I invite you to try Poda. I made this to find similar directories among multiple storage locations (or hampers as I call them in the program). The typical use case is to find out whether I find whether I have duplicate or similar directories in my laptop, the other laptop, NAS, flash drives, etc. I may have made a backup of a flash drive on my laptop and not sure what has changed where. Or even within the same storage!
[Update: There was a mistake in the use of dirdupes.py
below. A sort
filter was missing. It is now corrected.]
The way it works is that you first get the index of each hamper. For example, you may have four hampers:
- Your home directory in your laptop.
- An 8GB flash drive.
- A 32GB flash drive.
- A NAS server.
You configure and index the first three hampers in your laptop:
poda-hamper-add laptop-home /home/alvarezp . poda-hamper-add flash-8gb /media/alvarezp/ABCD-EF00 . poda-hamper-add flash-32gb /media/alvarezp/0123-4567 . poda-reindex laptop-home # Insert the 8GB flash drive poda-reindex flash-8gb # Remove the 8GB flash drive and insert the 32GB flash drive poda-reindex flash-32gb
You configure the fourth hamper in the NAS server and index it there by installing Poda there and following a similar process.
Then you bring the indexes together to your laptop. You need to do this manually for now. The index is just a big text file under the .poda/indexes/hostname/hampername
directory. Copy it to your laptop to the same directory. You may need to use mkdir. Manually copy the file there.
Once this is done, run
cat .poda/indexes/*/*/index | sort | poda-dirdupes.py | sort -n
on your laptop to post-process the indexes.
You will get an output like the following:
23 80.85% 57: alvarezp-samsung:samplehamper:backup alvarezp-samsung:samplehamper:content
which means that there are 23 unique bytes and 57 duplicate bytes between those two directories, for an 80.85% similarity.
More details in the Readme file and an included sample run at https://gitlab.com/alvarezp2000/poda