I invite you to try Poda. I made this to find similar directories among multiple storage locations (or hampers as I call them in the program). The typical use case is to find out whether I find whether I have duplicate or similar directories in my laptop, the other laptop, NAS, flash drives, etc. I may have made a backup of a flash drive on my laptop and not sure what has changed where. Or even within the same storage!
The way it works is that you first get the index of each hamper. For example, you may have four hampers:
- Your home directory in your laptop.
- An 8GB flash drive.
- A 32GB flash drive.
- A NAS server.
You configure and index the first three hampers in your laptop:
poda-hamper-add laptop-home /home/alvarezp . poda-hamper-add flash-8gb /media/alvarezp/ABCD-EF00 . poda-hamper-add flash-32gb /media/alvarezp/0123-4567 . poda-reindex laptop-home # Insert the 8GB flash drive poda-reindex flash-8gb # Remove the 8GB flash drive and insert the 32GB flash drive poda-reindex flash-32gb
You configure the fourth hamper in the NAS server and index it there by installing Poda there and following a similar process.
Then you bring the indexes together to your laptop. You need to do this manually for now. The index is just a big text file under the
.poda/indexes/hostname/hampername directory. Copy it to your laptop to the same directory. You may need to use mkdir. Manually copy the file there.
Once this is done, run
cat .poda/indexes/*/*/index | poda-dirdupes.py | sort -n
on your laptop to post-process the indexes.
You will get an output like the following:
23 80.85% 57: alvarezp-samsung:samplehamper:backup alvarezp-samsung:samplehamper:content
which means that there are 23 unique bytes and 57 duplicate bytes between those two directories, for an 80.85% similarity.
More details in the Readme file and an included sample run at https://gitlab.com/alvarezp2000/poda