Poda: finds similar directories

I invite you to try Poda. I made this to find similar directories among multiple storage locations (or hampers as I call them in the program). The typical use case is to find out whether I find whether I have duplicate or similar directories in my laptop, the other laptop, NAS, flash drives, etc. I may have made a backup of a flash drive on my laptop and not sure what has changed where. Or even within the same storage!

[Update: There was a mistake in the use of dirdupes.py below. A sort filter was missing. It is now corrected.]

The way it works is that you first get the index of each hamper. For example, you may have four hampers:

  1. Your home directory in your laptop.
  2. An 8GB flash drive.
  3. A 32GB flash drive.
  4. A NAS server.

You configure and index the first three hampers in your laptop:

poda-hamper-add laptop-home /home/alvarezp .
poda-hamper-add flash-8gb /media/alvarezp/ABCD-EF00 .
poda-hamper-add flash-32gb /media/alvarezp/0123-4567 .
poda-reindex laptop-home
# Insert the 8GB flash drive
poda-reindex flash-8gb
# Remove the 8GB flash drive and insert the 32GB flash drive
poda-reindex flash-32gb

You configure the fourth hamper in the NAS server and index it there by installing Poda there and following a similar process.

Then you bring the indexes together to your laptop. You need to do this manually for now. The index is just a big text file under the .poda/indexes/hostname/hampername directory. Copy it to your laptop to the same directory. You may need to use mkdir. Manually copy the file there.

Once this is done, run

cat .poda/indexes/*/*/index | sort | poda-dirdupes.py | sort -n

on your laptop to post-process the indexes.

You will get an output like the following:

                23   80.85%                 57: alvarezp-samsung:samplehamper:backup alvarezp-samsung:samplehamper:content

which means that there are 23 unique bytes and 57 duplicate bytes between those two directories, for an 80.85% similarity.

More details in the Readme file and an included sample run at https://gitlab.com/alvarezp2000/poda


Deja un comentario

Tu dirección de correo electrónico no será publicada. Los campos obligatorios están marcados con *