Monday, March 8, 2021

How to deduplicate and reorganize 1 TB of unorganized pictures

I went from a 1 TB unorganized collection of mostly pictures and movies to a 300 GB compendium of pictures and movies organized by year and month.  Here is how.  I am documenting my steps here because I don't want to figure it out again.  This might not work for you, but I was satisfied with the results.

*** Build initial Compendium and backups ***

Collect all images into one place. (I used an external 7200 RPM HDD I had laying around.)

Make backups in several locations (I went with several external 2 TB 7200 HDDs)
(These can be a little slow because they get stuffed in a drawer for years... the intent is to fall back on these if I make a mistake or if a device fails.)

Purchase external SSD drive (I went with 2 TB).  This will be my primary drive for this project and others in the future.  During this project, I want fast file operations.  Once the dedup and reorg is complete, I will back up my output to multiple external 7200 RPM drives and use the SSD as the primary browsing/editing device for my family photos.

Copy the compendium of images to the external SSD.

*** Remove junk files ***

First, remove temp "junk" files.

I used bash (Linux) to remove various temporary files that I knew I would no longer need:

Install Windows Subsystem for Linux
Go to the Windows store and install Ubuntu
Launch Ubuntu

cd /mnt/f/dedup

Experiment with the command below to find the files you want to delete:
find /mnt/f/dedup -iname '.DS_Store' -type f

Careful - Once you are satisfied with the target, you can add the delete switch:

find /mnt/f/dedup -iname '.DS_Store' -type f -delete

I wanted to get rid of many types of files.  You might not want to do the same.
These are the commands I ran; your mileage may vary:

find /mnt/f/dedup -type f -iname "*.THM" -delete
find /mnt/f/dedup -type f -iname "*.aae" -delete
find /mnt/f/dedup -type f -iname "*.db"  -delete
find /mnt/f/dedup -type f -iname "*.doc" -delete
find /mnt/f/dedup -type f -iname "*.docx" -delete
find /mnt/f/dedup -type f -iname "*.ini" -delete
find /mnt/f/dedup -type f -iname "*.mov_" -delete
find /mnt/f/dedup -type f -iname "*.pdf" -delete
find /mnt/f/dedup -type f -iname "*.psd"  -delete
find /mnt/f/dedup -type f -iname "*.temp" -delete
find /mnt/f/dedup -type f -iname "*.tmp" -delete
find /mnt/f/dedup -type f -iname "*.xmp" -delete
find /mnt/f/dedup -type f -iname ".DS_Store"  -delete
find /mnt/f/dedup -type f -iname "._*.avi" -size -5k -delete
find /mnt/f/dedup -type f -iname "._*.jpg" -size -5k -delete
find /mnt/f/dedup -type f -iname "._*.mov" -size -5k -delete
find /mnt/f/dedup -type f -iname "._.DS_Store"  -delete
find /mnt/f/dedup -type f -iname "._IMG_*.JPG" -size -5k -delete
find /mnt/f/dedup -type f -iname "._IMG_*.JPG" -size -5k -delete
find /mnt/f/dedup -type f -iname "._IMG_*.jpg" -size -5k -delete
find /mnt/f/dedup -type f -iname "._MVI_*.MOV" -size -5k -delete
find /mnt/f/dedup -type f -iname "._[0-9]*.jpg" -size -5k -delete
find /mnt/f/dedup -type f -iname "._MVI_*.AVI" -size -5k -delete
find /mnt/f/dedup -type f -iname "._MIV_*.MOV" -size -5k -delete
find /mnt/f/dedup -type f -iname "._[0-9][0-9]" -size -5k -delete
find /mnt/f/dedup -type f -iname "._IMG_[0-9]*.HEIC" -size -5k -delete
find /mnt/f/dedup -type f -iname "._IMG_[0-9]*.jpeg" -size -5k -delete
find /mnt/f/dedup -type f -iname "._[0-9][0-9][0-9][0-9]*-[0-9][0-9][0-9][0-9]*" -size -5k -delete
find /mnt/f/dedup -type f -iname "._IMG_*.CR2" -size -5k -delete




*** Deduplication ***

I use a Windows application called "Duplicate Cleaner Free"

Install and run it.

I use these settings:
Find files with: Same content
More duplicate options: Same file extension
Search filters: Included: *.*

Scan location:
Start with a small section of your total project to get a sense of how the software works.  In fact, I scan each section of my project individually to keep things at a reasonable scope.  After processing all sections, I then run a final dedup that includes all of them (not just individually) because there are likely duplicates between sections.

The software may identify files that you realize are more junk files.  If so, go back and run bash commands to get rid of them, and then run a new scan.  You may have to repeat this multiple times to truly get rid of your junk files.

Once Duplicate Cleaner finds and displays a list of duplicates, it expects you to mark which files you want to delete.  Use the "Selection Assistant" to mass mark the files:

I removed the duplicates based on:
Mark "All but one file in each group".

You may not like these options.  Experiment to find what works well for you.

Click "File Removal" to delete the duplicates.

On the dialog box that follows, I use the following options:
Skip problem files: uncheck
Remove empty folders: check
Delete to the Recycle Bin: uncheck
Via Windows Shell: check

When satisfied, click "Delete files"

If you are removing thousands of files, it may take a while to run.  Don't freak out if Windows says the app is not responding.  My system took several minutes to delete about 250 GB of 100,000 duplicates on a SSD drive attached via USB 3.  You may see a black screen appear, disappear, and reappear a few times.

Once you have processed each section and the entire project, your dedup is complete.  Make more backups because this was time consuming and you don't want to have to do it again.  I had to revert to this point due to mistakes more than I'd care to admit.  That's why we make backups.

*** Reorganization ***

Create a folder to contain the reorganized files:
md f:\output

Create a folder to contain images that don't have dates:
md f:\output\remnant


Need to update Ubuntu so we can install exiftool
sudo apt-get update
sudo apt-get upgrade

I rebooted out (bad) habit.

sudo apt-get update
sudo apt install libimage-exiftool-perl

Run one of the following commands from the directory you want problem files to be placed.  For me, that was /mnt/f/output/remnant
Note: Not sure about the above statement. Not sure it matters where the command is run from.  No files went into my "remnant" subdirectory.

Either:
exiftool -o . '-Directory<CreateDate' -d /mnt/f/output/%Y/%Y-%m%%-c -r '/mnt/f/reorg/pics/'
exiftool . '-Directory<CreateDate' -d /mnt/f/output/%Y/%Y-%m%%-c -r '/mnt/f/reorg/pics/'


The command above copies all the files in the /mnt/f/reorg subdirectory to one called /mnt/f/output
It creates a directory structure based not on the file stamp, but on the "create date" stored in the exif data of the image.

-o = Copy over (don't move).  If you leave -o out, the tool does a move instead of a file copy.

-d = destination directory (/mnt/f/output/%Y/%y-%m)
%Y = YYYY
%y = yy
%m = mm
%%-c = Increment count by 1 for files with duplicate filenames.  I don't quite understand this option, but I cobbled it together from random web links.

So a subdirectory for files created in  March of 2012 would look like this:  /mnt/f/output/2012/2012-03

-r = Recursive (go through all the subdirectories) of the source directory
In this case, the source directory is /mnt/f/reorg


*** Cleanup ***

I ran the exiftool command without the -o option, which meant files were moved, not copied.  The idea is that I want to pull all the files out of the source directory and into the reorganized repository.  But how do we track down the files that exiftool could not migrate?

The find command can find all files.  If you redirect output to a file, you will have a list of files to work through.

From the reorg directory, run this command:

find . -type f > remaining_files.txt

-type f = Look for "files" (as opposed to directories)

I still had a bunch of junk files remaining that I had to delete.


*** Collection of scripts used during cleanup that I should explain later but we both know I will forget to do so ***
 

 This command below will find and remove all files starting with "._" (without quotes).  That's pretty extreme, so buyer beware.

find /mnt/f/reorg -type f -iname "._*" -delete

This next command will find and remove all empty directories to make the remaining job easier:
find /mnt/f/reorg -type d -empty -delete

-type d = Limit the find command to directories
-empty = Empty ones


find /mnt/f/output/2006 -type d -iname "*-1" -exec mv {} /mnt/f/output/2006/more/ \;

find . -type f
This would find all files remaining in the dedup directory that were not moved to the reorg directory.


find . -type f ! -iname "IMG_*.jpg" -and ! -iname "MVI*.AVI" -and ! -iname "dscn*.jpg" -and ! -iname "IMG*.MOV" -and ! -iname "DSCN*.MOV" -and ! -iname "kimg_*.jpg" -and ! -iname "xIMG_*.JPG" -and ! -iname "1 IMG_*.JPG" -and ! -iname "MVI*.MOV" -and ! -iname "DSC*.JPG" -and ! -iname "DSCF*.AVI" -and ! -iname "IMG_*.jpeg" -and ! -iname "._IMG_[0-9]*.jpeg" -and ! -iname "IMG_*.CR2" -and ! -iname "IMG*.HEIC"


rename_files_in_single_directory:
#!/usr/bin/bash
# Renames files from whatever.ext to whatever_001.ext
for file in $1/*.*; do ext="${file##*.}"; filename="${file%.*}"; mv "$file" "${filename}_001.${ext}"; done
#for file in $1/*.*; do ext="${file##*.}"; filename="${file%.*}"; echo "${filename}_001.${ext}"; done

cycle_through:
#!/usr/bin/bash
# Requires a list of directories in file /mnt/f/reorg/dirs.txt
while read dirname; do
        echo "Processing $dirname"
        rename_files_in_single_directory "$dirname"
done </mnt/f/reorg/dirs.txt

Command to find all directories in format YYYY-MM-1
(This was needed to find directories with duplicate filenames in a given month of a given year.)
find . -type d -iname "[0-9][0-9][0-9][0-9]-[0-9][0-9]-1" > dirs.txt



Friday, March 5, 2021

Add a custom directory to the path in bash

 Edit ~/.bashrc

Add the following line:

export PATH="$PATH:/<target_directory>"

 

Brighten ls output in bash

 I can't easily see the folder names returned by the 'ls' command in bash when running Ubuntu on Windows.  Here is the fix.

Edit ~/.bashrc

Append this line:

LS_COLORS="ow=01;36;40" && export LS_COLORS

Update 2023-07-15:

Not working as well as it did.  Here's a newer way:

LS_COLORS='rs=0:di=1;35:ln=01;36:mh=00:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=40;31;01:su=37;41:sg=30;43:ca=30;41:tw=30;42:ow=34;42:st=37;44:ex=01;32:*.tar=01;31:*.tgz=01;31:*.arj=01;31:*.taz=01;31:*.lzh=01;31:*.lzma=01;31:*.tlz=01;31:*.txz=01;31:*.zip=01;31:*.z=01;31:*.Z=01;31:*.dz=01;31:*.gz=01;31:*.lz=01;31:*.xz=01;31:*.bz2=01;31:*.bz=01;31:*.tbz=01;31:*.tbz2=01;31:*.tz=01;31:*.deb=01;31:*.rpm=01;31:*.jar=01;31:*.war=01;31:*.ear=01;31:*.sar=01;31:*.rar=01;31:*.ace=01;31:*.zoo=01;31:*.cpio=01;31:*.7z=01;31:*.rz=01;31:*.jpg=01;35:*.jpeg=01;35:*.gif=01;35:*.bmp=01;35:*.pbm=01;35:*.pgm=01;35:*.ppm=01;35:*.tga=01;35:*.xbm=01;35:*.xpm=01;35:*.tif=01;35:*.tiff=01;35:*.png=01;35:*.svg=01;35:*.svgz=01;35:*.mng=01;35:*.pcx=01;35:*.mov=01;35:*.mpg=01;35:*.mpeg=01;35:*.m2v=01;35:*.mkv=01;35:*.webm=01;35:*.ogm=01;35:*.mp4=01;35:*.m4v=01;35:*.mp4v=01;35:*.vob=01;35:*.qt=01;35:*.nuv=01;35:*.wmv=01;35:*.asf=01;35:*.rm=01;35:*.rmvb=01;35:*.flc=01;35:*.avi=01;35:*.fli=01;35:*.flv=01;35:*.gl=01;35:*.dl=01;35:*.xcf=01;35:*.xwd=01;35:*.yuv=01;35:*.cgm=01;35:*.emf=01;35:*.axv=01;35:*.anx=01;35:*.ogv=01;35:*.ogx=01;35:*.aac=00;36:*.au=00;36:*.flac=00;36:*.mid=00;36:*.midi=00;36:*.mka=00;36:*.mp3=00;36:*.mpc=00;36:*.ogg=00;36:*.ra=00;36:*.wav=00;36:*.axa=00;36:*.oga=00;36:*.spx=00;36:*.xspf=00;36:';

export LS_COLORS

 



Customize and brighten the fonts in the bash prompt in Ubuntu on Windows

I run Ubuntu on Windows.  The prompt is too dark for me to see.  Here is how to brighten the prompt.

Edit ~/.bashrc

Append the following line:

 

PS1='\e[37;1m\u@\h \e[35m\w/ \e[0m\$ '

 

\u = Username

\h = Hostname

\w = Working directory

 

So \u@\h: makes the prompt
 

<username>@<hostname>:

 \e[ = Start a color scheme

The part that follows is a color:  37;1m

The 37 is the color.

The 1 says to use a bright version.

m somehow designates the end of the color sequence (but I'm not exactly sure of that part).


I made some more changes to make my life easier.  I added some spaces to the prompt before and after the working directory so I could double-click the working directory path and get the path easily into my copy buffer.


How to turn off the bell in bash

The bell is terribly loud in Ubuntu on Windows.  Here is how to disable it:

Edit ~/.inputrc

Contents:

 

set bell-style none