How I recovered my photos and much of the metadata with a few tools (Part 2)

This is a continuation of another post detailing how I recovered thousands of photos from an accidentally formatted hard drive. Read that post here.

The Problem – Part 2

At this point, I was definitely breathing easier, but there was still a lot of problems to deal with.  The problem with what PhotoRec does is that, it doesn’t look at the data on your hard disk at the file/directory level.  It looks at the raw data on the hard drive and recognizes if it’s a picture and then restores it to the location of your choice.  This means that filenames and directory structure are lost.  This also means, that nearly every jpeg file on the formatted hard drive was recoved…including tens of thousands of images from Firefox’s cache along with hundreds(thousands?) of other various images like wallpapers that I’ve downloaded over the years, with no obvious way to distinguish between them and actual photos.

My first thought was that I could use metadata stored with most images to distinguish between photos we took and photos/graphics that I didnt care about.  This line of thought only got me half way to a solution.  The problem with this is that such a large number of our photos came from a scanner which doesn’t imbed any meta data with the image.  Note that images that my wife had already processed in Picasa had metadata because Picasa adds various tags to the photo. Unfortunately, she was only a portion of the way through this process and the unprocessed ones were only differentiated from the processed ones by which folder they were originally in.

The Code

I decided the best way to handle this was to code a set of scripts in Python to weed through this seemingly homogenous set of photos.  My first, and easiest task was to determine which photos were taken with a digital camera.

Instead of re-inventing the wheel, I looked for a command line program that would print out the metadata for each photo.  I first alighted upon ExifTool by Phil Harvey.  This program is extremely thorough, so I began to code around it.

My first step was to gather all of the available metadata for all of the images.  I whipped together some code to do this, but I soon found out that ExifTool is slooooow.  At the rate it was processing images it was going to literally take days!  After a bit more searching, I found jhead, another tool that would give me the data I needed for determining which camera took which picture.  Admitedly, it didn’t return as much information as ExifTool, but it would suffice for this step.  The following Python code used jhead to sort out all of the photos taken by digital cameras:

import subprocess
import os
import shutil
import shelve

def pic_data(fn):
    """Returns a dictionary with photo data
    """
    split = '\r\n'
    cmd = 'jhead.exe'

    proc = subprocess.Popen(cmd + ' ' + fn,
                            shell=True,
                            stdout=subprocess.PIPE)

    stdout_value = proc.communicate()[0]
    stdout_value = stdout_value.strip()
    out_list = stdout_value.splitlines()
    img_data = dict()
    for line in out_list:
        record = line.partition(':')
        img_data[record[0].strip()] = record[2].strip()

    return img_data  

pics_location = 'C:\\LOCATION_OF_PICTURES'
index = 0
database = shelve.open('pic_database')
database.clear()

for dirs in os.walk(pics_location):
    for filename in dirs[2]:
        if filename.endswith('jpg'):
            if index < 100:
                index += 1
            else:
                print 'Stored ' + info['File name'] + ' (among other)'
                index = 0
            pic = os.path.join(dirs[0], filename)
            info = pic_data(pic)
            database[info['File name']] = info

database.close()

What this does is iterate through all of the photos and runs jhead on them. Jhead produces output like this:

File name    : FULL_PATH_TO_PHOTO_HERE
File size    : 2420211 bytes
File date    : 2005:05:14 17:55:33
Camera make  : Canon
Camera model : Canon PowerShot G6
Date/Time    : 2005:05:14 17:55:33
Resolution   : 3072 x 2304
Flash used   : No
Focal length : 11.2mm  (35mm equivalent: 56mm)
CCD width    : 7.21mm
Exposure time: 0.067 s  (1/15)
Aperture     : f/4.0
Focus dist.  : 1.27m
Whitebalance : Auto
Metering Mode: matrix

The actual data and fields varies depending upon which photo jhead is run on. The previous is an example of a photo taken by a digital camera. This sort of data will help me filter out all the digicam pics.

The above Python script uses the awesome shelve module to create what amounts to a persistent dictionary. This was important because I knew that it’d take me several trial runs to get my next script working right and I didnt want to have to wait for jhead to run on all the photos each time. Even though jhead was much faster than ExifTool it still took maybe 30 minutes to get the data for all the photos.

Digital Camera Photos

Now all I had to do was write a bit of code to look up each picture in the database I just created that had the fields ‘Camera make’ and ‘Camera model’. The following code did just that.

import os
import shutil
import shelve

database = shelve.open('pic_database')

scratch_dir = "C:\\Users\\Therms\\Desktop\\campics"
make_dict = dict()

#first we create a dictionary with each camera make
#as the key and a tuple consisting of the filename
#and camera model
for pic in database:
    curpic = database[pic]
    if key_exists(curpic, 'Camera make'):
        model = curpic['Camera model']
        tup = curpic['File name'], model
        if curpic['Camera make'] in make_dict:
            file_list = make_dict[curpic['Camera make']]
            file_list.append(tup)
            make_dict[curpic['Camera make']] = file_list
        else:
            file_list = [tup]
            make_dict[curpic['Camera make']] = file_list

#now we step through the make dictionary and create a
#directory structure like this:  CameraMake/CameraModel/pics
#finally we delete each moved pic from the pic database so that
#we don't process them in later scripts
for make in make_dict:
    index = 1
    print make
    for filez, model in make_dict[make]:
        print 'Copying ' + str(index) + '/' + str(len(make_dict[make]))
        make_dir = os.path.join(scratch_dir, make)
        model_dir = os.path.join(scratch_dir, make, model)

        if not os.path.exists(make_dir):
            os.mkdir(make_dir)
        if not os.path.exists(model_dir):
            os.mkdir(model_dir)

        shutil.move(filez, model_dir)
        del database[filez]
        index += 1

Of course, many of the pics that PhotoRec had recoverd from our Firefox cache also had camera make/model information, but by sorting the photos out into a directory structure containing the model/make information, I was able to just delete all the photos taken by cameras that we didn’t own, leaving me with just our photos.

Planning the next step

At this point, I was a little unsure about how to proceed. So I spent a little bit of time running ExifTool on different photos that were left trying to see if there was a pattern of metadata that would help me filter out some more of our photos.

As I mentioned earlier, ExifTool is slower than jhead, but it does return a lot more data about each photo. One piece of metadata it returns is a field called “Software” from the Image File Directory section of the EXIF metadata. This field is saved whenever you edit an image with Picasa and is set to “Picasa 2.7” (or whichever version was used). As my wife had edited a large number of the pics she’d scanned in, this gave me another large set of images to filter out.

The Picasa Code

Unfortunately, the jhead database I had constructed wasn’t going to work for me anymore because it just didn’t contain the data I needed. ExifTool is much more thorough (and this is for a photo with very little metadata):

---- ExifTool ----
ExifTool Version Number         : 7.48
---- File ----
File Name                       : f12049608.jpg
Directory                       : \\EHUD\Photos\picasa pics
File Size                       : 320 kB
File Modification Date/Time     : 2008:10:10 13:12:29-05:00
File Type                       : JPEG
MIME Type                       : image/jpeg
Exif Byte Order                 : Little-endian (Intel, II)
Current IPTC Digest             : 5e62ad2acd8219df49858cf37b38613b
Image Width                     : 2146
Image Height                    : 1552
Encoding Process                : Baseline DCT, Huffman coding
Bits Per Sample                 : 8
Color Components                : 3
Y Cb Cr Sub Sampling            : YCbCr4:2:0 (2 2)
---- JFIF ----
JFIF Version                    : 1.01
Resolution Unit                 : None
X Resolution                    : 1
Y Resolution                    : 1
---- IFD0 ----
Software                        : Picasa 3.0
---- ExifIFD ----
Exif Version                    : 0210
Image Unique ID                 : e98f333bf6bde083ad94ae5444e972ed
---- InteropIFD ----
Interoperability Index          : Unknown (    )
Interoperability Version        : 0100
---- IPTC ----
Caption-Abstract                : 6 months old, Jan. 2008
Keywords                        : meilyn
---- Composite ----
Image Size                      : 2146x1552

Thus I tweaked the above jhead script to parse the output from ExifTool into a database (just some minor tweaking of the pic_data function to take in ExifTool output).

Once I had that done (it took hours to parse all the photos), I was able to whip up a simple script to move all those photos over.

import shelve
import os
import shutil

database = shelve.open('pic_database')

scratch_dir = "C:\\DIRECTORY_TO_STORE_PICASA_PHOTOS"
if not os.path.exists(scratch_dir):
    os.mkdir(scratch_dir)

picasas = 0
pic_list = []

for pic in database:
    curpic = database[pic]
    if 'Software' in curpic:
        picasas += 1
        pic_list.append(pic)

index = 1
for f in pic_list:
    shutil.move(f, scratch_dir)
    del database[f]
    print 'Moved ' + str(index) + ' of ' + str(picasas)
    index += 1

database.close()

That’s all for now. Next time I’ll talk about how I filtered through more of the photos.

Leave a Reply

Trackbacks and Pingbacks: