This is a continuation of another post detailing how I recovered thousands of photos from an accidentally formatted hard drive. Read that post here.
The Problem – Part 2
At this point, I was definitely breathing easier, but there was still a lot of problems to deal with. The problem with what PhotoRec does is that, it doesn’t look at the data on your hard disk at the file/directory level. It looks at the raw data on the hard drive and recognizes if it’s a picture and then restores it to the location of your choice. This means that filenames and directory structure are lost. This also means, that nearly every jpeg file on the formatted hard drive was recoved…including tens of thousands of images from Firefox’s cache along with hundreds(thousands?) of other various images like wallpapers that I’ve downloaded over the years, with no obvious way to distinguish between them and actual photos.
My first thought was that I could use metadata stored with most images to distinguish between photos we took and photos/graphics that I didnt care about. This line of thought only got me half way to a solution. The problem with this is that such a large number of our photos came from a scanner which doesn’t imbed any meta data with the image. Note that images that my wife had already processed in Picasa had metadata because Picasa adds various tags to the photo. Unfortunately, she was only a portion of the way through this process and the unprocessed ones were only differentiated from the processed ones by which folder they were originally in.
The Code
I decided the best way to handle this was to code a set of scripts in Python to weed through this seemingly homogenous set of photos. My first, and easiest task was to determine which photos were taken with a digital camera.
Instead of re-inventing the wheel, I looked for a command line program that would print out the metadata for each photo. I first alighted upon ExifTool by Phil Harvey. This program is extremely thorough, so I began to code around it.
My first step was to gather all of the available metadata for all of the images. I whipped together some code to do this, but I soon found out that ExifTool is slooooow. At the rate it was processing images it was going to literally take days! After a bit more searching, I found jhead, another tool that would give me the data I needed for determining which camera took which picture. Admitedly, it didn’t return as much information as ExifTool, but it would suffice for this step. The following Python code used jhead to sort out all of the photos taken by digital cameras:
import subprocess
import os
import shutil
import shelve
def pic_data(fn):
"""Returns a dictionary with photo data
"""
split = '\r\n'
cmd = 'jhead.exe'
proc = subprocess.Popen(cmd + ' ' + fn,
shell=True,
stdout=subprocess.PIPE)
stdout_value = proc.communicate()[0]
stdout_value = stdout_value.strip()
out_list = stdout_value.splitlines()
img_data = dict()
for line in out_list:
record = line.partition(':')
img_data[record[0].strip()] = record[2].strip()
return img_data
pics_location = 'C:\\LOCATION_OF_PICTURES'
index = 0
database = shelve.open('pic_database')
database.clear()
for dirs in os.walk(pics_location):
for filename in dirs[2]:
if filename.endswith('jpg'):
if index < 100:
index += 1
else:
print 'Stored ' + info['File name'] + ' (among other)'
index = 0
pic = os.path.join(dirs[0], filename)
info = pic_data(pic)
database[info['File name']] = info
database.close()
What this does is iterate through all of the photos and runs jhead on them. Jhead produces output like this:
File name : FULL_PATH_TO_PHOTO_HERE File size : 2420211 bytes File date : 2005:05:14 17:55:33 Camera make : Canon Camera model : Canon PowerShot G6 Date/Time : 2005:05:14 17:55:33 Resolution : 3072 x 2304 Flash used : No Focal length : 11.2mm (35mm equivalent: 56mm) CCD width : 7.21mm Exposure time: 0.067 s (1/15) Aperture : f/4.0 Focus dist. : 1.27m Whitebalance : Auto Metering Mode: matrix
The actual data and fields varies depending upon which photo jhead is run on. The previous is an example of a photo taken by a digital camera. This sort of data will help me filter out all the digicam pics.
The above Python script uses the awesome shelve module to create what amounts to a persistent dictionary. This was important because I knew that it’d take me several trial runs to get my next script working right and I didnt want to have to wait for jhead to run on all the photos each time. Even though jhead was much faster than ExifTool it still took maybe 30 minutes to get the data for all the photos.
Digital Camera Photos
Now all I had to do was write a bit of code to look up each picture in the database I just created that had the fields ‘Camera make’ and ‘Camera model’. The following code did just that.
import os
import shutil
import shelve
database = shelve.open('pic_database')
scratch_dir = "C:\\Users\\Therms\\Desktop\\campics"
make_dict = dict()
#first we create a dictionary with each camera make
#as the key and a tuple consisting of the filename
#and camera model
for pic in database:
curpic = database[pic]
if key_exists(curpic, 'Camera make'):
model = curpic['Camera model']
tup = curpic['File name'], model
if curpic['Camera make'] in make_dict:
file_list = make_dict[curpic['Camera make']]
file_list.append(tup)
make_dict[curpic['Camera make']] = file_list
else:
file_list = [tup]
make_dict[curpic['Camera make']] = file_list
#now we step through the make dictionary and create a
#directory structure like this: CameraMake/CameraModel/pics
#finally we delete each moved pic from the pic database so that
#we don't process them in later scripts
for make in make_dict:
index = 1
print make
for filez, model in make_dict[make]:
print 'Copying ' + str(index) + '/' + str(len(make_dict[make]))
make_dir = os.path.join(scratch_dir, make)
model_dir = os.path.join(scratch_dir, make, model)
if not os.path.exists(make_dir):
os.mkdir(make_dir)
if not os.path.exists(model_dir):
os.mkdir(model_dir)
shutil.move(filez, model_dir)
del database[filez]
index += 1
Of course, many of the pics that PhotoRec had recoverd from our Firefox cache also had camera make/model information, but by sorting the photos out into a directory structure containing the model/make information, I was able to just delete all the photos taken by cameras that we didn’t own, leaving me with just our photos.
Planning the next step
At this point, I was a little unsure about how to proceed. So I spent a little bit of time running ExifTool on different photos that were left trying to see if there was a pattern of metadata that would help me filter out some more of our photos.
As I mentioned earlier, ExifTool is slower than jhead, but it does return a lot more data about each photo. One piece of metadata it returns is a field called “Software” from the Image File Directory section of the EXIF metadata. This field is saved whenever you edit an image with Picasa and is set to “Picasa 2.7” (or whichever version was used). As my wife had edited a large number of the pics she’d scanned in, this gave me another large set of images to filter out.
The Picasa Code
Unfortunately, the jhead database I had constructed wasn’t going to work for me anymore because it just didn’t contain the data I needed. ExifTool is much more thorough (and this is for a photo with very little metadata):
---- ExifTool ---- ExifTool Version Number : 7.48 ---- File ---- File Name : f12049608.jpg Directory : \\EHUD\Photos\picasa pics File Size : 320 kB File Modification Date/Time : 2008:10:10 13:12:29-05:00 File Type : JPEG MIME Type : image/jpeg Exif Byte Order : Little-endian (Intel, II) Current IPTC Digest : 5e62ad2acd8219df49858cf37b38613b Image Width : 2146 Image Height : 1552 Encoding Process : Baseline DCT, Huffman coding Bits Per Sample : 8 Color Components : 3 Y Cb Cr Sub Sampling : YCbCr4:2:0 (2 2) ---- JFIF ---- JFIF Version : 1.01 Resolution Unit : None X Resolution : 1 Y Resolution : 1 ---- IFD0 ---- Software : Picasa 3.0 ---- ExifIFD ---- Exif Version : 0210 Image Unique ID : e98f333bf6bde083ad94ae5444e972ed ---- InteropIFD ---- Interoperability Index : Unknown ( ) Interoperability Version : 0100 ---- IPTC ---- Caption-Abstract : 6 months old, Jan. 2008 Keywords : meilyn ---- Composite ---- Image Size : 2146x1552
Thus I tweaked the above jhead script to parse the output from ExifTool into a database (just some minor tweaking of the pic_data function to take in ExifTool output).
Once I had that done (it took hours to parse all the photos), I was able to whip up a simple script to move all those photos over.
import shelve
import os
import shutil
database = shelve.open('pic_database')
scratch_dir = "C:\\DIRECTORY_TO_STORE_PICASA_PHOTOS"
if not os.path.exists(scratch_dir):
os.mkdir(scratch_dir)
picasas = 0
pic_list = []
for pic in database:
curpic = database[pic]
if 'Software' in curpic:
picasas += 1
pic_list.append(pic)
index = 1
for f in pic_list:
shutil.move(f, scratch_dir)
del database[f]
print 'Moved ' + str(index) + ' of ' + str(picasas)
index += 1
database.close()
That’s all for now. Next time I’ll talk about how I filtered through more of the photos.
1 Comments.