28 August 2010

ImageImprint - Identify Identical Images, Changes In Images And Produce Minimal JPEGs

Create hashes (checksums) for the content of JPEG files and other image types to identify duplicate images and image changes.

I've used MD5's for some time now to keep any eye on duplicate files and changes in files that I don't want to change. But when it comes to images the standard routine of just running an MD5 over the whole file doesn't cut it. The problem with image files (and many other types of files) is that the file can change without the image content actually changing, this might happen for example if you add meta data to the file, or the image is opened and resaved without any editing but changing the layout and structure of the file. Now while these are indeed changes to the file content and indeed invalidate a complete file hash this is not always desirable, you see what I am concerned about is image content. So what I wanted was a more fine grained way of identifying if an image file had changed and if so whether the actual image content had changed or just some other meta/extra data in the image file. What ImageImprint does is generates up to four image hashes (2 for any image file, 4 for JPEG files) that provide more insight on whether an image file has changed and the nature of the changes.

The four types of image hash ImageImprint can produce are as follows:
  • Full File Hash: The same as any other hashing program, runs the hash over the entire file. It can be used to identify identical files and any changes in the file.
  • Generic Image Hash: Renders the image to a standard format and then hashes the result. It can be used to identify identical images (even when the file format and encoding scheme vary "in some instances: non lossy") and changes to an image.
  • Minimal Jpeg Hash: Only for JPEG images. Reduces the JPEG to its most minimal form in a standard format and hashes the result. Can identify identical images that were derived from the same base encode and identify changes to a JPEG raw image data.
  • SOS Jpeg Hash: Only for JPEG images. This is really the kernel of a JPEG file, it is basically the raw image data but doesn't contain enough information to reconstruct the full image (the coding key information is missing). You can think of this like the image information with the colour space missing, or a compression algorithm with the key lookup table removed, I guess it is somewhere between those in reality. This really automates the second method in this process. Can be used in much the same way as the above hash, but does not guarantee complete JPEG integrity, and therefor can be used in some instances where a JPEG is more significantly reorganised (say optimised) without the actual image  changing.

Hopefully I'll end up getting this on a wiki so that users can update it to be more useful but for now if you have any comments or suggestions, just leave them as posts.

Release Version

Click here to download the latest version of ImageImprint (V 0.9.0.0).




Screen Shots




Usage

This is the same as "-h" but a little more detailed.

To run the program using these options you either need to create a shortcut to the application and then add the comandline options to the target line:


or run the program from the command prompt and append the command line options:


Usage: YarragoImageImprint_V_0_9_0_0.exe [options]

-l: Displays the programs licence.

-file: The file that ImageImprint will produce hashes for. ImageImprint only runs in file mode or directory mode, if both are specified file mode takes precedence.

-d {directory}: The directory that ImageImprint will produce hashes for.

-s {log}: Save the hash data to file instead of displaying it. Instead display the status.
Status: Start - Current - Files Processed - Second Per Files - Instantaneous Second Per Files (the number of seconds it took to process the last file).

-nr: The default operation of the program is to recursively process all sub directories in the base directory. This option stops the behavior so that only files directly in the base directory are processed.

-fs: Write out file size.

-f: Generate a full file hash.

-j: JPEG files only. Generate a minimal JPEG file hash.

-jffo: Only use the first frame in the JPEG image (only effects -j and -sos, does not effect -i). Some JPEGs contain multiple images, if this option is used only the first frame will be used for the JPEG hashes and subsequent images are ignored.

-sos: JPEG files only. Generate a hash based only on the SOS segment(s).

-w {directory}: Write out the hash data files for the minimal JPEG files and/or SOS segments to this directory (implies -j if it is not used).

-i: Generate the generic image hash (for any known image type).

-nv: Do not verify images when using the generic image method (this speeds up processing time considerably).

-debug: Run the debug test (takes time and does not produce a hash, but checks the JPEG hashes produced are sane).

-hash {algorithm}: Change the algorithm used to produce the hash (md5, sha1). Default if not specified is MD5.

-v {version}: The JPEG imprint version to use. Not supported in the current version but will be available in future versions if the hash algorithms change so you can verify against old hashes you produced. This is for legacy usage only and you should always generate new hashes with the most recent version of the algorithm.

-qv: Gets the program versions.

-np: Do not prompt to enter a key on exit.

Here is a sample command line I would typically use:
YarragoImageImprint_V_0_9_0_0.exe -f -j -sos -i -d "C:\pathtoimagefiles\ " -s "C:\ImageImprintLog.csv"

Here is a sample command line I would use if I was debugging or wanted to see what was going on:
YarragoImageImprint_V_0_9_0_0.exe -f -j -sos -i -d "C:\pathtoimagefiles\ " -s "C:\ImageImprintDebugLog.csv" -w "C:\writefileshere\ " -debug


Sample Usage

There are a number of tasks that ImageImprint will help you do:
  • Identify identical duplicate images.
  • Ensure that images aren't becoming corrupted over time either though disk corruption or human error.
  • Identify if a program is capable of performing an operation losslessly. (Really just a permutation of the above). 
  • Produce minimal JPEG files (without metadata). This serves a couple of purposes:
    • It can let you produce minimal files with identical images for further use or distribution. (E.g. If you want to produce images for distribution that don't have any metadata). 
    • It can let you debug my program for me, you can see what is being produced and determine whether there is a problem with it.


    What I typically do is generate the hashes for a collection of files and have the program save them to a .csv file using a command like:
    YarragoImageImprint_V_0_9_0_0.exe -f -j -sos -i -d "C:\pathtoimagefiles\ " -s "C:\ImageImprintLog.csv"

    To find duplicates I then open the result in spreadsheet software and sort each type of hash (one type at a time) and use a formula to match identical hashes.



    To find changes I use some method (spreadsheet or BeyondCompare) to compare an old log file with a current log file to find differences.

    If you have a single photo you are interested in you should just be able perform the comparison by hand.

    I have produced a set of sample images and hashes to better explain what the different hashes mean in a real world context.

    First download the sample images from here.

    Lenna is the standard test image so I wont depart here.


    Here is a table showing the each type of hash for a number of images, these are MD5's which I have shortened to just the first 3 bytes and last 3 bytes so 137BC77A0D150F3C6149755C968A38DE becomes 137...8DE.

    File Name Full File JPEG SOS Generic
    Image_Sample.bmp137...8DE

    225...169
    Image_Sample_RGB_Data.dat 225...169


    Lenna.bmp 303...ACC

    2DB...041
    Lenna.tiff 727...27E

    2DB...041
    Lenna_Jpeg_Standard.bmp 06A...BF3

    081...FC5
    Lenna_Minimal.jpg B1B...376 B1B...376 181...C61 081...FC5
    Lenna_Minimal_SOS_Segment_Only.sos 181...C61


    Lenna_Progressive.jpg E11...2E1 350...C99 AE3...8D7 081...FC5
    Lenna_Standard.jpg 587...0C9 B1B...376 181...C61 081...FC5
    Lenna_Standard_Comment.jpg 82A...320 B1B...376 181...C61 081...FC5
    Lenna_Standard_Meta_1.jpg 9B9...1F6 B1B...376 181...C61 081...FC5
    Lenna_Standard_Meta_2.jpg 108...790 B1B...376 181...C61 081...FC5
    Lenna_Standard_Restart.jpg 85E...194 729...99A 33E...D76 081...FC5

    Lenna.tiff: The original image file.

    Lenna.bmp: A resave from the original image. The file hash differs from the original because the file format is different, but the generic image hash stays the same because BMP is lossless and so the image is identical to the original.

    Lenna_Standard.jpg: A resave from the original. Again the file hash differs because it is a different file. The image hash also differs from the original because JPEG is a lossy compression and so the actual image is different. We also get a JPEG and SOS hash because it is a JPEG.

    Lenna_Jpeg_Standard.bmp: A resave from Lenna_Standard.jpg. The generic image hash stays the same as Lenna_Standard.jpg because BMP is lossless.

    Lenna_Minimal.jpg: This is the same as Lenna_Standard.jpg but is the minimal JPEG (as written out by ImageImprint), both JPEG hashes for both files are the same. The file hash and JPEG hash for this file are also the same, because the image is already in its minimal form and so when it is processed the same image is produced.

    Lenna_Minimal_SOS_Segment_Only.sos: This is a file that just contains the SOS segment from the Lenna_Standard.jpg image. You can see that the file hash matches the SOS hash from all images derived from Lenna_Standard.jpg.

    Lenna_Progressive.jpg: A resave of the original image as a progressive JPEG. The JPEG hash differs because the raw JPEG data is totally different, but because GIMP can save identical progressive and standard JPEG images the generic image hashes match.

    Lenna_Standard_Comment.jpg: A resave of Lenna_Standard.jpg with a comment added to the file. The file hash changes because the file has varied, but the JPEG, SOS and generic image hashes match because the JPEG image data hasn't been altered.

    Lenna_Standard_Meta_1.jpg: Much the same as above.
    Lenna_Standard_Meta_2.jpg: Much the same as above.

    Lenna_Standard_Restart.jpg: Much the same as progressive, except that it is a standard JPEG image with restart markers.

    Image_Sample.bmp: This is a test file I created to demonstrate how the generic image hash works. It is a normal 24bit BMP test image.

    Image_Sample_RGB_Data.dat: This is the raw data that is generated by the generic image hash function, hence the file hash is the same as the generic image hash for Image_Sample.bmp.

    Analysis

    The following is a description of what the program does and a bit of background on how it works.

    Generic Image
    Raw pixel data BGR 8 bytes per pixel (why BGR? because System.Drawing.Imaging.PixelFormat.Format24bppRgb is actually BGR which I expect is because of the endianness of the machine it is running on) and I didn't really want to add the overhead of reordering (it really doesn't matter so long as it is standard). Starting from the top left pixel and then working across the top row in order and then the second top row and so on until the far right pixel in the bottom row is reached.



    Here is the data that is produced to hash (the linebreaks and spaces are formatting niceties only).

    0000FF 0000FF 0000FF 0000FF 0000FF
    00FF00 00FF00 00FF00 00FF00 00FF00
    FF0000 FF0000 FF0000 FF0000 FF0000
    0000FF 00FF00 FF0000 0000FF 00FF00
    FF0000 0000FF 00FF00 FF0000 0000FF
    00FF00 FF0000 0000FF 00FF00 FF0000
    0000FF 0000FF 0000FF 0000FF 0000FF
    FF0000 FF0000 0000FF 00FF00 00FF00
    00FF00 00FF00 0000FF FF0000 FF0000
    0000FF 0000FF 0000FF 0000FF 0000FF

    If you want to see how it actually works you can pump it through an online hex to MD5 converter (such as the one here) to see that the hash produced is the same as in the table above.

    For those of you that better understand the JPEG standard you may be interested to know in more detail exactly how the JPEG hashes work. For those of you who are interested but don't know how JPEG files work I suggest you take a look here, here or here if you are really keen (Appendix B) and then come back when you have the basics down pat.

    When generating a hash for JPEG a minimal JPEG file is generated (you can write it out and examine the contents using the "-w" command line switch). The minimal JPEG only uses the segments found below from the original JPEG and discards all other segments. The segments are then sorted into a standard order and multiple segments of the same type are joined into a single segment. In this way JPEG files that have the same content but in a differing fragmentation can be converted to a uniform structure leading to identical output images and therefor image hashes.

    Jpeg Standard Format
    SOI
    [Tables/Misc]
    SOF
    [Tables/Misc]
    SOS[...More SOS's...]
    EOI

    Tables/Misc:
    DQT
    DHT
    DRI

    All tables joined into a single segment. Tables ordered by decoder id, then by precision (8 bit then 16bit for DQT's) or table type (AC then DC for DHT's). If the id byte is identical (decoder id + other info) is the same, then the tables are left in the order they appear in the image, in practice I don't think this should happen.

    The following are some information and caveats that I have identified with each hash.
    • Full File Hash: File hash changes even when the actual image content stays the same.
    • Generic Image Hash: Images that contain higher then 24bit fidelity may be incorrectly paired with images using 24bit or above (because the image is effectively flattened into 24bits). I have not tested images with an alpha channel. No guard against image skewing (I.E. you could combine the entire image onto a single row or any combination of rows, so long as the pixel order is the same and generate the same image), but in practice I don't think this will ever happen. Some image types can contain multiple images in the file, this hash is only ever run over the primary image in the file (subsequent images are ignored). This type of hash takes the longest because it uses the standard .NET image conversion function which somehow verifies the whole image and this process seems to take a while (and for this reason this type of hash is probably more robust and less error prone). It also uses up considerably more resources (RAM) because the full image must be decompressed to this raw format and for even reasonable sized JPEG's this can mean a whole bucket load of raw data (yes this could be done slightly more efficiently).
    • Minimal Jpeg Hash: There are a number of scenarios that could lead to a minimal JPEG that are different even though the image is truly identical and this is why you have the other hashes to provide some fall back against this. Will not identify more complex changes then simple reorganisation and fragmentation file changes (GIMP can produce identical progressive vs standard images, which leads to grossly different underlying data but identical images).
    • SOS Jpeg Hash: The image could become corrupt (conceivable the DHT, or DQT tables could become corrupt) without altering the hash. 
    Obviously the full file hash will be stable over time. I also imagine that both the JPEG hashes will be stable, at least up to a point, with bug fixes been included in later versions but the version commandline option providing backward historical compatibility for comparison against existing hash logs. However I envisaged that the generic image hash would not necessarily be stable because of the way it is implemented (I'd really like to get it to RGB one day, it just seems nicer), so don't rely on it for long term achieve verification alone (well at least not yet).
      The JPEG methods perform some file type validation (they make sure that the structure seems fine and that the critical segments are there but they don't check the data in the segments are sane. For this reason if you want to be sure that your image files are sane to begin with you are better using a different program or running the generic image hash which make use of Microsoft's image validation, although I'm not sure how this works and what its tolerance is like.

      As a result of the above it is best to use a combination (a couple) of the above for most tasks, hence giving you further insight on how the file and/or image has changed.


      Testing And Debugging

      This program was really only designed for my personal use, I didn't intend releasing it. I've only released it because I thought it might be useful to someone else. As a result the error handling is not as robust as it could have been, because I figured if there was a problem I would identify it when it occurred and I would treat it well and would only give it valid data. So go easy on it and be careful, it isn't the most robust code I've ever written but it does work well if you give it valid data (I.E. directories that actually exist). If you have a genuine repeatable bug you can produce with reasonable input let me know and I'll see what I can do to fix it. I'd say its somewhere between Alpha and Beta level code. I have successfully run the program over my photo collection which has in excess of 50,000 JPEG's and am fairly confident with the output. Running it over this collection from a network drive with the "-debug" option (which takes significantly longer) took around 2 full days.

      Here are a list of bugs that have been identified that I'm aware of but that I haven't got around to fixing yet:
      • When files are written out they are only valid if a valid hash is generated. I have not actually experienced this bug, but I expect that this is what happens if the input file is not valid.

      So this is where you can come in, the program works well and if you treat it well you can get a lot out of it, but if you do find an error let me know and I will try to fix it. I provide the software and you provide the testing win-win. The most untested component is the command line interface, because up until just about the release version I just modified the parameters in code (it meant I didn't have to have spend time on something that when I was using it myself I didn't need).

      Debug checks you can run for me:
      • Run the debug check for all your files and let me know if there are any failures.
      • There were a few other errors I was going to get you to check for me but I can't think of them right now? Oops. 

      The checks that the debug option performs are as follows:
      • They checks only apply to the custom JPEG processing I have written. 
      • Check that the minimal JPEG that is produced matches (image wise) the original image. 
      • Check that rehashing the minimal JPEG produces the same result.


      Plugins
      I see a few plugins in the future of this program so that users can generate these hashes as they download their images from their camera or card as part of their workflow. If you are from a company that produces this software you have two options, either write the plugin your self that makes use of the commandline interface of the program or give me a free version to work with and if I find time I will write the plugin for you.


      Interested in JPEGs in detail or Photo Management

      In the course of developing my program I was doing a bit of in depth reading on how JPEG files work and I came across the following site:
      http://www.impulseadventure.com/photo
      I think its really after my own heart on a lot of digital photography management and its really worth a look.

      13 August 2010

      Printing Word Documents To XPS With Default Filenames

      I was fairly disappointed when I found that when printing to an XPS file from Word you have to manually specify the filename every time. This is fine if you have one file you want to convert to an XPS file or even a few, but if you have a number it becomes quite a tedious process. I'm now using XPS documents as part of further work flow and I found it frustrating to have to enter the filename every time.



      What I actually wanted was an easy method to be able to convert Word documents to image files. A few months ago I wanted to be able to easily convert any document to an image file. There were a range of products out there, but none of them did what I wanted so I rolled my own that was able to convert XPS documents to image files:


      If you’re interested in this let me know and I'll upload it so everyone can use it.
      Update: I have made this available, see my free software page.

      With Word documents you already have the option to right click and print, but this only works in this case if your default printer is set to the XPS writer and you'll still have to specify the file name every time, all in all if you have enough documents this eats up a lot of time.

      So what I really wanted to be able to do was not even have to open the document, but instead right click on it and simply select Print To XPS.


      I couldn't find the information on how to do this, so I worked out how to do it and thought I would pass on here.

      The solution I came up with can be used in two ways:
      • If you just want to be able to quickly print the document you are working on to an XPS file with the same name from inside word.
      • If you want to be able to right click in explorer and print a word document straight to an XPS file.

      There are two parts to installing this:
      • Installing the macro into word.
      • Creating the right click context menu in explorer using the registry editor.


      Installing The Macro Into Word

      These instructions are for Microsoft Word 2003, but I'm sure that you can do the same things for other versions of Word, you will just have to be a little clever if things aren’t in the same place.

      Update 01/09/2010: I have tested this with Word 2007 under Windows 7 and found that it still works, there is just a little variance in where things are located.

      Open up a new word document and go to:
      Tools->Macro->VisualBasicEditor

      You will see the project explorer, it will contain two documents Normal and something like "Project(Document1)", we are interested in "Normal" because we want to apply the macros to the Normal template so that we can run it from any document, if you select the other option you will apply it to the currently opened document, which means the macro will only work with that document.

      Download this macro file I have created, which contains all the macros we need to print to XPS files using the documents name as the name for the outputted XPS file.

      Right click on "Normal" and select "Import File...".


      Select my macro file from the import file dialog.

      The file will now appear under the modules in the normal document and the code should show in the editor.


      Close the VisualBasic Editor.

      To test, Tools->Macro->Macros, two new macros should appear"
      "PrintToXpsWithFilename"
      "PrintToXpsWithFilenameAndClose"

      You should be able to directly run these two macros:
      "PrintToXpsWithFilename" will print to an .xps file with the same name and in the same directory as the current document and leave the document open.
      "PrintToXpsWithFilenameAndClose" will do the same, but will close the document (and word) once it has been printed.


      If you want you can now assign keyboard shortcuts or menu buttons, but these are fairly trivial so I will leave you to do that yourself.

      Those of you who are just interested in the macros and are not interested in the right click functionality can leave us now. For the rest we will progress onto the registry.

      Creating The Right Click Context Menu In Explorer Using Registry Editor

      If you aren’t familiar editing the registry you might want to do a bit of background reading first.
      Obviously backup your registry data first.

      Fire up the registry editor.

      First we will do it for old .doc documents, but see further down if you want to do it for .docx’s.

      Open the following key:
      HKEY_CLASSES_ROOT\Word.Document.8\shell\

      Create a key called "PrintToXPS", set its (default) string value so that it is "Print To XPS" this will be the name that shows up on the files right click explorer menu.


      Now create a sub key in the “PrintToXPS” key called "command", this key will hold the action that we want to perform. So now you should have:
      HKEY_CLASSES_ROOT\Word.Document.8\shell\PrintToXPS\command

      Set “command”s (default) string value to everything between the brackets but not including them ["C:\Program Files\Microsoft Office\OFFICE11\WINWORD.EXE" /mPrintToXpsWithFilenameAndClose /q /n "%1 "]

      Update 01/09/2010: For other versions of word you need to adjust the path you use. For Word 2007 use:
      C:\Program Files\Microsoft Office\OFFICE12\WINWORD.EXE 


      If you want to do it for other types of documents we need to find out where the shell command for that document type is located. I imagine it will be the same on your system so you probably won't need to look this up, but if you have trouble see what values have on your system.
      To do this expand HKEY_CLASSES_ROOT\, you will see a huge amount of sub keys, you want to scroll down until you find the extension of the word document you want to be able to print from. On my system I can see the following two keys with the default value listed afterward:
      HKEY_CLASSES_ROOT\.doc - Word.Document.8
      HKEY_CLASSES_ROOT\.docx - Word.Document.12

      Just replace this in the key paths you opened above and follow the instructions. So if I wanted to make it work for .docx I would open the following key and then follow the instructions above:
      HKEY_CLASSES_ROOT\Word.Document.12\shell\

      That should be all, you should now be able to right click on the Word document types you have set up and click “Print To XPS”.




      Some further info for those who just need to know:

      I think that the above is just a symbolic link to this key:
      HKEY_LOCAL_MACHINE\SOFTWARE\Classes\Word.Document.8\shell\

      So I have a feeling (although I haven't tested this) that if you want to apply it just to the current user you could put it in:
      HKEY_CURRENT_USER\SOFTWARE\Classes\Word.Document.8\shell\Print To XPS

      Have fun!