Mostly Unique Picture Filenames

At the most broad level you can divide pictures into two categories: those you take yourself and those you have downloaded from elsewhere.  Tack on a simple hash of the contents in some consistent manner and you can practically guarantee global uniqueness.

In the case of the former, the date is the only unique constant.  As a non-professional photographer, you can’t without extreme effort take two pictures at exactly the same time.  So while it is possible, you can compare the dates (the value stored in the JPEG meta-data, file date values can get mangled) of two files to determine their equalness.

For files downloaded from elsewhere on the internet which have been passed through some kind of editor, and no longer retain their pictures-taken-on meta-data, you can only do a hash of the content and compare file sizes.  The problem comes about when it has been resized so that both the content hash code and file sizes are different even though the visual is practically the same.  In most cases I do believe that people modify a picture file in order to resize it, and the reason just below that is to add their own mark to advertise their site (a tag at the bottom, a watermark, a logo, etc.).  There’s little you can do about the latter, but what if the hash was done on a pre-resized version of the picture.  Say you always resize the dimensions to 425xN where 425 is the width and N is the height relative to the width after it’s been changed.  Then run the image through a filter to remove minor level changes and hash that content.  I wonder how effective it would be in finding duplicates, and then at that point how to programmatically tell which version is higher quality.

You can’t always determine the quality level of a picture file based on its dimensions or file size.  People tend to explode both of these artificially, for whatever reason, and quite often I’ve seen pictures where the smaller of two duplicates is better looking.  I think there’s some typical automatic logic you can apply to get decent results.

  • Insanely high dimensions are indicative of an original because only complete assholes will blow up a tiny image into something like 2200×3600.  The threshold of a half-megabyte file size will generally tell you if this is the case or not.
  • If JPEG meta-data exists, you can compare its values to the picture itself.  For example, it may have an “Original Resolution” property which you can compare against the image dimensions.

I’m just thinking aloud as I ponder collecting assets for mini-content sites.

Leave a Reply