Your Cheatin’ File Formats

A privacy problem that seems to go largely unnoticed is the issue of personal data that is hidden away in computer documents without their creators’ knowledge.  In fact, nearly all of the most common and popular document formats use such metadata to tuck away all sorts of nifty descriptive information about the document.  Here are just a few examples:

  • When it was created/changed
  • Who made the changes based on User Name or other Operating System-captured name
  • Applications used – including watermarking or similar identifying information tying a document directly back to the exact copy of software or hardware that created it
  • And on and on

Unless you use only text (.txt) files to store data, then odds are pretty good that your documents (MS Word, PDF, JPEG, etc.) have gobs of this type of extra information attached.  And in most cases, while perhaps overdone by complex document formats, this additional document information is intended to be a useful thing and not stored for any nefarious, privacy-intruding purposes.  

However, privacy issues can quickly arise when these documents are then published to the web.  In this scenario, they can reveal personal information through their metadata that their users never desire or intend to be published. 

A perfect example of this situation that has entered the annals of Web Lore is the Cat Schwartz (of  circa-2000 TechTV fame) cropping wardrobe malfunction.  An original topless image was cropped to just an innocuous head shot and posted to her blog, but oops, the metadata thumbnail still contained the original uncropped topless photo.  Just a small, yet-shocking example of hidden metadata stored in only one such complex and ubiquitous Internet data exchange format – in this case a JPEG with EXIF metadata. 

So what are users to do that want to “scrub” all personal information and metadata from their documents before posting to the web?  Unfortunately, there appear to be no easy, one-size-fits-all solutions to this problem.  Application vendors have little to gain and much to lose by stripping out such metadata.  These applications need to have access to this metadata to provide increased functionality and the market appears to make it clear that users value this functionality over privacy.  Even when vendors do provide mechanisms to eliminate such data, they make it cumbersome and onerous.  Third party solutions often only work on one specific complex data format. 

Windows Vista surprisingly does provide a mechanism for doing this (Properties | Advanced | “Remove Properties and Personal Information“), but this only removes some of the obvious metadata that Windows can identify and does nothing with vendor specific data.  Also, you have to actually manually select the file(s) – it can’t recursively cleanse subfolders.

Take the simplest of examples: How do I remove personal data from my JPEGs before I post to public photos sites?

The Windows Vista “Remove Properties” tool doesn’t help because it only handles a few of the obvious EXIF data items (like Title, Author, Tags, etc.), but there are literally hundreds of others unhandled (even the very obvious ones like “Taken On” date and editing application).  Thus for even this simplest example, the user is forced to turn to a third party tool like ExifTool – an impressive, but somewhat geeky and command-line driven EXIF metadata utility that includes a cleaner.  One could also save the JPEG to a different format that doesn’t support EXIF metadata like BMP or PNG, but get ready for some serious size bloat as the compression is lost.

To “quickly” achieve this, I just gave up and wrote my own (C# source code below-now how’s that for geeky?) – but it is only a marginal success because it only handles the Text metadata.  When I tried to just remove all metadata, I got some troublesome results (the compression was removed, or the changes were just ignored because they caused inconsistencies).  This is a worrisome example of how even someone who is actively committed to removing all of this information can be thwarted.  But I figured the text attributes included most of information that someone might want to scrub anyway (like dates, programs, etc.). 

So there is one complex data format partially down, thousands more to go.  Privacy really shouldn’t be this hard folks…

  
// Disclaimer: Use of this code is done so entirely at your own risk.
// This software is provided "as is" without warranty of any kind
// C# Snippets/Class to remove image text metadata from a jpeg file
// Note: removing non-text metadata can have undesired effects of
// altering the compression or other image characteristics
class ExifTextCleanser
{
   public static void RemoveImageTextPropertyItems(Image image)
   {
        foreach (PropertyItem pi in image.PropertyItems)
        {
            // if it's text, remove it
            if (pi.Type == 2) // 2 = Text
            {
                image.RemovePropertyItem(pi.Id);
            }
        }
   }

   public static void PrintImageTextProperties(Image image)
   {
    Console.WriteLine("properties id count=" + image.PropertyIdList.Length);
    Encoding encoder = new ASCIIEncoding();

    // Print all Image PropertyItems
    foreach (PropertyItem pi in image.PropertyItems)
    {
       if (pi.Type == 2) // 2 = Text
       {
        string textProperty = encoder.GetString(pi.Value);
        Console.WriteLine("Property, ID=" + pi.Id + ", value=" + textProperty);
       }
    }
   }

   public static void CleanseJpeg(string originalFileName, string newFileName)
   {
    if (!(originalFileName.ToLower().EndsWith(".jpg") ||
          originalFileName.ToLower().EndsWith(".jpeg")))
    {
     Console.WriteLine(originalFileName + " not a JPEG.");
     return;
    }

    Bitmap bitmap = new Bitmap(originalFileName);

    PrintImageTextProperties(bitmap); // take a peek at this metadata info

    RemoveImageTextPropertyItems(bitmap); // then nuke it

    // save the cleansed version of the file
    bitmap.Save(newFileName);
   }
}
Advertisements

0 Responses to “Your Cheatin’ File Formats”



  1. Leave a Comment

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s





%d bloggers like this: