Please note Friday the 30th of September 2016 is a public holiday in Victoria, and Image Science will be closed. 

Setting Up Effective Backup Systems for Digital Images

21st September 2015 Digital Asset Management


If your house burnt down right now, and assuming everyone that lives in the house was safe, what would you try and retrieve from your house before the walls came down? The answer, in every case, is your most valuable possession. And that is, of course, your family's photographs. All your other items can be re-bought, but you can't get back the records of your life.

So let's look at how to make sure you never, ever lose your irreplaceable history. More specifically - your digital history. Backing up digital data is much easier and less expensive than making physical copies of everything.

How Big is the Problem?

Let's use some back of the envelope mathematics. Say you're a hobby photographer - you might generate a few hundred photos a month, some family photos, some nice landscapes you took on that weekend away to Croajingalong, etc.

Multiply that out by a few years, say 5 years since you started with digital, and you're using a typical modern camera of 16 to 20 megapixels. You're shooting RAW. This works out at roughly 150 GB of images. Not a lot by modern standards really, but add to that a scanned archive of thousands of historical family photographs - the raw data might soon be a full terabyte or so. My personal family archive, more extensive than most because we run a scanning service, is about 1TB at this point.

If you're a professional photographer - your volumes are higher. Often quite a bit higher. You might generate several hundred images each week. Therefore you might well have several terabytes of images, although most profesional photographers typically have in practise approximately 10 TB of actually valuable RAW data at this stage.

So, how do we cope with this sort of data? 150GB isn't so hard, but of course many terabytes is much more difficult.

What is Backup, really?

There is only one good definition of backup, and that is multiple copies of your important data in significantly different physical locations, with regular retrieval checks.

No other system is really backup. Things like 'Raid 5' etc. are not backup, but more on this later. The only truly path to data safety is making sure you have identical, checked and complete copies of your data in different locations.

These are the things most likely to cause you data-grief:

  • Device failure (easily the most common issue)
  • Human error (accidental deletion or some sort of editing disaster)
  • Theft
  • Fire


Backup prevents outright loss of your data under any of these circumstances. All backup systems use 'redundancy' to work. Redundancy just means having multiple identical (redundant) copies of your data. This means if any one copy is destroyed, the other copy is suddenly not so redundant and becomes your working copy, which should immediately be backed up again if you're down to just one copy at this point.

Having any data as just one copy puts you at significant risk. Hard drive failure is amazingly common and can be very sudden - from working fine to total, irretrievable failure in the blink of an eye, especially with SSDs.

Data in only one physical location puts you at risk. Even multiple copies spread across your house has significant risk - perhaps someone steals both your hard drives. Or your house burns down to the ground - and this definitely happens in Australia! Fire has destroyed the life work of several major photographers of world wide importance.

A Good Backup System

A good backup system:

  • Is as near to automatic as possible - the more work it involves, the less likely you are to do it.
  • Will let you get back to work after a negative event, within a few hours. The process should allow you to restore everything, or individual files, easily.
  • Is inexpensive - the more it costs the more reluctant you will be to implement it.
  • Will prevent catastrophic, outright loss, under all circumstances.
  • Will ideally prevent any loss at any time.
  • Is regularly checked to make sure you really can restore things of the backup.

So this is what we're trying to achieve - a totally safe, inexpensive, easy to implement system.

Traditional Approaches

The traditional approach is to use backup software to make backup sets. This is a slow, laborious method, and generally requires you to have a single storage system to back up to that is of equal size to the data being stored. Sure, you can do incremental backups, meaning after the initial long, slow process of backing everything up, it just backs up your recent changes. But, for example, trying to restore a file you've accidentally deleted is tedious, and this system realistically requires you to have the storage connected locally when making the backup, so typically people have their backup system right next to their computer and before you know it someone has come in and pinched the computer and the backup in one go.

Another approach is to use file synchronisation. So rather than create a big, monolithic backup set, you essentially synchronise one library of files to another. Say your master library of photos on your PC to an external hard drive. This is very effective and easy, but it's also not really backup. The significant issue here is that if you synchronise the system, you will synchronise mistakes as well. E.g. you accidentally delete a folder in your master library, then you synchronise your latest changes to your backup - well, you've just deleted your backup! For this reason, synchronisation needs to be paired with version control and fail-safes.

Essentially, you implement a system where the synchronised backup side:

  • Implements 'journalling' - i.e. keeps older versions of files so you can roll back to a previous version if you realise you've accidentally damaged or deleted a file
  • Implements fail-safes - basically, if more than 1% of your files will be changed, it will stop the process and warn you so you can't accidentally delete at both ends.

Synchronisation with journalling and fail-safes is probably the most reliable method to use in all, as it's very easy to automate. The backup you make is always better than the backup you don't!

A Better Approach

We're going to divide the solutions into two parts:

1. Prevent Outright Catastrophic Loss

2. Prevent Any Loss

Step 1 will prevent the truly terrible from ever happening. We're going to make it possible to safely store and retrieve every image we ever produce, simply and cheaply. We do this by reducing the problem and making sure only absolutely key data, the stuff we really can't live without, is very very thoroughly backed up.

Step 2 is considerably harder and more expensive, and involves protecting all of your data.

Preventing Catastrophic Loss

To prevent catastrophic loss is the main goal. Losing one precious image is bad, but losing your life's work or your entire family history is truly awful. So the most important backup system to implement is one that prevents outright loss of everything.

To prevent this, we look at two major technologies. One which allows us to reduce the size of the problem, and the other that allows us to not worry about backup hardware and geography. These two technologies are Compression and The Cloud.

Compression

Compression of images is a long, long established practise. Mathematical algorithms are used to reduce the data required to store an image. The most basic method of compression is Run Length Encoding. Rather than store every single pixel individually, we store groups of like pixels instead (e.g. instead of 15 black pixels in a row, so 0 0 0 0 ... 0, requiring 15 slits, we store 15*0 instead). There are variations on this theme, but basically all seek to reduce the data size significantly, at the price of a very slight loss in quality.

The reality is, even when you shoot raw, the product of your photographic and editing efforts is ultimately a single, rendered image. And images tend to compress very well indeed, with little to no visible quality loss if you use conservative lossy compression levels. A typical example would be a 120MB edited 16 Bit TIFF file, which is about what you need for a 300 DPI A3 print. If you try and store all of these, then you'll soon have a huge pool of gigabytes or even terabytes of data. And of course, we store these where we can. But image compression, which is basically converting your image to a high quality JPG image, will massively reduce the file size. Typical results might be a 120mb TIFF can be reduced to 4 or 5mb as a JPEG.

Your first concern about compression will likely be quality loss, and yes JPEG is a lossy compression technique. But just like a high quality MP3 file, in most circumstances this quality loss will be imperceptible. And certainly in the context of a disaster, complete retrieval of even slightly compressed data will be a fantastic result.

As an exercise, try this - take one of your images, save one copy as a TIFF and another as a quality level 12 JPEG. Now open both in Photoshop, lay one as a layer over the other, and set the blending mode to difference. You will very quickly see the tiny amount of change to the image is on no real significance.

Here's an example with a complex, detailed image. The differences are shown in the second image - black means identical, other tones mean some change. It's clear the difference is negligible in practical terms and certainly of no great concern in a disaster recovery scenario versus the potential of losing everything.

Acedia by Jeremy Geddes
Acedia by Jeremy Geddes
Blending mode difference example
Blending mode difference example

So, compression helps us vastly reduce the size of the problem, by a factor of 10 to 20 times or more. And it is this reduction in size that makes the second technology feasible.

The Cloud

The Cloud is really just a fancy name for a massive bunch of server computers distributed around the world in various data centers. The technical implementations of The Cloud are irrelevant to the end user, what matters is the result. Basically, a readily available and universally accessible pool of inexpensive, safe storage that is completely managed by someone else.

Cloud storage is implemented by the really major IT companies, Google, Amazon, Microsoft etc, people who really know computing backwards and know how to build massively redundant, geographically distributed massive storage systems. And you simply buy some space on these system. The offerings are all much the same and much the same price now, it's really just a very pedestrian service.

All you then need is a relatively good internet connection with decent upload speeds. Add to that a cloud account, and basically you simply upload your compressed files to your cloud storage. Even over typical home ADSL this system becomes quite feasible, you just leave it running a few hours a week for the uploading.

My favourite cloud services are:

Zenfolio

(PremiumAU Account - $160/yr)

Zenfolio offers unlimited photo storage. You can store as many images you like on this service for just $160 a year. And you can use it as an amazing, easy and beautiful private or public gallery & print sales system as well. It's really amazing value and I think the ideal service for most photographers. Good mobile/tablet clients too, which means my photo library can be with me wherever I go.

Dropbox

(100 GB account with 'PackRat' extension - $139/yr)

Dropbox is a general storage cloud system. I add the 'packrat' extension which makes it a 'journalling system' , which means if I accidentally screw up or delete a file, I can get it back later. I store all my important stuff in here, photos of course, but also documents, records, Lightroom catalogue etc. It's super handy.

Amazon Glacier

($0.012 per GB / month)

Amazon Glacier is the cheapest for really large volumes of data, but it does take many hours to retrieve files. Basically you upload you files to this service and they are stored away in deep storage. If you need to retrieve them, your files are re-loaded and several hours later you can start getting them back. It's not as convenient, but very very cheap.

With any of these systems, you simply upload your compressed files to them, either by dropping them in a particular folder, or by using plugins with your imaging apps, and they take care of the rest. I'll describe a particular work-flow below.

The cloud is essentially risk free, although it in not unheard of for a single cloud based storage company to disappear. For this reason, I recommend you with big, established names in the area, and it's now quite affordable to even use two clouds.

Preventing Loss of ALL your Data

This is harder to achieve. Here, we want to store everything - all the raw data in full, non-compressed form. Our RAW files plus any resulting TIFFs etc. with all their layers and so on, post editing. This can mean we soon start to need a lot of storage space.

I have written about my preferred technique- the Ultimate Backup System. In short, you have two computers connected by simple internet connections and overnight these back up to each other. Each computer is just a large box of disks. It's not tremendously hard to set up, but does take some initial commitment to get going.

For most people, now that 3 and 4 TB drives are readily available and inexpensive, the easiest approach is to just buy a bunch of larger single hard drives and then set up a simple synchronisation backup system with reminders. That is, once a week or so, you get your computer to remind you to backup your recent changes and new files to one of these drives,and you then store this backup off site at work or a friend's house. Of course, once you have more data than fits on a single drive at a time, it becomes more difficult. But usually a sensible division of data, e.g. family photos as one division, client weddings, portrait sessions etc, makes the process quite manageable.

The temptation here, especially as you pool of data grows, is to start looking at expensive and complex multi-disk storage systems. And these have some definite pluses - basically, they join multiple hard drives together to form a much larger pool of storage. It's not hard these days to put together something that can store 48 TB. You then just treat them as a giant hard drive and backup to them as normal.

There are some significant problems with these systems though. Firstly, they are single device. So they do absolutely nothing to protect you from fire and theft, which means you really need two of the system you are buying. Secondly, they're expensive. Not only must you buy the unit and the hard drives themselves, but you need to buy extra hard drives, which are used for internal redundancy within the unit. And you really need two of these systems, one stored offsite, for it to actually constitute a real backup system.

These systems often also tend to be slower than single disk systems, especially if they are network attached. Multi drive systems usually offer some sort of RAID, e.g. RAID 5 or BEYOND-RAID. These claim to offer you greater protection for your data, e.g. if one of the 5 hard drives in your system fails, just swap in a new one and the system will restore itself. But as any experienced IT person will tell you, this is a recipe for disaster.

For one, RAID systems have notoriously high failure rates - they stress the disks in them far more than normal usage through their constant write strategies, so you really need to at least put expensive server quality hard drives in them. Secondly, RAID system are absolutely notorious for failing during the 're-sync' operations when you do have to replace a drive. Either another drive fails along the way, or some internal RAID issue occurs and the whole thing goes irretrievably down.

This Synology unit will happily store 48 TB of data for you in one (easily steal-able!) unit
This Synology unit will happily store 48 TB of data for you in one (easily steal-able!) unit

Multi-disk systems encourage poor backup all-your-eggs-in-one-volatile-basket strategies and should, if possible, be avoided for backup purposes. They are actually better for working drive purposes, if you want to have all your files accessible at once in a single pool of storage way.

A better multi disk approach is known as 'RAID over any file-system' - AKA Raid F, and similar to traditional Parity RAID techniques like RAID 3 and 6. This is where all your data is stored on individual drives as normal, and a separate drive is used to store 'parity' information. This offers the same recovery benefits as traditional raid, i.e. if a drive in the system fails you replace it and the system can restore everything. The great thing about this that if any single drive fails and then for some reason the re-sync fails, you lose only that drive's data, as each separate drive in the system is a just a regular normal drive. Which means the all too regular catastrophic loss problem with RAID systems simply never happens with parity raid. And it also tends to be a faster system as well.

It's generally actually cheaper to set up a parity RAID system than buy the more expensive, but convenient, disk aggregation systems like those from QNAP/Synology/Drobo etc. But it's also more work. The basic approach is to buy a server style case, fill it with hard drives, and then just install a Linux parity raid system on top. It sounds harder than it is in practise, as these things are very modular these days.

General Tips on Traditional Disk Based Backup Systems

If you do choose the pre-packaged multi-drive path, we recommend Synology as a very well proven company in this area, and we strongly recommend againsr Drobo. We have withdrawn Drobo from sale here because of the ongoing huge proportion of hardware faults, and really quite poor backup service.

Also, whether you go single or multi drive, it's always great practise to vary the brands and batches of hard drives you buy. This helps avoid the common problem of batch failure. That is, if you buy 4 drives from one vendor and put them in a RAID system, and they all get basically the same wear and tear therefore there is an excellent chance that when one drive fails, others will be right on the point of failure too, which puts your data as serious risk. It's much better to buy a bunch of different ones as they'll have different failure characteristics.

Finally remember that for a consumer hard drive, about 4 years of powered use is the most you will get. Hard drives do not last forever, and need regular replacing. Check your backups, and change your disks entirely every few years.

How do we put this into practise? It needs to be as automated as possible or you're not likely to use it. Here's a basic approach:

  1. Copy RAWs from camera to PC to a temporary storage drive as soon possible after shooting.
    (Leave your images on the camera card while you do all the next steps, as this is in itself already a backup)
  2. Import RAWs into the Lightroom master library via copy, not move. This means there are now two copies on my PC, and still one on the card.
  3. Edit my images as normal - visual edits, tagging etc.
  4. Catastrophic Failure Prevention:
    I then immediately trigger two automated exports directly from Lightroom:
    a) One a set of high quality JPGs into my 'fast album' - a local library of full size, high quality JPGs that I feed to all the devices in my house for viewing on TV etc. This folder is a Dropbox synchronised Photos folder which means anything I put in this folder automatically starts syncing to Dropbox in the cloud, and out to other machines like my office PC. Several more copies are made at this point.
    b) Another set of JPGs is sent to my Zenfolio library. This provides another backup and online galleries at the same time. I can share these galleries with friends and family, or clients, and directly offer prints for sale through this same system, meaning I can pretty much decimate a bird population with a single stone.
  5. Full Data Backup: Overnight my RAW files automatically synchronise with my work server using SyncBack SE. This allows me to set up fail-safes like 'don't synchronise if more than 2% of my files will be changed', which prevents large accidental propagation of issues, e.g. if I have accidentally deleted a folder or something. If you don't have a home and work server set up, then this is a good thing to arrange with a friend or colleague. You just set it up so that overnight your files synchronise from your machine to theirs.

    Both my work and home server are just standard PCs loaded with lots of large disks and using a simple parity raid system called FLEXRAID.

    ALTERNATIVELY

    Backup your RAWs to a portable hard drive/hard drive aggregate unit, and then you must take this somewhere else - using multiple portable systems so you can cycle (drop off the newly filled one, bring the older one back for the next fill up etc).

The end result is a lot of copies. Basically, I end up with two JPGs on separate clouds for emergencies, so there is simply no way for me to lose everything even in the event of a major disaster. I also have two copies of all my RAW files/processed TIFFS and Lightroom catalogue using what is, in effect my own 'mini private cloud'. Once set up the process is almost entirely automatic. I have never lost a single photo in the 10+ years I have been following roughly this approach.

Each year it gets cheaper, faster, and easier. With any luck, we'll get proper high speed NBN connections at both my office and home, then I will really be able to make this thing fly and won't even need to leave uploads happening over night as I do now. I will then be able to implement a similarly automatic and robust system for video as well, which of course is another order of magnitude more in data terms.