My Amazon S3 photo backup solution
Far far away, someone, somewhere said:
“There are two kinds of people, those who back up their data and those who have never lost all their data.”
Luckily for me, I have never been a victim of a situation where I lost all of my data simply because I do backups regularly. I never do a full backup of my machine though. I can download an operating system in few minutes, restore my system preferences via a single click, install all my frequently used apps using a single command, pull all of my projects from Github and listen to music on my Technics SL-1200 or stream it from Apple Music. The only thing that I keep backed up is my photo collection.
My backup strategy in a nutshell
Since May 2007 I have kept all of my photos in a well organized collection, ordered chronologically by year and by session / event. I keep exactly the same habit for all of my pictures taken on my iPhone in parallel. It is not an enormous amount of data (around 200GB) but the sentimental value that it holds is immense.
No matter what, I always store this collection on two physical devices. It can be my computer’s hard drive, an external flash disc, NAS server or a RAID array. Currently I use two totally average external hard drives by Segate. I am the happy owner of a superb Sony α7R III that shoots 80 megabyte ARW files. Taking that into consideration I’ve realised that I may run out of storage on these hard drives very quickly, but for now they do the job.
However, things happen! Disks fail, people rob, rivers flood, comets fall. In case any of that occurs I need one more copy in the cloud. I have tested multiple solutions and services over the past few years and finally I feel that I have found something that is going to stick around. Although making a backup to a local hard drive is fairly easy and straight forward, cloud backups are way more complicated. Luckily I am here to help you out.
What I consider to be a good cloud backup and things that I don’t care about
There are plenty of services that offer cloud storage for amateur and professional photographers. Dropbox, Google Drive, Box, OneDrive, Zoolz or Backblaze just to name a few.
There are a few key things that I need to get out of my cloud backup solution. Security first — I really don’t want anyone to look at the pictures of my beautiful girlfriend. There is a reasonable chance that my collection will grow over time so auto-scaling and unlimited storage resources is another must-have. New services show up and vanish often and I am really not interested in investing my time in solutions that may not be around tomorrow. Do you remember copy? Quite a cool service but it didn’t stay around for long though. Also price is an obvious factor of course.
The providers listed above usually offer tons of things that I simply don’t care about. I don’t need a fancy app with tons of bells and whistles. I don’t need a constant live sync and seamless integration with my OS. It is a last resort backup — the file structure is probably never going to change. I will just add more stuff over time.
I am here today not to compare the available options or convince you to use one over the other. I spent years looking for a solution that suits my needs and I would like to share it with you.
Say hello to AWS Simple Storage Service (S3)
AWS (Amazon Web Services) is a platform that offers a number of things that your business or you, as an individual, may need. From computational power, through to database storage, content delivery networks to machine learning and IoT (Internet of things) related products. A storage solution is one of the many services that AWS has to offer. It is well established and proven by the mile-long list of clients like: Adobe, AirBnb, Netflix, NASA, SoundCloud, Canon, GoPro… The list goes on and on.
You may have heard the opinion that AWS is complicated to use. In reality it is crazy complicated but to be in a band you don’t have to play all the instruments — just master a single one. Storage is what we need.
AWS has a number of storage solutions in its product list. From simple solutions like Amazon Simple Storage Service (S3) to the AWS Snowmobile — a 45-foot long shipping container pulled by a truck to transfer extremely large amounts of data (up to 100PB). The thing that we need is a container of data stored within an S3 bucket and its seamless transition to the Glacier class using lifecycle policies. Let me explain.
What is S3 and how it works
Amazon S3 is a simple storage solution that offers a range of classes designed for specific use cases. For frequently used, general storage use S3 Standard. Infrequent Access works best for files that you don’t have to access very often but still keep them accessible whenever you need them. For archiving purposes, Glacier is the best option. Each of these categories comes with pros and cons and each of them suits different needs. The main differences between them are price and waiting time to access objects (photos in our case). For those that are curious I would direct you to Marc Trimuschats presentation from the AWS Summit 2017. Deep Dive on Object Storage tells you everything that you need.
Essentially, files stored in the hot storage (S3 Standard) are accessible immediately but they will cost you a fortune ($0.023 / GB). Cold storage (Glacier) on the other hand is extremely cheap ($0.004 per GB) but a file restoration can take from 1 minute up to 12 hours. You will be charged for each GB retrieved from the cold storage cluster too. The pricing may vary a bit depending on the region of your S3 “bucket”.
Privacy of files is something that we can easily control with S3. If you want to make a file public or private, no more than a single click is needed. Lifecycle policies help us to create a set of rules that invisibly migrate files between storage classes. I utilised the power of this feature to migrate all the files imported to the Standard bucket to Glacier the next day.
I mentioned before that AWS is complicated to use, but I hope that this step by step guide can make things easier for you. The S3 Storage may actually be one of the easiest to use services from the humongous number of products in AWS portfolio.
Start with creating a free AWS account. This process requires you to add a credit card to your account and authorise it by a phone call you will receive from Amazon’s bot. It is worth mentioning that you are eligible to use a Free Tier that gives you access to a snippet of AWS features totally for free. You can end this process here but I would strongly suggest to look at the IAM (Identity and Access Management) best practices. Personally I use my “root” account just for billing purposes and to manage users. For using AWS services I created a IAM user with sufficient permissions for my everyday tasks — security first. Read more about the recommended way of using the AWS platform on the AWS Identity and Access Management Documentation. Getting Started with Amazon Web Services webinar is another helpful resource start with.
“When you first create an AWS account, you begin with a single sign-in identity that has complete access to all AWS services and resources in the account. This identity is called the AWS account root user and is accessed by signing in with the email address and password that you used to create the account. We strongly recommend that you do not use the root user for your everyday tasks, even the administrative ones. Instead, adhere to the best practice of using the root user only to create your first IAM user. Then securely lock away the root user credentials and use them to perform only a few account and service management tasks.”
Our account is ready to use and now secure, it is time to create the first storage “bucket” under the S3 section. Use a unique name for your bucket and choose a location of interest. Make a wise decision at this point because you won’t be able to change those details later on. Hit the “Create” button and we are almost set up.
In theory we are ready to use the service now but there is one thing that may help to automate our workflow a lot. We definitely don’t want to change the storage class (Standard, IA and Glacier) for every file manually. As mentioned before lifecycle policies can automate the process for us. My aim is to migrate all the files that I put into my Standard S3 bucket as soon as possible to cheap cold storage (Glacier). To set it up that way, click on the name of the bucket created in the previous step and navigate to Lifecycle rules under the Management tab. Click the “Add lifecycle rule” button to define a new rule. Add a meaningful name to your rule and navigate further to the Transitions section. For the current version of your files create a rule that moves the file to Glacier after one day. We don’t need to tweak settings for the previous versions because we didn’t enable file versioning in the first place (you don’t need that for backups). Click next to the Expiration tab just to keep it as it is (we really don’t want our files to be removed) and proceed to the next tab — Review. Make sure that you are happy with all the settings in the last step and save the rule. We are done!
GUI or not
Although the S3 web interface is very user friendly and fast, you may be interested in using a GUI (graphical user interface) tool to send files to your bucket. Luckily there are a lot of tools out there that let you access your Simple Storage Service easily. As a macOS user my personal preference is ForkLift 3. Transmit 5 is another app for the Apple system that has garnered a great reputation. Maybe Cyberduck? FileZilla Pro and S3 Browser could be good options for Windows users. Play around with the available options and let me know about your preferred way to interact with S3 objects.
Happy backing up
I am very happy with this solution and it works for me really well. I managed to reduce the cost of my digital backups from £8 per month to less than £1. I have a reliable and secure copy of my files and a great system in place that hopefully is going to serve me for the long term. Let me know about your backup strategy in the comments below. If you have any questions or need some more clarification on anything in this post, I am always keen to help. Happy backing up!
One 27 March 2019, Amazon announced a Glacier Deep Archive which is even more cost effective storage class that perfectly suits my needs.
Nice I'm glad I found this article, as I have been wondering about the viability of using S3 as a backup solution for some time.
I just have some questions for implementation:
1) Do you have any suggestions on the structure of buckets? For example, do you have a seperate bucket for each year, or each month in a year, etc.?
2) Is there any benefit of choosing servers geographically closest to you, or should I just choose the cheapest one (N. Virginia). It would seem that the point is never to access the data unless in an emergency. So the server proximity would be irrelevant?
3) Do you have any trouble uploading large files, such as a video that is several gigabytes? Would aws cli serve to sync large files (`aws s3 sync . s3://whatever`)?
4) Roughly how much data are you storing, is it terabytes? You said 200gb at the start, but is that still true with your a7r3? Is the system scaling to handle this extra data? What kind of upload speed do you have to handle this? I have lots of data ...
Sorry for all the questions!
I am glad that you found it useful. I am more than happy to answer to your questions.
I don't have any order or particularly complicated file structure. I keep everything in one bucket. This one is split by years, and inside each years directory I have a directories with sessions. Like so:
Totally go for the cheapest one in this case. There is no reason why I picked my local one apart from habit. All my S3 instances are in London ¯\_(ツ)_/¯
Cannot help with this one because I have never tried sending big files like this. Sorry.
I currently have around 250GB. I add new folders every so often and it scales really well for me. My sessions folders are not huge tho, between 3GB to 15GB. It goes really quick for me. Cannot give you a number but I can give approx comparison — Dropbox upload is million times slower than this one. Speed of this solution is not a concern for me at all.
I am more than happy to help further if you have more questions. Have a lovely day :)
Excellent article. I was attending a work course on AWS when it got me to thinking about using something like this as Cloud Storage for my RAWs and I came across your article.
As I assign keywords to all my images at time of import into Lightroom - Is it possible to search for a particular image by Keyword in AWS ?
Also I have my images placed in Folders which are contained inside a master folder. is it possible to import the folder structure or is it just the files ?
thanks in advance.
I am glad that my articled helped you out. Unfortunately I am not able to answer your questions from multiple reasons. I use S3 purely for archive and I have no clue about Lightroom. I am a Capture one user.
I am more than happy to help you with further questions if you have any.
Have a nice day 🥑
Yes, you can create folder structure. You can add also tags and additional properties to the uploaded objects but I'm not sure you can use them for searching objects.
Thank you for sharing this. I'm about to do something very similar. At least regarding backing up raw or original photos.
Are prices prorated, or do you get charged for a full month? For example, 1TB would cost $23/month in Standard, and $4/month in Glacial. Do you end up paying $23 for the 1 day your photos sit in Standard storage?
It is hard to understand your question, but I can assure you that currently I store there about 300GB of data and my highest bill from AWS so far was £1.03 for a month. Hopefully this helps you to understand the costing a bit better.
Prorated means that you only pay for the time you use. So when you put files into storage, are you being charged for 1 day of Standard storage, or are you charged for a full month of Standard storage?
The way how I set it up is described above. All the files that I upload to standard S3 are kept like that for a month. After that time, they migrate to Glacier. You can customise it tho and almost instantly send them to Glacier if this workflow suits you better.
OK, thanks for getting back to me.
Is S3 Glacier best solution to storage family photos?
When it comes to family photos storage, privacy should be one of the main concerns and Amazon Glacier is fantastic at security as long as it is correctly set up.
This is awesome! Have been meaning to streamline my backup solution for aaaaages and finally I have something simple and robust. Thanks! 🙌
I am glad that you like it. It still works amazingly well for me to this day. Good luck :)
I'm a little confused — why is it that you upload to S3 & then have it transfer to Glacier? Is there some particular reason to do this rather than use something like Arq to transfer directly to Glacier?
I think I understand after looking at it a bit more… it's not possible to directly back up to Glacier… I think!
To be honest I have no clue what Arq is but my reason to do so was ease of use and S3 integration with GUI that supports this protocol. I use app called ForkLift but there are some others like Transmit (super cool looking but too expensive for my needs).
Arq is just an automated backup tool for macOS, which handles both backing up and restoring from S3/Glacier. After reading up on it more, it's still necessary to back up to S3 and then use a lifecycle policy to migrate it.
Arq is worth checking out :) It would save you the effort of manually using Forklift, and it can do incremental backups hourly for you.
This is interesting and definitely worth looking at for some more automation freaks than me. I am kinda happy with my very manual process. I do it very rarely so automating this would be an overkill in my case. For professional photographers it is a really fantastic tool!
Coming to this discussion after some research on Glacier Deep. One of the things you don't mention is if the pricing for S3 Standard is prorated (a comment below asks, but not necessarily in a clear way). If I understand correctly, if you upload a file to S3 standard, you pay for a month's worth of its storage, regardless of whether you lifecycle it to Glacier 1 day later or 30 days. Has this been your experience? Or are you charged a prorated cost for S3 Standard depending on how long the file "sits" there?
Another point: it is apparently possible to upload directly to Glacier Deep (PUT costs more, but you presumably save the cost of S3 Standard). The flipside is that, apparently, it is command-line only. Have you tried this? etc :)
With me that was the case. I paid for the first period for a standard storage class and then for glacier then it change the class after a period of time.
There is an option to create a Glacier / Deep Archive bucket and put items directly to it. As far as i know it comes with some restrictions though. You cannot use any GUI clients because s3 protocol doesn't have access to Glacier objects. Back when I was setting this system up it was a big restriction for me. Things may changed since then though.
Thanks for reading and good luck :)
Do you mean S3 Glacier service (which operates with Vaults and Archives), as opposed to a Glacier-class object in an S3 bucket? If so, you are correct, it doesn't have a GUI client and all interactions are strictly through API calls. Basically, Glacier class storage in S3 works by S3 itself interacting with S3 Glacier via API (https://stackoverflow.com/a....
It means that the S3 Glacier is more specialised service and is probably meant for API integration, rather than direct access by users.
About uploading to Glacier: I can confirm that CloudBerry Backup (which seems to be the most popular S3 Windows backup client and which is, thankfully, free for personal use) can upload directly to any class, including both Glacier and Glacier Deep Archive. I couldn't find any upload settings for S3 Browser, though, as it seems to upload to a bucket default storage class. In any case, uploading directly to Glacier is not limited to command line.
I may be wrong, but in my understanding prorated means that you pay only for the time you store an object in S3. So in Pawel's case he will be billed for 1 day of Standard class storage and the rest will be for Glacier class.
Great post! If I found it earlier, it could have saved me some time reading AWS docs and watching lengthy youtube videos :) Still, using IAM properly is something I haven't gotten into yet.
One important thing I would add is AWS has very flexible server-side (and client-side) encryption settings for protecting data at rest. You can either manage keys yourself via using client-side encryption and uploading already encrypted data, or let AWS manage encryption server-side. The latter has at least two options: generate and use your own encryption keys via AWS KMS (Key Management Service), or let AWS manage the encryption automatically in the background without you needing to do anything (https://youtu.be/VC0k-noNwO....
I'm not sure the server-side encryption is on by default though, and I think I had to specifically enable it in bucket settings, but it's there and it's as simple as turning it on. This way your girlfriend's photos are safe from anyone's, even the employee's eyes! :)
Thanks for the helpful article. I was looking into AWS for work and likewise thought maybe should expand my backups to use AWS rather than just a NAS and on site backup. However one point to consider is the fact that I got numerous warnings about how Glacier and Glacier Deep Archive are charged. Not only per request, but also per file, additional storage is required and it is very very hard to work out just what this means as far as $$ is concerned. Also I read each type has a minimum time of storage and you will be charged if you delete earlier than that time, which for GDA is 6 months. I have over 100,000 photos I was going to simply replicate to S3 and then push to GDA. I am now thinking that is a very bad idea and that I should perhaps zip them up or something but that creates additional problems around how many versions do you keep, how often you update, when do you delete, etc. I suggest that is why "simpler" options like Dropbox, Google, Drive, etc are popular as you just buy a bucket of storage and can upload, add, change, etc as much as you like without fear of racking up a large bill. If you know what you are doing and can understand the complicated charging model then perhaps S3 & AWS might work out cheaper.. but it could cost you a lot more as well!
Hi. Thanks for reading.
Pricing so far is working very good for me. I currently store 10+ years of photos taken on my iPhone and professional camera and my monthly bill is never more expensive than £0.45.
Good advice though. It is worth to do the math before investing time and effort to set system like that up.
Have a great day 👍👋
Have you done any large volume (exceeding 10GB per month free tier limit) data retrievals so far? If so, roughly how much did it cost you?
I have never done it before and hopefully I will never have to. I use Glacier Deep Archive for deep archive :) This is just a last resort backup option for me.
You will be surprised how costly it would be to get your backup back from S3. All offerings have the transfer out priced per GiB and it is not cheap. E.g., assume you have 1TB in Glacier Deep Archive and a disaster hit, so you need to retrieve that 1TB back -- be prepared to pay at least $120 (1024GiB * (Retrieval cost + Data Transfer out)). I learnt it the hard way for retrieving 5TB :)
This is a very valid point. As said before, My fingers crossed that I will never have to do it 🤞
Does it mean that you don't have any assurances that the backup you are creating are actually good and can be relied upon? :). Personally, I would be uncomfortable keeping a "last resort" backup without at least an integrity check from time to time to ensure that it is not a dud. I think in your case, it is easier -- you can just request a random file retrieval to confirm that that particular file is retrievable. In my case I had a binary encrypted blob of my local disk image, so doing backup verification was a really costly operation.
Thank you for this article.
I started using S3 a few years ago for both syncing across machines and operating systems and archival of documents, photos, music and video. I use it to host two static websites as well. There are easy to use GUIs (Transmit on Mac, Cloudberry Explorer on PC, Cyberduck, Filezilla, others both paid and "free") and the command line is easy to use with a wee bit of learning and patience. (It will be familiar to anyone who has worked with Linux.)
Your reminder to set up users beyond root is timely. I have neglected this but understand its importance. I have never explored Glacier but will look into it. Seems simple enough. One thing I've begun doing is backing up photos and other files directly from Android phone to S3 buckets. Haven't found a way to automate this though.
So called "power users" and IT folks are familiar with AWS, but it's a best kept secret for the rest of us that deserves to be shared.
By the way, Happy New Year!
Thanks for reading. New year is a great opportunity to revisit backup solutions. Happy 2020 to you as well 🎊
Thanks for the info!!!
Just quick notes as I was reading through the pricing policy. If one consider to have this as absolute backup then its fine. I.e. just dump data without need to access them unless disaster requires that
1) Retrieve time (for objects to become available for download) for glacier and deep one 6/12 hours
2) Storage consideration, AWS will add extra small data for each object on glacier/deep, something around 32KB. This is not an issue for small amount of objects. Good idea to compress and package objects in single file (per year for example).
3) Also consider the traffic especially out one, those will be chargeable
4) Consider to use US region to host your S3 than one close to you, all regions in US are the cheapest. The latency difference to upload/download to these zones is really negligible as we are not accessing those objects frequently anyway
Thanks for the great and simple tutorial @pawelgrzybek:disqus! You really demystified AWS S3.
Can you go into greater detail on how to secure your S3 bucket? What are the IAM configurations you have in place for securing the photo backup?
Also, I read online that you can enter 0 for the days when creating the transition to Glacier rule, and that way it transfers the objects immediately. Do you have any experience with that?
I am glad you found if helpful.
I kept this bucket private accessible just for IAM users of my account (literally just for myself). I remember doing some copy/pasting from StackOverflow to set this things up back then, but now it is all achievable using GUI (Permissions > Access control list).
Personally I didn't explore the option to move classed immediately after upload. I am not sure it would be a massive cost saving for me. I do not store enough of data for this thing to make a big difference. It is good to know though :)
Thanks again for reading and have a nice day 🥑
Ah I love it! I also went down this road and never looked back (actually that is a lie, I did try a couple of other options later on to see if they compare, and ended up returning everything to S3 haha).
Thanks for putting this information up here. I also remember being completely overwhelmed at first with the AWS offerings (and back then there was no snowmobile!). Loads of people will find this helpful!
One thing I'd like to add, if/when wanting to take things further is to consider git-annex. It's something I've been using for a long time as well, with S3 as the storage backend, and it provides some things I've never seen anything else achieve elsewhere.
Unfortunately it does add more complexity, as you'd be learning just the basics of git, and then the basics of git-annex on top of all the S3 stuff above (though it does have an automated sync tool which can make a lot of this point and click!). But that is the theme of this post! Huzzah!
Ultimately the tool manages metadata and filenames, and treats them separate from their content. Which maybe sounds small or confusing, and it is at first. But once you get used to it it's huge!
It means you can move files around in their structure, rename them, even copy them, and they'll only be stored in S3 once without alteration.
But most importantly for me, it keeps track of your data locations, and lets you drop and retrieve the content of your files easily.
For example; I can ask it where a photo exists, and it knows that I have a copy of a photo on S3, and another copy on an external HD and another on my laptop.
Because there is a separation of files and their content, I can drop a file I don't need right now (the tool will confirm there is still a copy around at least before allowing it), and what I'll get in the end is essentially an empty file of the same name as a placeholder. So I can still see my files, their structure, and what I have available. But I don't have to have all the "content" of those files around at all times.
Git itself can be stored and hosted cheaply through AWS, so the whole package can live there.
The list of other things you can do is a mile long, but some others of note:
* It can handle encryption, so everything you put into S3 is encrypted if you so choose, even to amazon itself.
* If can add another layer of metadata on top of your files managed in the tool itself. Then you can use that as search criteria in other commands, and even dynamically restructure the layout into temporary metadata driven views (think tagging people in photos, then dynamically restructuring the layout so all photos of people are in folders of their name, then switching it back as it was as if nothing happened).
* It can add multiple "backends" that include S3 but don't have to be. You can even store data in different cloud services and access them all from one place if you like. Moving them between services as you see fit (they all have free plans up to like 5gb right? :P ).
* Because it uses git as the backbone for syncing all the metadata and location info. You have all the knowledge completely offline. So you could check where a file lives, for instance, on a laptop offline on a beach somewhere, if you so desired. :P
Sorry that became a large ramble, and I ended up removing a number of other features I also use for space, haha.
Back to the regularly scheduled awesome tutorial!
O wow! This comment deserves to be an article by itself. Thanks for sharing. It is pretty amazing how you extended the simple idea that I described in my post.
Agree with your comments. This comment itself is an article.
I'm trying to address this problem of meta data not being passed with uploads. Specifically, the date the photo was taken/created. Do you know of any less complicated solutions that will pass all meta data with photos to S3?
Thanks for this explanation - helped me to setup my own photo backup in AWS
I am glad it helped you out. Have a fab day!
Love it! Thanks for the detailed explanation.
Very useful article! I currently use the external drive+time machine backup for my photo's which go back over 20 years plus but my external drive is suddenly giving me write issues. So, time to think of getting a big bucket in the cloud. Carbonite has been there for ages but most of these solutions use Amazon cloud services anyway so why not just go straight there? Your article is making that possible, thanks!
I am glad that my article helped you out.
This is just what I was looking for. My SSD that had all of my newborn's photos nearly died this past weekend and now I've been tasked with finding a better storage solution. First thought was disks + fireproof safe, but now I'm thinking S3 is better. You make it sound much easier than AWS' documentation! I'll certainly report back if it isn't as great as I hoped.
Great blog! Thanks Pawel for taking the time to post and respond... lots of options and tools to check out!
I am glad that you liked it Craig 🙌