Some of my ELC comrades and I are developing a site that lets charities create online storefronts to sell books, CDs, and other media. A big part of the development process thus far has centered around importing huge product datasets from the project's affiliated media distributors.
Being an AWS-oriented company, with most of our projects hosted on EC2 via RightScale, the natural choice for a media hosting solution was Amazon's Simple Storage Service, or S3. So, after a few days' work on the less-than-simple task of migrating millions of product images to our S3 account, I've decided to publish my findings so others might have an easier time of it. The following is a summary of the software, Ruby libraries, and tricks I used to get through it.
Software
JungleDisk
Mac/PC/Linux, $20, http://jungledisk.com/
JungleDisk software mounts your S3 as a virtual drive. Sounds cool but the implementation is actually quite clunky and, as you'll read below, proprietary and flawed..
JungleDisk has a feature that allows for "fast renaming". As renaming or moving objects (files) on S3 is a technical impossibility, this feature leverages a hidden JungleDisk-owned EC2 instance. Because the EC2 server is within the same Amazonian cloud, it can serve as a faster intermediary storage tank for files en route to their "new name" or "new path".
Unfortunately, the JungleDisk software uses a proprietary method of storing files on S3. When a JungleDisk-created bucket is viewed using more conventional methods, the file structure is obscure and incomprehensible. See images below..
The mounted virtual disk looks good..
..but this is what's really in the bucket:
Furthermore, JungleDisk cannot recognize S3 buckets that aren't named according to its arbitrary bucket-naming scheme. Boo. Bottom line: JungleDisk is no good for serious developers, and is even of questionable value to a novice performing simple backups, especially considering that it's no longer a free program.
Transmit
Mac, $29.95, http://panic.com/transmit/
A popular (S)FTP client for Mac that recently (version 3.6) added S3 support. I didn't end up using this much because the software imposes a restrictive 'left pane is your stuff, right pane is their stuff' metaphor that is likely helpful for newbies but ultimately frustrating if you want remote file listings ('their stuff') in both panes, e.g. your EC2 filesystem on the left and S3 on the right. Duh.
S3Fox Firefox Add-on
https://addons.mozilla.org/en-US/firefox/addon/3247
S3Fox provides a dual-pane file manager for S3 within Firefox. It seems to work pretty well for managing smaller sets of objects but is ultimately not reliable for doing larger scale transfers or permissions management. I was initially impressed to see the ACL management options, which are actually more functional than some of the standalone software I tried, but upon further inspection it became clear that S3Fox's ACL management is flawed: Though it's possible to set rights on individual objects (files), it appears that S3Fox doesn't really understand the notion of a "directory" within a bucket (who can blame it?), and therefore does not properly handle ACL configuration on "directories". Furthermore, when changing an entire bucket's ACL and clicking "Apply to subfolders", there's no way to know if it's successfully setting permissions on the bucket's contents without hand-checking them. In my case it wasn't, so it wasn't of much use to me.
One bonus: S3Fox provides a handy 'Copy URL to Clipboard' context menu item that produces the fully qualified web address of an object, ie. s3.amazonaws.com/bucket_xyz/path/to/file.jpg
Forklift
Mac, $29.95, http://binarynights.com/
Forklift is awesome! It's a relatively new dual-pane file manager for Mac that supports lots of protocols, S3 included. Unlike Transmit, it allows you to view whatever you want in either pane. It handles large file listings well and it has the most functional ACL management of any of the software I tried out..
This actually works!
Forklift also boasts a highly fuctional and customizable keyboard shortcut system, allowing for more nimble file management than what can typically be accomplished in a GUI.
My only real complaints about Forklift are that it doesn't provide options for file renaming/moving on S3, and that moving a file from EC2 to S3 or vice-versa means downloading the file to your local machine and re-uploading. It would be great to see a future version of Forklift leveraging a behind-the-scenes EC2 instance for quick file renaming and moving (like JungleDisk), or at least direct transfers between S3 and EC2 filesystems. Forklift is allegedly the first file manager for Mac that supports FXP so who knows, they might do something similar for AWS stuff..
Programmatic Approaches
Ruby Libraries
After playing around with various software options with little luck, I decided it was time to move to a more programmatic approach. There are quite a few Ruby libraries out there for interfacing with S3, but in the end I chose Marcel Molina's AWS::S3. Unlike every other library I tried before finding it, it's current, thorough, well documented, and it actually works!
Other Interesting Options
FuseOverAmazon is a fuse filesystem that allows you to mount an Amazon S3 bucket as a local filesystem. It stores files "natively" in S3 (i.e., you should be able to use other programs to access the same files).
s3-bash is a small collection of BASH scripts for PUTting, GETting and DELETEing files. I played with it a bit and it works but it doesn't seem to have any ACL support.
Gotchas / Caveats / Pitfalls
Think Sustainably
Be sure to choose a sustainable directory structure and naming convention for your media before you upload it to S3. Once you've created a bucket you can't rename it. You can't rename files. You can't move files. You can only copy them, so choose wisely.
Consider Access Control before Uploading
By default, your objects (files) will not be readable by anyone but their owner. If you want files to be viewable to the public, consider specifying permissions as you upload the files rather than after, especially if you're dealing with a large set of files. The following is an excerpt from the AWS documentation on Access Control Lists (ACLs):
Bucket and object ACLs are completely independent; an object does not inherit the ACL from its bucket. For example, if you create a bucket and grant write access to another user, you will not be able to access the user’s objects unless the user explicitly grants access. This also applies if you grant anonymous write access to a bucket. Only the user “anonymous” will be able to access objects the user created unless permission is explicitly granted to the bucket owner.
Atomize your Directories
Most computers start to get pissed off when you've got more than a few thousand files in a directory. If you're working with huge numbers of files, creating subfolders with smaller files sets will help keep your directories from getting unwieldy and your software from crashing.
Conclusion
Forklift, S3Fox, and Transmit are easy and functional tools for doing small-scale management of an S3 account, but if you're doing serious development and moving around a lot of data you'll probably need to implement a custom hand-written solution. For more ruby/rails-centric S3 material, check out my S3-related del.icio.us bookmarks or check out all the interesting projects and tools labeled with 's3' on Google Code.
..and good luck!