S3 is Amazon’s Object Storage Service. It is a highly durable, scalable, and fast, secure storage system that is highly available via a web interface for you and me to upload and download any amount of data from anywhere in the world.
That’s a lot of marketing-speak. Let’s break this down.
What is Object Storage?
Object storage is where every file is an object as opposed to a file in a filesystem.
A filesystem needs an Operating system. Object storage doesn’t.
S3 is like that hard drive you have on the cloud where you don’t need to worry about formatting it with a filesystem like FAT (FAT12, FAT16, FAT32), exFAT, NTFS, HFS and HFS+, HPFS, APFS, UFS, ext2, ext3, ext4, XFS, btrfs, ISO 9660, Files-11, Veritas File System, VMFS, ZFS, ReiserFS and UDF.
Durability
Every time you upload something to S3, it gets replicated to multiple hard drives across multiple zones on AWS. That way you can never really lose anything on AWS.
Who cares if one of the hard drives crashes; there are multiple copies of your data sitting on multiple other physical drives!
AWS says the chance that you’d actually lose your file entirely on AWS S3 is 0.0000000001%. That math comes out to losing 1 object in every 100 billion objects stored.
With a Seagate hard drive, I have lost data to hard drive crashes with a ratio of 1 in every 2! I’d say 1 in a billion is progress!
Availability
S3 is highly available because AWS guarantees the availability to be 99.99 %. AWS, GCP, Azure and the cloud providers are big on these 9s. They take pride in how many 9s they can add to the percentage!
We are still living in an era where we’re used to seeing messages like — “our servers will be down from so and so time on such and such date for maintenance, you might experience connection problems during the time”.
We’d, of course, like everything to be up 100% of the time. Transitioning from a time like this to a time where servers are up 100% of the time, I think 99.99% is a good transition!
Scalability
S3 is scalable which means the hard drive is like the big beanstalk that Jack climbed. It grows with your data! No more running out of hard disk space!
S3 allows you to store virtually unlimited amount of data
It’s unlimited because AWS is buying up hard drives and adding to its arsenal faster than people on AWS can fill up the existing hard drives on the platform. As long as AWS can keep a step ahead of us, to us the storage will just seem unlimited.
Also, the cloud is shared. If both you and your friend bought physical hard drives, and your friend has space left on their drive, you don’t get to use that space because it’s not yours and it’s not connected to your computer! But, on the cloud, if someone stops needing a lot of storage, that storage is freed up for others to use.
The size limit per file is 5TB!
The number of buckets you can create in your AWS account is capped at 100 by default. If you need more buckets, you can contact AWS to increase this limit.
Security
Your data can be encrypted in S3 using Amazon-provided encryption tools. AWS also provides access management tools via which you can restrict access to your data. You can monitor for irregular access requests via other AWS services that tie into S3 like Amazon Macie. It also supports audit logs that you can enable on all the access requests to your data stored in S3
Compliance
S3 maintains compliance programs, such as PCI-DSS, HIPAA/HITECH, FedRAMP, EU Data Protection Directive, and FISMA, to help you meet regulatory requirements.
Try getting any of the above from a hard drive. It just doesn’t exist. This is why S3 is a paradigm shift.
There was a time when people paid $5000 for a CD-RW drive to be able to read and store data on CD-Roms. Then thumb drives/pen drives started coming into the market. They were a bit expensive at first but quickly dropped in price. People who paid 5 grand for their CD writers held onto their CD writers for a long time…longer than they should have! Today we can pick up a thumb drive that is 100x the storage capacity of those compact discs for next to nothing — This was a paradigm shift we saw in our lifetime. The shift to the cloud is the next one we get to see in the same lifetime!
Pay-as-you-go
How many times have you bought a 2TB hard drive and still have 100GB left on it?
With hard drives, we always paid for the storage space upfront irrespective of how much space we’d use. With S3, you only get billed for the space you’re using on a minute-by-minute basis!
Having said that, S3 is not cheap if you just need space for personal storage. For that use a service like Google Drive or Dropbox. S3 is meant for storing data that otherwise would be a part of your backend infrastructure, and would have cost you an arm and a leg to maintain a data center yourself. I talk about some of the fundamentals of cloud computing in this article.
The cost varies depending on many factors. It’ll be different if you frequently access your data versus if you just want to archive it. It’ll be different if you access it from the same zone as it is created in or from out of the zone.
If you need to copy data over to another region, it’s called a network egress and has a little different pricing structure. If you’re running a compute service on the data, and the data needs to be moved across regions, that’d be a network egress even though you may not have directly initiated the transfer. Pricing also depends on the amount of data you need to store. The price goes down slightly for over 50TB of data.
Just so we can look at some numbers let’s say you want to use AWS like a 512 GB hard drive and use it frequently.
A frequently accessed data is best stored in standard S3 instead of the S3-IA which stands for infrequent access, and costs less.
It’d cost 2.3 cents per GB per month. So this would cost you 512 * 0.023 = 11 dollars and 78 cents per month to maintain a frequently used 512 GB drive on the AWS cloud. Google cloud storage is similar. Google currently charges 2.6 cents / GB /month. So Google will run you about 13 dollars and 31 cents/month to maintain that same 512GB drive on GCP (Google Cloud Platform)
As I mentioned above for personal data storage use something like Google drive which costs 99 cents /month for 1TB. Of course, you can’t run an API compute service on that data or tie it to any data analytics or anything other than just storing it.
You can find out more about AWS S3 pricing here.
Use Cases
Storing your user’s data as resources for your API is the most common use case. But, there are many other ways that S3 can become immediately useful.
You can use S3 to host your static website! If you create a website and store it on your local hard drive, no one would see it because your computer is not a server (unless you make it so and never shut it off, and then it’ll just be a baby server that can probably handle a few users and that’s about it).
Once you upload your website to S3, now it’s on AWS servers which are available worldwide. You can take advantage of that and make your website available worldwide. Since S3 scales as part of being in the AWS family of services, handling more users as the numbers grow is seamless!
The only catch is that websites on S3 can’t be dynamic. Users can’t interact with the site and expect different things to show on the page based on their interaction. Everybody gets the same data and the same experience — static site.
If you need to host a dynamic site like an e-commerce store, that AWS provides services like EC2 where you can provision a server of your own and build a modern website with all the bells and whistles.
A number of AWS services tie in with S3. Services like Macie for monitoring sensitive data access, and Big Data and analytics services can take their input from the data stored in S3. You can use it with AWS Lake Formation to tie in the data with Machine Learning.
The data on S3 can be used for archival purposes for your corporate documents. Data that you know you don’t need frequently but still need to store can be pushed out to S3 Glaciers that are even cheaper than S3 but can take longer to get the data back if you do need it back.
There are many ways of leveraging the data on S3. These are just some of the common ones. Hop on to AWS S3 and browse through their docs.
Buckets
Your data that are pictures, videos, or files are all stored as objects. The objects are stored in Buckets. A Bucket is like a folder on your hard drive.
You create a Bucket, configure its access settings like private, public, etc. You can also assign access rights to it as to who in your AWS environment can access it. e.g. if you have your company’s backend running on AWS then you probably have different groups of users like developers, IT, clients, etc. Not everyone should have access to all of your data.
You can configure and modify all that in a Bucket’s settings at any time. This is called Access Control Information and it is a sub-resource.
(A subresource is just another resource that is tied to a parent resource, and its lifecycle depends on the lifecycle of the parent. We’ll talk about resources in a second.)
You upload all your files to that bucket through a web interface. You get a URL for your Bucket. If your bucket is public, you can access your files from anywhere in the world just by typing in the URL on a web browser. If your bucket is private, you can still access it from anywhere in the world, you’ll just need to log in to your account.
The URL for a bucket comes in two different flavors — a path URL and a virtual URL
A path style URL has the following format
http://s3-region-name.amazonaws.com/bucket-name
e.g. if your bucket is called myBucket and you created it in Paris
http://s3-eu-west-3.amazonaws.com/myBucket
A virtual URL of the above path would look like this-
http://myBucket.s3-eu-west-3.amazonaws.com i.e. http://bucket-name.s3-region-name.amazonaws.com In virtual url, the region-name is optional, so you also do- http://myBucket.amazonaws.com
Every bucket name has a unique name. Once you’ve created a bucket with a particular name, no one in the world can use that name. So if someone is using a name you want for your bucket — well, you’ll need to come up with a different name. This is done so that the buckets can be globally accessed from anywhere with a link. Same reason why two websites can’t have the same name. If duplicates were allowed then the DNS name servers won’t be able to resolve them.
When you’re designing your AWS, come up with a naming scheme for your buckets that is unique enough for your project. That way you won’t spend a lot of time chasing dead ends.
ARN
Working on AWS, you’re going to bump into this term often. Buckets and objects are resources on AWS. In a nutshell, every bucket you create, and every object you upload to AWS is a resource that can be requested via a REST API. Amazon assigns a unique name to identify each of these resources. This is the Amazon Resource Name a.k.a ARN
Regions
Sometimes you create a bucket, and when you log in the next day you don’t see the buckets. This is because the Buckets are region-specific.
When you log into your AWS account, you get to pick which region you want to operate out of. Regions like N.Virginia (us-east-1), California (us-west-1), London (eu-west-2), etc. When you pick your region, the resources you create get created on the servers in that region.
Some resources like access rights policies are globally available but buckets are region-specific. If you create a Bucket in N. Virginia(us-east-1) then that bucket is not available when you switch your region to Mumbai (ap-south-1). When you log in, AWS will log into your default region which can be different from the region you created your resources in. So if you don’t see your bucket, check the region.
Choose your region based on where you are or where you’re going to access the bucket from the most. If you’re in California and need to access your files from California, then don’t create a bucket in Sydney ( ap-southeast-2). The reason being latency. If you store stuff in Sydney then your files are physically stored on a server there.
Depending on the sensitivity of your data, there may be other data compliance regulatory requirements that you’ll have to keep in mind when you’re using S3. Some data are not allowed to leave the country and must be restricted to a particular geographic location.
It’s hard to come by well-informed people about this topic, however, you sound like you know what you’re talking about! Thanks for detailed info.