Overview
Teaching: 20 min Exercises: 0 minQuestions
Why and when should we use the cloud?
How do we use the cloud for data science?
Who is / are AWS?
Objectives
Learners will describe advantages and disadvantages of the cloud
Learners will analyze their use-cases for suitability for cloud computing
Learners will log into the AWS console and look around
Cloud computing encompasses a large collection of publicly available services provided by many different companies where you can provision computing on machines that are only accessible to you through an intermediated interface (such as a web-browser or through ssh).
These types of services range from things like Google Drive or Dropbox, that provide access to storage through a browser, to services that give you access to a linux-installed bare metal machine (“bare metal” means that you get the entire machine to yourself, you are the “single tenant” of this machine).
This contrasts with buying your own desktop or laptop computer, or cluster of machines, or with buying external storage devices (such as a RAID, redundant array of independent disks). It also contrasts with some services that are not publicly accessible, such as institutional clusters, and the XSEDE services, that may also only be accessible to you through an intermediated interface.
You do not wait for compute tasks to go through a queue
Compute can start as soon as you want it
You do not purchase and maintain hardware, operating systems etcetera
Upgrades just happen
You pay for resources you use; and then shut them off
You don’t have to buy into an institutional cluster if the cost calculation doesn’t make sense for you.
You have huge scale-up potential (reduced processing time)
In principle, you have near-infinite computing capacity.
There is a huge support community rapidly expanding cloud tools and tech
Because of the public availability of these resources, and substantial buy-in from industry, there is a large eco-system of tools and resources.
Storage, reliability, security and many other off-the-shelf services
And they just keep making new stuff.
You don’t have time to learn how to work on the public cloud There is stuff to learn. That’s what we’re here for! But there will be more to learn after this session is over. If you prefer to learn other things, you might not want to invest your time in learning about the cloud.
You operate your computer(s) at a very high duty cycle (more cost-effective) If your computer is constantly computing something, the cloud might end up costing you more.
Example data science workflow: acquire data, parsing, munging, analyzing the data, building, testing and validating models
Cloud computing is like a utility: You pay for resources you allocate
The burden of cloud management is on each of us
There are details to learn about managing your work on the public cloud. Without this skill life can quickly become expensive; for example if you accidentally allocate > expensive resources and leave them running. (Cloud instances can be turned off without losing state/progress and they can be saved as memory images.)
Lemons from lemonade
A good way of getting some bad news is to publish and then delete your cloud access credentials on GitHub. GitHub supports versioning: Someone who is not your friend can roll back your public repository to the version where the key was present, grab that key, and start using your cloud account at your expense.
This framework is the vocabulary and relationships we use to describe using the public cloud platform for data-driven research. Here comes the jargon storm! In what follows we assume you are a Researcher focused on data-driven science and that you are interested in adopting the cloud as a way of streamlining that process in some capacity.
As a Researcher you do perfunctory processing and exploratory processing. The cloud can help you with both, but at a cost: The time you invest to learn new methods. We call this cloud adoption and the core premise is that you no longer have a familiar computer with an attached storage system where you log in and do your work. The cloud model is (they like to say) cattle not pets: You have a huge pool of available compute resources and you rent them by the hour. When you are done with them you simply Stop or Terminate them and they go back into the resource pool. Before continuing let’s do a quick cost analysis of what this means: How does a cloud machine compare to a desktop?
A Cloud Under Your Desk
A good desktop might cost $3000; and a very powerful cloud instance will cost about $0.40 (USD) per hour. Let’s say for the sake of argument that they are equivalent in compute power and attached storage. If you work eight-hour days with four weeks of vacation then your annual compute cost is roughly $800. Over three years your “cloud under the desk” runs you $2400; but you can make this cheaper if you do not need the compute power; or you can throttle it up when you need a lot. You might also ask: What are the additional tradeoffs and other factors?
Compute (=EC2)
Storage (= S3)
Manage = Databases… SQL, Not Only SQL… Data Warehouses… query machinery
Web = Web services, web sites, APIs, Clients, confederation, …
Services = All of the above simplified: Often no Compute involved
Admin is easy to dismiss; but it is always present, even on the cloud
This really depends on the value of your time in relation to your research budget; and on how much of your wall clock time you spend doing computing. It also depends on your team’s capacity to assess and learn cloud tech for your work. This might be very fast - which is what we find in the majority of cases - but if you are getting into sophisticated work e.g. using a web framework or developing a database then substantial bootstrapping effort will be required.
Google Cloud Platform: Easy Interface, cheap computing options ($300 credits, free signup) Amazon Web Services: LOTS of services, features, most widely used ($200 AWS Educate) Microsoft Azure: Integrates well with other Microsoft products
Colab: Free compute resources through a Jupyter Notebook interface MyBinder: Jupyter Notebooks from a Github repository
Let’s get concrete. There are several ways to interact with cloud computing resources. Today, we will see how to interact through a web console, through command line interfaces, and through programmatic APIs.
We’ll start with a relatively manual way, that is also relatively straight-forward: using the AWS web console. To do so, we go to:
http://aws.amazon.com/
To log in click on the top right “My account” and then “AWS management console”
The account ID we will use here is “uwescience”. Enter your credentials. You might have to change your password the first time that you log in.
Once you are in, you can select from among several different services through the console. There are a dizzying array of services to choose from, really.
One thing that we will point out first before we even look at any of the services that we can access is that AWS operates several different data centers. You would think that this doesn’t matter, because it’s all in the cloud anyway, but that’s wrong. The location of the data centers matters, because communication between data centers is slow, expensive, and sometimes just impossible. For this reason, you always want to keep an eye on the “region” in which you are operating. For everything that we will do here, we will use the “us-east-2” region (Ohio).
Let’s look at some services that we can use. Let’s start with S3.
Key Points
The cloud provides on-demand access to infinite computational resources
Resources need to be carefully managed, because charges are usually tied to how long resources are held
The cloud is great for bursty, high-volume computing, or for some small services you might want to run