AWS Athena is a new, serverless technology enabling users to query S3 data interactively. This course teaches you everything you need to use Athena, including access configuration, schema definition, querying, and performance and cost optimization.
Ever wish you could query data without needing to provision, manage, and configure infrastructure and software? Enter AWS Athena, a scalable, serverless, and interactive query service newly provided by Amazon Web Services. In this course, Getting Started with AWS Athena, you'll learn how to utilize Athena and perform ad-hoc analysis of data in the Amazon Cloud. First, you'll explore how to setup user access, and define schemas which point to your S3 data. Next, you'll discover how to query information using SQL in a few simple steps. Finally, you'll delve into how Athena works from behind the scenes and understand the best practices that drive Athena cost and performance optimization. By the end of this course, you'll have the skills and knowledge necessary to start implementing solutions with AWS Athena on your own datasets in your own AWS environments.
Aaron is a software developer and consultant at Morrison Consulting LLC, specializing in Amazon Web Services (AWS). His interests in the cloud space include automation, serverless computing, and security.
Course Overview Hey everyone, my name is Aaron Medacco, and welcome to my course, Getting Started with AWS Athena. I'm a software developer, and cloud architect specialized in Amazon Web Services. Have you ever wanted to query data without configuring and managing infrastructure? What about without requiring ETL jobs? Then you've come to the right place. Athena allows you to query data directly in S3, and without managing instances or clusters, and there's no need to transform data. In this course, I'll show you everything you need to use Athena, and perform ad hoc analysis of data in the Amazon cloud. The major topics we'll cover include uploading files to S3, configuring user access to Athena in an AWS account, defining schemas we can use to query our S3 data from Athena, querying our data using standard SQL, and cost and performance optimization using best practices. By the end of this course, you'll have mastered the skills required to perform and implement ad hoc analysis of data using Athena. Before beginning the course, you should be comfortable with using Amazon Web Services at a basic level, and familiar with writing SQL. I hope you'll join me on this journey to learn how to implement solutions using Athena, with the Getting Started with AWS Athena course, here at Pluralsight.
Establishing Access to Your Data We've talked about Athena at a high level, and by now we know that our data needs to be stored in Amazon S3 before we can use Athena. We also need to validate access. Does the user have permission to use Athena? Does the user have permission to read files in S3? We'll need to ensure these concerns are taken care of before we go any further. In this module, we'll start by introducing the scenario data for Wired Brain Coffee, and their third-party gift card vendor. We'll review the different methods for granting permission and securing access, both to the Athena service and to objects stored in S3. Now, remember, if you're watching this course, much of this should be review, so we're not going to go into these in serious detail. However, it's important to talk about them so you know how you can protect your data, and of course, if you're trying to allow access to data for yourself, a coworker, or perhaps even another AWS account, you'll want to know where to look if you aren't able to interact with the data you're trying to. Finally, we'll follow the discussion up with a demo. We're going to create a user in our AWS account, assign him the correct permission they'll need to use Athena and interact with S3, we'll upload the gift card data from Wired Brain Coffee and the third party, and we'll validate our users access before we move on. Pretty short module, let's get started.
Creating Databases & Tables to Define Your Schema Are you ready to get your hands dirty? Before we can retrieve information from S3 data running SQL queries, we need to set up our schema within AWS Athena. We need to tell Athena where our data is, describe the columns and data types it contains, and designate how to interpret the format of our data. Therefore, in this module we'll start by reviewing what we know about databases and tables in Athena. We know Athena uses Hive underneath. We know that databases and tables in Athena are different than those of a relational database. And we know it's up to us as the user to manage and maintain our schemas, since the data they point to isn't validated by Athena for us, prior to query time. We'll then talk about how to create our schema, the different tools we can use, and the rules or syntax for successfully managing databases and tables. In our scenario, we'll be creating tables from the data we uploaded to S3 in an earlier module. And so we're going to be spending most of this module in demos. We'll be walking through how to set up our schema using the AWS management console in our browser, and we'll cover both methods for doing this, both the wizard based flow for creating tables, and also by simply running a DDL statement within the query editor. There's some other neat items in the console such as the catalog manager and history tabs, which can come in handy too, and like promised, we'll also walk through setting up a third party tool to work with Athena. In this case, we'll be using SQL Workbench, but you can use any tools that supports a JDBC connection. Obviously the exact steps for setup will vary based on the tool. The idea here is that we know more than one way to leverage the service. If you're already using your own business intelligence tools, being able to easily incorporate Athena into your existing workflow is going to be beneficial. Let's get started.
Retrieving Information by Querying Tables with SQL Querying data in Athena is just like querying data in a relational database, even though we aren't dealing with a relational database behind the scenes. Until this point we focused on knowledge and set up. We've granted access, learned about Athena's underlying tech, and defined our tables. In this module, we'll start realizing the value of AWS Athena, we'll write SQL to query our S3 data. First, we'll talk about the Presto SQL engine, the technology running queries for you underneath Athena. We'll talk about the origin of Presto, what it was designed for, and how it allows you query structured, semi structured, and unstructured data in Athena. Next, we'll define ANSI-Compliant SQL. Athena uses ANSI SQL, but what does that actually mean to us. Then, we'll revisit our Wired Brain Coffee scenario by providing a visual of what we've provisioned so far, and what we'll be querying for. Finally, we'll finish up with a demo to show how easy it is to query in Athena. We'll start by running some simple queries in the management console, and demonstrate how to save queries for later reference. Then, we'll move back into SQL Workbench, where we'll solve Wired Brain Coffee's gift card discrepancy problem. By the end of the module, you'll have gone through each step necessary to start using AWS Athena. Access, to uploading data, to schemas, to querying. Let's get started.
AWS Athena vs. Other Solutions Congratulations on making it this far. At this point, you know both how to set up and use Athena to query your S3 data, and you know how to take advantage of the best practices when optimizing for cost and performance. In this module, we'll wrap up the course by doing a comparison between AWS Athena, and other data solution services provided by Amazon Web Services. Note that these aren't apples to apples comparisons, but are intended to give you some general guidance when architecting solutions on AWS. As you continue to expand your knowledge of AWS, it's important to understand the differences between the services, and the types of problems they are effective at solving, especially as AWS grows its catalog. We'll start with Redshift. We'll identify what Redshift is, the differences it has with Athena, and address the use cases where Redshift is preferable instead of Athena. We'll consider important items like solution setup, cost, performance, and features in our comparison. These services are certainly very different, but I believe it's valuable to compare them, seeing as they both fit within the realm of data in the cloud. Then we'll move onto AWS Elastic MapReduce. Like Redshift, EMR is much more powerful and rich in application than Athena, is designed for a broader range of use cases, and is certainly a main stay in the data service catalog of AWS. We'll talk about when using Athena is preferable for CMR, and when EMR will be more attractive depending on what you're trying to accomplish. Finally, as a segue, I'd like to call attention to a popular quote by Maslow, which I believe applies to technology professionals as they become familiar with tools. I suppose it is tempting, if the only tool you have is a hammer, to treat everything as if it were a nail. We've spent roughly two hours on learning AWS Athena, and while I want you to be excited about solving problems with it, we should always take a step back to consider if the tools we are using are best equipped for the situation.