Harnessing Amazon EMR on AWS Outposts for Hybrid Big Data Analytics

January 30, 2025
Harnessing Amazon EMR on AWS Outposts for Hybrid Big Data Analytics

Businesses require powerful and flexible tools to manage and analyze vast amounts of information. Amazon EMR has long been the leading solution for processing big data in the cloud. Amazon EMR is the industry-leading big data solution for petabyte-scale data processing, interactive analytics, and machine learning using over 20 open source frameworks such as Apache Hadoop, Hive, and Apache Spark. However, data residency requirements, latency issues, and hybrid architecture needs often challenge purely cloud-based solutions. Enter Amazon EMR on AWS Outposts—a groundbreaking extension that brings the power of Amazon EMR directly to your on-premises environments.

This innovative service merges the scalability, performance, and ease of Amazon EMR with the control and proximity of your data center. Enterprises can meet stringent regulatory and operational requirements while unlocking new data processing possibilities. EMR on Outposts is a native hybrid data analytics service that seamlessly allows data access and processing both on premises and in the cloud. It integrates smoothly with existing IT infrastructures, providing the flexibility to keep data where it best fits your needs while performing computations entirely on premises. For example, consider a hybrid setup where sensitive data remains locally in Amazon S3 on Outposts, and public data in an AWS Regional Amazon Simple Storage Service bucket. This configuration allows you to augment sensitive on-premises data with cloud data while ensuring all data processing and compute runs on-premises in AWS Outposts Racks.

1. Prerequisites

To begin with Amazon EMR on AWS Outposts, several prerequisites need to be addressed to set up your environment correctly. Obtaining an AWS account and a role with administrator access is the first requirement. If you don’t have an AWS account, the process to create one is straightforward and necessary for access to AWS services.

Once you’ve set up your AWS account, ensuring that an Outposts rack is installed and operational is crucial. The Outposts rack provides the necessary hardware infrastructure for on-premises deployment of Amazon EMR. Additionally, generating an EC2 key pair is essential as it allows you to connect to the EMR cluster nodes even if regional connectivity is lost. This step ensures that you maintain access to your cluster nodes under all circumstances.

Setting up Direct Connect is the final prerequisite step, although this is only required if you are planning to deploy the second AWS CloudFormation template. Direct Connect establishes a private, dedicated connection between your on-premises environment and AWS, optimizing the network performance and ensuring secure data transit. With these prerequisites met, you can proceed with deploying the CloudFormation stacks that structure your hybrid big data analytics solution.

2. Deploy the CloudFormation Stacks

Deploying the CloudFormation stacks is a multi-step process that sets up the required infrastructure for Amazon EMR on AWS Outposts. First, Stack1 provisions the network infrastructure on Outposts. This includes creating the S3 on Outposts bucket and Regional S3 bucket. Sample data is copied to each bucket to simulate a real-world data setup. Confidential customer stockholding data is copied to the S3 on Outposts bucket, while non-confidential stock details are copied to the Regional S3 bucket.

Next, Stack2 provisions the infrastructure necessary for connecting to the Regional S3 bucket via Direct Connect. This stack establishes a VPC with private connectivity for both the Regional S3 bucket and the Outposts subnet. It also creates an Amazon S3 VPC interface endpoint for private access to Amazon S3 and configures a private Amazon Route 53 hosted zone for S3 DNS resolution within the VPC. You can skip this stack if Direct Connect routing is not needed.

Stack3 provisions the EMR cluster infrastructure along with AWS Glue database and tables. The stack establishes an AWS Glue database named oktank_outpostblog_temp and creates three tables within it: stock_details, stockholdings_info, and stockholdings_info_detailed. The stock_details table holds public stock information stored in the Regional S3 bucket, while the other two tables contain confidential stockholding data in the S3 on Outposts bucket. A runtime role named outpostblog-runtimeRole1 is also created, simplifying access controls for different jobs within the single EMR cluster.

The final stack, Stack4, provisions EMR Studio, enabling you to interact with the data stored across both S3 on Outposts and Regional S3 buckets. After completing these CloudFormation stacks using an admin role, you are now equipped with the resources needed to perform hybrid big data analytics in your environment.

3. Access the Data and Join Tables

With the infrastructure set up, accessing and joining the data stored in local and regional S3 buckets can be performed using Amazon EMR. Begin by navigating to the Outputs tab of Stack4 in the AWS CloudFormation console. From there, select the EMR Studio URL to open EMR Studio in a new window. Create a workspace with default options, which will launch in a new tab.

Connect to the EMR cluster using the runtime role (outpostblog-runtimeRole1), affirming that you are now linked to the EMR cluster. Use the File Browser tab to open a notebook and choose the PySpark kernel. Execute a query in the notebook to read data from the stock details table pointing to the Regional S3 bucket. Follow this by running another query to read from the confidential data stored in the local S3 on Outposts bucket.

One of the significant requirements for Oktank is to enrich on-premises data with public data from the Regional S3 bucket. This enrichment process is straightforward with Amazon EMR. Run the necessary queries to join the datasets from both the local and regional S3 buckets. The ability to interactively explore and analyze the data enables real-time insights and drives data-driven decisions.

4. Control Access to Tables Using Lake Formation

Managing access control to your data tables is critical for maintaining data security and compliance. Lake Formation offers a robust solution for this purpose. On the Lake Formation console, navigate to Tables in the sidebar and select the stockholdings_info table. View the current access permissions and revoke permissions as necessary from IAMAllowedPrincipals to restrict table access.

Attempt to rerun the data access queries in the EMR Studio notebook to confirm that access is correctly restricted. The query should fail because Lake Formation has denied the permissions. To resolve this, go back to the Lake Formation console, select the stockholdings_info table, and grant the necessary permissions to the runtime role (outpostblog-runtimeRole1). Assigning appropriate permissions ensures that the runtime role can access the required table data for processing.

Managing permissions using Lake Formation shows how real-time access controls can be enforced on catalog tables. Note that while Lake Formation controls access to tables, actual data files in the S3 on Outposts bucket require IAM permissions. Stack3 handles these permissions, ensuring secure access to your data.

5. Submit a Batch Job

Lastly, performing batch jobs as an EMR step allows for automated, large-scale data processing. Ensure that the stockholdings_info_detailed table is initially empty by running a verification query in the notebook. After detaching the notebook from the cluster, navigate to the EMR console and submit a step on the EMR cluster named EMROutpostBlog.

Select Spark Application for Type, and choose the Python script from the scripts folder in your S3 bucket created by CloudFormation templates. Assign the runtime role (outpostblog-RuntimeRole1) to the permissions and add the step to submit the job. Wait for the job to complete, which will insert data into the stockholdings_info_detailed table. Re-run the earlier verification query in the notebook to confirm that data has been inserted successfully.

Conclusion

Companies need robust and versatile tools to handle and analyze extensive amounts of data. Amazon EMR has been the top choice for big data processing in the cloud. It excels in petabyte-scale data processing, interactive analytics, and machine learning, leveraging over 20 open-source frameworks like Apache Hadoop, Hive, and Apache Spark. Nonetheless, purely cloud-based solutions often face challenges such as data residency requirements, latency issues, and the need for hybrid architecture. That’s where Amazon EMR on AWS Outposts comes in—an innovative extension that delivers the power of Amazon EMR directly to your on-premises environments.

This new service combines Amazon EMR’s scalability, performance, and user-friendliness with the control and proximity of your data center. It enables enterprises to comply with stringent regulatory and operational requirements while exploring new data processing opportunities. EMR on Outposts is a native hybrid data analytics service that allows seamless data access and processing both on-premises and in the cloud. It integrates effortlessly with existing IT infrastructures, offering flexibility to keep data where it best serves your needs while performing computations on-premises. For instance, in a hybrid setup, sensitive data can be stored locally in Amazon S3 on Outposts while public data resides in an AWS Regional Amazon S3 bucket. This setup allows you to enhance sensitive on-premises data with cloud data, ensuring all data processing occurs on-premises in AWS Outposts Racks.

Subscribe to our weekly news digest.

Join now and become a part of our fast-growing community.

Invalid Email Address
Thanks for Subscribing!
We'll be sending you our best soon!
Something went wrong, please try again later