Simplify Enterprise Data Access Using Amazon SageMaker Lakehouse

December 9, 2024

Organizations are increasingly leveraging data to drive innovation and inform their decisions, yet building data-driven applications can be a complex endeavor. This complexity arises from the need for collaboration among multiple teams and the integration of various data sources, tools, and services. For instance, creating a targeted marketing application that delivers personalized offers involves data engineers, data scientists, and business analysts, each utilizing distinct systems and tools. This multifaceted process leads to several challenges, including the steep learning curve for multiple systems, difficulties in managing data and code across different services, and complexities in controlling user access among various systems. Organizations often resort to creating custom solutions to bridge these systems, striving for a more unified approach that allows them to choose the best tools while ensuring a seamless experience for their data teams.

Amazon SageMaker Lakehouse offers a solution to these challenges by providing unified access to data stored across both data warehouses and data lakes. Through SageMaker Lakehouse, users can utilize their preferred analytics, machine learning, and business intelligence engines via an open, Apache Iceberg REST API, which ensures secure access to data with consistent, fine-grained access controls.

1. Set Up Prerequisites

To begin, there are several prerequisites that need to be set up to ensure a smooth implementation of Amazon SageMaker Lakehouse. First, you need to create a custom IAM role following the guidelines in the Requirements for roles used to register locations. For this guide, we will use the IAM role named LakeFormationRegistrationRole. Next, set up an Amazon Virtual Private Cloud (VPC) with both private and public subnets. This VPC will provide the necessary networking infrastructure for your data lakehouse setup.

Then, proceed to create an S3 bucket named customer_data, which will be used to store customer-related data. You will also need to establish an Amazon Redshift serverless endpoint named sales_dw to host the store_sales dataset. Additionally, to facilitate churn analysis by sales analysts, another Amazon Redshift serverless endpoint named sales_analysis_dw needs to be created.

Moving forward, create an IAM role named DataTransferRole following the instructions provided in the Prerequisites for managing Amazon Redshift namespaces in the AWS Glue Data Catalog. Ensure that you have installed or updated to the latest version of the AWS CLI by following the instructions available in the Installing or updating to the latest version of the AWS CLI guide. Lastly, create a data lake admin following the instructions in the Create a data lake administrator guide. For this guide, we will use an IAM role called Admin.

2. Configure Data Lake Administrators

Configuring data lake administrators is a crucial step in managing user permissions and ensuring secure access to data within your organization. Start by signing in to the AWS Management Console as Admin and navigate to AWS Lake Formation. In the navigation pane, select Administration roles and then choose Tasks under Administration.

Under Data lake administrators, select Add. On the Add administrators page, under Access type, choose Data lake administrator. Under IAM users and roles, select Admin and confirm the selection. This grants the Admin role the necessary permissions to manage user access to catalog objects using AWS Lake Formation.

Next, on the Add administrators page, for Access type, select Read-only administrators. Under IAM users and roles, select AWSServiceRoleForRedshift and confirm the selection. This step allows Amazon Redshift to discover and access catalog objects in the AWS Glue Data Catalog, ensuring seamless integration between Redshift and other data sources.

3. Create Customer Table in Amazon S3 Data Lake in AWS Glue Data Catalog

Creating a customer table in the Amazon S3 data lake within the AWS Glue Data Catalog is essential for organizing and managing your data. Begin by creating an AWS Glue database called customerdb in the default catalog of your account. To do this, navigate to the AWS Lake Formation console and select Databases in the navigation pane.

Once the database is created, select it and choose Edit. Uncheck the box labeled Use only IAM access control for new tables in this database to allow broader access control options. Then, sign in to the Athena console as Admin and select the Workgroup that the role has access to. Run the provided SQL script to create the customer table and populate it with data, ensuring that your data is accurately represented in the data lake.

4. Register S3 Bucket with Lake Formation

Registering your S3 bucket with Lake Formation is a critical step in centralizing your data management and ensuring secure access. Start by signing in to the Lake Formation console as Data Lake Admin. In the navigation pane, choose Administration, then Data lake locations. Select Register location to begin the registration process.

For the Amazon S3 path, enter s3://customer_data/. This specifies the location of the S3 bucket that you want to register. Next, select the IAM role LakeFormationRegistrationRole, which was created earlier for this purpose. For Permission mode, choose Lake Formation to ensure that permissions are managed by Lake Formation. Finally, choose Register location to complete the process, thus enabling seamless integration of your S3 bucket with AWS Lake Formation.

5. Create Sales Database in Amazon Redshift

Creating a sales database in Amazon Redshift is essential for managing your sales data and performing analytics. Begin by signing in to the Redshift endpoint sales_dw as an Admin user. Run the provided script to create a database named salesdb. This database will serve as the central repository for your sales data.

Next, connect to the newly created salesdb database. Run the provided script to create the sales schema and the store_sales table. Populate the store_sales table with data to ensure that it contains the necessary information for your sales analysis. This step sets the foundation for conducting in-depth analysis on your sales data using Amazon Redshift, thereby enabling data-driven decision-making within your organization.

6. Create Churn Lakehouse RMS Catalog in AWS Glue Data Catalog

Creating a churn lakehouse RMS catalog in the AWS Glue Data Catalog allows you to organize and manage customer churn data effectively. Sign in to the Lake Formation console as Data Lake Admin. In the left navigation pane, select Data Catalog, then choose Catalogs New and Create catalog to initiate the process.

Provide the necessary details for the catalog, including the name churn_lakehouse. For type, select Managed catalog and for storage, select Redshift. Ensure that the option Access this catalog from Iceberg compatible engines is selected under Access from engines. This configuration enables seamless integration with various analytics engines. Choose Next to proceed.

Under Principals, select IAM users and roles. Choose Admin and under Catalog permissions, select Super user to grant the Admin role full control over the catalog. Choose Add and then Create catalog to finalize the creation of the churn lakehouse RMS catalog. This catalog will now serve as the central repository for your customer churn data, enabling comprehensive analysis and insights.

7. Access Churn Lakehouse RMS Catalog from Amazon EMR Spark Engine

Accessing the churn lakehouse RMS catalog from the Amazon EMR Spark engine is essential for leveraging Spark’s powerful data processing capabilities. Begin by setting up an EMR Studio. This provides a user-friendly interface for managing your EMR applications and notebooks.

Next, create an EMR serverless application using the CLI command. This serverless application will enable you to run Spark jobs without managing the underlying infrastructure, thus simplifying the process and reducing operational overhead. By integrating the churn lakehouse RMS catalog with EMR Spark engine, you can efficiently process and analyze large volumes of customer churn data, leading to valuable insights and more informed business decisions.

8. Sign in to EMR Studio and Use the EMR Studio Workspace

Organizations are increasingly using data to drive innovation and inform their decisions, yet building data-driven applications can be intricate. This complexity stems from the need for collaboration among multiple teams and the integration of various data sources, tools, and services. For example, developing a targeted marketing application that delivers personalized offers requires data engineers, data scientists, and business analysts, each working with distinct systems and tools. This multifaceted process introduces several challenges, such as the steep learning curve for multiple systems, struggles in managing data and code across various services, and controlling user access among different systems. Organizations often create custom solutions to bridge these systems, aiming for a more unified approach that allows them to choose the best tools while ensuring a seamless experience for their data teams.

Amazon SageMaker Lakehouse addresses these challenges by providing unified access to data stored across both data warehouses and data lakes. Through SageMaker Lakehouse, users can utilize their preferred analytics, machine learning, and business intelligence engines via an open, Apache Iceberg REST API, ensuring secure data access with consistent, fine-grained access controls.

Subscribe to our weekly news digest.

Join now and become a part of our fast-growing community.

Invalid Email Address
Thanks for Subscribing!
We'll be sending you our best soon!
Something went wrong, please try again later