In today’s data-driven era, ensuring the quality of data is paramount for organizations aiming to harness the power of analytics, machine learning (ML), and informed decision-making. Two tools that have risen to prominence in this quest for high-quality data management are Apache Iceberg and AWS Glue. The combination of these technologies allows for effective data management, quality assurance, and seamless integration, making them invaluable for enterprises dealing with large-scale, continuously incoming data. This article delves into the steps and strategies for deploying these tools together to ensure high-quality data management.
1. Deploy Resources with AWS CloudFormation
To kick off, deploying the necessary resources using AWS CloudFormation is a critical first step. This automated setup streamlines the process, ensuring all required components are efficiently configured. Begin by selecting the “Launch stack” option in the AWS CloudFormation console. AWS CloudFormation simplifies the task of defining and deploying infrastructure as code, making it easier to manage and replicate environments.
For the parameters, the IcebergDatabaseName is set by default. However, users have the flexibility to modify this value based on their specific requirements. Once the database name is established, proceed by selecting “Next.” This step ensures that the database configuration aligns with your desired setup.
After reviewing the parameters, click “Next” once more. This leads to a summary page where users must acknowledge that AWS CloudFormation might create IAM resources with custom names. This acknowledgment is crucial as it grants permission for the creation of IAM roles and policies necessary for the operation of AWS Glue and Iceberg.
Finally, submit the stack creation. After the stack’s creation is complete, check the “Outputs” tab. The values provided here are essential for the subsequent steps, as they contain details like the S3 bucket name, IAM role, and other resource identifiers required for configuring the environment further.
2. Configure Iceberg JAR Files
With the resources deployed, the next step involves configuring the Iceberg JAR files. Start by selecting the necessary JAR files from the Iceberg releases page. Two specific JAR files are required: the “1.6.1 Spark 3.3_with Scala 2.12 runtime Jar” and the “1.6.1 aws-bundle Jar.” These files enable the integration of Iceberg with your Spark environment, facilitating data management and query execution.
Download these JAR files onto your local machine. Once downloaded, open the Amazon S3 console and navigate to the S3 bucket created through the CloudFormation stack. The S3 bucket name can be found on the CloudFormation Outputs tab. Ensuring the correct bucket is selected is vital for the setup’s success.
Within the S3 bucket, choose the “Create folder” option and establish a path for the JAR files. Typically, this path is named “jars.” After creating the folder, upload the two downloaded JAR files to s3:///jars/ from your local machine. This step ensures that the JAR files are accessible to your AWS Glue environment, enabling the use of Iceberg’s functionalities within Spark sessions.
3. Upload a Jupyter Notebook on AWS Glue Studio
Having configured the JAR files, the next phase involves uploading a Jupyter Notebook on AWS Glue Studio to facilitate interaction with Iceberg. Start by downloading the notebook file named wap.ipynb. This notebook contains the necessary code and configurations to implement the Write-Audit-Publish (WAP) pattern using Iceberg and AWS Glue.
Open the AWS Glue Studio console and create a new job by selecting the “Notebook” option. Choose the “Upload Notebook” feature and select “Choose file” to upload the wap.ipynb file you downloaded earlier. This step imports the notebook into AWS Glue Studio, making it ready for execution.
Next, assign the IAM role created through the CloudFormation stack to the notebook. Typically, this role is named IcebergWAPGlueJobRole. Selecting the correct IAM role ensures that the notebook has the necessary permissions to interact with AWS resources and execute its tasks successfully. For the job name, enter iceberg_wap at the top left of the page and then choose “Save.” This step finalizes the configuration, and the notebook is now ready for further setup and execution.
4. Set Up Iceberg Branches
With the notebook uploaded, the next step involves setting up Iceberg branches within the Jupyter Notebook environment on AWS Glue Studio. Begin by running the initial cells in the notebook to use Iceberg with Glue. The %additional_python_modules pandas==2.2 command is used to visualize the temperature and humidity data in the notebook. Before running the cell, replace with the S3 bucket name where you uploaded the Iceberg JAR files.
Next, initialize the SparkSession by running the corresponding cell. The first three settings, starting with spark.sql, are essential for using Iceberg with Glue. Set the default catalog name to glue_catalog using spark.sql.defaultCatalog. Additionally, enabling the configuration spark.sql.execution.arrow.pyspark.enabled is necessary for data visualization with pandas.
After the session is created, indicated by the message Session has been created, proceed by copying the temperature and humidity dataset to the S3 bucket created through the CloudFormation stack. Ensure you replace with the correct bucket name from the CloudFormation Outputs tab before running the cell.
Configure the data source bucket name and path (DATA_SRC), Iceberg data warehouse path (ICEBERG_LOC), and the database and table names for the Iceberg table (DB_TBL). Replace with the appropriate S3 bucket name. Read the dataset and create the Iceberg table using the Create Table As Select (CTAS) query. Visualize the temperature and humidity data by running the provided code, displaying the data for each room using pandas and matplotlib.
Finally, create Iceberg branches by running the necessary queries. The branches, such as the staging (stg) branch, are created using the ALTER TABLE db.table CREATE BRANCH query. These branches will be used for implementing the WAP pattern in subsequent steps.
5. Implement WAP Pattern with Iceberg
With the Iceberg branches set up, it’s time to implement the WAP pattern, ensuring high-quality data management with Apache Iceberg and AWS Glue. The WAP pattern’s three phases—Write, Audit, and Publish—segregate and validate data effectively.
5.1 Write Phase: Write Incoming Data into the Iceberg Stg Branch
To begin, write the incoming temperature and humidity data, which may include erroneous values, into the stg branch of the Iceberg table. Start by running the cell designed to write data into the Iceberg table. This step captures all incoming data, regardless of its quality, into a staging environment separated from the main dataset.
Upon writing the records, visualize the current data in the stg branch by running the subsequent code. This visualization helps identify any anomalies or erroneous values in the new data. For instance, you might notice incorrect readings like temperatures around 100°C in certain rooms. This phase confirms that new data is isolated in the staging branch, preventing it from affecting downstream consumers until it has been validated.
5.2 Audit Phase: Evaluate the Data Quality in the Stg Branch
In the audit phase, evaluate the quality of the captured data using AWS Glue Data Quality. Run the provided code to initiate data quality checks on the stg branch. The evaluation rule, defined in DQ_RULESET, checks if the temperature values fall within the normal range of −10 to 50°C, based on the device specifications. Any values outside this range are considered erroneous.
The output displays the evaluation result. If any temperature data violates the criteria, it will show as “Failed.” For instance, a temperature reading of 105°C will trigger a failure, highlighting the need for further action.
Filter out the incorrect data from the stg branch. Then, update the latest snapshot in the audit branch with the valid data. This ensures that only quality-approved data is moved to the audit branch, ready for downstream use. The audit branch now contains data that meets the defined quality criteria, safeguarding the integrity of the dataset.
5.3 Publish Phase: Publish the Valid Data to the Downstream Side
In the final phase, publish the validated data from the audit branch to the main branch, making it available for downstream applications. Start by running the fast_forward Iceberg procedure to merge the validated data from the audit branch with the main branch. This operation ensures that only data passing the quality checks is consumed by downstream users.
After completing the procedure, review the published data by querying the main branch in the Iceberg table. This step simulates a downstream query, confirming that only valid temperature and humidity data is accessible. The query results should display data that has passed the data quality evaluation, ensuring accuracy and reliability.
In this use case, the WAP pattern effectively managed data quality by isolating and validating incoming data through the stg and audit branches before publishing to the main branch. This approach prevented erroneous data from being visualized or used for analysis, promoting accurate insights and decisions.
Conclusion
In today’s data-centric world, maintaining high-quality data is crucial for organizations looking to leverage analytics, machine learning (ML), and informed decision-making. Among the prominent tools that have emerged for superior data management are Apache Iceberg and AWS Glue. Together, these technologies provide an effective solution for managing, ensuring data quality, and seamless integration, making them essential for companies handling vast amounts of continuously incoming data. Apache Iceberg is an open table format for huge analytic datasets, while AWS Glue is a fully managed ETL (extract, transform, and load) service. By deploying Apache Iceberg and AWS Glue in tandem, enterprises can efficiently organize their data lakes, implement robust data governance, and streamline their data pipeline processes. This collaborative approach enables organizations to ensure that their data remains clean, structured, and easily accessible for various analytical needs. This article explores the steps and strategies for integrating these two tools to achieve top-notch data management.