Home / Data Analytics & Visualization / How Can Apache Pinot Enable Real-Time Analytics on AWS?

How Can Apache Pinot Enable Real-Time Analytics on AWS?

Aug 7, 2024

Tray DorbainBusiness Strategy Consultant

The advent of real-time analytics has transformed how businesses interact with data, making rapid processing and timely insights more critical than ever. Online Analytical Processing (OLAP) has always been a cornerstone for analyzing vast datasets by aggregating data into user-friendly structures. The traditional batch processing methods of OLAP, however, often fall short of meeting the demands of modern applications and user expectations. Enter Apache Pinot, an open-source real-time OLAP datastore designed to achieve low latency, high concurrency, and petabyte-scale data handling. Apache Pinot, when deployed on Amazon Web Services (AWS), empowers companies to turn raw data into actionable intelligence immediately. This article guides you on setting up Apache Pinot on AWS to build a real-time analytics solution and visualize data using Tableau.

1. Preconditions

Before diving into the deployment process of Apache Pinot on AWS, it is crucial to ensure that all necessary preconditions are met. First and foremost, an active AWS account is vital, as the guide centers around AWS’s vast suite of cloud services. In addition to having an AWS account, a basic understanding of Amazon S3, EC2, and Kinesis Data Streams is essential, as these services are pivotal in the setup and functioning of the real-time analytics solution.

An IAM role with permissions to access AWS CloudShell and create Data Streams instances, EC2 instances, and S3 buckets also needs to be in place. This role facilitates the deployment process and the seamless interaction between different AWS services. On the local machine, Git should be installed to clone the repository required for setting up Apache Pinot. Furthermore, both Node.js and npm need to be installed for package management. Ensuring these preconditions are met sets a solid foundation for moving forward with real-time analytics using Apache Pinot on AWS.

2. Preparation for Tableau Visualization

Tableau is a powerful visualization tool that complements the capabilities of Apache Pinot by transforming raw data into interactive, insightful dashboards. To prepare for this crucial step, you need to first download and install Tableau Desktop; for this guide, the version to use is 2023.3.0. This installation allows you to visualize the data processed by Apache Pinot in near real-time, enabling quick and insightful data analysis without delay.

Additionally, the Kinesis Data Generator (KDG) must be installed using AWS CloudFormation. The KDG will generate sample web transactions to populate the Kinesis Data Streams, which are integral in testing and demonstrating the real-time analytics capabilities of Apache Pinot. Furthermore, it is essential to download and configure the Apache Pinot drivers. Save these drivers in the C:\Program Files\Tableau\Drivers folder if using Windows, or follow specific instructions for other operating systems. Properly setting up Tableau ensures a smooth transition from data capture to visualization, making the entire analytics process efficient and effective.

3. Initiate AWS CDK

Next, it is crucial to scaffold the AWS Cloud Development Kit (CDK) environment, which simplifies cloud infrastructure setup by using familiar programming languages. To initiate this, run the AWS CDK bootstrap command formatted as follows: cdk bootstrap aws:///. For instance, if the AWS account ID is 123456789000 and the designated region is us-east-1, the command would be cdk bootstrap aws://123456789000/us-east-1. This step establishes the necessary resources in your AWS account to deploy Apache Pinot successfully.

The AWS CDK uses high-level constructs to represent AWS components, significantly simplifying the build process. Bootstrapping is only necessary if AWS CDK has not been used previously in your deployment account and region. By successfully bootstrapping the AWS CDK, you lay the groundwork for efficiently managing and orchestrating your cloud infrastructure, paving the way for deploying a highly scalable, real-time OLAP datastore.

4. Clone and Configure Repository

With the AWS CDK environment in place, the next step involves cloning the relevant GitHub repository and installing necessary dependencies. Begin by running the command: git clone https://github.com/aws-samples/near-realtime-apache-pinot-workshop. This clones the repository containing all the scripts and configuration files required for setting up and deploying Apache Pinot on AWS. Navigate into the cloned repository by executing cd near-realtime-apache-pinot-workshop.

Following this, install all the dependencies listed in the package.json file by running npm i. This command ensures that all required packages and libraries are available, facilitating a seamless deployment process. This preparation is crucial because it ensures that all the tools and configurations are correctly set up, enabling you to focus on deploying and managing the real-time OLAP datastore without hitches.

5. Deploy AWS CDK Stack

Deploying the required AWS Cloud infrastructure is the next critical step. Execute the command: cdk deploy --parameters IpAddress="". This command initiates the deployment process for the necessary AWS resources to run Apache Pinot. Confirm the prompt by typing ‘y’. The deployment of the AWS CDK stack typically takes about 10-12 minutes and culminates in a deployment message displaying the creation of AWS objects, Stack ARN, and total deployment time.

The deployment provisions the requisite compute, storage, and networking resources to host Apache Pinot while ensuring scalability and high availability. Once completed, you will have a robust infrastructure capable of ingesting and processing streaming and batch data sources in near real-time. With the AWS CDK managing the deployment, the process is streamlined, accommodating the intricate requirements of a real-time OLAP datastore.

6. Access Apache Pinot Controller

Upon deploying the AWS CDK stack, the next step is accessing the Apache Pinot controller. Retrieve the Application Load Balancer (ALB) DNS name for the Apache Pinot controller from the CloudFormation console. Navigate to the Stacks section, select the ApachePinotSolutionStack, and then choose Outputs. Copy the value for ControllerDNSUrl. This DNS name allows you to access the Apache Pinot controller via a web browser.

By pasting the ALB DNS name into a browser, you can monitor and manage the details of the Apache Pinot controller, including the number of controllers, brokers, servers, minions, tenants, and tables. You can also view lists of tenants, controllers, and brokers. This level of control and visibility ensures that the Apache Pinot cluster is functioning optimally, providing quick access to real-time analytics capabilities.

7. Stream Data and Visualization

After setting up the Apache Pinot controller, you can proceed to stream data and visualize it using Tableau. Utilize the Kinesis Data Generator (KDG) tool to send sample web transactions to a Kinesis data stream. Follow the instructions to access the KDG tool and use the provided record template to send sample data to a stream named pinot-stream. Ensure you send a handful of records by choosing “Send data” and then stop the process by selecting “Stop sending data to Kinesis.”

Following this, you should see the web transactions data reflected in Tableau Desktop, offering a near real-time visualization of the ingested data. With this setup, you can manipulate and analyze the data quickly, making it easier to derive actionable insights and business intelligence from streaming data.

8. Clean Up

Once you have successfully tested and visualized the real-time data, it is essential to clean up the AWS resources you created to avoid unnecessary costs. Start by disabling termination protection on the specific EC2 instances involved in the deployment. Go to the Amazon EC2 console, select Instance from the navigation pane, choose Actions, Instance Settings, and then Change termination protection. Clear the Termination protection checkbox for the following instances: ApachePinotSolutionStack/bastionHost, ApachePinotSolutionStack/zookeeperNode1, ApachePinotSolutionStack/zookeeperNode2, and ApachePinotSolutionStack/zookeeperNode3.

Next, run the command cdk destroy from the cloned GitHub repository and confirm by typing ‘y’ when prompted. This command will dismantle the AWS CDK stack and remove all associated resources, ensuring that you do not incur further charges.

9. Scale Solution for Production

Though the example given in this guide uses minimal resources to demonstrate functionality, scaling the solution for a production environment requires more robust configurations. The solution provides autoscaling policies to independently scale brokers and servers based on CPU requirements. When autoscaling is initiated, an AWS Lambda function runs the logic needed to add or remove brokers and servers in the Apache Pinot cluster.

In Apache Pinot, tables can be tagged with identifiers for routing queries to appropriate servers. This functionality is useful for building multi-tenant clusters but requires careful management to ensure no active tags or tables are associated with brokers or servers being removed. The autoscaling policy also handles rebalancing segments anytime a new broker or server is added, ensuring optimum resource utilization and seamless performance.

By following these steps, you can efficiently scale your Apache Pinot deployment to meet growing demands and complex analytics requirements. Achieving a scalable, real-time OLAP datastore enables businesses to process and analyze data at unprecedented speeds, turning information into a strategic asset.