Aerospike Connect for Presto

Kiran Matty - Director of Product Management Blog, Technology, Developer

Sound business decisions lean heavily on insights generated from data such as device metrics, user behavior tracking, business transactions, location data, and much more. A Business Intelligence (BI) tool such as Tableau, MicroStrategy, Looker, or Qlik, etc. is a staple in a business or data analyst’s toolbox and SQL is their language of choice for data mining. However, they not only care about premier dashboarding and analytics capabilities but also the speed of analysis, which is usually measured as time-to-insight. Dashboarding and interactive queries demand near real-time response times to be impactful. Such workloads are typically read-heavy, low latency, and high throughput. Large datasets such as a table with a million records or more are typically stored in a database and queried on an as-needed basis. Under the covers, BI tools generate SQL queries and send it over a JDBC interface to the SQL query engine, which is backed by a database. No matter how optimized your SQL query engine is, most of the performance gains are wiped out by slow reads from the underlying database. Slower queries can lead to loss of productivity, which can be very pronounced for large enterprises where millions of queries are run on a daily basis. Needless to say, this can severely impact your decision-making ability, not to mention revenues.

Aerospike Connect for Presto is finally here to help!

Aerospike is a highly scalable NoSQL data platform and its Hybrid Memory Architecture™ makes it an ideal database for analytics workloads. It is typically deployed into real-time environments managing terabyte-to-petabyte data volumes and supports millisecond read and write latencies. It leverages bleeding-edge storage innovations such as PMem or persistent memory from best of breed hardware companies such as HPE and Intel. By storing indexes in DRAM, and data on persistent storage (SSD) and reading directly from disk, Aerospike provides unparalleled speed and cost-efficiency. It can be deployed alongside other scalable distributed software such as Kafka, Presto, Spark, etc. via the Aerospike Connect product line for data analysis.

We are very excited to announce that Aerospike Connect for Presto is out of beta and now generally available. The Presto connector enables business and data analysts to use ANSI SQL, which they are very comfortable with, to query data stored in Aerospike via Presto. Presto is a highly parallel and distributed SQL query engine. Further, it is multi-tenant and capable of concurrently running hundreds of memory, I/O, and CPU-intensive queries, and scaling to thousands of workers. Figure 1 below depicts a high-level architecture of a typical deployment:

Presto: High-level architectural deployment

Figure 1: High-level architecture

With its recently announced multi-site clustering capability, Aerospike enables you to extend the above architecture to provide a globally consistent view of data to a geographically distributed team of business or data analysts.

Salient Features of Aerospike Connect for Presto

  • Run ANSI SQL queries in-place on massive amounts of data 
    • Query data stored in Aerospike without the need for complex and error-prone processes of copying data over to other analytics systems, which significantly helps with governance and compliance.
    • Support for Presto data types, including complex types such as maps and arrays, along with both read and write SQL statements
    • Presto Aggregate functions such as min/max, sum, avg, etc. are supported
  • Federate queries across multiple Aerospike clusters or between Aerospike and other databases
    • Enterprise database deployments are typically polyglot in nature, but you can now deploy Aerospike into an ecosystem of DBs that consist of Cassandra, PostgreSQL, Oracle, etc.
  • Analyze Aerospike data via Business Intelligence (BI) tools
    • Create insightful dashboards using Tableau, Qlik, Looker, etc by accessing Presto over JDBC to analyze data stored in Aerospike.
  • Query records with different schemas within the same set in Aerospike
    • Aerospike is a NoSQL DB and schemaless, but the presto connector reconciles those differences and offers a SQL experience that you are familiar with, while leveraging the benefits of a NoSQL DB.
    • It supports schema inference so that you have the option to not provide any table definition in the Presto Catalog.
  • Accelerate queries by leveraging Aerospike’s massive parallelism
    • It parallely scans 4,096 partitions to load data into upto 32K Presto splits across your Presto cluster and uses the recently released Aerospike expressions for pushdowns to the database
  • Leverage Presto Cost-Based Optimization via row count for query optimization
    • Aerospike Connect for Presto is one of the few Presto connectors that supports it for speeding up joins in Presto
  • Secure your deployment with TLS and LDAP support
    • Use TLS to secure connection between Presto and the Aerospike clusters and LDAP for authenticating Presto users with the Aerospike database
  • Deploy anywhere 
    • Deploy in a cloud or Kubernetes environment to help you leverage Managed Presto Services offered by Cloud providers

Key Use Cases

  1. Interactive Queries
    Business analysts run 100’s of interactive (millisecond-to-seconds query latency) or ad-hoc queries on a daily basis to generate insights for business critical decisions E.g. How many users have clicked the new banner ad? What are some categories of ads they’ve seen?, etc. These queries are characterized by low response times and usually involve smaller data sets. You can either use Presto CLI or any SQL editor that supports a Presto JDBC driver such as Hue, Zeppelin, Quix , etc. (see Figure 2, below).  An Aerospike customer is currently trialing our Presto connector to query and inspect data for PII that is ingested into an edge Aerospike system in their real-time IoT pipeline for compliance with GDPR. Aerospike Connect for Presto is designed such that it does not cause any performance degradation for the database at query time.

    Interactive query against Aerospike data using the Quix SQL editor
    Figure 2 Interactive query against Aerospike data using the Quix SQL editor
  2. BI Dashboarding 
    Data analysts can use a BI tool such as Tableau, Looker, Qlik, etc. to connect to Presto over JDBC (using the Presto JDBC driver for the BI tool) and analyze data stored in Aerospike at scale for visualization and reporting. They can serve up real-time dashboards and deliver high-quality data insights quicker than before. For example, Figure 3 depicts the visualization of Covid data that is loaded into Aerospike. Here we are analyzing the correlation between the total COVID cases and total deaths for each state in the US.

    Tableau Dashboard showing COVID-19 correlation analysis of data stored in Aerospike
    Figure 3 Tableau Dashboard showing COVID-19 correlation analysis of data stored in Aerospike
  3. Data Preparation for AI/ML
    Python is extensively used by data scientists looking to prep, process, and to analyze data for analytics and machine learning use cases. Jupyter Notebook is an open source, interactive, and web-based notebook that is used for data analysis and visualization. You can now analyze data stored in the Aerospike database via Presto using the Jupyter Notebook with the PyHive Presto Python library. The advantage of using a Jupyter Presto notebook is that you can not only explore data stored in Aerospike, but also pre-process it to create AI/ML models using popular Python libraries such as Pandas, NumPy, Scikit-learn, etc., and quickly progress from POC to production.

    Jupyter Python notebook for pre-processing data stored in Aerospike
    Figure 4 Jupyter Python notebook for pre-processing data stored in Aerospike

Now, you are off-to-the races!  You can now analyze data stored in Aerospike with the BI tool of your choice, run interactive SQL queries, launch your Jupyter notebook to prepare data in Python for AI/ML use cases, and significantly improve your Time-to-Insight. Please refer to the Aerospike Connect for Presto documentation for details and reach out to us if you have any questions.

Share:

About Author

mm

    Kiran Matty - Director of Product Management

    All posts by this author
    Kiran Matty is the Director of Product Management at Aerospike, where he manages ecosystem products that relate to AI/ML, Streaming, and Aerospike DB access. His experience spans big data and ML infrastructure, data security, and analytics. Prior to Aerospike, he was a product manager at Visa, Hortonworks, HPE, and Cisco. His interests include large scale distributed systems and AI/ML.