This event has ended. View the official site or create your own event → Check it out
This event has ended. Create your own
View analytic
Monday, August 17 • 11:00am - 11:40am
Breakthrough OLAP performance on Cassandra and Spark

Sign up or log in to save this to your schedule and see who's attending!

Apache Cassandra is rock-solid and widely deployed for OLTP and real-time applications, but is typically not thought of as an OLAP database for analytical queries.  This talk will show architectures and techniques for combining Apache Cassandra and Spark to yield a 10-1000x improvement in OLAP analytical performance.  We will then introduce a new open-source project that combines the above performance improvements with the ease of use of Apache Cassandra, and compare it to implementations based on Hadoop and Parquet.
First, the existing Cassandra Spark connector allows one to easily load data from Cassandra to Spark.  We’ll cover how to accelerate queries through different caching options in Spark, and the tradeoffs and limitations around performance, memory, and updating data in real time.  We then dive into the use of columnar storage layout and efficient coding techniques that dramatically speed up I/O for OLAP use cases.  Cassandra features like triggers and custom secondary indexes allow for easy data ingestion into columnar format.  Next, we explore how to integrate this new storage with Spark SQL and its pluggable data storage API.  Future developments will enable extreme analytical database performance, including smart caching of column projections, a columnar version of Spark’s Catalyst execution planner, and how vectorization makes for fast cache- and GPU-friendly calculations — see Spark’s Project Tungsten.

FiloDB is a new open-source database using the above techniques to combine very fast Spark SQL analytical queries with the ease of use of Cassandra.  We will briefly cover interesting use cases, such as:* Easy exactly-once ingestion from Kafka for streaming and IoT applications* Incremental computed columns and geospatial annotations. We’ll discuss how FiloDB improves aggregations needed for choropleth maps over standard PostGIS solutions.

avatar for Evan Chan

Evan Chan

Evan loves to design, build, and improve bleeding edge distributed data and backend systems using the latest in open source technologies.  He is the creator of the FiloDB open-source distributed analytical database, as well as the Spark Job Server.  He has led the design and implementation of multiple big data platforms based on Storm, Spark, Kafka, Cassandra, and Scala/Akka, including a columnar real-time distributed query engine. He... Read More →

Monday August 17, 2015 11:00am - 11:40am
Track A

Attendees (21)