Loading…
This event has ended. View the official site or create your own event → Check it out
This event has ended. Create your own
View analytic
Tuesday, August 18 • 10:20am - 10:50am
Developing Spark SQL Integration for MongoDB using Spark's External Datasource API

Sign up or log in to save this to your schedule and see who's attending!

The external data sources API introduced with Apache Spark 1.2.0 provides a clean and systematic way to integrate a wide range of external database systems with Spark SQL. MongoDB provides an interesting challenge for such integration because its data model, based on JSON, involves no prescriptive schema and is aggressively non-rectangular. This presentation will cover the integration issues from the viewpoint of a "Spark outsider": a moderately competent Scala programmer who is an early adopter of the external data source API, but not familiar with Spark internals. The target audience is such developers who need to integrate the database of their choice as quickly and practically as possible. Topics: • The external data source API (including significant enhancements coming in Spark 1.3.0) • The SchemaRDD mechanism (to become DataFrame in Spark 1.3.0) • MongoDB, its data model and Scala API (Casbah) • The implementation approach, including efficient schema inference, filter and projection push-down, and data partitioning • Examples of querying MongoDB through Spark SQL, HiveQL and DataFrame Lots of Scala code samples will be based on the NSMC project: https://github.com/spirom/spark-mongodb-connector (This probably makes sense as either a full length talk or a tutorial, although a half-length talk could provide some value. A lightning talk is unlikely to make sense.)

Speakers
avatar for Spiro Michaylov

Spiro Michaylov

Development Manager, Tableau Software
Spiro Michaylov is a development manager in the data platform organization at Tableau Software in Kirkland, Washington. He has been working in distributed systems and big data for almost twenty years, developing compilers for parallel scientific computing, high frequency trading infrastructure in the securities industry, and parts of the ETL and data integration platform at Ab Initio Software. He designed several of the enterprise DBMS features... Read More →


Tuesday August 18, 2015 10:20am - 10:50am
Track B

Attendees (12)