Loading…
This event has ended. View the official site or create your own event → Check it out
This event has ended. Create your own
View analytic
Tuesday, August 18 • 4:20pm - 5:00pm
A Gentle Introduction to Apache Spark and Locality Sensitive Hashing

Sign up or log in to save this to your schedule and see who's attending!

Apache Spark is an increasingly popular Big Data computation platform which lets developers be more productive than Map-Reduce. But what of writing programs that run fast ? We will see in this talk how we can find approximate nearest neighbours in a web log quickly with hashing. We'll also use this journey as a reason to study and employ the basic tenets of how to write a fast Spark program. Employed in cases suffering from “the curse of dimensionality” (where feature dimensions are untractably many), locality-sensitive hashing is a technique that can help find approximate nearest neighbours by simply hashing examples in a clever way. We will see how to use it to find close user behaviours in a web log efficiently by exploiting the parallelism offered by Spark. Along the way, we will encounter the landmark notions of how to write efficient Spark programs in Scala, including partition-specific commands, variable capture avoidance, early filters, sparse shuffling, and broadcast variables. After this talk, attendees will have been acquainted with a very easy to parallelise technique with many other uses (e.g. de-duplication), and have a couple more techniques in their grab bag for removing the bottlenecks in their Spark programs.

Speakers
avatar for François Garillot

François Garillot

François Garillot joined Typesafe in 2012 after an early stint in research, where he spoke frequently at international conferences. He is now working in Typesafe's Spark team, leveraging his Scala knowledge to improve Spark's support for scalable machine learning and data science applications. Based in Lausanne, he speaks at Swiss conferences and Scala user groups in Lyon and Paris. He recently spoke at Strata Hadoop Barcelona on how to... Read More →


Tuesday August 18, 2015 4:20pm - 5:00pm
Track A

Attendees (19)