Share on FacebookShare on TwitterShare on LinkedinShare via emailShare via Facebook Messenger

Building a Versatile Analytics Pipeline on Top of Apache Spark

Updated on November 2, 2020DataGeneral Engineering

Grammarly, like most growing companies, strives to make data-driven decisions. That means that we need a reliable way to collect, analyze, and query data about our users. We started out using third-party tools like Mixpanel to handle our analytics needs, but soon our needs surpassed the capabilities of those tools. For example, we wanted to control the pre-aggregation and enrichment of data, to generate reports that were more customized, and to have higher confidence in the accuracy of data. So we developed our own in-house analytics engine and application on top of Apache Spark. Recently, I gave a talk at the Spark Summit sharing some of our learnings along the way. The talk covered:

  • Outputting data to several storages in a single Spark job
  • Dealing with the Spark memory model, building a custom spillable data structure for data traversal
  • Implementing a custom query language with parser combinators on top of the Spark SQL parser
  • Custom query optimizer and analyzer
  • Flexible-schema storage and query against multi-schema data with schema conflicts
  • Custom aggregation functions in Spark SQL

Shape the way millions of people communicate!

Here is the video of the talk:

Check out the slides as well:

Your writing, at its best.
Works on all your favorite websites
iPhone and iPad KeyboardAndroid KeyboardChrome BrowserSafari BrowserFirefox BrowserEdge BrowserWindows OSMicrosoft Office
Related Articles
Shape the way millions of people communicate every day!