Building a Versatile Analytics Pipeline on Top of Apache Spark

Updated on
November 2, 2020
DataGeneral Engineering

Grammarly, like most growing companies, strives to make data-driven decisions. That means that we need a reliable way to collect, analyze, and query data about our users. We started out using third-party tools like Mixpanel to handle our analytics needs, but soon our needs surpassed the capabilities of those tools. For example, we wanted to control the pre-aggregation and enrichment of data, to generate reports that were more customized, and to have higher confidence in the accuracy of data. So we developed our own in-house analytics engine and application on top of Apache Spark. Recently, I gave a talk at the Spark Summit sharing some of our learnings along the way. The talk covered:

  • Outputting data to several storages in a single Spark job
  • Dealing with the Spark memory model, building a custom spillable data structure for data traversal
  • Implementing a custom query language with parser combinators on top of the Spark SQL parser
  • Custom query optimizer and analyzer
  • Flexible-schema storage and query against multi-schema data with schema conflicts
  • Custom aggregation functions in Spark SQL

Shape the way millions of people communicate!
Open Roles

Here is the video of the talk:

Check out the slides as well:

Your writing, at its best.
Get Grammarly for free
Works on all your favorite websites
Related Articles
Shape the way millions of people communicate every day!
Explore open roles