Home United States USA — software Building Open Source Google Analytics from Scratch

Building Open Source Google Analytics from Scratch

March 25, 2019

316

Take a look at this tutorial that shows you how to build your own open source analytics and dashboard apparatus using AWS Lambda and Cube.js.
Let’s be friends:
Check out this article that shows you how to build your own analytics platform with AWS and Cube.js.
Comment (0)
Join the DZone community and get the full member experience.
From an engineering standpoint, the technology behind Google Analytics was pretty sophisticated when it was created. Custom, tailored-made algorithms were implemented for event collection, sampling, aggregation, and storing output for reporting purposes. Back then, it required years of engineering time to ship such a piece of software. Big data landscapes have changed drastically since then. In this tutorial, we’re going to rebuild an entire Google Analytics pipeline. We’ll start from data collection and reporting. By using the most recent big data technology available, we’ll see how simple it is to reproduce such software nowadays.
Here’s an analytics dashboard with an embedded tracking code that collects data about its visitors while visualizing it at the same time.
Check out the source code on GitHub. Give it a star if you like it!
If you’re familiar with Google Analytics, you probably already know that every web page tracked by GA contains a GA tracking code. It loads an async script that assigns a tracking cookie to a user if it isn’t set yet. It also sends an XHR for every user interaction, like a page load. These XHR requests are then processed and raw event data is stored and scheduled for aggregation processing. Depending on the total amount of incoming requests, the data will also be sampled.
Even though this is a high-level overview of Google Analytics essentials, it’s enough to reproduce most of the functionality. Let me show you how.
There are numerous ways of implementing a backend. We’ll take the serverless route because the most important thing about web analytics is scalability. In this case, your event processing pipeline scales in proportion to the load. Just as Google Analytics does.
We’ll stick with Amazon Web Services for this tutorial. Google Cloud Platform can also be used as they have pretty similar products. Here’s a sample architecture of the web analytics backend we’re going to build.
For the sake of simplicity, we’re only going to collect page view events. The journey of a page view event begins in the visitor’s browser, where an XHR request to an API Gateway is initiated. The request event is then passed to Lambda where event data is processed and written to a Kinesis Data Stream. Kinesis Firehose uses the Kinesis Data Stream as input and writes processed parquet files to S3. Athena is used to query parquet files directly from S3. Cube.js will generate SQL analytics queries and provide an API for viewing the analytics in a browser.
This seems very complex at first, but component decomposition is key. It allows us to build scalable and reliable systems. Let’s start implementing the data collection.
To deploy data collection backend, we’ll use the Serverless Application Framework. It lets you develop serverless applications with minimal code dependencies on cloud providers. Before we start, please ensure Node.js is installed on your machine. Also, if you don’t have an AWS account yet you’d need to signup for free and install and configure AWS CLI.
To install the Serverless Framework CLI let’s run:
Now create the event-collection service from a Node.js template:
This will scaffold the entire directory structure. Let’s cd to the created directory and add the aws-sdk dependency:
Install the yarn package manager if you don’t have it:
We’ll need to update handler.js with this snippet:
As you can see, the only thing this simple function does is write a record into a Kinesis Data Stream named event-collection.