Home United States USA — software Getting Started With Batch Processing Using Apache Flink

Getting Started With Batch Processing Using Apache Flink

October 13, 2017

434

If you’ve been following software development news recently you probably heard about the new project called Apache Flink. I’ve already written about it a bit…
If you’ve been following software development news recently, you probably heard about a new project called Apache Flink. I’ve already written about it a bit here and here, but if you are not familiar with it, Apache Flink is a new-generation big data processing tool that can process either finite sets of data (this is also called batch processing) or potentially infinite streams of data (stream processing). In terms of new features, many believe Apache Flink is a game changer and can even replace Apache Spark in the future.
In this article, I’ll introduce you to how you can use Apache Flink to implement simple batch processing algorithms. We will start with setting up our development environment, and then we will see how we can load data, process a dataset, and write data back to an external system.
You might have heard that stream processing is « the new hot thing right now » and that Apache Flink is a tool for stream processing. This can pose a question: Why do we need to learn how to implement batch processing applications?
While it is true that stream processing has become more and more widespread, many tasks still require batch processing. Also, if you are just getting started with Apache Flink, in my opinion, it is better to start with batch processing since it is simpler and in a way resembles working with a database. Once you’ve covered batch processing, you can learn about stream processing where Apache Flink really shines!
If you want to implement some Apache Flink applications yourself, first you need to create a Flink project. In this article, we are going to write applications in Java, but you can also write Flink application in Scala, Python, or R.
To create a Flink Java project, execute the following command:
After you enter group id, artifact id, and a project version, this command will create the following project structure:
The most important here is the massive pom.xml that specifies all the necessary dependencies. Automatically created Java classes are examples of some simple Flink applications that you can take a look at, but we don’t need them for our purposes.
To start developing your first Flink application, create a class with the main method like this:
There is nothing special about this main method. All we have to do is to add some boilerplate code.
First, we need to create a Flink execution environment that will behave differently if you run it on a local machine or in a Flink cluster:
Alternatively, you could create a collection environment like this:
This will create a Flink execution environment that, instead of running Flink application on a local cluster, will emulate all operations using in-memory collections in a single Java process. Your application will run faster, but this environment some subtle differences from a local cluster with multiple nodes.
Before we can do anything, we need to read data into Apache Flink. We can read data from numerous systems, including local filesystem, S3, HDFS, HBase, Cassandra, etc. No matter where we read a dataset from, Apache Flink allows us to work with data in a uniform way using the DataSet class:
All items in a dataset should have the same type. The single generics parameter specifies a type of the data that is stored in a dataset.
To read data from a file, we can use the readTextFile method that will read lines in a file line by line and return a dataset of type String:
If you specify a file path like this, Flink will attempt to read a local file. If you want to read a file from HDFS, you need to specify the hdfs:// protocol:
Flink also has support for CSV files, but in this case, it won’t return a dataset of strings. It will try to parse every line and return a dataset of Tuple instances:
Tuple2 is a class that stores an immutable pair of two fields, but there are other classes like Tuple0, Tuple1, Tuple3, up to Tuple25 that store from zero to twenty-five fields. Later, we will see how to work with these classes.
The types method specifies types and number of columns in a CSV file, so Flink could read a parse them.
We can also create small datasets that are very good for small experiments and unit tests:
A question that you may ask is what data we can store in a DataSet? Not every Java type can be used in a dataset, and there are four different categories of types that you can use:
Now to the data processing part! How do you implement an algorithm for processing your data? To do this, you can use a number of operations that resemble Java 8 streams operations, such as:
Keep in mind that the biggest difference between Java streams and these operations is that Java 8 works with data in memory and can access local data, while Flink works with data on a cluster in a distributed environment.
Let’s take a look at a simple example that uses these operations. The following example is very straightforward. It creates a dataset of numbers, which squares every number and filters out all odd numbers.
If you have any experience with Java 8 you are probably wondering why I don’t use lambdas here. We can use lambdas here but it can cause some complications, as I’ve written here.
After we’ve finished processing our data it would make sense to save the result of our hard work. Flink can store data into a number of third-party systems such as HDFS, S3, Cassandra, etc.
For example, to write data to a file, we need to use the writeAsText method from the DataSet class:
For debugging/testing purposes Flink can write data to standard output or to standard output:
To implement some meaningful algorithms we need to first download a Grouplens movies dataset. It contains several CSV files with information about movies and movie ratings. We are going to work with the movies.csv file from this dataset which contains a list of all movies and looks like this:
It has three columns:
We can now load this CSV file in Apache Flink and perform some meaningful processing. Here we will load a file from a local filesystem, while in a realistic environment you would read a much bigger dataset and it would probably reside in a distributed system, such as S3 or HDFS.
In this demo let’s find all movies of the « Action » genre. Here is a code snippet that does this:
Let’s break it down. First, we read a CSV file using the readCsvFile method:
Using helper methods, we specify how to parse strings in the CSV file and that we need to skip the first line. In the last line, we specify a type of each column in the CSV file and Flink will parse data for us.