Home United States USA — software Running Hadoop on a Raspberry Pi 2 cluster

Running Hadoop on a Raspberry Pi 2 cluster

October 6, 2017

756

Last week I wrote about a 300 node cluster using Raspberry Pi (RPi) microcomputers. But can you do useful work on such a low-cost, low-power cluster? Yes, you can. Hadoop runs on massive clusters, but you can also run it on your own, highly-scalable, RPi cluster.
I’ve been involved with cluster computing ever since DEC introduced VAXclusters in 1984. In those days, a three node VAXcluster cost about $1 million. Today you can build a much more powerful cluster for under $1,000, including much more storage than anyone could afford back then.
Hadoop is the open-source version of Google’s Map/Reduce and Google File System (GFS), widely used for large data-crunching applications. It is a shared-nothing cluster, which means that as you add cluster nodes, performance scales up smoothly.
In the paper, Performance of a Low Cost Hadoop Cluster for Image Analysis, researchers Basit Qureshia, Yasir Javeda, Anis Kouba, Mohamed-Foued Sritic, and Maram Alajlan, built a 20 node RPi Model 2 cluster, brought up Hadoop on it, and used it for surveillance drone image analysis. They also benchmarked the RPi cluster against a 4-node PC cluster based on 3GHz Intel i7 CPUs, each with 4GB of RAM.
The 20 node cluster was divided into four, 5-node subnets, each attached to 16 port switches that are, in turn, networked to a managed 24 port core switch. The extra switch ports enable easy cluster expansion.
Each 700MHz RPi B runs Raspbian, an ARM-optimized version of Debian Linux. Each RPi has a Class 10,16 GB SD card capable of up to 80MB/s read/write speeds. An image of the OS with Hadoop 2.6.2 was copied onto the SD cards. The Hadoop Master node, which implements the name-node only, was installed on a PC running Ubuntu 14.4 and Hadoop.
You’d expect a cluster of 64-bit, 3GHz x86 CPUs to be much faster than 700MHz, 32-bit ARM CPUs, and you’d be right. The team ran a series of tests that were a) compute-intensive (calculating Pi), b) I/O intensive (document word counts), and, c) both (large image file pixel counts).
Here’s the word count results, taken from a figure in the paper.
In general, the x86 cluster was 10-20 times faster. However, the ability to put a Hadoop cluster in a backpack with a battery, opens up possibilities for powerful edge computing, such as the drone video pre-processing the authors explore in their paper. Also, today we have the RPi Model 3, with a processor with almost double the clock speed of the RPi tested by the researchers.
Mobile edge clusters aren’t a thing today, but they will be, because our ability to gather data at the edge is growing much faster than network bandwidth to the edge. We’ll have to pre-process, for example, IoT data to compact it for network transmission.
When will they be economically viable? Three things have to happen:
All three will happen in the next five years. Then backpack clusters will be capable of real work out in the wild.
Courteous comments welcome, of course.