Home United States USA — software Apache Spark Performance Tuning – Straggler Tasks Apache Spark Performance Tuning –...

Apache Spark Performance Tuning – Straggler Tasks Apache Spark Performance Tuning – Straggler Tasks

170
0
SHARE

The final installment in this Spark performance tuning series discusses detecting straggler tasks and principles for improving shuffle in our example app.
This is the last article of a four-part series about Apache Spark on YARN. Apache Spark carefully distinguishes “transformation” operations into two types: “narrow” and “wide.” This distinction is important due to strong implications on evaluating transformations and improving their performance. Spark depends heavily on the key/value pair paradigm on defining and parallelizing operations, especially wide transformations requiring data to be redistributed between machines.
A few performance bottlenecks were identified in the SFO Fire Department call service dataset use case with YARN cluster manager. To understand the use case and performance bottlenecks identified, refer our previous blog on Apache Spark on YARN – Performance and Bottlenecks. The resource planning bottleneck is addressed and notable performance improvements achieved in the use case Spark application, as discussed in our previous blog on Apache Spark on YARN – Resource Planning. To learn about partition tuning in the use case Spark application, refer to our previous blog on Apache Spark Performance Tuning – Degree of Parallelism.
In this blog, let us discuss the shuffle and straggler tasks problem so as to improve the performance of the use case application.
Two primary techniques, “shuffle less” and “shuffle better, ” to avoid performance problems associated with shuffles are as follows:
Operations on the key/value pairs can cause
The memory errors in the driver are mainly caused by actions. The last three performance issues (out of memory on the executors, shuffles, and straggler tasks) are caused by shuffles associated with the wide transformations.
The number of partitions tuned based on the input dataset size is explained in our previous blog on Apache Spark Performance Tuning – Degree of Parallelism. The DataFrame API implementation of the application submitted with the following configuration is shown in the below screenshot:
On considering Shuffle Read and Write columns, the shuffled data is in Bytes and Kilo Bytes (KB) across all the stages, as per the shuffle principle “shuffle less” in our use case application.
The “Executors” tab in the Spark UI provides the summary of input, shuffles read, and write as shown in the below diagram:
Internally, Spark does the following:
By reducing input size and by filtering the data from input datasets in both low-level and high-level API implementation, the performance can be improved.
Our input dataset has 34 columns. Three columns were used for computation to answer the use case scenario questions.
The below updated RDD and DataFrame API implementation code provides performance improvement by selecting only needed data for this use case scenario:
The above line is added at the beginning of the RDD API implementation to select three columns and remove 31 columns from the RDD to reduce the input size in all the shuffle stages.
The below code does the same thing in the DataFrame API implementation:
The code block of the RDD API implementation is given below:
The code block of the DataFrame API implementations is given below:
The Spark submit command with partition tuning used to execute the RDD and DataFrame API implementation in YARN is as follows:
The time duration after reducing the input size in RDD and DataFrame API implementation is shown in the below diagram:
The performance duration (without any performance tuning) based on different API implementation of the use case Spark application running on YARN is shown in the below diagram:
For more details, refer our previous blog on Apache Spark on YARN – Performance and Bottlenecks .
We tuned the number of executors, cores, and memory for RDD and DataFrame implementation of the use case Spark application. The below diagram is based on the performance improvements after tuning the resources:
For more details, refer our previous blog on Apache Spark on YARN – Resource Planning.
We tuned the default parallelism and shuffle partitions of both RDD and DataFrame implementation in our previous blog on Apache Spark Performance Tuning – Degree of Parallelism. We did not achieve performance improvement, but reduced the scheduler overhead.
In this blog, we discussed shuffle principles and understood the use case application shuffle, straggler task deducting in the application, and input size reduction to improve the performance of different API implementations of the Spark application.
The code examples are available on GitHub .

Continue reading...