Pointers for beginners to learn Apache Spark

3 minute read

Distributed Systems

  1. A Thorough Introduction to Distributed Systems

Official Apache Spark guide

  1. RDD Programming Guide

Web resources, ebooks, gitbooks and tutorials for Apache Spark

  1. A nice brief gitbook on running spark from a USB stick in local modeHow to light your ‘Spark on a stick’
  2. SparkSQL Getting Started
  3. Running Spark App In Standalone Cluster Mode
  4. Spark Recipes
  5. HOW TO SETUP APACHE SPARK STANDALONE CLUSTER ON MULTIPLE MACHINE
  6. how spark internally executes a program

Papers published on Apache Spark

  1. Spark SQL: Relational Data Processing in Spark
  2. Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing

Spark Architecture

  1. Spark Architecture
  2. Spark Misconceptions
  3. Spark Architecture: Shuffle
  4. RDD’s : Building block of Spark
  5. Spark DataFrames

Spark Topics about which one needs to be aware of, for building efficient data pipelines

Spark Execution modes

  1. Spark Master UI
  2. Spark on Yarn
  3. Spark standalone cluster tutorial

Spark dynamic allocation

  1. Smart Resource Utilization With Spark Dynamic Allocation

Spark Speculative tasks

  1. Speculative execution in Spark

Spark Partitioning

  1. An Intro to Apache Spark Partitioning: What You Need to Know
  2. Spark Under The Hood : Partition

Shuffling in Apche Spark

  1. A brief coursera lecture on shuffling in Apache Spark : Shuffling: What it is and why it’s important
  2. Another good article on shuffle by Cloudera : Working with Apache Spark: Or, How I Learned to Stop Worrying and Love the Shuffle
  3. In-depth explanation on Spark shuffle : Apache Spark Shuffles Explained In Depth
  4. You Won’t Believe How Spark Shuffling Will Probably Bite You (Also Windowing)
  5. A video on shuffle by Yandex on courseraShuffle. Where to send data?

Tuning Apache Spark for performance

  1. Official Spark configuration page - version 2.3.xSpark Configuration
  2. How-to: Tune Your Apache Spark Jobs (Part 1)
  3. How-to: Tune Your Apache Spark Jobs (Part 2)
  4. Spark performance tuning from the trenches
  5. Tune your Spark (Part 2) jobs
  6. One operation and maintenance

Commonly occuring errors and issues in Apache Spark

  1. Some Lessons of Spark and Memory Issues on EMR

Most common Apache Spark mistakes and gotcha’s

  1. Top 5 Mistakes to Avoid When Writing Apache Spark Applications

Spark input splits works same way as Hadoop input splits, it uses same underlining hadoop InputFormat API’s. When it comes to the spark partitions, by default it will create one partition for each hdfs blocks, For example: if you have file with 1GB size and your hdfs block size is 128 MB then you will have total 8 HDFS blocks and spark will create 8 partitions by default . But incase if you want further split within partition then it would be done on line split.

On ingest, Spark relies on HDFS settings to determine the splits based on block size which maps 1:1 to RDD partition. However, Spark then gives you fine grain control over the number of partitions at run time. Spark provides transformation like repartition, coalesce, and repartitionAndSortWithinPartition give you direct control over the number of partitions being computed. When these transformations are used correctly, they can greatly improve the efficiency of the Spark job.

when reading compressed file formats from disk, Spark partitioning depends on whether the format is splittable. For instance, these formats are splittable: bzip2, snappy, LZO (if indexed), while gzip is not splittable.

Running Apache Spark on EMR

Apache Spark on EMR with S3 as the storage is a best combination for executing your ETL tasks in cloud these days. Running Spark on EMR takes away the hassle of setting up a spark/hadoop cluster and it’s administration. Also it comes with auto scaling feature.

So a regular spark execution on EMR looks like this:

Spawn a new EMR cluster considering the resources required for your job. Select the latest Spark version and other tools like Hive, Zeppelin, Ganglia. Pass the necessary configurations for spark and yarn which need to be loaded during the bootstrap process. Once the cluster is up, simply run your spark applications using Step execution, AWS lambda or spark-submit.

  1. Discusses about shuffle, task memory spill in EMRTuning My Apache Spark Data Processing Cluster on Amazon EMR
  2. Tuning Spark Jobs on EMR with YARN - Lessons Learnt
  3. Setting spark.speculation in Spark 2.1.0 while writing to s3

Apache Spark best practices

  1. Spark Best Practices
  2. Lessons From the Field: Applying Best Practices to Your Apache Spark Applications
  3. Spark Best Practices
  4. Best Practices for Spark Programming - Part I