Pointers for beginners to learn Apache Spark

4 minute read

I’ve tried to collect good links on Apache spark and it’s architecture here. I’ll be regularly updating this list as and when I come across new articles on Apache Spark.

Distributed Systems

  1. A Thorough Introduction to Distributed Systems
  2. Why I love databases
  3. How Sharding Works

Hadoop-Map Reduce paradigm

  1. Shuffle Operation in Hadoop and Spark

Official Apache Spark guide

  1. RDD Programming Guide

Web resources, gitbooks and tutorials for Apache Spark

  1. **A nice brief gitbook on running spark from a USB stick in local mode: **How to light your ‘Spark on a stick’
  2. SparkSQL Getting Started
  3. Running Spark App In Standalone Cluster Mode
  4. Spark Recipes
  5. how to setup apache spark standalone cluster on multiple machine
  6. how spark internally executes a program

Papers published on Apache Spark

  1. Spark SQL: Relational Data Processing in Spark
  2. Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing
  3. Optimizing Shuffle Performance in Spark

Spark Architecture

  1. Spark Architecture
  2. Spark Misconceptions
  3. Spark Architecture: Shuffle
  4. RDD’s : Building block of Spark
  5. Spark DataFrames

Spark Topics about which one needs to be aware of, for building efficient data pipelines

Spark Execution

  1. Spark Execution - plans
  2. Spark on Yarn
  3. Spark standalone cluster tutorial
  4. Spark Master UI

Spark dynamic allocation

  1. Smart Resource Utilization With Spark Dynamic Allocation

Spark Speculative tasks

  1. Speculative execution in Spark

Persisting and Checkpointing in Apache Spark

  1. Apache Spark Caching Vs Checkpointing

Spark Serialization

  1. Serialization in Spark

External shuffle service

  1. External shuffle service in Apache Spark

Spark Partitioning

  1. An Intro to Apache Spark Partitioning: What You Need to Know
  2. Spark Under The Hood : Partition
  3. Partitioning in Spark
  4. Partitioning internals in Spark

Shuffling in Apche Spark

  1. All about Shuffling
  2. Another good article on shuffle by Cloudera : Working with Apache Spark: Or, How I Learned to Stop Worrying and Love the Shuffle
  3. In-depth explanation on Spark shuffle : Apache Spark Shuffles Explained In Depth
  4. You Won’t Believe How Spark Shuffling Will Probably Bite You (Also Windowing)
  5. A video on shuffle by Yandex on courseraShuffle. Where to send data?
  6. A brief coursera lecture on shuffling in Apache Spark : Shuffling: What it is and why it’s important

Tuning Apache Spark for performance

  1. Official Spark configuration page - version 2.3.xSpark Configuration
  2. How-to: Tune Your Apache Spark Jobs (Part 1)
  3. How-to: Tune Your Apache Spark Jobs (Part 2)
  4. Spark performance tuning from the trenches
  5. Tune your Spark (Part 2) jobs
  6. One operation and maintenance

Spark SQL

Introducing Window Functions in Spark SQL

Spark Operations

Union, reduce, map Operations

Commonly occuring errors and issues in Apache Spark

  1. Some Lessons of Spark and Memory Issues on EMR

Most common Apache Spark mistakes and gotcha’s

  1. Top 5 Mistakes to Avoid When Writing Apache Spark Applications

Spark input splits works same way as Hadoop input splits, it uses same underlining hadoop InputFormat API’s. When it comes to the spark partitions, by default it will create one partition for each hdfs blocks, For example: if you have file with 1GB size and your hdfs block size is 128 MB then you will have total 8 HDFS blocks and spark will create 8 partitions by default . But incase if you want further split within partition then it would be done on line split.

On ingest, Spark relies on HDFS settings to determine the splits based on block size which maps 1:1 to RDD partition. However, Spark then gives you fine grain control over the number of partitions at run time. Spark provides transformation like repartition, coalesce, and repartitionAndSortWithinPartition give you direct control over the number of partitions being computed. When these transformations are used correctly, they can greatly improve the efficiency of the Spark job.

when reading compressed file formats from disk, Spark partitioning depends on whether the format is splittable. For instance, these formats are splittable: bzip2, snappy, LZO (if indexed), while gzip is not splittable.

Running Apache Spark on EMR

Apache Spark on EMR with S3 as the storage is a best combination for executing your ETL tasks in cloud these days. Running Spark on EMR takes away the hassle of setting up a spark/hadoop cluster and it’s administration. Also it comes with auto scaling feature.

So a regular spark execution on EMR looks like this:

Spawn a new EMR cluster considering the resources required for your job. Select the latest Spark version and other tools like Hive, Zeppelin, Ganglia. Pass the necessary configurations for spark and yarn which need to be loaded during the bootstrap process. Once the cluster is up, simply run your spark applications using Step execution, AWS lambda or spark-submit.

  1. Discusses about shuffle, task memory spill in EMRTuning My Apache Spark Data Processing Cluster on Amazon EMR
  2. Tuning Spark Jobs on EMR with YARN - Lessons Learnt
  3. Setting spark.speculation in Spark 2.1.0 while writing to s3
  4. Submitting User Applications with spark-submit

Apache Spark best practices

  1. Spark Best Practices
  2. Lessons From the Field: Applying Best Practices to Your Apache Spark Applications
  3. Spark Best Practices
  4. Spark best practices
  5. Best Practices for Spark Programming - Part I
  6. Apache Spark - Best Practices and Tuning

StackOverflow questions on Apache Spark

  1. Spark: driver/worker configuration. Does driver run on Master node?
  2. Resolving dependency problems in Apache Spark