Common ETL transformation scenarios

less than 1 minute read

ETL include any change that you do to your source data before storing in your target, including conforming different source systems, filtering data, adding new custom derived columns etc.

Parsing and Validation

This is done in a series of map(), flatMap(), filter(), groupByKey(), reduceByKey() and leftOuterJoin() transformations which gets reused across Streaming and Batch jobs

Data Enrichment and state maintenance

Data enrichment and state maintenance is done under updateStateByKey() in Streaming and mapPartitions() in case of batch processing with the business logic for merging the sessions and enriching the session with additional information getting reused across batch and streaming jobs.

Data cleaning

Mapping NULL to 0

โ€œMaleโ€ to โ€œMโ€ and โ€œFemaleโ€ to โ€œFโ€

Date formats consistency

Running aggregates

You have two points (X1,Y1) and (X2,Y2) and you need to store the distance between them in your target

Data enrichment

Feature generation

Corroboration

Deduplication

Data increment, updation, delta delivery

Data skewness

Salting and hashing

Metric Generation and Save to DB

Metrics are generated using flatMap() and reduceByKey() after the Sessionization and data enrichment and are saved to DB