AWS Data Ingestion
Data Ingestion in AWS (Batch + Streaming)
Why Data Ingestion Matters
Before you can analyze anything, you must bring data into your system:
● From databases, APIs, logs, apps, IoT devices
● Into a data lake (usually S3) or warehouse
There are two main ingestion types:
1. Batch Ingestion (Scheduled / Bulk)
What it is:
● Data is collected and processed in chunks (hourly, daily, etc.)
When to use:
● Historical data
● Periodic reports
● Large datasets
Key Service: AWS Glue
What it does:
● Extract, Transform, Load (ETL)
● Reads from sources → transforms → writes to S3/Redshift
Features:
● Serverless
● Uses Spark (PySpark jobs)
● Built-in data catalog
Use case:
● Daily pipeline: RDS → S3 → Redshift
Think: scheduled ETL engine
Legacy/Orchestration: AWS Data Pipeline
What it does:
● Moves data between AWS services on a schedule
Note:
● Largely replaced by Glue / Step Functions
Think: older scheduling tool
2. Streaming Ingestion (Real-Time)
What it is:
● Data is processed continuously as it arrives
When to use:
● Real-time analytics
● Monitoring systems
● Fraud detection
1. Amazon Kinesis
What it does:
● Collects and processes real-time data streams
Components:
● Kinesis Data Streams → ingestion
● Kinesis Firehose → auto-load to S3/Redshift
Use cases:
● Log streaming
● Clickstream data
● IoT data
Think: AWS-native real-time pipeline
2. Amazon MSK
What it is:
● Managed Apache Kafka service
Why use it:
● If your system already uses Kafka
● More control than Kinesis
Use cases:
● Large-scale event streaming
● Enterprise pipelines
Think: Kafka on AWS without managing servers
Batch vs Streaming (Quick Comparison)
Feature Batch Streaming -------------------------------------------- Speed Delayed Real-time Data size Large chunks Continuous small events Tools Glue Kinesis / MSK Use case Reports, ETL Live dashboards
Real Data Engineering Flow
Batch:
1. Data from DB → Glue job
2. Store → S3
3. Query → Athena / Redshift
Streaming:
1. App events → Kinesis / MSK
2. Process → Lambda / consumers
3. Store → S3 / Redshift
Summary:
● Glue = batch ETL (most important)
● Kinesis = real-time streaming (AWS-native)
● MSK = Kafka-based streaming
● Data Pipeline = older scheduling tool
Rule of thumb:
● Use batch for cost efficiency
● Use streaming for real-time insights