WWW.SKGURU.COM

Azure Data Engineering Overview

Azure Data Engineering: Azure Data Engineering generally refers to designing, building, and maintaining data pipelines and analytics solutions on Microsoft Azure.

Python supports multiple programming paradigms (procedural, object-oriented, and functional) and has a vast ecosystem of libraries and frameworks.

What an Azure Data Engineer Does
An Azure Data Engineer focuses on:
Ingesting data from multiple sources
Transforming and cleaning data
Storing data efficiently
Enabling analytics, reporting, and machine learning

Core Azure Data Engineering Services
1. Data Ingestion
Azure Data Factory (ADF) – ETL/ELT pipelines, scheduling, orchestration
Azure Event Hubs – Real-time streaming ingestion
Azure IoT Hub – IoT data ingestion
2. Data Storage
Azure Data Lake Storage Gen2 (ADLS) – Central data lake (raw, curated, processed data)
Azure Blob Storage – Object storage
Azure SQL Database / Managed Instance – Relational data
3. Data Processing
Azure Databricks – Big data processing with Spark (batch + streaming)
Azure Synapse Analytics
Spark Pools (big data)
Dedicated SQL Pools (data warehousing)
Serverless SQL Pools (query data lake directly)
4. Analytics & Reporting
Power BI – Dashboards and reports
Synapse SQL – Analytical queries
5. Orchestration & DevOps
Azure Data Factory – Pipeline orchestration
Azure DevOps / GitHub – CI/CD for data pipelines
ARM / Bicep / Terraform – Infrastructure as Code

1. What is Azure Data Factory?
Azure Data Factory (ADF) is a cloud-based data integration service used to ingest, orchestrate, and transform data from multiple sources into analytical systems such as Azure Data Lake Storage (ADLS), Azure Synapse Analytics, or Azure SQL.
Key idea:
ADF is an orchestration tool, not a database or analytics engine.
ADF supports:
• Batch data movement
• Data transformation
• Workflow orchestration
• Hybrid (cloud + on-prem) data integration

2. Core ADF Architecture Components
2.1 Pipelines
A pipeline is a logical container for activities.
Defines workflow logic
Does NOT process data itself
Can be triggered manually or automatically
📌 Pipelines answer the question: “What should run, and in what order?”
2.2 Activities
An activity performs a task inside a pipeline.
Main activity categories:
Data Movement:
• Copy Activity (most common)
• Moves data between sources and sinks
Data Transformation:
• Mapping Data Flow (Spark-based)
• Databricks Notebook
• Stored Procedure
• Azure Function
Control Flow:
• If Condition
• ForEach
• Switch
• Until
• Wait
📌 Rule: Retries, timeouts, and fault tolerance are configured at the activity level, not pipeline level.
2.3 Linked Services
Linked Services define connection information.
They contain:
• Endpoint
• Authentication method
• Credentials (via Key Vault)
Examples:
• Azure Data Lake Gen2
• Azure SQL Database
• Synapse Analytics
• Blob Storage
• REST APIs
🚫 Linked Services do NOT define schema or structure.
2.4 Datasets
Datasets describe the structure of data within a data store.
They define:
• File format (CSV, Parquet, JSON)
• Folder path or table
• Schema (optional)
📌 One Linked Service → many Datasets 📌 Datasets are reusable

3. Triggers (Scheduling Pipelines)
Triggers determine when pipelines run.
3.1 Schedule Trigger
• Time-based
• Simple scheduling
• No backfill
3.2 Tumbling Window Trigger ⭐
• Fixed, non-overlapping intervals
• Supports retry, dependency, and backfill
• Ideal for incremental loads
📌 Keyword match:
Incremental + reliability → Tumbling Window
3.3 Event-Based Trigger
• Runs pipeline on Blob create/delete
• Used for micro-batch ingestion
• Not streaming

4. Data Movement with Copy Activity
The Copy Activity is the backbone of ADF.
Capabilities:
• High-performance ingestion
• Supports filtering and partitioning
• Fault tolerance
• Incremental loading
• Supports many source/sink combinations
Common scenarios:
• Load data from SQL → ADLS
• Load REST API → Data Lake
• Load files → Synapse
📌 Rule: Simple ingestion = Copy Activity, not Data Flow.