+============================================================+
| |
| AZURE ANALYTICS FUNDAMENTALS - PART 2 |
| |
| Data Lakes - Data Warehouses - Modern Architecture |
| |
| Powered by HitaVir Tech |
+============================================================+
Welcome to Fundamentals of Analytics on Azure Cloud Platform - Part 2 by HitaVir Tech!
In Part 1 you built the mental model โ analytics concepts, the 5 Vs, and the Azure services that solve each V. In Part 2 you will zoom out and learn how those services combine into production architectures used by real companies today.
Part 1 โ The Ingredients | Part 2 โ The Recipe |
๐ง What is analytics / ML? | ๐๏ธ What is a data warehouse? |
๐ The 5 Vs diagnostic | ๐ชฃ What is a data lake? |
โ๏ธ One service per V | ๐งฉ How they combine into a Lakehouse |
๐ ๏ธ ADLS โ Synapse Serverless mini lab | ๐บ๏ธ Reference architectures for 6 real use cases |
Pillar | Topics |
๐ชฃ Data Lakes | What, why, zones, governance on Azure |
๐๏ธ Data Warehouses | Columnar MPP, star schemas, Synapse Dedicated |
๐ก Modern Data Architecture | Lakehouse + Microsoft Fabric |
โ๏ธ Azure Services | ADLS Gen2, Synapse, Databricks, Purview, ADF, Event Hubs, Power BI |
๐บ๏ธ Reference Architectures | Batch BI, Streaming, ML, Log analytics, 360ยฐ customer, Data mesh |
๐ฏ Common Use Cases | When to pick which pattern |
+================================================================+
| |
| Services are Lego bricks. Architecture is the castle. |
| |
| Any junior can spin up ADLS + Synapse. |
| Seniors know WHEN to use which, WHY, and HOW they join. |
| |
+================================================================+
2-3 hours (concept-heavy, no new hands-on required โ uses Part 1 lab as the anchor)
If you are... | Do this |
๐ A student new to cloud | Read top-to-bottom; pause at each reference architecture |
๐ ๏ธ A working engineer | Skim sections 1-3, deep-read the reference architectures for patterns you ship |
๐๏ธ A solution architect | Use the reference diagrams as whiteboard starters with stakeholders |
๐ A reference reader | Jump to the Quiz, the Cheat Sheet, and the Appendix of Resources |
๐ก HitaVir Tech says: "Services change names every few years โ HDInsight became Databricks, SQL DW became Synapse, Synapse is becoming Fabric. But the shapes of data architectures stay stable for decades. Master the shapes. You will pick up the services in a week."
Required
Helpful
+--------------------------------------------------------------+
| 5 Vs framework Azure service toolkit |
| ------------------- ------------------------ |
| Volume ADLS Gen2, Synapse, Databricks |
| Variety ADF, Synapse, AI Vision, Doc Intel.|
| Velocity Event Hubs, Stream Analytics, Fn |
| Veracity ADF flows, Purview, Defender |
| Value Power BI, Azure ML, OpenAI |
+--------------------------------------------------------------+
In Part 2 we compose these services into proven shapes.
Part 2 is concept-heavy. Every diagram is annotated with the services you already met in Part 1. The Part 1 hands-on lab is the practical anchor โ this codelab teaches the architectures that scale it up.
โ ๏ธ If you choose to experiment with Synapse Dedicated Pools or Databricks: they can exit the free tier quickly. Use serverless modes and delete resource groups the same day.
+==============================================================+
| SECTION 1 - ARCHITECTURES (the big three) |
+==============================================================+
Three architecture patterns power 95% of modern analytics in production:
+----------------------+
| 1. DATA LAKE |
| Store anything, |
| cheap and forever |
+----------+-----------+
|
| grew alongside
v
+----------------------+
| 2. DATA WAREHOUSE |
| Fast SQL on |
| curated tables |
+----------+-----------+
|
| combined into
v
+----------------------+
| 3. LAKEHOUSE |
| (Modern Data Arch.) |
| Best of both |
+----------------------+
Part 2 tours each architecture, shows the Azure services that implement it, then demonstrates how real companies blend all three for different use cases.
+==============================================================+
| ARCHITECTURE 1 - DATA LAKE |
| "Store first, schema later." |
+==============================================================+
๐ชฃ A data lake is a centralized repository that stores any type of data โ structured, semi-structured, unstructured โ at any scale, in its native format, typically on cheap object storage like Azure Data Lake Storage Gen2.
The defining move: you ingest now and decide the schema later (called schema-on-read). Contrast with warehouses, which demand schema-on-write.
+------------------------------------------------------------+
| |
| ANY DATA ---> ADLS Gen2 (object store) ---> ENGINES|
| |
| CSV, JSON, cheap, durable, Synapse, |
| Parquet, Delta, infinite scale, Databricks, |
| logs, images, one source of truth HDInsight, |
| Kafka events Power BI, ML |
| |
+------------------------------------------------------------+
One lake. Many engines. That is the core promise.
Before ~2010, analytics meant warehouses โ expensive, schema-strict, row-limited. Then data exploded:
Problem With Warehouse-Only World | Who Felt It |
๐ธ Warehouse storage cost $1000s / TB / month | Every CFO |
๐ซ Could not store PDFs, images, videos | Healthcare, retail, media |
๐ข Schema changes took weeks | Fast-moving startups |
โ Historical data deleted to save cost | Regulated industries |
Data lakes fixed this by leveraging cheap object storage (ADLS at ~$0.018 / GB / month) and decoupling compute from storage.
abfss://lake@hitavirtech.dfs.core.windows.net/
|
+-- raw/ <-- BRONZE: untouched, as ingested
| + source of truth
| + can replay anything from here
|
+-- curated/ <-- SILVER: cleaned, typed, Delta/Parquet
| + deduped, quality-checked
| + partitioned for fast scans
|
+-- analytics/ <-- GOLD: pre-aggregated, BI-ready
+ joins done once
+ powers Power BI and ML features
Zone | Icon | Shape | Readers |
Raw / Bronze | ๐ฅ | Original bytes โ CSV, JSON, images, dumps | Data engineers only |
Curated / Silver | ๐ฅ | Cleaned, typed, often Delta + partitions | Analysts, ML engineers |
Analytics / Gold | ๐ฅ | Aggregated, ready for Power BI and models | Business users, BI tools |
+--------------------------------------------------------------+
| DELTA LAKE - ACID Transactions on ADLS |
+--------------------------------------------------------------+
| Format : Parquet files + JSON transaction log |
| Superpower: UPDATE, DELETE, MERGE on lake files |
| Travel : Time travel (query any past version) |
| Used by : Databricks, Synapse Spark, Microsoft Fabric |
| |
| Solves the "warehouse features on lake files" problem. |
+--------------------------------------------------------------+
Layer | Icon | Purpose | Service |
Storage | ๐ชฃ | Raw bytes, infinite scale | ADLS Gen2 |
Governance | ๐ | Permissions, catalog, lineage | Microsoft Purview |
Cataloging | ๐ | Schema + lineage | Purview + Synapse Catalog |
ETL / ELT | ๐ธ๏ธ | Move raw โ curated โ analytics | Azure Data Factory, Synapse Pipelines, Databricks |
Query | ๐ | SQL on lake files | Synapse Serverless SQL |
ML | ๐ค | Train on lake data directly | Azure Machine Learning |
+--------------------------------------------------------------+
| MICROSOFT PURVIEW - Unified Data Governance |
+--------------------------------------------------------------+
| Catalog : Scan ADLS, Synapse, SQL, Power BI, S3, GCS |
| Discovery : Auto-classify sensitive data (500+ types) |
| Lineage : End-to-end, column-level, across services |
| Policy : Data access governance across clouds |
| |
| Turns a raw ADLS account into a governed, multi-tenant lake.|
+--------------------------------------------------------------+
Purview is what lets one ADLS account serve 20 teams without everyone seeing everyone else's columns.
Strength | Icon | Weakness | Icon |
Cheap per GB | ๐ฐ | Can become a "data swamp" without governance | ๐ |
Any format | ๐งฉ | Query performance < a warehouse on the same data | ๐ข |
Separates storage and compute | ๐ | Schema enforcement is optional (and often skipped) | ๐ซฅ |
Multi-engine access (Synapse, Databricks, Power BI, ML) | ๐ | Harder for business users to self-serve | ๐ต |
+--------------------------------------------------------------+
| |
| NO CATALOG -> "Which container has customers?" |
| NO QUALITY RULES -> "Why are 40% of amounts negative?" |
| NO GOVERNANCE -> "Who deleted last quarter's data?" |
| NO LIFECYCLE -> "We're paying for 2014 clickstream" |
| |
| ==> DATA SWAMP (useless, expensive) |
+--------------------------------------------------------------+
Every successful data lake is paired with Purview + quality rules + lifecycle policies + Azure Policy. Skip these and your lake drowns.
๐ก HitaVir Tech says: "A data lake without a catalog is a data swamp. A data lake without quality rules is a liability. Governance is not optional โ it is the difference between an asset and a landfill."
๐ชฃ Data lake in one line: store everything cheaply, govern it strictly, query it from any engine.
+==============================================================+
| ARCHITECTURE 2 - DATA WAREHOUSE |
| "Fast SQL on curated, trusted data." |
+==============================================================+
๐๏ธ A data warehouse is a centralized, highly-structured database optimized for analytical queries โ aggregations, joins, and scans across billions of rows โ at interactive speeds.
Key properties:
Property | Icon | What It Means |
Schema-on-write | ๐ | Every row fits a predefined schema at load time |
Columnar storage | ๐ | Stores columns together, not rows โ 10-100ร faster scans |
MPP (Massively Parallel Processing) | โก | Splits work across many compute nodes automatically |
Optimized for reads | ๐ | Writes are slower; reads are lightning-fast |
Business-user friendly | ๐ฅ | Clean star schemas; analysts can self-serve SQL |
ROW STORE (OLTP, e.g. Azure SQL DB)
-----------------------------------
[id | name | country | amount] <-- each row stored together
Great for: "Get everything about order 1042"
Bad for: "SUM(amount) across 1B rows"
COLUMNAR STORE (OLAP, e.g. Synapse Dedicated, Parquet)
------------------------------------------------------
[id][id][id]...
[name][name][name]...
[country][country][country]...
[amount][amount][amount]... <-- each column stored together
Great for: "SUM(amount) across 1B rows" (scan only one column)
Bad for: "Get everything about order 1042"
Warehouses use columnar. That one design choice is why they can aggregate billions of rows in seconds.
Most warehouse tables follow the star schema:
+---------------------+
| DIM_CUSTOMER |
| (who bought) |
+----------+----------+
|
|
+---------------+ +---------------+ +---------------+
| DIM_PRODUCT |----| FACT_SALES |----| DIM_DATE |
| (what sold) | | (the event) | | (when sold) |
+---------------+ +-------+-------+ +---------------+
|
|
+----------+----------+
| DIM_STORE |
| (where sold) |
+---------------------+
Star schemas make queries fast AND readable: SELECT country, SUM(amount) FROM fact_sales JOIN dim_store ....
+--------------------------------------------------------------+
| SYNAPSE DEDICATED SQL POOL (formerly SQL DW) |
+--------------------------------------------------------------+
| Category : Columnar MPP warehouse |
| Distrib. : Round-robin, hash, replicated |
| Scale : 60-to-hundreds of DWUs, petabyte range |
| SQL : T-SQL (SQL Server flavored) |
| Pairings : PolyBase, COPY statement, Power BI |
| |
| The engine behind Mars, Rolls-Royce, Walmart analytics. |
+--------------------------------------------------------------+
+----------------------------+
| Dedicated SQL Pool |
| (hot, curated tables) |
+-------------+--------------+
|
| joins across
v
+----------------------------+
| ADLS via Serverless SQL |
| (cold, historical data) |
+----------------------------+
One T-SQL query spans both the warehouse (recent, hot) and the lake (years of history). No duplicate storage, no duplicate pipelines.
Source | Icon | Loader | Speed |
ADLS files | ๐ชฃ |
| ๐ |
Event Hubs | ๐ | Synapse streaming ingestion | ๐ |
Azure SQL / Postgres | ๐๏ธ | Azure Data Factory + CDC | ๐ |
SaaS apps (Dynamics, Salesforce) | โ๏ธ | Data Factory connectors / Power Automate | ๐ถ |
Strength | Icon | Weakness | Icon |
Sub-second SQL on billions of rows | โก | Expensive per TB stored | ๐ธ |
Business-analyst friendly | ๐ฅ | Rigid schema โ changes need migrations | ๐ |
Mature BI tool ecosystem (Power BI) | ๐ | Only handles structured data | ๐ |
ACID transactions and governance baked-in | ๐ก๏ธ | Locked into one vendor's engine | ๐ |
+--------------------+------------------------+------------------------+
| ATTRIBUTE | DATA LAKE | DATA WAREHOUSE |
+--------------------+------------------------+------------------------+
| Data type | Anything | Structured only |
| Schema | On read | On write |
| Cost / TB stored | $ (cheap) | $$$$ (expensive) |
| Query speed | Medium | Fast |
| Users | Engineers, data sci. | Analysts, business |
| Azure example | ADLS + Serverless SQL | Synapse Dedicated |
+--------------------+------------------------+------------------------+
๐ก HitaVir Tech says: "Warehouses are optimized for the answers you know you want. Lakes are optimized for the answers you haven't invented questions for yet. Real companies need both. The next section shows how to stop choosing and combine them."
๐๏ธ Warehouse in one line: columnar + MPP + star schema = fast answers for business users.
+==============================================================+
| ARCHITECTURE 3 - MODERN DATA ARCHITECTURE |
| (aka the "Lakehouse" pattern on Azure) |
+==============================================================+
By 2018, most companies had both a lake and a warehouse โ and suffered:
Pain | Icon | Symptom |
Two copies of the truth | ๐ฏ | Lake says one number, warehouse says another |
Pipeline sprawl | ๐ธ๏ธ | 200 Data Factory jobs shuffling data between them |
Permission chaos | ๐ | Entra ID for SQL, ACLs for storage, separate audits |
Skill silos | ๐งโ๐ป | Data engineers in Spark, analysts in SQL, no common tool |
ML engineers stuck | ๐ค | Data scientists denied warehouse access, scraping lakes by hand |
+==============================================================+
| |
| ONE GOVERNED PLATFORM |
| |
| - Unified storage (ADLS Gen2 = source of truth) |
| - Open table format (Delta Lake / Iceberg) |
| - Purpose-built engines (pick the right tool per job) |
| - Shared catalog + governance (Purview) |
| - Common security model (Entra ID + RBAC + Key Vault) |
| |
+==============================================================+
Instead of lake or warehouse, you get lake and warehouse โ unified by one catalog, one permission model, one lineage.
In 2023 Microsoft launched Fabric โ a single SaaS surface that bundles:
Fabric Pillar | Icon | Under the Hood |
OneLake | One tenant-wide ADLS, "OneDrive for data" | |
Data Factory | Ingest and orchestrate (same engine as ADF) | |
Synapse Data Engineering | Spark notebooks on Delta Lake | |
Synapse Data Warehouse | T-SQL warehouse over Delta (not Dedicated Pool!) | |
Synapse Real-Time Analytics | KQL / Data Explorer on streams | |
Power BI | Native BI over OneLake (Direct Lake mode) | |
Data Activator | Trigger actions from data signals | |
Copilot | Natural-language across all pillars |
Fabric is where Azure analytics is heading. The Part 1 services still exist โ Fabric simply bundles them on one pricing model, one identity, one lakehouse.
+--------+ +--------+ +--------+ +--------+ +--------+
| 1 | | 2 | | 3 | | 4 | | 5 |
| SCALABLE| |PURPOSE-| |SEAMLESS| |UNIFIED | |FUTURE- |
| DATA | | BUILT | | DATA | |GOVERN- | | PROOF |
| LAKE | | ENGINES| |MOVEMENT| | ANCE | | ML |
+---------+ +--------+ +--------+ +--------+ +--------+
| | | | |
v v v v v
ADLS + Synapse, Shortcuts, Purview + Azure ML,
Fabric Databricks, Mirroring, Entra ID + OpenAI,
OneLake Power BI, ADF zero- Key Vault + Synapse
Data Explorer copy link Defender ML, Fabric
Every modern Azure architecture starts from ADLS Gen2 / OneLake. Why?
Reason | Icon | Impact |
Infinite scale | โพ๏ธ | Never outgrow it |
11 nines durability (GRS) | ๐ก๏ธ | Your data is safer than on any disk |
Pennies per GB | ๐ฐ | Keep history forever |
Native reader for Synapse, Databricks, Power BI, ML | ๐ | One source, many consumers |
One-size-fits-all is dead. Pick the right engine per workload:
Workload | Icon | Engine | Why |
Ad-hoc SQL on lake files | ๐ | Synapse Serverless SQL | Pay per TB scanned, zero setup |
Dashboards on curated tables | ๐๏ธ | Synapse Dedicated / Fabric Warehouse | Sub-second BI |
Petabyte Spark / ML | ๐ฅ | Databricks / Synapse Spark | Custom transforms + ML at scale |
Sub-ms lookups | โก | Cosmos DB | Document / key-value queries |
Full-text + vector search | ๐ | AI Search | RAG, log search, enterprise search |
Real-time aggregation | ๐ฏ | Stream Analytics / ADX | Streaming SQL / KQL |
Instead of 200 brittle ETL jobs, modern Azure architectures rely on:
Layer | Icon | Service | Purpose |
Identity | ๐ | Microsoft Entra ID (formerly AAD) | Who you are |
Fine-grained access | ๐ | Azure RBAC + ADLS ACLs + Purview policies | What you can do |
Encryption | ๐ | Key Vault + Customer-Managed Keys | At-rest and in-transit |
PII scanning | ๐ต๏ธ | Microsoft Purview + Defender for Cloud | Find sensitive data |
Audit | ๐ | Activity Log + Azure Monitor | Every API call, every query |
Data discovery | ๐งญ | Microsoft Purview | Business-friendly data catalog |
Policy-as-code | โ๏ธ | Azure Policy | Enforce rules on resources |
ML is no longer a bolt-on โ it lives inside the platform:
Capability | Icon | Service |
Full ML lifecycle | ๐ค | Azure Machine Learning |
Foundation models (GPT-4, Claude partners, Llama) | ๐ง | Azure OpenAI Service |
Spark-native ML in notebooks | ๐ฅ | Databricks MLflow / Synapse ML |
No-code ML | ๐ | Azure ML Designer / AutoML |
Natural-language BI | ๐ฌ | Power BI Copilot |
RAG | ๐ | AI Search + Azure OpenAI |
+==================================================================+
| |
| +---------+ +---------+ +---------+ +---------+ |
| | Batch | | Stream | | OpTx DB | | SaaS | |
| | files | | events | | (CDC) | | apps | |
| +----+----+ +----+----+ +----+----+ +----+----+ |
| | | | | |
| +-------------+-------------+-------------+ |
| | |
| v |
| +-----------------------------------------------------+ |
| | ADLS GEN2 / ONELAKE --- Centralized Lake | |
| | Raw -> Curated -> Analytics (Delta/Parquet) | |
| +---------------------------+-------------------------+ |
| | |
| governed by | |
| v |
| +-----------------------------------------------------+ |
| | PURVIEW + ENTRA ID + KEY VAULT + DEFENDER + POLICY | |
| +-----------------------------------------------------+ |
| | |
| +------------+------------+------------+-------------+ |
| | | | | | |
| v v v v v |
| +-------+ +---------+ +-----+ +-----------+ +------+ |
| |Synapse| |Synapse | |Data-| |Azure Data | |Azure | |
| |Server-| |Dedicated| |bricks| |Explorer | |ML / | |
| |less | |(MPP) | |Spark | | (KQL) | |OpenAI| |
| +---+---+ +----+----+ +--+--+ +-----+-----+ +--+---+ |
| | | | | | |
| +------------+-----------+-------------+-----------+ |
| | |
| v |
| +----------------------------------------+ |
| | VALUE = Power BI + Copilot + apps | |
| +----------------------------------------+ |
| |
+==================================================================+
Look carefully: every service from Part 1 has a home. That is modern Azure data architecture.
Centralize storage in a governed ADLS / OneLake. Use the best engine for each workload. Let data move frictionlessly between them. Secure it all uniformly. Build ML natively on top.
๐ก HitaVir Tech says: "Don't build one monolith. Don't build 20 silos. Build one lake, with many engines, one catalog, one security model. That's how Azure's biggest analytics customers run."
๐ก Lakehouse in one line: one lake for storage, many engines for compute, one catalog for trust.
+==============================================================+
| THE COMPLETE AZURE SERVICE MAP |
| for Modern Data Architecture |
+==============================================================+
Each service answers a specific question in the modern architecture:
Layer | Question | Service | Icon |
Storage | Where does my data live? | ADLS Gen2 / OneLake | ๐ชฃ |
Governance | Who can see what? | Microsoft Purview | ๐๏ธ |
Catalog | What data do we have? | Purview + Synapse Catalog | ๐ |
ETL / ELT | How do I shape it? | Azure Data Factory, Databricks, Synapse Spark | ๐ธ๏ธ |
SQL on lake | How do I explore? | Synapse Serverless SQL | ๐ |
SQL on warehouse | How do I serve BI? | Synapse Dedicated / Fabric Warehouse | ๐๏ธ |
Stream ingest | How do I handle real time? | Event Hubs + Stream Analytics | ๐ |
Time-series / logs | How do I query logs? | Azure Data Explorer (KQL) | ๐ |
BI | How do people see it? | Power BI | ๐ |
ML | How do we predict? | Azure Machine Learning | ๐ค |
LLMs | How do we add GenAI? | Azure OpenAI + AI Search | ๐ง |
Real-time search | How do we find a needle? | Azure AI Search | ๐ |
+--------------------------------------------------------------+
| AZURE DATABRICKS - The Lakehouse Pioneer |
+--------------------------------------------------------------+
| Audience : Data engineers + data scientists + analysts |
| Engine : Apache Spark (Photon) + Delta Lake |
| Notebooks : Python, SQL, R, Scala |
| ML : MLflow tracking, feature store, model serving |
| Governance: Unity Catalog (like Purview, for Databricks) |
| |
| The "full lakehouse in one product" option. |
+--------------------------------------------------------------+
Classic ETL means writing code to extract-transform-load between systems. Azure's zero-copy features make data appear in another system without moving it:
+-----------------+ +------------------------+
| Azure SQL DB |==Mirror=>| Microsoft Fabric |
| (app DB) | managed | (Lakehouse, OneLake) |
+-----------------+ +------------------------+
Zero-copy on Azure today:
- Azure SQL -> Fabric (Mirroring)
- Cosmos DB -> Fabric (Mirroring)
- Snowflake -> Fabric (Mirroring)
- OneLake Shortcut -> ADLS Gen2 / S3 / GCS (no copy at all)
Fewer pipelines to maintain. Fresher analytics. Less on-call pain.
In a modern architecture, ADF is everywhere:
Capability | Icon | Role |
Copy Activity | ๐ | Move data between 100+ sources and sinks |
Mapping Data Flows | ๐ธ๏ธ | Visual Spark-based transforms |
Triggers | โฑ๏ธ | Schedule, event, tumbling window |
Integration Runtimes | ๐ | Cloud or self-hosted, on-prem โ Azure |
Git integration | ๐ | Pipelines as code (Azure DevOps / GitHub) |
Monitoring | ๐ | Built-in dashboards + Azure Monitor |
When someone describes a problem, scan this table first:
Problem Sounds Like... | Reach For |
"We have terabytes piling up and need cheap storage" | ๐ชฃ ADLS Gen2 + ๐ง Archive tier |
"Analysts want SQL on 1B rows, must return in seconds" | ๐๏ธ Synapse Dedicated / Fabric Warehouse |
"We dump random files hourly, want ad-hoc SQL" | ๐ Synapse Serverless + ๐ธ๏ธ ADF |
"Events come at 1M / sec and drive a live dashboard" | ๐ Event Hubs + ๐ฏ Stream Analytics |
"Logs need to be searchable with keyword filters" | ๐ฌ Azure Data Explorer (KQL) |
"We keep 2 copies of the same data in lake and warehouse" | ๐ญ Serverless / Mirroring / Fabric Direct Lake |
"We need to share a slice with another Azure tenant" | ๐ Azure Data Share / Purview data policies |
"Non-engineers can't find any data" | ๐งญ Microsoft Purview / Fabric catalog |
"We want the CEO to ask questions in English" | ๐ฌ Power BI Copilot |
"Our support team wants to chat with our docs" | ๐ AI Search + ๐ง Azure OpenAI (RAG) |
๐ก HitaVir Tech says: "When a junior asks โwhich Azure service should we use?', the senior reply is another question โ โwhat is the actual pattern?' Service choice without pattern = tech for tech's sake."
+==============================================================+
| SECTION 2 - COMMON USE CASES |
| (where the patterns show up in real life) |
+==============================================================+
Most real-world analytics work on Azure falls into six repeatable patterns. Recognize them and you'll know which reference architecture to reach for.
+-----------------------------------------------------------+
| 1. BATCH BI - nightly dashboards |
| 2. REAL-TIME ANALYTICS- live metrics, fraud, IoT |
| 3. LOG / APP OBS. - search + troubleshoot logs |
| 4. CUSTOMER 360 - unify profiles across sources |
| 5. ML / PREDICTIVE - forecast, recommend, score |
| 6. DATA MESH - domain-owned, shared data |
+-----------------------------------------------------------+
Who needs it: Every company with a CFO.
Shape: Operational databases + flat files โ data lake โ warehouse โ Power BI.
Trait | Value |
Freshness | Daily or hourly |
Volume | GB to TB |
Velocity V | ๐ข Batch |
Core service | ๐๏ธ Synapse Dedicated / Fabric Warehouse |
Example prompt: "Revenue by region compared to last quarter, refreshed every morning at 8 am."
Who needs it: Rideshare, fintech, ad-tech, IoT, online gaming.
Shape: Events โ Event Hubs โ stream processor โ live dashboard and ADLS for history.
Trait | Value |
Freshness | Sub-second to seconds |
Volume | Millions of events / sec |
Velocity V | โก Streaming |
Core service | ๐ Event Hubs + ๐ฏ Stream Analytics |
Example prompt: "Alert the risk team the moment any card transaction looks fraudulent."
Who needs it: Every engineering team at scale.
Shape: Application logs โ Log Analytics / ADX โ ADLS archive.
Trait | Value |
Freshness | Seconds |
Volume | TB/day in logs |
Velocity V | โก Streaming |
Core service | ๐ฌ Azure Data Explorer (KQL) |
Example prompt: "Search the last 30 days of production logs for any mention of this request ID."
Who needs it: Retail, banking, telecom, SaaS.
Shape: Unify profiles from CRM, web, mobile, support into one view, served to marketing + ML.
Trait | Value |
Freshness | Hourly |
Volume | TB |
Dominant V | ๐งฉ Variety |
Core service | ๐ชฃ ADLS + ๐ธ๏ธ ADF + ๐๏ธ Synapse |
Example prompt: "Show one customer's full lifetime journey โ ads seen, orders placed, tickets filed."
Who needs it: Forecasting, recommendations, fraud, churn, dynamic pricing.
Shape: Lake โ feature store โ model training โ model endpoint โ prediction served in app or BI.
Trait | Value |
Freshness | Training weekly, inference real-time |
Volume | GB to PB of history |
Dominant V | ๐ Value |
Core service | ๐ค Azure ML + ๐ชฃ ADLS |
Example prompt: "Predict which customers will churn next month so we can retain them."
Who needs it: Enterprises with many product teams owning their own data.
Shape: Each domain team curates its own data products on ADLS / OneLake; a central catalog (Purview + Fabric) makes them discoverable and access-controlled.
Trait | Value |
Freshness | Per-domain |
Ownership | Distributed to domain teams |
Dominant V | ๐ก๏ธ Veracity + ๐ Value |
Core service | ๐๏ธ Purview + ๐ก Fabric domains |
Example prompt: "The Finance team owns โinvoices', Marketing owns โcampaigns', but anyone at the company can discover and request access."
What is the dominant question?
|
+-------+---+-----+----+-----+----+-------+
| | | | | |
Weekly Instant Find a One Predict Federated
KPIs alerts log line 360 view future ownership
| | | | | |
v v v v v v
BATCH REAL- LOG CUSTOMER ML / DATA
BI TIME OBS. 360 PRED. MESH
๐ก HitaVir Tech says: "New engineers try to force every problem into the pattern they already know. Seniors look at the dominant V and pick the shape โ then fill in services. Diagnose first. Build second."
+==============================================================+
| SECTION 3 - REFERENCE ARCHITECTURES |
+==============================================================+
For each use case, here is a whiteboard-ready Azure architecture you can copy, adapt, and defend in a design review.
Azure SQL / Postgres Dynamics 365, Salesforce
| |
+---- ADF with CDC connector ----+
|
v
+--------------------------------------+
| ADLS Gen2 Data Lake |
| raw -> curated (Delta/Parquet) |
+---------------+----------------------+
|
v (ADF Data Flows, Purview DQ, catalog)
|
+---------------+----------------------+
| Synapse Dedicated SQL Pool |
| - Star schemas |
| - Nightly COPY loads |
+---------------+----------------------+
|
v
Power BI
(Direct Query or Import)
|
v
Executives
When to pick it: Finance, ops, exec reporting. Stable queries, predictable loads.
Producers (apps, IoT, clickstream)
|
v
+-------------------------+
| Azure Event Hubs |
| (durable, partitioned) |
+------------+------------+
|
+-----------+------------+------------+
| | |
v v v
Stream Analytics Functions Capture
(continuous SQL) (alerting on |
| anomalies) v
| | ADLS
v v (Parquet, hist.)
Live dashboard Service |
(Power BI streaming) Bus / Teams v
OR Data Explorer alert Synapse
Serverless
ad-hoc
When to pick it: Fraud detection, live pricing, real-time personalization, IoT.
App services / AKS / VMs / Activity Log / Defender
|
v
Azure Monitor
(agent + diagnostic settings)
|
+----------+-----------+
| |
v v
Log Analytics / ADLS (archive)
Data Explorer (cheap, years)
(hot, 30-90 days) |
| v
v Synapse Serverless
KQL dashboards / for historical
Grafana / Sentinel audits
When to pick it: SRE and platform teams, security logs (Sentinel), microservice observability.
Dynamics 365 Web events Mobile app Zendesk / SN
| | | |
+------------+------+-------+----------------+
|
v
ADLS Gen2 Data Lake (raw)
|
v ADF + Purview DQ
|
ADLS Gen2 Data Lake (curated, Delta)
|
v
+----------------+------------------+
| |
v v
Synapse Azure ML
Unified Customer Features + Models
table (serving BI) (churn, LTV, NBA)
| |
v v
Power BI 360 Marketing automation
dashboard (personalized offers)
When to pick it: Retail, banking, telecom, subscription SaaS.
+----------------------+
| ADLS / OneLake |
| (historical data) |
+----------+-----------+
|
v
+----------------------+
| Databricks / Synapse|
| Feature engineering |
+----------+-----------+
|
v
+-------------------------+
| Azure ML Feature Store |
+-----+-----------+-------+
| |
(training) (serving)
v v
Azure ML Real-time
Jobs endpoint
(compute (managed online
cluster) inference)
| |
v v
Models Mobile / web app
(personalized UX)
When to pick it: Recommenders, fraud detection, demand forecasting, dynamic pricing.
Domain A (Orders) Domain B (Marketing) Domain C (Finance)
owns its own pipes owns its own pipes owns its own pipes
| | |
v v v
ADLS + Synapse ADLS + Synapse ADLS + Synapse
(Orders domain) (Marketing domain) (Finance domain)
| | |
+-----------+-----------+-------------------------+
|
v
+-----------------------------------+
| Microsoft Purview (central cat) |
| + Fabric domains + Data Policies |
+---------------+-------------------+
|
+------------+------------+
| | |
v v v
Analyst Data sci. Executive
(Synapse) (Azure ML) (Power BI / Copilot)
When to pick it: Large enterprises where domain teams must own their data products, but a central platform team guarantees governance.
# | Pattern | Primary V | Storage | Compute | Serve |
1 | Batch BI | Volume | ADLS + Synapse | ADF, Synapse Dedicated | Power BI |
2 | Real-Time | Velocity | Event Hubs + ADLS | Stream Analytics, Functions | Power BI, Data Explorer |
3 | Log Obs. | Velocity + Variety | Log Analytics + ADX + ADLS | Azure Monitor | KQL dashboards, Sentinel |
4 | 360 | Variety + Value | ADLS + Synapse | ADF | Power BI + apps |
5 | ML | Value | ADLS | Databricks, Azure ML | Endpoint in app |
6 | Mesh | Veracity + Value | Distributed ADLS | Per-domain | Purview + Power BI |
๐ก HitaVir Tech says: "Architects don't memorize 50 services โ they memorize 6 shapes. When someone brings a new problem, they map it to a shape first, then pick services to fit. Copy these six patterns. Most of your career, you'll be adapting one of them."
+==============================================================+
| QUIZ - TEST YOUR UNDERSTANDING |
+==============================================================+
Answer each question before revealing. No peeking โ this is how you build real recall.
Which of the following best describes a data lake?
Answer: B. A lake holds any data in native format; multiple engines (Synapse, Databricks, Power BI, Azure ML) read from it.
Which property is specific to data warehouses, not data lakes?
Answer: C. Columnar + MPP is the warehouse signature, enabling fast aggregation on billions of rows.
Match each zone to its typical reader:
Zone | Readers |
Raw (Bronze) | ? |
Curated (Silver) | ? |
Analytics (Gold) | ? |
Answer: Raw = data engineers only. Curated = analysts and ML engineers. Analytics = business users and BI tools.
In an Azure Modern Data Architecture, which service is the "source of truth" storage layer?
Answer: B. ADLS Gen2 (or OneLake in Fabric). Every other analytics engine on Azure reads from it.
What does Synapse Serverless SQL enable?
OPENROWSETAnswer: C. Serverless SQL lets you query lake data from the warehouse โ no duplicate storage.
Which service provides unified data catalog, lineage, and policy across ADLS, Synapse, Power BI, and even AWS S3?
Answer: B. Microsoft Purview. Azure RBAC is coarse identity; Defender finds PII; Activity Log audits; Purview governs across the whole estate.
A fraud team needs to block bad transactions within 200 ms. Which pattern fits?
Answer: B. Real-time analytics with streaming + serverless alerting.
Microsoft Fabric Mirroring between Azure SQL and Fabric eliminates the need to:
Answer: B. Mirroring replicates changes automatically โ no pipeline code to maintain.
Which service helps non-engineers discover datasets using business terms instead of table names?
Answer: C. Purview provides a business-friendly catalog, lineage, and classification.
A company stores 5 years of clickstream JSON in ADLS but has no Purview, no Azure Policy, no quality rules. What is this?
Answer: B. A data swamp โ no catalog, no governance, no trust. Storage alone is not an architecture.
Score | Meaning |
9-10 | ๐ You are ready for production design reviews |
7-8 | ๐ง Solid. Re-read the reference architectures section |
5-6 | ๐ Review the three architecture chapters and retake |
< 5 | ๐ Re-do Part 1 first โ the 5 Vs are the foundation |
๐ก HitaVir Tech says: "Don't guess. The questions here are the same ones that show up in every Azure analytics interview. Know them cold."
+==============================================================+
| CONGRATULATIONS - PART 2 DONE! |
+==============================================================+
๐ชฃ Data Lakes
Topic | Icon |
What a lake is (schema-on-read) | ๐ |
The medallion zones (raw, curated, analytics) | ๐ฅ๐ฅ๐ฅ |
Why lakes become swamps without governance | ๐ |
Delta Lake โ ACID on ADLS | ๐บ |
๐๏ธ Data Warehouses
Topic | Icon |
Schema-on-write, columnar, MPP | ๐ |
Star schemas (fact + dimensions) | โญ |
Synapse Dedicated vs Serverless | ๐๏ธ |
Fabric Warehouse (the new kid) | ๐ก |
๐ก Modern Data Architecture (Lakehouse)
Pillar | Icon |
Scalable data lake (ADLS / OneLake) | ๐ชฃ |
Purpose-built engines | ๐งฐ |
Seamless movement (Mirroring, Shortcuts, Direct Lake) | ๐ |
Unified governance (Purview + Entra) | ๐ |
Built-in AI / ML (Azure ML + OpenAI) | ๐ค |
๐บ๏ธ Six Reference Architectures
# | Pattern | Core Service |
1 | Batch BI | ๐๏ธ Synapse + ๐ Power BI |
2 | Real-Time | ๐ Event Hubs + ๐ฏ Stream Analytics |
3 | Log Observability | ๐ฌ Data Explorer / Log Analytics |
4 | Customer 360 | ๐ชฃ ADLS + ๐ธ๏ธ ADF + ๐๏ธ Synapse |
5 | ML / Predictive | ๐ค Azure ML + ๐ฅ Databricks |
6 | Data Mesh | ๐๏ธ Purview + ๐ก Fabric domains |
Part 1 Gave You... | Part 2 Gave You... |
๐ A diagnostic framework (5 Vs) | ๐บ๏ธ Reference architectures |
โ๏ธ One service per V | ๐งฉ Combinations that scale |
๐ ๏ธ A basic ADLS โ Synapse Serverless lab | ๐ก The full Lakehouse picture |
๐ฏ What to pick for each bottleneck | ๐ฏ Which shape fits each real-world problem |
raw/ โ curated/ as Parquet. +==============================================================+
| |
| Part 1 = the 5 Vs + one service per V |
| Part 2 = three architectures + six reference shapes |
| |
| Together = you can design and defend |
| an analytics platform on Azure. |
| |
+==============================================================+
๐ก HitaVir Tech says: "You just completed the same journey a new hire at a top cloud team makes in their first three months. Keep the cheat sheet, apply the patterns, and your architecture reviews will sound like a 5-year veteran's. Fundamentals compound."
๐ Welcome to the Lakehouse. See you in Part 3 โ hands-on ADF + Synapse + Power BI capstone.
โ HitaVir Tech โ๏ธ
+==============================================================+
| RESOURCES AND NEXT STEPS |
+==============================================================+
Topic | Icon | Where to Read |
Azure Data Lake Storage Gen2 | ๐ชฃ | learn.microsoft.com/azure/storage/blobs/data-lake-storage-introduction |
Microsoft Purview | ๐๏ธ | learn.microsoft.com/purview |
Azure Data Factory | ๐ธ๏ธ | learn.microsoft.com/azure/data-factory |
Azure Synapse Analytics | ๐๏ธ | learn.microsoft.com/azure/synapse-analytics |
Azure Databricks | ๐ฅ | learn.microsoft.com/azure/databricks |
Azure Event Hubs | ๐ | learn.microsoft.com/azure/event-hubs |
Azure Stream Analytics | ๐ฏ | learn.microsoft.com/azure/stream-analytics |
Azure Data Explorer | ๐ฌ | learn.microsoft.com/azure/data-explorer |
Power BI | ๐ | learn.microsoft.com/power-bi |
Azure Machine Learning | ๐ค | learn.microsoft.com/azure/machine-learning |
Microsoft Fabric | ๐ก | learn.microsoft.com/fabric |
Whitepaper | Icon | Why Read It |
Azure Well-Architected Framework | ๐ | The canonical design checklist |
Cloud Adoption Framework โ Data | ๐ก | Governance, operating models |
Azure Data Architecture Guide | ๐บ๏ธ | Reference diagrams, blessed by MS |
Modern Data Warehouse Architecture | ๐๏ธ | The classic lakehouse blueprint |
Microsoft Fabric Whitepaper | ๐ก | Why Fabric, and how it maps to Azure |
Resource | Icon | Provider |
Microsoft Learn โ Azure Data Engineer path | ๐ | learn.microsoft.com |
DP-203 Azure Data Engineer Associate cert | ๐ฏ | Microsoft Certification |
DP-500 / DP-600 Fabric certs | ๐ | Microsoft Certification |
AI-102 Azure AI Engineer cert | ๐ง | Microsoft Certification |
Book | Why |
Designing Data-Intensive Applications โ Martin Kleppmann | The single best data engineering book ever written |
The Data Warehouse Toolkit โ Ralph Kimball | Star schemas, dimensional modeling, classics |
Fundamentals of Data Engineering โ Joe Reis | Modern, cloud-era data engineering |
Azure Data Engineering Cookbook โ Ahmad Osama | Practical Azure recipes |
+==============================================================+
| AZURE ANALYTICS - PART 2 CHEAT SHEET |
| (screenshot and keep) |
+==============================================================+
Store ANY data ---> ADLS Gen2 (raw | curated | analytics)
Schema-on-READ ---> decided at query time
Read by ---> Synapse, Databricks, Power BI, Azure ML
Governed by ---> Microsoft Purview + Azure Policy
Columnar + MPP ---> fast aggregation on billions of rows
Schema-on-WRITE ---> decided before load
Star schema ---> fact + dimensions
Azure service ---> Synapse Dedicated / Fabric Warehouse
One lake + many engines + one catalog + one security model + native AI/ML
Pillar | Icon | Services |
Scalable lake | ๐ชฃ | ADLS Gen2 / OneLake |
Purpose-built engines | ๐งฐ | Synapse, Databricks, Data Explorer, Cosmos DB |
Seamless movement | ๐ | Mirroring, Shortcuts, Direct Lake, ADF |
Unified governance | ๐ | Purview, Entra ID, Key Vault, Defender, Policy |
Built-in AI | ๐ค | Azure ML, OpenAI, AI Search, Copilot |
# | Pattern | When | Core |
1 | Batch BI | Daily dashboards | ๐๏ธ Synapse + ๐ Power BI |
2 | Real-Time | Fraud, IoT, live | ๐ Event Hubs + ๐ฏ Stream Analytics |
3 | Log Obs. | Search production logs | ๐ฌ Data Explorer / Sentinel |
4 | 360 | Unify customer view | ๐ชฃ ADLS + ๐ธ๏ธ ADF + ๐๏ธ Synapse |
5 | ML | Predict & personalize | ๐ค Azure ML + ๐ฅ Databricks |
6 | Data Mesh | Enterprise domain ownership | ๐๏ธ Purview + ๐ก Fabric |
Azure service icons used in this codelab are from the official Microsoft Azure Public Service Icons set (V23), freely distributed by Microsoft for use in architecture diagrams and educational materials.