+============================================================+
| |
| AZURE ANALYTICS FUNDAMENTALS - PART 1 |
| |
| Concepts - The 5 Vs of Big Data - Azure Mapping |
| |
| Powered by HitaVir Tech |
+============================================================+
Welcome to Fundamentals of Analytics on Azure Cloud Platform - Part 1 by HitaVir Tech!
This codelab builds your mental model for analytics on Microsoft Azure โ one concept at a time, one Azure service at a time. No prior Azure experience required.
Pillar | Topics |
๐ง Concepts | Analytics, Machine Learning, core framework |
๐ The 5 Vs | Volume, Variety, Velocity, Veracity, Value |
โ๏ธ Azure Services | One toolkit per V โ the complete mapping |
๐ ๏ธ Hands-on Lab | ADLS Gen2 โ Synapse Serverless SQL end-to-end |
Every data challenge you will face maps to one of five questions:
Question | V |
๐ฆ "How do we store 500 TB?" | Volume |
๐งฉ "We have CSVs, JSON, images โ help!" | Variety |
๐ "Dashboards must refresh every second" | Velocity |
๐ก๏ธ "Half our timestamps are malformed" | Veracity |
๐ "Who actually uses this dashboard?" | Value |
The 5 Vs give you a framework to diagnose. Azure gives you a toolbox to solve each V.
3-4 hours (concepts + hands-on lab)
If you are... | Do this |
๐ A student new to cloud | Read top-to-bottom, do the hands-on lab |
๐ ๏ธ A working engineer | Skim Part 1-2, deep-read the 5 Vs, focus on Azure services for your bottleneck V |
๐งโ๐ซ A trainer or mentor | Use section headers as talking points; the spotlight cards are slide-ready |
๐ A reference reader | Jump to the Cheat Sheet at the end |
๐ก HitaVir Tech says: "The 5 Vs aren't academic jargon โ they're the mental checklist every senior engineer runs when someone says โwe have a data problem.' Learn to speak this language and every cloud will feel familiar."
Required
azure.microsoft.com/free)Helpful
Everything in this codelab runs in the Azure Portal in your browser. Zero software installation on your machine.
Every step stays inside the Azure free tier / low-cost services:
Free / Low-Cost Budget | Usage in This Codelab |
๐พ ADLS Gen2 storage โ 5 GB free | < 1 MB |
๐ Synapse Serverless SQL โ $5 / TB scanned | < 1ยข total |
๐ Synapse workspace โ free to create | 1 workspace |
๐ฐ Estimated total cost | ~$0.00 |
โ ๏ธ Always clean up. Step 10 of the lab is a cleanup ritual. Skip it and Azure will happily bill you for forgotten resources.
+==============================================================+
| SECTION 1 - ANALYTICS CONCEPTS |
+==============================================================+
Before we touch Azure, we need three anchor ideas:
+----------------------+
| 1. ANALYTICS |
| turn data into |
| decisions |
+----------+-----------+
|
| powered by
v
+----------------------+
| 2. MACHINE LEARNING|
| algorithms that |
| learn patterns |
+----------+-----------+
|
| challenged by
v
+----------------------+
| 3. THE 5 Vs |
| of Big Data |
+----------------------+
๐ Analytics is the practice of turning raw data into useful insights that help people make better decisions.
That one line is the whole discipline. SQL queries, dashboards, ML models, data lakes โ all of it is just tooling in service of that idea.
Imagine HitaVir Coffee โ 50 locations across India. Every day, each store generates data:
Data Stream | Icon | Data Stream | Icon |
Orders | โ | Payments | ๐ฐ |
Loyalty | ๐ฅ | Inventory | ๐ฆ |
Shifts | ๐ | Equipment | ๐ก๏ธ |
Deliveries | ๐ | Reviews | โญ |
One transaction alone is meaningless. But aggregate across 50 stores for a year and patterns emerge:
Observation | Action |
๐ข Mondays are slowest | ๐ฏ Launch "Monday BOGO" |
๐ Store #23 sells 2x pastries | ๐ Copy their layout |
๐ Cappuccinos drop in summer | ๐ง Push cold brew |
That journey โ from raw events to actions โ is analytics.
+================================================================+
| |
| L4 PRESCRIPTIVE "What should we do?" |
| |
| ^ |
| | |
| L3 PREDICTIVE "What will happen?" |
| |
| ^ |
| | |
| L2 DIAGNOSTIC "Why did it happen?" |
| |
| ^ |
| | |
| L1 DESCRIPTIVE "What happened?" |
| |
+================================================================+
Level | Icon | Question | Example | Powered By |
L1 Descriptive | ๐ธ | What happened? | "Sold 12,400 cappuccinos" | ๐ SQL / BI |
L2 Diagnostic | ๐ต๏ธ | Why did it happen? | "Bean shortage hit week 3" | ๐ SQL + drill-down |
L3 Predictive | ๐ฎ | What will happen? | "June sales up 15%" | ๐ค Machine learning |
L4 Prescriptive | ๐ฏ | What should we do? | "Order 200kg by May 25" | ๐ค ML + optimization |
Most companies live at L1-L2. Analytics engineers build the foundation that makes L3-L4 possible.
๐ก HitaVir Tech says: "Never build a dashboard nobody looks at. Always ask โ what decision will this insight change? If the answer is โnone', don't build it."
L1-L2 is Synapse + Power BI. L3-L4 adds Azure ML + Azure OpenAI.
๐ค Machine Learning (ML) is a branch of AI where algorithms learn patterns from data instead of being explicitly programmed.
+-----------------------------+ +-----------------------------+
| TRADITIONAL PROGRAMMING | | MACHINE LEARNING |
| ------------------------- | | ------------------------- |
| | | |
| Rules + Data | | Data + Answers |
| | | | | |
| v | | v |
| Program | | Learned Model |
| | | | | |
| v | | v |
| Answer | | Rules (weights) |
| | | |
| Human writes the rules. | | Machine learns the rules. |
+-----------------------------+ +-----------------------------+
Flavor | Icon | Data Needed | Example | Azure Service |
Supervised | ๐ฏ | Labeled examples | Spam filter, fraud detection, image classification | ๐ค Azure ML |
Unsupervised | ๐ | No labels | Customer segmentation, anomaly detection, topic modeling | ๐ค Azure ML โข ๐ญ AI Language |
Reinforcement | ๐ฎ | Reward signals | Game AI, robotics, recommenders | ๐ค Azure ML โข ๐ฏ Personalizer |
Level Stops at SQL Needs ML
----- -------------------- --------------------
L1 Descriptive OK -
L2 Diagnostic OK -
L3 Predictive Machine learning
L4 Prescriptive ML + optimization
๐ก HitaVir Tech says: "ML is not magic โ it's statistics at scale. If your analytics foundations are shaky, your ML models will be too. Clean data first, cool models second."
Coming up in "Azure Services for Value" (L3-L4 analytics).
+==============================================================+
| SECTION 2 - THE 5 Vs OF BIG DATA |
+==============================================================+
In 2001, analyst Doug Laney described big-data challenges with three Vs: Volume, Variety, Velocity. Later the industry added Veracity (trust) and Value (outcome). Together they form the universal diagnostic checklist.
*
VOLUME
How much is it?
/ \
/ \
/ \
/ \
VARIETY VELOCITY
How many formats? How fast?
\ /
\ /
\ /
\ /
VERACITY VALUE
Can we trust it? Worth it?
V | Icon | Question |
1 | ๐ฆ VOLUME | How much? (scale) |
2 | ๐งฉ VARIETY | How many formats? |
3 | ๐ VELOCITY | How fast? (speed) |
4 | ๐ก๏ธ VERACITY | Can we trust it? |
5 | ๐ VALUE | What outcome? |
Miss any one V and your data platform has a hole. Let's tour each.
Volume โ ADLS Gen2. Variety โ Data Factory. Velocity โ Event Hubs. Veracity โ Purview. Value โ Power BI.
+==============================================================+
| THE 1st V - VOLUME |
| "How much data do we have?" |
+==============================================================+
๐ฆ Volume is the size of your data โ how many bytes, rows, events, or files you must store, move, and process.
Unit | Power | What It Holds |
Byte (B) | 1 | A letter |
KB | 10^3 | One email |
MB | 10^6 | One song |
GB | 10^9 | One DVD |
TB | 10^12 | One year of company sales |
PB | 10^15 | One day of YouTube uploads |
EB | 10^18 | All of Netflix streaming |
ZB | 10^21 | The entire internet per year |
A traditional database runs fine up to ~1-10 TB. Past that, things break:
At big-data scale, you need distributed systems โ hundreds of machines sharing the load.
Company | Daily Volume |
๐ฑ Instagram | 100M+ photos uploaded |
๐ Amazon | Billions of events |
๐ Uber | 10s of TB of trip data |
๐ฌ Netflix | PB of logs and streams |
๐ Bing | Unimaginable |
๐ก HitaVir Tech says: "What works at 10 GB catastrophically fails at 10 TB. Always ask โ how does this scale at 100x?"
๐ฆ Volume in one line: design for 100ร your current data โ or rebuild painfully later.
Coming up in "Azure Services for Volume".
+==============================================================+
| THE 2nd V - VARIETY |
| "How many kinds of data?" |
+==============================================================+
๐งฉ Variety is the diversity of data โ formats, schemas, and sources.
Twenty years ago, "data" meant rows in a database. Today, it means far more:
Type | Icon | Examples | Schema |
Structured | ๐ | SQL tables, CSV, spreadsheets | Fixed |
Semi-structured | ๐งฉ | JSON from APIs, XML, Parquet, Avro | Flexible |
Unstructured | ๐๏ธ | Images, video, audio, PDFs, free text | None |
Each format needs different tooling:
Format | Tool Needed |
SQL tables | Relational engine |
JSON | Document parser |
OCR | |
Image | Computer vision |
Audio | Speech-to-text |
Free text | NLP / embeddings |
Most real projects combine these. Example โ "Correlate support emails + call recordings + order history into one insight." Three completely different pipelines feeding one answer.
Industry | Variety Mix |
๐ฅ Healthcare | Patient records + X-rays + doctor's notes |
๐ Retail | Orders + product photos + reviews |
๐ฆ Banking | Transactions + scanned checks + call transcripts |
๐ Autonomous cars | Sensors + video + maps + LiDAR |
๐ก HitaVir Tech says: "90% of the world's data is unstructured. But 90% of analytics happens on structured data. Your job is often to convert chaos into order."
๐งฉ Variety in one line: structure is created, not found โ choose tools that embrace format diversity.
Coming up in "Azure Services for Variety".
+==============================================================+
| THE 3rd V - VELOCITY |
| "How fast is the data?" |
+==============================================================+
๐ Velocity is the speed at which data arrives, moves, and must be processed to deliver value.
Freshness | Icon | Approach | Example Use Case |
Next day | ๐ | Batch (nightly) | Finance reports |
Every hour | ๐ | Mini-batch | Ops dashboards |
Seconds | ๐ | Streaming | Live pricing |
Sub-millisecond | โก | Real-time | Fraud detection, HFT |
Problem | Solution |
Disks too slow | In-memory / caches |
Batch SQL too slow | Stream-processing engines |
One machine too small | Horizontal auto-scaling |
Failures = data loss | Durable logs (Kafka / Event Hubs) |
Scenario | Required Latency |
๐ณ Credit card fraud | Under 100 ms |
๐ Stock trading | Microseconds |
๐ฑ Social feed | Seconds |
๐ Delivery tracking | Minutes |
๐ Exec dashboard | Hourly |
๐งพ Month close | Daily batch |
๐ก HitaVir Tech says: "Streaming is fashionable. Batch is profitable. 80% of real-world analytics runs on batch โ don't reach for streaming unless the business truly cannot wait."
๐ Velocity in one line: match the pipeline's speed to the decision's deadline โ no faster.
Coming up in "Azure Services for Velocity".
+==============================================================+
| THE 4th V - VERACITY |
| "Can we trust the data?" |
+==============================================================+
๐ก๏ธ Veracity is the accuracy, consistency, and trustworthiness of your data.
Big volumes and fast pipelines are useless if the data is wrong.
Enemy | Icon | Symptom |
Missing | ๐ฆ | NULL in required fields |
Duplicates | ๐๏ธ | Same row repeated |
Inconsistent | ๐ญ | 2024-01-05 vs 01/05/24 |
Outliers | ๐ | Age = 347 |
Units | ๐ช | USD mixed with INR |
Bias | ๐ช | Only US users sampled |
Stale | ๐ชค | Last updated 2019 |
Broken joins | ๐ | Order with no user |
Noise | ๐ฒ | Flaky sensor readings |
๐ก๏ธ A beautiful dashboard built on bad data is worse than no dashboard โ it creates false confidence. The most dangerous insight is a wrong insight someone believes.
Dimension | Icon | Question |
Completeness | ๐งฉ | Required fields populated? |
Accuracy | ๐ฏ | Data reflects reality? |
Consistency | ๐งญ | Related systems agree? |
Timeliness | โฐ | Is it current enough? |
Validity | โ | Matches formats / rules? |
Uniqueness | ๐ข | Any unintended duplicates? |
Incident | Consequence |
๐ฐ๏ธ NASA Mars Climate Orbiter (1999) | Lost $125M โ metric vs imperial unit mismatch |
๐ฆ Knight Capital (2012) | $440M loss in 45 minutes โ bad trading data |
๐คง Google Flu Trends | Overestimated flu peaks 100%+ due to search bias |
๐ก HitaVir Tech says: "Senior engineers obsess over data quality. Juniors obsess over cool tools. Guess which group builds systems that actually work in production."
๐ก๏ธ Veracity in one line: quality rules are a pipeline concern, not a hope.
Coming up in "Azure Services for Veracity".
+==============================================================+
| THE 5th V - VALUE |
| "What business outcome does it drive?" |
+==============================================================+
๐ Value is the business outcome your data and analytics actually deliver โ revenue gained, cost saved, risk reduced, customer experience improved.
Without Value, the other four Vs are expensive hobbies.
VALUE
(outcome)
^
| enabled by
|
Insights & decisions
^
| enabled by
|
Analytics + ML
^
| enabled by
|
Trustworthy (Veracity) data
^
| at the right
| speed (Velocity)
|
across Varieties
^
| stored at
|
the right scale (Volume)
Industry | Analytics Value |
๐ Retail | Recommendation engine โ +20% revenue |
๐ฆ Banking | Fraud detection โ millions saved |
๐ Logistics | Route optimization โ -15% fuel cost |
๐ฅ Healthcare | Early diagnosis models โ better outcomes |
๐ฌ Streaming | Personalized content โ higher retention |
Most companies have folders full of unused dashboards โ the dashboard graveyard. Every one cost engineering time, storage, and compute.
The difference between a valuable dashboard and a graveyard dashboard:
+------------------------------------------------------+
| |
| "What decision will change because of this?" |
| |
| If nobody can answer -> DON'T BUILD IT. |
| |
+------------------------------------------------------+
๐ก HitaVir Tech says: "A data platform that costs more than the decisions it enables is a failure, no matter how beautiful the architecture. Lead with Value."
๐ Value in one line: start from the decision, work backwards to the pipeline.
Coming up in "Azure Services for Value".
+==============================================================+
| SECTION 3 - AZURE SERVICES BY THE 5 Vs |
+==============================================================+
Now we map each V to the Azure services that solve it.
+-------+ +-------+ +-------+ +-------+ +-------+ +-------+ +-------+
| | | | | | | | | | | | | |
| INGST | ->| STORE | ->| CATLG | ->| PROCS | ->| QUERY | ->| VIEW | ->| ACT |
| | | | | | | | | | | | | |
+-------+ +-------+ +-------+ +-------+ +-------+ +-------+ +-------+
The 5 Vs tell you where the bottleneck is. The Azure services tell you what solves it.
+==============================================================+
| AZURE FOR VOLUME |
| "Store any amount of data, affordably." |
+==============================================================+
ADLS Gen2 โข Blob Storage โข Synapse Analytics โข HDInsight โข Databricks
Service | Icon | Purpose |
Azure Data Lake Storage Gen2 | Object storage built on Blob, hierarchical namespace โ the data-lake foundation | |
Azure Blob Storage | Raw object storage โ hot / cool / archive tiers | |
Azure Archive Storage | Cheapest long-term vault (hours-to-retrieve) | |
Azure Synapse Analytics | Analytics platform โ dedicated + serverless SQL + Spark | |
Azure HDInsight | Managed Hadoop / Spark / Kafka clusters | |
Azure Databricks | Managed Apache Spark + MLflow + Delta Lake |
+--------------------------------------------------------------+
| AZURE DATA LAKE STORAGE GEN2 |
+--------------------------------------------------------------+
| Built on : Azure Blob Storage |
| Durability : 99.999999999% (11 nines, GRS) |
| Scale : Exabytes (practically unlimited) |
| Pricing : ~$0.018 / GB / month (hot LRS) |
| Features : Hierarchical namespace, POSIX ACLs |
| Read by : Synapse, Databricks, HDInsight, Power BI |
| |
| If you remember only one Azure service - make it ADLS Gen2. |
+--------------------------------------------------------------+
Tier | Icon | Access Pattern | Relative Cost |
Hot | ๐ฅ | Hot, frequent access | $$$$ |
Cool | โ๏ธ | Monthly access | $$ |
Cold | ๐ง | Every 90+ days | $ |
Archive | ๐๏ธ | Compliance vault, hours to rehydrate | ยข |
+--------------------------------------------------------------+
| AZURE SYNAPSE ANALYTICS - Unified Analytics |
+--------------------------------------------------------------+
| Engines : Dedicated SQL Pool (MPP) + Serverless SQL + |
| Apache Spark pools + Data Explorer pools |
| Storage : ADLS Gen2 under the hood |
| SQL : T-SQL (SQL Server flavored) |
| Integration : One workspace, notebooks, pipelines, Power BI|
| |
| One UI for lake, warehouse, Spark, and BI. |
+--------------------------------------------------------------+
+--------------------------------------------------------------+
| AZURE DATABRICKS - Managed Spark + Delta Lake |
+--------------------------------------------------------------+
| Engine : Apache Spark (Photon-accelerated) |
| Superpower : Delta Lake (ACID on the data lake) |
| ML : MLflow, Feature Store, AutoML |
| Governance : Unity Catalog |
| |
| Use for petabyte-scale custom Spark + ML workloads. |
+--------------------------------------------------------------+
abfss://lake@hitavirtechanalytics.dfs.core.windows.net/
|
+-- raw/ <-- Bronze: untouched source data
| +-- sales/2026/04/22/orders.csv
| +-- inventory/2026/04/22/stock.json
|
+-- curated/ <-- Silver: cleaned, typed Parquet/Delta
| +-- sales_fact/year=2026/month=04/day=22/part-001.parquet
|
+-- analytics/ <-- Gold: pre-aggregated for BI
+-- daily_revenue/year=2026/month=04/day=22/part-001.parquet
How much data?
|
+--------------------+--------------------+
| | |
< 1 TB 1-100 TB > 100 TB
| | |
v v v
Azure SQL DB ADLS + ADLS + Databricks +
or Synapse Synapse Synapse Dedicated +
Serverless Serverless Purview governance
(small + cheap) (most common) (huge platform)
๐ก HitaVir Tech says: "Start with ADLS Gen2. Every Azure analytics service reads from it. You'll never regret putting data into the lake โ you may regret putting it anywhere else first."
+==============================================================+
| AZURE FOR VARIETY |
| "Handle any data format, elegantly." |
+==============================================================+
Data Factory โข Synapse โข Cosmos DB โข AI Vision โข Doc Intelligence โข AI Speech โข AI Language โข AI Search
Service | Icon | Purpose |
ADLS Gen2 | Holds every format โ CSV, JSON, Parquet, images, video | |
Azure Data Factory | ETL / ELT, 100+ connectors, mapping data flows | |
Synapse Serverless SQL |
| |
Azure Cosmos DB | Multi-model NoSQL (document, graph, key-value) | |
Azure AI Vision | Images / video โ structured labels, OCR | |
Azure AI Document Intelligence | PDFs, forms, invoices โ text and tables | |
Azure AI Speech | Speech โ text, speaker ID, translation | |
Azure AI Language | NLP: sentiment, entities, summarization | |
Azure AI Search | Full-text and vector search over any source |
+--------------------------------------------------------------+
| AZURE DATA FACTORY - Serverless ETL / ELT |
+--------------------------------------------------------------+
| Connectors : 100+ (SQL, SaaS, files, APIs, on-prem) |
| Pipelines : Drag-and-drop + code-free mapping data flows |
| Triggers : Schedule, event-based, tumbling window |
| Integration : Git (Azure DevOps / GitHub) |
| |
| The "orchestrator" of Azure data platforms. |
+--------------------------------------------------------------+
INPUT OUTPUT
----- ------
CSV + +---- Parquet
JSON +---> Copy Activity ---> Mapping Data Flow -+ (optimized)
Parquet + (schema + +---- Delta
Oracle | transforms) tables
SAP +
SELECT c.customer_id, SUM(o.amount) AS total_spent
FROM OPENROWSET(
BULK 'https://hitavirtech.dfs.core.windows.net/lake/orders/*.parquet',
FORMAT = 'PARQUET'
) AS o
JOIN OPENROWSET(
BULK 'https://hitavirtech.dfs.core.windows.net/lake/customers/*.json',
FORMAT = 'CSV', FIELDTERMINATOR='0x0b'
) AS c ON c.customer_id = o.customer_id
GROUP BY c.customer_id;
Serverless SQL reads CSV, JSON, Parquet, Delta directly from ADLS. You never leave T-SQL.
Input | Icon | Azure Service | Output |
Images | ๐ผ๏ธ | AI Vision | Labels, faces, OCR |
PDFs / scans | ๐ | Doc Intelligence | Extracted text + tables + key-value |
Audio / voice | ๐ค | AI Speech | Transcripts + speaker ID |
Free text | ๐ฌ | AI Language | Sentiment, entities, summarization |
Translations | ๐ | Translator | 100+ languages |
Magic step: chaos in โ structured features out โ then into ADLS / Synapse / Databricks as normal.
Customer review (raw text)
|
v
AI Language ---> sentiment = negative, topic = shipping
|
v
ADLS Gen2 (enriched records)
|
v
Data Factory ---> Synapse table
|
v
Synapse SQL: "avg sentiment per product / month"
|
v
Power BI dashboard for the CX team
|
v
Action: fix shipping partner in Region X
What's my data?
|
+---------+---------+--+---+---------+---------+
| | | | | |
Tabular JSON Images PDFs Audio Free text
| | | | | |
v v v v v v
ADLS + ADLS + AI Doc AI AI
Synapse Synapse Vision Intell. Speech Language
or Cosmos
๐ก HitaVir Tech says: "The magic of modern analytics โ unstructured data becomes structured features in minutes via Azure AI services. What took PhDs years a decade ago is now an API call."
+==============================================================+
| AZURE FOR VELOCITY |
| "Move and process data in real time." |
+==============================================================+
Event Hubs โข Stream Analytics โข Data Explorer โข Functions โข Event Grid โข Logic Apps
Service | Icon | Purpose |
Azure Event Hubs | Real-time event stream (Kafka-compatible) | |
Event Hubs Capture | Buffered delivery to ADLS / Blob (no code) | |
Azure Stream Analytics | SQL on streams, sub-second latency | |
Azure Data Explorer (Kusto) | Blazing-fast time-series + log analytics | |
Azure Functions | Event-driven serverless code | |
Azure Event Grid | Serverless event bus across Azure | |
Azure Service Bus / Queue Storage | Queue and pub-sub messaging |
+--------------------------------------------------------------+
| AZURE EVENT HUBS - Real-time Event Stream |
+--------------------------------------------------------------+
| Latency : Sub-second |
| Retention : 1-7 days (90 days on Premium) |
| Throughput : MB/sec per partition, scale by partition |
| Kafka API : Yes - drop-in for Kafka clients |
| |
| The "high-speed conveyor belt" for events on Azure. |
+--------------------------------------------------------------+
PRODUCERS EVENT HUBS CONSUMERS
------------ ------------ ------------
App events +----------------------+ Functions
Clickstreams --->| >> >> >> >> >> >> |---> Stream Analytics
IoT sensors | durable, ordered, | Capture -> ADLS
Transactions | partitioned | Data Explorer
Card swipes +----------------------+ Synapse
Event Hubs holds events durably. Multiple consumers read the same stream independently.
+--------------------------------------------------------------+
| EVENT HUBS CAPTURE - The Easy Button |
+--------------------------------------------------------------+
| Model : Fully managed, no code |
| Buffer : Time window or size (whichever first) |
| Format : Avro (native) or Parquet via Stream Analytics|
| Sinks : ADLS Gen2, Blob Storage |
| |
| No cluster - the laziest streaming archive on Azure. |
+--------------------------------------------------------------+
Producers ---> Event Hubs ---> Capture (no servers) ---> ADLS
auto-write every N mins
SELECT user_id, amount, location
INTO FraudAlerts
FROM TransactionsStream TIMESTAMP BY event_time
WHERE amount > 10000 OR is_foreign = 1;
Results in milliseconds โ not after the nightly batch.
Blob created +
Event Hubs +---> Azure Function ---> Synapse load
Cosmos change + |
Event Grid + +----> Service Bus / Queue alert
Perfect for: event reactions, enrichment, alerting, small transforms.
+--------------------------------------------------------------+
| AZURE DATA EXPLORER (ADX / KUSTO) |
+--------------------------------------------------------------+
| Category : Time-series + log analytics |
| Query lang : Kusto Query Language (KQL) |
| Scale : Billions of rows, sub-second |
| Ingest rate : GB/sec, auto-indexed |
| |
| Same engine powers Azure Monitor, Sentinel, Log Analytics. |
+--------------------------------------------------------------+
Rideshare app (1 million events/sec)
|
v
Azure Event Hubs
|
+--------+--------+---------+
| | | |
v v v v
Function Stream Capture
fraud Analytics buffer
flag real-time --> ADLS (Parquet)
| | |
v v v
Service Power BI Synapse
Bus live dash Serverless
alert (historical)
How fresh must the data be?
|
+---------+----------+---+---+-----------+
| | | | |
Next day 15 minutes Seconds Sub-second Kafka shop
| | | | |
v v v v v
ADF Capture Event Stream Event Hubs
pipeline -> ADLS Hubs + Analytics (Kafka API)
Function
๐ก HitaVir Tech says: "Every team thinks they need real-time until they see the bill. Start with Event Hubs Capture and 5-minute micro-batches โ graduate later. Most of the time, you won't need to."
+==============================================================+
| AZURE FOR VERACITY |
| "Make sure the data is trustworthy." |
+==============================================================+
Mapping Data Flows โข Microsoft Purview โข Info Protection โข Activity Log โข Defender for Cloud โข Key Vault โข Azure Monitor
Service | Icon | Purpose |
ADF Mapping Data Flows | Visual data cleaning and profiling | |
Microsoft Purview Data Quality | Rule-based DQ checks | |
Great Expectations / Deequ on Spark | Unit tests for data (open-source in Databricks) | |
Microsoft Purview | Data catalog + lineage + policy | |
Azure Activity Log | Audit every control-plane change | |
Azure Monitor + Log Analytics | Resource and pipeline telemetry | |
Microsoft Defender for Cloud | Discover and classify PII, CSPM | |
Azure Key Vault | Manage encryption keys and secrets |
+--------------------------------------------------------------+
| ADF MAPPING DATA FLOWS - No-Code Data Prep |
+--------------------------------------------------------------+
| Interface : Visual, drag-and-drop |
| Transforms : 90+ (nulls, dupes, dates, joins, aggs...) |
| Engine : Spark (managed, auto-scaled) |
| Debug : Interactive, with sample data |
| |
| Hand this to business analysts - no Spark code needed. |
+--------------------------------------------------------------+
Source ---> Profile ---> Apply transforms ---> Sink
(stats, (fill nulls, parse (to ADLS,
anomalies) dates, dedupe) Synapse)
+--------------------------------------------------------------+
| MICROSOFT PURVIEW - Unified Data Governance |
+--------------------------------------------------------------+
| Catalog : Scan ADLS, Synapse, SQL, Power BI, S3, GCS |
| Lineage : End-to-end column-level lineage |
| DQ rules : Completeness, uniqueness, ranges, custom |
| Insights : Sensitivity labels, hotspots, adoption |
| |
| The "nervous system" for multi-cloud data governance. |
+--------------------------------------------------------------+
RULES CHECK RESULT
----- -------------
order_id IS NOT NULL ... PASS
amount BETWEEN 0 AND 1_000_000 ... PASS
customer_id IN customers ... PASS
COUNT(DISTINCT order_id) = COUNT(*) ... FAIL - 23 dupes!
+--------------------------------------------------------------+
| MICROSOFT DEFENDER FOR CLOUD - CSPM + PII detection |
+--------------------------------------------------------------+
| Method : ML-based classification of storage contents |
| Finds : Credit cards, SSNs, emails, addresses |
| Output : Severity alerts -> Sentinel SIEM |
| |
| Sniffs sensitive data hiding in your storage accounts. |
+--------------------------------------------------------------+
Essential for regulated industries (finance, healthcare, gov).
PROFILE ---> DEFINE RULES ---> ENFORCE ---> ALERT ---> FIX ---> MONITOR
(know) (expected) (every run) (fail fast) (fix) (trends)
^ |
| |
+---------------------------- loop ---------------------------------------+
Raw sales CSV from 50 stores
|
v
ADF Mapping Data Flow reads it
|
v
Purview Data Quality rules:
PASS - order_id unique
PASS - amount in [0, 1M]
FAIL - store_id in valid list (12 rows failed)
|
v
+--+--+
v v
Quarantine Curated
container + zone
Teams alert (Parquet)
๐ก HitaVir Tech says: "Every pipeline must have quality rules โ not โsomeday', from day one. 10x cheaper to catch bad data at ingest than to explain a wrong dashboard to the CEO."
+==============================================================+
| AZURE FOR VALUE |
| "Turn data into decisions and ROI." |
+==============================================================+
Power BI โข Azure ML โข Azure OpenAI โข Anomaly Detector โข Personalizer โข Metrics Advisor
Service | Icon | Purpose |
Microsoft Power BI | Dashboards, BI, natural-language analytics | |
Azure Machine Learning | Build, train, deploy ML models | |
Azure AI Anomaly Detector | No-code time-series anomaly detection | |
Azure AI Personalizer | Contextual recommendation engine | |
Azure AI Metrics Advisor | Proactive KPI anomaly monitoring | |
Azure OpenAI Service | GPT, Claude-competitive LLMs on Azure | |
Power BI Copilot | Ask data questions in natural language | |
Synapse ML / Fabric ML | ML in notebooks next to your data | |
Microsoft Fabric | SaaS analytics: Lakehouse + Warehouse + BI |
+--------------------------------------------------------------+
| MICROSOFT POWER BI - Business Intelligence |
+--------------------------------------------------------------+
| Sources : Synapse, SQL, ADLS, Fabric, 100+ connectors |
| Engine : VertiPaq (in-memory columnar) |
| Superpowers : Copilot (natural language) + Embedded |
| Editions : Free | Pro | Premium | Fabric |
| |
| The leader in Gartner's BI Magic Quadrant for 17 years. |
+--------------------------------------------------------------+
Synapse / SQL / ADLS ---> Dataset ---> Visuals ---> Report ---> App
(data source) (VertiPaq (charts) (pages) (share)
cache)
+--------------------------------------------------------------+
| AZURE MACHINE LEARNING - Full-Lifecycle ML |
+--------------------------------------------------------------+
| Studio : Browser IDE for ML |
| AutoML : Tries many models automatically |
| Pipelines : Train / evaluate / deploy as YAML + CLI v2 |
| MLOps : Model registry, endpoints, monitoring |
| |
| From raw data to deployed model - one platform. |
+--------------------------------------------------------------+
+--------------------------------------------------------------+
| AZURE OPENAI SERVICE |
+--------------------------------------------------------------+
| Models : GPT-4, GPT-4o, embeddings, DALL-E, whisper |
| Compliance : SOC, HIPAA, FedRAMP, private network |
| Integration : RAG with AI Search, Cognitive Services |
| |
| Same OpenAI models, but your data never leaves Azure. |
+--------------------------------------------------------------+
Your docs (ADLS)
|
v
AI Search (vector index)
|
v
Azure OpenAI (GPT-4o with RAG)
|
v
Chat / Copilot experiences
+--------------------------------------------------------------+
| AZURE AI PERSONALIZER - Netflix-Style Recs |
+--------------------------------------------------------------+
| Inputs : Context features + reward signals |
| Use cases : "You may like...", "Related items..." |
| Real-time : Inference in milliseconds |
| |
| Contextual-bandit reinforcement learning, no PhD required. |
+--------------------------------------------------------------+
User types: "Show me top 5 products last quarter"
|
v
Copilot interprets -> writes DAX -> runs -> visualizes
|
v
Bar chart appears in 2 seconds
Analysts no longer gatekeep simple questions.
CREATE EXTERNAL MODEL churn_model
FROM (SELECT * FROM customer_features)
WITH (MODEL_TYPE = 'Binary Classification',
LABEL_COLUMN = 'churned');
-- then use it
SELECT customer_id, PREDICT(MODEL = churn_model, DATA = features) AS risk
FROM customers
WHERE risk > 0.8;
ML without leaving your Synapse warehouse.
+==============================================================+
| |
| BUSINESS VALUE revenue - savings - growth |
| ^ |
| | |
| APPLICATION LAYER Personalizer, Metrics Advisor, AD |
| ^ |
| | |
| ML LAYER Azure ML, Synapse ML, OpenAI |
| ^ |
| | |
| ANALYSIS LAYER Synapse + Power BI + Fabric |
| ^ |
| | |
| BUILD LAYER ADLS + ADF + Event Hubs + Purview |
| |
+==============================================================+
๐ก HitaVir Tech says: "The best data platform is worthless if nobody uses the outputs. Start from Value and work backwards โ who sees which insight, and what decision changes? Design everything else to serve that."
+==============================================================+
| END-TO-END AZURE ANALYTICS STACK |
+==============================================================+
All 5 Vs combined into one living architecture:
INGEST (Velocity + Variety Layers)
+---------------------------+ +---------------------------+
| Azure Event Hubs | | Data Factory |
| Event Hubs Capture | | AI Vision / Doc Intel. |
| Functions | | AI Speech / Language |
| IoT Hub | | Translator |
+-------------+-------------+ +-------------+-------------+
| |
+----------------+---------------+
|
v
STORE (Volume Layer)
+----------------------------------------------------------+
| ADLS Gen2 Data Lake (raw / curated / analytics) |
| <------> Microsoft Purview (catalog + lineage) |
+-----------------------------+----------------------------+
|
v
PROCESS + VERACITY
+----------------------------------------------------------+
| ADF Data Flows / Databricks / Synapse Spark |
| <--- Purview DQ / Defender for Cloud / Policy |
+-----------------------------+----------------------------+
|
v
QUERY
+--------------+ +--------------+ +--------------+
| Synapse | | Synapse | | Azure ML |
| Serverless | | Dedicated | | (training) |
+------+-------+ +------+-------+ +------+-------+
| | |
+-----------------+-----------------+
|
v
VALUE LAYER
+----------------------------------------------------------+
| Power BI | Azure OpenAI (GPT) |
| Copilot (NL) | Personalizer |
+-----------------------------+----------------------------+
|
v
Decisions - Revenue - Growth
Every box maps to a V. Every arrow is a managed Azure service.
+==============================================================+
| HANDS-ON LAB - ADLS -> SYNAPSE SERVERLESS SQL |
+==============================================================+
You will build a mini pipeline that touches Volume (ADLS Gen2), Variety (CSV auto-queried), and Value (SQL insights).
Step 1 Step 2-3 Step 4-5 Step 6 Step 7-9
+------+ +------+ +----------+ +--------+ +----------+
| CSV | ----> | ADLS | ------> | Synapse | -> |Serverl.| --> | T-SQL |
+------+ +------+ | workspace| | SQL | +----------+
Prepare Upload Create Run OPENROWSET
sample data to container workspace get insights
On your laptop, create a file called sales.csv:
order_id,customer,product,quantity,amount,order_date
1001,Ravi,Laptop,1,75000.00,2026-04-01
1002,Priya,Mouse,2,1500.00,2026-04-01
1003,Amit,Keyboard,1,3500.00,2026-04-02
1004,Ravi,Monitor,1,18000.00,2026-04-02
1005,Sneha,Headphones,1,4500.00,2026-04-03
1006,Priya,Laptop,1,75000.00,2026-04-03
1007,Vikram,USB Cable,3,900.00,2026-04-04
1008,Neha,Webcam,1,5000.00,2026-04-04
1009,Ravi,Mouse,1,750.00,2026-04-05
1010,Amit,Monitor,1,18000.00,2026-04-05
portal.azure.com)rg-hitavirtech-analytics (create new)hitavirtechanalyticsXX (lowercase, unique โ add random digits)Central India)lake โ Createlake โ + Add Directory โ rawraw/ โ + Add Directory โ salesraw/sales/ โ Upload โ select sales.csv โ UploadYour file now lives at:
https://hitavirtechanalyticsXX.dfs.core.windows.net/lake/raw/sales/sales.csv
rg-hitavirtech-analyticssyn-hitavirtech-XXhitavirtechanalyticsXX + container lakeWithout this step, Synapse Serverless SQL cannot read your files.
lake โ raw โ salessales.csv โ New SQL script โ Select TOP 100 rowsOPENROWSET query โ click RunYou should see 10 rows. ๐
Replace the auto-generated query with:
SELECT *
FROM OPENROWSET(
BULK 'https://hitavirtechanalyticsXX.dfs.core.windows.net/lake/raw/sales/sales.csv',
FORMAT = 'CSV',
PARSER_VERSION = '2.0',
HEADER_ROW = TRUE
) AS sales;
Replace hitavirtechanalyticsXX with your actual account name. Click Run.
-- Top customers by revenue
SELECT customer, SUM(CAST(amount AS DECIMAL(12,2))) AS total_spent
FROM OPENROWSET(
BULK 'https://hitavirtechanalyticsXX.dfs.core.windows.net/lake/raw/sales/sales.csv',
FORMAT = 'CSV', PARSER_VERSION = '2.0', HEADER_ROW = TRUE
) AS sales
GROUP BY customer
ORDER BY total_spent DESC;
-- Best-selling products
SELECT product, SUM(CAST(quantity AS INT)) AS units_sold
FROM OPENROWSET(
BULK 'https://hitavirtechanalyticsXX.dfs.core.windows.net/lake/raw/sales/sales.csv',
FORMAT = 'CSV', PARSER_VERSION = '2.0', HEADER_ROW = TRUE
) AS sales
GROUP BY product
ORDER BY units_sold DESC;
-- Daily revenue
SELECT order_date, SUM(CAST(amount AS DECIMAL(12,2))) AS daily_revenue
FROM OPENROWSET(
BULK 'https://hitavirtechanalyticsXX.dfs.core.windows.net/lake/raw/sales/sales.csv',
FORMAT = 'CSV', PARSER_VERSION = '2.0', HEADER_ROW = TRUE
) AS sales
GROUP BY order_date
ORDER BY order_date;
At the bottom of every query result: "Data processed: 1 KB" (or similar).
That number is your bill. Serverless SQL charges ~$5 per TB scanned. At scale, every analytics engineer watches it. Partitioning + Parquet shrinks it 100-1000ร.
โ ๏ธ Forgetting cleanup = surprise Azure bill.
The simplest, safest cleanup on Azure: delete the entire resource group.
rg-hitavirtech-analytics๐ก HitaVir Tech says: "The last 5 minutes of cleanup are the most valuable 5 minutes of the entire lab. Engineers who skip it end up with $300 surprise bills."
+==============================================================+
| CONGRATULATIONS - PART 1 DONE! |
+==============================================================+
๐ง Analytics Concepts
Topic | Icon |
Analytics and the four maturity levels | ๐ |
Machine Learning โ three flavors | ๐ค |
๐ The 5 Vs of Big Data
V | Icon | Theme |
VOLUME | ๐ฆ | Scale |
VARIETY | ๐งฉ | Formats |
VELOCITY | ๐ | Speed |
VERACITY | ๐ก๏ธ | Trust |
VALUE | ๐ | Outcome |
โ๏ธ Azure Services Mapped to Each V
V | Key Services |
๐ฆ Volume | ๐ชฃ ADLS Gen2 โข ๐๏ธ Synapse โข ๐ฅ Databricks โข ๐ HDInsight |
๐งฉ Variety | ๐ธ๏ธ Data Factory โข ๐ Synapse Serverless โข ๐ค AI Vision โข ๐ Doc Intelligence โข ๐ AI Speech โข ๐ญ AI Language |
๐ Velocity | ๐ Event Hubs โข ๐ฏ Stream Analytics โข ๐ฌ Data Explorer โข โก Functions |
๐ก๏ธ Veracity | ๐งช ADF Data Flows โข ๐ก๏ธ Purview โข ๐ต๏ธ Defender โข ๐ Activity Log |
๐ Value | ๐ Power BI โข ๐ค Azure ML โข ๐ง Azure OpenAI โข ๐ฏ Personalizer |
๐ ๏ธ Hands-on Skills
OPENROWSET queries on CSV๐ Part 2 โ Advanced Analytics on Azure will cover:
+==============================================================+
| |
| The 5 Vs = universal data challenge framework |
| Azure = complete toolbox for each V |
| |
| Learn both -> you can pick up any cloud's analytics |
| stack in a week. |
| |
+==============================================================+
๐ก HitaVir Tech says: "Analytics is not about tools. Tools change every two years. Analytics is about asking the right question, finding the right data, and presenting an insight people can act on. Master the fundamentals โ Volume, Variety, Velocity, Veracity, Value โ and every new tool becomes just another syntax."
๐ Welcome to cloud analytics on Azure. See you in Part 2.
โ HitaVir Tech โ๏ธ
+==============================================================+
| DIAGNOSE YOUR OWN PROJECT WITH THE 5 Vs |
+==============================================================+
Think of a data project you work on (or want to build). Run it through these five questions. The V that feels most stressful is your bottleneck โ that's where to focus first.
# | Question | Your V | Azure Services to Study |
1 | "Do we have somewhere cheap and durable to store everything?" | ๐ฆ Volume | ๐ชฃ ADLS Gen2 โข ๐๏ธ Synapse โข ๐ฅ Databricks |
2 | "Do we have to handle more than one data format?" | ๐งฉ Variety | ๐ธ๏ธ ADF โข ๐ Synapse Serverless โข ๐ค AI Vision โข ๐ Doc Intelligence |
3 | "Is the current data freshness good enough for stakeholders?" | ๐ Velocity | ๐ Event Hubs โข ๐ฏ Stream Analytics โข โก Functions |
4 | "Do stakeholders trust the numbers we publish?" | ๐ก๏ธ Veracity | ๐งช ADF Data Flows โข ๐ก๏ธ Purview โข ๐ต๏ธ Defender |
5 | "Is anyone actually acting on our outputs?" | ๐ Value | ๐ Power BI โข ๐ค Azure ML โข ๐ฏ Personalizer |
Score each V from 1 (healthy) to 5 (painful). The highest score is where a senior engineer should lead the next sprint.
๐ฏ Pro Tip: "Stacking solutions for Velocity when Veracity is the real problem is the most common and expensive Azure mistake. Diagnose honestly before you build."
+==============================================================+
| AZURE ANALYTICS - PART 1 CHEAT SHEET |
| (screenshot and keep) |
+==============================================================+
Term | Definition |
๐ Analytics | Turning data into decisions |
๐ค Machine Learning | Algorithms that learn patterns instead of being programmed |
๐ธ Descriptive โ ๐ต๏ธ Diagnostic โ ๐ฎ Predictive โ ๐ฏ Prescriptive | The 4 levels of analytics maturity |
V | Icon | Question | One-Liner |
1 | ๐ฆ | How much? | Design for 100ร your current data |
2 | ๐งฉ | How many formats? | Structure is created, not found |
3 | ๐ | How fast? | Match speed to the decision's deadline |
4 | ๐ก๏ธ | Can we trust it? | Quality rules are pipeline-level, not tribal |
5 | ๐ | Worth it? | Start from the decision, work backwards |
V | Store | Process | Catalog / Quality | Deliver |
๐ฆ Volume | ๐ชฃ ADLS Gen2 โข ๐ง Archive | ๐๏ธ Synapse โข ๐ฅ Databricks | ๐๏ธ Purview | โ |
๐งฉ Variety | ๐ชฃ ADLS โข โก Cosmos DB | ๐ค AI Vision โข ๐ Doc Intel. โข ๐ AI Speech โข ๐ญ AI Language | ๐ธ๏ธ ADF โข ๐ Purview | ๐ Synapse Serverless |
๐ Velocity | ๐ Event Hubs โข ๐ช Kafka API | โก Functions โข ๐ฏ Stream Analytics | ๐ Capture (โ ADLS) | ๐ฌ Data Explorer |
๐ก๏ธ Veracity | โ | ๐งช ADF Data Flows | ๐ก๏ธ Purview DQ โข ๐ต๏ธ Defender โข ๐ Activity Log โข โ๏ธ Policy | ๐๏ธ Purview |
๐ Value | โ | ๐ค Azure ML โข ๐ฏ Personalizer โข ๐๏ธ Anomaly Detector | โ | ๐ Power BI โข ๐ฌ Copilot โข ๐ง OpenAI |
INGEST --> STORE --> CATALOG --> PROCESS --> QUERY --> VISUALIZE --> DECIDE
| | | | | | |
Event Hubs ADLS Purview ADF flows Synapse Power BI Human
ADF Synapse Fabric cat. Databricks Dedicated Azure ML action
Functions Archive Data Shares Synapse Spark Serverless OpenAI
rg delete is your friend.Azure service icons used in this codelab are from the official Microsoft Azure Public Service Icons set (V23), freely distributed by Microsoft for use in architecture diagrams and educational materials.