Synapse

  +============================================================+
  |                                                            |
  |      AZURE ANALYTICS FUNDAMENTALS - PART 1                 |
  |                                                            |
  |      Concepts  -  The 5 Vs of Big Data  -  Azure Mapping   |
  |                                                            |
  |                 Powered by HitaVir Tech                    |
  +============================================================+

Welcome to Fundamentals of Analytics on Azure Cloud Platform - Part 1 by HitaVir Tech!

This codelab builds your mental model for analytics on Microsoft Azure โ€” one concept at a time, one Azure service at a time. No prior Azure experience required.

What You Will Master

Pillar

Topics

๐Ÿง  Concepts

Analytics, Machine Learning, core framework

๐Ÿ“ The 5 Vs

Volume, Variety, Velocity, Veracity, Value

โ˜๏ธ Azure Services

One toolkit per V โ€” the complete mapping

๐Ÿ› ๏ธ Hands-on Lab

ADLS Gen2 โ†’ Synapse Serverless SQL end-to-end

Why the 5 Vs Framework Matters

Every data challenge you will face maps to one of five questions:

Question

V

๐Ÿ“ฆ "How do we store 500 TB?"

Volume

๐Ÿงฉ "We have CSVs, JSON, images โ€” help!"

Variety

๐ŸŒŠ "Dashboards must refresh every second"

Velocity

๐Ÿ›ก๏ธ "Half our timestamps are malformed"

Veracity

๐Ÿ’Ž "Who actually uses this dashboard?"

Value

The 5 Vs give you a framework to diagnose. Azure gives you a toolbox to solve each V.

Estimated Duration

3-4 hours (concepts + hands-on lab)

How to Use This Codelab

If you are...

Do this

๐ŸŽ“ A student new to cloud

Read top-to-bottom, do the hands-on lab

๐Ÿ› ๏ธ A working engineer

Skim Part 1-2, deep-read the 5 Vs, focus on Azure services for your bottleneck V

๐Ÿง‘โ€๐Ÿซ A trainer or mentor

Use section headers as talking points; the spotlight cards are slide-ready

๐Ÿ”– A reference reader

Jump to the Cheat Sheet at the end

๐Ÿ’ก HitaVir Tech says: "The 5 Vs aren't academic jargon โ€” they're the mental checklist every senior engineer runs when someone says โ€˜we have a data problem.' Learn to speak this language and every cloud will feel familiar."

What You Need

Required

Helpful

No Local Installs Required

Everything in this codelab runs in the Azure Portal in your browser. Zero software installation on your machine.

โš ๏ธ Cost Awareness

Every step stays inside the Azure free tier / low-cost services:

Free / Low-Cost Budget

Usage in This Codelab

๐Ÿ’พ ADLS Gen2 storage โ€” 5 GB free

< 1 MB

๐Ÿ” Synapse Serverless SQL โ€” $5 / TB scanned

< 1ยข total

๐Ÿ“š Synapse workspace โ€” free to create

1 workspace

๐Ÿ’ฐ Estimated total cost

~$0.00

โš ๏ธ Always clean up. Step 10 of the lab is a cleanup ritual. Skip it and Azure will happily bill you for forgotten resources.

Services the Hands-on Lab Will Use

Storage Accounts ADLS Gen2 Synapse

  +==============================================================+
  |         SECTION  1   -   ANALYTICS  CONCEPTS                 |
  +==============================================================+

Before we touch Azure, we need three anchor ideas:

                    +----------------------+
                    |  1.  ANALYTICS       |
                    |  turn data into      |
                    |  decisions           |
                    +----------+-----------+
                               |
                               | powered by
                               v
                    +----------------------+
                    |  2.  MACHINE LEARNING|
                    |  algorithms that     |
                    |  learn patterns      |
                    +----------+-----------+
                               |
                               | challenged by
                               v
                    +----------------------+
                    |  3.  THE 5 Vs        |
                    |  of Big Data         |
                    +----------------------+

Each Anchor Maps to an Azure Service Family

Synapse ML ADLS Gen2

What is Analytics?

๐Ÿ“Š Analytics is the practice of turning raw data into useful insights that help people make better decisions.

That one line is the whole discipline. SQL queries, dashboards, ML models, data lakes โ€” all of it is just tooling in service of that idea.

Real-World Example โ€” HitaVir Coffee

Imagine HitaVir Coffee โ€” 50 locations across India. Every day, each store generates data:

Data Stream

Icon

Data Stream

Icon

Orders

โ˜•

Payments

๐Ÿ’ฐ

Loyalty

๐Ÿ‘ฅ

Inventory

๐Ÿ“ฆ

Shifts

๐Ÿ•

Equipment

๐ŸŒก๏ธ

Deliveries

๐Ÿšš

Reviews

โญ

One transaction alone is meaningless. But aggregate across 50 stores for a year and patterns emerge:

Observation

Action

๐Ÿข Mondays are slowest

๐ŸŽฏ Launch "Monday BOGO"

๐Ÿ“ˆ Store #23 sells 2x pastries

๐Ÿ” Copy their layout

๐Ÿ“‰ Cappuccinos drop in summer

๐ŸงŠ Push cold brew

That journey โ€” from raw events to actions โ€” is analytics.

The Four Levels of Analytics

  +================================================================+
  |                                                                |
  |     L4   PRESCRIPTIVE       "What should we do?"               |
  |                                                                |
  |                  ^                                             |
  |                  |                                             |
  |     L3   PREDICTIVE         "What will happen?"                |
  |                                                                |
  |                  ^                                             |
  |                  |                                             |
  |     L2   DIAGNOSTIC         "Why did it happen?"               |
  |                                                                |
  |                  ^                                             |
  |                  |                                             |
  |     L1   DESCRIPTIVE        "What happened?"                   |
  |                                                                |
  +================================================================+

Level

Icon

Question

Example

Powered By

L1 Descriptive

๐Ÿ“ธ

What happened?

"Sold 12,400 cappuccinos"

๐Ÿ” SQL / BI

L2 Diagnostic

๐Ÿ•ต๏ธ

Why did it happen?

"Bean shortage hit week 3"

๐Ÿ” SQL + drill-down

L3 Predictive

๐Ÿ”ฎ

What will happen?

"June sales up 15%"

๐Ÿค– Machine learning

L4 Prescriptive

๐ŸŽฏ

What should we do?

"Order 200kg by May 25"

๐Ÿค– ML + optimization

Most companies live at L1-L2. Analytics engineers build the foundation that makes L3-L4 possible.

What Analytics Is NOT

๐Ÿ’ก HitaVir Tech says: "Never build a dashboard nobody looks at. Always ask โ€” what decision will this insight change? If the answer is โ€˜none', don't build it."

Preview โ€” Azure Services Across Analytics Maturity

Synapse Power BI ML OpenAI

L1-L2 is Synapse + Power BI. L3-L4 adds Azure ML + Azure OpenAI.

What is Machine Learning?

๐Ÿค– Machine Learning (ML) is a branch of AI where algorithms learn patterns from data instead of being explicitly programmed.

Traditional Programming vs Machine Learning

  +-----------------------------+      +-----------------------------+
  |    TRADITIONAL PROGRAMMING  |      |    MACHINE LEARNING         |
  |  -------------------------  |      |  -------------------------  |
  |                             |      |                             |
  |   Rules + Data              |      |   Data + Answers            |
  |         |                   |      |         |                   |
  |         v                   |      |         v                   |
  |     Program                 |      |    Learned Model            |
  |         |                   |      |         |                   |
  |         v                   |      |         v                   |
  |     Answer                  |      |    Rules (weights)          |
  |                             |      |                             |
  |  Human writes the rules.    |      |  Machine learns the rules.  |
  +-----------------------------+      +-----------------------------+

The Three Flavors of ML

Flavor

Icon

Data Needed

Example

Azure Service

Supervised

๐ŸŽฏ

Labeled examples

Spam filter, fraud detection, image classification

๐Ÿค– Azure ML

Unsupervised

๐Ÿ”Ž

No labels

Customer segmentation, anomaly detection, topic modeling

๐Ÿค– Azure ML โ€ข ๐Ÿ’ญ AI Language

Reinforcement

๐ŸŽฎ

Reward signals

Game AI, robotics, recommenders

๐Ÿค– Azure ML โ€ข ๐ŸŽฏ Personalizer

How ML Powers Analytics

  Level  Stops at SQL               Needs ML
  -----  --------------------       --------------------
  L1     Descriptive     OK          -
  L2     Diagnostic      OK          -
  L3     Predictive                  Machine learning
  L4     Prescriptive                ML + optimization

๐Ÿ’ก HitaVir Tech says: "ML is not magic โ€” it's statistics at scale. If your analytics foundations are shaky, your ML models will be too. Clean data first, cool models second."

Preview โ€” Azure ML Services

ML OpenAI Cognitive Services AI Search Anomaly Detector Personalizer

Coming up in "Azure Services for Value" (L3-L4 analytics).

  +==============================================================+
  |         SECTION  2   -   THE  5  Vs  OF  BIG  DATA           |
  +==============================================================+

Where the 5 Vs Come From

In 2001, analyst Doug Laney described big-data challenges with three Vs: Volume, Variety, Velocity. Later the industry added Veracity (trust) and Value (outcome). Together they form the universal diagnostic checklist.

The 5 Vs Star

                          *
                     VOLUME
                  How much is it?
                       / \
                      /   \
                     /     \
                    /       \
           VARIETY             VELOCITY
       How many formats?     How fast?
                 \             /
                  \           /
                   \         /
                    \       /
             VERACITY        VALUE
         Can we trust it?  Worth it?

The 5 Vs at a Glance

V

Icon

Question

1

๐Ÿ“ฆ VOLUME

How much? (scale)

2

๐Ÿงฉ VARIETY

How many formats?

3

๐ŸŒŠ VELOCITY

How fast? (speed)

4

๐Ÿ›ก๏ธ VERACITY

Can we trust it?

5

๐Ÿ’Ž VALUE

What outcome?

Miss any one V and your data platform has a hole. Let's tour each.

Preview โ€” Azure's One Service Per V

ADLS Gen2 Data Factory Event Hubs Data Shares Power BI

Volume โ†’ ADLS Gen2. Variety โ†’ Data Factory. Velocity โ†’ Event Hubs. Veracity โ†’ Purview. Value โ†’ Power BI.

  +==============================================================+
  |              THE  1st  V   -   VOLUME                        |
  |              "How much data do we have?"                     |
  +==============================================================+

What is Volume?

๐Ÿ“ฆ Volume is the size of your data โ€” how many bytes, rows, events, or files you must store, move, and process.

The Scale Ladder

Unit

Power

What It Holds

Byte (B)

1

A letter

KB

10^3

One email

MB

10^6

One song

GB

10^9

One DVD

TB

10^12

One year of company sales

PB

10^15

One day of YouTube uploads

EB

10^18

All of Netflix streaming

ZB

10^21

The entire internet per year

Why Volume is Hard

A traditional database runs fine up to ~1-10 TB. Past that, things break:

At big-data scale, you need distributed systems โ€” hundreds of machines sharing the load.

Real-World Volume

Company

Daily Volume

๐Ÿ“ฑ Instagram

100M+ photos uploaded

๐Ÿ›’ Amazon

Billions of events

๐Ÿš— Uber

10s of TB of trip data

๐ŸŽฌ Netflix

PB of logs and streams

๐Ÿ”Ž Bing

Unimaginable

Questions Volume Forces You to Ask

๐Ÿ’ก HitaVir Tech says: "What works at 10 GB catastrophically fails at 10 TB. Always ask โ€” how does this scale at 100x?"

๐Ÿ“ฆ Volume in one line: design for 100ร— your current data โ€” or rebuild painfully later.

Preview โ€” Azure Services That Solve Volume

ADLS Gen2 Synapse Databricks HDInsight

Coming up in "Azure Services for Volume".

  +==============================================================+
  |              THE  2nd  V   -   VARIETY                       |
  |              "How many kinds of data?"                       |
  +==============================================================+

What is Variety?

๐Ÿงฉ Variety is the diversity of data โ€” formats, schemas, and sources.

Twenty years ago, "data" meant rows in a database. Today, it means far more:

Type

Icon

Examples

Schema

Structured

๐Ÿ“Š

SQL tables, CSV, spreadsheets

Fixed

Semi-structured

๐Ÿงฉ

JSON from APIs, XML, Parquet, Avro

Flexible

Unstructured

๐ŸŽž๏ธ

Images, video, audio, PDFs, free text

None

Why Variety is Hard

Each format needs different tooling:

Format

Tool Needed

SQL tables

Relational engine

JSON

Document parser

PDF

OCR

Image

Computer vision

Audio

Speech-to-text

Free text

NLP / embeddings

Most real projects combine these. Example โ€” "Correlate support emails + call recordings + order history into one insight." Three completely different pipelines feeding one answer.

Real-World Variety

Industry

Variety Mix

๐Ÿฅ Healthcare

Patient records + X-rays + doctor's notes

๐Ÿ›’ Retail

Orders + product photos + reviews

๐Ÿฆ Banking

Transactions + scanned checks + call transcripts

๐Ÿš— Autonomous cars

Sensors + video + maps + LiDAR

Questions Variety Forces You to Ask

๐Ÿ’ก HitaVir Tech says: "90% of the world's data is unstructured. But 90% of analytics happens on structured data. Your job is often to convert chaos into order."

๐Ÿงฉ Variety in one line: structure is created, not found โ€” choose tools that embrace format diversity.

Preview โ€” Azure Services That Solve Variety

Data Factory Synapse Cosmos DB AI Vision Doc Intelligence AI Language

Coming up in "Azure Services for Variety".

  +==============================================================+
  |              THE  3rd  V   -   VELOCITY                      |
  |              "How fast is the data?"                         |
  +==============================================================+

What is Velocity?

๐ŸŒŠ Velocity is the speed at which data arrives, moves, and must be processed to deliver value.

The Velocity Spectrum

Freshness

Icon

Approach

Example Use Case

Next day

๐ŸŒ

Batch (nightly)

Finance reports

Every hour

๐Ÿ‡

Mini-batch

Ops dashboards

Seconds

๐Ÿš€

Streaming

Live pricing

Sub-millisecond

โšก

Real-time

Fraud detection, HFT

Why Velocity is Hard

Problem

Solution

Disks too slow

In-memory / caches

Batch SQL too slow

Stream-processing engines

One machine too small

Horizontal auto-scaling

Failures = data loss

Durable logs (Kafka / Event Hubs)

Real-World Velocity

Scenario

Required Latency

๐Ÿ’ณ Credit card fraud

Under 100 ms

๐Ÿ“ˆ Stock trading

Microseconds

๐Ÿ“ฑ Social feed

Seconds

๐Ÿšš Delivery tracking

Minutes

๐Ÿ“Š Exec dashboard

Hourly

๐Ÿงพ Month close

Daily batch

Questions Velocity Forces You to Ask

๐Ÿ’ก HitaVir Tech says: "Streaming is fashionable. Batch is profitable. 80% of real-world analytics runs on batch โ€” don't reach for streaming unless the business truly cannot wait."

๐ŸŒŠ Velocity in one line: match the pipeline's speed to the decision's deadline โ€” no faster.

Preview โ€” Azure Services That Solve Velocity

Event Hubs Stream Analytics Data Explorer Functions Event Grid

Coming up in "Azure Services for Velocity".

  +==============================================================+
  |              THE  4th  V   -   VERACITY                      |
  |              "Can we trust the data?"                        |
  +==============================================================+

What is Veracity?

๐Ÿ›ก๏ธ Veracity is the accuracy, consistency, and trustworthiness of your data.

Big volumes and fast pipelines are useless if the data is wrong.

The Veracity Enemies

Enemy

Icon

Symptom

Missing

๐Ÿฆ 

NULL in required fields

Duplicates

๐Ÿ—‘๏ธ

Same row repeated

Inconsistent

๐ŸŽญ

2024-01-05 vs 01/05/24

Outliers

๐Ÿ“‰

Age = 347

Units

๐Ÿช™

USD mixed with INR

Bias

๐Ÿชž

Only US users sampled

Stale

๐Ÿชค

Last updated 2019

Broken joins

๐Ÿ“Ž

Order with no user

Noise

๐ŸŽฒ

Flaky sensor readings

GIGO โ€” Garbage In, Garbage Out

๐Ÿ›ก๏ธ A beautiful dashboard built on bad data is worse than no dashboard โ€” it creates false confidence. The most dangerous insight is a wrong insight someone believes.

Six Dimensions of Data Quality

Dimension

Icon

Question

Completeness

๐Ÿงฉ

Required fields populated?

Accuracy

๐ŸŽฏ

Data reflects reality?

Consistency

๐Ÿงญ

Related systems agree?

Timeliness

โฐ

Is it current enough?

Validity

โœ…

Matches formats / rules?

Uniqueness

๐Ÿ”ข

Any unintended duplicates?

Real-World Veracity Failures

Incident

Consequence

๐Ÿ›ฐ๏ธ NASA Mars Climate Orbiter (1999)

Lost $125M โ€” metric vs imperial unit mismatch

๐Ÿฆ Knight Capital (2012)

$440M loss in 45 minutes โ€” bad trading data

๐Ÿคง Google Flu Trends

Overestimated flu peaks 100%+ due to search bias

Questions Veracity Forces You to Ask

๐Ÿ’ก HitaVir Tech says: "Senior engineers obsess over data quality. Juniors obsess over cool tools. Guess which group builds systems that actually work in production."

๐Ÿ›ก๏ธ Veracity in one line: quality rules are a pipeline concern, not a hope.

Preview โ€” Azure Services That Solve Veracity

Data Factory Data Shares Defender Key Vault Activity Log

Coming up in "Azure Services for Veracity".

  +==============================================================+
  |              THE  5th  V   -   VALUE                         |
  |          "What business outcome does it drive?"              |
  +==============================================================+

What is Value?

๐Ÿ’Ž Value is the business outcome your data and analytics actually deliver โ€” revenue gained, cost saved, risk reduced, customer experience improved.

Without Value, the other four Vs are expensive hobbies.

The Value Pyramid

                       VALUE
                     (outcome)
                         ^
                         |  enabled by
                         |
                Insights & decisions
                         ^
                         |  enabled by
                         |
                   Analytics + ML
                         ^
                         |  enabled by
                         |
               Trustworthy (Veracity) data
                         ^
                         |  at the right
                         |  speed (Velocity)
                         |
                  across Varieties
                         ^
                         |  stored at
                         |
                   the right scale (Volume)

Examples of Real Value

Industry

Analytics Value

๐Ÿ›’ Retail

Recommendation engine โ†’ +20% revenue

๐Ÿฆ Banking

Fraud detection โ†’ millions saved

๐Ÿšš Logistics

Route optimization โ†’ -15% fuel cost

๐Ÿฅ Healthcare

Early diagnosis models โ†’ better outcomes

๐ŸŽฌ Streaming

Personalized content โ†’ higher retention

The Dashboard Graveyard

Most companies have folders full of unused dashboards โ€” the dashboard graveyard. Every one cost engineering time, storage, and compute.

The difference between a valuable dashboard and a graveyard dashboard:

  +------------------------------------------------------+
  |                                                      |
  |   "What decision will change because of this?"       |
  |                                                      |
  |   If nobody can answer   ->  DON'T BUILD IT.         |
  |                                                      |
  +------------------------------------------------------+

How to Measure Value

Questions Value Forces You to Ask

๐Ÿ’ก HitaVir Tech says: "A data platform that costs more than the decisions it enables is a failure, no matter how beautiful the architecture. Lead with Value."

๐Ÿ’Ž Value in one line: start from the decision, work backwards to the pipeline.

Preview โ€” Azure Services That Deliver Value

Power BI ML OpenAI Personalizer Anomaly Detector

Coming up in "Azure Services for Value".

  +==============================================================+
  |       SECTION  3  -  AZURE  SERVICES  BY  THE  5  Vs         |
  +==============================================================+

The Headline Cast

ADLS Gen2 Synapse Databricks Data Factory Event Hubs Stream Analytics Functions Power BI Azure ML Azure OpenAI

Now we map each V to the Azure services that solve it.

The Golden Rule โ€” Every Stack Follows One Shape

  +-------+   +-------+   +-------+   +-------+   +-------+   +-------+   +-------+
  |       |   |       |   |       |   |       |   |       |   |       |   |       |
  | INGST | ->| STORE | ->| CATLG | ->| PROCS | ->| QUERY | ->| VIEW  | ->|  ACT  |
  |       |   |       |   |       |   |       |   |       |   |       |   |       |
  +-------+   +-------+   +-------+   +-------+   +-------+   +-------+   +-------+

The 5 Vs tell you where the bottleneck is. The Azure services tell you what solves it.

  +==============================================================+
  |              AZURE  FOR  VOLUME                              |
  |         "Store any amount of data, affordably."              |
  +==============================================================+

The Volume Lineup

ADLS Gen2 Storage Account Synapse HDInsight Databricks

ADLS Gen2 โ€ข Blob Storage โ€ข Synapse Analytics โ€ข HDInsight โ€ข Databricks

The Volume Toolkit

Service

Icon

Purpose

Azure Data Lake Storage Gen2

ADLS Gen2

Object storage built on Blob, hierarchical namespace โ€” the data-lake foundation

Azure Blob Storage

Blob Storage

Raw object storage โ€” hot / cool / archive tiers

Azure Archive Storage

Archive

Cheapest long-term vault (hours-to-retrieve)

Azure Synapse Analytics

Synapse

Analytics platform โ€” dedicated + serverless SQL + Spark

Azure HDInsight

HDInsight

Managed Hadoop / Spark / Kafka clusters

Azure Databricks

Databricks

Managed Apache Spark + MLflow + Delta Lake

Service Spotlight โ€” Azure Data Lake Storage Gen2

ADLS Gen2

  +--------------------------------------------------------------+
  |  AZURE  DATA  LAKE  STORAGE  GEN2                            |
  +--------------------------------------------------------------+
  |  Built on    :  Azure Blob Storage                           |
  |  Durability  :  99.999999999%  (11 nines, GRS)               |
  |  Scale       :  Exabytes (practically unlimited)             |
  |  Pricing     :  ~$0.018 / GB / month (hot LRS)               |
  |  Features    :  Hierarchical namespace, POSIX ACLs           |
  |  Read by     :  Synapse, Databricks, HDInsight, Power BI     |
  |                                                              |
  |  If you remember only one Azure service - make it ADLS Gen2. |
  +--------------------------------------------------------------+

Blob Storage Tiers โ€” The Cost Pyramid

Tier

Icon

Access Pattern

Relative Cost

Hot

๐Ÿ”ฅ

Hot, frequent access

$$$$

Cool

โ„๏ธ

Monthly access

$$

Cold

๐ŸงŠ

Every 90+ days

$

Archive

๐Ÿ”๏ธ

Compliance vault, hours to rehydrate

ยข

Service Spotlight โ€” Azure Synapse Analytics

Synapse

  +--------------------------------------------------------------+
  |  AZURE  SYNAPSE  ANALYTICS  -  Unified Analytics             |
  +--------------------------------------------------------------+
  |  Engines     :  Dedicated SQL Pool (MPP) + Serverless SQL +  |
  |                 Apache Spark pools + Data Explorer pools     |
  |  Storage     :  ADLS Gen2 under the hood                     |
  |  SQL         :  T-SQL (SQL Server flavored)                  |
  |  Integration :  One workspace, notebooks, pipelines, Power BI|
  |                                                              |
  |  One UI for lake, warehouse, Spark, and BI.                  |
  +--------------------------------------------------------------+

Service Spotlight โ€” Azure Databricks

Databricks

  +--------------------------------------------------------------+
  |  AZURE  DATABRICKS  -  Managed Spark + Delta Lake            |
  +--------------------------------------------------------------+
  |  Engine      :  Apache Spark (Photon-accelerated)            |
  |  Superpower  :  Delta Lake (ACID on the data lake)           |
  |  ML          :  MLflow, Feature Store, AutoML                |
  |  Governance  :  Unity Catalog                                |
  |                                                              |
  |  Use for petabyte-scale custom Spark + ML workloads.         |
  +--------------------------------------------------------------+

ADLS Data Lake โ€” Medallion Architecture

  abfss://lake@hitavirtechanalytics.dfs.core.windows.net/
    |
    +-- raw/             <-- Bronze:  untouched source data
    |     +-- sales/2026/04/22/orders.csv
    |     +-- inventory/2026/04/22/stock.json
    |
    +-- curated/         <-- Silver:  cleaned, typed Parquet/Delta
    |     +-- sales_fact/year=2026/month=04/day=22/part-001.parquet
    |
    +-- analytics/       <-- Gold:    pre-aggregated for BI
          +-- daily_revenue/year=2026/month=04/day=22/part-001.parquet

Volume Decision Tree

                      How much data?
                            |
       +--------------------+--------------------+
       |                    |                    |
     < 1 TB              1-100 TB             > 100 TB
       |                    |                    |
       v                    v                    v
    Azure SQL DB          ADLS +              ADLS + Databricks +
    or Synapse            Synapse             Synapse Dedicated +
    Serverless            Serverless          Purview governance
    (small + cheap)       (most common)        (huge platform)

๐Ÿ’ก HitaVir Tech says: "Start with ADLS Gen2. Every Azure analytics service reads from it. You'll never regret putting data into the lake โ€” you may regret putting it anywhere else first."

  +==============================================================+
  |              AZURE  FOR  VARIETY                             |
  |         "Handle any data format, elegantly."                 |
  +==============================================================+

The Variety Lineup

Data Factory Synapse Cosmos DB AI Vision Doc Intelligence AI Speech AI Language AI Search

Data Factory โ€ข Synapse โ€ข Cosmos DB โ€ข AI Vision โ€ข Doc Intelligence โ€ข AI Speech โ€ข AI Language โ€ข AI Search

The Variety Toolkit

Service

Icon

Purpose

ADLS Gen2

ADLS Gen2

Holds every format โ€” CSV, JSON, Parquet, images, video

Azure Data Factory

Data Factory

ETL / ELT, 100+ connectors, mapping data flows

Synapse Serverless SQL

Synapse

OPENROWSET on CSV / JSON / Parquet โ€” no setup

Azure Cosmos DB

Cosmos DB

Multi-model NoSQL (document, graph, key-value)

Azure AI Vision

AI Vision

Images / video โ†’ structured labels, OCR

Azure AI Document Intelligence

Doc Intelligence

PDFs, forms, invoices โ†’ text and tables

Azure AI Speech

AI Speech

Speech โ†’ text, speaker ID, translation

Azure AI Language

AI Language

NLP: sentiment, entities, summarization

Azure AI Search

AI Search

Full-text and vector search over any source

Service Spotlight โ€” Azure Data Factory

Data Factory

  +--------------------------------------------------------------+
  |  AZURE  DATA  FACTORY  -  Serverless ETL / ELT               |
  +--------------------------------------------------------------+
  |  Connectors  :  100+  (SQL, SaaS, files, APIs, on-prem)      |
  |  Pipelines   :  Drag-and-drop + code-free mapping data flows |
  |  Triggers    :  Schedule, event-based, tumbling window       |
  |  Integration :  Git (Azure DevOps / GitHub)                  |
  |                                                              |
  |  The "orchestrator" of Azure data platforms.                 |
  +--------------------------------------------------------------+

Data Factory Flow

  INPUT                                                       OUTPUT
  -----                                                       ------

  CSV       +                                           +---- Parquet
  JSON      +---> Copy Activity ---> Mapping Data Flow -+     (optimized)
  Parquet   +              (schema +                    +---- Delta
  Oracle    |               transforms)                        tables
  SAP       +

Synapse Serverless SQL โ€” One SQL, Many Formats

SELECT  c.customer_id, SUM(o.amount) AS total_spent
FROM    OPENROWSET(
          BULK 'https://hitavirtech.dfs.core.windows.net/lake/orders/*.parquet',
          FORMAT = 'PARQUET'
        ) AS o
JOIN    OPENROWSET(
          BULK 'https://hitavirtech.dfs.core.windows.net/lake/customers/*.json',
          FORMAT = 'CSV', FIELDTERMINATOR='0x0b'
        ) AS c  ON c.customer_id = o.customer_id
GROUP BY c.customer_id;

Serverless SQL reads CSV, JSON, Parquet, Delta directly from ADLS. You never leave T-SQL.

Unstructured โ†’ Structured: The AI Extractor Pipeline

AI Vision Doc Intelligence AI Speech AI Language Translator

Input

Icon

Azure Service

Output

Images

๐Ÿ–ผ๏ธ

AI Vision

Labels, faces, OCR

PDFs / scans

๐Ÿ“„

Doc Intelligence

Extracted text + tables + key-value

Audio / voice

๐ŸŽค

AI Speech

Transcripts + speaker ID

Free text

๐Ÿ’ฌ

AI Language

Sentiment, entities, summarization

Translations

๐ŸŒ

Translator

100+ languages

Magic step: chaos in โ†’ structured features out โ†’ then into ADLS / Synapse / Databricks as normal.

Real-World Example โ€” E-commerce Review Pipeline

  Customer review (raw text)
        |
        v
  AI Language  --->  sentiment = negative, topic = shipping
        |
        v
  ADLS Gen2 (enriched records)
        |
        v
  Data Factory  --->  Synapse table
        |
        v
  Synapse SQL:   "avg sentiment per product / month"
        |
        v
  Power BI dashboard for the CX team
        |
        v
  Action:  fix shipping partner in Region X

Variety Decision Tree

                       What's my data?
                             |
      +---------+---------+--+---+---------+---------+
      |         |         |      |         |         |
   Tabular   JSON      Images   PDFs   Audio     Free text
      |         |         |      |         |         |
      v         v         v      v         v         v
   ADLS +     ADLS +    AI      Doc      AI         AI
   Synapse    Synapse   Vision  Intell.  Speech     Language
              or Cosmos

๐Ÿ’ก HitaVir Tech says: "The magic of modern analytics โ€” unstructured data becomes structured features in minutes via Azure AI services. What took PhDs years a decade ago is now an API call."

  +==============================================================+
  |              AZURE  FOR  VELOCITY                            |
  |         "Move and process data in real time."                |
  +==============================================================+

The Velocity Lineup

Event Hubs Stream Analytics Data Explorer Functions Event Grid Logic Apps

Event Hubs โ€ข Stream Analytics โ€ข Data Explorer โ€ข Functions โ€ข Event Grid โ€ข Logic Apps

The Velocity Toolkit

Service

Icon

Purpose

Azure Event Hubs

Event Hubs

Real-time event stream (Kafka-compatible)

Event Hubs Capture

Event Hubs Capture

Buffered delivery to ADLS / Blob (no code)

Azure Stream Analytics

Stream Analytics

SQL on streams, sub-second latency

Azure Data Explorer (Kusto)

Data Explorer

Blazing-fast time-series + log analytics

Azure Functions

Functions

Event-driven serverless code

Azure Event Grid

Event Grid

Serverless event bus across Azure

Azure Service Bus / Queue Storage

Service Bus

Queue and pub-sub messaging

Service Spotlight โ€” Azure Event Hubs

Event Hubs

  +--------------------------------------------------------------+
  |  AZURE  EVENT  HUBS  -  Real-time Event Stream               |
  +--------------------------------------------------------------+
  |  Latency     :  Sub-second                                   |
  |  Retention   :  1-7 days (90 days on Premium)                |
  |  Throughput  :  MB/sec per partition, scale by partition     |
  |  Kafka API   :  Yes - drop-in for Kafka clients              |
  |                                                              |
  |  The "high-speed conveyor belt" for events on Azure.         |
  +--------------------------------------------------------------+

Event Hubs in Action โ€” The Conveyor Belt

  PRODUCERS                EVENT HUBS                CONSUMERS
  ------------             ------------              ------------

  App events        +----------------------+         Functions
  Clickstreams  --->|  >> >> >> >> >> >>   |--->    Stream Analytics
  IoT sensors       |  durable, ordered,   |         Capture -> ADLS
  Transactions      |  partitioned         |         Data Explorer
  Card swipes       +----------------------+         Synapse

Event Hubs holds events durably. Multiple consumers read the same stream independently.

Service Spotlight โ€” Event Hubs Capture

Event Hubs

  +--------------------------------------------------------------+
  |  EVENT  HUBS  CAPTURE  -  The Easy Button                    |
  +--------------------------------------------------------------+
  |  Model       :  Fully managed, no code                       |
  |  Buffer      :  Time window or size (whichever first)        |
  |  Format      :  Avro (native) or Parquet via Stream Analytics|
  |  Sinks       :  ADLS Gen2, Blob Storage                      |
  |                                                              |
  |  No cluster - the laziest streaming archive on Azure.        |
  +--------------------------------------------------------------+
  Producers  --->  Event Hubs  --->  Capture (no servers)  --->  ADLS
                                    auto-write every N mins

Azure Stream Analytics โ€” Continuous SQL

SELECT  user_id, amount, location
INTO    FraudAlerts
FROM    TransactionsStream TIMESTAMP BY event_time
WHERE   amount > 10000 OR is_foreign = 1;

Results in milliseconds โ€” not after the nightly batch.

Azure Functions โ€” The Universal Event Glue

  Blob created     +
  Event Hubs       +--->  Azure Function  --->  Synapse load
  Cosmos change    +         |
  Event Grid       +         +---->  Service Bus / Queue alert

Perfect for: event reactions, enrichment, alerting, small transforms.

Service Spotlight โ€” Azure Data Explorer

Data Explorer

  +--------------------------------------------------------------+
  |  AZURE  DATA  EXPLORER  (ADX / KUSTO)                        |
  +--------------------------------------------------------------+
  |  Category    :  Time-series + log analytics                  |
  |  Query lang  :  Kusto Query Language (KQL)                   |
  |  Scale       :  Billions of rows, sub-second                 |
  |  Ingest rate :  GB/sec, auto-indexed                         |
  |                                                              |
  |  Same engine powers Azure Monitor, Sentinel, Log Analytics.  |
  +--------------------------------------------------------------+

Real-World Velocity Pipeline โ€” Rideshare App

  Rideshare app (1 million events/sec)
              |
              v
       Azure Event Hubs
              |
     +--------+--------+---------+
     |        |        |         |
     v        v        v         v
  Function  Stream    Capture
  fraud     Analytics buffer
  flag      real-time --> ADLS (Parquet)
     |        |         |
     v        v         v
   Service  Power BI   Synapse
   Bus      live dash  Serverless
   alert                (historical)

Velocity Decision Tree

                       How fresh must the data be?
                                 |
        +---------+----------+---+---+-----------+
        |         |          |       |           |
     Next day  15 minutes  Seconds  Sub-second  Kafka shop
        |         |          |       |           |
        v         v          v       v           v
     ADF        Capture    Event     Stream      Event Hubs
     pipeline   -> ADLS    Hubs +    Analytics   (Kafka API)
                           Function

๐Ÿ’ก HitaVir Tech says: "Every team thinks they need real-time until they see the bill. Start with Event Hubs Capture and 5-minute micro-batches โ€” graduate later. Most of the time, you won't need to."

  +==============================================================+
  |              AZURE  FOR  VERACITY                            |
  |         "Make sure the data is trustworthy."                 |
  +==============================================================+

The Veracity Lineup

Data Factory Policy Info Protection Activity Log Defender Key Vault Monitor

Mapping Data Flows โ€ข Microsoft Purview โ€ข Info Protection โ€ข Activity Log โ€ข Defender for Cloud โ€ข Key Vault โ€ข Azure Monitor

The Veracity Toolkit

Service

Icon

Purpose

ADF Mapping Data Flows

ADF Data Flows

Visual data cleaning and profiling

Microsoft Purview Data Quality

Purview DQ

Rule-based DQ checks

Great Expectations / Deequ on Spark

Databricks DQ

Unit tests for data (open-source in Databricks)

Microsoft Purview

Purview

Data catalog + lineage + policy

Azure Activity Log

Activity Log

Audit every control-plane change

Azure Monitor + Log Analytics

Azure Monitor

Resource and pipeline telemetry

Microsoft Defender for Cloud

Defender

Discover and classify PII, CSPM

Azure Key Vault

Key Vault

Manage encryption keys and secrets

Service Spotlight โ€” ADF Mapping Data Flows

Data Factory

  +--------------------------------------------------------------+
  |  ADF  MAPPING  DATA  FLOWS  -  No-Code Data Prep             |
  +--------------------------------------------------------------+
  |  Interface   :  Visual, drag-and-drop                        |
  |  Transforms  :  90+ (nulls, dupes, dates, joins, aggs...)    |
  |  Engine      :  Spark (managed, auto-scaled)                 |
  |  Debug       :  Interactive, with sample data                |
  |                                                              |
  |  Hand this to business analysts - no Spark code needed.      |
  +--------------------------------------------------------------+
  Source  --->  Profile  --->  Apply transforms  --->  Sink
                (stats,        (fill nulls, parse   (to ADLS,
                 anomalies)     dates, dedupe)       Synapse)

Microsoft Purview โ€” Catalog + Quality + Lineage

  +--------------------------------------------------------------+
  |  MICROSOFT  PURVIEW  -  Unified Data Governance              |
  +--------------------------------------------------------------+
  |  Catalog     :  Scan ADLS, Synapse, SQL, Power BI, S3, GCS   |
  |  Lineage     :  End-to-end column-level lineage              |
  |  DQ rules    :  Completeness, uniqueness, ranges, custom     |
  |  Insights    :  Sensitivity labels, hotspots, adoption       |
  |                                                              |
  |  The "nervous system" for multi-cloud data governance.       |
  +--------------------------------------------------------------+
  RULES                                          CHECK RESULT
  -----                                          -------------
  order_id IS NOT NULL                     ...   PASS
  amount BETWEEN 0 AND 1_000_000           ...   PASS
  customer_id IN customers                 ...   PASS
  COUNT(DISTINCT order_id) = COUNT(*)      ...   FAIL - 23 dupes!

Service Spotlight โ€” Microsoft Defender for Cloud

Defender

  +--------------------------------------------------------------+
  |  MICROSOFT  DEFENDER  FOR  CLOUD  -  CSPM + PII detection    |
  +--------------------------------------------------------------+
  |  Method      :  ML-based classification of storage contents  |
  |  Finds       :  Credit cards, SSNs, emails, addresses        |
  |  Output      :  Severity alerts -> Sentinel SIEM             |
  |                                                              |
  |  Sniffs sensitive data hiding in your storage accounts.      |
  +--------------------------------------------------------------+

Activity Log + Policy โ€” The Audit Twins

Activity Log Policy

Essential for regulated industries (finance, healthcare, gov).

The Data Quality Lifecycle

  PROFILE  --->  DEFINE RULES  --->  ENFORCE  --->  ALERT  --->  FIX  --->  MONITOR
  (know)         (expected)           (every run)    (fail fast)  (fix)     (trends)
     ^                                                                         |
     |                                                                         |
     +---------------------------- loop ---------------------------------------+

Real-World Example โ€” Sales Lake Quality Gate

  Raw sales CSV from 50 stores
        |
        v
  ADF Mapping Data Flow reads it
        |
        v
  Purview Data Quality rules:
     PASS - order_id unique
     PASS - amount in [0, 1M]
     FAIL - store_id in valid list   (12 rows failed)
        |
        v
     +--+--+
     v     v
  Quarantine   Curated
  container +  zone
  Teams alert  (Parquet)

๐Ÿ’ก HitaVir Tech says: "Every pipeline must have quality rules โ€” not โ€˜someday', from day one. 10x cheaper to catch bad data at ingest than to explain a wrong dashboard to the CEO."

  +==============================================================+
  |              AZURE  FOR  VALUE                               |
  |       "Turn data into decisions and ROI."                    |
  +==============================================================+

The Value Lineup

Power BI Azure ML Azure OpenAI Anomaly Detector Personalizer Metrics Advisor

Power BI โ€ข Azure ML โ€ข Azure OpenAI โ€ข Anomaly Detector โ€ข Personalizer โ€ข Metrics Advisor

The Value Toolkit

Service

Icon

Purpose

Microsoft Power BI

Power BI

Dashboards, BI, natural-language analytics

Azure Machine Learning

Azure ML

Build, train, deploy ML models

Azure AI Anomaly Detector

Anomaly Detector

No-code time-series anomaly detection

Azure AI Personalizer

Personalizer

Contextual recommendation engine

Azure AI Metrics Advisor

Metrics Advisor

Proactive KPI anomaly monitoring

Azure OpenAI Service

Azure OpenAI

GPT, Claude-competitive LLMs on Azure

Power BI Copilot

Power BI Copilot

Ask data questions in natural language

Synapse ML / Fabric ML

Synapse ML

ML in notebooks next to your data

Microsoft Fabric

Fabric

SaaS analytics: Lakehouse + Warehouse + BI

Service Spotlight โ€” Microsoft Power BI

Power BI

  +--------------------------------------------------------------+
  |  MICROSOFT  POWER  BI  -  Business Intelligence              |
  +--------------------------------------------------------------+
  |  Sources     :  Synapse, SQL, ADLS, Fabric, 100+ connectors  |
  |  Engine      :  VertiPaq (in-memory columnar)                |
  |  Superpowers :  Copilot (natural language) + Embedded        |
  |  Editions    :  Free  |  Pro  |  Premium  |  Fabric          |
  |                                                              |
  |  The leader in Gartner's BI Magic Quadrant for 17 years.     |
  +--------------------------------------------------------------+

Power BI Flow

  Synapse / SQL / ADLS  --->  Dataset  --->  Visuals  --->  Report  --->  App
  (data source)              (VertiPaq       (charts)       (pages)       (share)
                              cache)

Service Spotlight โ€” Azure Machine Learning

Azure ML

  +--------------------------------------------------------------+
  |  AZURE  MACHINE  LEARNING  -  Full-Lifecycle ML              |
  +--------------------------------------------------------------+
  |  Studio      :  Browser IDE for ML                           |
  |  AutoML      :  Tries many models automatically              |
  |  Pipelines   :  Train / evaluate / deploy as YAML + CLI v2   |
  |  MLOps       :  Model registry, endpoints, monitoring        |
  |                                                              |
  |  From raw data to deployed model - one platform.             |
  +--------------------------------------------------------------+

Service Spotlight โ€” Azure OpenAI Service

Azure OpenAI

  +--------------------------------------------------------------+
  |  AZURE  OPENAI  SERVICE                                      |
  +--------------------------------------------------------------+
  |  Models      :  GPT-4, GPT-4o, embeddings, DALL-E, whisper   |
  |  Compliance  :  SOC, HIPAA, FedRAMP, private network         |
  |  Integration :  RAG with AI Search, Cognitive Services       |
  |                                                              |
  |  Same OpenAI models, but your data never leaves Azure.       |
  +--------------------------------------------------------------+
  Your docs (ADLS)
       |
       v
  AI Search (vector index)
       |
       v
  Azure OpenAI (GPT-4o with RAG)
       |
       v
  Chat / Copilot experiences

Service Spotlight โ€” Azure AI Personalizer

Personalizer

  +--------------------------------------------------------------+
  |  AZURE  AI  PERSONALIZER  -  Netflix-Style Recs              |
  +--------------------------------------------------------------+
  |  Inputs      :  Context features + reward signals            |
  |  Use cases   :  "You may like...", "Related items..."        |
  |  Real-time   :  Inference in milliseconds                    |
  |                                                              |
  |  Contextual-bandit reinforcement learning, no PhD required.  |
  +--------------------------------------------------------------+

Power BI Copilot โ€” Natural-Language BI

  User types:   "Show me top 5 products last quarter"
                        |
                        v
     Copilot interprets -> writes DAX -> runs -> visualizes
                        |
                        v
            Bar chart appears in 2 seconds

Analysts no longer gatekeep simple questions.

Synapse ML โ€” SQL-Native Machine Learning

CREATE EXTERNAL MODEL churn_model
FROM   (SELECT * FROM customer_features)
WITH  (MODEL_TYPE = 'Binary Classification',
       LABEL_COLUMN = 'churned');

-- then use it
SELECT customer_id, PREDICT(MODEL = churn_model, DATA = features) AS risk
FROM   customers
WHERE  risk > 0.8;

ML without leaving your Synapse warehouse.

The Value Stack

  +==============================================================+
  |                                                              |
  |   BUSINESS  VALUE      revenue - savings - growth            |
  |        ^                                                     |
  |        |                                                     |
  |   APPLICATION LAYER    Personalizer, Metrics Advisor, AD     |
  |        ^                                                     |
  |        |                                                     |
  |   ML LAYER             Azure ML, Synapse ML, OpenAI          |
  |        ^                                                     |
  |        |                                                     |
  |   ANALYSIS LAYER       Synapse + Power BI + Fabric           |
  |        ^                                                     |
  |        |                                                     |
  |   BUILD LAYER          ADLS + ADF + Event Hubs + Purview     |
  |                                                              |
  +==============================================================+

๐Ÿ’ก HitaVir Tech says: "The best data platform is worthless if nobody uses the outputs. Start from Value and work backwards โ€” who sees which insight, and what decision changes? Design everything else to serve that."

  +==============================================================+
  |         END-TO-END  AZURE  ANALYTICS  STACK                  |
  +==============================================================+

The Full Cast โ€” Every Service You Will See Below

Event Hubs Data Factory ADLS Gen2 Synapse Databricks Stream Analytics Data Explorer ML Power BI OpenAI

All 5 Vs combined into one living architecture:

  INGEST  (Velocity + Variety Layers)
  +---------------------------+    +---------------------------+
  |  Azure Event Hubs         |    |  Data Factory             |
  |  Event Hubs Capture       |    |  AI Vision / Doc Intel.   |
  |  Functions                |    |  AI Speech / Language     |
  |  IoT Hub                  |    |  Translator               |
  +-------------+-------------+    +-------------+-------------+
                |                                |
                +----------------+---------------+
                                 |
                                 v
  STORE  (Volume Layer)
  +----------------------------------------------------------+
  |   ADLS Gen2  Data Lake  (raw  /  curated  /  analytics)  |
  |   <------>  Microsoft Purview  (catalog + lineage)       |
  +-----------------------------+----------------------------+
                                |
                                v
  PROCESS + VERACITY
  +----------------------------------------------------------+
  |   ADF Data Flows  /  Databricks  /  Synapse Spark        |
  |   <---  Purview DQ  /  Defender for Cloud  /  Policy     |
  +-----------------------------+----------------------------+
                                |
                                v
  QUERY
  +--------------+  +--------------+  +--------------+
  |  Synapse     |  |  Synapse     |  | Azure ML     |
  |  Serverless  |  |  Dedicated   |  |  (training)  |
  +------+-------+  +------+-------+  +------+-------+
         |                 |                 |
         +-----------------+-----------------+
                           |
                           v
  VALUE LAYER
  +----------------------------------------------------------+
  |   Power BI       |    Azure OpenAI (GPT)                 |
  |   Copilot (NL)   |    Personalizer                       |
  +-----------------------------+----------------------------+
                                |
                                v
                 Decisions  -  Revenue  -  Growth

Every box maps to a V. Every arrow is a managed Azure service.

  +==============================================================+
  |  HANDS-ON  LAB  -  ADLS  ->  SYNAPSE  SERVERLESS SQL         |
  +==============================================================+

The Services We'll Touch

Storage Account ADLS Gen2 Synapse

You will build a mini pipeline that touches Volume (ADLS Gen2), Variety (CSV auto-queried), and Value (SQL insights).

   Step 1         Step 2-3         Step 4-5           Step 6          Step 7-9
  +------+       +------+          +----------+    +--------+     +----------+
  | CSV  | ----> | ADLS | ------>  | Synapse  | -> |Serverl.| --> | T-SQL    |
  +------+       +------+          | workspace|    | SQL    |     +----------+
  Prepare        Upload            Create                        Run OPENROWSET
  sample data    to container      workspace                     get insights

๐Ÿ“„ Step 1 โ€” Prepare Sample Data

On your laptop, create a file called sales.csv:

order_id,customer,product,quantity,amount,order_date
1001,Ravi,Laptop,1,75000.00,2026-04-01
1002,Priya,Mouse,2,1500.00,2026-04-01
1003,Amit,Keyboard,1,3500.00,2026-04-02
1004,Ravi,Monitor,1,18000.00,2026-04-02
1005,Sneha,Headphones,1,4500.00,2026-04-03
1006,Priya,Laptop,1,75000.00,2026-04-03
1007,Vikram,USB Cable,3,900.00,2026-04-04
1008,Neha,Webcam,1,5000.00,2026-04-04
1009,Ravi,Mouse,1,750.00,2026-04-05
1010,Amit,Monitor,1,18000.00,2026-04-05

๐Ÿชฃ Step 2 โ€” Create a Storage Account with ADLS Gen2

  1. Sign in to the Azure Portal (portal.azure.com)
  2. Click Create a resource โ†’ Storage account
  3. Resource group: rg-hitavirtech-analytics (create new)
  4. Storage account name: hitavirtechanalyticsXX (lowercase, unique โ€” add random digits)
  5. Region: closest to you (Central India)
  6. Redundancy: LRS (cheapest)
  7. Advanced tab โ†’ enable Hierarchical namespace (this makes it ADLS Gen2)
  8. Review + create โ†’ Create

โฌ†๏ธ Step 3 โ€” Upload the CSV

  1. Open your storage account โ†’ Containers โ†’ + Container
  2. Name: lake โ†’ Create
  3. Open lake โ†’ + Add Directory โ†’ raw
  4. Inside raw/ โ†’ + Add Directory โ†’ sales
  5. Inside raw/sales/ โ†’ Upload โ†’ select sales.csv โ†’ Upload

Your file now lives at:

https://hitavirtechanalyticsXX.dfs.core.windows.net/lake/raw/sales/sales.csv

๐Ÿ—๏ธ Step 4 โ€” Create a Synapse Workspace

  1. Portal โ†’ Create a resource โ†’ Azure Synapse Analytics
  2. Resource group: rg-hitavirtech-analytics
  3. Workspace name: syn-hitavirtech-XX
  4. Region: same as the storage account
  5. Select Data Lake Storage Gen2: pick hitavirtechanalyticsXX + container lake
  6. SQL admin credentials: set a strong password (save it!)
  7. Review + create โ†’ Create (takes ~3 minutes)

๐Ÿ” Step 5 โ€” Grant Synapse Access to the Lake

  1. Open the storage account โ†’ Access Control (IAM) โ†’ + Add role assignment
  2. Role: Storage Blob Data Contributor
  3. Assign access to: Managed identity โ†’ pick the Synapse workspace
  4. Save

Without this step, Synapse Serverless SQL cannot read your files.

๐Ÿ’ป Step 6 โ€” Open Synapse Studio

  1. Portal โ†’ your Synapse workspace โ†’ Open Synapse Studio
  2. Left nav โ†’ Data tab
  3. Expand Linked โ†’ Azure Data Lake Storage Gen2 โ†’ your account โ†’ lake โ†’ raw โ†’ sales
  4. Right-click sales.csv โ†’ New SQL script โ†’ Select TOP 100 rows
  5. Synapse auto-generates an OPENROWSET query โ€” click Run

You should see 10 rows. ๐ŸŽ‰

๐Ÿ” Step 7 โ€” Query with Serverless SQL

Replace the auto-generated query with:

SELECT *
FROM OPENROWSET(
    BULK 'https://hitavirtechanalyticsXX.dfs.core.windows.net/lake/raw/sales/sales.csv',
    FORMAT = 'CSV',
    PARSER_VERSION = '2.0',
    HEADER_ROW = TRUE
) AS sales;

Replace hitavirtechanalyticsXX with your actual account name. Click Run.

๐Ÿ’ก Step 8 โ€” Analytical Queries

-- Top customers by revenue
SELECT customer, SUM(CAST(amount AS DECIMAL(12,2))) AS total_spent
FROM OPENROWSET(
    BULK 'https://hitavirtechanalyticsXX.dfs.core.windows.net/lake/raw/sales/sales.csv',
    FORMAT = 'CSV', PARSER_VERSION = '2.0', HEADER_ROW = TRUE
) AS sales
GROUP BY customer
ORDER BY total_spent DESC;
-- Best-selling products
SELECT product, SUM(CAST(quantity AS INT)) AS units_sold
FROM OPENROWSET(
    BULK 'https://hitavirtechanalyticsXX.dfs.core.windows.net/lake/raw/sales/sales.csv',
    FORMAT = 'CSV', PARSER_VERSION = '2.0', HEADER_ROW = TRUE
) AS sales
GROUP BY product
ORDER BY units_sold DESC;
-- Daily revenue
SELECT order_date, SUM(CAST(amount AS DECIMAL(12,2))) AS daily_revenue
FROM OPENROWSET(
    BULK 'https://hitavirtechanalyticsXX.dfs.core.windows.net/lake/raw/sales/sales.csv',
    FORMAT = 'CSV', PARSER_VERSION = '2.0', HEADER_ROW = TRUE
) AS sales
GROUP BY order_date
ORDER BY order_date;

๐Ÿ’ฐ Step 9 โ€” The "Data Scanned" Number

At the bottom of every query result: "Data processed: 1 KB" (or similar).

That number is your bill. Serverless SQL charges ~$5 per TB scanned. At scale, every analytics engineer watches it. Partitioning + Parquet shrinks it 100-1000ร—.

๐Ÿงน Step 10 โ€” Cleanup (Mandatory!)

โš ๏ธ Forgetting cleanup = surprise Azure bill.

The simplest, safest cleanup on Azure: delete the entire resource group.

  1. Portal โ†’ Resource groups โ†’ rg-hitavirtech-analytics
  2. Delete resource group โ†’ type the name to confirm โ†’ Delete
  3. Wait ~5 minutes for all resources to terminate
  4. ๐Ÿ’ฐ Check Cost Management next day โ†’ confirm ~$0

๐Ÿ’ก HitaVir Tech says: "The last 5 minutes of cleanup are the most valuable 5 minutes of the entire lab. Engineers who skip it end up with $300 surprise bills."

  +==============================================================+
  |          CONGRATULATIONS  -  PART 1 DONE!                    |
  +==============================================================+

What You Learned

๐Ÿง  Analytics Concepts

Topic

Icon

Analytics and the four maturity levels

๐Ÿ“Š

Machine Learning โ€” three flavors

๐Ÿค–

๐Ÿ“ The 5 Vs of Big Data

V

Icon

Theme

VOLUME

๐Ÿ“ฆ

Scale

VARIETY

๐Ÿงฉ

Formats

VELOCITY

๐ŸŒŠ

Speed

VERACITY

๐Ÿ›ก๏ธ

Trust

VALUE

๐Ÿ’Ž

Outcome

โ˜๏ธ Azure Services Mapped to Each V

ADLS Gen2 Synapse Databricks Data Factory Event Hubs Stream Analytics Power BI ML OpenAI

V

Key Services

๐Ÿ“ฆ Volume

๐Ÿชฃ ADLS Gen2 โ€ข ๐Ÿ›๏ธ Synapse โ€ข ๐Ÿ”ฅ Databricks โ€ข ๐Ÿ˜ HDInsight

๐Ÿงฉ Variety

๐Ÿ•ธ๏ธ Data Factory โ€ข ๐Ÿ” Synapse Serverless โ€ข ๐Ÿค– AI Vision โ€ข ๐Ÿ“ Doc Intelligence โ€ข ๐Ÿ”Š AI Speech โ€ข ๐Ÿ’ญ AI Language

๐ŸŒŠ Velocity

๐ŸŒŠ Event Hubs โ€ข ๐ŸŽฏ Stream Analytics โ€ข ๐Ÿ”ฌ Data Explorer โ€ข โšก Functions

๐Ÿ›ก๏ธ Veracity

๐Ÿงช ADF Data Flows โ€ข ๐Ÿ›ก๏ธ Purview โ€ข ๐Ÿ•ต๏ธ Defender โ€ข ๐Ÿ“œ Activity Log

๐Ÿ’Ž Value

๐Ÿ“Š Power BI โ€ข ๐Ÿค– Azure ML โ€ข ๐Ÿง  Azure OpenAI โ€ข ๐ŸŽฏ Personalizer

๐Ÿ› ๏ธ Hands-on Skills

What's Coming in Part 2

๐Ÿš€ Part 2 โ€” Advanced Analytics on Azure will cover:

What To Do Next

  1. ๐Ÿ”„ Repeat this lab with a different dataset (try a Kaggle CSV)
  2. ๐Ÿ“– Read the Azure Synapse and ADLS docs
  3. ๐Ÿ’ฐ Watch your Azure bill for a few days
  4. ๐Ÿง  Apply the 5 Vs to a project you work on โ€” identify the bottleneck V

Final Thoughts

  +==============================================================+
  |                                                              |
  |    The 5 Vs  =  universal data challenge framework           |
  |    Azure    =  complete toolbox for each V                   |
  |                                                              |
  |    Learn both  ->  you can pick up any cloud's analytics     |
  |    stack in a week.                                          |
  |                                                              |
  +==============================================================+

๐Ÿ’ก HitaVir Tech says: "Analytics is not about tools. Tools change every two years. Analytics is about asking the right question, finding the right data, and presenting an insight people can act on. Master the fundamentals โ€” Volume, Variety, Velocity, Veracity, Value โ€” and every new tool becomes just another syntax."

๐ŸŽ“ Welcome to cloud analytics on Azure. See you in Part 2.

โ€” HitaVir Tech โ˜๏ธ

  +==============================================================+
  |        DIAGNOSE  YOUR  OWN  PROJECT  WITH  THE  5  Vs        |
  +==============================================================+

Think of a data project you work on (or want to build). Run it through these five questions. The V that feels most stressful is your bottleneck โ€” that's where to focus first.

ADLS Gen2 Data Factory Synapse Event Hubs Stream Analytics Defender Power BI ML

#

Question

Your V

Azure Services to Study

1

"Do we have somewhere cheap and durable to store everything?"

๐Ÿ“ฆ Volume

๐Ÿชฃ ADLS Gen2 โ€ข ๐Ÿ›๏ธ Synapse โ€ข ๐Ÿ”ฅ Databricks

2

"Do we have to handle more than one data format?"

๐Ÿงฉ Variety

๐Ÿ•ธ๏ธ ADF โ€ข ๐Ÿ” Synapse Serverless โ€ข ๐Ÿค– AI Vision โ€ข ๐Ÿ“ Doc Intelligence

3

"Is the current data freshness good enough for stakeholders?"

๐ŸŒŠ Velocity

๐ŸŒŠ Event Hubs โ€ข ๐ŸŽฏ Stream Analytics โ€ข โšก Functions

4

"Do stakeholders trust the numbers we publish?"

๐Ÿ›ก๏ธ Veracity

๐Ÿงช ADF Data Flows โ€ข ๐Ÿ›ก๏ธ Purview โ€ข ๐Ÿ•ต๏ธ Defender

5

"Is anyone actually acting on our outputs?"

๐Ÿ’Ž Value

๐Ÿ“Š Power BI โ€ข ๐Ÿค– Azure ML โ€ข ๐ŸŽฏ Personalizer

Score each V from 1 (healthy) to 5 (painful). The highest score is where a senior engineer should lead the next sprint.

๐ŸŽฏ Pro Tip: "Stacking solutions for Velocity when Veracity is the real problem is the most common and expensive Azure mistake. Diagnose honestly before you build."

  +==============================================================+
  |          AZURE  ANALYTICS  -  PART 1  CHEAT  SHEET           |
  |                  (screenshot and keep)                       |
  +==============================================================+

The Azure Analytics Toolbox โ€” At a Glance

ADLS Gen2 Synapse Databricks Data Factory Event Hubs Stream Analytics Data Explorer Cosmos DB Defender Key Vault Power BI ML OpenAI AI Search

๐Ÿง  Concepts in a Sentence

Term

Definition

๐Ÿ“Š Analytics

Turning data into decisions

๐Ÿค– Machine Learning

Algorithms that learn patterns instead of being programmed

๐Ÿ“ธ Descriptive โ†’ ๐Ÿ•ต๏ธ Diagnostic โ†’ ๐Ÿ”ฎ Predictive โ†’ ๐ŸŽฏ Prescriptive

The 4 levels of analytics maturity

๐Ÿ“ The 5 Vs in a Sentence

V

Icon

Question

One-Liner

1

๐Ÿ“ฆ

How much?

Design for 100ร— your current data

2

๐Ÿงฉ

How many formats?

Structure is created, not found

3

๐ŸŒŠ

How fast?

Match speed to the decision's deadline

4

๐Ÿ›ก๏ธ

Can we trust it?

Quality rules are pipeline-level, not tribal

5

๐Ÿ’Ž

Worth it?

Start from the decision, work backwards

โ˜๏ธ Azure Services โ€” By V

V

Store

Process

Catalog / Quality

Deliver

๐Ÿ“ฆ Volume

๐Ÿชฃ ADLS Gen2 โ€ข ๐ŸงŠ Archive

๐Ÿ›๏ธ Synapse โ€ข ๐Ÿ”ฅ Databricks

๐Ÿ—๏ธ Purview

โ€”

๐Ÿงฉ Variety

๐Ÿชฃ ADLS โ€ข โšก Cosmos DB

๐Ÿค– AI Vision โ€ข ๐Ÿ“ Doc Intel. โ€ข ๐Ÿ”Š AI Speech โ€ข ๐Ÿ’ญ AI Language

๐Ÿ•ธ๏ธ ADF โ€ข ๐Ÿ“š Purview

๐Ÿ” Synapse Serverless

๐ŸŒŠ Velocity

๐ŸŒŠ Event Hubs โ€ข ๐Ÿช Kafka API

โšก Functions โ€ข ๐ŸŽฏ Stream Analytics

๐Ÿš’ Capture (โ†’ ADLS)

๐Ÿ”ฌ Data Explorer

๐Ÿ›ก๏ธ Veracity

โ€”

๐Ÿงช ADF Data Flows

๐Ÿ›ก๏ธ Purview DQ โ€ข ๐Ÿ•ต๏ธ Defender โ€ข ๐Ÿ“œ Activity Log โ€ข โš™๏ธ Policy

๐Ÿ—๏ธ Purview

๐Ÿ’Ž Value

โ€”

๐Ÿค– Azure ML โ€ข ๐ŸŽฏ Personalizer โ€ข ๐Ÿ‘๏ธ Anomaly Detector

โ€”

๐Ÿ“Š Power BI โ€ข ๐Ÿ’ฌ Copilot โ€ข ๐Ÿง  OpenAI

๐Ÿ› ๏ธ The Universal Azure Analytics Pipeline

   INGEST  -->  STORE  -->  CATALOG  -->  PROCESS  -->  QUERY  -->  VISUALIZE  -->  DECIDE
     |          |           |             |             |            |               |
   Event Hubs  ADLS         Purview       ADF flows     Synapse      Power BI        Human
   ADF         Synapse      Fabric cat.   Databricks    Dedicated    Azure ML        action
   Functions   Archive      Data Shares   Synapse Spark Serverless   OpenAI

๐ŸŽฏ The Decision Rules

๐Ÿ“ˆ Next Steps

  1. Re-do the hands-on lab with a different dataset.
  2. Apply the 5 Vs self-assessment to a real project.
  3. Watch the Azure Synapse and ADLS documentation pages.
  4. Move to Part 2 โ€” Advanced Analytics on Azure.

Azure service icons used in this codelab are from the official Microsoft Azure Public Service Icons set (V23), freely distributed by Microsoft for use in architecture diagrams and educational materials.