
+============================================================+
| |
| AWS ANALYTICS FUNDAMENTALS - PART 1 |
| |
| Concepts - The 5 Vs of Big Data - AWS Mapping |
| |
| Powered by HitaVir Tech |
+============================================================+
Welcome to Fundamentals of Analytics on AWS - Part 1 by HitaVir Tech!
This codelab builds your mental model for analytics in the cloud โ one concept at a time, one AWS service at a time. No prior AWS experience required.
Pillar | Topics |
๐ง Concepts | Analytics, Machine Learning, core framework |
๐ The 5 Vs | Volume, Variety, Velocity, Veracity, Value |
โ๏ธ AWS Services | One toolkit per V โ the complete mapping |
๐ ๏ธ Hands-on Lab | S3 โ Glue โ Athena end-to-end |
Every data challenge you will face maps to one of five questions:
Question | V |
๐ฆ "How do we store 500 TB?" | Volume |
๐งฉ "We have CSVs, JSON, images โ help!" | Variety |
๐ "Dashboards must refresh every second" | Velocity |
๐ก๏ธ "Half our timestamps are malformed" | Veracity |
๐ "Who actually uses this dashboard?" | Value |
The 5 Vs give you a framework to diagnose. AWS gives you a toolbox to solve each V.
3-4 hours (concepts + hands-on lab)
If you are... | Do this |
๐ A student new to cloud | Read top-to-bottom, do the hands-on lab |
๐ ๏ธ A working engineer | Skim Part 1-2, deep-read the 5 Vs, focus on AWS services for your bottleneck V |
๐งโ๐ซ A trainer or mentor | Use section headers as talking points; the spotlight cards are slide-ready |
๐ A reference reader | Jump to the Cheat Sheet at the end |
๐ก HitaVir Tech says: "The 5 Vs aren't academic jargon โ they're the mental checklist every senior engineer runs when someone says โwe have a data problem.' Learn to speak this language and every cloud will feel familiar."
Required
Helpful
Everything in this codelab runs in the AWS web console in your browser. Zero software installation on your machine.
Every step stays inside the AWS Free Tier:
Free Tier Budget | Usage in This Codelab |
๐ชฃ S3 storage โ 5 GB | < 1 MB |
๐ Athena โ pay per query | < 1ยข total |
๐ Glue Data Catalog โ 1M objects | 1 table |
๐ฐ Estimated total cost | $0.00 |
โ ๏ธ Always clean up. Step 10 of the lab is a cleanup ritual. Skip it and AWS will happily bill you for forgotten resources.

+==============================================================+
| SECTION 1 - ANALYTICS CONCEPTS |
+==============================================================+
Before we touch AWS, we need three anchor ideas:
+----------------------+
| 1. ANALYTICS |
| turn data into |
| decisions |
+----------+-----------+
|
| powered by
v
+----------------------+
| 2. MACHINE LEARNING|
| algorithms that |
| learn patterns |
+----------+-----------+
|
| challenged by
v
+----------------------+
| 3. THE 5 Vs |
| of Big Data |
+----------------------+

๐ Analytics is the practice of turning raw data into useful insights that help people make better decisions.
That one line is the whole discipline. SQL queries, dashboards, ML models, data lakes โ all of it is just tooling in service of that idea.
Imagine HitaVir Coffee โ 50 locations across India. Every day, each store generates data:
Data Stream | Icon | Data Stream | Icon |
Orders | โ | Payments | ๐ฐ |
Loyalty | ๐ฅ | Inventory | ๐ฆ |
Shifts | ๐ | Equipment | ๐ก๏ธ |
Deliveries | ๐ | Reviews | โญ |
One transaction alone is meaningless. But aggregate across 50 stores for a year and patterns emerge:
Observation | Action |
๐ข Mondays are slowest | ๐ฏ Launch "Monday BOGO" |
๐ Store #23 sells 2x pastries | ๐ Copy their layout |
๐ Cappuccinos drop in summer | ๐ง Push cold brew |
That journey โ from raw events to actions โ is analytics.
+================================================================+
| |
| L4 PRESCRIPTIVE "What should we do?" |
| |
| ^ |
| | |
| L3 PREDICTIVE "What will happen?" |
| |
| ^ |
| | |
| L2 DIAGNOSTIC "Why did it happen?" |
| |
| ^ |
| | |
| L1 DESCRIPTIVE "What happened?" |
| |
+================================================================+
Level | Icon | Question | Example | Powered By |
L1 Descriptive | ๐ธ | What happened? | "Sold 12,400 cappuccinos" | ๐ SQL / BI |
L2 Diagnostic | ๐ต๏ธ | Why did it happen? | "Bean shortage hit week 3" | ๐ SQL + drill-down |
L3 Predictive | ๐ฎ | What will happen? | "June sales up 15%" | ๐ค Machine learning |
L4 Prescriptive | ๐ฏ | What should we do? | "Order 200kg by May 25" | ๐ค ML + optimization |
Most companies live at L1-L2. Analytics engineers build the foundation that makes L3-L4 possible.
๐ก HitaVir Tech says: "Never build a dashboard nobody looks at. Always ask โ what decision will this insight change? If the answer is โnone', don't build it."

L1-L2 is Athena + QuickSight. L3-L4 adds SageMaker + Bedrock.
๐ค Machine Learning (ML) is a branch of AI where algorithms learn patterns from data instead of being explicitly programmed.
+-----------------------------+ +-----------------------------+
| TRADITIONAL PROGRAMMING | | MACHINE LEARNING |
| ------------------------- | | ------------------------- |
| | | |
| Rules + Data | | Data + Answers |
| | | | | |
| v | | v |
| Program | | Learned Model |
| | | | | |
| v | | v |
| Answer | | Rules (weights) |
| | | |
| Human writes the rules. | | Machine learns the rules. |
+-----------------------------+ +-----------------------------+
Flavor | Icon | Data Needed | Example | AWS Service |
Supervised | ๐ฏ | Labeled examples | Spam filter, fraud detection, image classification | ๐ค SageMaker |
Unsupervised | ๐ | No labels | Customer segmentation, anomaly detection, topic modeling | ๐ค SageMaker โข ๐ญ Comprehend |
Reinforcement | ๐ฎ | Reward signals | Game AI, robotics, recommenders | ๐ค SageMaker RL โข ๐๏ธ DeepRacer |
Level Stops at SQL Needs ML
----- -------------------- --------------------
L1 Descriptive OK -
L2 Diagnostic OK -
L3 Predictive Machine learning
L4 Prescriptive ML + optimization
๐ก HitaVir Tech says: "ML is not magic โ it's statistics at scale. If your analytics foundations are shaky, your ML models will be too. Clean data first, cool models second."

Coming up in "AWS Services for Value" (L3-L4 analytics).
+==============================================================+
| SECTION 2 - THE 5 Vs OF BIG DATA |
+==============================================================+
In 2001, analyst Doug Laney described big-data challenges with three Vs: Volume, Variety, Velocity. Later the industry added Veracity (trust) and Value (outcome). Together they form the universal diagnostic checklist.
*
VOLUME
How much is it?
/ \
/ \
/ \
/ \
VARIETY VELOCITY
How many formats? How fast?
\ /
\ /
\ /
\ /
VERACITY VALUE
Can we trust it? Worth it?
V | Icon | Question |
1 | ๐ฆ VOLUME | How much? (scale) |
2 | ๐งฉ VARIETY | How many formats? |
3 | ๐ VELOCITY | How fast? (speed) |
4 | ๐ก๏ธ VERACITY | Can we trust it? |
5 | ๐ VALUE | What outcome? |
Miss any one V and your data platform has a hole. Let's tour each.

Volume โ S3. Variety โ Glue. Velocity โ Kinesis. Veracity โ Lake Formation. Value โ QuickSight.
+==============================================================+
| THE 1st V - VOLUME |
| "How much data do we have?" |
+==============================================================+
๐ฆ Volume is the size of your data โ how many bytes, rows, events, or files you must store, move, and process.
Unit | Power | What It Holds |
Byte (B) | 1 | A letter |
KB | 10^3 | One email |
MB | 10^6 | One song |
GB | 10^9 | One DVD |
TB | 10^12 | One year of company sales |
PB | 10^15 | One day of YouTube uploads |
EB | 10^18 | All of Netflix streaming |
ZB | 10^21 | The entire internet per year |
A traditional database runs fine up to ~1-10 TB. Past that, things break:
At big-data scale, you need distributed systems โ hundreds of machines sharing the load.
Company | Daily Volume |
๐ฑ Instagram | 100M+ photos uploaded |
๐ Amazon | Billions of events |
๐ Uber | 10s of TB of trip data |
๐ฌ Netflix | PB of logs and streams |
๐ Google | Unimaginable |
๐ก HitaVir Tech says: "What works at 10 GB catastrophically fails at 10 TB. Always ask โ how does this scale at 100x?"
๐ฆ Volume in one line: design for 100ร your current data โ or rebuild painfully later.

Coming up in "AWS Services for Volume".
+==============================================================+
| THE 2nd V - VARIETY |
| "How many kinds of data?" |
+==============================================================+
๐งฉ Variety is the diversity of data โ formats, schemas, and sources.
Twenty years ago, "data" meant rows in a database. Today, it means far more:
Type | Icon | Examples | Schema |
Structured | ๐ | SQL tables, CSV, spreadsheets | Fixed |
Semi-structured | ๐งฉ | JSON from APIs, XML, Parquet, Avro | Flexible |
Unstructured | ๐๏ธ | Images, video, audio, PDFs, free text | None |
Each format needs different tooling:
Format | Tool Needed |
SQL tables | Relational engine |
JSON | Document parser |
OCR | |
Image | Computer vision |
Audio | Speech-to-text |
Free text | NLP / embeddings |
Most real projects combine these. Example โ "Correlate support emails + call recordings + order history into one insight." Three completely different pipelines feeding one answer.
Industry | Variety Mix |
๐ฅ Healthcare | Patient records + X-rays + doctor's notes |
๐ Retail | Orders + product photos + reviews |
๐ฆ Banking | Transactions + scanned checks + call transcripts |
๐ Autonomous cars | Sensors + video + maps + LiDAR |
๐ก HitaVir Tech says: "90% of the world's data is unstructured. But 90% of analytics happens on structured data. Your job is often to convert chaos into order."
๐งฉ Variety in one line: structure is created, not found โ choose tools that embrace format diversity.

Coming up in "AWS Services for Variety".
+==============================================================+
| THE 3rd V - VELOCITY |
| "How fast is the data?" |
+==============================================================+
๐ Velocity is the speed at which data arrives, moves, and must be processed to deliver value.
Freshness | Icon | Approach | Example Use Case |
Next day | ๐ | Batch (nightly) | Finance reports |
Every hour | ๐ | Mini-batch | Ops dashboards |
Seconds | ๐ | Streaming | Live pricing |
Sub-millisecond | โก | Real-time | Fraud detection, HFT |
Problem | Solution |
Disks too slow | In-memory / caches |
Batch SQL too slow | Stream-processing engines |
One machine too small | Horizontal auto-scaling |
Failures = data loss | Durable logs (Kafka / Kinesis) |
Scenario | Required Latency |
๐ณ Credit card fraud | Under 100 ms |
๐ Stock trading | Microseconds |
๐ฑ Social feed | Seconds |
๐ Delivery tracking | Minutes |
๐ Exec dashboard | Hourly |
๐งพ Month close | Daily batch |
๐ก HitaVir Tech says: "Streaming is fashionable. Batch is profitable. 80% of real-world analytics runs on batch โ don't reach for streaming unless the business truly cannot wait."
๐ Velocity in one line: match the pipeline's speed to the decision's deadline โ no faster.

Coming up in "AWS Services for Velocity".
+==============================================================+
| THE 4th V - VERACITY |
| "Can we trust the data?" |
+==============================================================+
๐ก๏ธ Veracity is the accuracy, consistency, and trustworthiness of your data.
Big volumes and fast pipelines are useless if the data is wrong.
Enemy | Icon | Symptom |
Missing | ๐ฆ | NULL in required fields |
Duplicates | ๐๏ธ | Same row repeated |
Inconsistent | ๐ญ | 2024-01-05 vs 01/05/24 |
Outliers | ๐ | Age = 347 |
Units | ๐ช | USD mixed with INR |
Bias | ๐ช | Only US users sampled |
Stale | ๐ชค | Last updated 2019 |
Broken joins | ๐ | Order with no user |
Noise | ๐ฒ | Flaky sensor readings |
๐ก๏ธ A beautiful dashboard built on bad data is worse than no dashboard โ it creates false confidence. The most dangerous insight is a wrong insight someone believes.
Dimension | Icon | Question |
Completeness | ๐งฉ | Required fields populated? |
Accuracy | ๐ฏ | Data reflects reality? |
Consistency | ๐งญ | Related systems agree? |
Timeliness | โฐ | Is it current enough? |
Validity | โ | Matches formats / rules? |
Uniqueness | ๐ข | Any unintended duplicates? |
Incident | Consequence |
๐ฐ๏ธ NASA Mars Climate Orbiter (1999) | Lost $125M โ metric vs imperial unit mismatch |
๐ฆ Knight Capital (2012) | $440M loss in 45 minutes โ bad trading data |
๐คง Google Flu Trends | Overestimated flu peaks 100%+ due to search bias |
๐ก HitaVir Tech says: "Senior engineers obsess over data quality. Juniors obsess over cool tools. Guess which group builds systems that actually work in production."
๐ก๏ธ Veracity in one line: quality rules are a pipeline concern, not a hope.

Coming up in "AWS Services for Veracity".
+==============================================================+
| THE 5th V - VALUE |
| "What business outcome does it drive?" |
+==============================================================+
๐ Value is the business outcome your data and analytics actually deliver โ revenue gained, cost saved, risk reduced, customer experience improved.
Without Value, the other four Vs are expensive hobbies.
VALUE
(outcome)
^
| enabled by
|
Insights & decisions
^
| enabled by
|
Analytics + ML
^
| enabled by
|
Trustworthy (Veracity) data
^
| at the right
| speed (Velocity)
|
across Varieties
^
| stored at
|
the right scale (Volume)
Industry | Analytics Value |
๐ Retail | Recommendation engine โ +20% revenue |
๐ฆ Banking | Fraud detection โ millions saved |
๐ Logistics | Route optimization โ -15% fuel cost |
๐ฅ Healthcare | Early diagnosis models โ better outcomes |
๐ฌ Streaming | Personalized content โ higher retention |
Most companies have folders full of unused dashboards โ the dashboard graveyard. Every one cost engineering time, storage, and compute.
The difference between a valuable dashboard and a graveyard dashboard:
+------------------------------------------------------+
| |
| "What decision will change because of this?" |
| |
| If nobody can answer -> DON'T BUILD IT. |
| |
+------------------------------------------------------+
๐ก HitaVir Tech says: "A data platform that costs more than the decisions it enables is a failure, no matter how beautiful the architecture. Lead with Value."
๐ Value in one line: start from the decision, work backwards to the pipeline.

Coming up in "AWS Services for Value".
+==============================================================+
| SECTION 3 - AWS SERVICES BY THE 5 Vs |
+==============================================================+

Now we map each V to the AWS services that solve it.
+-------+ +-------+ +-------+ +-------+ +-------+ +-------+ +-------+
| | | | | | | | | | | | | |
| INGST | ->| STORE | ->| CATLG | ->| PROCS | ->| QUERY | ->| VIEW | ->| ACT |
| | | | | | | | | | | | | |
+-------+ +-------+ +-------+ +-------+ +-------+ +-------+ +-------+
The 5 Vs tell you where the bottleneck is. The AWS services tell you what solves it.
+==============================================================+
| AWS FOR VOLUME |
| "Store any amount of data, affordably." |
+==============================================================+

S3 โข Glacier โข Redshift โข EMR โข Lake Formation
Service | Icon | Purpose |
Amazon S3 |
| Object storage โ the data-lake foundation (infinite scale) |
Amazon S3 Glacier |
| Cheapest archive tier for cold data |
Amazon Redshift |
| Petabyte-scale columnar data warehouse |
Amazon EMR |
| Managed Spark / Hadoop clusters for huge batch jobs |
AWS Lake Formation |
| Manage, govern, and secure a data lake on S3 |
Amazon EBS / EFS |
| Block and file storage for compute workloads |

+--------------------------------------------------------------+
| AMAZON S3 - Simple Storage Service |
+--------------------------------------------------------------+
| Category : Object storage / data lake |
| Durability : 99.999999999% (11 nines) |
| Scale : Unlimited (practically) |
| Pricing : ~$0.023 / GB / month (Standard) |
| Read by : Athena, Redshift, EMR, SageMaker, QuickSight|
| |
| If you remember only one AWS service - make it S3. |
+--------------------------------------------------------------+
Class | Icon | Access Pattern | Relative Cost |
S3 Standard | ๐ฅ | Hot, frequent access | $$$$ |
S3 Intelligent-Tiering | ๐ก๏ธ | Auto hot/cold moves | $$$ |
S3 Standard-IA | โ๏ธ | Monthly access | $$ |
S3 Glacier Instant | ๐ง | Rare access, instant | $ |
S3 Glacier Flexible | ๐๏ธ | Hours to retrieve | ยข |
S3 Glacier Deep Archive | ๐๏ธ | Compliance vault | ยข |

+--------------------------------------------------------------+
| AMAZON REDSHIFT - Data Warehouse |
+--------------------------------------------------------------+
| Category : Columnar MPP data warehouse |
| Scale : Petabytes (exabytes tested) |
| SQL : PostgreSQL-flavored |
| Modes : Serverless | Provisioned (RA3 nodes) |
| Superpower : Sub-second queries over billions of rows |
| |
| Use when you need fast SQL on huge structured data. |
+--------------------------------------------------------------+

+--------------------------------------------------------------+
| AMAZON EMR - Elastic MapReduce |
+--------------------------------------------------------------+
| Category : Managed Hadoop / Spark / Presto clusters |
| Scale : Thousands of nodes |
| Pricing : Per-instance-hour (spot = 90% off) |
| |
| Use for petabyte-scale custom Spark jobs. |
+--------------------------------------------------------------+
s3://hitavirtech-analytics/
|
+-- raw/ <-- Bronze: untouched source data
| +-- sales/2026/04/22/orders.csv
| +-- inventory/2026/04/22/stock.json
|
+-- curated/ <-- Silver: cleaned, typed Parquet
| +-- sales_fact/year=2026/month=04/day=22/part-001.parquet
|
+-- analytics/ <-- Gold: pre-aggregated for BI
+-- daily_revenue/year=2026/month=04/day=22/part-001.parquet
How much data?
|
+--------------------+--------------------+
| | |
< 1 TB 1-100 TB > 100 TB
| | |
v v v
RDS or S3 + S3 + EMR +
Redshift Serverless Athena Redshift +
(small + cheap) (most common) Lake Formation
(huge platform)
๐ก HitaVir Tech says: "Start with S3. Every analytics service on AWS reads from it. You'll never regret putting data into S3 โ you may regret putting it anywhere else first."
+==============================================================+
| AWS FOR VARIETY |
| "Handle any data format, elegantly." |
+==============================================================+

Glue โข Athena โข DynamoDB โข Rekognition โข Textract โข Transcribe โข Comprehend โข OpenSearch
Service | Icon | Purpose |
Amazon S3 |
| Holds every format โ CSV, JSON, Parquet, images, video |
AWS Glue |
| ETL for all formats; crawlers auto-detect schema |
AWS Glue Data Catalog |
| Metadata layer โ one view across formats |
Amazon Athena |
| SQL on CSV, JSON, Parquet, ORC, Avro |
Amazon DynamoDB |
| NoSQL for flexible docs |
Amazon Rekognition |
| Images / video โ structured labels |
Amazon Textract |
| PDFs and scans โ text and tables |
Amazon Transcribe |
| Speech โ text |
Amazon Comprehend |
| NLP: sentiment, entities, topics |
Amazon OpenSearch |
| Full-text search and log analytics |

+--------------------------------------------------------------+
| AWS GLUE - Serverless ETL + Data Catalog |
+--------------------------------------------------------------+
| Crawlers : Auto-detect schema for new S3 folders |
| Catalog : Central metadata (Athena, Redshift, EMR) |
| ETL Jobs : Python / Spark transformations |
| DataBrew : No-code visual data prep |
| Data Quality : Rule-based DQ checks |
| |
| The "nervous system" of your AWS data lake. |
+--------------------------------------------------------------+
INPUT OUTPUT
----- ------
CSV + +---- Parquet
JSON +---> Crawler ---> Catalog ---> ETL Job ----+ (optimized)
Parquet + (schema + +---- Updated
Avro | partitions) tables
Database +
SELECT *
FROM csv_orders
JOIN json_customers USING (customer_id)
JOIN parquet_products USING (product_id);
Athena reads CSV, JSON, Parquet, ORC, Avro โ all via the Glue Catalog. You never leave SQL.

Input | Icon | AWS Service | Output |
Images | ๐ผ๏ธ | Rekognition | Labels, faces, moderation |
PDFs / scans | ๐ | Textract | Extracted text + tables |
Audio / voice | ๐ค | Transcribe | Transcripts |
Free text | ๐ฌ | Comprehend | Sentiment, entities, topics |
Translations | ๐ | Translate | Languages |
Magic step: chaos in โ structured features out โ then into S3 / Athena / Redshift as normal.
Customer review (raw text)
|
v
Comprehend ---> sentiment = negative, topic = shipping
|
v
S3 (enriched records)
|
v
Glue Crawler ---> Data Catalog
|
v
Athena SQL: "avg sentiment per product / month"
|
v
QuickSight dashboard for the CX team
|
v
Action: fix shipping partner in Region X
What's my data?
|
+---------+---------+--+---+---------+---------+
| | | | | |
Tabular JSON Images PDFs Audio Free text
| | | | | |
v v v v v v
S3+Glue+ S3 + Reko- Text- Transcribe Comprehend
Athena Athena gnition tract
or DDB
๐ก HitaVir Tech says: "The magic of modern analytics โ unstructured data becomes structured features in minutes via AWS AI services. What took PhDs years a decade ago is now an API call."
+==============================================================+
| AWS FOR VELOCITY |
| "Move and process data in real time." |
+==============================================================+

Kinesis Streams โข Firehose โข Kinesis Analytics โข MSK โข Lambda โข EventBridge โข SNS โข SQS
Service | Icon | Purpose |
Kinesis Data Streams |
| Real-time event stream (like Kafka) |
Kinesis Firehose |
| Buffered streaming delivery to S3 / Redshift |
Amazon MSK |
| Managed Apache Kafka |
Kinesis Data Analytics |
| SQL / Flink on streams in real time |
AWS Lambda |
| Event-driven serverless functions |
Amazon EventBridge |
| Serverless event bus across AWS |
Amazon SNS / SQS |
| Pub-sub / queue messaging |

+--------------------------------------------------------------+
| KINESIS DATA STREAMS - Real-time Event Stream |
+--------------------------------------------------------------+
| Latency : Sub-second |
| Retention : 24 hours (up to 365 days) |
| Throughput : MB/sec per shard, scale by sharding |
| Ordering : Per-shard ordered |
| |
| The "high-speed conveyor belt" for events. |
+--------------------------------------------------------------+
PRODUCERS KINESIS CONSUMERS
------------ --------- ------------
App events +----------------------+ Lambda
Clickstreams --->| >> >> >> >> >> >> |---> Kinesis Analytics
IoT sensors | durable, ordered, | Firehose -> S3
Transactions | up to 365d retention| OpenSearch
Card swipes +----------------------+ Redshift
Kinesis holds events durably. Multiple consumers read the same stream independently.

+--------------------------------------------------------------+
| KINESIS FIREHOSE - The Easy Button |
+--------------------------------------------------------------+
| Model : Serverless, fully managed |
| Buffer : 60 sec or 5 MB (whichever first) |
| Transforms : Optional Lambda enrichment |
| Sinks : S3 (Parquet!), Redshift, OpenSearch, HTTP |
| |
| No code, no cluster - the laziest streaming on AWS. |
+--------------------------------------------------------------+
Producers ---> Firehose ---> Convert to Parquet ---> S3
(no servers,
auto-scale)
CREATE STREAM fraud_alerts AS
SELECT user_id, amount, location
FROM transactions_stream
WHERE amount > 10000 OR is_foreign = TRUE;
Results in milliseconds โ not after the nightly batch.
S3 new file +
Kinesis event +---> Lambda ---> Redshift load
DynamoDB row + |
EventBridge + +----> SNS / SQS alert
Perfect for: event reactions, enrichment, alerting, small transforms.
Rideshare app (1 million events/sec)
|
v
Kinesis Data Streams
|
+--------+--------+---------+
| | | |
v v v v
Lambda Analytics Firehose
fraud real-time buffer
flag aggregate --> S3 (Parquet)
| | |
v v v
SNS QuickSight Athena
alert live dash. historical BI
How fresh must the data be?
|
+---------+----------+---+---+-----------+
| | | | |
Next day 15 minutes Seconds Sub-second Kafka shop
| | | | |
v v v v v
Glue Firehose Kinesis Kinesis MSK
batch -> S3 + Lambda Analytics
(Flink)
๐ก HitaVir Tech says: "Every team thinks they need real-time until they see the bill. Start with Firehose and 5-minute micro-batches โ graduate later. Most of the time, you won't need to."
+==============================================================+
| AWS FOR VERACITY |
| "Make sure the data is trustworthy." |
+==============================================================+

DataBrew โข Glue Data Quality โข Lake Formation โข Macie โข CloudTrail โข Config โข IAM โข KMS
Service | Icon | Purpose |
AWS Glue DataBrew |
| Visual data cleaning and profiling |
AWS Glue Data Quality |
| Rule-based DQ checks |
Amazon Deequ (library) |
| Unit tests for data on Spark |
AWS Lake Formation |
| Governance and fine-grained permissions |
AWS CloudTrail |
| Audit every API call |
AWS Config |
| Track resource configuration drift |
Amazon Macie |
| Discover and classify PII in S3 |
AWS KMS |
| Manage encryption keys |

+--------------------------------------------------------------+
| GLUE DATABREW - No-Code Data Prep |
+--------------------------------------------------------------+
| Interface : Visual, spreadsheet-like |
| Transforms : 250+ built-in (nulls, dupes, dates...) |
| Profiling : Auto column stats, anomaly flags |
| Recipes : Save and schedule as jobs |
| |
| Hand this to business analysts - no Spark needed. |
+--------------------------------------------------------------+
Load ---> Profile ---> Apply transforms ---> Export
(stats, (fill nulls, parse (to S3 or
anomalies) dates, standardize) Redshift)

+--------------------------------------------------------------+
| GLUE DATA QUALITY - Rules That Guard the Lake |
+--------------------------------------------------------------+
| Rule types : Completeness, uniqueness, ranges, custom |
| Enforcement : Block pipeline OR quarantine OR alert |
| ML-assisted : Recommends rules from sample data |
| |
| Data quality becomes a pipeline concern, not tribal. |
+--------------------------------------------------------------+
RULES CHECK RESULT
----- -------------
order_id IS NOT NULL ... PASS
amount BETWEEN 0 AND 1_000_000 ... PASS
customer_id IN customers ... PASS
COUNT(DISTINCT order_id) = COUNT(*) ... FAIL - 23 dupes!

+--------------------------------------------------------------+
| AMAZON MACIE - PII Detective |
+--------------------------------------------------------------+
| Method : ML-based classification of S3 contents |
| Finds : Credit cards, SSNs, emails, addresses |
| Output : Severity alerts -> Security Hub |
| |
| Sniffs sensitive data hiding in your data lake. |
+--------------------------------------------------------------+

Essential for regulated industries (finance, healthcare, gov).
PROFILE ---> DEFINE RULES ---> ENFORCE ---> ALERT ---> FIX ---> MONITOR
(know) (expected) (every run) (fail fast) (fix) (trends)
^ |
| |
+---------------------------- loop ---------------------------------------+
Raw sales CSV from 50 stores
|
v
Glue ETL reads it
|
v
Glue Data Quality runs rules:
PASS - order_id unique
PASS - amount in [0, 1M]
FAIL - store_id in valid list (12 rows failed)
|
v
+--+--+
v v
Quarantine Curated
bucket + zone
SNS alert (Parquet)
๐ก HitaVir Tech says: "Every pipeline must have quality rules โ not โsomeday', from day one. 10x cheaper to catch bad data at ingest than to explain a wrong dashboard to the CEO."
+==============================================================+
| AWS FOR VALUE |
| "Turn data into decisions and ROI." |
+==============================================================+

QuickSight โข SageMaker โข Forecast โข Personalize โข Fraud Detector โข Bedrock
Service | Icon | Purpose |
Amazon QuickSight |
| Dashboards, BI, natural-language analytics |
Amazon SageMaker |
| Build, train, deploy ML models |
Amazon Forecast |
| No-code time-series forecasting |
Amazon Personalize |
| Recommendation engines |
Amazon Fraud Detector |
| Fraud-prediction models |
Amazon Q in QuickSight |
| Ask data questions in natural language |
Amazon Bedrock |
| Foundation models (Claude, Llama, etc.) |
Redshift ML |
| Train and run ML using SQL in Redshift |
Amazon Lookout for Metrics |
| Auto-detect anomalies in business KPIs |

+--------------------------------------------------------------+
| AMAZON QUICKSIGHT - Business Intelligence |
+--------------------------------------------------------------+
| Sources : S3 via Athena, Redshift, RDS, Excel... |
| Engine : SPICE (in-memory columnar cache) |
| Superpowers : Amazon Q (natural language) + Embedded |
| Editions : Standard | Enterprise |
| |
| Your "Tableau / Power BI" on AWS, with AI built in. |
+--------------------------------------------------------------+
Athena / Redshift ---> SPICE ---> Visuals ---> Dashboard ---> Share
(data source) (cache) (charts) (compose) (embed)

+--------------------------------------------------------------+
| AMAZON SAGEMAKER - Full-Lifecycle ML |
+--------------------------------------------------------------+
| Studio : Browser IDE for ML |
| AutoPilot : AutoML - tries many models |
| Pipelines : Train / evaluate / deploy as code |
| Monitor : Catch model drift in production |
| |
| From raw data to deployed model - one platform. |
+--------------------------------------------------------------+

+--------------------------------------------------------------+
| AMAZON FORECAST - No-Code Time-Series ML |
+--------------------------------------------------------------+
| Inputs : Historical + related series + metadata |
| AutoML : Tries multiple algorithms, picks best |
| Accuracy : Typically 50% better than Excel baselines |
| |
| Same tech Amazon uses internally for demand planning. |
+--------------------------------------------------------------+
Past sales + Weather + Holidays
|
v
Forecast (AutoML)
|
v
Daily predictions per SKU per store

+--------------------------------------------------------------+
| AMAZON PERSONALIZE - Netflix-Style Recs |
+--------------------------------------------------------------+
| Inputs : Users + items + interaction events |
| Use cases : "You may like...", "Related items..." |
| Real-time : Inference in milliseconds |
| |
| Same ML powering Amazon.com recommendations. |
+--------------------------------------------------------------+
User types: "Show me top 5 products last quarter"
|
v
Q interprets -> writes SQL -> runs -> visualizes
|
v
Bar chart appears in 2 seconds
Analysts no longer gatekeep simple questions.
CREATE MODEL churn_model
FROM customer_features
TARGET churned
FUNCTION predict_churn
IAM_ROLE DEFAULT
SETTINGS (S3_BUCKET 'my-ml-bucket');
-- then use it
SELECT customer_id, predict_churn(features) AS risk
FROM customers
WHERE risk > 0.8;
ML without leaving your data warehouse.
+==============================================================+
| |
| BUSINESS VALUE revenue - savings - growth |
| ^ |
| | |
| APPLICATION LAYER Personalize, Fraud Detector, Lookout |
| ^ |
| | |
| ML LAYER SageMaker, Forecast, Redshift ML |
| ^ |
| | |
| ANALYSIS LAYER Athena + QuickSight + Redshift |
| ^ |
| | |
| BUILD LAYER S3 + Glue + Kinesis + Lake Formation |
| |
+==============================================================+
๐ก HitaVir Tech says: "The best data platform is worthless if nobody uses the outputs. Start from Value and work backwards โ who sees which insight, and what decision changes? Design everything else to serve that."
+==============================================================+
| END-TO-END AWS ANALYTICS STACK |
+==============================================================+

All 5 Vs combined into one living architecture:
INGEST (Velocity + Variety Layers)
+---------------------------+ +---------------------------+
| Kinesis Data Streams | | Glue Crawlers |
| Kinesis Firehose | | Rekognition / Textract |
| Lambda | | Transcribe / Comprehend |
| MSK (Kafka) | | Translate |
+-------------+-------------+ +-------------+-------------+
| |
+----------------+---------------+
|
v
STORE (Volume Layer)
+----------------------------------------------------------+
| Amazon S3 Data Lake (raw / curated / analytics) |
| <------> AWS Glue Data Catalog (metadata) |
+-----------------------------+----------------------------+
|
v
PROCESS + VERACITY
+----------------------------------------------------------+
| Glue ETL / EMR (transform) |
| <--- Glue Data Quality / DataBrew / Lake Formation|
+-----------------------------+----------------------------+
|
v
QUERY
+------------+ +------------+ +------------+
| Athena | | Redshift | | SageMaker |
| (SQL) | | (MPP SQL) | | (ML) |
+------+-----+ +------+-----+ +------+-----+
| | |
+---------------+----------------+
|
v
VALUE LAYER
+----------------------------------------------------------+
| QuickSight | Forecast |
| Amazon Q (NL) | Personalize |
+-----------------------------+----------------------------+
|
v
Decisions - Revenue - Growth
Every box maps to a V. Every arrow is a managed AWS service.
+==============================================================+
| HANDS-ON LAB - S3 -> GLUE -> ATHENA |
+==============================================================+

You will build a mini pipeline that touches Volume (S3), Variety (CSV auto-cataloged), and Value (SQL insights).
Step 1 Step 2-3 Step 4-5 Step 6 Step 7-9
+------+ +------+ +----------+ +--------+ +----------+
| CSV | ----> | S3 | ------> | Crawler | -> | Catalog| --> | Athena |
+------+ +------+ +----------+ +--------+ +----------+
Prepare Upload Create & run Table auto Run SQL,
sample data to bucket crawler populated get insights
On your laptop, create a file called sales.csv:
order_id,customer,product,quantity,amount,order_date
1001,Ravi,Laptop,1,75000.00,2026-04-01
1002,Priya,Mouse,2,1500.00,2026-04-01
1003,Amit,Keyboard,1,3500.00,2026-04-02
1004,Ravi,Monitor,1,18000.00,2026-04-02
1005,Sneha,Headphones,1,4500.00,2026-04-03
1006,Priya,Laptop,1,75000.00,2026-04-03
1007,Vikram,USB Cable,3,900.00,2026-04-04
1008,Neha,Webcam,1,5000.00,2026-04-04
1009,Ravi,Mouse,1,750.00,2026-04-05
1010,Amit,Monitor,1,18000.00,2026-04-05
hitavirtech-analytics-yourname (globally unique โ add your name)ap-south-1 = Mumbai)rawraw/ โ Create folder โ salesraw/sales/ โ Upload โ select sales.csv โ UploadYour object now lives at:
s3://hitavirtech-analytics-yourname/raw/sales/sales.csv
hitavirtech_sales_db โ Createhitavirtech-sales-crawler โ Nexts3://hitavirtech-analytics-yourname/raw/sales/ โ AddAWSGlueServiceRole-hitavirtech โ Createhitavirtech_sales_db โ Schedule On demand โ Next โ Createhitavirtech-sales-crawler โ Run crawlersales tableorder_id, customer, product, quantity, amount, order_dateGlue just auto-discovered the schema. ๐
s3://hitavirtech-analytics-yourname/athena-results/AwsDataCatalog, Database = hitavirtech_sales_dbSELECT * FROM sales LIMIT 5;
You should see 5 rows. ๐
-- Top customers by revenue
SELECT customer, SUM(amount) AS total_spent
FROM sales
GROUP BY customer
ORDER BY total_spent DESC;
-- Best-selling products
SELECT product, SUM(quantity) AS units_sold
FROM sales
GROUP BY product
ORDER BY units_sold DESC;
-- Daily revenue
SELECT order_date, SUM(amount) AS daily_revenue
FROM sales
GROUP BY order_date
ORDER BY order_date;
At the bottom of every result: "Data scanned: 412 B" or similar.
That number is your bill. At scale, every analytics engineer watches it. Partitioning + Parquet shrinks it 100-1000x.
โ ๏ธ Forgetting cleanup = surprise AWS bill.
hitavirtech-sales-crawlerhitavirtech_sales_db๐ก HitaVir Tech says: "The last 5 minutes of cleanup are the most valuable 5 minutes of the entire lab. Engineers who skip it end up with $300 surprise bills."
+==============================================================+
| CONGRATULATIONS - PART 1 DONE! |
+==============================================================+
๐ง Analytics Concepts
Topic | Icon |
Analytics and the four maturity levels | ๐ |
Machine Learning โ three flavors | ๐ค |
๐ The 5 Vs of Big Data
V | Icon | Theme |
VOLUME | ๐ฆ | Scale |
VARIETY | ๐งฉ | Formats |
VELOCITY | ๐ | Speed |
VERACITY | ๐ก๏ธ | Trust |
VALUE | ๐ | Outcome |

โ๏ธ AWS Services Mapped to Each V
V | Key Services |
๐ฆ Volume | ๐ชฃ S3 โข ๐๏ธ Redshift โข ๐ EMR โข ๐ง Glacier โข ๐๏ธ Lake Formation |
๐งฉ Variety | ๐ธ๏ธ Glue โข ๐ Athena โข ๐ค Rekognition โข ๐ Textract โข ๐ Transcribe โข ๐ญ Comprehend |
๐ Velocity | ๐ Kinesis โข ๐ Firehose โข ๐ช MSK โข ๐ฏ Kinesis Analytics โข โก Lambda |
๐ก๏ธ Veracity | ๐งช DataBrew โข ๐ก๏ธ Glue DQ โข ๐๏ธ Lake Formation โข ๐ต๏ธ Macie โข ๐ CloudTrail |
๐ Value | ๐ QuickSight โข ๐ค SageMaker โข ๐ฎ Forecast โข ๐ฏ Personalize โข ๐ง Redshift ML |
๐ ๏ธ Hands-on Skills
๐ Part 2 โ Advanced Analytics on AWS will cover:
+==============================================================+
| |
| The 5 Vs = universal data challenge framework |
| AWS = complete toolbox for each V |
| |
| Learn both -> you can pick up any cloud's analytics |
| stack in a week. |
| |
+==============================================================+
๐ก HitaVir Tech says: "Analytics is not about tools. Tools change every two years. Analytics is about asking the right question, finding the right data, and presenting an insight people can act on. Master the fundamentals โ Volume, Variety, Velocity, Veracity, Value โ and every new tool becomes just another syntax."
๐ Welcome to cloud analytics. See you in Part 2.
โ HitaVir Tech โ๏ธ
+==============================================================+
| DIAGNOSE YOUR OWN PROJECT WITH THE 5 Vs |
+==============================================================+
Think of a data project you work on (or want to build). Run it through these five questions. The V that feels most stressful is your bottleneck โ that's where to focus first.
# | Question | Your V | AWS Services to Study |
1 | "Do we have somewhere cheap and durable to store everything?" | ๐ฆ Volume | ๐ชฃ S3 โข ๐ง Glacier โข ๐๏ธ Redshift |
2 | "Do we have to handle more than one data format?" | ๐งฉ Variety | ๐ธ๏ธ Glue โข ๐ Athena โข ๐ค Rekognition โข ๐ Textract |
3 | "Is the current data freshness good enough for stakeholders?" | ๐ Velocity | ๐ Kinesis โข ๐ Firehose โข โก Lambda |
4 | "Do stakeholders trust the numbers we publish?" | ๐ก๏ธ Veracity | ๐งช DataBrew โข ๐ก๏ธ Glue DQ โข ๐ต๏ธ Macie |
5 | "Is anyone actually acting on our outputs?" | ๐ Value | ๐ QuickSight โข ๐ค SageMaker โข ๐ฏ Personalize |
Score each V from 1 (healthy) to 5 (painful). The highest score is where a senior engineer should lead the next sprint.
๐ฏ Pro Tip: "Stacking solutions for Velocity when Veracity is the real problem is the most common and expensive AWS mistake. Diagnose honestly before you build."
+==============================================================+
| AWS ANALYTICS - PART 1 CHEAT SHEET |
| (screenshot and keep) |
+==============================================================+

Term | Definition |
๐ Analytics | Turning data into decisions |
๐ค Machine Learning | Algorithms that learn patterns instead of being programmed |
๐ธ Descriptive โ ๐ต๏ธ Diagnostic โ ๐ฎ Predictive โ ๐ฏ Prescriptive | The 4 levels of analytics maturity |
V | Icon | Question | One-Liner |
1 | ๐ฆ | How much? | Design for 100ร your current data |
2 | ๐งฉ | How many formats? | Structure is created, not found |
3 | ๐ | How fast? | Match speed to the decision's deadline |
4 | ๐ก๏ธ | Can we trust it? | Quality rules are pipeline-level, not tribal |
5 | ๐ | Worth it? | Start from the decision, work backwards |
V | Store | Process | Catalog / Quality | Deliver |
๐ฆ Volume | ๐ชฃ S3 โข ๐ง Glacier | ๐๏ธ Redshift โข ๐ EMR | ๐๏ธ Lake Formation | โ |
๐งฉ Variety | ๐ชฃ S3 โข โก DynamoDB | ๐ค Rekognition โข ๐ Textract โข ๐ Transcribe โข ๐ญ Comprehend | ๐ธ๏ธ Glue โข ๐ Catalog | ๐ Athena |
๐ Velocity | ๐ Kinesis โข ๐ช MSK | โก Lambda โข ๐ฏ K. Analytics | ๐ Firehose (โ S3) | ๐ OpenSearch |
๐ก๏ธ Veracity | โ | ๐งช DataBrew | ๐ก๏ธ Glue DQ โข ๐ต๏ธ Macie โข ๐ CloudTrail โข โ๏ธ Config | ๐๏ธ Lake Formation |
๐ Value | โ | ๐ค SageMaker โข ๐ฎ Forecast โข ๐ฏ Personalize โข ๐ง Redshift ML | โ | ๐ QuickSight โข ๐ฌ Q โข ๐ง Bedrock |
INGEST --> STORE --> CATALOG --> PROCESS --> QUERY --> VISUALIZE --> DECIDE
| | | | | | |
Kinesis S3 Glue Glue ETL Athena QuickSight Human
Firehose Redshift Catalog EMR Redshift SageMaker action
DMS Glacier Lake Form. DataBrew Spectrum Forecast
AWS service icons used in this codelab are from the official AWS Architecture Icons deck, freely distributed by Amazon Web Services for use in architecture diagrams and educational materials.