AWS

  +============================================================+
  |                                                            |
  |      AWS ANALYTICS FUNDAMENTALS - PART 1                   |
  |                                                            |
  |      Concepts  -  The 5 Vs of Big Data  -  AWS Mapping     |
  |                                                            |
  |                 Powered by HitaVir Tech                    |
  +============================================================+

Welcome to Fundamentals of Analytics on AWS - Part 1 by HitaVir Tech!

This codelab builds your mental model for analytics in the cloud โ€” one concept at a time, one AWS service at a time. No prior AWS experience required.

What You Will Master

Pillar

Topics

๐Ÿง  Concepts

Analytics, Machine Learning, core framework

๐Ÿ“ The 5 Vs

Volume, Variety, Velocity, Veracity, Value

โ˜๏ธ AWS Services

One toolkit per V โ€” the complete mapping

๐Ÿ› ๏ธ Hands-on Lab

S3 โ†’ Glue โ†’ Athena end-to-end

Why the 5 Vs Framework Matters

Every data challenge you will face maps to one of five questions:

Question

V

๐Ÿ“ฆ "How do we store 500 TB?"

Volume

๐Ÿงฉ "We have CSVs, JSON, images โ€” help!"

Variety

๐ŸŒŠ "Dashboards must refresh every second"

Velocity

๐Ÿ›ก๏ธ "Half our timestamps are malformed"

Veracity

๐Ÿ’Ž "Who actually uses this dashboard?"

Value

The 5 Vs give you a framework to diagnose. AWS gives you a toolbox to solve each V.

Estimated Duration

3-4 hours (concepts + hands-on lab)

How to Use This Codelab

If you are...

Do this

๐ŸŽ“ A student new to cloud

Read top-to-bottom, do the hands-on lab

๐Ÿ› ๏ธ A working engineer

Skim Part 1-2, deep-read the 5 Vs, focus on AWS services for your bottleneck V

๐Ÿง‘โ€๐Ÿซ A trainer or mentor

Use section headers as talking points; the spotlight cards are slide-ready

๐Ÿ”– A reference reader

Jump to the Cheat Sheet at the end

๐Ÿ’ก HitaVir Tech says: "The 5 Vs aren't academic jargon โ€” they're the mental checklist every senior engineer runs when someone says โ€˜we have a data problem.' Learn to speak this language and every cloud will feel familiar."

What You Need

Required

Helpful

No Local Installs Required

Everything in this codelab runs in the AWS web console in your browser. Zero software installation on your machine.

โš ๏ธ Cost Awareness

Every step stays inside the AWS Free Tier:

Free Tier Budget

Usage in This Codelab

๐Ÿชฃ S3 storage โ€” 5 GB

< 1 MB

๐Ÿ” Athena โ€” pay per query

< 1ยข total

๐Ÿ“š Glue Data Catalog โ€” 1M objects

1 table

๐Ÿ’ฐ Estimated total cost

$0.00

โš ๏ธ Always clean up. Step 10 of the lab is a cleanup ritual. Skip it and AWS will happily bill you for forgotten resources.

Services the Hands-on Lab Will Use

S3 Glue Glue Crawler Glue Catalog Athena

  +==============================================================+
  |         SECTION  1   -   ANALYTICS  CONCEPTS                 |
  +==============================================================+

Before we touch AWS, we need three anchor ideas:

                    +----------------------+
                    |  1.  ANALYTICS       |
                    |  turn data into      |
                    |  decisions           |
                    +----------+-----------+
                               |
                               | powered by
                               v
                    +----------------------+
                    |  2.  MACHINE LEARNING|
                    |  algorithms that     |
                    |  learn patterns      |
                    +----------+-----------+
                               |
                               | challenged by
                               v
                    +----------------------+
                    |  3.  THE 5 Vs        |
                    |  of Big Data         |
                    +----------------------+

Each Anchor Maps to an AWS Service Family

Athena SageMaker S3

What is Analytics?

๐Ÿ“Š Analytics is the practice of turning raw data into useful insights that help people make better decisions.

That one line is the whole discipline. SQL queries, dashboards, ML models, data lakes โ€” all of it is just tooling in service of that idea.

Real-World Example โ€” HitaVir Coffee

Imagine HitaVir Coffee โ€” 50 locations across India. Every day, each store generates data:

Data Stream

Icon

Data Stream

Icon

Orders

โ˜•

Payments

๐Ÿ’ฐ

Loyalty

๐Ÿ‘ฅ

Inventory

๐Ÿ“ฆ

Shifts

๐Ÿ•

Equipment

๐ŸŒก๏ธ

Deliveries

๐Ÿšš

Reviews

โญ

One transaction alone is meaningless. But aggregate across 50 stores for a year and patterns emerge:

Observation

Action

๐Ÿข Mondays are slowest

๐ŸŽฏ Launch "Monday BOGO"

๐Ÿ“ˆ Store #23 sells 2x pastries

๐Ÿ” Copy their layout

๐Ÿ“‰ Cappuccinos drop in summer

๐ŸงŠ Push cold brew

That journey โ€” from raw events to actions โ€” is analytics.

The Four Levels of Analytics

  +================================================================+
  |                                                                |
  |     L4   PRESCRIPTIVE       "What should we do?"               |
  |                                                                |
  |                  ^                                             |
  |                  |                                             |
  |     L3   PREDICTIVE         "What will happen?"                |
  |                                                                |
  |                  ^                                             |
  |                  |                                             |
  |     L2   DIAGNOSTIC         "Why did it happen?"               |
  |                                                                |
  |                  ^                                             |
  |                  |                                             |
  |     L1   DESCRIPTIVE        "What happened?"                   |
  |                                                                |
  +================================================================+

Level

Icon

Question

Example

Powered By

L1 Descriptive

๐Ÿ“ธ

What happened?

"Sold 12,400 cappuccinos"

๐Ÿ” SQL / BI

L2 Diagnostic

๐Ÿ•ต๏ธ

Why did it happen?

"Bean shortage hit week 3"

๐Ÿ” SQL + drill-down

L3 Predictive

๐Ÿ”ฎ

What will happen?

"June sales up 15%"

๐Ÿค– Machine learning

L4 Prescriptive

๐ŸŽฏ

What should we do?

"Order 200kg by May 25"

๐Ÿค– ML + optimization

Most companies live at L1-L2. Analytics engineers build the foundation that makes L3-L4 possible.

What Analytics Is NOT

๐Ÿ’ก HitaVir Tech says: "Never build a dashboard nobody looks at. Always ask โ€” what decision will this insight change? If the answer is โ€˜none', don't build it."

Preview โ€” AWS Services Across Analytics Maturity

Athena QuickSight SageMaker Bedrock

L1-L2 is Athena + QuickSight. L3-L4 adds SageMaker + Bedrock.

What is Machine Learning?

๐Ÿค– Machine Learning (ML) is a branch of AI where algorithms learn patterns from data instead of being explicitly programmed.

Traditional Programming vs Machine Learning

  +-----------------------------+      +-----------------------------+
  |    TRADITIONAL PROGRAMMING  |      |    MACHINE LEARNING         |
  |  -------------------------  |      |  -------------------------  |
  |                             |      |                             |
  |   Rules + Data              |      |   Data + Answers            |
  |         |                   |      |         |                   |
  |         v                   |      |         v                   |
  |     Program                 |      |    Learned Model            |
  |         |                   |      |         |                   |
  |         v                   |      |         v                   |
  |     Answer                  |      |    Rules (weights)          |
  |                             |      |                             |
  |  Human writes the rules.    |      |  Machine learns the rules.  |
  +-----------------------------+      +-----------------------------+

The Three Flavors of ML

Flavor

Icon

Data Needed

Example

AWS Service

Supervised

๐ŸŽฏ

Labeled examples

Spam filter, fraud detection, image classification

๐Ÿค– SageMaker

Unsupervised

๐Ÿ”Ž

No labels

Customer segmentation, anomaly detection, topic modeling

๐Ÿค– SageMaker โ€ข ๐Ÿ’ญ Comprehend

Reinforcement

๐ŸŽฎ

Reward signals

Game AI, robotics, recommenders

๐Ÿค– SageMaker RL โ€ข ๐ŸŽ๏ธ DeepRacer

How ML Powers Analytics

  Level  Stops at SQL               Needs ML
  -----  --------------------       --------------------
  L1     Descriptive     OK          -
  L2     Diagnostic      OK          -
  L3     Predictive                  Machine learning
  L4     Prescriptive                ML + optimization

๐Ÿ’ก HitaVir Tech says: "ML is not magic โ€” it's statistics at scale. If your analytics foundations are shaky, your ML models will be too. Clean data first, cool models second."

Preview โ€” AWS ML Services

SageMaker Bedrock Comprehend Personalize Forecast Fraud Detector

Coming up in "AWS Services for Value" (L3-L4 analytics).

  +==============================================================+
  |         SECTION  2   -   THE  5  Vs  OF  BIG  DATA           |
  +==============================================================+

Where the 5 Vs Come From

In 2001, analyst Doug Laney described big-data challenges with three Vs: Volume, Variety, Velocity. Later the industry added Veracity (trust) and Value (outcome). Together they form the universal diagnostic checklist.

The 5 Vs Star

                          *
                     VOLUME
                  How much is it?
                       / \
                      /   \
                     /     \
                    /       \
           VARIETY             VELOCITY
       How many formats?     How fast?
                 \             /
                  \           /
                   \         /
                    \       /
             VERACITY        VALUE
         Can we trust it?  Worth it?

The 5 Vs at a Glance

V

Icon

Question

1

๐Ÿ“ฆ VOLUME

How much? (scale)

2

๐Ÿงฉ VARIETY

How many formats?

3

๐ŸŒŠ VELOCITY

How fast? (speed)

4

๐Ÿ›ก๏ธ VERACITY

Can we trust it?

5

๐Ÿ’Ž VALUE

What outcome?

Miss any one V and your data platform has a hole. Let's tour each.

Preview โ€” AWS's One Service Per V

S3 Glue Kinesis Lake Formation QuickSight

Volume โ†’ S3. Variety โ†’ Glue. Velocity โ†’ Kinesis. Veracity โ†’ Lake Formation. Value โ†’ QuickSight.

  +==============================================================+
  |              THE  1st  V   -   VOLUME                        |
  |              "How much data do we have?"                     |
  +==============================================================+

What is Volume?

๐Ÿ“ฆ Volume is the size of your data โ€” how many bytes, rows, events, or files you must store, move, and process.

The Scale Ladder

Unit

Power

What It Holds

Byte (B)

1

A letter

KB

10^3

One email

MB

10^6

One song

GB

10^9

One DVD

TB

10^12

One year of company sales

PB

10^15

One day of YouTube uploads

EB

10^18

All of Netflix streaming

ZB

10^21

The entire internet per year

Why Volume is Hard

A traditional database runs fine up to ~1-10 TB. Past that, things break:

At big-data scale, you need distributed systems โ€” hundreds of machines sharing the load.

Real-World Volume

Company

Daily Volume

๐Ÿ“ฑ Instagram

100M+ photos uploaded

๐Ÿ›’ Amazon

Billions of events

๐Ÿš— Uber

10s of TB of trip data

๐ŸŽฌ Netflix

PB of logs and streams

๐Ÿ”Ž Google

Unimaginable

Questions Volume Forces You to Ask

๐Ÿ’ก HitaVir Tech says: "What works at 10 GB catastrophically fails at 10 TB. Always ask โ€” how does this scale at 100x?"

๐Ÿ“ฆ Volume in one line: design for 100ร— your current data โ€” or rebuild painfully later.

Preview โ€” AWS Services That Solve Volume

S3 Glacier Redshift EMR Lake Formation

Coming up in "AWS Services for Volume".

  +==============================================================+
  |              THE  2nd  V   -   VARIETY                       |
  |              "How many kinds of data?"                       |
  +==============================================================+

What is Variety?

๐Ÿงฉ Variety is the diversity of data โ€” formats, schemas, and sources.

Twenty years ago, "data" meant rows in a database. Today, it means far more:

Type

Icon

Examples

Schema

Structured

๐Ÿ“Š

SQL tables, CSV, spreadsheets

Fixed

Semi-structured

๐Ÿงฉ

JSON from APIs, XML, Parquet, Avro

Flexible

Unstructured

๐ŸŽž๏ธ

Images, video, audio, PDFs, free text

None

Why Variety is Hard

Each format needs different tooling:

Format

Tool Needed

SQL tables

Relational engine

JSON

Document parser

PDF

OCR

Image

Computer vision

Audio

Speech-to-text

Free text

NLP / embeddings

Most real projects combine these. Example โ€” "Correlate support emails + call recordings + order history into one insight." Three completely different pipelines feeding one answer.

Real-World Variety

Industry

Variety Mix

๐Ÿฅ Healthcare

Patient records + X-rays + doctor's notes

๐Ÿ›’ Retail

Orders + product photos + reviews

๐Ÿฆ Banking

Transactions + scanned checks + call transcripts

๐Ÿš— Autonomous cars

Sensors + video + maps + LiDAR

Questions Variety Forces You to Ask

๐Ÿ’ก HitaVir Tech says: "90% of the world's data is unstructured. But 90% of analytics happens on structured data. Your job is often to convert chaos into order."

๐Ÿงฉ Variety in one line: structure is created, not found โ€” choose tools that embrace format diversity.

Preview โ€” AWS Services That Solve Variety

Glue Athena DynamoDB Rekognition Textract Comprehend

Coming up in "AWS Services for Variety".

  +==============================================================+
  |              THE  3rd  V   -   VELOCITY                      |
  |              "How fast is the data?"                         |
  +==============================================================+

What is Velocity?

๐ŸŒŠ Velocity is the speed at which data arrives, moves, and must be processed to deliver value.

The Velocity Spectrum

Freshness

Icon

Approach

Example Use Case

Next day

๐ŸŒ

Batch (nightly)

Finance reports

Every hour

๐Ÿ‡

Mini-batch

Ops dashboards

Seconds

๐Ÿš€

Streaming

Live pricing

Sub-millisecond

โšก

Real-time

Fraud detection, HFT

Why Velocity is Hard

Problem

Solution

Disks too slow

In-memory / caches

Batch SQL too slow

Stream-processing engines

One machine too small

Horizontal auto-scaling

Failures = data loss

Durable logs (Kafka / Kinesis)

Real-World Velocity

Scenario

Required Latency

๐Ÿ’ณ Credit card fraud

Under 100 ms

๐Ÿ“ˆ Stock trading

Microseconds

๐Ÿ“ฑ Social feed

Seconds

๐Ÿšš Delivery tracking

Minutes

๐Ÿ“Š Exec dashboard

Hourly

๐Ÿงพ Month close

Daily batch

Questions Velocity Forces You to Ask

๐Ÿ’ก HitaVir Tech says: "Streaming is fashionable. Batch is profitable. 80% of real-world analytics runs on batch โ€” don't reach for streaming unless the business truly cannot wait."

๐ŸŒŠ Velocity in one line: match the pipeline's speed to the decision's deadline โ€” no faster.

Preview โ€” AWS Services That Solve Velocity

Kinesis Data Streams Kinesis Firehose Kinesis Analytics Lambda EventBridge

Coming up in "AWS Services for Velocity".

  +==============================================================+
  |              THE  4th  V   -   VERACITY                      |
  |              "Can we trust the data?"                        |
  +==============================================================+

What is Veracity?

๐Ÿ›ก๏ธ Veracity is the accuracy, consistency, and trustworthiness of your data.

Big volumes and fast pipelines are useless if the data is wrong.

The Veracity Enemies

Enemy

Icon

Symptom

Missing

๐Ÿฆ 

NULL in required fields

Duplicates

๐Ÿ—‘๏ธ

Same row repeated

Inconsistent

๐ŸŽญ

2024-01-05 vs 01/05/24

Outliers

๐Ÿ“‰

Age = 347

Units

๐Ÿช™

USD mixed with INR

Bias

๐Ÿชž

Only US users sampled

Stale

๐Ÿชค

Last updated 2019

Broken joins

๐Ÿ“Ž

Order with no user

Noise

๐ŸŽฒ

Flaky sensor readings

GIGO โ€” Garbage In, Garbage Out

๐Ÿ›ก๏ธ A beautiful dashboard built on bad data is worse than no dashboard โ€” it creates false confidence. The most dangerous insight is a wrong insight someone believes.

Six Dimensions of Data Quality

Dimension

Icon

Question

Completeness

๐Ÿงฉ

Required fields populated?

Accuracy

๐ŸŽฏ

Data reflects reality?

Consistency

๐Ÿงญ

Related systems agree?

Timeliness

โฐ

Is it current enough?

Validity

โœ…

Matches formats / rules?

Uniqueness

๐Ÿ”ข

Any unintended duplicates?

Real-World Veracity Failures

Incident

Consequence

๐Ÿ›ฐ๏ธ NASA Mars Climate Orbiter (1999)

Lost $125M โ€” metric vs imperial unit mismatch

๐Ÿฆ Knight Capital (2012)

$440M loss in 45 minutes โ€” bad trading data

๐Ÿคง Google Flu Trends

Overestimated flu peaks 100%+ due to search bias

Questions Veracity Forces You to Ask

๐Ÿ’ก HitaVir Tech says: "Senior engineers obsess over data quality. Juniors obsess over cool tools. Guess which group builds systems that actually work in production."

๐Ÿ›ก๏ธ Veracity in one line: quality rules are a pipeline concern, not a hope.

Preview โ€” AWS Services That Solve Veracity

Glue DataBrew Glue Data Quality Lake Formation Macie CloudTrail KMS

Coming up in "AWS Services for Veracity".

  +==============================================================+
  |              THE  5th  V   -   VALUE                         |
  |          "What business outcome does it drive?"              |
  +==============================================================+

What is Value?

๐Ÿ’Ž Value is the business outcome your data and analytics actually deliver โ€” revenue gained, cost saved, risk reduced, customer experience improved.

Without Value, the other four Vs are expensive hobbies.

The Value Pyramid

                       VALUE
                     (outcome)
                         ^
                         |  enabled by
                         |
                Insights & decisions
                         ^
                         |  enabled by
                         |
                   Analytics + ML
                         ^
                         |  enabled by
                         |
               Trustworthy (Veracity) data
                         ^
                         |  at the right
                         |  speed (Velocity)
                         |
                  across Varieties
                         ^
                         |  stored at
                         |
                   the right scale (Volume)

Examples of Real Value

Industry

Analytics Value

๐Ÿ›’ Retail

Recommendation engine โ†’ +20% revenue

๐Ÿฆ Banking

Fraud detection โ†’ millions saved

๐Ÿšš Logistics

Route optimization โ†’ -15% fuel cost

๐Ÿฅ Healthcare

Early diagnosis models โ†’ better outcomes

๐ŸŽฌ Streaming

Personalized content โ†’ higher retention

The Dashboard Graveyard

Most companies have folders full of unused dashboards โ€” the dashboard graveyard. Every one cost engineering time, storage, and compute.

The difference between a valuable dashboard and a graveyard dashboard:

  +------------------------------------------------------+
  |                                                      |
  |   "What decision will change because of this?"       |
  |                                                      |
  |   If nobody can answer   ->  DON'T BUILD IT.         |
  |                                                      |
  +------------------------------------------------------+

How to Measure Value

Questions Value Forces You to Ask

๐Ÿ’ก HitaVir Tech says: "A data platform that costs more than the decisions it enables is a failure, no matter how beautiful the architecture. Lead with Value."

๐Ÿ’Ž Value in one line: start from the decision, work backwards to the pipeline.

Preview โ€” AWS Services That Deliver Value

QuickSight SageMaker Bedrock Forecast Personalize

Coming up in "AWS Services for Value".

  +==============================================================+
  |       SECTION  3  -  AWS  SERVICES  BY  THE  5  Vs           |
  +==============================================================+

The Headline Cast

S3 Redshift EMR Glue Athena Kinesis Firehose Lambda QuickSight SageMaker

Now we map each V to the AWS services that solve it.

The Golden Rule โ€” Every Stack Follows One Shape

  +-------+   +-------+   +-------+   +-------+   +-------+   +-------+   +-------+
  |       |   |       |   |       |   |       |   |       |   |       |   |       |
  | INGST | ->| STORE | ->| CATLG | ->| PROCS | ->| QUERY | ->| VIEW  | ->|  ACT  |
  |       |   |       |   |       |   |       |   |       |   |       |   |       |
  +-------+   +-------+   +-------+   +-------+   +-------+   +-------+   +-------+

The 5 Vs tell you where the bottleneck is. The AWS services tell you what solves it.

  +==============================================================+
  |              AWS  FOR  VOLUME                                |
  |         "Store any amount of data, affordably."              |
  +==============================================================+

The Volume Lineup

S3 Glacier Redshift EMR Lake Formation

S3 โ€ข Glacier โ€ข Redshift โ€ข EMR โ€ข Lake Formation

The Volume Toolkit

Service

Icon

Purpose

Amazon S3

Amazon S3

Object storage โ€” the data-lake foundation (infinite scale)

Amazon S3 Glacier

S3 Glacier

Cheapest archive tier for cold data

Amazon Redshift

Redshift

Petabyte-scale columnar data warehouse

Amazon EMR

EMR

Managed Spark / Hadoop clusters for huge batch jobs

AWS Lake Formation

Lake Formation

Manage, govern, and secure a data lake on S3

Amazon EBS / EFS

EBS / EFS

Block and file storage for compute workloads

Service Spotlight โ€” Amazon S3

S3

  +--------------------------------------------------------------+
  |  AMAZON  S3  -  Simple Storage Service                       |
  +--------------------------------------------------------------+
  |  Category     :  Object storage / data lake                  |
  |  Durability   :  99.999999999%  (11 nines)                   |
  |  Scale        :  Unlimited (practically)                     |
  |  Pricing      :  ~$0.023 / GB / month (Standard)             |
  |  Read by      :  Athena, Redshift, EMR, SageMaker, QuickSight|
  |                                                              |
  |  If you remember only one AWS service - make it S3.          |
  +--------------------------------------------------------------+

S3 Storage Classes โ€” The Cost Pyramid

Class

Icon

Access Pattern

Relative Cost

S3 Standard

๐Ÿ”ฅ

Hot, frequent access

$$$$

S3 Intelligent-Tiering

๐ŸŒก๏ธ

Auto hot/cold moves

$$$

S3 Standard-IA

โ„๏ธ

Monthly access

$$

S3 Glacier Instant

๐ŸงŠ

Rare access, instant

$

S3 Glacier Flexible

๐Ÿ—„๏ธ

Hours to retrieve

ยข

S3 Glacier Deep Archive

๐Ÿ”๏ธ

Compliance vault

ยข

Service Spotlight โ€” Amazon Redshift

Redshift

  +--------------------------------------------------------------+
  |  AMAZON  REDSHIFT  -  Data Warehouse                         |
  +--------------------------------------------------------------+
  |  Category     :  Columnar MPP data warehouse                 |
  |  Scale        :  Petabytes (exabytes tested)                 |
  |  SQL          :  PostgreSQL-flavored                         |
  |  Modes        :  Serverless  |  Provisioned (RA3 nodes)      |
  |  Superpower   :  Sub-second queries over billions of rows    |
  |                                                              |
  |  Use when you need fast SQL on huge structured data.         |
  +--------------------------------------------------------------+

Service Spotlight โ€” Amazon EMR

EMR

  +--------------------------------------------------------------+
  |  AMAZON  EMR  -  Elastic MapReduce                           |
  +--------------------------------------------------------------+
  |  Category     :  Managed Hadoop / Spark / Presto clusters    |
  |  Scale        :  Thousands of nodes                          |
  |  Pricing      :  Per-instance-hour (spot = 90% off)          |
  |                                                              |
  |  Use for petabyte-scale custom Spark jobs.                   |
  +--------------------------------------------------------------+

S3 Data Lake โ€” Medallion Architecture

  s3://hitavirtech-analytics/
    |
    +-- raw/             <-- Bronze:  untouched source data
    |     +-- sales/2026/04/22/orders.csv
    |     +-- inventory/2026/04/22/stock.json
    |
    +-- curated/         <-- Silver:  cleaned, typed Parquet
    |     +-- sales_fact/year=2026/month=04/day=22/part-001.parquet
    |
    +-- analytics/       <-- Gold:    pre-aggregated for BI
          +-- daily_revenue/year=2026/month=04/day=22/part-001.parquet

Volume Decision Tree

                      How much data?
                            |
       +--------------------+--------------------+
       |                    |                    |
     < 1 TB              1-100 TB             > 100 TB
       |                    |                    |
       v                    v                    v
    RDS or                S3 +               S3 + EMR +
  Redshift Serverless    Athena              Redshift +
  (small + cheap)       (most common)        Lake Formation
                                             (huge platform)

๐Ÿ’ก HitaVir Tech says: "Start with S3. Every analytics service on AWS reads from it. You'll never regret putting data into S3 โ€” you may regret putting it anywhere else first."

  +==============================================================+
  |              AWS  FOR  VARIETY                               |
  |         "Handle any data format, elegantly."                 |
  +==============================================================+

The Variety Lineup

Glue Athena DynamoDB Rekognition Textract Transcribe Comprehend OpenSearch

Glue โ€ข Athena โ€ข DynamoDB โ€ข Rekognition โ€ข Textract โ€ข Transcribe โ€ข Comprehend โ€ข OpenSearch

The Variety Toolkit

Service

Icon

Purpose

Amazon S3

Amazon S3

Holds every format โ€” CSV, JSON, Parquet, images, video

AWS Glue

AWS Glue

ETL for all formats; crawlers auto-detect schema

AWS Glue Data Catalog

Glue Catalog

Metadata layer โ€” one view across formats

Amazon Athena

Athena

SQL on CSV, JSON, Parquet, ORC, Avro

Amazon DynamoDB

DynamoDB

NoSQL for flexible docs

Amazon Rekognition

Rekognition

Images / video โ†’ structured labels

Amazon Textract

Textract

PDFs and scans โ†’ text and tables

Amazon Transcribe

Transcribe

Speech โ†’ text

Amazon Comprehend

Comprehend

NLP: sentiment, entities, topics

Amazon OpenSearch

OpenSearch

Full-text search and log analytics

Service Spotlight โ€” AWS Glue

Glue Crawler Catalog DataBrew Data Quality

  +--------------------------------------------------------------+
  |  AWS  GLUE  -  Serverless ETL + Data Catalog                 |
  +--------------------------------------------------------------+
  |  Crawlers     :  Auto-detect schema for new S3 folders       |
  |  Catalog      :  Central metadata (Athena, Redshift, EMR)    |
  |  ETL Jobs     :  Python / Spark transformations              |
  |  DataBrew     :  No-code visual data prep                    |
  |  Data Quality :  Rule-based DQ checks                        |
  |                                                              |
  |  The "nervous system" of your AWS data lake.                 |
  +--------------------------------------------------------------+

Glue Flow

  INPUT                                                       OUTPUT
  -----                                                       ------

  CSV       +                                           +---- Parquet
  JSON      +---> Crawler ---> Catalog ---> ETL Job ----+     (optimized)
  Parquet   +              (schema +                    +---- Updated
  Avro      |               partitions)                        tables
  Database  +

Athena โ€” One SQL, Many Formats

SELECT *
FROM   csv_orders
JOIN   json_customers   USING (customer_id)
JOIN   parquet_products USING (product_id);

Athena reads CSV, JSON, Parquet, ORC, Avro โ€” all via the Glue Catalog. You never leave SQL.

Unstructured โ†’ Structured: The AI Extractor Pipeline

Rekognition Textract Transcribe Comprehend Translate

Input

Icon

AWS Service

Output

Images

๐Ÿ–ผ๏ธ

Rekognition

Labels, faces, moderation

PDFs / scans

๐Ÿ“„

Textract

Extracted text + tables

Audio / voice

๐ŸŽค

Transcribe

Transcripts

Free text

๐Ÿ’ฌ

Comprehend

Sentiment, entities, topics

Translations

๐ŸŒ

Translate

Languages

Magic step: chaos in โ†’ structured features out โ†’ then into S3 / Athena / Redshift as normal.

Real-World Example โ€” E-commerce Review Pipeline

  Customer review (raw text)
        |
        v
  Comprehend  --->  sentiment = negative, topic = shipping
        |
        v
  S3 (enriched records)
        |
        v
  Glue Crawler  --->  Data Catalog
        |
        v
  Athena SQL:   "avg sentiment per product / month"
        |
        v
  QuickSight dashboard for the CX team
        |
        v
  Action:  fix shipping partner in Region X

Variety Decision Tree

                       What's my data?
                             |
      +---------+---------+--+---+---------+---------+
      |         |         |      |         |         |
   Tabular   JSON      Images   PDFs   Audio     Free text
      |         |         |      |         |         |
      v         v         v      v         v         v
   S3+Glue+   S3 +      Reko-   Text-   Transcribe  Comprehend
   Athena     Athena    gnition  tract
              or DDB

๐Ÿ’ก HitaVir Tech says: "The magic of modern analytics โ€” unstructured data becomes structured features in minutes via AWS AI services. What took PhDs years a decade ago is now an API call."

  +==============================================================+
  |              AWS  FOR  VELOCITY                              |
  |         "Move and process data in real time."                |
  +==============================================================+

The Velocity Lineup

Kinesis Streams Firehose Kinesis Analytics MSK Lambda EventBridge SNS SQS

Kinesis Streams โ€ข Firehose โ€ข Kinesis Analytics โ€ข MSK โ€ข Lambda โ€ข EventBridge โ€ข SNS โ€ข SQS

The Velocity Toolkit

Service

Icon

Purpose

Kinesis Data Streams

Kinesis Data Streams

Real-time event stream (like Kafka)

Kinesis Firehose

Kinesis Firehose

Buffered streaming delivery to S3 / Redshift

Amazon MSK

Amazon MSK

Managed Apache Kafka

Kinesis Data Analytics

Kinesis Analytics

SQL / Flink on streams in real time

AWS Lambda

Lambda

Event-driven serverless functions

Amazon EventBridge

EventBridge

Serverless event bus across AWS

Amazon SNS / SQS

SNS / SQS

Pub-sub / queue messaging

Service Spotlight โ€” Amazon Kinesis Data Streams

Kinesis Data Streams

  +--------------------------------------------------------------+
  |  KINESIS  DATA  STREAMS  -  Real-time Event Stream           |
  +--------------------------------------------------------------+
  |  Latency      :  Sub-second                                  |
  |  Retention    :  24 hours (up to 365 days)                   |
  |  Throughput   :  MB/sec per shard, scale by sharding         |
  |  Ordering     :  Per-shard ordered                           |
  |                                                              |
  |  The "high-speed conveyor belt" for events.                  |
  +--------------------------------------------------------------+

Kinesis in Action โ€” The Conveyor Belt

  PRODUCERS                KINESIS                   CONSUMERS
  ------------             ---------                 ------------

  App events        +----------------------+         Lambda
  Clickstreams  --->|  >> >> >> >> >> >>   |--->    Kinesis Analytics
  IoT sensors       |  durable, ordered,   |         Firehose -> S3
  Transactions      |  up to 365d retention|         OpenSearch
  Card swipes       +----------------------+         Redshift

Kinesis holds events durably. Multiple consumers read the same stream independently.

Service Spotlight โ€” Kinesis Firehose

Firehose

  +--------------------------------------------------------------+
  |  KINESIS  FIREHOSE  -  The Easy Button                       |
  +--------------------------------------------------------------+
  |  Model        :  Serverless, fully managed                   |
  |  Buffer       :  60 sec or 5 MB (whichever first)            |
  |  Transforms   :  Optional Lambda enrichment                  |
  |  Sinks        :  S3 (Parquet!), Redshift, OpenSearch, HTTP   |
  |                                                              |
  |  No code, no cluster - the laziest streaming on AWS.         |
  +--------------------------------------------------------------+
  Producers  --->  Firehose  --->  Convert to Parquet  --->  S3
                  (no servers,
                   auto-scale)

Kinesis Data Analytics โ€” Continuous SQL

CREATE STREAM fraud_alerts AS
SELECT user_id, amount, location
FROM   transactions_stream
WHERE  amount > 10000 OR is_foreign = TRUE;

Results in milliseconds โ€” not after the nightly batch.

AWS Lambda โ€” The Universal Event Glue

  S3 new file      +
  Kinesis event    +--->  Lambda  --->  Redshift load
  DynamoDB row     +         |
  EventBridge      +         +---->  SNS / SQS alert

Perfect for: event reactions, enrichment, alerting, small transforms.

Real-World Velocity Pipeline โ€” Rideshare App

  Rideshare app (1 million events/sec)
              |
              v
       Kinesis Data Streams
              |
     +--------+--------+---------+
     |        |        |         |
     v        v        v         v
   Lambda  Analytics  Firehose
   fraud   real-time  buffer
   flag    aggregate  --> S3 (Parquet)
     |        |         |
     v        v         v
   SNS     QuickSight  Athena
   alert   live dash.  historical BI

Velocity Decision Tree

                       How fresh must the data be?
                                 |
        +---------+----------+---+---+-----------+
        |         |          |       |           |
     Next day  15 minutes  Seconds  Sub-second  Kafka shop
        |         |          |       |           |
        v         v          v       v           v
     Glue       Firehose    Kinesis  Kinesis     MSK
     batch      -> S3       + Lambda Analytics
                                    (Flink)

๐Ÿ’ก HitaVir Tech says: "Every team thinks they need real-time until they see the bill. Start with Firehose and 5-minute micro-batches โ€” graduate later. Most of the time, you won't need to."

  +==============================================================+
  |              AWS  FOR  VERACITY                              |
  |         "Make sure the data is trustworthy."                 |
  +==============================================================+

The Veracity Lineup

DataBrew Glue DQ Lake Formation Macie CloudTrail Config IAM KMS

DataBrew โ€ข Glue Data Quality โ€ข Lake Formation โ€ข Macie โ€ข CloudTrail โ€ข Config โ€ข IAM โ€ข KMS

The Veracity Toolkit

Service

Icon

Purpose

AWS Glue DataBrew

Glue DataBrew

Visual data cleaning and profiling

AWS Glue Data Quality

Glue Data Quality

Rule-based DQ checks

Amazon Deequ (library)

Deequ

Unit tests for data on Spark

AWS Lake Formation

Lake Formation

Governance and fine-grained permissions

AWS CloudTrail

CloudTrail

Audit every API call

AWS Config

AWS Config

Track resource configuration drift

Amazon Macie

Macie

Discover and classify PII in S3

AWS KMS

KMS

Manage encryption keys

Service Spotlight โ€” AWS Glue DataBrew

DataBrew

  +--------------------------------------------------------------+
  |  GLUE  DATABREW  -  No-Code Data Prep                        |
  +--------------------------------------------------------------+
  |  Interface    :  Visual, spreadsheet-like                    |
  |  Transforms   :  250+ built-in (nulls, dupes, dates...)      |
  |  Profiling    :  Auto column stats, anomaly flags            |
  |  Recipes      :  Save and schedule as jobs                   |
  |                                                              |
  |  Hand this to business analysts - no Spark needed.           |
  +--------------------------------------------------------------+
  Load  --->  Profile  --->  Apply transforms  --->  Export
              (stats,        (fill nulls, parse      (to S3 or
               anomalies)     dates, standardize)     Redshift)

Service Spotlight โ€” Glue Data Quality

Glue Data Quality

  +--------------------------------------------------------------+
  |  GLUE  DATA  QUALITY  -  Rules That Guard the Lake           |
  +--------------------------------------------------------------+
  |  Rule types    :  Completeness, uniqueness, ranges, custom   |
  |  Enforcement   :  Block pipeline  OR  quarantine  OR  alert  |
  |  ML-assisted   :  Recommends rules from sample data          |
  |                                                              |
  |  Data quality becomes a pipeline concern, not tribal.        |
  +--------------------------------------------------------------+
  RULES                                          CHECK RESULT
  -----                                          -------------
  order_id IS NOT NULL                     ...   PASS
  amount BETWEEN 0 AND 1_000_000           ...   PASS
  customer_id IN customers                 ...   PASS
  COUNT(DISTINCT order_id) = COUNT(*)      ...   FAIL - 23 dupes!

Service Spotlight โ€” Amazon Macie

Macie

  +--------------------------------------------------------------+
  |  AMAZON  MACIE  -  PII Detective                             |
  +--------------------------------------------------------------+
  |  Method       :  ML-based classification of S3 contents      |
  |  Finds        :  Credit cards, SSNs, emails, addresses       |
  |  Output       :  Severity alerts -> Security Hub             |
  |                                                              |
  |  Sniffs sensitive data hiding in your data lake.             |
  +--------------------------------------------------------------+

CloudTrail + Config โ€” The Audit Twins

CloudTrail Config

Essential for regulated industries (finance, healthcare, gov).

The Data Quality Lifecycle

  PROFILE  --->  DEFINE RULES  --->  ENFORCE  --->  ALERT  --->  FIX  --->  MONITOR
  (know)         (expected)           (every run)    (fail fast)  (fix)     (trends)
     ^                                                                         |
     |                                                                         |
     +---------------------------- loop ---------------------------------------+

Real-World Example โ€” Sales Lake Quality Gate

  Raw sales CSV from 50 stores
        |
        v
  Glue ETL reads it
        |
        v
  Glue Data Quality runs rules:
     PASS - order_id unique
     PASS - amount in [0, 1M]
     FAIL - store_id in valid list   (12 rows failed)
        |
        v
     +--+--+
     v     v
  Quarantine   Curated
  bucket +    zone
  SNS alert   (Parquet)

๐Ÿ’ก HitaVir Tech says: "Every pipeline must have quality rules โ€” not โ€˜someday', from day one. 10x cheaper to catch bad data at ingest than to explain a wrong dashboard to the CEO."

  +==============================================================+
  |              AWS  FOR  VALUE                                 |
  |       "Turn data into decisions and ROI."                    |
  +==============================================================+

The Value Lineup

QuickSight SageMaker Forecast Personalize Fraud Detector Bedrock

QuickSight โ€ข SageMaker โ€ข Forecast โ€ข Personalize โ€ข Fraud Detector โ€ข Bedrock

The Value Toolkit

Service

Icon

Purpose

Amazon QuickSight

QuickSight

Dashboards, BI, natural-language analytics

Amazon SageMaker

SageMaker

Build, train, deploy ML models

Amazon Forecast

Forecast

No-code time-series forecasting

Amazon Personalize

Personalize

Recommendation engines

Amazon Fraud Detector

Fraud Detector

Fraud-prediction models

Amazon Q in QuickSight

Amazon Q in QuickSight

Ask data questions in natural language

Amazon Bedrock

Bedrock

Foundation models (Claude, Llama, etc.)

Redshift ML

Redshift ML

Train and run ML using SQL in Redshift

Amazon Lookout for Metrics

Lookout for Metrics

Auto-detect anomalies in business KPIs

Service Spotlight โ€” Amazon QuickSight

QuickSight

  +--------------------------------------------------------------+
  |  AMAZON  QUICKSIGHT  -  Business Intelligence                |
  +--------------------------------------------------------------+
  |  Sources      :  S3 via Athena, Redshift, RDS, Excel...      |
  |  Engine       :  SPICE (in-memory columnar cache)            |
  |  Superpowers  :  Amazon Q (natural language) + Embedded      |
  |  Editions     :  Standard  |  Enterprise                     |
  |                                                              |
  |  Your "Tableau / Power BI" on AWS, with AI built in.         |
  +--------------------------------------------------------------+

QuickSight Flow

  Athena / Redshift  --->  SPICE  --->  Visuals  --->  Dashboard  --->  Share
  (data source)            (cache)      (charts)      (compose)        (embed)

Service Spotlight โ€” Amazon SageMaker

SageMaker

  +--------------------------------------------------------------+
  |  AMAZON  SAGEMAKER  -  Full-Lifecycle ML                     |
  +--------------------------------------------------------------+
  |  Studio       :  Browser IDE for ML                          |
  |  AutoPilot    :  AutoML - tries many models                  |
  |  Pipelines    :  Train / evaluate / deploy as code           |
  |  Monitor      :  Catch model drift in production             |
  |                                                              |
  |  From raw data to deployed model - one platform.             |
  +--------------------------------------------------------------+

Service Spotlight โ€” Amazon Forecast

Forecast

  +--------------------------------------------------------------+
  |  AMAZON  FORECAST  -  No-Code Time-Series ML                 |
  +--------------------------------------------------------------+
  |  Inputs       :  Historical + related series + metadata      |
  |  AutoML       :  Tries multiple algorithms, picks best       |
  |  Accuracy     :  Typically 50% better than Excel baselines   |
  |                                                              |
  |  Same tech Amazon uses internally for demand planning.       |
  +--------------------------------------------------------------+
  Past sales + Weather + Holidays
           |
           v
      Forecast (AutoML)
           |
           v
   Daily predictions per SKU per store

Service Spotlight โ€” Amazon Personalize

Personalize

  +--------------------------------------------------------------+
  |  AMAZON  PERSONALIZE  -  Netflix-Style Recs                  |
  +--------------------------------------------------------------+
  |  Inputs       :  Users + items + interaction events          |
  |  Use cases    :  "You may like...", "Related items..."       |
  |  Real-time    :  Inference in milliseconds                   |
  |                                                              |
  |  Same ML powering Amazon.com recommendations.                |
  +--------------------------------------------------------------+

Amazon Q in QuickSight โ€” Natural-Language BI

  User types:   "Show me top 5 products last quarter"
                        |
                        v
         Q interprets -> writes SQL -> runs -> visualizes
                        |
                        v
            Bar chart appears in 2 seconds

Analysts no longer gatekeep simple questions.

Redshift ML โ€” SQL-Native Machine Learning

CREATE MODEL churn_model
FROM   customer_features
TARGET churned
FUNCTION predict_churn
IAM_ROLE DEFAULT
SETTINGS (S3_BUCKET 'my-ml-bucket');

-- then use it
SELECT customer_id, predict_churn(features) AS risk
FROM   customers
WHERE  risk > 0.8;

ML without leaving your data warehouse.

The Value Stack

  +==============================================================+
  |                                                              |
  |   BUSINESS  VALUE      revenue - savings - growth            |
  |        ^                                                     |
  |        |                                                     |
  |   APPLICATION LAYER    Personalize, Fraud Detector, Lookout  |
  |        ^                                                     |
  |        |                                                     |
  |   ML LAYER             SageMaker, Forecast, Redshift ML      |
  |        ^                                                     |
  |        |                                                     |
  |   ANALYSIS LAYER       Athena + QuickSight + Redshift        |
  |        ^                                                     |
  |        |                                                     |
  |   BUILD LAYER          S3 + Glue + Kinesis + Lake Formation  |
  |                                                              |
  +==============================================================+

๐Ÿ’ก HitaVir Tech says: "The best data platform is worthless if nobody uses the outputs. Start from Value and work backwards โ€” who sees which insight, and what decision changes? Design everything else to serve that."

  +==============================================================+
  |         END-TO-END  AWS  ANALYTICS  STACK                    |
  +==============================================================+

The Full Cast โ€” Every Service You Will See Below

Kinesis Firehose S3 Glue Athena Redshift EMR SageMaker QuickSight Bedrock

All 5 Vs combined into one living architecture:

  INGEST  (Velocity + Variety Layers)
  +---------------------------+    +---------------------------+
  |  Kinesis Data Streams     |    |  Glue Crawlers            |
  |  Kinesis Firehose         |    |  Rekognition / Textract   |
  |  Lambda                   |    |  Transcribe / Comprehend  |
  |  MSK (Kafka)              |    |  Translate                |
  +-------------+-------------+    +-------------+-------------+
                |                                |
                +----------------+---------------+
                                 |
                                 v
  STORE  (Volume Layer)
  +----------------------------------------------------------+
  |   Amazon S3  Data Lake  (raw  /  curated  /  analytics)  |
  |   <------>  AWS Glue Data Catalog  (metadata)            |
  +-----------------------------+----------------------------+
                                |
                                v
  PROCESS + VERACITY
  +----------------------------------------------------------+
  |   Glue ETL  /  EMR  (transform)                          |
  |   <---  Glue Data Quality  /  DataBrew  /  Lake Formation|
  +-----------------------------+----------------------------+
                                |
                                v
  QUERY
  +------------+  +------------+  +------------+
  |  Athena    |  |  Redshift  |  | SageMaker  |
  |  (SQL)     |  |  (MPP SQL) |  |  (ML)      |
  +------+-----+  +------+-----+  +------+-----+
         |               |                |
         +---------------+----------------+
                         |
                         v
  VALUE LAYER
  +----------------------------------------------------------+
  |   QuickSight     |    Forecast                           |
  |   Amazon Q (NL)  |    Personalize                        |
  +-----------------------------+----------------------------+
                                |
                                v
                 Decisions  -  Revenue  -  Growth

Every box maps to a V. Every arrow is a managed AWS service.

  +==============================================================+
  |   HANDS-ON  LAB  -  S3  ->  GLUE  ->  ATHENA                 |
  +==============================================================+

The Services We'll Touch

S3 Glue Crawler Glue Catalog Athena

You will build a mini pipeline that touches Volume (S3), Variety (CSV auto-cataloged), and Value (SQL insights).

   Step 1         Step 2-3         Step 4-5           Step 6          Step 7-9
  +------+       +------+          +----------+    +--------+     +----------+
  | CSV  | ----> |  S3  | ------>  | Crawler  | -> | Catalog| --> | Athena   |
  +------+       +------+          +----------+    +--------+     +----------+
  Prepare        Upload            Create & run     Table auto    Run SQL,
  sample data    to bucket          crawler          populated     get insights

๐Ÿ“„ Step 1 โ€” Prepare Sample Data

On your laptop, create a file called sales.csv:

order_id,customer,product,quantity,amount,order_date
1001,Ravi,Laptop,1,75000.00,2026-04-01
1002,Priya,Mouse,2,1500.00,2026-04-01
1003,Amit,Keyboard,1,3500.00,2026-04-02
1004,Ravi,Monitor,1,18000.00,2026-04-02
1005,Sneha,Headphones,1,4500.00,2026-04-03
1006,Priya,Laptop,1,75000.00,2026-04-03
1007,Vikram,USB Cable,3,900.00,2026-04-04
1008,Neha,Webcam,1,5000.00,2026-04-04
1009,Ravi,Mouse,1,750.00,2026-04-05
1010,Amit,Monitor,1,18000.00,2026-04-05

๐Ÿชฃ Step 2 โ€” Create an S3 Bucket

  1. Sign in to the AWS Management Console
  2. Search S3 in the top search bar
  3. Click Create bucket
  4. Bucket name: hitavirtech-analytics-yourname (globally unique โ€” add your name)
  5. Region: closest to you (ap-south-1 = Mumbai)
  6. Leave defaults โ†’ Create bucket

โฌ†๏ธ Step 3 โ€” Upload the CSV

  1. Open your bucket โ†’ Create folder โ†’ raw
  2. Inside raw/ โ†’ Create folder โ†’ sales
  3. Inside raw/sales/ โ†’ Upload โ†’ select sales.csv โ†’ Upload

Your object now lives at:

s3://hitavirtech-analytics-yourname/raw/sales/sales.csv

๐Ÿ•ธ๏ธ Step 4 โ€” Create Glue Database and Crawler

  1. Search Glue in the AWS Console
  2. Left menu โ†’ Databases โ†’ Add database
  3. Name: hitavirtech_sales_db โ†’ Create
  4. Left menu โ†’ Crawlers โ†’ Create crawler
  5. Name: hitavirtech-sales-crawler โ†’ Next
  6. Add a data source โ†’ S3 โ†’ path: s3://hitavirtech-analytics-yourname/raw/sales/ โ†’ Add
  7. Next โ†’ Create new IAM role AWSGlueServiceRole-hitavirtech โ†’ Create
  8. Next โ†’ Target database hitavirtech_sales_db โ†’ Schedule On demand โ†’ Next โ†’ Create

๐Ÿ•ท๏ธ Step 5 โ€” Run the Crawler

  1. Select hitavirtech-sales-crawler โ†’ Run crawler
  2. Wait 1-2 minutes until Status = Completed and Table changes = 1 created

๐Ÿ“š Step 6 โ€” Verify the Table

  1. Glue โ†’ Tables โ†’ open the new sales table
  2. Inspect:
    • Columns: order_id, customer, product, quantity, amount, order_date
    • Types: inferred automatically (bigint, string, double)
    • Location: your S3 folder

Glue just auto-discovered the schema. ๐ŸŽ‰

๐Ÿ” Step 7 โ€” Query with Athena

  1. Open Athena in the AWS Console
  2. First time: Edit settings โ†’ set query results location to s3://hitavirtech-analytics-yourname/athena-results/
  3. In the editor: Data source = AwsDataCatalog, Database = hitavirtech_sales_db
  4. Run:
SELECT * FROM sales LIMIT 5;

You should see 5 rows. ๐ŸŽ‰

๐Ÿ’ก Step 8 โ€” Analytical Queries

-- Top customers by revenue
SELECT customer, SUM(amount) AS total_spent
FROM sales
GROUP BY customer
ORDER BY total_spent DESC;
-- Best-selling products
SELECT product, SUM(quantity) AS units_sold
FROM sales
GROUP BY product
ORDER BY units_sold DESC;
-- Daily revenue
SELECT order_date, SUM(amount) AS daily_revenue
FROM sales
GROUP BY order_date
ORDER BY order_date;

๐Ÿ’ฐ Step 9 โ€” The "Data Scanned" Number

At the bottom of every result: "Data scanned: 412 B" or similar.

That number is your bill. At scale, every analytics engineer watches it. Partitioning + Parquet shrinks it 100-1000x.

๐Ÿงน Step 10 โ€” Cleanup (Mandatory!)

โš ๏ธ Forgetting cleanup = surprise AWS bill.

  1. ๐Ÿ•ธ๏ธ Glue โ†’ Crawlers โ†’ delete hitavirtech-sales-crawler
  2. ๐Ÿ•ธ๏ธ Glue โ†’ Databases โ†’ delete hitavirtech_sales_db
  3. ๐Ÿชฃ S3 โ†’ your bucket โ†’ Empty โ†’ then Delete
  4. ๐Ÿ’ฐ Check Billing Dashboard next day โ†’ confirm $0

๐Ÿ’ก HitaVir Tech says: "The last 5 minutes of cleanup are the most valuable 5 minutes of the entire lab. Engineers who skip it end up with $300 surprise bills."

  +==============================================================+
  |          CONGRATULATIONS  -  PART 1 DONE!                    |
  +==============================================================+

What You Learned

๐Ÿง  Analytics Concepts

Topic

Icon

Analytics and the four maturity levels

๐Ÿ“Š

Machine Learning โ€” three flavors

๐Ÿค–

๐Ÿ“ The 5 Vs of Big Data

V

Icon

Theme

VOLUME

๐Ÿ“ฆ

Scale

VARIETY

๐Ÿงฉ

Formats

VELOCITY

๐ŸŒŠ

Speed

VERACITY

๐Ÿ›ก๏ธ

Trust

VALUE

๐Ÿ’Ž

Outcome

S3 Glue Athena Redshift Kinesis Kinesis Firehose Lambda QuickSight SageMaker Bedrock

โ˜๏ธ AWS Services Mapped to Each V

V

Key Services

๐Ÿ“ฆ Volume

๐Ÿชฃ S3 โ€ข ๐Ÿ›๏ธ Redshift โ€ข ๐Ÿ˜ EMR โ€ข ๐ŸงŠ Glacier โ€ข ๐Ÿ—๏ธ Lake Formation

๐Ÿงฉ Variety

๐Ÿ•ธ๏ธ Glue โ€ข ๐Ÿ” Athena โ€ข ๐Ÿค– Rekognition โ€ข ๐Ÿ“ Textract โ€ข ๐Ÿ”Š Transcribe โ€ข ๐Ÿ’ญ Comprehend

๐ŸŒŠ Velocity

๐ŸŒŠ Kinesis โ€ข ๐Ÿš’ Firehose โ€ข ๐Ÿช MSK โ€ข ๐ŸŽฏ Kinesis Analytics โ€ข โšก Lambda

๐Ÿ›ก๏ธ Veracity

๐Ÿงช DataBrew โ€ข ๐Ÿ›ก๏ธ Glue DQ โ€ข ๐Ÿ—๏ธ Lake Formation โ€ข ๐Ÿ•ต๏ธ Macie โ€ข ๐Ÿ“œ CloudTrail

๐Ÿ’Ž Value

๐Ÿ“Š QuickSight โ€ข ๐Ÿค– SageMaker โ€ข ๐Ÿ”ฎ Forecast โ€ข ๐ŸŽฏ Personalize โ€ข ๐Ÿง  Redshift ML

๐Ÿ› ๏ธ Hands-on Skills

What's Coming in Part 2

๐Ÿš€ Part 2 โ€” Advanced Analytics on AWS will cover:

What To Do Next

  1. ๐Ÿ”„ Repeat this lab with a different dataset (try a Kaggle CSV)
  2. ๐Ÿ“– Read the AWS Athena and Glue docs
  3. ๐Ÿ’ฐ Watch your AWS bill for a few days
  4. ๐Ÿง  Apply the 5 Vs to a project you work on โ€” identify the bottleneck V

Final Thoughts

  +==============================================================+
  |                                                              |
  |    The 5 Vs  =  universal data challenge framework           |
  |    AWS      =  complete toolbox for each V                   |
  |                                                              |
  |    Learn both  ->  you can pick up any cloud's analytics     |
  |    stack in a week.                                          |
  |                                                              |
  +==============================================================+

๐Ÿ’ก HitaVir Tech says: "Analytics is not about tools. Tools change every two years. Analytics is about asking the right question, finding the right data, and presenting an insight people can act on. Master the fundamentals โ€” Volume, Variety, Velocity, Veracity, Value โ€” and every new tool becomes just another syntax."

๐ŸŽ“ Welcome to cloud analytics. See you in Part 2.

โ€” HitaVir Tech โ˜๏ธ

  +==============================================================+
  |        DIAGNOSE  YOUR  OWN  PROJECT  WITH  THE  5  Vs         |
  +==============================================================+

S3 Glue Athena Kinesis Kinesis Firehose Macie QuickSight SageMakerThink of a data project you work on (or want to build). Run it through these five questions. The V that feels most stressful is your bottleneck โ€” that's where to focus first.

#

Question

Your V

AWS Services to Study

1

"Do we have somewhere cheap and durable to store everything?"

๐Ÿ“ฆ Volume

๐Ÿชฃ S3 โ€ข ๐ŸงŠ Glacier โ€ข ๐Ÿ›๏ธ Redshift

2

"Do we have to handle more than one data format?"

๐Ÿงฉ Variety

๐Ÿ•ธ๏ธ Glue โ€ข ๐Ÿ” Athena โ€ข ๐Ÿค– Rekognition โ€ข ๐Ÿ“ Textract

3

"Is the current data freshness good enough for stakeholders?"

๐ŸŒŠ Velocity

๐ŸŒŠ Kinesis โ€ข ๐Ÿš’ Firehose โ€ข โšก Lambda

4

"Do stakeholders trust the numbers we publish?"

๐Ÿ›ก๏ธ Veracity

๐Ÿงช DataBrew โ€ข ๐Ÿ›ก๏ธ Glue DQ โ€ข ๐Ÿ•ต๏ธ Macie

5

"Is anyone actually acting on our outputs?"

๐Ÿ’Ž Value

๐Ÿ“Š QuickSight โ€ข ๐Ÿค– SageMaker โ€ข ๐ŸŽฏ Personalize

Score each V from 1 (healthy) to 5 (painful). The highest score is where a senior engineer should lead the next sprint.

๐ŸŽฏ Pro Tip: "Stacking solutions for Velocity when Veracity is the real problem is the most common and expensive AWS mistake. Diagnose honestly before you build."

  +==============================================================+
  |           AWS  ANALYTICS  -  PART 1  CHEAT  SHEET             |
  |                  (screenshot and keep)                        |
  +==============================================================+

The AWS Analytics Toolbox โ€” At a Glance

S3 Glacier Redshift EMR Glue Athena Kinesis Kinesis Firehose Lambda Macie CloudTrail KMS QuickSight SageMaker Bedrock

๐Ÿง  Concepts in a Sentence

Term

Definition

๐Ÿ“Š Analytics

Turning data into decisions

๐Ÿค– Machine Learning

Algorithms that learn patterns instead of being programmed

๐Ÿ“ธ Descriptive โ†’ ๐Ÿ•ต๏ธ Diagnostic โ†’ ๐Ÿ”ฎ Predictive โ†’ ๐ŸŽฏ Prescriptive

The 4 levels of analytics maturity

๐Ÿ“ The 5 Vs in a Sentence

V

Icon

Question

One-Liner

1

๐Ÿ“ฆ

How much?

Design for 100ร— your current data

2

๐Ÿงฉ

How many formats?

Structure is created, not found

3

๐ŸŒŠ

How fast?

Match speed to the decision's deadline

4

๐Ÿ›ก๏ธ

Can we trust it?

Quality rules are pipeline-level, not tribal

5

๐Ÿ’Ž

Worth it?

Start from the decision, work backwards

โ˜๏ธ AWS Services โ€” By V

V

Store

Process

Catalog / Quality

Deliver

๐Ÿ“ฆ Volume

๐Ÿชฃ S3 โ€ข ๐ŸงŠ Glacier

๐Ÿ›๏ธ Redshift โ€ข ๐Ÿ˜ EMR

๐Ÿ—๏ธ Lake Formation

โ€”

๐Ÿงฉ Variety

๐Ÿชฃ S3 โ€ข โšก DynamoDB

๐Ÿค– Rekognition โ€ข ๐Ÿ“ Textract โ€ข ๐Ÿ”Š Transcribe โ€ข ๐Ÿ’ญ Comprehend

๐Ÿ•ธ๏ธ Glue โ€ข ๐Ÿ“š Catalog

๐Ÿ” Athena

๐ŸŒŠ Velocity

๐ŸŒŠ Kinesis โ€ข ๐Ÿช MSK

โšก Lambda โ€ข ๐ŸŽฏ K. Analytics

๐Ÿš’ Firehose (โ†’ S3)

๐Ÿ”Ž OpenSearch

๐Ÿ›ก๏ธ Veracity

โ€”

๐Ÿงช DataBrew

๐Ÿ›ก๏ธ Glue DQ โ€ข ๐Ÿ•ต๏ธ Macie โ€ข ๐Ÿ“œ CloudTrail โ€ข โš™๏ธ Config

๐Ÿ—๏ธ Lake Formation

๐Ÿ’Ž Value

โ€”

๐Ÿค– SageMaker โ€ข ๐Ÿ”ฎ Forecast โ€ข ๐ŸŽฏ Personalize โ€ข ๐Ÿง  Redshift ML

โ€”

๐Ÿ“Š QuickSight โ€ข ๐Ÿ’ฌ Q โ€ข ๐Ÿง  Bedrock

๐Ÿ› ๏ธ The Universal AWS Analytics Pipeline

   INGEST  -->  STORE  -->  CATALOG  -->  PROCESS  -->  QUERY  -->  VISUALIZE  -->  DECIDE
     |          |           |             |             |            |               |
   Kinesis     S3          Glue          Glue ETL       Athena       QuickSight      Human
   Firehose    Redshift    Catalog       EMR            Redshift     SageMaker       action
   DMS         Glacier     Lake Form.    DataBrew       Spectrum     Forecast

๐ŸŽฏ The Decision Rules

๐Ÿ“ˆ Next Steps

  1. Re-do the hands-on lab with a different dataset.
  2. Apply the 5 Vs self-assessment to a real project.
  3. Watch the AWS Athena and Glue documentation pages.
  4. Move to Part 2 โ€” Advanced Analytics on AWS.

AWS service icons used in this codelab are from the official AWS Architecture Icons deck, freely distributed by Amazon Web Services for use in architecture diagrams and educational materials.