
+============================================================+
| |
| AWS ANALYTICS FUNDAMENTALS - PART 2 |
| |
| Data Lakes - Data Warehouses - Modern Architecture |
| |
| Powered by HitaVir Tech |
+============================================================+
Welcome to Fundamentals of Analytics on AWS - Part 2 by HitaVir Tech!
In Part 1 you built the mental model โ analytics concepts, the 5 Vs, and the AWS services that solve each V. In Part 2 you will zoom out and learn how those services combine into production architectures used by real companies today.
Part 1 โ The Ingredients | Part 2 โ The Recipe |
๐ง What is analytics / ML? | ๐๏ธ What is a data warehouse? |
๐ The 5 Vs diagnostic | ๐ชฃ What is a data lake? |
โ๏ธ One service per V | ๐งฉ How they combine into a Lake House |
๐ ๏ธ S3 โ Glue โ Athena mini lab | ๐บ๏ธ Reference architectures for 6 real use cases |
Pillar | Topics |
๐ชฃ Data Lakes | What, why, zones, governance |
๐๏ธ Data Warehouses | Columnar MPP, star schemas, Redshift |
๐ก Modern Data Architecture | Lake House โ combining both worlds |
โ๏ธ AWS Services | S3, Lake Formation, Redshift, Glue, Athena, Kinesis, QuickSight |
๐บ๏ธ Reference Architectures | Batch BI, Streaming, ML, Log analytics, 360ยฐ customer, Data mesh |
๐ฏ Common Use Cases | When to pick which pattern |

+================================================================+
| |
| Services are Lego bricks. Architecture is the castle. |
| |
| Any junior can spin up S3 + Redshift. |
| Seniors know WHEN to use which, WHY, and HOW they join. |
| |
+================================================================+
2-3 hours (concept-heavy, no new hands-on required โ uses Part 1 lab as the anchor)
If you are... | Do this |
๐ A student new to cloud | Read top-to-bottom; pause at each reference architecture |
๐ ๏ธ A working engineer | Skim sections 1-3, deep-read the reference architectures for patterns you ship |
๐๏ธ A solution architect | Use the reference diagrams as whiteboard starters with stakeholders |
๐ A reference reader | Jump to the Quiz, the Cheat Sheet, and the Appendix of Resources |
๐ก HitaVir Tech says: "Services change names every few years โ Glue was DataPipeline, Kinesis was Kafka-on-AWS, SageMaker was ML-on-EC2. But the shapes of data architectures stay stable for decades. Master the shapes. You will pick up the services in a week."
Required
Helpful

+--------------------------------------------------------------+
| 5 Vs framework AWS service toolkit |
| ------------------- ---------------------- |
| Volume S3, Glacier, Redshift, EMR |
| Variety Glue, Athena, Rekognition, ... |
| Velocity Kinesis, Firehose, Lambda, MSK |
| Veracity DataBrew, Glue DQ, Macie, ... |
| Value QuickSight, SageMaker, Forecast |
+--------------------------------------------------------------+
In Part 2 we compose these services into proven shapes.
Part 2 is concept-heavy. Every diagram is annotated with the services you already met in Part 1. The Part 1 hands-on lab is the practical anchor โ this codelab teaches the architectures that scale it up.
โ ๏ธ If you choose to experiment with Redshift or Kinesis: they can exit the free tier quickly. Use serverless modes and destroy resources the same day.
+==============================================================+
| SECTION 1 - ARCHITECTURES (the big three) |
+==============================================================+

Three architecture patterns power 95% of modern analytics in production:
+----------------------+
| 1. DATA LAKE |
| Store anything, |
| cheap and forever |
+----------+-----------+
|
| grew alongside
v
+----------------------+
| 2. DATA WAREHOUSE |
| Fast SQL on |
| curated tables |
+----------+-----------+
|
| combined into
v
+----------------------+
| 3. LAKE HOUSE |
| (Modern Data Arch.) |
| Best of both |
+----------------------+
Part 2 tours each architecture, shows the AWS services that implement it, then demonstrates how real companies blend all three for different use cases.
+==============================================================+
| ARCHITECTURE 1 - DATA LAKE |
| "Store first, schema later." |
+==============================================================+

๐ชฃ A data lake is a centralized repository that stores any type of data โ structured, semi-structured, unstructured โ at any scale, in its native format, typically on cheap object storage like Amazon S3.
The defining move: you ingest now and decide the schema later (called schema-on-read). Contrast with warehouses, which demand schema-on-write.
+------------------------------------------------------------+
| |
| ANY DATA ---> S3 (object store) ---> ANY ENGINE |
| |
| CSV, JSON, cheap, durable, Athena, |
| Parquet, logs, infinite scale, Redshift, |
| images, PDFs, one source of truth EMR, |
| Kafka events SageMaker |
| |
+------------------------------------------------------------+
One lake. Many engines. That is the core promise.
Before ~2010, analytics meant warehouses โ expensive, schema-strict, row-limited. Then data exploded:
Problem With Warehouse-Only World | Who Felt It |
๐ธ Warehouse storage cost $1000s / TB / month | Every CFO |
๐ซ Could not store PDFs, images, videos | Healthcare, retail, media |
๐ข Schema changes took weeks | Fast-moving startups |
โ Historical data deleted to save cost | Regulated industries |
Data lakes fixed this by leveraging cheap object storage (S3 at ~$0.023 / GB / month) and decoupling compute from storage.
s3://hitavirtech-lake/
|
+-- raw/ <-- BRONZE: untouched, as ingested
| + source of truth
| + can replay anything from here
|
+-- curated/ <-- SILVER: cleaned, typed, Parquet
| + deduped, quality-checked
| + partitioned for fast scans
|
+-- analytics/ <-- GOLD: pre-aggregated, BI-ready
+ joins done once
+ powers dashboards and ML features
Zone | Icon | Shape | Readers |
Raw / Bronze | ๐ฅ | Original bytes โ CSV, JSON, images, dumps | Data engineers only |
Curated / Silver | ๐ฅ | Cleaned, typed, often Parquet + partitions | Analysts, ML engineers |
Analytics / Gold | ๐ฅ | Aggregated, ready for dashboards and models | Business users, BI tools |

Layer | Icon | Purpose | Service |
Storage | ๐ชฃ | Raw bytes, infinite scale | Amazon S3 |
Governance | ๐ | Permissions, row/column security | AWS Lake Formation |
Cataloging | ๐ | Schema + partitions metadata | AWS Glue Data Catalog |
ETL | ๐ธ๏ธ | Move raw โ curated โ analytics | AWS Glue ETL, EMR |
Query | ๐ | SQL on lake files | Amazon Athena |
ML | ๐ค | Train on lake data directly | Amazon SageMaker |

+--------------------------------------------------------------+
| AWS LAKE FORMATION - Governance for S3 Data Lakes |
+--------------------------------------------------------------+
| Permissions : Table / column / row-level grants |
| Discovery : Blueprint ingestion from DBs, logs |
| Auditing : Every access logged to CloudTrail |
| Catalog : Wraps the Glue Data Catalog with RBAC |
| |
| Turns a raw S3 bucket into a governed, multi-tenant lake. |
+--------------------------------------------------------------+
Lake Formation is what lets one S3 bucket serve 20 teams without everyone seeing everyone else's columns.
Strength | Icon | Weakness | Icon |
Cheap per GB | ๐ฐ | Can become a "data swamp" without governance | ๐ |
Any format | ๐งฉ | Query performance < a warehouse on the same data | ๐ข |
Separates storage and compute | ๐ | Schema enforcement is optional (and often skipped) | ๐ซฅ |
Multi-engine access (Athena, Spark, Redshift, ML) | ๐ | Harder for business users to self-serve | ๐ต |
+--------------------------------------------------------------+
| |
| NO CATALOG -> "Which bucket has customers?" |
| NO QUALITY RULES -> "Why are 40% of amounts negative?" |
| NO GOVERNANCE -> "Who deleted last quarter's data?" |
| NO LIFECYCLE -> "We're paying for 2014 clickstream" |
| |
| ==> DATA SWAMP (useless, expensive) |
+--------------------------------------------------------------+
Every successful data lake is paired with Glue Catalog + Lake Formation + quality rules + lifecycle policies. Skip these and your lake drowns.
๐ก HitaVir Tech says: "A data lake without a catalog is a data swamp. A data lake without quality rules is a liability. Governance is not optional โ it is the difference between an asset and a landfill."
๐ชฃ Data lake in one line: store everything cheaply, govern it strictly, query it from any engine.
+==============================================================+
| ARCHITECTURE 2 - DATA WAREHOUSE |
| "Fast SQL on curated, trusted data." |
+==============================================================+

๐๏ธ A data warehouse is a centralized, highly-structured database optimized for analytical queries โ aggregations, joins, and scans across billions of rows โ at interactive speeds.
Key properties:
Property | Icon | What It Means |
Schema-on-write | ๐ | Every row fits a predefined schema at load time |
Columnar storage | ๐ | Stores columns together, not rows โ 10-100ร faster scans |
MPP (Massively Parallel Processing) | โก | Splits work across many nodes automatically |
Optimized for reads | ๐ | Writes are slower; reads are lightning-fast |
Business-user friendly | ๐ฅ | Clean star schemas; analysts can self-serve SQL |
ROW STORE (OLTP, e.g. MySQL)
----------------------------
[id | name | country | amount] <-- each row stored together
Great for: "Get everything about order 1042"
Bad for: "SUM(amount) across 1B rows"
COLUMNAR STORE (OLAP, e.g. Redshift, Parquet)
---------------------------------------------
[id][id][id]...
[name][name][name]...
[country][country][country]...
[amount][amount][amount]... <-- each column stored together
Great for: "SUM(amount) across 1B rows" (scan only one column)
Bad for: "Get everything about order 1042"
Warehouses use columnar. That one design choice is why they can aggregate billions of rows in seconds.
Most warehouse tables follow the star schema:
+---------------------+
| DIM_CUSTOMER |
| (who bought) |
+----------+----------+
|
|
+---------------+ +---------------+ +---------------+
| DIM_PRODUCT |----| FACT_SALES |----| DIM_DATE |
| (what sold) | | (the event) | | (when sold) |
+---------------+ +-------+-------+ +---------------+
|
|
+----------+----------+
| DIM_STORE |
| (where sold) |
+---------------------+
Star schemas make queries fast AND readable: SELECT country, SUM(amount) FROM fact_sales JOIN dim_store ....

+--------------------------------------------------------------+
| AMAZON REDSHIFT - AWS's Managed Data Warehouse |
+--------------------------------------------------------------+
| Category : Columnar MPP warehouse |
| Modes : Serverless | Provisioned (RA3) |
| Scale : GBs to petabytes |
| SQL dialect : PostgreSQL-flavored |
| Superpowers : Redshift Spectrum, Redshift ML, Data Share |
| |
| The engine behind Nasdaq, McDonald's, Yelp analytics. |
+--------------------------------------------------------------+
+----------------------------+
| Redshift Cluster |
| (hot, curated tables) |
+-------------+--------------+
|
| joins across
v
+----------------------------+
| S3 Data Lake via Spectrum |
| (cold, historical data) |
+----------------------------+
One SQL query spans both the warehouse (recent, hot) and the lake (years of history). No duplicate storage, no duplicate pipelines.
Source | Icon | Loader | Speed |
S3 files | ๐ชฃ |
| ๐ |
Kinesis Streams | ๐ | Redshift streaming ingestion | ๐ |
Operational DBs (MySQL, Postgres) | ๐๏ธ | AWS DMS + CDC | ๐ |
SaaS apps (Salesforce, etc.) | โ๏ธ | AWS AppFlow or partner tools | ๐ถ |
Strength | Icon | Weakness | Icon |
Sub-second SQL on billions of rows | โก | Expensive per TB stored | ๐ธ |
Business-analyst friendly | ๐ฅ | Rigid schema โ changes need migrations | ๐ |
Mature BI tool ecosystem | ๐ | Only handles structured data | ๐ |
ACID transactions and governance baked-in | ๐ก๏ธ | Locked into one vendor's engine | ๐ |
+--------------------+------------------------+------------------------+
| ATTRIBUTE | DATA LAKE | DATA WAREHOUSE |
+--------------------+------------------------+------------------------+
| Data type | Anything | Structured only |
| Schema | On read | On write |
| Cost / TB stored | $ (cheap) | $$$$ (expensive) |
| Query speed | Medium | Fast |
| Users | Engineers, data sci. | Analysts, business |
| AWS example | S3 + Glue + Athena | Redshift |
+--------------------+------------------------+------------------------+
๐ก HitaVir Tech says: "Warehouses are optimized for the answers you know you want. Lakes are optimized for the answers you haven't invented questions for yet. Real companies need both. The next section shows how to stop choosing and combine them."
๐๏ธ Warehouse in one line: columnar + MPP + star schema = fast answers for business users.
+==============================================================+
| ARCHITECTURE 3 - MODERN DATA ARCHITECTURE |
| (aka the "Lake House" pattern) |
+==============================================================+

By 2018, most companies had both a lake and a warehouse โ and suffered:
Pain | Icon | Symptom |
Two copies of the truth | ๐ฏ | Lake says one number, warehouse says another |
Pipeline sprawl | ๐ธ๏ธ | 200 jobs shuffling data between them |
Permission chaos | ๐ | IAM for S3, Redshift grants, separate audits |
Skill silos | ๐งโ๐ป | Data engineers in Spark, analysts in SQL, no common tool |
ML engineers stuck | ๐ค | Data scientists denied warehouse access, scraping lakes by hand |
+==============================================================+
| |
| ONE GOVERNED PLATFORM |
| |
| - Unified storage (S3 data lake = source of truth) |
| - Purpose-built engines (pick the right tool per job) |
| - Shared catalog + governance (Lake Formation) |
| - Seamless movement (Spectrum, Athena, zero-ETL) |
| - Common security model (IAM + Lake Formation + KMS) |
| |
+==============================================================+
Instead of lake or warehouse, you get lake and warehouse โ unified by one catalog, one permission model, one lineage.
+--------+ +--------+ +--------+ +--------+ +--------+
| 1 | | 2 | | 3 | | 4 | | 5 |
| SCALABLE| |PURPOSE-| |SEAMLESS| |UNIFIED | |FUTURE- |
| DATA | | BUILT | | DATA | |GOVERN- | | PROOF |
| LAKE | | ENGINES| |MOVEMENT| | ANCE | | ML |
+---------+ +--------+ +--------+ +--------+ +--------+
| | | | |
v v v v v
S3 + Athena, Spectrum, Lake Form. SageMaker,
Lake Redshift, Zero-ETL, + IAM + Bedrock,
Form. EMR, OS, Federated KMS + Q Glue ML
DynamoDB query

Every modern architecture starts from S3. Why S3?
Reason | Icon | Impact |
Infinite scale | โพ๏ธ | Never outgrow it |
11 nines durability | ๐ก๏ธ | Your data is safer than on any disk |
Pennies per GB | ๐ฐ | Keep history forever |
Native reader for Athena, Redshift, EMR, Glue, SageMaker | ๐ | One source, many consumers |
One-size-fits-all is dead. Pick the right engine per workload:
Workload | Icon | Engine | Why |
Ad-hoc SQL on lake files | ๐ | Athena | Pay per query, zero setup |
Dashboards on curated tables | ๐๏ธ | Redshift | Sub-second BI |
Petabyte Spark jobs | ๐ | EMR | Custom transforms at scale |
Sub-ms lookups | โก | DynamoDB | Key-value queries |
Full-text + log search | ๐ | OpenSearch | Observability and logs |
Real-time aggregation | ๐ฏ | Kinesis Data Analytics | Streaming SQL / Flink |
Instead of 200 brittle ETL jobs, modern architectures rely on:

Layer | Icon | Service | Purpose |
Identity | ๐ | IAM | Who you are |
Fine-grained access | ๐ | Lake Formation | What columns / rows you see |
Encryption | ๐ | KMS | At-rest and in-transit |
PII scanning | ๐ต๏ธ | Macie | Find sensitive data in S3 |
Audit | ๐ | CloudTrail | Every API call, every query |
Data discovery | ๐งญ | DataZone | Business-friendly data catalog |

ML is no longer a bolt-on โ it lives inside the platform:
Capability | Icon | Service |
Full ML lifecycle | ๐ค | SageMaker |
Foundation models (Claude, Llama, Titan) | ๐ง | Bedrock |
SQL-native ML in the warehouse | ๐ฎ | Redshift ML |
No-code ML in the lake | ๐ | SageMaker Canvas |
Natural-language BI | ๐ฌ | Amazon Q in QuickSight |
+==================================================================+
| |
| +---------+ +---------+ +---------+ +---------+ |
| | Batch | | Stream | | OpTx DB | | SaaS | |
| | files | | events | | (CDC) | | apps | |
| +----+----+ +----+----+ +----+----+ +----+----+ |
| | | | | |
| +-------------+-------------+-------------+ |
| | |
| v |
| +-----------------------------------------------------+ |
| | AMAZON S3 --- Centralized Data Lake | |
| | Raw -> Curated -> Analytics (Parquet) | |
| +---------------------------+-------------------------+ |
| | |
| governed by | |
| v |
| +-----------------------------------------------------+ |
| | LAKE FORMATION + GLUE CATALOG + IAM + KMS + MACIE | |
| +-----------------------------------------------------+ |
| | |
| +------------+------------+------------+-------------+ |
| | | | | | |
| v v v v v |
| +-------+ +---------+ +-----+ +-----------+ +------+ |
| | Athena| |Redshift | | EMR | | OpenSearch| |Sage- | |
| | SQL | | MPP | |Spark| | Search | |Maker | |
| +---+---+ +----+----+ +--+--+ +-----+-----+ +--+---+ |
| | | | | | |
| +------------+-----------+-------------+-----------+ |
| | |
| v |
| +----------------------------------------+ |
| | VALUE = QuickSight + Amazon Q + apps| |
| +----------------------------------------+ |
| |
+==================================================================+
Look carefully: every service from Part 1 has a home. That is modern data architecture.
Centralize storage in a governed S3 lake. Use the best engine for each workload. Let data move frictionlessly between them. Secure it all uniformly. Build ML natively on top.
๐ก HitaVir Tech says: "Don't build one monolith. Don't build 20 silos. Build one lake, with many engines, one catalog, one security model. That's how AWS's biggest analytics customers run."
๐ก Lake House in one line: one lake for storage, many engines for compute, one catalog for trust.
+==============================================================+
| THE COMPLETE AWS SERVICE MAP |
| for Modern Data Architecture |
+==============================================================+

Each service answers a specific question in the modern architecture:
Layer | Question | Service | Icon |
Storage | Where does my data live? | Amazon S3 | ๐ชฃ |
Governance | Who can see what? | AWS Lake Formation | ๐๏ธ |
Catalog | What data do we have? | AWS Glue Data Catalog | ๐ |
ETL | How do I shape it? | AWS Glue ETL, EMR | ๐ธ๏ธ |
SQL on lake | How do I explore? | Amazon Athena | ๐ |
SQL on warehouse | How do I serve BI? | Amazon Redshift | ๐๏ธ |
Stream ingest | How do I handle real time? | Kinesis + Firehose + MSK | ๐ |
Search | How do I query logs? | Amazon OpenSearch | ๐ |
BI | How do people see it? | Amazon QuickSight | ๐ |
ML | How do we predict? | Amazon SageMaker | ๐ค |
Discovery | How do business users find data? | Amazon DataZone | ๐งญ |

+--------------------------------------------------------------+
| AMAZON DATAZONE - Business-Friendly Data Catalog |
+--------------------------------------------------------------+
| Audience : Business analysts, product managers |
| Features : Search by business term, request access |
| Integrates : Glue Catalog, Redshift, S3, third-party |
| ML assist : Auto-generate descriptions for tables |
| |
| Bridges the "I need data" gap between teams and engineers. |
+--------------------------------------------------------------+
Classic ETL means writing code to extract-transform-load between systems. Zero-ETL makes AWS automatically replicate curated data between services:
+----------------+ +-----------------------+
| Aurora MySQL | ==ETL==> | Redshift (analytics) |
| (app DB) | managed | |
+----------------+ +-----------------------+
Supported today:
- Aurora -> Redshift
- RDS -> Redshift
- DynamoDB -> OpenSearch
- S3 -> Redshift (auto-copy)
Fewer pipelines to maintain. Fresher analytics. Less on-call pain.

In a modern architecture, Glue is everywhere:
Capability | Icon | Role |
Crawlers | ๐ท๏ธ | Auto-discover schemas in S3 |
Data Catalog | ๐ | The shared metadata layer |
ETL jobs (Spark / Python) | ๐ธ๏ธ | Transform raw โ curated |
DataBrew | ๐งช | No-code cleaning for analysts |
Data Quality | ๐ก๏ธ | Rules enforced on every run |
Workflows | ๐งญ | Orchestrate multi-job pipelines |
When someone describes a problem, scan this table first:
Problem Sounds Like... | Reach For |
"We have terabytes piling up and need cheap storage" | ๐ชฃ S3 + ๐ง Glacier |
"Analysts want SQL on 1B rows, must return in seconds" | ๐๏ธ Redshift |
"We dump random files hourly, want ad-hoc SQL" | ๐ Athena + ๐ธ๏ธ Glue |
"Events come at 1M / sec and drive a live dashboard" | ๐ Kinesis + ๐ฏ K. Analytics |
"Logs need to be searchable with keyword filters" | ๐ OpenSearch |
"We keep 2 copies of the same data in lake and warehouse" | ๐ญ Redshift Spectrum / Zero-ETL |
"We need to share a slice with another AWS account" | ๐ Redshift Data Sharing / Lake Formation grants |
"Non-engineers can't find any data" | ๐งญ DataZone |
"We want the CEO to ask questions in English" | ๐ฌ Amazon Q in QuickSight |
๐ก HitaVir Tech says: "When a junior asks โwhich AWS service should we use?', the senior reply is another question โ โwhat is the actual pattern?' Service choice without pattern = tech for tech's sake."
+==============================================================+
| SECTION 2 - COMMON USE CASES |
| (where the patterns show up in real life) |
+==============================================================+
Most real-world analytics work on AWS falls into six repeatable patterns. Recognize them and you'll know which reference architecture to reach for.

+-----------------------------------------------------------+
| 1. BATCH BI - nightly dashboards |
| 2. REAL-TIME ANALYTICS- live metrics, fraud, IoT |
| 3. LOG / APP OBS. - search + troubleshoot logs |
| 4. CUSTOMER 360 - unify profiles across sources |
| 5. ML / PREDICTIVE - forecast, recommend, score |
| 6. DATA MESH - domain-owned, shared data |
+-----------------------------------------------------------+

Who needs it: Every company with a CFO.
Shape: Operational databases + flat files โ data lake โ warehouse โ dashboards.
Trait | Value |
Freshness | Daily or hourly |
Volume | GB to TB |
Velocity V | ๐ข Batch |
Core service | ๐๏ธ Redshift |
Example prompt: "Revenue by region compared to last quarter, refreshed every morning at 8 am."

Who needs it: Rideshare, fintech, ad-tech, IoT, online gaming.
Shape: Events โ Kinesis โ stream processor โ live dashboard and S3 for history.
Trait | Value |
Freshness | Sub-second to seconds |
Volume | Millions of events / sec |
Velocity V | โก Streaming |
Core service | ๐ Kinesis + ๐ฏ Kinesis Analytics |
Example prompt: "Alert the risk team the moment any card transaction looks fraudulent."

Who needs it: Every engineering team at scale.
Shape: Application logs โ Firehose โ OpenSearch + S3 archive.
Trait | Value |
Freshness | Seconds |
Volume | TB/day in logs |
Velocity V | โก Streaming |
Core service | ๐ OpenSearch |
Example prompt: "Search the last 30 days of production logs for any mention of this request ID."

Who needs it: Retail, banking, telecom, SaaS.
Shape: Unify profiles from CRM, web, mobile, support into one view, served to marketing + ML.
Trait | Value |
Freshness | Hourly |
Volume | TB |
Dominant V | ๐งฉ Variety |
Core service | ๐ชฃ S3 + ๐ธ๏ธ Glue + ๐๏ธ Redshift |
Example prompt: "Show one customer's full lifetime journey โ ads seen, orders placed, tickets filed."

Who needs it: Forecasting, recommendations, fraud, churn, dynamic pricing.
Shape: Lake โ feature store โ model training โ model endpoint โ prediction served in app or BI.
Trait | Value |
Freshness | Training weekly, inference real-time |
Volume | GB to PB of history |
Dominant V | ๐ Value |
Core service | ๐ค SageMaker + ๐ชฃ S3 |
Example prompt: "Predict which customers will churn next month so we can retain them."

Who needs it: Enterprises with many product teams owning their own data.
Shape: Each domain team curates its own data products on S3; a central catalog (Lake Formation + DataZone) makes them discoverable and access-controlled.
Trait | Value |
Freshness | Per-domain |
Ownership | Distributed to domain teams |
Dominant V | ๐ก๏ธ Veracity + ๐ Value |
Core service | ๐๏ธ Lake Formation + ๐งญ DataZone |
Example prompt: "The Finance team owns โinvoices', Marketing owns โcampaigns', but anyone at the company can discover and request access."
What is the dominant question?
|
+-------+---+-----+----+-----+----+-------+
| | | | | |
Weekly Instant Find a One Predict Federated
KPIs alerts log line 360 view future ownership
| | | | | |
v v v v v v
BATCH REAL- LOG CUSTOMER ML / DATA
BI TIME OBS. 360 PRED. MESH
๐ก HitaVir Tech says: "New engineers try to force every problem into the pattern they already know. Seniors look at the dominant V and pick the shape โ then fill in services. Diagnose first. Build second."
+==============================================================+
| SECTION 3 - REFERENCE ARCHITECTURES |
+==============================================================+
For each use case, here is a whiteboard-ready AWS architecture you can copy, adapt, and defend in a design review.

Operational DBs (RDS, Aurora) SaaS apps (Salesforce)
| |
+---- AWS DMS (CDC) ----+ AppFlow
|
v
+--------------------------------------+
| Amazon S3 Data Lake |
| raw -> curated (Parquet) |
+---------------+----------------------+
|
v (Glue ETL, Data Quality, Catalog)
|
+---------------+----------------------+
| Amazon Redshift (curated tables) |
| - Star schemas |
| - Nightly loads via COPY |
+---------------+----------------------+
|
v
Amazon QuickSight
(SPICE dashboards)
|
v
Executives
When to pick it: Finance, ops, exec reporting. Stable queries, predictable loads.

Producers (apps, IoT, clickstream)
|
v
+-------------------------+
| Kinesis Data Streams |
| (durable, ordered) |
+------------+------------+
|
+-----------+------------+------------+
| | |
v v v
Kinesis Analytics Lambda Firehose
(continuous SQL) (alerting on |
| anomalies) v
| | S3 lake
v v (Parquet, hist.)
Live dashboard SNS/SQS |
(QuickSight or | v
OpenSearch dash.) Ops team Athena
phone ad-hoc
When to pick it: Fraud detection, live pricing, real-time personalization, IoT.

App servers / containers / Lambda / CloudTrail
|
v
Kinesis Firehose
|
+----------+-----------+
| |
v v
OpenSearch S3 (cold)
(hot, 30 days) (cheap, years)
| |
v v
Kibana / OS Athena for historical
Dashboards search and audits
When to pick it: SRE and platform teams, security logs, microservice observability.

CRM Web events Mobile app Support tickets
| | | |
+------------+------+-------+----------------+
|
v
S3 Data Lake (raw)
|
v Glue ETL + Data Quality
|
S3 Data Lake (curated)
|
v
+-----------------+------------------+
| |
v v
Redshift SageMaker
Unified Customer Features + Models
table (serving BI) (churn, LTV, NBA)
| |
v v
QuickSight 360 Marketing automation
dashboard (personalized offers)
When to pick it: Retail, banking, telecom, subscription SaaS.

+----------------------+
| S3 Data Lake |
| (historical data) |
+----------+-----------+
|
v
+----------------------+
| Glue / EMR |
| Feature engineering |
+----------+-----------+
|
v
+-------------------------+
| SageMaker Feature Store|
+-----+-----------+-------+
| |
(training) (serving)
v v
SageMaker Real-time
Training endpoint
Jobs (low-latency)
| |
v v
Models Mobile / web app
(personalized UX)
When to pick it: Recommenders, fraud detection, demand forecasting, dynamic pricing.

Domain A (Orders) Domain B (Marketing) Domain C (Finance)
owns its own pipes owns its own pipes owns its own pipes
| | |
v v v
S3 bucket + Glue S3 bucket + Glue S3 bucket + Glue
Catalog (Orders) Catalog (Marketing) Catalog (Finance)
| | |
+-----------+-----------+-------------------------+
|
v
+-----------------------------------+
| Lake Formation (central RBAC) |
| + Amazon DataZone (discovery) |
+---------------+-------------------+
|
+------------+------------+
| | |
v v v
Analyst Data sci. Executive
(Athena) (SageMaker) (QS / Q)
When to pick it: Large enterprises where domain teams must own their data products, but a central platform team guarantees governance.
# | Pattern | Primary V | Storage | Compute | Serve |
1 | Batch BI | Volume | S3 + Redshift | Glue, Redshift | QuickSight |
2 | Real-Time | Velocity | Kinesis + S3 | Kinesis Analytics, Lambda | OpenSearch, QuickSight |
3 | Log Obs. | Velocity + Variety | OpenSearch + S3 | Firehose | Kibana, Athena |
4 | 360 | Variety + Value | S3 + Redshift | Glue | QuickSight + apps |
5 | ML | Value | S3 | Glue, SageMaker | Endpoint in app |
6 | Mesh | Veracity + Value | Distributed S3 | Per-domain | Lake Formation + DataZone |
๐ก HitaVir Tech says: "Architects don't memorize 50 services โ they memorize 6 shapes. When someone brings a new problem, they map it to a shape first, then pick services to fit. Copy these six patterns. Most of your career, you'll be adapting one of them."
+==============================================================+
| QUIZ - TEST YOUR UNDERSTANDING |
+==============================================================+

Answer each question before revealing. No peeking โ this is how you build real recall.
Which of the following best describes a data lake?
Answer: B. A lake holds any data in native format; multiple engines (Athena, Redshift, EMR, SageMaker) read from it.
Which property is specific to data warehouses, not data lakes?
Answer: C. Columnar + MPP is the warehouse signature, enabling fast aggregation on billions of rows.
Match each zone to its typical reader:
Zone | Readers |
Raw (Bronze) | ? |
Curated (Silver) | ? |
Analytics (Gold) | ? |
Answer: Raw = data engineers only. Curated = analysts and ML engineers. Analytics = business users and BI tools.
In an AWS Modern Data Architecture, which service is the "source of truth" storage layer?
Answer: B. S3. Every other analytics engine on AWS reads from it.
What does Redshift Spectrum enable?
Answer: C. Spectrum lets you query lake data from the warehouse โ no duplicate storage.
Which service provides fine-grained (column/row) permissions on the S3 data lake?
Answer: B. Lake Formation. IAM provides coarse identity; Macie finds PII; CloudTrail audits; Lake Formation enforces column and row-level access.
A fraud team needs to block bad transactions within 200 ms. Which pattern fits?
Answer: B. Real-time analytics with streaming + serverless alerting.
Zero-ETL between Aurora and Redshift eliminates the need to:
Answer: B. Zero-ETL replicates changes automatically โ no pipeline code to maintain.
Which service helps non-engineers discover datasets using business terms instead of table names?
Answer: C. DataZone provides a business-friendly catalog on top of the Glue Catalog.
A company stores 5 years of clickstream JSON in S3 but has no Glue Catalog, no Lake Formation, no quality rules. What is this?
Answer: B. A data swamp โ no catalog, no governance, no trust. Storage alone is not an architecture.
Score | Meaning |
9-10 | ๐ You are ready for production design reviews |
7-8 | ๐ง Solid. Re-read the reference architectures section |
5-6 | ๐ Review the three architecture chapters and retake |
< 5 | ๐ Re-do Part 1 first โ the 5 Vs are the foundation |
๐ก HitaVir Tech says: "Don't guess. The questions here are the same ones that show up in every AWS analytics interview. Know them cold."
+==============================================================+
| CONGRATULATIONS - PART 2 DONE! |
+==============================================================+

๐ชฃ Data Lakes
Topic | Icon |
What a lake is (schema-on-read) | ๐ |
The medallion zones (raw, curated, analytics) | ๐ฅ๐ฅ๐ฅ |
Why lakes become swamps without governance | ๐ |
๐๏ธ Data Warehouses
Topic | Icon |
Schema-on-write, columnar, MPP | ๐ |
Star schemas (fact + dimensions) | โญ |
Redshift serverless vs provisioned | ๐๏ธ |
Redshift Spectrum (querying the lake) | ๐ญ |
๐ก Modern Data Architecture (Lake House)
Pillar | Icon |
Scalable data lake | ๐ชฃ |
Purpose-built engines | ๐งฐ |
Seamless movement (Spectrum, zero-ETL) | ๐ |
Unified governance | ๐ |
Built-in AI / ML | ๐ค |
๐บ๏ธ Six Reference Architectures
# | Pattern | Core Service |
1 | Batch BI | ๐๏ธ Redshift + ๐ QuickSight |
2 | Real-Time | ๐ Kinesis + ๐ฏ K. Analytics |
3 | Log Observability | ๐ OpenSearch |
4 | Customer 360 | ๐ชฃ S3 + ๐ธ๏ธ Glue + ๐๏ธ Redshift |
5 | ML / Predictive | ๐ค SageMaker + ๐ชฃ S3 |
6 | Data Mesh | ๐๏ธ Lake Formation + ๐งญ DataZone |
Part 1 Gave You... | Part 2 Gave You... |
๐ A diagnostic framework (5 Vs) | ๐บ๏ธ Reference architectures |
โ๏ธ One service per V | ๐งฉ Combinations that scale |
๐ ๏ธ A basic S3 โ Glue โ Athena lab | ๐ก The full Lake House picture |
๐ฏ What to pick for each bottleneck | ๐ฏ Which shape fits each real-world problem |
raw/ โ curated/ as Parquet. +==============================================================+
| |
| Part 1 = the 5 Vs + one service per V |
| Part 2 = three architectures + six reference shapes |
| |
| Together = you can design and defend |
| an analytics platform on AWS. |
| |
+==============================================================+
๐ก HitaVir Tech says: "You just completed the same journey a new hire at a top cloud team makes in their first three months. Keep the cheat sheet, apply the patterns, and your architecture reviews will sound like a 5-year veteran's. Fundamentals compound."
๐ Welcome to the Lake House. See you in Part 3 โ hands-on Glue ETL, Redshift loading, and a capstone project.
โ HitaVir Tech โ๏ธ
+==============================================================+
| RESOURCES AND NEXT STEPS |
+==============================================================+

Topic | Icon | Where to Read |
Amazon S3 | ๐ชฃ | docs.aws.amazon.com/s3 |
AWS Lake Formation | ๐๏ธ | docs.aws.amazon.com/lake-formation |
AWS Glue | ๐ธ๏ธ | docs.aws.amazon.com/glue |
Amazon Athena | ๐ | docs.aws.amazon.com/athena |
Amazon Redshift | ๐๏ธ | docs.aws.amazon.com/redshift |
Amazon Kinesis | ๐ | docs.aws.amazon.com/kinesis |
Amazon OpenSearch | ๐ | docs.aws.amazon.com/opensearch-service |
Amazon QuickSight | ๐ | docs.aws.amazon.com/quicksight |
Amazon SageMaker | ๐ค | docs.aws.amazon.com/sagemaker |
Amazon DataZone | ๐งญ | docs.aws.amazon.com/datazone |
Whitepaper | Icon | Why Read It |
AWS Well-Architected Framework โ Analytics Lens | ๐ | The canonical design checklist |
Modern Data Architecture on AWS | ๐ก | The lake house, explained by AWS itself |
Data Analytics Lens โ Reference Architectures | ๐บ๏ธ | AWS-blessed reference diagrams |
Building a Data Lake on AWS | ๐ชฃ | Zone design, governance patterns |
Streaming Data Solutions | ๐ | Choosing between Kinesis, MSK, Firehose |
Resource | Icon | Provider |
AWS Skill Builder โ Analytics Learning Plan | ๐ | aws.amazon.com/training |
AWS Solutions Architect โ Associate path | ๐ฏ | AWS Training |
AWS Data Analytics Specialty certification | ๐ | AWS Certification |
Book | Why |
Designing Data-Intensive Applications โ Martin Kleppmann | The single best data engineering book ever written |
The Data Warehouse Toolkit โ Ralph Kimball | Star schemas, dimensional modeling, classics |
Fundamentals of Data Engineering โ Joe Reis | Modern, cloud-era data engineering |
+==============================================================+
| AWS ANALYTICS - PART 2 CHEAT SHEET |
| (screenshot and keep) |
+==============================================================+

Store ANY data ---> S3 (raw | curated | analytics)
Schema-on-READ ---> decided at query time
Read by ---> Athena, Redshift, EMR, SageMaker
Governed by ---> Lake Formation + Glue Catalog
Columnar + MPP ---> fast aggregation on billions of rows
Schema-on-WRITE ---> decided before load
Star schema ---> fact + dimensions
AWS service ---> Amazon Redshift (serverless or RA3)
One lake + many engines + one catalog + one security model + native ML
Pillar | Icon | Services |
Scalable lake | ๐ชฃ | S3 |
Purpose-built engines | ๐งฐ | Athena, Redshift, EMR, OpenSearch, DynamoDB |
Seamless movement | ๐ | Spectrum, Zero-ETL, Federated query |
Unified governance | ๐ | Lake Formation, IAM, KMS, Macie, CloudTrail |
Built-in ML | ๐ค | SageMaker, Redshift ML, Bedrock, Amazon Q |
# | Pattern | When | Core |
1 | Batch BI | Daily dashboards | ๐๏ธ Redshift + ๐ QuickSight |
2 | Real-Time | Fraud, IoT, live | ๐ Kinesis + ๐ฏ Analytics |
3 | Log Obs. | Search production logs | ๐ OpenSearch |
4 | 360 | Unify customer view | ๐ชฃ S3 + ๐ธ๏ธ Glue + ๐๏ธ Redshift |
5 | ML | Predict & personalize | ๐ค SageMaker |
6 | Data Mesh | Enterprise domain ownership | ๐๏ธ Lake Formation + ๐งญ DataZone |
tickit dataset.AWS service icons used in this codelab are from the official AWS Architecture Icons deck, freely distributed by Amazon Web Services for use in architecture diagrams and educational materials.