"It's not the plane, it's the pilot." — The tools do not make the engineer. YOU do.
Welcome to HitaVir Tech Code: Learn Like a Top Performer. This is the most important codelab in your entire program. Every other lab teaches you a tool. This one teaches you how to learn any tool.
In the real world, nobody hands you a step-by-step guide. You get a Slack message at 2 AM:
@on-call Pipeline down. EU customer data stale since midnight.
Revenue dashboards blank. Fix ASAP.
The engineer who fixes this is not the one who watched the most tutorials. It is the one who practiced under pressure, debugged methodically, and built muscle memory through execution.
What Companies Want | What Most Learners Do | The Gap |
Build from scratch | Follow tutorials | Cannot create without a guide |
Debug under pressure | Panic at errors | No systematic debugging process |
Explain decisions | Memorize commands | Cannot answer "why" questions |
Ship production code | Write lab exercises | No production mindset |
This codelab closes that gap. Every principle here is battle-tested by engineers at top companies.
This is not a motivational talk. This is an operational playbook with hands-on exercises, templates you will build, and habits you will practice starting today.
These 11 guidelines are your flight manual. Read them carefully. Come back to them when you are stuck. Internalize them until they become instinct.
Do not read passively. Execute every command yourself. Your fingers on the keyboard are worth more than hours of passive reading.
"Don't think. Just do." — Open the terminal right now. The learning starts when you start typing.
The Learning Pyramid (Retention Rates):
Lecture → 5%
Reading → 10%
Video → 20%
Watching a demo → 30%
Discussion → 50%
Practice by Doing → 75%
Teaching Others → 90%
HitaVir Tech codelabs operate in the 75-90% zone. Every step has a command to run, an output to verify, and a concept to explain.
Rules:
First run it. Then understand why it works. Then explain it to someone else.
"You don't have time to think up there. If you think, you're dead." — In interviews and production, hesitation kills. Build muscle memory through repetition.
EXECUTE → Run the command, write the code
|
UNDERSTAND → Read the output, trace the logic, check what changed
|
EXPLAIN → Say out loud or write down WHY it worked
Phase | Question to Ask |
Execute | What do I expect this to produce? |
Understand | Was the output what I expected? If not, why? |
Explain | Can I explain this to a teammate without looking at the guide? |
If you can execute but cannot explain, you are not ready for an interview.
Errors are missions, not failures. Every stack trace is a clue.
"Maybe so, sir. But not today." — The bug thinks it can beat you. Prove it wrong.
The Debugging Protocol:
1. STOP → Do not change anything yet
2. READ → Read the ENTIRE error message
3. CLASSIFY → Syntax? Connection? Permission? Data? Resource?
4. ISOLATE → What specific line or component failed?
5. HYPOTHESIZE → What is the most likely cause?
6. TEST → Change ONE thing and test again
7. DOCUMENT → Write down what fixed it
Error Classification Quick Reference:
Type | Signs | First Check |
Syntax |
| Check the exact line character by character |
Connection |
| Check endpoint, port, network |
Permission |
| Check IAM roles, credentials |
Data |
| Check input for nulls, type mismatches |
Resource |
| Scale up or optimize the query |
After every step, verify your output matches the expected result before moving on.
"Stay on target." — Precision matters. Validate every checkpoint like your deployment depends on it — because one day, it will.
After CREATING a resource → Verify it exists
After LOADING data → Check row counts and sample records
After RUNNING a pipeline → Check job status AND output quality
After CONFIGURING → Test the connection immediately
A missed validation in Step 3 becomes an impossible-to-debug failure in Step 15. Never skip a checkpoint.
After completing the lab, close this document and rebuild the entire project from memory.
"Push beyond your limits." — If you cannot rebuild it blind, you have not learned it. You have only followed instructions.
The 3-Attempt System:
Attempt 1 (Guided) → Complete the lab, take notes, rebuild with guide open
Attempt 2 (Notes) → Close the lab, use only your notes
Attempt 3 (Blind) → Close everything, rebuild from a blank terminal
If you complete Attempt 3, you OWN that skill.
Write down what surprised you, what broke, and what clicked.
"Trust your instincts." — Your journal trains your engineering intuition. What you write down today becomes the instinct you rely on tomorrow.
Every session, document:
Set a timer. No distractions. One codelab equals one focused session.
"The only place where success comes before work is in the dictionary." — Block the time. Silence the noise. Execute.
Daily System:
First 15 min → Review yesterday's notes, set today's goal
Core Block → 2-3 hours of FOCUSED execution (no phone, no tabs)
After Lab → Write journal entry, attempt rebuild
End of Day → Commit to GitHub, plan tomorrow
The 20-Minute Rule: If stuck for 20 minutes, ask for help. Do not waste 2 hours.
At every step, ask: "How would this work in a production environment with 10TB of data?"
"It's not the mission, it's the man." — The codelab is training. The real mission is your career. Think production from day one.
Four Questions for Every Step:
RELIABILITY → What happens when this fails at 3 AM?
SCALABILITY → Will this work with 100x the data?
COST → Am I wasting cloud resources?
SECURITY → Am I hardcoding credentials? (NEVER do this)
Share knowledge. Help your batchmates. Review each other's work.
"You're a team. You live together, you fly together." — In real engineering teams, nobody ships alone. Start that habit now.
Do not be afraid to break things in a dev environment. That is how you learn what not to break in production.
"You gotta let go." — Fear of errors is the biggest enemy of learning. Break it, fix it, own it.
Things you SHOULD break on purpose:
- Run a command with wrong parameters (see what happens)
- Delete a file and recover it with Git
- Crash a Spark job with bad data (learn to read the stack trace)
- Misconfigure IAM and see what AccessDenied looks like
When the lab gets hard, do not skip ahead. Stay with the struggle.
"I'm not leaving my wingman." — The hard step you are stuck on right now? That is the one that will separate you in interviews.
The moment you want to skip is the moment you need to stay. The struggle is the learning.
Your mission is clear: land the dream job. Every command you type, every error you debug, every checkpoint you validate brings you one step closer. The only thing standing between you and that offer letter is execution. So stop thinking about it — and start building.
By the end of this codelab, you will build a complete personal learning system:
hitavir-learning-system/
README.md <-- Your learning manifesto
journal/
week-01.md <-- Weekly progress logs
debug-log/
errors-and-fixes.md <-- Personal error database
notes/
topic-notes-template.md <-- Structured note-taking
challenges/
rebuild-tracker.md <-- Rebuild attempt tracker
interview-prep/
questions-bank.md <-- Interview Q&A collection
Component | Purpose | Career Impact |
Debug Log | Track every error and fix | Eliminates repeat mistakes |
Journal | Daily learning documentation | Builds engineering intuition |
Notes Template | Structured topic summaries | Interview-ready explanations |
Rebuild Tracker | Measure skill ownership | Proves you can build from scratch |
Interview Bank | Collect questions per codelab | Walks into interviews prepared |
The same learning framework applies across all platforms:
Learning Principle | AWS Application | Azure Application | GCP Application |
Checkpoint Validation | Verify S3 objects after upload | Verify ADLS files after write | Verify GCS objects after transfer |
Debugging Protocol | Read CloudWatch logs | Read Azure Monitor logs | Read Cloud Logging |
Real-World Thinking | Check EC2 costs | Check VM costs | Check Compute costs |
Rebuild Challenge | Recreate Glue job from scratch | Recreate ADF pipeline from scratch | Recreate Dataflow job from scratch |
Databricks runs on all three clouds — master the learning system once, apply it everywhere.
Requirement | Details |
Computer | Any OS (Windows, Mac, or Linux) |
Terminal | Git Bash (Windows), Terminal (Mac/Linux) |
Git | Installed and configured |
Text Editor | VS Code recommended |
GitHub Account | Free account at github.com |
Commit to these before starting:
[ ] I will type every command myself
[ ] I will read every error message completely
[ ] I will attempt rebuilds without the guide
[ ] I will keep a daily learning journal
[ ] I will ask for help after 20 minutes of being stuck
[ ] I understand that struggling IS the learning process
If you are not ready to commit to these, stop here. Come back when you are. There are no shortcuts at this altitude.
This codelab teaches a layered system — each layer builds on the previous one:
Layer 10: LET GO OF FEAR + LOYALTY TO PROCESS
Break things. Stay with the struggle.
──────────────────────────────────────────────
Layer 9: TEAM MINDSET
Share, review, collaborate
──────────────────────────────────────────────
Layer 8: REAL-WORLD THINKING
Reliability, scalability, cost, security
──────────────────────────────────────────────
Layer 7: PRODUCTIVITY DISCIPLINE
Deep work blocks, 20-minute rule
──────────────────────────────────────────────
Layer 6: LEARNING JOURNAL
Track, reflect, build intuition
──────────────────────────────────────────────
Layer 5: REBUILD CHALLENGE
Close the guide. Build from scratch.
──────────────────────────────────────────────
Layer 4: CHECKPOINT VALIDATION
Verify before advancing
──────────────────────────────────────────────
Layer 3: DEBUGGING MINDSET
Errors are data. Read them. Fix them.
──────────────────────────────────────────────
Layer 2: EXECUTE-UNDERSTAND-EXPLAIN
The fundamental learning cycle
──────────────────────────────────────────────
Layer 1: LEARN BY DOING
Hands on the keyboard. Always.
──────────────────────────────────────────────
CODELAB SKILL → PRODUCTION SKILL
───────────── ──────────────────
Learn by Doing → Ship code daily
Execute-Understand-Explain → Design docs and code reviews
Debugging Mindset → On-call incident response
Checkpoint Validation → Data quality pipelines
Rebuild Challenge → System design interviews
Learning Journal → Engineering decision records
Productivity Discipline → Sprint planning and delivery
Real-World Thinking → Architecture decisions
Team Mindset → Cross-functional collaboration
Let Go of Fear → Experimentation culture
Loyalty to Process → Shipping through adversity
Create the complete directory structure for your personal learning system.
# Create the full directory structure
mkdir -p ~/hitavir-learning-system/{journal,debug-log,notes,challenges,interview-prep}
# Verify the structure
find ~/hitavir-learning-system -type d | sort
Every production system starts with a well-organized structure. Your learning system is no different. Engineers who organize their knowledge systematically learn faster and retain more. This directory becomes your personal engineering knowledge base.
/home/user/hitavir-learning-system
/home/user/hitavir-learning-system/challenges
/home/user/hitavir-learning-system/debug-log
/home/user/hitavir-learning-system/interview-prep
/home/user/hitavir-learning-system/journal
/home/user/hitavir-learning-system/notes
Error | Cause | Fix |
| Parent path does not exist | The |
| No write access to home directory | Check with |
Verify before moving on:
[x] All 5 subdirectories exist
[x] You typed the command yourself (not copy-pasted)
[x] You can explain what -p does in mkdir
Turn your learning directory into a Git repository with a professional README.
cd ~/hitavir-learning-system
# Initialize Git repository
git init
# Create README
cat > README.md << 'README'
# HitaVir Tech - My Learning System
Personal engineering knowledge base built during the HitaVir Tech Data Engineering program.
### Structure
- **journal/** — Weekly learning logs
- **debug-log/** — Error database with fixes
- **notes/** — Structured topic notes
- **challenges/** — Rebuild attempt tracker
- **interview-prep/** — Interview questions bank
### Learning Methodology
- Execute-Understand-Explain cycle for every concept
- 7-step debugging protocol for every error
- 3-attempt rebuild challenge for every codelab
- Daily journal for continuous improvement
Built with discipline. Maintained with consistency.
README
# Stage and commit
git add .
git commit -m "feat: initialize learning system with directory structure and README"
# Verify
git log --oneline
"It's not the plane, it's the pilot." — Your GitHub profile is your engineering portfolio. Every recruiter and hiring manager will check it. A well-maintained learning repository shows discipline, consistency, and genuine engineering mindset — traits that cannot be faked in an interview.
Initialized empty Git repository in /home/user/hitavir-learning-system/.git/
[main (root-commit) abc1234] feat: initialize learning system with directory structure and README
1 file changed, 16 insertions(+)
create mode 100644 README.md
abc1234 feat: initialize learning system with directory structure and README
Error | Cause | Fix |
| Git not installed | Install Git first (see Environment Setup codelab) |
| Git user not configured | Run |
Verify before moving on:
[x] Git repository initialized (ls -la .git confirms)
[x] README.md committed
[x] git log shows your commit
Create a structured error database that you will maintain throughout the entire program.
cat > ~/hitavir-learning-system/debug-log/errors-and-fixes.md << 'DEBUGLOG'
# Debug Log — Error Database
### How to Use This Log
After every error you encounter, add a row to the table below.
Review this log weekly to identify patterns in your mistakes.
### Error Log
| Date | Codelab | Error Message | Type | Root Cause | Fix Applied |
|------|---------|--------------|------|------------|-------------|
| 2026-04-14 | Learning System | (example) Permission denied | Permission | No write access | Used home directory |
### Patterns I Notice
- (Update weekly: what types of errors do I hit most often?)
### Key Debugging Lessons
- Always read the ENTIRE error message before searching online
- Change ONE thing at a time when debugging
- Document the fix immediately — future you will thank present you
DEBUGLOG
# Verify
cat ~/hitavir-learning-system/debug-log/errors-and-fixes.md
"Maybe so, sir. But not today." — The same error will try to beat you twice. Your debug log ensures it never wins a second time. Production engineers maintain runbooks for exactly this reason — a catalog of known issues and their fixes that turns hours of debugging into minutes.
The full debug log template should display with the header, table format, and sections.
Error | Cause | Fix |
| Heredoc delimiter not on its own line | Ensure |
File appears empty | Heredoc syntax error | Check matching opening/closing tags |
Verify before moving on:
[x] errors-and-fixes.md exists in debug-log/
[x] File contains the table header and example row
[x] You understand the 7-step debugging protocol
Create a weekly journal template that captures daily learning progress.
cat > ~/hitavir-learning-system/journal/week-01.md << 'JOURNAL'
# Week 01 — Learning Journal
### Day 1 — [Date]
**What I Worked On**
- [Codelab name / topic]
**Key Concepts Learned**
- [Concept 1: one sentence explanation]
- [Concept 2: one sentence explanation]
**Commands I Practiced**
- [command 1] — [what it does]
- [command 2] — [what it does]
**Errors I Hit**
| Error | Type | Fix |
|-------|------|-----|
| [error message] | [classification] | [what fixed it] |
**What Clicked Today** (aha moments)
- [insight]
**What Is Still Confusing** (to revisit)
- [question or concept]
**Rebuild Attempt**
- Attempted: [Yes/No]
- Completion: [percentage]
- What I forgot: [specific items]
**Tomorrow's Focus**
- [what to work on next]
---
### Day 2 — [Date]
(Same structure — copy and fill daily)
JOURNAL
# Verify
cat ~/hitavir-learning-system/journal/week-01.md
"Trust your instincts." — Your journal trains your engineering intuition. The human brain forgets 70% of new information within 24 hours. A journal captures insights at the moment they happen, when they are sharpest. What you write down today becomes the instinct you rely on tomorrow.
The full journal template should display with Day 1 structure and all sections.
Error | Cause | Fix |
Heredoc not closing | Missing | Ensure closing tag is alone on its own line |
File in wrong directory | Typo in path | Verify with |
Verify before moving on:
[x] week-01.md exists in journal/
[x] Template has all required sections
[x] You commit to filling this out daily
Create a reusable template for structured notes on each topic you learn.
cat > ~/hitavir-learning-system/notes/topic-notes-template.md << 'NOTES'
# [Topic Name] — Notes
### One-Line Summary
[Explain this topic in one sentence]
### Key Concepts
| Concept | What It Does | When to Use It |
|---------|-------------|----------------|
| [concept] | [description] | [use case] |
### Essential Commands
- `[command 1]` — [what it does]
- `[command 2]` — [what it does]
### How It Works (Architecture)
[Describe the flow or draw an ASCII diagram]
### Common Pitfalls
- [Pitfall 1 and how to avoid it]
- [Pitfall 2 and how to avoid it]
### Interview-Ready Explanation
"[2-3 sentence explanation you could give in an interview]"
### Multi-Cloud Equivalents
| Capability | AWS | Azure | GCP |
|-----------|-----|-------|-----|
| [service type] | [AWS service] | [Azure service] | [GCP service] |
### Related Topics
- [Topic A] — [how it connects]
- [Topic B] — [how it connects]
NOTES
# Verify
cat ~/hitavir-learning-system/notes/topic-notes-template.md
Structured notes are dramatically more useful than free-form notes. This template forces you to organize knowledge in a way that is immediately useful for interviews, code reviews, and architecture discussions. The multi-cloud section builds the cross-platform awareness that top employers value.
The complete notes template should display with all sections.
Error | Cause | Fix |
File created but empty | Heredoc syntax error | Re-run the command, check for matching |
Cannot find file later | Forgot the path | Use |
Verify before moving on:
[x] topic-notes-template.md exists in notes/
[x] Template includes multi-cloud section
[x] You plan to copy this template for each new topic
Create a tracker to log your rebuild attempts and measure skill ownership over time.
cat > ~/hitavir-learning-system/challenges/rebuild-tracker.md << 'TRACKER'
# Rebuild Challenge Tracker
### How This Works
After completing each codelab, attempt to rebuild it without the guide.
Track your attempts here. Aim for Attempt 3 (blind rebuild) on every lab.
### Rebuild Log
| Codelab | Attempt | Time | Completion | What I Forgot |
|---------|---------|------|------------|---------------|
| HitaVir Tech Code | 1 (Guided) | -- min | --% | -- |
| HitaVir Tech Code | 2 (Notes) | -- min | --% | -- |
| HitaVir Tech Code | 3 (Blind) | -- min | --% | -- |
### Personal Records
- Fastest full rebuild: --
- Most improved codelab: --
- Skills I can rebuild blind: --
### Streak
- Current rebuild streak: 0 codelabs
- Best streak: 0 codelabs
TRACKER
# Verify
cat ~/hitavir-learning-system/challenges/rebuild-tracker.md
"Push beyond your limits." — The rebuild tracker is your objective measure of skill ownership. Anyone can follow a guide. The tracker proves you can build without one. When you can rebuild three codelabs blind, you are genuinely job-ready — not just tutorial-complete.
The complete rebuild tracker should display with the log table, personal records, and streak sections.
Error | Cause | Fix |
Heredoc content missing sections | Copy error | Re-run the full command block |
File permissions issue | Restrictive umask | Check with |
Verify before moving on:
[x] rebuild-tracker.md exists in challenges/
[x] Table has columns for Codelab, Attempt, Time, Completion
[x] You understand the 3-attempt system
Create a structured bank for collecting interview questions from every codelab.
cat > ~/hitavir-learning-system/interview-prep/questions-bank.md << 'INTERVIEW'
# Interview Questions Bank
### How to Use
After each codelab, add relevant interview questions below.
Practice answering out loud (2 minutes max per question).
Rate your confidence honestly.
### From: HitaVir Tech Code (Learning Guidelines)
| # | Question | My Answer Summary | Confidence |
|---|----------|-------------------|------------|
| 1 | How do you approach learning a new technology? | Execute-Understand-Explain cycle | Low / Med / High |
| 2 | Walk me through your debugging process | 7-step protocol | Low / Med / High |
| 3 | How do you ensure data quality in pipelines? | Checkpoint validation at every step | Low / Med / High |
| 4 | How do you stay productive on complex tasks? | Deep work blocks + 20-min rule | Low / Med / High |
| 5 | Tell me about a time you were stuck. How did you resolve it? | Debug log example | Low / Med / High |
| 6 | How do you handle working across multiple cloud platforms? | Cloud-agnostic methodology | Low / Med / High |
| 7 | How do you prepare for unfamiliar technical challenges? | Layered learning system | Low / Med / High |
| 8 | How would you design a system to handle 10x data growth? | 4 questions: reliability, scalability, cost, security | Low / Med / High |
### From: [Next Codelab Name]
(Copy this section header and table for each new codelab)
INTERVIEW
# Verify
cat ~/hitavir-learning-system/interview-prep/questions-bank.md
"You don't have time to think up there. If you think, you're dead." — Interview preparation is not something you cram the night before. It is something you build incrementally, one codelab at a time. By the end of this program, you will have 50+ practiced questions with confident answers.
The complete interview bank should display with the table of 8 questions.
Error | Cause | Fix |
Table formatting broken | Missing pipe characters | Ensure every row has the same number of ` |
File not created | Path typo | Verify directory exists with |
Verify before moving on:
[x] questions-bank.md exists in interview-prep/
[x] 8 questions from this codelab are listed
[x] You can answer at least 3 of them out loud right now
Trigger intentional errors and practice the 7-step debugging protocol on each one.
cd ~/hitavir-learning-system
# ERROR DRILL 1: File not found
cat this-file-does-not-exist.txt
# STOP → READ the error → CLASSIFY it → What type is this?
# ERROR DRILL 2: Permission denied
touch /root/unauthorized-file.txt
# STOP → READ the error → CLASSIFY it → What type is this?
# ERROR DRILL 3: Command not found
databrickzz --version
# STOP → READ the error → CLASSIFY it → What is the fix?
# ERROR DRILL 4: Invalid argument
mkdir ""
# STOP → READ the error → CLASSIFY it → What went wrong?
# ERROR DRILL 5: Git error (if not in repo)
cd /tmp && git log
# STOP → READ the error → CLASSIFY it → What is the fix?
# Return to your learning system
cd ~/hitavir-learning-system
"You gotta let go." — Most learners freeze when they see an error. This drill rewires your response. After practicing intentional errors, you will read error messages with curiosity instead of fear. Production engineers encounter dozens of errors daily — the difference is they read them calmly and fix them systematically.
# Drill 1 — File not found:
cat: this-file-does-not-exist.txt: No such file or directory
→ Type: PATH ERROR | Fix: Check filename and current directory
# Drill 2 — Permission denied:
touch: cannot touch '/root/unauthorized-file.txt': Permission denied
→ Type: PERMISSION ERROR | Fix: Use accessible path or check permissions
# Drill 3 — Command not found:
bash: databrickzz: command not found
→ Type: COMMAND ERROR | Fix: Check spelling, verify installation
# Drill 4 — Invalid argument:
mkdir: cannot create directory '': No such file or directory
→ Type: ARGUMENT ERROR | Fix: Provide a valid directory name
# Drill 5 — Not a git repo:
fatal: not a git repository (or any of the parent directories): .git
→ Type: CONTEXT ERROR | Fix: Navigate to a git repository first
Error | Cause | Fix |
Drill 2 does not error | Running as root | Use a non-root user or try a different protected path |
Drill 5 works (no error) | /tmp is a git repo | Use a different non-git directory |
Verify before moving on:
[x] You triggered all 5 errors
[x] You classified each error by type
[x] You identified the fix for each
[x] You added at least 2 entries to your debug log
Now update your debug log with the errors you just practiced:
# Add your drill results to the debug log
echo "| $(date +%Y-%m-%d) | Learning System | cat: No such file | Path | Wrong filename | Check spelling and pwd |" >> ~/hitavir-learning-system/debug-log/errors-and-fixes.md
echo "| $(date +%Y-%m-%d) | Learning System | Permission denied | Permission | No write access | Use home directory |" >> ~/hitavir-learning-system/debug-log/errors-and-fixes.md
Commit your complete learning system to Git and run a full validation.
cd ~/hitavir-learning-system
# Stage all files
git add .
# Commit
git commit -m "feat: complete learning system setup with all templates and debug drills"
# Run full validation
echo "=== LEARNING SYSTEM VALIDATION ==="
echo ""
echo "Directory Structure:"
for dir in journal debug-log notes challenges interview-prep; do
if [ -d "$dir" ]; then
echo " [PASS] $dir/"
else
echo " [FAIL] $dir/ missing"
fi
done
echo ""
echo "Key Files:"
for file in README.md debug-log/errors-and-fixes.md journal/week-01.md notes/topic-notes-template.md challenges/rebuild-tracker.md interview-prep/questions-bank.md; do
if [ -f "$file" ]; then
echo " [PASS] $file"
else
echo " [FAIL] $file missing"
fi
done
echo ""
echo "Git Status:"
git log --oneline
echo ""
echo "=== VALIDATION COMPLETE ==="
"Stay on target." — This is your final checkpoint for the hands-on section. A complete validation before moving on ensures every component is in place. In production, this is equivalent to running integration tests before deploying.
=== LEARNING SYSTEM VALIDATION ===
Directory Structure:
[PASS] journal/
[PASS] debug-log/
[PASS] notes/
[PASS] challenges/
[PASS] interview-prep/
Key Files:
[PASS] README.md
[PASS] debug-log/errors-and-fixes.md
[PASS] journal/week-01.md
[PASS] notes/topic-notes-template.md
[PASS] challenges/rebuild-tracker.md
[PASS] interview-prep/questions-bank.md
Git Status:
abc1234 feat: complete learning system setup with all templates and debug drills
def5678 feat: initialize learning system with directory structure and README
=== VALIDATION COMPLETE ===
Error | Cause | Fix |
| Already committed | That is fine — run validation anyway |
Some files show [FAIL] | Missed a previous step | Go back and create the missing file |
Verify before moving on:
[x] All directories show [PASS]
[x] All files show [PASS]
[x] Git log shows 2 commits
[x] You can explain what each file is for
Answer these without looking back. If you cannot answer all of them, revisit the relevant section.
1. What are the 3 phases of the Execute-Understand-Explain cycle?
2. What are the 7 steps of the Debugging Protocol?
3. Name 5 error types from the classification system.
4. What is the 3-attempt rebuild system?
5. What are the 4 questions of real-world thinking?
6. What is the 20-minute rule?
7. Why should you break things on purpose in a dev environment?
8. What goes in a daily journal entry?
9. How does the team mindset principle apply to engineering?
10. What does "loyalty to the process" mean when a lab gets hard?
If you can answer all 10 from memory, you have internalized these guidelines.
DO:
- Use meaningful variable names (customer_orders, not x)
- Follow the naming conventions of the project you are in
- Keep functions small and focused (one function, one job)
- Write comments only where the logic is not self-evident
DO NOT:
- Hardcode file paths, credentials, or config values (EVER)
- Write 500-line functions
- Skip error handling
- Commit without reviewing your own diff first
DO:
- Commit frequently with clear messages (feat:, fix:, docs:)
- Use branches for new features
- Review your diff before every commit
- Push to GitHub every day
DO NOT:
- Commit directly to main/master without review
- Write commit messages like "fix" or "update" or "stuff"
- Push secrets, API keys, or credentials
- Ignore .gitignore
DO:
- Tag all cloud resources (project, owner, environment)
- Use IAM roles with least privilege
- Set billing alerts on every cloud account
- Enable logging and monitoring
DO NOT:
- Use root/admin accounts for daily work
- Leave clusters and VMs running when not in use
- Store data without encryption
- Hardcode credentials in code or config files
Bottleneck | Optimization |
Context switching | One codelab per session, no tab switching |
Passive reading | Replace with hands-on execution |
Fear of errors | Practice intentional error drills weekly |
Information overload | Use structured notes template, not free-form |
Forgetting | Journal daily, review weekly |
Tutorial dependency | Rebuild challenge after every codelab |
As you progress through the program, apply these optimization patterns:
Technology | Key Optimization |
Spark/PySpark | Minimize shuffles, use partitioning, cache wisely |
SQL | Use indexes, avoid SELECT *, analyze query plans |
Cloud Storage | Use lifecycle policies, right-size storage tiers |
Pipelines | Implement incremental loads, avoid full refreshes |
Databricks | Use autoscaling clusters, Delta Lake optimization (ZORDER, OPTIMIZE) |
Cloud FinOps is not an afterthought — it is a core engineering discipline. Every line of code you write, every cluster you spin up, every query you run has a cost. The best engineers think about cost the same way they think about performance: continuously.
INFORM → Know what you are spending and why
OPTIMIZE → Reduce waste without reducing capability
OPERATE → Build cost-awareness into daily engineering habits
Pillar | Engineer's Responsibility | Example |
Inform | Tag every resource, review bills weekly |
|
Optimize | Right-size, schedule, use spot/preemptible | Downsize m5.xlarge to m5.large if CPU is at 20% |
Operate | Auto-terminate, lifecycle policies, alerts | Cluster auto-stops after 15 min idle |
Practice | AWS | Azure | GCP |
Cost Dashboard | Cost Explorer | Cost Management | Billing Reports |
Budget Alerts | AWS Budgets | Azure Budgets | GCP Budget Alerts |
Right-Sizing | Compute Optimizer | Azure Advisor | Recommender |
Spot/Preemptible | Spot Instances | Spot VMs | Preemptible VMs |
Storage Tiers | S3 Standard → IA → Glacier | Hot → Cool → Archive | Standard → Nearline → Coldline |
Auto-Shutdown | Lambda + CloudWatch Events | Azure Automation | Cloud Scheduler + Functions |
Reserved Pricing | Reserved Instances / Savings Plans | Reserved VMs | Committed Use Discounts |
Databricks Cost Killers (and how to avoid them):
1. IDLE CLUSTERS
Problem: Cluster running 24/7 but used 2 hours/day
Fix: Auto-termination after 10-15 min idle
Savings: Up to 90%
2. OVERSIZED CLUSTERS
Problem: 8-node cluster for a job that runs on 2 nodes
Fix: Use autoscaling (min 1, max based on workload)
Savings: 50-75%
3. ON-DEMAND FOR DEV/TEST
Problem: Using on-demand instances for development
Fix: Use spot instances for non-critical workloads
Savings: 60-80%
4. FULL TABLE SCANS
Problem: Reading entire Delta table when you need one partition
Fix: Partition pruning, ZORDER, predicate pushdown
Savings: Varies (can be 10x-100x on large tables)
Writing cost-efficient code is just as important as configuring cloud resources. Bad code wastes compute, memory, and money.
Pattern 1: Use Generators Instead of Lists for Large Data
# BAD — Loads entire dataset into memory at once
# Cost: High memory usage, potential OOM on large files
def read_all_records(filepath):
with open(filepath) as f:
return [line.strip() for line in f] # All in memory
# GOOD — Yields one record at a time
# Cost: Constant memory regardless of file size
def read_records(filepath):
with open(filepath) as f:
for line in f:
yield line.strip() # One at a time
Pattern 2: Filter Early, Process Late
# BAD — Processes everything, then filters
# Cost: Wasted CPU cycles on data you will discard
result = []
for record in massive_dataset:
transformed = expensive_transformation(record)
if transformed["region"] == "EU":
result.append(transformed)
# GOOD — Filter first, then process only what you need
# Cost: Transforms only matching records
result = [
expensive_transformation(record)
for record in massive_dataset
if record["region"] == "EU"
]
Pattern 3: Use Built-in Functions Over Loops
# BAD — Manual loop (slower, more memory)
total = 0
for sale in sales_data:
total += sale["amount"]
# GOOD — Built-in sum() with generator (C-optimized, faster)
total = sum(sale["amount"] for sale in sales_data)
Pattern 4: Efficient String Operations
# BAD — String concatenation in a loop (creates new string each time)
# Cost: O(n^2) memory allocation for n strings
query = ""
for column in columns:
query += column + ", "
# GOOD — join() method (single allocation)
# Cost: O(n) memory allocation
query = ", ".join(columns)
Pattern 5: Context Managers for Resource Cleanup
# BAD — Resource leak if exception occurs between open and close
# Cost: Leaked connections = wasted cloud resources + potential billing
conn = database.connect()
result = conn.execute("SELECT * FROM orders")
conn.close() # Never reached if execute() throws
# GOOD — Context manager guarantees cleanup
# Cost: Connection always released, even on error
with database.connect() as conn:
result = conn.execute("SELECT * FROM orders")
# Connection automatically closed here
Pattern 6: Batch Operations Over Individual Calls
# BAD — One API call per record (N network round trips)
# Cost: Slow + may hit API rate limits + higher network charges
for record in records:
s3_client.put_object(Bucket=bucket, Key=record["id"], Body=record["data"])
# GOOD — Batch upload (fewer API calls, lower cost)
# Cost: Fewer round trips, lower API call charges
import io
buffer = io.BytesIO()
for record in records:
buffer.write(record["data"].encode() + b"\n")
s3_client.put_object(Bucket=bucket, Key="batch_upload.jsonl", Body=buffer.getvalue())
Pattern 7: Cache Expensive Computations
from functools import lru_cache
# BAD — Recalculates every time (wastes CPU)
def get_exchange_rate(currency):
return api_call_to_exchange_service(currency) # Slow + costs per call
# GOOD — Cache results (avoids redundant API calls)
@lru_cache(maxsize=128)
def get_exchange_rate(currency):
return api_call_to_exchange_service(currency) # Called once per currency
Pattern 8: Use Appropriate Data Structures
# BAD — Using list for membership checks (O(n) per lookup)
blocked_users = ["user1", "user2", "user3", ..., "user10000"]
if user_id in blocked_users: # Scans entire list
block()
# GOOD — Using set for membership checks (O(1) per lookup)
blocked_users = {"user1", "user2", "user3", ..., "user10000"}
if user_id in blocked_users: # Instant hash lookup
block()
# BAD — Collect entire DataFrame to driver (OOM risk, defeats Spark)
all_data = df.collect() # Pulls everything to one machine
for row in all_data:
process(row)
# GOOD — Keep processing distributed
df.filter(col("status") == "active") \
.groupBy("region") \
.agg(sum("revenue").alias("total_revenue")) \
.write.format("delta").save("/output/path")
# BAD — No partition pruning (full table scan)
df = spark.read.format("delta").load("/data/sales")
result = df.filter(col("sale_date") == "2026-04-14")
# GOOD — Partition-aware reads (reads only relevant partitions)
df = spark.read.format("delta").load("/data/sales")
# Table is partitioned by sale_date, so Spark reads only that partition
result = df.filter(col("sale_date") == "2026-04-14")
# Verify with: result.explain() → should show PartitionFilters
# BAD — Persisting data you only use once (wastes memory/disk)
df.cache()
result = df.groupBy("category").count()
result.show()
# df stays cached, consuming cluster memory
# GOOD — Cache only when reusing the same DataFrame multiple times
df.cache()
summary = df.groupBy("category").count()
details = df.filter(col("amount") > 1000)
summary.show()
details.show()
df.unpersist() # Release when done
After EVERY codelab that uses cloud resources:
[ ] Check cloud console for running resources
[ ] Terminate idle clusters, VMs, and services
[ ] Verify billing dashboard for unexpected charges
[ ] Set up budget alerts if not already done
[ ] Review code for the 8 Python cost patterns above
[ ] Check Spark jobs for unnecessary shuffles and full scans
[ ] Verify storage lifecycle policies are in place
[ ] Tag all resources with project/owner/environment
The most expensive bug is the one that runs in production for a month before anyone notices the bill.
Delete your entire learning system:
rm -rf ~/hitavir-learning-system
Now rebuild everything from memory.
Time yourself. Log the result in your rebuild tracker.
Find a peer, friend, or family member. Explain in 2 minutes:
If they understand your explanation, you truly know it.
Without looking at the reference table, classify these errors:
1. "SyntaxError: unexpected EOF while parsing" → Type: ?
2. "Connection refused on port 5432" → Type: ?
3. "Access Denied for user 'admin'@'localhost'" → Type: ?
4. "java.lang.OutOfMemoryError: Java heap space" → Type: ?
5. "FileNotFoundError: [Errno 2] No such file" → Type: ?
Explain to a hiring manager in 60 seconds:
Time yourself. If it takes more than 60 seconds, simplify.
A teammate sends you this message:
"Spark job keeps failing with OutOfMemory.
Tried increasing executor memory to 64GB but still fails."
Using the debugging protocol, write your response. What questions would you ask? What would you check first?
"Push beyond your limits." — These challenges are optional, but the engineers who do them are the ones who get hired.
cd ~/hitavir-learning-system
# Create repo on github.com first, then:
# git remote add origin https://github.com/YOUR-USERNAME/hitavir-learning-system.git
# git push -u origin main
Signal | What It Demonstrates |
Daily commits | Consistency and discipline |
Growing debug log | Systematic learning from mistakes |
Improving rebuild times | Deliberate practice mentality |
Structured notes | Communication and documentation skills |
Professional README | Self-awareness and professionalism |
After completing this codelab, share your journey:
Post idea:
"Started building my Data Engineering learning system at
HitaVir Tech. My approach: Execute-Understand-Explain for
every concept. A personal debug log for every error. And the
rebuild challenge to convert tutorials into real skills.
Day 1 of building in public.
#DataEngineering #LearningInPublic #HitaVirTech"
"It's not the plane, it's the pilot." — Your portfolio proves you are the pilot, not just a passenger.
Practice answering these out loud. Two minutes maximum per question.
Q1: How do you approach learning a new technology or tool? Use the Execute-Understand-Explain cycle.
Q2: Walk me through how you debug a failing data pipeline. Use the 7-step Debugging Protocol with a specific example.
Q3: How do you ensure data quality in your pipelines? Reference checkpoint validation — row counts, schema checks, sample verification.
Q4: How do you stay productive when working on complex tasks? Reference deep work blocks, the 20-minute rule, and daily journaling.
Q5: Tell me about a time you were stuck on a technical problem. Use a real example from your debug log.
Q6: How do you handle working across multiple cloud platforms? Reference the cloud-agnostic learning framework.
Q7: How do you collaborate with team members on technical projects? Reference the team mindset — code reviews, knowledge sharing, pair debugging.
Q8: How would you design a system to handle 10x your current data? Reference the 4 questions: reliability, scalability, cost, security.
"Trust your instincts." — Your answers should flow naturally because you have lived these principles, not memorized them.
[x] Complete learning directory structure (6 directories)
[x] Professional README with Git repository
[x] Debug log with error database template
[x] Weekly journal with daily structure
[x] Topic notes template with multi-cloud section
[x] Rebuild challenge tracker with streak counter
[x] Interview questions bank with 8 practiced questions
[x] 5 intentional error drills with classifications
[x] Full system validation
[x] 11 guidelines for top-performer learning
[x] Execute-Understand-Explain cycle
[x] 7-step Debugging Protocol
[x] Error classification system (5 types)
[x] 3-attempt rebuild system
[x] Checkpoint validation discipline
[x] Productivity systems for deep work
[x] Real-world thinking (4 questions)
[x] Team mindset principles
[x] Fear management in engineering
[x] Process loyalty through difficulty
Your next codelab is Data Engineering Environment Setup on Windows 11. Apply everything you learned here:
You have completed the lab. But completing is not mastering.
Go back. Rebuild it without the guide. Break it on purpose. Fix it under pressure.
"Don't think. Just do." — When the interviewer asks you to describe your engineering process, your answer should flow before your doubt has time to speak.
"Push beyond your limits." — The engineers who land the best roles are not the ones who finished the codelab. They are the ones who finished it, broke it, rebuilt it, and then taught it to someone else.
"I'm not leaving my wingman." — Your batchmates are on the same mission. Help them. Challenge them. Rise together.
Your mission was never just this lab. Your mission is the career you are building. Every command you executed today is a brick in that foundation.
Now go fly.
— HitaVir Tech | Building Engineers, Not Just Learners
This codelab creates only local files and a Git repository. No cloud resources to clean up.
KEEP your entire learning system:
~/hitavir-learning-system/ <-- This is your knowledge base
README.md
journal/week-01.md
debug-log/errors-and-fixes.md
notes/topic-notes-template.md
challenges/rebuild-tracker.md
interview-prep/questions-bank.md
cd ~/hitavir-learning-system
git add .
git commit -m "docs: complete learning system setup after HitaVir Tech Code codelab"
# git push origin main
Build this habit now:
After EVERY codelab:
1. Stop running clusters, VMs, or services
2. Delete temporary cloud resources
3. Check cloud billing dashboard
4. Commit and push all work to GitHub
5. Write your journal entry
6. Update your debug log
7. Log your rebuild attempt
Resource management is an engineering discipline. Practice it from day one.
Proceed to the next codelab: Data Engineering Environment Setup on Windows 11.