Welcome to HitaVirTech Batch 5! This codelab walks you through setting up a complete Data Engineering development environment on Windows 11 from scratch.
Tool | Purpose |
Windows Terminal | Modern terminal experience |
Git & Git Bash | Version control & Unix shell on Windows |
Python 3.11 | Core scripting & PySpark |
VS Code | Primary IDE |
Java 11 (JDK) | Required for PySpark / Spark |
Apache Spark | Local Spark for PySpark development |
GitHub Account | Cloud version control & collaboration |
Databricks CLI | Connect to Databricks workspace |
Docker Desktop | Containerized environments |
Windows Terminal gives you a modern, tabbed terminal with Git Bash, PowerShell, and CMD all in one place.
Press Win + S, type Microsoft Store, and open it.
Search for Windows Terminal and click Install.
Alternatively, open PowerShell as Administrator and run:
winget install Microsoft.WindowsTerminal
wt --version
# Windows Terminal v1.19.x
Git is essential for version control. Git Bash gives you a Unix-like shell on Windows.
Go to https://git-scm.com/download/win
Download the 64-bit Windows installer or use winget:
winget install Git.Git
Screen | Recommended Option |
Default Editor | Visual Studio Code |
Initial branch name | main |
PATH environment | Git from the command line and 3rd-party software |
Line endings | Checkout Windows-style, commit Unix-style |
Terminal emulator | Use MinTTY (Git Bash) |
Credential helper | Git Credential Manager |
Open Git Bash and run:
git config --global user.name "Your Full Name"
git config --global user.email "your@email.com"
git config --global init.defaultBranch main
git config --global core.autocrlf true
ssh-keygen -t ed25519 -C "your@email.com"
# Press Enter to accept defaults
eval "$(ssh-agent -s)"
ssh-add ~/.ssh/id_ed25519
# Copy public key - paste this into GitHub later
cat ~/.ssh/id_ed25519.pub
git --version
# git version 2.44.x.windows.1
Python is the primary language for data engineering.
Important: Use Python 3.11 for best PySpark compatibility. Avoid 3.12+.
Go to https://www.python.org/downloads/
Download Python 3.11.x or use winget:
winget install Python.Python.3.11
CRITICAL: Check Add Python to PATH before clicking Install!
Open a new Git Bash window:
python --version
# Python 3.11.x
pip --version
# pip 23.x from ...
pip install --upgrade pip
pip install pyspark==3.5.1 pandas numpy pyarrow boto3 delta-spark jupyter black flake8 pytest
cd /c/hitavirtect_codelabs
python -m venv .venv
source .venv/Scripts/activate
# (.venv) prefix appears in terminal
Apache Spark requires Java 11. Do NOT use Java 17+ as it has compatibility issues with Spark.
Go to https://adoptium.net/temurin/releases/?version=11
Select: Version 11 LTS, OS Windows, Architecture x64, Package JDK
Or via winget:
winget install EclipseAdoptium.Temurin.11.JDK
# PowerShell as Administrator
[System.Environment]::SetEnvironmentVariable(
"JAVA_HOME",
"C:\Program Files\Eclipse Adoptium\jdk-11.x.x.x-hotspot",
"Machine"
)
java -version
# openjdk version "11.0.x"
echo $JAVA_HOME
# /c/Program Files/Eclipse Adoptium/jdk-11...
Go to https://spark.apache.org/downloads.html
Select: Spark 3.5.1, Pre-built for Hadoop 3.3
mkdir -p /c/spark
mv ~/Downloads/spark-3.5.1-bin-hadoop3.tgz /c/spark/
cd /c/spark && tar -xzf spark-3.5.1-bin-hadoop3.tgz
ls spark-3.5.1-bin-hadoop3/
PySpark on Windows needs winutils.exe to work correctly:
mkdir -p /c/hadoop/bin
curl -L -o /c/hadoop/bin/winutils.exe \
https://github.com/cdarlint/winutils/raw/master/hadoop-3.3.5/bin/winutils.exe
# PowerShell as Administrator
[System.Environment]::SetEnvironmentVariable("SPARK_HOME", "C:\spark\spark-3.5.1-bin-hadoop3", "Machine")
[System.Environment]::SetEnvironmentVariable("HADOOP_HOME", "C:\hadoop", "Machine")
[System.Environment]::SetEnvironmentVariable("PYSPARK_PYTHON", "python", "Machine")
$path = [System.Environment]::GetEnvironmentVariable("Path", "Machine")
[System.Environment]::SetEnvironmentVariable("Path", $path + ";C:\spark\spark-3.5.1-bin-hadoop3\bin", "Machine")
python -c "
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('Test').getOrCreate()
df = spark.createDataFrame([('Alice', 25), ('Bob', 30)], ['name', 'age'])
df.show()
spark.stop()
print('PySpark works!')"
Expected output:
+-----+---+
| name|age|
+-----+---+
|Alice| 25|
| Bob| 30|
+-----+---+
PySpark works!
winget install Microsoft.VisualStudioCode
Or download from https://code.visualstudio.com/
During install, tick:
code --install-extension ms-python.python
code --install-extension ms-python.black-formatter
code --install-extension databricks.databricks
code --install-extension amazonwebservices.aws-toolkit-vscode
code --install-extension ms-azuretools.vscode-docker
code --install-extension ms-toolsai.jupyter
code --install-extension eamodio.gitlens
code --install-extension redhat.vscode-yaml
Press Ctrl+Shift+P > Open User Settings (JSON):
{
"python.defaultInterpreterPath": "C:/hitavirtect_codelabs/.venv/Scripts/python.exe",
"editor.formatOnSave": true,
"editor.tabSize": 4,
"files.eol": "\n",
"[python]": {
"editor.defaultFormatter": "ms-python.black-formatter"
},
"terminal.integrated.defaultProfile.windows": "Git Bash",
"git.autofetch": true
}
code --version
# 1.88.x
GitHub is the world's largest platform for hosting code, collaborating on projects, and managing your data engineering portfolio. You'll use GitHub to store codelabs, Databricks notebooks, and project code throughout this course.
Open your browser and go to https://github.com
Click the Sign up button in the top-right corner.
Fill in the registration form:
Field | Guidance |
Use a professional email you check regularly | |
Password | Minimum 15 characters, or 8+ with a number and lowercase letter |
Username | Choose a professional handle (e.g., |
Email preferences | Optional — tick if you want product updates |
Tip: Your GitHub username will appear in all your commit history and public repos. Choose something professional like john-doe or jdoe-de — not coolgamer123.
Click Continue.
GitHub shows a visual CAPTCHA. Follow the on-screen instructions to verify you are human, then click Create account.
GitHub sends a launch code (8-digit number) to your email.
GitHub may ask a few onboarding questions:
Click Continue or Skip personalisation.
Select GitHub Free — this is sufficient for all HitaVirTech coursework.
Plan | Cost | What You Get |
Free | $0/month | Unlimited public & private repos, 2000 CI/CD mins/month |
Pro | $4/month | More CI/CD minutes, advanced insights |
You do NOT need a paid plan for this course.
You generated an SSH key in the Git step. Now add it to GitHub so you can push/pull without a password.
HitaVirTech-Laptop (or any label)Authentication Keycat ~/.ssh/id_ed25519.pub from Git Bashssh -T git@github.com
# Hi your-username! You have successfully authenticated
hitavir-batch5HitaVirTech Batch 5 - Data Engineering ProjectsThen clone it locally:
cd /c/hitavirtect_codelabs
git clone git@github.com:your-username/hitavir-batch5.git
cd hitavir-batch5
ls
# README.md
Make sure your local Git matches your GitHub account:
git config --global user.name "Your GitHub Display Name"
git config --global user.email "your-github-email@example.com"
# Verify
git config --list | grep user
# user.name=Your GitHub Display Name
# user.email=your-github-email@example.com
Action | Why It Matters |
Add a profile photo | Makes you recognisable to collaborators |
Write a bio | Shows your skills to potential employers |
Pin your best repos | Highlights your data engineering work |
Use README.md in repos | Documents your projects professionally |
winget install Databricks.DatabricksCLI
Or via pip:
pip install databricks-cli
databricks configure --token --profile hitavir-dev
# Databricks Host: https://adb-XXXXXXXXXX.azuredatabricks.net
# Token: dapiXXXXXXXXXXXXXXXX
Generate token: Databricks > User Settings > Developer > Access Tokens > Generate New Token
databricks --version
# Databricks CLI v0.x.x
databricks clusters list --profile hitavir-dev
# PowerShell as Administrator - then restart PC
wsl --install
After restart:
wsl --set-default-version 2
wsl --install -d Ubuntu-22.04
Download from https://www.docker.com/products/docker-desktop/
Install settings:
docker --version
# Docker version 26.x.x
docker run hello-world
# Hello from Docker!
docker run -d \
--name pg-dev \
-e POSTGRES_PASSWORD=hitavir123 \
-e POSTGRES_DB=dataengineering \
-p 5432:5432 \
postgres:15
docker ps
Run this one-liner in Git Bash to check everything:
echo '=== HitaVirTech DE Environment Check ===' && \
echo "Git: $(git --version)" && \
echo "Python: $(python --version)" && \
echo "PySpark: $(python -c 'import pyspark; print(pyspark.__version__)')" && \
echo "Java: $(java -version 2>&1 | head -1)" && \
echo "Docker: $(docker --version)" && \
echo "VS Code: $(code --version | head -1)" && \
echo "GitHub SSH: $(ssh -T git@github.com 2>&1 | head -1)" && \
echo '======================================='
=== HitaVirTech DE Environment Check ===
Git: git version 2.44.0.windows.1
Python: Python 3.11.9
PySpark: 3.5.1
Java: openjdk version "11.0.23"
Docker: Docker version 26.x.x
VS Code: 1.88.x
GitHub SSH: Hi your-username! You have successfully authenticated
=======================================
Issue | Fix |
| Reinstall Python with Add to PATH ticked |
PySpark Java error | Ensure JAVA_HOME points to JDK 11 exactly |
| Check HADOOP_HOME is set correctly |
Docker won't start | Enable Virtualization in BIOS |
SSH auth failed | Re-run |
GitHub push denied | Check SSH key is added to GitHub Settings |
You have successfully set up a complete Data Engineering environment on Windows 11!