My complete GitHub Actions setup: parallel test jobs, Docker build caching, SSH deployment to VPS, zero-downtime with PM2 reload, secrets management, and the workflow patterns I've refined over two years.
Every project I've worked on eventually reaches the same inflection point: the deploy process gets too painful to do manually. You forget to run the tests. You build locally but forget to bump the version. You SSH into production and realize the last person who deployed left a stale .env file.
GitHub Actions solved this for me two years ago. Not perfectly on day one — the first workflow I wrote was a 200-line YAML nightmare that timed out half the time and cached nothing. But iteration by iteration, I arrived at something that deploys this site reliably, with zero downtime, in under four minutes.
This is that workflow, explained section by section. Not the docs version. The version that survives contact with production.
Before we get into the full pipeline, you need a clear mental model of how GitHub Actions works. If you've used Jenkins or CircleCI, forget most of what you know. The concepts map loosely, but the execution model is different enough to trip you up.
on:
push:
branches: [main]
pull_request:
branches: [main]
schedule:
- cron: "0 6 * * 1" # Every Monday at 6 AM UTC
workflow_dispatch:
inputs:
environment:
description: "Target environment"
required: true
default: "staging"
type: choice
options:
- staging
- productionFour triggers, each serving a different purpose:
push to main is your production deploy trigger. Code merged? Ship it.pull_request runs your CI checks on every PR. This is where lint, type checks, and tests live.schedule is cron for your repo. I use it for weekly dependency audit scans and stale cache cleanup.workflow_dispatch gives you a manual "Deploy" button in the GitHub UI with input parameters. Invaluable when you need to deploy staging without a code change — maybe you updated an environment variable or need to repull a base Docker image.One thing that bites people: pull_request runs against the merge commit, not the PR branch HEAD. This means your CI is testing what the code will look like after merge. That's actually what you want, but it surprises people when a green branch goes red after a rebase.
jobs:
lint:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: pnpm/action-setup@v4
- uses: actions/setup-node@v4
with:
node-version: 22
cache: "pnpm"
- run: pnpm install --frozen-lockfile
- run: pnpm lintJobs run in parallel by default. Each job gets a fresh VM (the "runner"). ubuntu-latest gives you a reasonably beefy machine — 4 vCPUs, 16 GB RAM as of 2026. That's free for public repos, 2000 minutes/month for private.
Steps run sequentially within a job. Each uses: step pulls in a reusable action from the marketplace. Each run: step executes a shell command.
The --frozen-lockfile flag is crucial. Without it, pnpm install might update your lockfile during CI, which means you're not testing the same dependencies your developer committed. I've seen this cause phantom test failures that vanish locally because the lockfile on the developer's machine is already correct.
env:
NODE_ENV: production
NEXT_TELEMETRY_DISABLED: 1
jobs:
deploy:
runs-on: ubuntu-latest
environment: production
steps:
- name: Deploy
env:
SSH_PRIVATE_KEY: ${{ secrets.SSH_PRIVATE_KEY }}
DEPLOY_HOST: ${{ secrets.DEPLOY_HOST }}
run: |
echo "$SSH_PRIVATE_KEY" > key.pem
chmod 600 key.pem
ssh -i key.pem deploy@$DEPLOY_HOST "cd /var/www/app && ./deploy.sh"Environment variables set with env: at the workflow level are plain text, visible in logs. Use these for non-sensitive config: NODE_ENV, telemetry flags, feature toggles.
Secrets (${{ secrets.X }}) are encrypted at rest, masked in logs, and only available to workflows in the same repo. They're set in Settings > Secrets and variables > Actions.
The environment: production line is significant. GitHub Environments let you scope secrets to specific deployment targets. Your staging SSH key and your production SSH key can both be named SSH_PRIVATE_KEY but hold different values depending on which environment the job targets. This also unlocks required reviewers — you can gate production deploys behind a manual approval.
Here's how I structure the CI half of the pipeline. The goal: catch every category of error in the fastest possible time.
name: CI
on:
pull_request:
branches: [main]
push:
branches: [main]
concurrency:
group: ci-${{ github.ref }}
cancel-in-progress: true
jobs:
lint:
name: Lint
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: pnpm/action-setup@v4
- uses: actions/setup-node@v4
with:
node-version: 22
cache: "pnpm"
- run: pnpm install --frozen-lockfile
- run: pnpm lint
typecheck:
name: Type Check
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: pnpm/action-setup@v4
- uses: actions/setup-node@v4
with:
node-version: 22
cache: "pnpm"
- run: pnpm install --frozen-lockfile
- run: pnpm tsc --noEmit
test:
name: Unit Tests
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: pnpm/action-setup@v4
- uses: actions/setup-node@v4
with:
node-version: 22
cache: "pnpm"
- run: pnpm install --frozen-lockfile
- run: pnpm test -- --coverage
- uses: actions/upload-artifact@v4
if: always()
with:
name: coverage-report
path: coverage/
retention-days: 7
build:
name: Build
needs: [lint, typecheck, test]
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: pnpm/action-setup@v4
- uses: actions/setup-node@v4
with:
node-version: 22
cache: "pnpm"
- run: pnpm install --frozen-lockfile
- run: pnpm build
- uses: actions/upload-artifact@v4
with:
name: build-output
path: .next/
retention-days: 1Lint, typecheck, and test run in parallel. They have no dependencies on each other. A type error doesn't block lint from running, and a failed test doesn't need to wait for the type checker. On a typical run, all three complete in 30-60 seconds while running simultaneously.
Build waits for all three. The needs: [lint, typecheck, test] line means the build job only starts if lint, typecheck, AND test all pass. There's no point building a project that has lint errors or type failures.
concurrency with cancel-in-progress: true is a huge time saver. If you push two commits in quick succession, the first CI run is cancelled. Without this, you'll have stale runs consuming your minutes budget and cluttering the checks UI.
Coverage upload with if: always() means you get the coverage report even when tests fail. This is useful for debugging — you can see which tests failed and what they covered.
By default, if one job in a matrix fails, GitHub cancels the others. For CI, I actually want this behavior — if lint fails, I don't care about the test results. Fix the lint first.
But for test matrices (say, testing across Node 20 and Node 22), you might want to see all failures at once:
test:
strategy:
fail-fast: false
matrix:
node-version: [20, 22]
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: pnpm/action-setup@v4
- uses: actions/setup-node@v4
with:
node-version: ${{ matrix.node-version }}
cache: "pnpm"
- run: pnpm install --frozen-lockfile
- run: pnpm testfail-fast: false lets both matrix legs complete. If Node 22 fails but Node 20 passes, you see that information immediately instead of having to re-run.
The single biggest improvement you can make to CI speed is caching. A cold pnpm install on a medium project takes 30-45 seconds. With a warm cache, it takes 3-5 seconds. Multiply that across four parallel jobs and you're saving two minutes on every run.
- uses: actions/setup-node@v4
with:
node-version: 22
cache: "pnpm"This one-liner caches the pnpm store (~/.local/share/pnpm/store). On cache hit, pnpm install --frozen-lockfile just hard-links from the store instead of downloading. This alone cuts install time by 80% on repeat runs.
If you need more control — say, you want to cache based on the OS too — use actions/cache directly:
- uses: actions/cache@v4
with:
path: |
~/.local/share/pnpm/store
node_modules
key: pnpm-${{ runner.os }}-${{ hashFiles('pnpm-lock.yaml') }}
restore-keys: |
pnpm-${{ runner.os }}-The restore-keys fallback is important. If pnpm-lock.yaml changes (new dependency), the exact key won't match, but the prefix match will still restore most of the cached packages. Only the diff gets downloaded.
Next.js has its own build cache in .next/cache. Caching this between runs means incremental builds — only changed pages and components get recompiled.
- uses: actions/cache@v4
with:
path: .next/cache
key: nextjs-${{ runner.os }}-${{ hashFiles('pnpm-lock.yaml') }}-${{ hashFiles('src/**/*.ts', 'src/**/*.tsx') }}
restore-keys: |
nextjs-${{ runner.os }}-${{ hashFiles('pnpm-lock.yaml') }}-
nextjs-${{ runner.os }}-This three-level key strategy means:
Real numbers from my project: cold build takes ~55 seconds, cached build takes ~15 seconds. That's a 73% reduction.
Docker builds are where caching gets really impactful. A full Next.js Docker build — installing OS deps, copying source, running pnpm install, running next build — takes 3-4 minutes cold. With layer caching, it's 30-60 seconds.
- uses: docker/build-push-action@v6
with:
context: .
push: true
tags: ghcr.io/${{ github.repository }}:latest
cache-from: type=gha
cache-to: type=gha,mode=maxtype=gha uses GitHub Actions' built-in cache backend. mode=max caches all layers, not just the final ones. This is critical for multi-stage builds where intermediate layers (like pnpm install) are the most expensive to rebuild.
If you're in a monorepo with Turborepo, remote caching is transformative. First build uploads task outputs to the cache. Subsequent builds download instead of recomputing.
- run: pnpm turbo build --remote-only
env:
TURBO_TOKEN: ${{ secrets.TURBO_TOKEN }}
TURBO_TEAM: ${{ vars.TURBO_TEAM }}I've seen monorepo CI times drop from 8 minutes to 90 seconds with Turbo remote cache. The catch: it requires a Vercel account or self-hosted Turbo server. For single-app repos, it's overkill.
If you're deploying to a VPS (or any server), Docker gives you reproducible builds. The same image that runs in CI is the same image that runs in production. No more "it works on my machine" because the machine is the image.
Before we get to the workflow, here's the Dockerfile I use for Next.js:
# Stage 1: Dependencies
FROM node:22-alpine AS deps
RUN corepack enable && corepack prepare pnpm@latest --activate
WORKDIR /app
COPY package.json pnpm-lock.yaml ./
RUN pnpm install --frozen-lockfile --prod=false
# Stage 2: Build
FROM node:22-alpine AS builder
RUN corepack enable && corepack prepare pnpm@latest --activate
WORKDIR /app
COPY --from=deps /app/node_modules ./node_modules
COPY . .
ENV NEXT_TELEMETRY_DISABLED=1
RUN pnpm build
# Stage 3: Production
FROM node:22-alpine AS runner
WORKDIR /app
ENV NODE_ENV=production
ENV NEXT_TELEMETRY_DISABLED=1
RUN addgroup --system --gid 1001 nodejs
RUN adduser --system --uid 1001 nextjs
COPY --from=builder /app/public ./public
COPY --from=builder --chown=nextjs:nodejs /app/.next/standalone ./
COPY --from=builder --chown=nextjs:nodejs /app/.next/static ./.next/static
USER nextjs
EXPOSE 3000
ENV PORT=3000
CMD ["node", "server.js"]Three stages, clear separation. The final image is ~150MB instead of the ~1.2GB you'd get copying everything. Only production artifacts make it to the runner stage.
name: Build and Push Docker Image
on:
push:
branches: [main]
env:
REGISTRY: ghcr.io
IMAGE_NAME: ${{ github.repository }}
jobs:
build-and-push:
runs-on: ubuntu-latest
permissions:
contents: read
packages: write
steps:
- name: Checkout
uses: actions/checkout@v4
- name: Set up QEMU
uses: docker/setup-qemu-action@v3
- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v3
- name: Log in to GitHub Container Registry
uses: docker/login-action@v3
with:
registry: ${{ env.REGISTRY }}
username: ${{ github.actor }}
password: ${{ secrets.GITHUB_TOKEN }}
- name: Extract metadata
id: meta
uses: docker/metadata-action@v5
with:
images: ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}
tags: |
type=sha,prefix=
type=ref,event=branch
type=raw,value=latest,enable={{is_default_branch}}
- name: Build and push
uses: docker/build-push-action@v6
with:
context: .
platforms: linux/amd64,linux/arm64
push: true
tags: ${{ steps.meta.outputs.tags }}
labels: ${{ steps.meta.outputs.labels }}
cache-from: type=gha
cache-to: type=gha,mode=maxLet me unpack the important decisions here.
I use ghcr.io instead of Docker Hub for three reasons:
GITHUB_TOKEN is automatically available in every workflow — no need to store Docker Hub credentials.platforms: linux/amd64,linux/arm64This line adds maybe 90 seconds to your build, but it's worth it. ARM64 images run natively on:
Without this, your developers on M-series Macs are running x86 images through Rosetta emulation. It works, but it's noticeably slower and occasionally surfaces weird architecture-specific bugs.
QEMU provides the cross-compilation layer. Buildx orchestrates the multi-arch build and pushes a manifest list so Docker automatically pulls the right architecture.
tags: |
type=sha,prefix=
type=ref,event=branch
type=raw,value=latest,enable={{is_default_branch}}Every image gets three tags:
abc1234 (commit SHA): Immutable. You can always deploy an exact commit.main (branch name): Mutable. Points to the latest build from that branch.latest: Mutable. Only set on the default branch. This is what your server pulls.Never deploy latest in production without also recording the SHA somewhere. When something breaks, you need to know which latest. I store the deployed SHA in a file on the server that the health endpoint reads.
This is where it all comes together. CI passes, Docker image is built and pushed, now we need to tell the server to pull the new image and restart.
deploy:
name: Deploy to Production
needs: [build-and-push]
runs-on: ubuntu-latest
environment: production
steps:
- name: Deploy via SSH
uses: appleboy/ssh-action@v1
with:
host: ${{ secrets.DEPLOY_HOST }}
username: ${{ secrets.DEPLOY_USER }}
key: ${{ secrets.SSH_PRIVATE_KEY }}
port: ${{ secrets.SSH_PORT }}
script_stop: true
script: |
set -euo pipefail
APP_DIR="/var/www/akousa.net"
IMAGE="ghcr.io/${{ github.repository }}:latest"
DEPLOY_SHA="${{ github.sha }}"
echo "=== Deploying $DEPLOY_SHA ==="
# Pull the latest image
docker pull "$IMAGE"
# Stop and remove old container
docker stop akousa-app || true
docker rm akousa-app || true
# Start new container
docker run -d \
--name akousa-app \
--restart unless-stopped \
--network host \
-e NODE_ENV=production \
-e DATABASE_URL="${DATABASE_URL}" \
-p 3000:3000 \
"$IMAGE"
# Wait for health check
echo "Waiting for health check..."
for i in $(seq 1 30); do
if curl -sf http://localhost:3000/api/health > /dev/null 2>&1; then
echo "Health check passed on attempt $i"
break
fi
if [ "$i" -eq 30 ]; then
echo "Health check failed after 30 attempts"
exit 1
fi
sleep 2
done
# Record deployed SHA
echo "$DEPLOY_SHA" > "$APP_DIR/.deployed-sha"
# Prune old images
docker image prune -af --filter "until=168h"
echo "=== Deploy complete ==="For anything beyond a simple pull-and-restart, I move the logic into a script on the server rather than inlining it in the workflow:
#!/bin/bash
# /var/www/akousa.net/deploy.sh
set -euo pipefail
APP_DIR="/var/www/akousa.net"
LOG_FILE="$APP_DIR/deploy.log"
IMAGE="ghcr.io/akousa/akousa-net:latest"
log() {
echo "[$(date '+%Y-%m-%d %H:%M:%S')] $*" | tee -a "$LOG_FILE"
}
log "Starting deployment..."
# Login to GHCR
echo "$GHCR_TOKEN" | docker login ghcr.io -u akousa --password-stdin
# Pull with retry
for attempt in 1 2 3; do
if docker pull "$IMAGE"; then
log "Image pulled successfully on attempt $attempt"
break
fi
if [ "$attempt" -eq 3 ]; then
log "ERROR: Failed to pull image after 3 attempts"
exit 1
fi
log "Pull attempt $attempt failed, retrying in 5s..."
sleep 5
done
# Health check function
health_check() {
local port=$1
local max_attempts=30
for i in $(seq 1 $max_attempts); do
if curl -sf "http://localhost:$port/api/health" > /dev/null 2>&1; then
return 0
fi
sleep 2
done
return 1
}
# Start new container on alternate port
docker run -d \
--name akousa-app-new \
--env-file "$APP_DIR/.env.production" \
-p 3001:3000 \
"$IMAGE"
# Verify new container is healthy
if ! health_check 3001; then
log "ERROR: New container failed health check. Rolling back."
docker stop akousa-app-new || true
docker rm akousa-app-new || true
exit 1
fi
log "New container healthy. Switching traffic..."
# Switch Nginx upstream
sudo sed -i 's/server 127.0.0.1:3000/server 127.0.0.1:3001/' /etc/nginx/conf.d/upstream.conf
sudo nginx -t && sudo nginx -s reload
# Stop old container
docker stop akousa-app || true
docker rm akousa-app || true
# Rename new container
docker rename akousa-app-new akousa-app
log "Deployment complete."The workflow then becomes a single SSH command:
script: |
cd /var/www/akousa.net && ./deploy.shThis is better because: (1) the deploy logic is version-controlled on the server, (2) you can run it manually over SSH for debugging, and (3) you don't have to escape YAML inside YAML inside bash.
"Zero downtime" sounds like marketing speak, but it has a precise meaning: no request gets a connection refused or a 502 during deployment. Here are three real approaches, from simplest to most robust.
If you're running Node.js directly (not in Docker), PM2's cluster mode gives you the easiest zero-downtime path.
# ecosystem.config.js already has:
# instances: 2
# exec_mode: "cluster"
pm2 reload akousa --update-envpm2 reload (not restart) does a rolling restart. It spins up new workers, waits for them to be ready, then kills old workers one by one. At no point are zero workers serving traffic.
The --update-env flag reloads environment variables from the ecosystem config. Without it, your old env persists even after a deploy that changed .env.
In your workflow:
- name: Deploy and reload PM2
uses: appleboy/ssh-action@v1
with:
host: ${{ secrets.DEPLOY_HOST }}
username: ${{ secrets.DEPLOY_USER }}
key: ${{ secrets.SSH_PRIVATE_KEY }}
script: |
cd /var/www/akousa.net
git pull origin main
pnpm install --frozen-lockfile
pnpm build
pm2 reload ecosystem.config.js --update-envThis is what I use for this site. It's simple, reliable, and the downtime is literally zero — I've tested it with a load generator running 100 req/s during deploys. Not a single 5xx.
For Docker deployments, blue/green gives you a clean separation between the old and new versions.
The concept: run the old container ("blue") on port 3000 and the new container ("green") on port 3001. Nginx points to blue. You start green, verify it's healthy, switch Nginx to green, then stop blue.
Nginx upstream config:
# /etc/nginx/conf.d/upstream.conf
upstream app_backend {
server 127.0.0.1:3000;
}# /etc/nginx/sites-available/akousa.net
server {
listen 443 ssl http2;
server_name akousa.net;
location / {
proxy_pass http://app_backend;
proxy_http_version 1.1;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection 'upgrade';
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
proxy_cache_bypass $http_upgrade;
}
}The switch script:
#!/bin/bash
set -euo pipefail
CURRENT_PORT=$(grep -oP 'server 127\.0\.0\.1:\K\d+' /etc/nginx/conf.d/upstream.conf)
if [ "$CURRENT_PORT" = "3000" ]; then
NEW_PORT=3001
OLD_PORT=3000
else
NEW_PORT=3000
OLD_PORT=3001
fi
echo "Current: $OLD_PORT -> New: $NEW_PORT"
# Start new container on the alternate port
docker run -d \
--name "akousa-app-$NEW_PORT" \
--env-file /var/www/akousa.net/.env.production \
-p "$NEW_PORT:3000" \
"ghcr.io/akousa/akousa-net:latest"
# Wait for health
for i in $(seq 1 30); do
if curl -sf "http://localhost:$NEW_PORT/api/health" > /dev/null; then
echo "New container healthy on port $NEW_PORT"
break
fi
[ "$i" -eq 30 ] && { echo "Health check failed"; docker stop "akousa-app-$NEW_PORT"; docker rm "akousa-app-$NEW_PORT"; exit 1; }
sleep 2
done
# Switch Nginx
sudo sed -i "s/server 127.0.0.1:$OLD_PORT/server 127.0.0.1:$NEW_PORT/" /etc/nginx/conf.d/upstream.conf
sudo nginx -t && sudo nginx -s reload
# Stop old container
sleep 5 # Let in-flight requests complete
docker stop "akousa-app-$OLD_PORT" || true
docker rm "akousa-app-$OLD_PORT" || true
echo "Switched from :$OLD_PORT to :$NEW_PORT"The 5-second sleep after the Nginx reload isn't laziness — it's grace time. Nginx's reload is graceful (existing connections are kept open), but some long-polling connections or streaming responses need time to complete.
For a more structured approach, Docker Compose can manage the blue/green swap:
# docker-compose.yml
services:
app:
image: ghcr.io/akousa/akousa-net:latest
restart: unless-stopped
env_file: .env.production
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:3000/api/health"]
interval: 10s
timeout: 5s
retries: 3
start_period: 30s
deploy:
replicas: 2
update_config:
parallelism: 1
delay: 10s
order: start-first
failure_action: rollback
rollback_config:
parallelism: 0
order: stop-first
ports:
- "3000:3000"The order: start-first is the key line. It means "start the new container before stopping the old one." Combined with parallelism: 1, you get a rolling update — one container at a time, always maintaining capacity.
Deploy with:
docker compose pull
docker compose up -d --remove-orphansDocker Compose watches the healthcheck and won't route traffic to the new container until it passes. If the healthcheck fails, failure_action: rollback automatically reverts to the previous version. This is as close to Kubernetes-style rolling deployments as you get on a single VPS.
Secrets management is one of those things that's easy to get "mostly right" and catastrophically wrong in the remaining edge cases.
# Set via GitHub UI: Settings > Secrets and variables > Actions
steps:
- name: Use a secret
env:
DB_URL: ${{ secrets.DATABASE_URL }}
run: |
# The value is masked in logs
echo "Connecting to database..."
# This would print "Connecting to ***" in the logs
echo "Connecting to $DB_URL"GitHub automatically redacts secret values from log output. If your secret is p@ssw0rd123 and any step prints that string, the logs show ***. This works well, with one caveat: if your secret is short (like a 4-digit PIN), GitHub might not mask it because it could match innocent strings. Keep secrets reasonably complex.
jobs:
deploy-staging:
environment: staging
steps:
- run: echo "Deploying to ${{ secrets.DEPLOY_HOST }}"
# DEPLOY_HOST = staging.akousa.net
deploy-production:
environment: production
steps:
- run: echo "Deploying to ${{ secrets.DEPLOY_HOST }}"
# DEPLOY_HOST = akousa.netSame secret name, different values per environment. The environment field on the job determines which set of secrets is injected.
Production environments should have required reviewers enabled. This means a push to main triggers the workflow, CI runs automatically, but the deploy job pauses and waits for someone to click "Approve" in the GitHub UI. For a solo project, this might feel like overhead. For anything with users, it's a lifesaver the first time you accidentally merge something broken.
Static credentials (AWS access keys, GCP service account JSON files) stored in GitHub Secrets are a liability. They don't expire, they can't be scoped to a specific workflow run, and if they leak, you have to rotate them manually.
OIDC (OpenID Connect) solves this. GitHub Actions acts as an identity provider, and your cloud provider trusts it to issue short-lived credentials on the fly:
jobs:
deploy:
runs-on: ubuntu-latest
permissions:
id-token: write # Required for OIDC
contents: read
steps:
- name: Configure AWS Credentials
uses: aws-actions/configure-aws-credentials@v4
with:
role-to-assume: arn:aws:iam::123456789012:role/github-actions-deploy
aws-region: eu-central-1
- name: Push to ECR
run: |
aws ecr get-login-password --region eu-central-1 | \
docker login --username AWS --password-stdin 123456789012.dkr.ecr.eu-central-1.amazonaws.comNo access key. No secret key. The configure-aws-credentials action requests a temporary token from AWS STS using GitHub's OIDC token. The token is scoped to the specific repo, branch, and environment. It expires after the workflow run.
Setting this up on the AWS side requires an IAM OIDC identity provider and a role trust policy:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"Federated": "arn:aws:iam::123456789012:oidc-provider/token.actions.githubusercontent.com"
},
"Action": "sts:AssumeRoleWithWebIdentity",
"Condition": {
"StringEquals": {
"token.actions.githubusercontent.com:aud": "sts.amazonaws.com"
},
"StringLike": {
"token.actions.githubusercontent.com:sub": "repo:akousa/akousa-net:ref:refs/heads/main"
}
}
}
]
}The sub condition is crucial. Without it, any repo that somehow obtains your OIDC provider's details could assume the role. With it, only the main branch of your specific repo can.
GCP has an equivalent setup with Workload Identity Federation. Azure has federated credentials. If your cloud supports OIDC, use it. There's no reason to store static cloud credentials in 2026.
For VPS deployments over SSH, generate a dedicated key pair:
ssh-keygen -t ed25519 -C "github-actions-deploy" -f deploy_key -N ""Add the public key to the server's ~/.ssh/authorized_keys with restrictions:
restrict,command="/var/www/akousa.net/deploy.sh" ssh-ed25519 AAAA... github-actions-deploy
The restrict prefix disables port forwarding, agent forwarding, PTY allocation, and X11 forwarding. The command= prefix means this key can only execute the deploy script. Even if the private key is compromised, the attacker can run your deploy script and nothing else.
Add the private key to GitHub Secrets as SSH_PRIVATE_KEY. This is the one static credential I accept — SSH keys with forced commands have a very limited blast radius.
Every PR deserves a preview environment. It catches visual bugs that unit tests miss, lets designers review without checking out code, and makes QA's life dramatically easier.
name: Preview Deploy
on:
pull_request:
types: [opened, synchronize, reopened]
jobs:
preview:
runs-on: ubuntu-latest
environment:
name: preview-${{ github.event.number }}
url: ${{ steps.deploy.outputs.url }}
steps:
- uses: actions/checkout@v4
- name: Build preview image
uses: docker/build-push-action@v6
with:
context: .
push: true
tags: ghcr.io/${{ github.repository }}:pr-${{ github.event.number }}
cache-from: type=gha
cache-to: type=gha,mode=max
- name: Deploy preview
id: deploy
uses: appleboy/ssh-action@v1
with:
host: ${{ secrets.PREVIEW_HOST }}
username: ${{ secrets.DEPLOY_USER }}
key: ${{ secrets.SSH_PRIVATE_KEY }}
script: |
PR_NUM=${{ github.event.number }}
PORT=$((4000 + PR_NUM))
IMAGE="ghcr.io/${{ github.repository }}:pr-${PR_NUM}"
docker pull "$IMAGE"
docker stop "preview-${PR_NUM}" || true
docker rm "preview-${PR_NUM}" || true
docker run -d \
--name "preview-${PR_NUM}" \
--restart unless-stopped \
-e NODE_ENV=preview \
-p "${PORT}:3000" \
"$IMAGE"
echo "url=https://pr-${PR_NUM}.preview.akousa.net" >> "$GITHUB_OUTPUT"
- name: Comment PR with preview URL
uses: actions/github-script@v7
with:
script: |
const url = `https://pr-${{ github.event.number }}.preview.akousa.net`;
const body = `### Preview Deployment
| Status | URL |
|--------|-----|
| :white_check_mark: Deployed | [${url}](${url}) |
_Last updated: ${new Date().toISOString()}_
_Commit: \`${{ github.sha }}\`_`;
// Find existing comment
const comments = await github.rest.issues.listComments({
owner: context.repo.owner,
repo: context.repo.repo,
issue_number: context.issue.number,
});
const botComment = comments.data.find(c =>
c.user.type === 'Bot' && c.body.includes('Preview Deployment')
);
if (botComment) {
await github.rest.issues.updateComment({
owner: context.repo.owner,
repo: context.repo.repo,
comment_id: botComment.id,
body,
});
} else {
await github.rest.issues.createComment({
owner: context.repo.owner,
repo: context.repo.repo,
issue_number: context.issue.number,
body,
});
}The port calculation (4000 + PR_NUM) is a pragmatic hack. PR #42 gets port 4042. As long as you don't have more than a few hundred open PRs, there are no collisions. An Nginx wildcard config routes pr-*.preview.akousa.net to the right port.
Preview environments that aren't cleaned up eat disk and memory. Add a cleanup job:
name: Cleanup Preview
on:
pull_request:
types: [closed]
jobs:
cleanup:
runs-on: ubuntu-latest
steps:
- name: Remove preview container
uses: appleboy/ssh-action@v1
with:
host: ${{ secrets.PREVIEW_HOST }}
username: ${{ secrets.DEPLOY_USER }}
key: ${{ secrets.SSH_PRIVATE_KEY }}
script: |
PR_NUM=${{ github.event.number }}
docker stop "preview-${PR_NUM}" || true
docker rm "preview-${PR_NUM}" || true
docker rmi "ghcr.io/${{ github.repository }}:pr-${PR_NUM}" || true
echo "Preview for PR #${PR_NUM} cleaned up."
- name: Deactivate environment
uses: actions/github-script@v7
with:
script: |
const deployments = await github.rest.repos.listDeployments({
owner: context.repo.owner,
repo: context.repo.repo,
environment: `preview-${{ github.event.number }}`,
});
for (const deployment of deployments.data) {
await github.rest.repos.createDeploymentStatus({
owner: context.repo.owner,
repo: context.repo.repo,
deployment_id: deployment.id,
state: 'inactive',
});
}In your repository settings (Settings > Branches > Branch protection rules), require these checks before merging:
lint — No lint errorstypecheck — No type errorstest — All tests passbuild — Project builds successfullyWithout this, someone will merge a PR with failing checks. Not maliciously — they'll see "2 of 4 checks passed" and assume the other two are still running. Lock it down.
Also enable "Require branches to be up to date before merging." This forces a re-run of CI after rebasing onto the latest main. It catches the case where two PRs individually pass CI but conflict when combined.
A deployment that nobody knows about is a deployment that nobody trusts. Notifications close the feedback loop.
- name: Notify Slack
if: always()
uses: slackapi/slack-github-action@v2
with:
webhook: ${{ secrets.SLACK_DEPLOY_WEBHOOK }}
webhook-type: incoming-webhook
payload: |
{
"blocks": [
{
"type": "header",
"text": {
"type": "plain_text",
"text": "${{ job.status == 'success' && 'Deploy Successful' || 'Deploy Failed' }}"
}
},
{
"type": "section",
"fields": [
{
"type": "mrkdwn",
"text": "*Repository:*\n${{ github.repository }}"
},
{
"type": "mrkdwn",
"text": "*Branch:*\n${{ github.ref_name }}"
},
{
"type": "mrkdwn",
"text": "*Commit:*\n<${{ github.server_url }}/${{ github.repository }}/commit/${{ github.sha }}|${{ github.sha }}>"
},
{
"type": "mrkdwn",
"text": "*Triggered by:*\n${{ github.actor }}"
}
]
},
{
"type": "actions",
"elements": [
{
"type": "button",
"text": {
"type": "plain_text",
"text": "View Run"
},
"url": "${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}"
}
]
}
]
}The if: always() is critical. Without it, the notification step is skipped when the deploy fails — which is exactly when you need it most.
For richer deployment tracking, use the GitHub Deployments API. This gives you a deployment history in the repo UI and enables status badges:
- name: Create GitHub Deployment
id: deployment
uses: actions/github-script@v7
with:
script: |
const deployment = await github.rest.repos.createDeployment({
owner: context.repo.owner,
repo: context.repo.repo,
ref: context.sha,
environment: 'production',
auto_merge: false,
required_contexts: [],
description: `Deploying ${context.sha.substring(0, 7)} to production`,
});
return deployment.data.id;
- name: Deploy
run: |
# ... actual deployment steps ...
- name: Update deployment status
if: always()
uses: actions/github-script@v7
with:
script: |
const deploymentId = ${{ steps.deployment.outputs.result }};
await github.rest.repos.createDeploymentStatus({
owner: context.repo.owner,
repo: context.repo.repo,
deployment_id: deploymentId,
state: '${{ job.status }}' === 'success' ? 'success' : 'failure',
environment_url: 'https://akousa.net',
log_url: `${context.serverUrl}/${context.repo.owner}/${context.repo.repo}/actions/runs/${context.runId}`,
description: '${{ job.status }}' === 'success'
? 'Deployment succeeded'
: 'Deployment failed',
});Now your Environments tab in GitHub shows a complete deployment history: who deployed what, when, and whether it succeeded.
For critical deployments, I also trigger an email on failure. Not via GitHub Actions' built-in email (too noisy), but via a targeted webhook:
- name: Alert on failure
if: failure()
run: |
curl -X POST "${{ secrets.ALERT_WEBHOOK_URL }}" \
-H "Content-Type: application/json" \
-d '{
"subject": "DEPLOY FAILED: ${{ github.repository }}",
"body": "Commit: ${{ github.sha }}\nActor: ${{ github.actor }}\nRun: ${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}"
}'This is my last line of defense. Slack is great but it's also noisy — people mute channels. A "DEPLOY FAILED" email with a link to the run gets attention.
Here's everything wired together into a single, production-ready workflow. This is very close to what actually deploys this site.
name: CI/CD Pipeline
on:
push:
branches: [main]
pull_request:
branches: [main]
workflow_dispatch:
inputs:
skip_tests:
description: "Skip tests (emergency deploy)"
required: false
type: boolean
default: false
concurrency:
group: ${{ github.workflow }}-${{ github.ref }}
cancel-in-progress: ${{ github.event_name == 'pull_request' }}
env:
NODE_VERSION: "22"
REGISTRY: ghcr.io
IMAGE_NAME: ${{ github.repository }}
jobs:
# ============================================================
# CI: Lint, type check, and test in parallel
# ============================================================
lint:
name: Lint
runs-on: ubuntu-latest
steps:
- name: Checkout code
uses: actions/checkout@v4
- name: Setup pnpm
uses: pnpm/action-setup@v4
- name: Setup Node.js
uses: actions/setup-node@v4
with:
node-version: ${{ env.NODE_VERSION }}
cache: "pnpm"
- name: Install dependencies
run: pnpm install --frozen-lockfile
- name: Run ESLint
run: pnpm lint
typecheck:
name: Type Check
runs-on: ubuntu-latest
steps:
- name: Checkout code
uses: actions/checkout@v4
- name: Setup pnpm
uses: pnpm/action-setup@v4
- name: Setup Node.js
uses: actions/setup-node@v4
with:
node-version: ${{ env.NODE_VERSION }}
cache: "pnpm"
- name: Install dependencies
run: pnpm install --frozen-lockfile
- name: Run TypeScript compiler
run: pnpm tsc --noEmit
test:
name: Unit Tests
if: ${{ !inputs.skip_tests }}
runs-on: ubuntu-latest
steps:
- name: Checkout code
uses: actions/checkout@v4
- name: Setup pnpm
uses: pnpm/action-setup@v4
- name: Setup Node.js
uses: actions/setup-node@v4
with:
node-version: ${{ env.NODE_VERSION }}
cache: "pnpm"
- name: Install dependencies
run: pnpm install --frozen-lockfile
- name: Run tests with coverage
run: pnpm test -- --coverage
- name: Upload coverage report
if: always()
uses: actions/upload-artifact@v4
with:
name: coverage-report
path: coverage/
retention-days: 7
# ============================================================
# Build: Only after CI passes
# ============================================================
build:
name: Build Application
needs: [lint, typecheck, test]
if: always() && !cancelled() && needs.lint.result == 'success' && needs.typecheck.result == 'success' && (needs.test.result == 'success' || needs.test.result == 'skipped')
runs-on: ubuntu-latest
steps:
- name: Checkout code
uses: actions/checkout@v4
- name: Setup pnpm
uses: pnpm/action-setup@v4
- name: Setup Node.js
uses: actions/setup-node@v4
with:
node-version: ${{ env.NODE_VERSION }}
cache: "pnpm"
- name: Install dependencies
run: pnpm install --frozen-lockfile
- name: Cache Next.js build
uses: actions/cache@v4
with:
path: .next/cache
key: nextjs-${{ runner.os }}-${{ hashFiles('pnpm-lock.yaml') }}-${{ hashFiles('src/**/*.ts', 'src/**/*.tsx') }}
restore-keys: |
nextjs-${{ runner.os }}-${{ hashFiles('pnpm-lock.yaml') }}-
nextjs-${{ runner.os }}-
- name: Build Next.js application
run: pnpm build
# ============================================================
# Docker: Build and push image (main branch only)
# ============================================================
docker:
name: Build Docker Image
needs: [build]
if: github.ref == 'refs/heads/main' && github.event_name != 'pull_request'
runs-on: ubuntu-latest
permissions:
contents: read
packages: write
outputs:
image_tag: ${{ steps.meta.outputs.tags }}
steps:
- name: Checkout code
uses: actions/checkout@v4
- name: Set up QEMU for multi-platform builds
uses: docker/setup-qemu-action@v3
- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v3
- name: Log in to GitHub Container Registry
uses: docker/login-action@v3
with:
registry: ${{ env.REGISTRY }}
username: ${{ github.actor }}
password: ${{ secrets.GITHUB_TOKEN }}
- name: Extract image metadata
id: meta
uses: docker/metadata-action@v5
with:
images: ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}
tags: |
type=sha,prefix=
type=raw,value=latest,enable={{is_default_branch}}
- name: Build and push Docker image
uses: docker/build-push-action@v6
with:
context: .
platforms: linux/amd64,linux/arm64
push: true
tags: ${{ steps.meta.outputs.tags }}
labels: ${{ steps.meta.outputs.labels }}
cache-from: type=gha
cache-to: type=gha,mode=max
# ============================================================
# Deploy: SSH into VPS and update
# ============================================================
deploy:
name: Deploy to Production
needs: [docker]
if: github.ref == 'refs/heads/main' && github.event_name != 'pull_request'
runs-on: ubuntu-latest
environment:
name: production
url: https://akousa.net
steps:
- name: Create GitHub Deployment
id: deployment
uses: actions/github-script@v7
with:
script: |
const deployment = await github.rest.repos.createDeployment({
owner: context.repo.owner,
repo: context.repo.repo,
ref: context.sha,
environment: 'production',
auto_merge: false,
required_contexts: [],
description: `Deploy ${context.sha.substring(0, 7)}`,
});
return deployment.data.id;
- name: Deploy via SSH
uses: appleboy/ssh-action@v1
with:
host: ${{ secrets.DEPLOY_HOST }}
username: ${{ secrets.DEPLOY_USER }}
key: ${{ secrets.SSH_PRIVATE_KEY }}
port: ${{ secrets.SSH_PORT }}
script_stop: true
command_timeout: 5m
script: |
set -euo pipefail
APP_DIR="/var/www/akousa.net"
IMAGE="ghcr.io/${{ github.repository }}:latest"
SHA="${{ github.sha }}"
echo "=== Deploy $SHA started at $(date) ==="
# Pull new image
docker pull "$IMAGE"
# Run new container on alternate port
docker run -d \
--name akousa-app-new \
--env-file "$APP_DIR/.env.production" \
-p 3001:3000 \
"$IMAGE"
# Health check
echo "Running health check..."
for i in $(seq 1 30); do
if curl -sf http://localhost:3001/api/health > /dev/null 2>&1; then
echo "Health check passed (attempt $i)"
break
fi
if [ "$i" -eq 30 ]; then
echo "ERROR: Health check failed"
docker logs akousa-app-new --tail 50
docker stop akousa-app-new && docker rm akousa-app-new
exit 1
fi
sleep 2
done
# Switch traffic
sudo sed -i 's/server 127.0.0.1:3000/server 127.0.0.1:3001/' /etc/nginx/conf.d/upstream.conf
sudo nginx -t && sudo nginx -s reload
# Grace period for in-flight requests
sleep 5
# Stop old container
docker stop akousa-app || true
docker rm akousa-app || true
# Rename and reset port
docker rename akousa-app-new akousa-app
sudo sed -i 's/server 127.0.0.1:3001/server 127.0.0.1:3000/' /etc/nginx/conf.d/upstream.conf
# Note: we don't reload Nginx here because the container name changed,
# not the port. The next deploy will use the correct port.
# Record deployment
echo "$SHA" > "$APP_DIR/.deployed-sha"
echo "$(date -u +%Y-%m-%dT%H:%M:%SZ) $SHA" >> "$APP_DIR/deploy.log"
# Cleanup old images (older than 7 days)
docker image prune -af --filter "until=168h"
echo "=== Deploy complete at $(date) ==="
- name: Update deployment status
if: always()
uses: actions/github-script@v7
with:
script: |
const deploymentId = ${{ steps.deployment.outputs.result }};
await github.rest.repos.createDeploymentStatus({
owner: context.repo.owner,
repo: context.repo.repo,
deployment_id: deploymentId,
state: '${{ job.status }}' === 'success' ? 'success' : 'failure',
environment_url: 'https://akousa.net',
log_url: `${context.serverUrl}/${context.repo.owner}/${context.repo.repo}/actions/runs/${context.runId}`,
});
- name: Notify Slack
if: always()
uses: slackapi/slack-github-action@v2
with:
webhook: ${{ secrets.SLACK_DEPLOY_WEBHOOK }}
webhook-type: incoming-webhook
payload: |
{
"blocks": [
{
"type": "header",
"text": {
"type": "plain_text",
"text": "${{ job.status == 'success' && 'Deploy Successful' || 'Deploy Failed' }}"
}
},
{
"type": "section",
"fields": [
{
"type": "mrkdwn",
"text": "*Commit:*\n<${{ github.server_url }}/${{ github.repository }}/commit/${{ github.sha }}|`${{ github.sha }}`>"
},
{
"type": "mrkdwn",
"text": "*Actor:*\n${{ github.actor }}"
}
]
},
{
"type": "actions",
"elements": [
{
"type": "button",
"text": { "type": "plain_text", "text": "View Run" },
"url": "${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}"
}
]
}
]
}
- name: Alert on failure
if: failure()
run: |
curl -sf -X POST "${{ secrets.ALERT_WEBHOOK_URL }}" \
-H "Content-Type: application/json" \
-d '{
"subject": "DEPLOY FAILED: ${{ github.repository }}",
"body": "Commit: ${{ github.sha }}\nRun: ${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}"
}' || trueWhen I push to main:
When I open a PR:
if conditions gate them to main branch only).When I need an emergency deploy (skip tests):
skip_tests: true.This has been my workflow for two years. It's survived server migrations, Node.js major version upgrades, pnpm replacing npm, and the addition of 15 tools to this site. The total end-to-end time from push to production: 3 minutes 40 seconds on average. The slowest step is the multi-platform Docker build at ~90 seconds. Everything else is cached into near-instant.
I'll close with the mistakes I made so you don't have to.
Pin your action versions. uses: actions/checkout@v4 is fine, but for production, consider uses: actions/checkout@b4ffde65f46336ab88eb53be808477a3936bae11 (the full SHA). A compromised action could exfiltrate your secrets. The tj-actions/changed-files incident in 2025 proved this isn't theoretical.
Don't cache everything. I once cached node_modules directly (not just the pnpm store) and spent two hours debugging a phantom build failure caused by stale native bindings. Cache the package manager store, not the installed modules.
Set timeouts. Every job should have timeout-minutes. The default is 360 minutes (6 hours). If your deploy hangs because the SSH connection dropped, you don't want to discover it six hours later when you've burned through your monthly minutes.
jobs:
deploy:
timeout-minutes: 15
runs-on: ubuntu-latestUse concurrency wisely. For PRs, cancel-in-progress: true is always right — nobody cares about the CI result of a commit that's already been force-pushed over. For production deploys, set it to false. You don't want a fast-follow commit to cancel a deploy that's mid-rollout.
Test your workflow file. Use act (https://github.com/nektos/act) to run workflows locally. It won't catch everything (secrets aren't available, and the runner environment differs), but it catches YAML syntax errors and obvious logic bugs before you push.
Monitor your CI costs. GitHub Actions minutes are free for public repos and cheap for private ones, but they add up. Multi-platform Docker builds are 2x the minutes (one per platform). Matrix test strategies multiply your runtime. Keep an eye on the billing page.
The best CI/CD pipeline is the one you trust. Trust comes from reliability, observability, and incremental improvement. Start with a simple lint-test-build pipeline. Add Docker when you need reproducibility. Add SSH deployment when you need automation. Add notifications when you need confidence. Don't build the full pipeline on day one — you'll get the abstractions wrong.
Build the pipeline you need today, and let it grow with your project.