Every project I've worked on eventually reaches the same inflection point: the deploy process gets too painful to do manually. You forget to run the tests. You build locally but forget to bump the version. You SSH into production and realize the last person who deployed left a stale .env file.

GitHub Actions solved this for me two years ago. Not perfectly on day one — the first workflow I wrote was a 200-line YAML nightmare that timed out half the time and cached nothing. But iteration by iteration, I arrived at something that deploys this site reliably, with zero downtime, in under four minutes.

This is that workflow, explained section by section. Not the docs version. The version that survives contact with production.

Understanding the Building Blocks#

Before we get into the full pipeline, you need a clear mental model of how GitHub Actions works. If you've used Jenkins or CircleCI, forget most of what you know. The concepts map loosely, but the execution model is different enough to trip you up.

Triggers: When Your Workflow Runs#

yaml

on:
  push:
    branches: [main]
  pull_request:
    branches: [main]
  schedule:
    - cron: "0 6 * * 1" # Every Monday at 6 AM UTC
  workflow_dispatch:
    inputs:
      environment:
        description: "Target environment"
        required: true
        default: "staging"
        type: choice
        options:
          - staging
          - production

Four triggers, each serving a different purpose:

push to main is your production deploy trigger. Code merged? Ship it.
pull_request runs your CI checks on every PR. This is where lint, type checks, and tests live.
schedule is cron for your repo. I use it for weekly dependency audit scans and stale cache cleanup.
workflow_dispatch gives you a manual "Deploy" button in the GitHub UI with input parameters. Invaluable when you need to deploy staging without a code change — maybe you updated an environment variable or need to repull a base Docker image.

One thing that bites people: pull_request runs against the merge commit, not the PR branch HEAD. This means your CI is testing what the code will look like after merge. That's actually what you want, but it surprises people when a green branch goes red after a rebase.

Jobs, Steps, and Runners#

yaml

jobs:
  lint:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: pnpm/action-setup@v4
      - uses: actions/setup-node@v4
        with:
          node-version: 22
          cache: "pnpm"
      - run: pnpm install --frozen-lockfile
      - run: pnpm lint

Jobs run in parallel by default. Each job gets a fresh VM (the "runner"). ubuntu-latest gives you a reasonably beefy machine — 4 vCPUs, 16 GB RAM as of 2026. That's free for public repos, 2000 minutes/month for private.

Steps run sequentially within a job. Each uses: step pulls in a reusable action from the marketplace. Each run: step executes a shell command.

The --frozen-lockfile flag is crucial. Without it, pnpm install might update your lockfile during CI, which means you're not testing the same dependencies your developer committed. I've seen this cause phantom test failures that vanish locally because the lockfile on the developer's machine is already correct.

Environment Variables vs Secrets#

yaml

env:
  NODE_ENV: production
  NEXT_TELEMETRY_DISABLED: 1
 
jobs:
  deploy:
    runs-on: ubuntu-latest
    environment: production
    steps:
      - name: Deploy
        env:
          SSH_PRIVATE_KEY: ${{ secrets.SSH_PRIVATE_KEY }}
          DEPLOY_HOST: ${{ secrets.DEPLOY_HOST }}
        run: |
          echo "$SSH_PRIVATE_KEY" > key.pem
          chmod 600 key.pem
          ssh -i key.pem deploy@$DEPLOY_HOST "cd /var/www/app && ./deploy.sh"

Environment variables set with env: at the workflow level are plain text, visible in logs. Use these for non-sensitive config: NODE_ENV, telemetry flags, feature toggles.

Secrets (${{ secrets.X }}) are encrypted at rest, masked in logs, and only available to workflows in the same repo. They're set in Settings > Secrets and variables > Actions.

The environment: production line is significant. GitHub Environments let you scope secrets to specific deployment targets. Your staging SSH key and your production SSH key can both be named SSH_PRIVATE_KEY but hold different values depending on which environment the job targets. This also unlocks required reviewers — you can gate production deploys behind a manual approval.

The Full CI Pipeline#

Here's how I structure the CI half of the pipeline. The goal: catch every category of error in the fastest possible time.

yaml

name: CI
 
on:
  pull_request:
    branches: [main]
  push:
    branches: [main]
 
concurrency:
  group: ci-${{ github.ref }}
  cancel-in-progress: true
 
jobs:
  lint:
    name: Lint
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: pnpm/action-setup@v4
      - uses: actions/setup-node@v4
        with:
          node-version: 22
          cache: "pnpm"
      - run: pnpm install --frozen-lockfile
      - run: pnpm lint
 
  typecheck:
    name: Type Check
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: pnpm/action-setup@v4
      - uses: actions/setup-node@v4
        with:
          node-version: 22
          cache: "pnpm"
      - run: pnpm install --frozen-lockfile
      - run: pnpm tsc --noEmit
 
  test:
    name: Unit Tests
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: pnpm/action-setup@v4
      - uses: actions/setup-node@v4
        with:
          node-version: 22
          cache: "pnpm"
      - run: pnpm install --frozen-lockfile
      - run: pnpm test -- --coverage
      - uses: actions/upload-artifact@v4
        if: always()
        with:
          name: coverage-report
          path: coverage/
          retention-days: 7
 
  build:
    name: Build
    needs: [lint, typecheck, test]
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: pnpm/action-setup@v4
      - uses: actions/setup-node@v4
        with:
          node-version: 22
          cache: "pnpm"
      - run: pnpm install --frozen-lockfile
      - run: pnpm build
      - uses: actions/upload-artifact@v4
        with:
          name: build-output
          path: .next/
          retention-days: 1

Why This Structure#

Lint, typecheck, and test run in parallel. They have no dependencies on each other. A type error doesn't block lint from running, and a failed test doesn't need to wait for the type checker. On a typical run, all three complete in 30-60 seconds while running simultaneously.

Build waits for all three. The needs: [lint, typecheck, test] line means the build job only starts if lint, typecheck, AND test all pass. There's no point building a project that has lint errors or type failures.

concurrency with cancel-in-progress: true is a huge time saver. If you push two commits in quick succession, the first CI run is cancelled. Without this, you'll have stale runs consuming your minutes budget and cluttering the checks UI.

Coverage upload with if: always() means you get the coverage report even when tests fail. This is useful for debugging — you can see which tests failed and what they covered.

Fail-Fast vs. Let Them All Run#

By default, if one job in a matrix fails, GitHub cancels the others. For CI, I actually want this behavior — if lint fails, I don't care about the test results. Fix the lint first.

But for test matrices (say, testing across Node 20 and Node 22), you might want to see all failures at once:

yaml

test:
  strategy:
    fail-fast: false
    matrix:
      node-version: [20, 22]
  runs-on: ubuntu-latest
  steps:
    - uses: actions/checkout@v4
    - uses: pnpm/action-setup@v4
    - uses: actions/setup-node@v4
      with:
        node-version: ${{ matrix.node-version }}
        cache: "pnpm"
    - run: pnpm install --frozen-lockfile
    - run: pnpm test

fail-fast: false lets both matrix legs complete. If Node 22 fails but Node 20 passes, you see that information immediately instead of having to re-run.

Caching for Speed#

The single biggest improvement you can make to CI speed is caching. A cold pnpm install on a medium project takes 30-45 seconds. With a warm cache, it takes 3-5 seconds. Multiply that across four parallel jobs and you're saving two minutes on every run.

pnpm Store Cache#

yaml

- uses: actions/setup-node@v4
  with:
    node-version: 22
    cache: "pnpm"

This one-liner caches the pnpm store (~/.local/share/pnpm/store). On cache hit, pnpm install --frozen-lockfile just hard-links from the store instead of downloading. This alone cuts install time by 80% on repeat runs.

If you need more control — say, you want to cache based on the OS too — use actions/cache directly:

yaml

- uses: actions/cache@v4
  with:
    path: |
      ~/.local/share/pnpm/store
      node_modules
    key: pnpm-${{ runner.os }}-${{ hashFiles('pnpm-lock.yaml') }}
    restore-keys: |
      pnpm-${{ runner.os }}-

The restore-keys fallback is important. If pnpm-lock.yaml changes (new dependency), the exact key won't match, but the prefix match will still restore most of the cached packages. Only the diff gets downloaded.

Next.js Build Cache#

Next.js has its own build cache in .next/cache. Caching this between runs means incremental builds — only changed pages and components get recompiled.

yaml

- uses: actions/cache@v4
  with:
    path: .next/cache
    key: nextjs-${{ runner.os }}-${{ hashFiles('pnpm-lock.yaml') }}-${{ hashFiles('src/**/*.ts', 'src/**/*.tsx') }}
    restore-keys: |
      nextjs-${{ runner.os }}-${{ hashFiles('pnpm-lock.yaml') }}-
      nextjs-${{ runner.os }}-

This three-level key strategy means:

Exact match: same dependencies AND same source files. Full cache hit, build is near-instant.
Partial match (dependencies): dependencies same but source changed. Build recompiles only changed files.
Partial match (OS only): dependencies changed. Build reuses what it can.

Real numbers from my project: cold build takes ~55 seconds, cached build takes ~15 seconds. That's a 73% reduction.

Docker Layer Caching#

Docker builds are where caching gets really impactful. A full Next.js Docker build — installing OS deps, copying source, running pnpm install, running next build — takes 3-4 minutes cold. With layer caching, it's 30-60 seconds.

yaml

- uses: docker/build-push-action@v6
  with:
    context: .
    push: true
    tags: ghcr.io/${{ github.repository }}:latest
    cache-from: type=gha
    cache-to: type=gha,mode=max

type=gha uses GitHub Actions' built-in cache backend. mode=max caches all layers, not just the final ones. This is critical for multi-stage builds where intermediate layers (like pnpm install) are the most expensive to rebuild.

Turborepo Remote Cache#

If you're in a monorepo with Turborepo, remote caching is transformative. First build uploads task outputs to the cache. Subsequent builds download instead of recomputing.

yaml

- run: pnpm turbo build --remote-only
  env:
    TURBO_TOKEN: ${{ secrets.TURBO_TOKEN }}
    TURBO_TEAM: ${{ vars.TURBO_TEAM }}

I've seen monorepo CI times drop from 8 minutes to 90 seconds with Turbo remote cache. The catch: it requires a Vercel account or self-hosted Turbo server. For single-app repos, it's overkill.

Docker Build and Push#

If you're deploying to a VPS (or any server), Docker gives you reproducible builds. The same image that runs in CI is the same image that runs in production. No more "it works on my machine" because the machine is the image.

Multi-Stage Dockerfile#

Before we get to the workflow, here's the Dockerfile I use for Next.js:

dockerfile

# Stage 1: Dependencies
FROM node:22-alpine AS deps
RUN corepack enable && corepack prepare pnpm@latest --activate
WORKDIR /app
COPY package.json pnpm-lock.yaml ./
RUN pnpm install --frozen-lockfile --prod=false
 
# Stage 2: Build
FROM node:22-alpine AS builder
RUN corepack enable && corepack prepare pnpm@latest --activate
WORKDIR /app
COPY --from=deps /app/node_modules ./node_modules
COPY . .
ENV NEXT_TELEMETRY_DISABLED=1
RUN pnpm build
 
# Stage 3: Production
FROM node:22-alpine AS runner
WORKDIR /app
ENV NODE_ENV=production
ENV NEXT_TELEMETRY_DISABLED=1
 
RUN addgroup --system --gid 1001 nodejs
RUN adduser --system --uid 1001 nextjs
 
COPY --from=builder /app/public ./public
COPY --from=builder --chown=nextjs:nodejs /app/.next/standalone ./
COPY --from=builder --chown=nextjs:nodejs /app/.next/static ./.next/static
 
USER nextjs
EXPOSE 3000
ENV PORT=3000
CMD ["node", "server.js"]

Three stages, clear separation. The final image is ~150MB instead of the ~1.2GB you'd get copying everything. Only production artifacts make it to the runner stage.

The Build-and-Push Workflow#

yaml

name: Build and Push Docker Image
 
on:
  push:
    branches: [main]
 
env:
  REGISTRY: ghcr.io
  IMAGE_NAME: ${{ github.repository }}
 
jobs:
  build-and-push:
    runs-on: ubuntu-latest
    permissions:
      contents: read
      packages: write
 
    steps:
      - name: Checkout
        uses: actions/checkout@v4
 
      - name: Set up QEMU
        uses: docker/setup-qemu-action@v3
 
      - name: Set up Docker Buildx
        uses: docker/setup-buildx-action@v3
 
      - name: Log in to GitHub Container Registry
        uses: docker/login-action@v3
        with:
          registry: ${{ env.REGISTRY }}
          username: ${{ github.actor }}
          password: ${{ secrets.GITHUB_TOKEN }}
 
      - name: Extract metadata
        id: meta
        uses: docker/metadata-action@v5
        with:
          images: ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}
          tags: |
            type=sha,prefix=
            type=ref,event=branch
            type=raw,value=latest,enable={{is_default_branch}}
 
      - name: Build and push
        uses: docker/build-push-action@v6
        with:
          context: .
          platforms: linux/amd64,linux/arm64
          push: true
          tags: ${{ steps.meta.outputs.tags }}
          labels: ${{ steps.meta.outputs.labels }}
          cache-from: type=gha
          cache-to: type=gha,mode=max

Let me unpack the important decisions here.

GitHub Container Registry (ghcr.io)#

I use ghcr.io instead of Docker Hub for three reasons:

Authentication is free. GITHUB_TOKEN is automatically available in every workflow — no need to store Docker Hub credentials.
Proximity. Images are pulled from the same infrastructure your CI runs on. Pulls during CI are fast.
Visibility. Images are linked to your repo in the GitHub UI. You see them in the Packages tab.

Multi-Platform Builds#

yaml

platforms: linux/amd64,linux/arm64

This line adds maybe 90 seconds to your build, but it's worth it. ARM64 images run natively on:

Apple Silicon Macs (M1/M2/M3/M4) during local development with Docker Desktop
AWS Graviton instances (20-40% cheaper than x86 equivalents)
Oracle Cloud's free ARM tier

Without this, your developers on M-series Macs are running x86 images through Rosetta emulation. It works, but it's noticeably slower and occasionally surfaces weird architecture-specific bugs.

QEMU provides the cross-compilation layer. Buildx orchestrates the multi-arch build and pushes a manifest list so Docker automatically pulls the right architecture.

Tagging Strategy#

yaml

tags: |
  type=sha,prefix=
  type=ref,event=branch
  type=raw,value=latest,enable={{is_default_branch}}

Every image gets three tags:

abc1234 (commit SHA): Immutable. You can always deploy an exact commit.
main (branch name): Mutable. Points to the latest build from that branch.
latest: Mutable. Only set on the default branch. This is what your server pulls.

Never deploy latest in production without also recording the SHA somewhere. When something breaks, you need to know which latest. I store the deployed SHA in a file on the server that the health endpoint reads.

SSH Deployment to VPS#

This is where it all comes together. CI passes, Docker image is built and pushed, now we need to tell the server to pull the new image and restart.

The SSH Action#

yaml

deploy:
  name: Deploy to Production
  needs: [build-and-push]
  runs-on: ubuntu-latest
  environment: production
 
  steps:
    - name: Deploy via SSH
      uses: appleboy/ssh-action@v1
      with:
        host: ${{ secrets.DEPLOY_HOST }}
        username: ${{ secrets.DEPLOY_USER }}
        key: ${{ secrets.SSH_PRIVATE_KEY }}
        port: ${{ secrets.SSH_PORT }}
        script_stop: true
        script: |
          set -euo pipefail
 
          APP_DIR="/var/www/akousa.net"
          IMAGE="ghcr.io/${{ github.repository }}:latest"
          DEPLOY_SHA="${{ github.sha }}"
 
          echo "=== Deploying $DEPLOY_SHA ==="
 
          # Pull the latest image
          docker pull "$IMAGE"
 
          # Stop and remove old container
          docker stop akousa-app || true
          docker rm akousa-app || true
 
          # Start new container
          docker run -d \
            --name akousa-app \
            --restart unless-stopped \
            --network host \
            -e NODE_ENV=production \
            -e DATABASE_URL="${DATABASE_URL}" \
            -p 3000:3000 \
            "$IMAGE"
 
          # Wait for health check
          echo "Waiting for health check..."
          for i in $(seq 1 30); do
            if curl -sf http://localhost:3000/api/health > /dev/null 2>&1; then
              echo "Health check passed on attempt $i"
              break
            fi
            if [ "$i" -eq 30 ]; then
              echo "Health check failed after 30 attempts"
              exit 1
            fi
            sleep 2
          done
 
          # Record deployed SHA
          echo "$DEPLOY_SHA" > "$APP_DIR/.deployed-sha"
 
          # Prune old images
          docker image prune -af --filter "until=168h"
 
          echo "=== Deploy complete ==="

The Deploy Script Alternative#

For anything beyond a simple pull-and-restart, I move the logic into a script on the server rather than inlining it in the workflow:

bash

#!/bin/bash
# /var/www/akousa.net/deploy.sh
set -euo pipefail
 
APP_DIR="/var/www/akousa.net"
LOG_FILE="$APP_DIR/deploy.log"
IMAGE="ghcr.io/akousa/akousa-net:latest"
 
log() {
  echo "[$(date '+%Y-%m-%d %H:%M:%S')] $*" | tee -a "$LOG_FILE"
}
 
log "Starting deployment..."
 
# Login to GHCR
echo "$GHCR_TOKEN" | docker login ghcr.io -u akousa --password-stdin
 
# Pull with retry
for attempt in 1 2 3; do
  if docker pull "$IMAGE"; then
    log "Image pulled successfully on attempt $attempt"
    break
  fi
  if [ "$attempt" -eq 3 ]; then
    log "ERROR: Failed to pull image after 3 attempts"
    exit 1
  fi
  log "Pull attempt $attempt failed, retrying in 5s..."
  sleep 5
done
 
# Health check function
health_check() {
  local port=$1
  local max_attempts=30
  for i in $(seq 1 $max_attempts); do
    if curl -sf "http://localhost:$port/api/health" > /dev/null 2>&1; then
      return 0
    fi
    sleep 2
  done
  return 1
}
 
# Start new container on alternate port
docker run -d \
  --name akousa-app-new \
  --env-file "$APP_DIR/.env.production" \
  -p 3001:3000 \
  "$IMAGE"
 
# Verify new container is healthy
if ! health_check 3001; then
  log "ERROR: New container failed health check. Rolling back."
  docker stop akousa-app-new || true
  docker rm akousa-app-new || true
  exit 1
fi
 
log "New container healthy. Switching traffic..."
 
# Switch Nginx upstream
sudo sed -i 's/server 127.0.0.1:3000/server 127.0.0.1:3001/' /etc/nginx/conf.d/upstream.conf
sudo nginx -t && sudo nginx -s reload
 
# Stop old container
docker stop akousa-app || true
docker rm akousa-app || true
 
# Rename new container
docker rename akousa-app-new akousa-app
 
log "Deployment complete."

The workflow then becomes a single SSH command:

yaml

script: |
  cd /var/www/akousa.net && ./deploy.sh

This is better because: (1) the deploy logic is version-controlled on the server, (2) you can run it manually over SSH for debugging, and (3) you don't have to escape YAML inside YAML inside bash.

Zero-Downtime Strategies#

"Zero downtime" sounds like marketing speak, but it has a precise meaning: no request gets a connection refused or a 502 during deployment. Here are three real approaches, from simplest to most robust.

Strategy 1: PM2 Cluster Mode Reload#

If you're running Node.js directly (not in Docker), PM2's cluster mode gives you the easiest zero-downtime path.

bash

# ecosystem.config.js already has:
#   instances: 2
#   exec_mode: "cluster"
 
pm2 reload akousa --update-env

pm2 reload (not restart) does a rolling restart. It spins up new workers, waits for them to be ready, then kills old workers one by one. At no point are zero workers serving traffic.

The --update-env flag reloads environment variables from the ecosystem config. Without it, your old env persists even after a deploy that changed .env.

In your workflow:

yaml

- name: Deploy and reload PM2
  uses: appleboy/ssh-action@v1
  with:
    host: ${{ secrets.DEPLOY_HOST }}
    username: ${{ secrets.DEPLOY_USER }}
    key: ${{ secrets.SSH_PRIVATE_KEY }}
    script: |
      cd /var/www/akousa.net
      git pull origin main
      pnpm install --frozen-lockfile
      pnpm build
      pm2 reload ecosystem.config.js --update-env

This is what I use for this site. It's simple, reliable, and the downtime is literally zero — I've tested it with a load generator running 100 req/s during deploys. Not a single 5xx.

Strategy 2: Blue/Green with Nginx Upstream#

For Docker deployments, blue/green gives you a clean separation between the old and new versions.

The concept: run the old container ("blue") on port 3000 and the new container ("green") on port 3001. Nginx points to blue. You start green, verify it's healthy, switch Nginx to green, then stop blue.

Nginx upstream config:

nginx

# /etc/nginx/conf.d/upstream.conf
upstream app_backend {
    server 127.0.0.1:3000;
}

nginx

# /etc/nginx/sites-available/akousa.net
server {
    listen 443 ssl http2;
    server_name akousa.net;
 
    location / {
        proxy_pass http://app_backend;
        proxy_http_version 1.1;
        proxy_set_header Upgrade $http_upgrade;
        proxy_set_header Connection 'upgrade';
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;
        proxy_cache_bypass $http_upgrade;
    }
}

The switch script:

bash

#!/bin/bash
set -euo pipefail
 
CURRENT_PORT=$(grep -oP 'server 127\.0\.0\.1:\K\d+' /etc/nginx/conf.d/upstream.conf)
 
if [ "$CURRENT_PORT" = "3000" ]; then
  NEW_PORT=3001
  OLD_PORT=3000
else
  NEW_PORT=3000
  OLD_PORT=3001
fi
 
echo "Current: $OLD_PORT -> New: $NEW_PORT"
 
# Start new container on the alternate port
docker run -d \
  --name "akousa-app-$NEW_PORT" \
  --env-file /var/www/akousa.net/.env.production \
  -p "$NEW_PORT:3000" \
  "ghcr.io/akousa/akousa-net:latest"
 
# Wait for health
for i in $(seq 1 30); do
  if curl -sf "http://localhost:$NEW_PORT/api/health" > /dev/null; then
    echo "New container healthy on port $NEW_PORT"
    break
  fi
  [ "$i" -eq 30 ] && { echo "Health check failed"; docker stop "akousa-app-$NEW_PORT"; docker rm "akousa-app-$NEW_PORT"; exit 1; }
  sleep 2
done
 
# Switch Nginx
sudo sed -i "s/server 127.0.0.1:$OLD_PORT/server 127.0.0.1:$NEW_PORT/" /etc/nginx/conf.d/upstream.conf
sudo nginx -t && sudo nginx -s reload
 
# Stop old container
sleep 5  # Let in-flight requests complete
docker stop "akousa-app-$OLD_PORT" || true
docker rm "akousa-app-$OLD_PORT" || true
 
echo "Switched from :$OLD_PORT to :$NEW_PORT"

The 5-second sleep after the Nginx reload isn't laziness — it's grace time. Nginx's reload is graceful (existing connections are kept open), but some long-polling connections or streaming responses need time to complete.

Strategy 3: Docker Compose with Health Checks#

For a more structured approach, Docker Compose can manage the blue/green swap:

yaml

# docker-compose.yml
services:
  app:
    image: ghcr.io/akousa/akousa-net:latest
    restart: unless-stopped
    env_file: .env.production
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:3000/api/health"]
      interval: 10s
      timeout: 5s
      retries: 3
      start_period: 30s
    deploy:
      replicas: 2
      update_config:
        parallelism: 1
        delay: 10s
        order: start-first
        failure_action: rollback
      rollback_config:
        parallelism: 0
        order: stop-first
    ports:
      - "3000:3000"

The order: start-first is the key line. It means "start the new container before stopping the old one." Combined with parallelism: 1, you get a rolling update — one container at a time, always maintaining capacity.

Deploy with:

bash

docker compose pull
docker compose up -d --remove-orphans

Docker Compose watches the healthcheck and won't route traffic to the new container until it passes. If the healthcheck fails, failure_action: rollback automatically reverts to the previous version. This is as close to Kubernetes-style rolling deployments as you get on a single VPS.

Secrets Management#

Secrets management is one of those things that's easy to get "mostly right" and catastrophically wrong in the remaining edge cases.

GitHub Secrets: The Basics#

yaml

# Set via GitHub UI: Settings > Secrets and variables > Actions
 
steps:
  - name: Use a secret
    env:
      DB_URL: ${{ secrets.DATABASE_URL }}
    run: |
      # The value is masked in logs
      echo "Connecting to database..."
      # This would print "Connecting to ***" in the logs
      echo "Connecting to $DB_URL"

GitHub automatically redacts secret values from log output. If your secret is p@ssw0rd123 and any step prints that string, the logs show ***. This works well, with one caveat: if your secret is short (like a 4-digit PIN), GitHub might not mask it because it could match innocent strings. Keep secrets reasonably complex.

Environment-Scoped Secrets#

yaml

jobs:
  deploy-staging:
    environment: staging
    steps:
      - run: echo "Deploying to ${{ secrets.DEPLOY_HOST }}"
      # DEPLOY_HOST = staging.akousa.net
 
  deploy-production:
    environment: production
    steps:
      - run: echo "Deploying to ${{ secrets.DEPLOY_HOST }}"
      # DEPLOY_HOST = akousa.net

Same secret name, different values per environment. The environment field on the job determines which set of secrets is injected.

Production environments should have required reviewers enabled. This means a push to main triggers the workflow, CI runs automatically, but the deploy job pauses and waits for someone to click "Approve" in the GitHub UI. For a solo project, this might feel like overhead. For anything with users, it's a lifesaver the first time you accidentally merge something broken.

OIDC: No More Static Credentials#

Static credentials (AWS access keys, GCP service account JSON files) stored in GitHub Secrets are a liability. They don't expire, they can't be scoped to a specific workflow run, and if they leak, you have to rotate them manually.

OIDC (OpenID Connect) solves this. GitHub Actions acts as an identity provider, and your cloud provider trusts it to issue short-lived credentials on the fly:

yaml

jobs:
  deploy:
    runs-on: ubuntu-latest
    permissions:
      id-token: write  # Required for OIDC
      contents: read
 
    steps:
      - name: Configure AWS Credentials
        uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: arn:aws:iam::123456789012:role/github-actions-deploy
          aws-region: eu-central-1
 
      - name: Push to ECR
        run: |
          aws ecr get-login-password --region eu-central-1 | \
            docker login --username AWS --password-stdin 123456789012.dkr.ecr.eu-central-1.amazonaws.com

No access key. No secret key. The configure-aws-credentials action requests a temporary token from AWS STS using GitHub's OIDC token. The token is scoped to the specific repo, branch, and environment. It expires after the workflow run.

Setting this up on the AWS side requires an IAM OIDC identity provider and a role trust policy:

json

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Federated": "arn:aws:iam::123456789012:oidc-provider/token.actions.githubusercontent.com"
      },
      "Action": "sts:AssumeRoleWithWebIdentity",
      "Condition": {
        "StringEquals": {
          "token.actions.githubusercontent.com:aud": "sts.amazonaws.com"
        },
        "StringLike": {
          "token.actions.githubusercontent.com:sub": "repo:akousa/akousa-net:ref:refs/heads/main"
        }
      }
    }
  ]
}

The sub condition is crucial. Without it, any repo that somehow obtains your OIDC provider's details could assume the role. With it, only the main branch of your specific repo can.

GCP has an equivalent setup with Workload Identity Federation. Azure has federated credentials. If your cloud supports OIDC, use it. There's no reason to store static cloud credentials in 2026.

Deployment SSH Keys#

For VPS deployments over SSH, generate a dedicated key pair:

bash

ssh-keygen -t ed25519 -C "github-actions-deploy" -f deploy_key -N ""

Add the public key to the server's ~/.ssh/authorized_keys with restrictions:

restrict,command="/var/www/akousa.net/deploy.sh" ssh-ed25519 AAAA... github-actions-deploy

The restrict prefix disables port forwarding, agent forwarding, PTY allocation, and X11 forwarding. The command= prefix means this key can only execute the deploy script. Even if the private key is compromised, the attacker can run your deploy script and nothing else.

Add the private key to GitHub Secrets as SSH_PRIVATE_KEY. This is the one static credential I accept — SSH keys with forced commands have a very limited blast radius.

PR Workflows: Preview Deployments#

Every PR deserves a preview environment. It catches visual bugs that unit tests miss, lets designers review without checking out code, and makes QA's life dramatically easier.

Deploy a Preview on PR Open#

yaml

name: Preview Deploy
 
on:
  pull_request:
    types: [opened, synchronize, reopened]
 
jobs:
  preview:
    runs-on: ubuntu-latest
    environment:
      name: preview-${{ github.event.number }}
      url: ${{ steps.deploy.outputs.url }}
 
    steps:
      - uses: actions/checkout@v4
 
      - name: Build preview image
        uses: docker/build-push-action@v6
        with:
          context: .
          push: true
          tags: ghcr.io/${{ github.repository }}:pr-${{ github.event.number }}
          cache-from: type=gha
          cache-to: type=gha,mode=max
 
      - name: Deploy preview
        id: deploy
        uses: appleboy/ssh-action@v1
        with:
          host: ${{ secrets.PREVIEW_HOST }}
          username: ${{ secrets.DEPLOY_USER }}
          key: ${{ secrets.SSH_PRIVATE_KEY }}
          script: |
            PR_NUM=${{ github.event.number }}
            PORT=$((4000 + PR_NUM))
            IMAGE="ghcr.io/${{ github.repository }}:pr-${PR_NUM}"
 
            docker pull "$IMAGE"
            docker stop "preview-${PR_NUM}" || true
            docker rm "preview-${PR_NUM}" || true
 
            docker run -d \
              --name "preview-${PR_NUM}" \
              --restart unless-stopped \
              -e NODE_ENV=preview \
              -p "${PORT}:3000" \
              "$IMAGE"
 
            echo "url=https://pr-${PR_NUM}.preview.akousa.net" >> "$GITHUB_OUTPUT"
 
      - name: Comment PR with preview URL
        uses: actions/github-script@v7
        with:
          script: |
            const url = `https://pr-${{ github.event.number }}.preview.akousa.net`;
            const body = `### Preview Deployment
 
            | Status | URL |
            |--------|-----|
            | :white_check_mark: Deployed | [${url}](${url}) |
 
            _Last updated: ${new Date().toISOString()}_
            _Commit: \`${{ github.sha }}\`_`;
 
            // Find existing comment
            const comments = await github.rest.issues.listComments({
              owner: context.repo.owner,
              repo: context.repo.repo,
              issue_number: context.issue.number,
            });
 
            const botComment = comments.data.find(c =>
              c.user.type === 'Bot' && c.body.includes('Preview Deployment')
            );
 
            if (botComment) {
              await github.rest.issues.updateComment({
                owner: context.repo.owner,
                repo: context.repo.repo,
                comment_id: botComment.id,
                body,
              });
            } else {
              await github.rest.issues.createComment({
                owner: context.repo.owner,
                repo: context.repo.repo,
                issue_number: context.issue.number,
                body,
              });
            }

The port calculation (4000 + PR_NUM) is a pragmatic hack. PR #42 gets port 4042. As long as you don't have more than a few hundred open PRs, there are no collisions. An Nginx wildcard config routes pr-*.preview.akousa.net to the right port.

Cleanup on PR Close#

Preview environments that aren't cleaned up eat disk and memory. Add a cleanup job:

yaml

name: Cleanup Preview
 
on:
  pull_request:
    types: [closed]
 
jobs:
  cleanup:
    runs-on: ubuntu-latest
    steps:
      - name: Remove preview container
        uses: appleboy/ssh-action@v1
        with:
          host: ${{ secrets.PREVIEW_HOST }}
          username: ${{ secrets.DEPLOY_USER }}
          key: ${{ secrets.SSH_PRIVATE_KEY }}
          script: |
            PR_NUM=${{ github.event.number }}
            docker stop "preview-${PR_NUM}" || true
            docker rm "preview-${PR_NUM}" || true
            docker rmi "ghcr.io/${{ github.repository }}:pr-${PR_NUM}" || true
            echo "Preview for PR #${PR_NUM} cleaned up."
 
      - name: Deactivate environment
        uses: actions/github-script@v7
        with:
          script: |
            const deployments = await github.rest.repos.listDeployments({
              owner: context.repo.owner,
              repo: context.repo.repo,
              environment: `preview-${{ github.event.number }}`,
            });
 
            for (const deployment of deployments.data) {
              await github.rest.repos.createDeploymentStatus({
                owner: context.repo.owner,
                repo: context.repo.repo,
                deployment_id: deployment.id,
                state: 'inactive',
              });
            }

Required Status Checks#

In your repository settings (Settings > Branches > Branch protection rules), require these checks before merging:

lint — No lint errors
typecheck — No type errors
test — All tests pass
build — Project builds successfully

Without this, someone will merge a PR with failing checks. Not maliciously — they'll see "2 of 4 checks passed" and assume the other two are still running. Lock it down.

Also enable "Require branches to be up to date before merging." This forces a re-run of CI after rebasing onto the latest main. It catches the case where two PRs individually pass CI but conflict when combined.

Notifications#

A deployment that nobody knows about is a deployment that nobody trusts. Notifications close the feedback loop.

Slack Webhook#

yaml

- name: Notify Slack
  if: always()
  uses: slackapi/slack-github-action@v2
  with:
    webhook: ${{ secrets.SLACK_DEPLOY_WEBHOOK }}
    webhook-type: incoming-webhook
    payload: |
      {
        "blocks": [
          {
            "type": "header",
            "text": {
              "type": "plain_text",
              "text": "${{ job.status == 'success' && 'Deploy Successful' || 'Deploy Failed' }}"
            }
          },
          {
            "type": "section",
            "fields": [
              {
                "type": "mrkdwn",
                "text": "*Repository:*\n${{ github.repository }}"
              },
              {
                "type": "mrkdwn",
                "text": "*Branch:*\n${{ github.ref_name }}"
              },
              {
                "type": "mrkdwn",
                "text": "*Commit:*\n<${{ github.server_url }}/${{ github.repository }}/commit/${{ github.sha }}|${{ github.sha }}>"
              },
              {
                "type": "mrkdwn",
                "text": "*Triggered by:*\n${{ github.actor }}"
              }
            ]
          },
          {
            "type": "actions",
            "elements": [
              {
                "type": "button",
                "text": {
                  "type": "plain_text",
                  "text": "View Run"
                },
                "url": "${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}"
              }
            ]
          }
        ]
      }

The if: always() is critical. Without it, the notification step is skipped when the deploy fails — which is exactly when you need it most.

GitHub Deployments API#

For richer deployment tracking, use the GitHub Deployments API. This gives you a deployment history in the repo UI and enables status badges:

yaml

- name: Create GitHub Deployment
  id: deployment
  uses: actions/github-script@v7
  with:
    script: |
      const deployment = await github.rest.repos.createDeployment({
        owner: context.repo.owner,
        repo: context.repo.repo,
        ref: context.sha,
        environment: 'production',
        auto_merge: false,
        required_contexts: [],
        description: `Deploying ${context.sha.substring(0, 7)} to production`,
      });
      return deployment.data.id;
 
- name: Deploy
  run: |
    # ... actual deployment steps ...
 
- name: Update deployment status
  if: always()
  uses: actions/github-script@v7
  with:
    script: |
      const deploymentId = ${{ steps.deployment.outputs.result }};
      await github.rest.repos.createDeploymentStatus({
        owner: context.repo.owner,
        repo: context.repo.repo,
        deployment_id: deploymentId,
        state: '${{ job.status }}' === 'success' ? 'success' : 'failure',
        environment_url: 'https://akousa.net',
        log_url: `${context.serverUrl}/${context.repo.owner}/${context.repo.repo}/actions/runs/${context.runId}`,
        description: '${{ job.status }}' === 'success'
          ? 'Deployment succeeded'
          : 'Deployment failed',
      });

Now your Environments tab in GitHub shows a complete deployment history: who deployed what, when, and whether it succeeded.

Failure-Only Email#

For critical deployments, I also trigger an email on failure. Not via GitHub Actions' built-in email (too noisy), but via a targeted webhook:

yaml

- name: Alert on failure
  if: failure()
  run: |
    curl -X POST "${{ secrets.ALERT_WEBHOOK_URL }}" \
      -H "Content-Type: application/json" \
      -d '{
        "subject": "DEPLOY FAILED: ${{ github.repository }}",
        "body": "Commit: ${{ github.sha }}\nActor: ${{ github.actor }}\nRun: ${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}"
      }'

This is my last line of defense. Slack is great but it's also noisy — people mute channels. A "DEPLOY FAILED" email with a link to the run gets attention.

The Complete Workflow File#

Here's everything wired together into a single, production-ready workflow. This is very close to what actually deploys this site.

yaml

name: CI/CD Pipeline
 
on:
  push:
    branches: [main]
  pull_request:
    branches: [main]
  workflow_dispatch:
    inputs:
      skip_tests:
        description: "Skip tests (emergency deploy)"
        required: false
        type: boolean
        default: false
 
concurrency:
  group: ${{ github.workflow }}-${{ github.ref }}
  cancel-in-progress: ${{ github.event_name == 'pull_request' }}
 
env:
  NODE_VERSION: "22"
  REGISTRY: ghcr.io
  IMAGE_NAME: ${{ github.repository }}
 
jobs:
  # ============================================================
  # CI: Lint, type check, and test in parallel
  # ============================================================
 
  lint:
    name: Lint
    runs-on: ubuntu-latest
    steps:
      - name: Checkout code
        uses: actions/checkout@v4
 
      - name: Setup pnpm
        uses: pnpm/action-setup@v4
 
      - name: Setup Node.js
        uses: actions/setup-node@v4
        with:
          node-version: ${{ env.NODE_VERSION }}
          cache: "pnpm"
 
      - name: Install dependencies
        run: pnpm install --frozen-lockfile
 
      - name: Run ESLint
        run: pnpm lint
 
  typecheck:
    name: Type Check
    runs-on: ubuntu-latest
    steps:
      - name: Checkout code
        uses: actions/checkout@v4
 
      - name: Setup pnpm
        uses: pnpm/action-setup@v4
 
      - name: Setup Node.js
        uses: actions/setup-node@v4
        with:
          node-version: ${{ env.NODE_VERSION }}
          cache: "pnpm"
 
      - name: Install dependencies
        run: pnpm install --frozen-lockfile
 
      - name: Run TypeScript compiler
        run: pnpm tsc --noEmit
 
  test:
    name: Unit Tests
    if: ${{ !inputs.skip_tests }}
    runs-on: ubuntu-latest
    steps:
      - name: Checkout code
        uses: actions/checkout@v4
 
      - name: Setup pnpm
        uses: pnpm/action-setup@v4
 
      - name: Setup Node.js
        uses: actions/setup-node@v4
        with:
          node-version: ${{ env.NODE_VERSION }}
          cache: "pnpm"
 
      - name: Install dependencies
        run: pnpm install --frozen-lockfile
 
      - name: Run tests with coverage
        run: pnpm test -- --coverage
 
      - name: Upload coverage report
        if: always()
        uses: actions/upload-artifact@v4
        with:
          name: coverage-report
          path: coverage/
          retention-days: 7
 
  # ============================================================
  # Build: Only after CI passes
  # ============================================================
 
  build:
    name: Build Application
    needs: [lint, typecheck, test]
    if: always() && !cancelled() && needs.lint.result == 'success' && needs.typecheck.result == 'success' && (needs.test.result == 'success' || needs.test.result == 'skipped')
    runs-on: ubuntu-latest
    steps:
      - name: Checkout code
        uses: actions/checkout@v4
 
      - name: Setup pnpm
        uses: pnpm/action-setup@v4
 
      - name: Setup Node.js
        uses: actions/setup-node@v4
        with:
          node-version: ${{ env.NODE_VERSION }}
          cache: "pnpm"
 
      - name: Install dependencies
        run: pnpm install --frozen-lockfile
 
      - name: Cache Next.js build
        uses: actions/cache@v4
        with:
          path: .next/cache
          key: nextjs-${{ runner.os }}-${{ hashFiles('pnpm-lock.yaml') }}-${{ hashFiles('src/**/*.ts', 'src/**/*.tsx') }}
          restore-keys: |
            nextjs-${{ runner.os }}-${{ hashFiles('pnpm-lock.yaml') }}-
            nextjs-${{ runner.os }}-
 
      - name: Build Next.js application
        run: pnpm build
 
  # ============================================================
  # Docker: Build and push image (main branch only)
  # ============================================================
 
  docker:
    name: Build Docker Image
    needs: [build]
    if: github.ref == 'refs/heads/main' && github.event_name != 'pull_request'
    runs-on: ubuntu-latest
    permissions:
      contents: read
      packages: write
    outputs:
      image_tag: ${{ steps.meta.outputs.tags }}
 
    steps:
      - name: Checkout code
        uses: actions/checkout@v4
 
      - name: Set up QEMU for multi-platform builds
        uses: docker/setup-qemu-action@v3
 
      - name: Set up Docker Buildx
        uses: docker/setup-buildx-action@v3
 
      - name: Log in to GitHub Container Registry
        uses: docker/login-action@v3
        with:
          registry: ${{ env.REGISTRY }}
          username: ${{ github.actor }}
          password: ${{ secrets.GITHUB_TOKEN }}
 
      - name: Extract image metadata
        id: meta
        uses: docker/metadata-action@v5
        with:
          images: ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}
          tags: |
            type=sha,prefix=
            type=raw,value=latest,enable={{is_default_branch}}
 
      - name: Build and push Docker image
        uses: docker/build-push-action@v6
        with:
          context: .
          platforms: linux/amd64,linux/arm64
          push: true
          tags: ${{ steps.meta.outputs.tags }}
          labels: ${{ steps.meta.outputs.labels }}
          cache-from: type=gha
          cache-to: type=gha,mode=max
 
  # ============================================================
  # Deploy: SSH into VPS and update
  # ============================================================
 
  deploy:
    name: Deploy to Production
    needs: [docker]
    if: github.ref == 'refs/heads/main' && github.event_name != 'pull_request'
    runs-on: ubuntu-latest
    environment:
      name: production
      url: https://akousa.net
 
    steps:
      - name: Create GitHub Deployment
        id: deployment
        uses: actions/github-script@v7
        with:
          script: |
            const deployment = await github.rest.repos.createDeployment({
              owner: context.repo.owner,
              repo: context.repo.repo,
              ref: context.sha,
              environment: 'production',
              auto_merge: false,
              required_contexts: [],
              description: `Deploy ${context.sha.substring(0, 7)}`,
            });
            return deployment.data.id;
 
      - name: Deploy via SSH
        uses: appleboy/ssh-action@v1
        with:
          host: ${{ secrets.DEPLOY_HOST }}
          username: ${{ secrets.DEPLOY_USER }}
          key: ${{ secrets.SSH_PRIVATE_KEY }}
          port: ${{ secrets.SSH_PORT }}
          script_stop: true
          command_timeout: 5m
          script: |
            set -euo pipefail
 
            APP_DIR="/var/www/akousa.net"
            IMAGE="ghcr.io/${{ github.repository }}:latest"
            SHA="${{ github.sha }}"
 
            echo "=== Deploy $SHA started at $(date) ==="
 
            # Pull new image
            docker pull "$IMAGE"
 
            # Run new container on alternate port
            docker run -d \
              --name akousa-app-new \
              --env-file "$APP_DIR/.env.production" \
              -p 3001:3000 \
              "$IMAGE"
 
            # Health check
            echo "Running health check..."
            for i in $(seq 1 30); do
              if curl -sf http://localhost:3001/api/health > /dev/null 2>&1; then
                echo "Health check passed (attempt $i)"
                break
              fi
              if [ "$i" -eq 30 ]; then
                echo "ERROR: Health check failed"
                docker logs akousa-app-new --tail 50
                docker stop akousa-app-new && docker rm akousa-app-new
                exit 1
              fi
              sleep 2
            done
 
            # Switch traffic
            sudo sed -i 's/server 127.0.0.1:3000/server 127.0.0.1:3001/' /etc/nginx/conf.d/upstream.conf
            sudo nginx -t && sudo nginx -s reload
 
            # Grace period for in-flight requests
            sleep 5
 
            # Stop old container
            docker stop akousa-app || true
            docker rm akousa-app || true
 
            # Rename and reset port
            docker rename akousa-app-new akousa-app
            sudo sed -i 's/server 127.0.0.1:3001/server 127.0.0.1:3000/' /etc/nginx/conf.d/upstream.conf
            # Note: we don't reload Nginx here because the container name changed,
            # not the port. The next deploy will use the correct port.
 
            # Record deployment
            echo "$SHA" > "$APP_DIR/.deployed-sha"
            echo "$(date -u +%Y-%m-%dT%H:%M:%SZ) $SHA" >> "$APP_DIR/deploy.log"
 
            # Cleanup old images (older than 7 days)
            docker image prune -af --filter "until=168h"
 
            echo "=== Deploy complete at $(date) ==="
 
      - name: Update deployment status
        if: always()
        uses: actions/github-script@v7
        with:
          script: |
            const deploymentId = ${{ steps.deployment.outputs.result }};
            await github.rest.repos.createDeploymentStatus({
              owner: context.repo.owner,
              repo: context.repo.repo,
              deployment_id: deploymentId,
              state: '${{ job.status }}' === 'success' ? 'success' : 'failure',
              environment_url: 'https://akousa.net',
              log_url: `${context.serverUrl}/${context.repo.owner}/${context.repo.repo}/actions/runs/${context.runId}`,
            });
 
      - name: Notify Slack
        if: always()
        uses: slackapi/slack-github-action@v2
        with:
          webhook: ${{ secrets.SLACK_DEPLOY_WEBHOOK }}
          webhook-type: incoming-webhook
          payload: |
            {
              "blocks": [
                {
                  "type": "header",
                  "text": {
                    "type": "plain_text",
                    "text": "${{ job.status == 'success' && 'Deploy Successful' || 'Deploy Failed' }}"
                  }
                },
                {
                  "type": "section",
                  "fields": [
                    {
                      "type": "mrkdwn",
                      "text": "*Commit:*\n<${{ github.server_url }}/${{ github.repository }}/commit/${{ github.sha }}|`${{ github.sha }}`>"
                    },
                    {
                      "type": "mrkdwn",
                      "text": "*Actor:*\n${{ github.actor }}"
                    }
                  ]
                },
                {
                  "type": "actions",
                  "elements": [
                    {
                      "type": "button",
                      "text": { "type": "plain_text", "text": "View Run" },
                      "url": "${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}"
                    }
                  ]
                }
              ]
            }
 
      - name: Alert on failure
        if: failure()
        run: |
          curl -sf -X POST "${{ secrets.ALERT_WEBHOOK_URL }}" \
            -H "Content-Type: application/json" \
            -d '{
              "subject": "DEPLOY FAILED: ${{ github.repository }}",
              "body": "Commit: ${{ github.sha }}\nRun: ${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}"
            }' || true

Walking Through the Flow#

When I push to main:

Lint, Type Check, and Test kick off simultaneously. Three runners, three parallel jobs. If any fail, the pipeline stops.
Build runs only if all three pass. It validates that the application compiles and produces working output.
Docker builds the production image and pushes it to ghcr.io. Multi-platform, layer-cached.
Deploy SSHes into the VPS, pulls the new image, starts a new container, health-checks it, switches Nginx, and cleans up.
Notifications fire regardless of outcome. Slack gets the message. GitHub Deployments get updated. If it failed, an alert email goes out.

When I open a PR:

Lint, Type Check, and Test run. Same quality gates.
Build runs to verify the project compiles.
Docker and Deploy are skipped (the if conditions gate them to main branch only).

When I need an emergency deploy (skip tests):

Click "Run workflow" in the Actions tab.
Select skip_tests: true.
Lint and typecheck still run (you can't skip those — I don't trust myself that much).
Tests are skipped, build runs, Docker builds, deploy fires.

This has been my workflow for two years. It's survived server migrations, Node.js major version upgrades, pnpm replacing npm, and the addition of 15 tools to this site. The total end-to-end time from push to production: 3 minutes 40 seconds on average. The slowest step is the multi-platform Docker build at ~90 seconds. Everything else is cached into near-instant.

Lessons From Two Years of Iteration#

I'll close with the mistakes I made so you don't have to.

Pin your action versions. uses: actions/checkout@v4 is fine, but for production, consider uses: actions/checkout@b4ffde65f46336ab88eb53be808477a3936bae11 (the full SHA). A compromised action could exfiltrate your secrets. The tj-actions/changed-files incident in 2025 proved this isn't theoretical.

Don't cache everything. I once cached node_modules directly (not just the pnpm store) and spent two hours debugging a phantom build failure caused by stale native bindings. Cache the package manager store, not the installed modules.

Set timeouts. Every job should have timeout-minutes. The default is 360 minutes (6 hours). If your deploy hangs because the SSH connection dropped, you don't want to discover it six hours later when you've burned through your monthly minutes.

yaml

jobs:
  deploy:
    timeout-minutes: 15
    runs-on: ubuntu-latest

Use concurrency wisely. For PRs, cancel-in-progress: true is always right — nobody cares about the CI result of a commit that's already been force-pushed over. For production deploys, set it to false. You don't want a fast-follow commit to cancel a deploy that's mid-rollout.

Test your workflow file. Use act (https://github.com/nektos/act) to run workflows locally. It won't catch everything (secrets aren't available, and the runner environment differs), but it catches YAML syntax errors and obvious logic bugs before you push.

Monitor your CI costs. GitHub Actions minutes are free for public repos and cheap for private ones, but they add up. Multi-platform Docker builds are 2x the minutes (one per platform). Matrix test strategies multiply your runtime. Keep an eye on the billing page.

The best CI/CD pipeline is the one you trust. Trust comes from reliability, observability, and incremental improvement. Start with a simple lint-test-build pipeline. Add Docker when you need reproducibility. Add SSH deployment when you need automation. Add notifications when you need confidence. Don't build the full pipeline on day one — you'll get the abstractions wrong.

Build the pipeline you need today, and let it grow with your project.

This is that workflow, explained section by section. Not the docs version. The version that survives contact with production.

Understanding the Building Blocks#

Triggers: When Your Workflow Runs#

yaml

on:
  push:
    branches: [main]
  pull_request:
    branches: [main]
  schedule:
    - cron: "0 6 * * 1" # Every Monday at 6 AM UTC
  workflow_dispatch:
    inputs:
      environment:
        description: "Target environment"
        required: true
        default: "staging"
        type: choice
        options:
          - staging
          - production

Four triggers, each serving a different purpose:

push to main is your production deploy trigger. Code merged? Ship it.
pull_request runs your CI checks on every PR. This is where lint, type checks, and tests live.
schedule is cron for your repo. I use it for weekly dependency audit scans and stale cache cleanup.
workflow_dispatch gives you a manual "Deploy" button in the GitHub UI with input parameters. Invaluable when you need to deploy staging without a code change — maybe you updated an environment variable or need to repull a base Docker image.

Jobs, Steps, and Runners#

yaml

jobs:
  lint:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: pnpm/action-setup@v4
      - uses: actions/setup-node@v4
        with:
          node-version: 22
          cache: "pnpm"
      - run: pnpm install --frozen-lockfile
      - run: pnpm lint

Steps run sequentially within a job. Each uses: step pulls in a reusable action from the marketplace. Each run: step executes a shell command.

Environment Variables vs Secrets#

yaml

env:
  NODE_ENV: production
  NEXT_TELEMETRY_DISABLED: 1
 
jobs:
  deploy:
    runs-on: ubuntu-latest
    environment: production
    steps:
      - name: Deploy
        env:
          SSH_PRIVATE_KEY: ${{ secrets.SSH_PRIVATE_KEY }}
          DEPLOY_HOST: ${{ secrets.DEPLOY_HOST }}
        run: |
          echo "$SSH_PRIVATE_KEY" > key.pem
          chmod 600 key.pem
          ssh -i key.pem deploy@$DEPLOY_HOST "cd /var/www/app && ./deploy.sh"

Environment variables set with env: at the workflow level are plain text, visible in logs. Use these for non-sensitive config: NODE_ENV, telemetry flags, feature toggles.

Secrets (${{ secrets.X }}) are encrypted at rest, masked in logs, and only available to workflows in the same repo. They're set in Settings > Secrets and variables > Actions.

The Full CI Pipeline#

Here's how I structure the CI half of the pipeline. The goal: catch every category of error in the fastest possible time.

yaml

name: CI
 
on:
  pull_request:
    branches: [main]
  push:
    branches: [main]
 
concurrency:
  group: ci-${{ github.ref }}
  cancel-in-progress: true
 
jobs:
  lint:
    name: Lint
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: pnpm/action-setup@v4
      - uses: actions/setup-node@v4
        with:
          node-version: 22
          cache: "pnpm"
      - run: pnpm install --frozen-lockfile
      - run: pnpm lint
 
  typecheck:
    name: Type Check
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: pnpm/action-setup@v4
      - uses: actions/setup-node@v4
        with:
          node-version: 22
          cache: "pnpm"
      - run: pnpm install --frozen-lockfile
      - run: pnpm tsc --noEmit
 
  test:
    name: Unit Tests
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: pnpm/action-setup@v4
      - uses: actions/setup-node@v4
        with:
          node-version: 22
          cache: "pnpm"
      - run: pnpm install --frozen-lockfile
      - run: pnpm test -- --coverage
      - uses: actions/upload-artifact@v4
        if: always()
        with:
          name: coverage-report
          path: coverage/
          retention-days: 7
 
  build:
    name: Build
    needs: [lint, typecheck, test]
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: pnpm/action-setup@v4
      - uses: actions/setup-node@v4
        with:
          node-version: 22
          cache: "pnpm"
      - run: pnpm install --frozen-lockfile
      - run: pnpm build
      - uses: actions/upload-artifact@v4
        with:
          name: build-output
          path: .next/
          retention-days: 1

Why This Structure#

Coverage upload with if: always() means you get the coverage report even when tests fail. This is useful for debugging — you can see which tests failed and what they covered.

Fail-Fast vs. Let Them All Run#

By default, if one job in a matrix fails, GitHub cancels the others. For CI, I actually want this behavior — if lint fails, I don't care about the test results. Fix the lint first.

But for test matrices (say, testing across Node 20 and Node 22), you might want to see all failures at once:

yaml

test:
  strategy:
    fail-fast: false
    matrix:
      node-version: [20, 22]
  runs-on: ubuntu-latest
  steps:
    - uses: actions/checkout@v4
    - uses: pnpm/action-setup@v4
    - uses: actions/setup-node@v4
      with:
        node-version: ${{ matrix.node-version }}
        cache: "pnpm"
    - run: pnpm install --frozen-lockfile
    - run: pnpm test

fail-fast: false lets both matrix legs complete. If Node 22 fails but Node 20 passes, you see that information immediately instead of having to re-run.

Caching for Speed#

pnpm Store Cache#

yaml

- uses: actions/setup-node@v4
  with:
    node-version: 22
    cache: "pnpm"

If you need more control — say, you want to cache based on the OS too — use actions/cache directly:

yaml

- uses: actions/cache@v4
  with:
    path: |
      ~/.local/share/pnpm/store
      node_modules
    key: pnpm-${{ runner.os }}-${{ hashFiles('pnpm-lock.yaml') }}
    restore-keys: |
      pnpm-${{ runner.os }}-

Next.js Build Cache#

Next.js has its own build cache in .next/cache. Caching this between runs means incremental builds — only changed pages and components get recompiled.

yaml

- uses: actions/cache@v4
  with:
    path: .next/cache
    key: nextjs-${{ runner.os }}-${{ hashFiles('pnpm-lock.yaml') }}-${{ hashFiles('src/**/*.ts', 'src/**/*.tsx') }}
    restore-keys: |
      nextjs-${{ runner.os }}-${{ hashFiles('pnpm-lock.yaml') }}-
      nextjs-${{ runner.os }}-

This three-level key strategy means:

Exact match: same dependencies AND same source files. Full cache hit, build is near-instant.
Partial match (dependencies): dependencies same but source changed. Build recompiles only changed files.
Partial match (OS only): dependencies changed. Build reuses what it can.

Real numbers from my project: cold build takes ~55 seconds, cached build takes ~15 seconds. That's a 73% reduction.

Docker Layer Caching#

yaml

- uses: docker/build-push-action@v6
  with:
    context: .
    push: true
    tags: ghcr.io/${{ github.repository }}:latest
    cache-from: type=gha
    cache-to: type=gha,mode=max

Turborepo Remote Cache#

If you're in a monorepo with Turborepo, remote caching is transformative. First build uploads task outputs to the cache. Subsequent builds download instead of recomputing.

yaml

- run: pnpm turbo build --remote-only
  env:
    TURBO_TOKEN: ${{ secrets.TURBO_TOKEN }}
    TURBO_TEAM: ${{ vars.TURBO_TEAM }}

I've seen monorepo CI times drop from 8 minutes to 90 seconds with Turbo remote cache. The catch: it requires a Vercel account or self-hosted Turbo server. For single-app repos, it's overkill.

Docker Build and Push#

Multi-Stage Dockerfile#

Before we get to the workflow, here's the Dockerfile I use for Next.js:

dockerfile

# Stage 1: Dependencies
FROM node:22-alpine AS deps
RUN corepack enable && corepack prepare pnpm@latest --activate
WORKDIR /app
COPY package.json pnpm-lock.yaml ./
RUN pnpm install --frozen-lockfile --prod=false
 
# Stage 2: Build
FROM node:22-alpine AS builder
RUN corepack enable && corepack prepare pnpm@latest --activate
WORKDIR /app
COPY --from=deps /app/node_modules ./node_modules
COPY . .
ENV NEXT_TELEMETRY_DISABLED=1
RUN pnpm build
 
# Stage 3: Production
FROM node:22-alpine AS runner
WORKDIR /app
ENV NODE_ENV=production
ENV NEXT_TELEMETRY_DISABLED=1
 
RUN addgroup --system --gid 1001 nodejs
RUN adduser --system --uid 1001 nextjs
 
COPY --from=builder /app/public ./public
COPY --from=builder --chown=nextjs:nodejs /app/.next/standalone ./
COPY --from=builder --chown=nextjs:nodejs /app/.next/static ./.next/static
 
USER nextjs
EXPOSE 3000
ENV PORT=3000
CMD ["node", "server.js"]

Three stages, clear separation. The final image is ~150MB instead of the ~1.2GB you'd get copying everything. Only production artifacts make it to the runner stage.

The Build-and-Push Workflow#

yaml

name: Build and Push Docker Image
 
on:
  push:
    branches: [main]
 
env:
  REGISTRY: ghcr.io
  IMAGE_NAME: ${{ github.repository }}
 
jobs:
  build-and-push:
    runs-on: ubuntu-latest
    permissions:
      contents: read
      packages: write
 
    steps:
      - name: Checkout
        uses: actions/checkout@v4
 
      - name: Set up QEMU
        uses: docker/setup-qemu-action@v3
 
      - name: Set up Docker Buildx
        uses: docker/setup-buildx-action@v3
 
      - name: Log in to GitHub Container Registry
        uses: docker/login-action@v3
        with:
          registry: ${{ env.REGISTRY }}
          username: ${{ github.actor }}
          password: ${{ secrets.GITHUB_TOKEN }}
 
      - name: Extract metadata
        id: meta
        uses: docker/metadata-action@v5
        with:
          images: ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}
          tags: |
            type=sha,prefix=
            type=ref,event=branch
            type=raw,value=latest,enable={{is_default_branch}}
 
      - name: Build and push
        uses: docker/build-push-action@v6
        with:
          context: .
          platforms: linux/amd64,linux/arm64
          push: true
          tags: ${{ steps.meta.outputs.tags }}
          labels: ${{ steps.meta.outputs.labels }}
          cache-from: type=gha
          cache-to: type=gha,mode=max

Let me unpack the important decisions here.

GitHub Container Registry (ghcr.io)#

I use ghcr.io instead of Docker Hub for three reasons:

Authentication is free. GITHUB_TOKEN is automatically available in every workflow — no need to store Docker Hub credentials.
Proximity. Images are pulled from the same infrastructure your CI runs on. Pulls during CI are fast.
Visibility. Images are linked to your repo in the GitHub UI. You see them in the Packages tab.

Multi-Platform Builds#

yaml

platforms: linux/amd64,linux/arm64

This line adds maybe 90 seconds to your build, but it's worth it. ARM64 images run natively on:

Apple Silicon Macs (M1/M2/M3/M4) during local development with Docker Desktop
AWS Graviton instances (20-40% cheaper than x86 equivalents)
Oracle Cloud's free ARM tier

Without this, your developers on M-series Macs are running x86 images through Rosetta emulation. It works, but it's noticeably slower and occasionally surfaces weird architecture-specific bugs.

QEMU provides the cross-compilation layer. Buildx orchestrates the multi-arch build and pushes a manifest list so Docker automatically pulls the right architecture.

Tagging Strategy#

yaml

tags: |
  type=sha,prefix=
  type=ref,event=branch
  type=raw,value=latest,enable={{is_default_branch}}

Every image gets three tags:

abc1234 (commit SHA): Immutable. You can always deploy an exact commit.
main (branch name): Mutable. Points to the latest build from that branch.
latest: Mutable. Only set on the default branch. This is what your server pulls.

SSH Deployment to VPS#

This is where it all comes together. CI passes, Docker image is built and pushed, now we need to tell the server to pull the new image and restart.

The SSH Action#

yaml

deploy:
  name: Deploy to Production
  needs: [build-and-push]
  runs-on: ubuntu-latest
  environment: production
 
  steps:
    - name: Deploy via SSH
      uses: appleboy/ssh-action@v1
      with:
        host: ${{ secrets.DEPLOY_HOST }}
        username: ${{ secrets.DEPLOY_USER }}
        key: ${{ secrets.SSH_PRIVATE_KEY }}
        port: ${{ secrets.SSH_PORT }}
        script_stop: true
        script: |
          set -euo pipefail
 
          APP_DIR="/var/www/akousa.net"
          IMAGE="ghcr.io/${{ github.repository }}:latest"
          DEPLOY_SHA="${{ github.sha }}"
 
          echo "=== Deploying $DEPLOY_SHA ==="
 
          # Pull the latest image
          docker pull "$IMAGE"
 
          # Stop and remove old container
          docker stop akousa-app || true
          docker rm akousa-app || true
 
          # Start new container
          docker run -d \
            --name akousa-app \
            --restart unless-stopped \
            --network host \
            -e NODE_ENV=production \
            -e DATABASE_URL="${DATABASE_URL}" \
            -p 3000:3000 \
            "$IMAGE"
 
          # Wait for health check
          echo "Waiting for health check..."
          for i in $(seq 1 30); do
            if curl -sf http://localhost:3000/api/health > /dev/null 2>&1; then
              echo "Health check passed on attempt $i"
              break
            fi
            if [ "$i" -eq 30 ]; then
              echo "Health check failed after 30 attempts"
              exit 1
            fi
            sleep 2
          done
 
          # Record deployed SHA
          echo "$DEPLOY_SHA" > "$APP_DIR/.deployed-sha"
 
          # Prune old images
          docker image prune -af --filter "until=168h"
 
          echo "=== Deploy complete ==="

The Deploy Script Alternative#

For anything beyond a simple pull-and-restart, I move the logic into a script on the server rather than inlining it in the workflow:

bash

#!/bin/bash
# /var/www/akousa.net/deploy.sh
set -euo pipefail
 
APP_DIR="/var/www/akousa.net"
LOG_FILE="$APP_DIR/deploy.log"
IMAGE="ghcr.io/akousa/akousa-net:latest"
 
log() {
  echo "[$(date '+%Y-%m-%d %H:%M:%S')] $*" | tee -a "$LOG_FILE"
}
 
log "Starting deployment..."
 
# Login to GHCR
echo "$GHCR_TOKEN" | docker login ghcr.io -u akousa --password-stdin
 
# Pull with retry
for attempt in 1 2 3; do
  if docker pull "$IMAGE"; then
    log "Image pulled successfully on attempt $attempt"
    break
  fi
  if [ "$attempt" -eq 3 ]; then
    log "ERROR: Failed to pull image after 3 attempts"
    exit 1
  fi
  log "Pull attempt $attempt failed, retrying in 5s..."
  sleep 5
done
 
# Health check function
health_check() {
  local port=$1
  local max_attempts=30
  for i in $(seq 1 $max_attempts); do
    if curl -sf "http://localhost:$port/api/health" > /dev/null 2>&1; then
      return 0
    fi
    sleep 2
  done
  return 1
}
 
# Start new container on alternate port
docker run -d \
  --name akousa-app-new \
  --env-file "$APP_DIR/.env.production" \
  -p 3001:3000 \
  "$IMAGE"
 
# Verify new container is healthy
if ! health_check 3001; then
  log "ERROR: New container failed health check. Rolling back."
  docker stop akousa-app-new || true
  docker rm akousa-app-new || true
  exit 1
fi
 
log "New container healthy. Switching traffic..."
 
# Switch Nginx upstream
sudo sed -i 's/server 127.0.0.1:3000/server 127.0.0.1:3001/' /etc/nginx/conf.d/upstream.conf
sudo nginx -t && sudo nginx -s reload
 
# Stop old container
docker stop akousa-app || true
docker rm akousa-app || true
 
# Rename new container
docker rename akousa-app-new akousa-app
 
log "Deployment complete."

The workflow then becomes a single SSH command:

yaml

script: |
  cd /var/www/akousa.net && ./deploy.sh

This is better because: (1) the deploy logic is version-controlled on the server, (2) you can run it manually over SSH for debugging, and (3) you don't have to escape YAML inside YAML inside bash.

Zero-Downtime Strategies#

Strategy 1: PM2 Cluster Mode Reload#

If you're running Node.js directly (not in Docker), PM2's cluster mode gives you the easiest zero-downtime path.

bash

# ecosystem.config.js already has:
#   instances: 2
#   exec_mode: "cluster"
 
pm2 reload akousa --update-env

pm2 reload (not restart) does a rolling restart. It spins up new workers, waits for them to be ready, then kills old workers one by one. At no point are zero workers serving traffic.

The --update-env flag reloads environment variables from the ecosystem config. Without it, your old env persists even after a deploy that changed .env.

In your workflow:

yaml

- name: Deploy and reload PM2
  uses: appleboy/ssh-action@v1
  with:
    host: ${{ secrets.DEPLOY_HOST }}
    username: ${{ secrets.DEPLOY_USER }}
    key: ${{ secrets.SSH_PRIVATE_KEY }}
    script: |
      cd /var/www/akousa.net
      git pull origin main
      pnpm install --frozen-lockfile
      pnpm build
      pm2 reload ecosystem.config.js --update-env

This is what I use for this site. It's simple, reliable, and the downtime is literally zero — I've tested it with a load generator running 100 req/s during deploys. Not a single 5xx.

Strategy 2: Blue/Green with Nginx Upstream#

For Docker deployments, blue/green gives you a clean separation between the old and new versions.

Nginx upstream config:

nginx

# /etc/nginx/conf.d/upstream.conf
upstream app_backend {
    server 127.0.0.1:3000;
}

nginx

# /etc/nginx/sites-available/akousa.net
server {
    listen 443 ssl http2;
    server_name akousa.net;
 
    location / {
        proxy_pass http://app_backend;
        proxy_http_version 1.1;
        proxy_set_header Upgrade $http_upgrade;
        proxy_set_header Connection 'upgrade';
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;
        proxy_cache_bypass $http_upgrade;
    }
}

The switch script:

bash

#!/bin/bash
set -euo pipefail
 
CURRENT_PORT=$(grep -oP 'server 127\.0\.0\.1:\K\d+' /etc/nginx/conf.d/upstream.conf)
 
if [ "$CURRENT_PORT" = "3000" ]; then
  NEW_PORT=3001
  OLD_PORT=3000
else
  NEW_PORT=3000
  OLD_PORT=3001
fi
 
echo "Current: $OLD_PORT -> New: $NEW_PORT"
 
# Start new container on the alternate port
docker run -d \
  --name "akousa-app-$NEW_PORT" \
  --env-file /var/www/akousa.net/.env.production \
  -p "$NEW_PORT:3000" \
  "ghcr.io/akousa/akousa-net:latest"
 
# Wait for health
for i in $(seq 1 30); do
  if curl -sf "http://localhost:$NEW_PORT/api/health" > /dev/null; then
    echo "New container healthy on port $NEW_PORT"
    break
  fi
  [ "$i" -eq 30 ] && { echo "Health check failed"; docker stop "akousa-app-$NEW_PORT"; docker rm "akousa-app-$NEW_PORT"; exit 1; }
  sleep 2
done
 
# Switch Nginx
sudo sed -i "s/server 127.0.0.1:$OLD_PORT/server 127.0.0.1:$NEW_PORT/" /etc/nginx/conf.d/upstream.conf
sudo nginx -t && sudo nginx -s reload
 
# Stop old container
sleep 5  # Let in-flight requests complete
docker stop "akousa-app-$OLD_PORT" || true
docker rm "akousa-app-$OLD_PORT" || true
 
echo "Switched from :$OLD_PORT to :$NEW_PORT"

Strategy 3: Docker Compose with Health Checks#

For a more structured approach, Docker Compose can manage the blue/green swap:

yaml

# docker-compose.yml
services:
  app:
    image: ghcr.io/akousa/akousa-net:latest
    restart: unless-stopped
    env_file: .env.production
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:3000/api/health"]
      interval: 10s
      timeout: 5s
      retries: 3
      start_period: 30s
    deploy:
      replicas: 2
      update_config:
        parallelism: 1
        delay: 10s
        order: start-first
        failure_action: rollback
      rollback_config:
        parallelism: 0
        order: stop-first
    ports:
      - "3000:3000"

Deploy with:

bash

docker compose pull
docker compose up -d --remove-orphans

Secrets Management#

Secrets management is one of those things that's easy to get "mostly right" and catastrophically wrong in the remaining edge cases.

GitHub Secrets: The Basics#

yaml

# Set via GitHub UI: Settings > Secrets and variables > Actions
 
steps:
  - name: Use a secret
    env:
      DB_URL: ${{ secrets.DATABASE_URL }}
    run: |
      # The value is masked in logs
      echo "Connecting to database..."
      # This would print "Connecting to ***" in the logs
      echo "Connecting to $DB_URL"

Environment-Scoped Secrets#

yaml

jobs:
  deploy-staging:
    environment: staging
    steps:
      - run: echo "Deploying to ${{ secrets.DEPLOY_HOST }}"
      # DEPLOY_HOST = staging.akousa.net
 
  deploy-production:
    environment: production
    steps:
      - run: echo "Deploying to ${{ secrets.DEPLOY_HOST }}"
      # DEPLOY_HOST = akousa.net

Same secret name, different values per environment. The environment field on the job determines which set of secrets is injected.

OIDC: No More Static Credentials#

OIDC (OpenID Connect) solves this. GitHub Actions acts as an identity provider, and your cloud provider trusts it to issue short-lived credentials on the fly:

yaml

jobs:
  deploy:
    runs-on: ubuntu-latest
    permissions:
      id-token: write  # Required for OIDC
      contents: read
 
    steps:
      - name: Configure AWS Credentials
        uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: arn:aws:iam::123456789012:role/github-actions-deploy
          aws-region: eu-central-1
 
      - name: Push to ECR
        run: |
          aws ecr get-login-password --region eu-central-1 | \
            docker login --username AWS --password-stdin 123456789012.dkr.ecr.eu-central-1.amazonaws.com

Setting this up on the AWS side requires an IAM OIDC identity provider and a role trust policy:

json

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Federated": "arn:aws:iam::123456789012:oidc-provider/token.actions.githubusercontent.com"
      },
      "Action": "sts:AssumeRoleWithWebIdentity",
      "Condition": {
        "StringEquals": {
          "token.actions.githubusercontent.com:aud": "sts.amazonaws.com"
        },
        "StringLike": {
          "token.actions.githubusercontent.com:sub": "repo:akousa/akousa-net:ref:refs/heads/main"
        }
      }
    }
  ]
}

The sub condition is crucial. Without it, any repo that somehow obtains your OIDC provider's details could assume the role. With it, only the main branch of your specific repo can.

GCP has an equivalent setup with Workload Identity Federation. Azure has federated credentials. If your cloud supports OIDC, use it. There's no reason to store static cloud credentials in 2026.

Deployment SSH Keys#

For VPS deployments over SSH, generate a dedicated key pair:

bash

ssh-keygen -t ed25519 -C "github-actions-deploy" -f deploy_key -N ""

Add the public key to the server's ~/.ssh/authorized_keys with restrictions:

restrict,command="/var/www/akousa.net/deploy.sh" ssh-ed25519 AAAA... github-actions-deploy

Add the private key to GitHub Secrets as SSH_PRIVATE_KEY. This is the one static credential I accept — SSH keys with forced commands have a very limited blast radius.

PR Workflows: Preview Deployments#

Every PR deserves a preview environment. It catches visual bugs that unit tests miss, lets designers review without checking out code, and makes QA's life dramatically easier.

Deploy a Preview on PR Open#

yaml

name: Preview Deploy
 
on:
  pull_request:
    types: [opened, synchronize, reopened]
 
jobs:
  preview:
    runs-on: ubuntu-latest
    environment:
      name: preview-${{ github.event.number }}
      url: ${{ steps.deploy.outputs.url }}
 
    steps:
      - uses: actions/checkout@v4
 
      - name: Build preview image
        uses: docker/build-push-action@v6
        with:
          context: .
          push: true
          tags: ghcr.io/${{ github.repository }}:pr-${{ github.event.number }}
          cache-from: type=gha
          cache-to: type=gha,mode=max
 
      - name: Deploy preview
        id: deploy
        uses: appleboy/ssh-action@v1
        with:
          host: ${{ secrets.PREVIEW_HOST }}
          username: ${{ secrets.DEPLOY_USER }}
          key: ${{ secrets.SSH_PRIVATE_KEY }}
          script: |
            PR_NUM=${{ github.event.number }}
            PORT=$((4000 + PR_NUM))
            IMAGE="ghcr.io/${{ github.repository }}:pr-${PR_NUM}"
 
            docker pull "$IMAGE"
            docker stop "preview-${PR_NUM}" || true
            docker rm "preview-${PR_NUM}" || true
 
            docker run -d \
              --name "preview-${PR_NUM}" \
              --restart unless-stopped \
              -e NODE_ENV=preview \
              -p "${PORT}:3000" \
              "$IMAGE"
 
            echo "url=https://pr-${PR_NUM}.preview.akousa.net" >> "$GITHUB_OUTPUT"
 
      - name: Comment PR with preview URL
        uses: actions/github-script@v7
        with:
          script: |
            const url = `https://pr-${{ github.event.number }}.preview.akousa.net`;
            const body = `### Preview Deployment
 
            | Status | URL |
            |--------|-----|
            | :white_check_mark: Deployed | [${url}](${url}) |
 
            _Last updated: ${new Date().toISOString()}_
            _Commit: \`${{ github.sha }}\`_`;
 
            // Find existing comment
            const comments = await github.rest.issues.listComments({
              owner: context.repo.owner,
              repo: context.repo.repo,
              issue_number: context.issue.number,
            });
 
            const botComment = comments.data.find(c =>
              c.user.type === 'Bot' && c.body.includes('Preview Deployment')
            );
 
            if (botComment) {
              await github.rest.issues.updateComment({
                owner: context.repo.owner,
                repo: context.repo.repo,
                comment_id: botComment.id,
                body,
              });
            } else {
              await github.rest.issues.createComment({
                owner: context.repo.owner,
                repo: context.repo.repo,
                issue_number: context.issue.number,
                body,
              });
            }

Cleanup on PR Close#

Preview environments that aren't cleaned up eat disk and memory. Add a cleanup job:

yaml

name: Cleanup Preview
 
on:
  pull_request:
    types: [closed]
 
jobs:
  cleanup:
    runs-on: ubuntu-latest
    steps:
      - name: Remove preview container
        uses: appleboy/ssh-action@v1
        with:
          host: ${{ secrets.PREVIEW_HOST }}
          username: ${{ secrets.DEPLOY_USER }}
          key: ${{ secrets.SSH_PRIVATE_KEY }}
          script: |
            PR_NUM=${{ github.event.number }}
            docker stop "preview-${PR_NUM}" || true
            docker rm "preview-${PR_NUM}" || true
            docker rmi "ghcr.io/${{ github.repository }}:pr-${PR_NUM}" || true
            echo "Preview for PR #${PR_NUM} cleaned up."
 
      - name: Deactivate environment
        uses: actions/github-script@v7
        with:
          script: |
            const deployments = await github.rest.repos.listDeployments({
              owner: context.repo.owner,
              repo: context.repo.repo,
              environment: `preview-${{ github.event.number }}`,
            });
 
            for (const deployment of deployments.data) {
              await github.rest.repos.createDeploymentStatus({
                owner: context.repo.owner,
                repo: context.repo.repo,
                deployment_id: deployment.id,
                state: 'inactive',
              });
            }

Required Status Checks#

In your repository settings (Settings > Branches > Branch protection rules), require these checks before merging:

lint — No lint errors
typecheck — No type errors
test — All tests pass
build — Project builds successfully

Without this, someone will merge a PR with failing checks. Not maliciously — they'll see "2 of 4 checks passed" and assume the other two are still running. Lock it down.

Notifications#

A deployment that nobody knows about is a deployment that nobody trusts. Notifications close the feedback loop.

Slack Webhook#

yaml

- name: Notify Slack
  if: always()
  uses: slackapi/slack-github-action@v2
  with:
    webhook: ${{ secrets.SLACK_DEPLOY_WEBHOOK }}
    webhook-type: incoming-webhook
    payload: |
      {
        "blocks": [
          {
            "type": "header",
            "text": {
              "type": "plain_text",
              "text": "${{ job.status == 'success' && 'Deploy Successful' || 'Deploy Failed' }}"
            }
          },
          {
            "type": "section",
            "fields": [
              {
                "type": "mrkdwn",
                "text": "*Repository:*\n${{ github.repository }}"
              },
              {
                "type": "mrkdwn",
                "text": "*Branch:*\n${{ github.ref_name }}"
              },
              {
                "type": "mrkdwn",
                "text": "*Commit:*\n<${{ github.server_url }}/${{ github.repository }}/commit/${{ github.sha }}|${{ github.sha }}>"
              },
              {
                "type": "mrkdwn",
                "text": "*Triggered by:*\n${{ github.actor }}"
              }
            ]
          },
          {
            "type": "actions",
            "elements": [
              {
                "type": "button",
                "text": {
                  "type": "plain_text",
                  "text": "View Run"
                },
                "url": "${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}"
              }
            ]
          }
        ]
      }

The if: always() is critical. Without it, the notification step is skipped when the deploy fails — which is exactly when you need it most.

GitHub Deployments API#

For richer deployment tracking, use the GitHub Deployments API. This gives you a deployment history in the repo UI and enables status badges:

yaml

- name: Create GitHub Deployment
  id: deployment
  uses: actions/github-script@v7
  with:
    script: |
      const deployment = await github.rest.repos.createDeployment({
        owner: context.repo.owner,
        repo: context.repo.repo,
        ref: context.sha,
        environment: 'production',
        auto_merge: false,
        required_contexts: [],
        description: `Deploying ${context.sha.substring(0, 7)} to production`,
      });
      return deployment.data.id;
 
- name: Deploy
  run: |
    # ... actual deployment steps ...
 
- name: Update deployment status
  if: always()
  uses: actions/github-script@v7
  with:
    script: |
      const deploymentId = ${{ steps.deployment.outputs.result }};
      await github.rest.repos.createDeploymentStatus({
        owner: context.repo.owner,
        repo: context.repo.repo,
        deployment_id: deploymentId,
        state: '${{ job.status }}' === 'success' ? 'success' : 'failure',
        environment_url: 'https://akousa.net',
        log_url: `${context.serverUrl}/${context.repo.owner}/${context.repo.repo}/actions/runs/${context.runId}`,
        description: '${{ job.status }}' === 'success'
          ? 'Deployment succeeded'
          : 'Deployment failed',
      });

Now your Environments tab in GitHub shows a complete deployment history: who deployed what, when, and whether it succeeded.

Failure-Only Email#

For critical deployments, I also trigger an email on failure. Not via GitHub Actions' built-in email (too noisy), but via a targeted webhook:

yaml

- name: Alert on failure
  if: failure()
  run: |
    curl -X POST "${{ secrets.ALERT_WEBHOOK_URL }}" \
      -H "Content-Type: application/json" \
      -d '{
        "subject": "DEPLOY FAILED: ${{ github.repository }}",
        "body": "Commit: ${{ github.sha }}\nActor: ${{ github.actor }}\nRun: ${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}"
      }'

This is my last line of defense. Slack is great but it's also noisy — people mute channels. A "DEPLOY FAILED" email with a link to the run gets attention.

The Complete Workflow File#

Here's everything wired together into a single, production-ready workflow. This is very close to what actually deploys this site.

yaml

name: CI/CD Pipeline
 
on:
  push:
    branches: [main]
  pull_request:
    branches: [main]
  workflow_dispatch:
    inputs:
      skip_tests:
        description: "Skip tests (emergency deploy)"
        required: false
        type: boolean
        default: false
 
concurrency:
  group: ${{ github.workflow }}-${{ github.ref }}
  cancel-in-progress: ${{ github.event_name == 'pull_request' }}
 
env:
  NODE_VERSION: "22"
  REGISTRY: ghcr.io
  IMAGE_NAME: ${{ github.repository }}
 
jobs:
  # ============================================================
  # CI: Lint, type check, and test in parallel
  # ============================================================
 
  lint:
    name: Lint
    runs-on: ubuntu-latest
    steps:
      - name: Checkout code
        uses: actions/checkout@v4
 
      - name: Setup pnpm
        uses: pnpm/action-setup@v4
 
      - name: Setup Node.js
        uses: actions/setup-node@v4
        with:
          node-version: ${{ env.NODE_VERSION }}
          cache: "pnpm"
 
      - name: Install dependencies
        run: pnpm install --frozen-lockfile
 
      - name: Run ESLint
        run: pnpm lint
 
  typecheck:
    name: Type Check
    runs-on: ubuntu-latest
    steps:
      - name: Checkout code
        uses: actions/checkout@v4
 
      - name: Setup pnpm
        uses: pnpm/action-setup@v4
 
      - name: Setup Node.js
        uses: actions/setup-node@v4
        with:
          node-version: ${{ env.NODE_VERSION }}
          cache: "pnpm"
 
      - name: Install dependencies
        run: pnpm install --frozen-lockfile
 
      - name: Run TypeScript compiler
        run: pnpm tsc --noEmit
 
  test:
    name: Unit Tests
    if: ${{ !inputs.skip_tests }}
    runs-on: ubuntu-latest
    steps:
      - name: Checkout code
        uses: actions/checkout@v4
 
      - name: Setup pnpm
        uses: pnpm/action-setup@v4
 
      - name: Setup Node.js
        uses: actions/setup-node@v4
        with:
          node-version: ${{ env.NODE_VERSION }}
          cache: "pnpm"
 
      - name: Install dependencies
        run: pnpm install --frozen-lockfile
 
      - name: Run tests with coverage
        run: pnpm test -- --coverage
 
      - name: Upload coverage report
        if: always()
        uses: actions/upload-artifact@v4
        with:
          name: coverage-report
          path: coverage/
          retention-days: 7
 
  # ============================================================
  # Build: Only after CI passes
  # ============================================================
 
  build:
    name: Build Application
    needs: [lint, typecheck, test]
    if: always() && !cancelled() && needs.lint.result == 'success' && needs.typecheck.result == 'success' && (needs.test.result == 'success' || needs.test.result == 'skipped')
    runs-on: ubuntu-latest
    steps:
      - name: Checkout code
        uses: actions/checkout@v4
 
      - name: Setup pnpm
        uses: pnpm/action-setup@v4
 
      - name: Setup Node.js
        uses: actions/setup-node@v4
        with:
          node-version: ${{ env.NODE_VERSION }}
          cache: "pnpm"
 
      - name: Install dependencies
        run: pnpm install --frozen-lockfile
 
      - name: Cache Next.js build
        uses: actions/cache@v4
        with:
          path: .next/cache
          key: nextjs-${{ runner.os }}-${{ hashFiles('pnpm-lock.yaml') }}-${{ hashFiles('src/**/*.ts', 'src/**/*.tsx') }}
          restore-keys: |
            nextjs-${{ runner.os }}-${{ hashFiles('pnpm-lock.yaml') }}-
            nextjs-${{ runner.os }}-
 
      - name: Build Next.js application
        run: pnpm build
 
  # ============================================================
  # Docker: Build and push image (main branch only)
  # ============================================================
 
  docker:
    name: Build Docker Image
    needs: [build]
    if: github.ref == 'refs/heads/main' && github.event_name != 'pull_request'
    runs-on: ubuntu-latest
    permissions:
      contents: read
      packages: write
    outputs:
      image_tag: ${{ steps.meta.outputs.tags }}
 
    steps:
      - name: Checkout code
        uses: actions/checkout@v4
 
      - name: Set up QEMU for multi-platform builds
        uses: docker/setup-qemu-action@v3
 
      - name: Set up Docker Buildx
        uses: docker/setup-buildx-action@v3
 
      - name: Log in to GitHub Container Registry
        uses: docker/login-action@v3
        with:
          registry: ${{ env.REGISTRY }}
          username: ${{ github.actor }}
          password: ${{ secrets.GITHUB_TOKEN }}
 
      - name: Extract image metadata
        id: meta
        uses: docker/metadata-action@v5
        with:
          images: ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}
          tags: |
            type=sha,prefix=
            type=raw,value=latest,enable={{is_default_branch}}
 
      - name: Build and push Docker image
        uses: docker/build-push-action@v6
        with:
          context: .
          platforms: linux/amd64,linux/arm64
          push: true
          tags: ${{ steps.meta.outputs.tags }}
          labels: ${{ steps.meta.outputs.labels }}
          cache-from: type=gha
          cache-to: type=gha,mode=max
 
  # ============================================================
  # Deploy: SSH into VPS and update
  # ============================================================
 
  deploy:
    name: Deploy to Production
    needs: [docker]
    if: github.ref == 'refs/heads/main' && github.event_name != 'pull_request'
    runs-on: ubuntu-latest
    environment:
      name: production
      url: https://akousa.net
 
    steps:
      - name: Create GitHub Deployment
        id: deployment
        uses: actions/github-script@v7
        with:
          script: |
            const deployment = await github.rest.repos.createDeployment({
              owner: context.repo.owner,
              repo: context.repo.repo,
              ref: context.sha,
              environment: 'production',
              auto_merge: false,
              required_contexts: [],
              description: `Deploy ${context.sha.substring(0, 7)}`,
            });
            return deployment.data.id;
 
      - name: Deploy via SSH
        uses: appleboy/ssh-action@v1
        with:
          host: ${{ secrets.DEPLOY_HOST }}
          username: ${{ secrets.DEPLOY_USER }}
          key: ${{ secrets.SSH_PRIVATE_KEY }}
          port: ${{ secrets.SSH_PORT }}
          script_stop: true
          command_timeout: 5m
          script: |
            set -euo pipefail
 
            APP_DIR="/var/www/akousa.net"
            IMAGE="ghcr.io/${{ github.repository }}:latest"
            SHA="${{ github.sha }}"
 
            echo "=== Deploy $SHA started at $(date) ==="
 
            # Pull new image
            docker pull "$IMAGE"
 
            # Run new container on alternate port
            docker run -d \
              --name akousa-app-new \
              --env-file "$APP_DIR/.env.production" \
              -p 3001:3000 \
              "$IMAGE"
 
            # Health check
            echo "Running health check..."
            for i in $(seq 1 30); do
              if curl -sf http://localhost:3001/api/health > /dev/null 2>&1; then
                echo "Health check passed (attempt $i)"
                break
              fi
              if [ "$i" -eq 30 ]; then
                echo "ERROR: Health check failed"
                docker logs akousa-app-new --tail 50
                docker stop akousa-app-new && docker rm akousa-app-new
                exit 1
              fi
              sleep 2
            done
 
            # Switch traffic
            sudo sed -i 's/server 127.0.0.1:3000/server 127.0.0.1:3001/' /etc/nginx/conf.d/upstream.conf
            sudo nginx -t && sudo nginx -s reload
 
            # Grace period for in-flight requests
            sleep 5
 
            # Stop old container
            docker stop akousa-app || true
            docker rm akousa-app || true
 
            # Rename and reset port
            docker rename akousa-app-new akousa-app
            sudo sed -i 's/server 127.0.0.1:3001/server 127.0.0.1:3000/' /etc/nginx/conf.d/upstream.conf
            # Note: we don't reload Nginx here because the container name changed,
            # not the port. The next deploy will use the correct port.
 
            # Record deployment
            echo "$SHA" > "$APP_DIR/.deployed-sha"
            echo "$(date -u +%Y-%m-%dT%H:%M:%SZ) $SHA" >> "$APP_DIR/deploy.log"
 
            # Cleanup old images (older than 7 days)
            docker image prune -af --filter "until=168h"
 
            echo "=== Deploy complete at $(date) ==="
 
      - name: Update deployment status
        if: always()
        uses: actions/github-script@v7
        with:
          script: |
            const deploymentId = ${{ steps.deployment.outputs.result }};
            await github.rest.repos.createDeploymentStatus({
              owner: context.repo.owner,
              repo: context.repo.repo,
              deployment_id: deploymentId,
              state: '${{ job.status }}' === 'success' ? 'success' : 'failure',
              environment_url: 'https://akousa.net',
              log_url: `${context.serverUrl}/${context.repo.owner}/${context.repo.repo}/actions/runs/${context.runId}`,
            });
 
      - name: Notify Slack
        if: always()
        uses: slackapi/slack-github-action@v2
        with:
          webhook: ${{ secrets.SLACK_DEPLOY_WEBHOOK }}
          webhook-type: incoming-webhook
          payload: |
            {
              "blocks": [
                {
                  "type": "header",
                  "text": {
                    "type": "plain_text",
                    "text": "${{ job.status == 'success' && 'Deploy Successful' || 'Deploy Failed' }}"
                  }
                },
                {
                  "type": "section",
                  "fields": [
                    {
                      "type": "mrkdwn",
                      "text": "*Commit:*\n<${{ github.server_url }}/${{ github.repository }}/commit/${{ github.sha }}|`${{ github.sha }}`>"
                    },
                    {
                      "type": "mrkdwn",
                      "text": "*Actor:*\n${{ github.actor }}"
                    }
                  ]
                },
                {
                  "type": "actions",
                  "elements": [
                    {
                      "type": "button",
                      "text": { "type": "plain_text", "text": "View Run" },
                      "url": "${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}"
                    }
                  ]
                }
              ]
            }
 
      - name: Alert on failure
        if: failure()
        run: |
          curl -sf -X POST "${{ secrets.ALERT_WEBHOOK_URL }}" \
            -H "Content-Type: application/json" \
            -d '{
              "subject": "DEPLOY FAILED: ${{ github.repository }}",
              "body": "Commit: ${{ github.sha }}\nRun: ${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}"
            }' || true

Walking Through the Flow#

When I push to main:

Lint, Type Check, and Test kick off simultaneously. Three runners, three parallel jobs. If any fail, the pipeline stops.
Build runs only if all three pass. It validates that the application compiles and produces working output.
Docker builds the production image and pushes it to ghcr.io. Multi-platform, layer-cached.
Deploy SSHes into the VPS, pulls the new image, starts a new container, health-checks it, switches Nginx, and cleans up.
Notifications fire regardless of outcome. Slack gets the message. GitHub Deployments get updated. If it failed, an alert email goes out.

When I open a PR:

Lint, Type Check, and Test run. Same quality gates.
Build runs to verify the project compiles.
Docker and Deploy are skipped (the if conditions gate them to main branch only).

When I need an emergency deploy (skip tests):

Click "Run workflow" in the Actions tab.
Select skip_tests: true.
Lint and typecheck still run (you can't skip those — I don't trust myself that much).
Tests are skipped, build runs, Docker builds, deploy fires.

Lessons From Two Years of Iteration#

I'll close with the mistakes I made so you don't have to.

yaml

jobs:
  deploy:
    timeout-minutes: 15
    runs-on: ubuntu-latest

Build the pipeline you need today, and let it grow with your project.

Understanding the Building Blocks#

Triggers: When Your Workflow Runs#

Jobs, Steps, and Runners#

Environment Variables vs Secrets#

The Full CI Pipeline#

Why This Structure#

Fail-Fast vs. Let Them All Run#

Caching for Speed#

pnpm Store Cache#

Next.js Build Cache#

Docker Layer Caching#

Turborepo Remote Cache#

Docker Build and Push#

Multi-Stage Dockerfile#

The Build-and-Push Workflow#

GitHub Container Registry (ghcr.io)#

Multi-Platform Builds#

Tagging Strategy#

SSH Deployment to VPS#

The SSH Action#

The Deploy Script Alternative#

Zero-Downtime Strategies#

Strategy 1: PM2 Cluster Mode Reload#

Strategy 2: Blue/Green with Nginx Upstream#

Strategy 3: Docker Compose with Health Checks#

Secrets Management#

GitHub Secrets: The Basics#

Environment-Scoped Secrets#

OIDC: No More Static Credentials#

Deployment SSH Keys#

PR Workflows: Preview Deployments#

Deploy a Preview on PR Open#

Cleanup on PR Close#

Required Status Checks#

Notifications#

Slack Webhook#

GitHub Deployments API#

Failure-Only Email#

The Complete Workflow File#

Walking Through the Flow#

Lessons From Two Years of Iteration#

Related Posts

Cron Expression Guide: Schedule Tasks Like a Pro

Scaling to One Million Users: The Infrastructure Playbook Nobody Shares

Understanding the Building Blocks#

Triggers: When Your Workflow Runs#

Jobs, Steps, and Runners#

Environment Variables vs Secrets#

The Full CI Pipeline#

Why This Structure#

Fail-Fast vs. Let Them All Run#

Caching for Speed#

pnpm Store Cache#

Next.js Build Cache#

Docker Layer Caching#

Turborepo Remote Cache#

Docker Build and Push#

Multi-Stage Dockerfile#

The Build-and-Push Workflow#

GitHub Container Registry (ghcr.io)#

Multi-Platform Builds#

Tagging Strategy#

SSH Deployment to VPS#

The SSH Action#

The Deploy Script Alternative#

Zero-Downtime Strategies#

Strategy 1: PM2 Cluster Mode Reload#

Strategy 2: Blue/Green with Nginx Upstream#

Strategy 3: Docker Compose with Health Checks#

Secrets Management#

GitHub Secrets: The Basics#

Environment-Scoped Secrets#

OIDC: No More Static Credentials#

Deployment SSH Keys#

PR Workflows: Preview Deployments#

Deploy a Preview on PR Open#

Cleanup on PR Close#

Required Status Checks#

Notifications#

Slack Webhook#