---
title: "The Definitive GitHub Actions Debugging Guide: 65+ Real Errors and How to Fix Them"
description: "Every GitHub Actions error message, root cause, and fix in one place. From YAML gotchas to OIDC failures — the debugging reference you'll actually bookmark."
date: 2026-05-29
tags: ["GitHub Actions", "DevOps", "CI/CD", "Debugging", "GitHub"]
canonical: https://htek.dev/articles/github-actions-debugging-guide
---
GitHub Actions is the CI/CD backbone for millions of repositories. It's also the source of some of the most confusing, silent, and undocumented failure modes in modern DevOps.

I've spent years debugging Actions workflows — first across [500+ repository migrations at an enterprise scale](/articles/lessons-from-500-github-migrations), then building [agentic DevOps platforms](/articles/agentic-devops-next-evolution-of-shift-left) that push Actions to its limits. This guide is the result: every error message I've collected, every silent failure I've traced, and every workaround that actually works.

**This is a reference guide, not a tutorial.** Bookmark it. Search it when something breaks. Every section includes the actual error message (so you can Ctrl+F or Google it), the root cause, and the fix with copy-paste code.

## Quick Diagnosis Flowchart

![Quick diagnosis flowchart showing 6 debugging paths for GitHub Actions failures](/images/articles/github-actions-debugging-guide/diagnosis-flowchart.webp)
*Start here: identify your failure category before diving into 65+ specific scenarios.*

Before diving into 65+ scenarios, start here:

1. **Workflow never appears in Actions tab?** → [YAML Syntax Issues](#yaml-syntax--validation-errors) or [Trigger Problems](#trigger-problems)
2. **Workflow runs but a step fails?** → Check the error message against the sections below
3. **Workflow runs but produces wrong results silently?** → [Silent Failures](#silent-failures-the-most-dangerous-category)
4. **Secrets are empty or permissions denied?** → [Secrets & Permissions](#secrets-permissions--authentication)
5. **Cache miss or artifact not found?** → [Caching & Artifacts](#caching-artifacts--dependencies)
6. **Jobs cancelled unexpectedly?** → [Concurrency Issues](#concurrency--timing)

> **Pro tip:** Install [`actionlint`](https://github.com/rhysd/actionlint) right now. It catches the majority of syntax and context issues in this guide *before* you push. Run it locally or add it to your CI: `uses: raven-actions/actionlint@v2`.

---

## YAML Syntax & Validation Errors

These errors prevent your workflow from even registering with GitHub. No run appears — the workflow is silently rejected.

### Unexpected or Typo'd YAML Keys

**Error:**
```
The workflow is not valid. .github/workflows/ci.yml (Line: 6, Col: 5):
Unexpected value 'default'

unexpected key "Shell" for step to run shell command. expected one of
"continue-on-error", "env", "id", "if", "name", "run", "shell",
"timeout-minutes", "working-directory" [syntax-check]
```

**Root cause:** YAML key names in GitHub Actions are case-sensitive. `default:` is not `defaults:`. `Shell:` is not `shell:`. `branch:` is not `branches:`.

**Fix:** Use `actionlint` to catch these before pushing. Common corrections:
- `default:` → `defaults:`
- `branch:` → `branches:`
- `Shell:` → `shell:`

Standard YAML linters (`yamllint`, Python `yaml.safe_load()`) won't catch these because the YAML is syntactically valid — it's semantically wrong for GitHub Actions.

### Missing Required Keys

**Error:**
```
"runs-on" section is missing in job "test" [syntax-check]
"jobs" section should not be empty [syntax-check]
```

**Fix:** Every job needs `runs-on:` and at least one entry in `steps:`. Matrix keys are compared case-insensitively — `node` and `NODE` cannot coexist.

### Expression Syntax Errors

**Error:**
```
got unexpected character '"' while lexing expression...
do you mean string literals? only single quotes are available
for string delimiter [expression]
```

**Root cause:** GitHub Actions expressions use a custom mini-language, not JavaScript. Double quotes are not valid string delimiters. The `+` operator doesn't exist for concatenation.

**Fix:**
```yaml
# ❌ Wrong
run: echo "${{ "hello" }}"
run: echo "${{ var1 + var2 }}"

# ✅ Correct
run: echo "${{ 'hello' }}"
run: echo "${{ format('{0}{1}', var1, var2) }}"
```

### Context Variable Type Errors

**Error:**
```
receiver of object dereference "owner" must be type of object but
got "string" [expression]
```

**Root cause:** `github.repository` is a string (`"owner/repo"`), not an object. People try `github.repository.owner` expecting the org name.

**Fix:** Use `github.repository_owner` for the owner. Use `toJSON(env)` to dump environment variables, not `${{ env }}` (which outputs the string `'Object'`).

### `secrets.*` in Unexpected Contexts — Silent Failures

**Error:** No error. The workflow behaves unexpectedly or steps are silently skipped.

**Root cause:** While `secrets` is technically [available in step `if:` conditions](https://docs.github.com/en/actions/learn-github-actions/contexts#context-availability), using it there can cause unexpected behavior — particularly in composite actions, reusable workflows, or when the secret is undefined. The expression evaluates to empty string for undefined secrets, which can cause conditions to behave differently than expected.

**Fix:**
```yaml
# ⚠️ Can behave unexpectedly with undefined secrets
- if: ${{ secrets.MY_SECRET != '' }}
  run: echo "has secret"

# ✅ Map to env first, then check env (more reliable)
- env:
    MY_SECRET: ${{ secrets.MY_SECRET }}
  run: |
    if [ -n "$MY_SECRET" ]; then
      echo "has secret"
    fi
```

This pattern is especially dangerous because the failure mode is silence — no error, no notification. The env-mapping approach is more explicit and `actionlint` can validate it.

### `env` Context Unavailable in Reusable Workflow `with:`

**Error:**
```
Unrecognized named-value: 'env'. Located at position 1 within
expression: env.SOMETHING
```

**Root cause:** The `env` context is [not available](https://github.com/actions/runner/issues/2372) in the `with:` block when calling reusable workflows. This is a confirmed open bug with 226+ reactions.

**Fix:** Pass values via `github.event.inputs`, `secrets: inherit`, or hardcode them. There is no clean workaround — this is a known platform limitation.

### `if:` Conditionals Always Evaluating to `true`

**Error:** No error. The step always runs regardless of condition.

**Root cause:** Using YAML block scalar `|`, trailing spaces, or wrapping `${{ }}` with extra characters makes the condition a non-empty string — which is always truthy.

```yaml
# ❌ Always true — trailing newline from |
if: |
  ${{ github.event_name == 'push' }}

# ❌ Always true — trailing space
if: "${{ github.event_name == 'push' }} "

# ❌ Always true — extra characters between ${{ }} blocks
if: ${{ github.event_name == 'push' }} && ${{ github.ref_name == 'main' }}
```

**Fix:**
```yaml
# ✅ Correct — no extra characters
if: github.event_name == 'push'

# ✅ Correct — single expression, no wrapping needed
if: github.event_name == 'push' && github.ref_name == 'main'
```

### Boolean Inputs Are Strings in Composite Actions

```yaml
# In composite action — this is ALWAYS false:
if: ${{ inputs.realRun == true }}
```

**Root cause:** Composite actions receive all inputs as strings, even when declared with `type: boolean`. This is a [confirmed bug](https://github.com/actions/runner/issues/2238) with 117+ reactions.

**Fix:** Compare to the string `'true'`:
```yaml
if: ${{ inputs.realRun == 'true' }}
```

### Composite Actions: No `defaults:` Support

**Root cause:** Composite actions do not support the `defaults:` key. You cannot set a default shell. Every `run:` step must explicitly specify `shell:`.

**Fix:**
```yaml
runs:
  using: composite
  steps:
    - run: echo "hello"
      shell: bash        # Required on EVERY step
    - run: echo "world"
      shell: bash        # Must repeat
```

### Tab Characters in YAML

**Error:**
```
found a tab character where an indentation space is expected
```

**Fix:** YAML does not allow tabs for indentation. In VS Code: View → Render Whitespace. Add to `.editorconfig`:
```ini
[*.yml]
indent_style = space
indent_size = 2
```

---

## Silent Failures: The Most Dangerous Category

![Silent failures in CI/CD — everything looks green but hidden problems lurk beneath the surface](/images/articles/github-actions-debugging-guide/silent-failures.webp)
*The most dangerous bugs are the ones your pipeline says passed.*

These are the scenarios where *nothing visibly breaks* — your workflow just does the wrong thing.

### Scheduled Workflows Silently Disabled After 60 Days

**Symptom:** A cron workflow that's been running for months just stops. No notification.

**Root cause:** GitHub [automatically disables](https://github.com/orgs/community/discussions/86087) `schedule`-triggered workflows after 60 days of repository inactivity (no commits). Workflow runs themselves don't count as activity.

**Fix:**
```yaml
- uses: gautamkrishnar/keepalive-workflow@v2
  with:
    time_elapsed: '45'  # triggers 15 days before the 60-day cutoff
```

Or re-enable manually:
```bash
gh workflow enable "Workflow Name" --repo OWNER/REPO
```

### `GITHUB_TOKEN` Cannot Trigger Downstream Workflows

**Symptom:** A workflow pushes a commit or creates a tag, but the expected downstream workflow (triggered by `on: push`) never fires.

**Root cause:** This is [by design](https://docs.github.com/en/actions/security-for-github-actions/security-guides/automatic-token-authentication#using-the-github_token-in-a-workflow). Commits made with `GITHUB_TOKEN` do not trigger further workflow runs — it's GitHub's recursion prevention mechanism.

**Fix:** Use a GitHub App installation token or a PAT:
```yaml
- uses: actions/create-github-app-token@v1
  id: app-token
  with:
    app-id: ${{ vars.APP_ID }}
    private-key: ${{ secrets.APP_PRIVATE_KEY }}

- uses: actions/checkout@v4
  with:
    token: ${{ steps.app-token.outputs.token }}
```

### Cache Rate Limiting Falls Through as "Cache Not Found"

**Error:**
```
Warning: Failed to restore: Failed to GetCacheEntryDownloadURL:
Rate limited: Failed request: (429) Too Many Requests
Cache not found for input keys: ...
```

**Root cause:** When the [cache API rate limits](https://github.com/actions/cache/issues/1758) you, the action reports it as a cache miss — not a rate limit error. Your build proceeds without cache, silently slower.

**Fix:** Don't trigger hundreds of parallel matrix jobs all saving caches simultaneously. Stagger cache operations or use fewer, broader cache keys.

### Fork PR Secrets Evaluate to Empty Strings

**Symptom:** A contributor opens a PR from a fork. Secret-dependent steps fail or skip silently.

**Root cause:** Secrets are [not passed](https://docs.github.com/en/actions/security-for-github-actions/security-guides/using-secrets-in-github-actions#using-secrets-in-a-workflow) to workflows triggered by `pull_request` from forks. This is a deliberate security boundary.

**Fix:** Design CI to not require secrets for tests. For deployment previews after code review, use `pull_request_target` with a mandatory label gate:
```yaml
on:
  pull_request_target:
    types: [labeled]

jobs:
  deploy-preview:
    if: github.event.label.name == 'safe to test'
    # ...
```

> ⚠️ **Security warning:** Never checkout fork code with `pull_request_target` and then run it with repository secrets. This creates a [pwn-request vulnerability](https://securitylab.github.com/resources/github-actions-preventing-pwn-requests/).

---

## Runner & Environment Problems

### Self-Hosted Runner Registration & Update Loops

**Error:**
```
Runner update in progress, do not shutdown runner.
Downloading 2.277.1 runner... Generate and execute update script.
Runner will exit shortly for update, should back online within 10 seconds.
[...loops again...]
```

**Root cause:** Containerized runners built on older Ubuntu images (18.04) hit glibc incompatibility when auto-update downloads a newer runner binary.

**Fix:**
1. Rebuild container on Ubuntu 22.04+
2. Disable auto-update: `DISABLE_AUTO_UPDATE=1`
3. Add `rm -rf /home/runner/actions-runner` to container entrypoint before `./config.sh`
4. Add watchdog cron polling `GET /orgs/{org}/actions/runners` every 5 minutes

### Runner Out of Disk Space

**Error:**
```
No space left on device (os error 28)
```

**Root cause:** GitHub-hosted `ubuntu-latest` runners have ~14GB usable, but pre-installed toolchains (Android SDK ~8GB, .NET ~1.5GB, Haskell ~5GB) consume most of it.

**Fix:** Add a cleanup step before heavy builds:
```yaml
- name: Free Disk Space
  uses: jlumbroso/free-disk-space@main
  with:
    tool-cache: false
    android: true
    dotnet: true
    haskell: true
    large-packages: true
```
This reclaims ~10-15GB.

### Environment Variables Not Persisting Between Steps

**Error:**
```
Warning: The `set-output` command is deprecated and will be disabled soon.
```

**Root cause:** `::set-output` and `::set-env` [were deprecated](https://github.blog/changelog/2022-10-11-github-actions-deprecating-save-state-and-set-output-commands/) in favor of environment files.

**Fix:**
```yaml
# ❌ Deprecated
- run: echo "::set-output name=dir::$(yarn cache dir)"

# ✅ Current
- run: echo "dir=$(yarn cache dir)" >> $GITHUB_OUTPUT

# For multi-line values:
- run: |
    echo "MY_VAR<> $GITHUB_ENV
    echo "$multiline_value" >> $GITHUB_ENV
    echo "EOF" >> $GITHUB_ENV
```

### Tools Not Found in Next Step (PATH Issues)

**Error:**
```
/bin/bash: my-tool: command not found
```

**Root cause:** Each `run:` step spawns a fresh shell. `export PATH=...` is lost when that step ends.

**Fix:** Write to `$GITHUB_PATH`, not `PATH`:
```yaml
- name: Install tool
  run: |
    pip install my-cli-tool
    echo "$HOME/.local/bin" >> $GITHUB_PATH

- name: Use tool  # PATH is now updated
  run: my-cli-tool --version
```

### Docker Not Available on Runner

**Error:**
```
Cannot connect to the Docker daemon at unix:///var/run/docker.sock.
Is the docker daemon running?
```

**Root cause:** `ubuntu-latest-slim`, ARC containers, and self-hosted runners without DinD don't expose Docker.

**Fix:**
- Standard `ubuntu-latest`: Docker is available natively
- ARC/containerized: Use DinD sidecar or switch to JavaScript/composite actions
- For private registry pulls, add `docker/login-action` before container actions

### Service Container Connectivity

**Error:**
```
connection to server at "localhost", port 5432 failed: Connection refused
```

**Root cause:** In containerized jobs (`container:` at job level), service containers are on a Docker bridge network. `localhost` doesn't work.

**Fix:** Always add health checks, and use the service label as hostname in containerized jobs:
```yaml
services:
  postgres:
    image: postgres:15
    env:
      POSTGRES_PASSWORD: password
    ports:
      - 5432:5432
    options: >-
      --health-cmd pg_isready
      --health-interval 10s
      --health-timeout 5s
      --health-retries 5
```

For containerized jobs, connect to `postgres:5432` (the service label), not `localhost:5432`.

### Runner Image Deprecation

**Error:**
```
No hosted runners with requested label(s): 'ubuntu-18.04' can be found.
sudo: docker-compose: command not found
```

**Fix:**
```yaml
# ❌ Removed
- run: sudo docker-compose up -d

# ✅ Docker Compose v2 plugin syntax
- run: sudo docker compose up -d
```

Track upcoming removals at the [`actions/runner-images` releases](https://github.com/actions/runner-images/releases).

### Windows Runner Gotchas

**Error:**
```
AssertionError: expected '40-learnings\\passesdefaultgate.md' to contain '40-learnings/'
```

**Root cause:** Path separators (`\` vs `/`), missing POSIX tools (`jq`, `sed`), shebangs not honored, CRLF line endings.

**Fix:**
```yaml
defaults:
  run:
    shell: bash  # uses Git Bash on Windows

# Install missing tools
- if: runner.os == 'Windows'
  run: choco install jq -y
  shell: pwsh

# Disable CRLF auto-conversion
- run: git config --global core.autocrlf false
```

### Node.js Runtime Deprecation

**Error:**
```
Node.js 16 actions are deprecated. Please update the following actions
to use Node.js 20: actions/checkout@v3, actions/cache@v3
```

**Fix:** Bump to latest major versions of all actions. For own actions, update `action.yml` to `runs.using: node24`. Emergency workaround:
```yaml
env:
  FORCE_JAVASCRIPT_ACTIONS_TO_NODE24: 'true'
```

**Deprecation timeline:** node12 (cutoff mid-2023) → node16 (mid-2024) → node20 (enforcement rolling out 2025-2026). Check the [GitHub Actions changelog](https://github.blog/changelog/label/actions/) for the latest timeline.

---

## Secrets, Permissions & Authentication

![GitHub Actions permission model — nested security layers from repository settings to GITHUB_TOKEN to OIDC federation](/images/articles/github-actions-debugging-guide/permissions-model.webp)
*The GitHub Actions permission model: repo defaults → workflow permissions block → GITHUB_TOKEN scope. The #1 source of 403 errors.*

### `GITHUB_TOKEN` Permission Denied (403)

**Error:**
```
remote: Permission to org/repo.git denied to github-actions[bot].
fatal: unable to access '...': The requested URL returned error: 403
```

**Root cause:** Default `GITHUB_TOKEN` is read-only since GitHub [tightened defaults for new repos and orgs in February 2023](https://github.blog/changelog/2023-02-02-github-actions-updating-the-default-github_token-permissions-to-read-only/).

**Fix:** Add explicit `permissions:` to the job:
```yaml
permissions:
  contents: write       # git push
  pull-requests: write  # PR creation
  packages: write       # GHCR push
```

> **Critical:** The `permissions:` block completely replaces defaults. Any permission not listed becomes `none`. Listing only `contents: write` drops all other permissions including `pull-requests`.

### OIDC Federation Failures with AWS

**Error:**
```
Could not assume role with OIDC: Not authorized to perform
sts:AssumeRoleWithWebIdentity
```

**Root causes and fixes:**

1. **Reusable workflows change the `sub` claim.** The OIDC JWT subject reflects the *calling* repo, not the reusable workflow's repo. IAM trust policies must match the caller.

2. **Missing `permissions: id-token: write`** on the calling job.

3. **Audience mismatch:**
```yaml
- uses: aws-actions/configure-aws-credentials@v4
  with:
    audience: sts.amazonaws.com  # must match trust policy
    role-to-assume: arn:aws:iam::123456789012:role/MyRole
    aws-region: us-east-1
```

### Cross-Repo Access (403)

**Error:**
```
remote: Permission to other-org/other-repo.git denied to github-actions[bot].
```

**Root cause:** `GITHUB_TOKEN` is scoped to a single repository. It cannot access other repos — this is a [security boundary by design](https://docs.github.com/en/actions/security-for-github-actions/security-guides/automatic-token-authentication).

**Fix:** Use a GitHub App installation token (recommended) or a PAT:
```yaml
- uses: actions/create-github-app-token@v1
  id: app-token
  with:
    app-id: ${{ vars.APP_ID }}
    private-key: ${{ secrets.APP_PRIVATE_KEY }}
    repositories: "target-repo"

- uses: actions/checkout@v4
  with:
    token: ${{ steps.app-token.outputs.token }}
    repository: org/target-repo
```

### Environment Protection Rules Blocking Deployments

**Error:**
```
This deployment was rejected
```

**Root cause:** The triggering ref doesn't match the environment's allowed branches/tags filter, or the required reviewer also triggered the workflow (GitHub doesn't allow self-approval).

**Fix:** Ensure the triggering ref matches the environment's branch filter pattern. Add a second reviewer if the triggering user is the sole required reviewer.

### GitHub App Token Generation Failures

**Error:**
```
error:0909006C:PEM routines:get_name:no start line
```

**Root cause:** Private key corrupted during shell escaping or base64 encoding.

**Fix:** Store the raw PEM file directly as a GitHub secret:
```bash
gh secret set APP_PRIVATE_KEY < my-app.private-key.pem
```

Use [`actions/create-github-app-token@v1`](https://github.com/actions/create-github-app-token) (official, node20-native) instead of `tibdex/github-app-token`.

### Docker Registry Auth (GHCR)

**Error:**
```
denied: installation not allowed to Write organization package
```

**Fix:**
1. Add `permissions: packages: write` to the job
2. For org packages: visit package settings → Manage Actions Access → add the repository with Write access
3. Don't set `DOCKER_CONFIG: $HOME/.docker` at job level — it breaks credential persistence

### Dependabot Secrets Namespace

**Root cause:** Dependabot runs in a separate secrets namespace. Repository secrets are not available to Dependabot-triggered workflows.

**Fix:** Add secrets to both namespaces:
```bash
gh secret set NPM_TOKEN --body "npm_xxx" --app actions
gh secret set NPM_TOKEN --body "npm_xxx" --app dependabot
```

### PAT vs. GITHUB_TOKEN Decision Matrix

| Scenario | Use |
|----------|-----|
| Push to same repo | `GITHUB_TOKEN` + `contents: write` |
| Create PR on same repo | `GITHUB_TOKEN` + `pull-requests: write` |
| Push to different repo | GitHub App token or PAT |
| Trigger another workflow | PAT (GITHUB_TOKEN can't trigger workflows) |
| Cross-org operations | Classic PAT with `repo` scope |

**Prefer GitHub App tokens over PATs:** PATs are tied to individuals (leave org = token breaks), expire, and are harder to audit.

---

## Caching, Artifacts & Dependencies

### Cache Miss Despite Recent Save

**Error:**
```
Cache not found for input keys: Linux-node-abc123def456
```

**Root causes:**
1. **Branch scoping:** Caches from `main` are accessible to branches, but not vice-versa
2. **Version mismatch:** Changing OS or compression tool changes the cache version hash
3. **Rate limiting:** 429s fall through silently as "cache not found"
4. **Infrastructure outage:** Check [githubstatus.com](https://githubstatus.com)

**Fix:** Always prime cache on the default branch first. Use the [List Caches API](https://docs.github.com/en/rest/actions/cache#list-github-actions-caches-for-a-repository) to debug version mismatches.

### `cache-hit` Output Semantics

```yaml
# ❌ Wrong — cache-hit is empty string (not 'false') on full miss
if: steps.cache.outputs.cache-hit == 'false'

# ✅ Correct — always use != 'true'
if: steps.cache.outputs.cache-hit != 'true'
```

`cache-hit` is `'true'` on exact key match, empty string on miss, and `'false'` on `restore-keys` match. Yes, really.

### Cache Size Limit (10 GB Per Repo)

**Symptom:** Random cache misses on older branches.

**Root cause:** Repos have a [10 GB total cache limit](https://github.com/actions/cache#cache-limits). Oldest caches are LRU-evicted silently.

**Fix:** Clean up branch caches on PR close:
```yaml
on:
  pull_request:
    types: [closed]
jobs:
  cleanup:
    runs-on: ubuntu-latest
    permissions:
      actions: write
    steps:
      - run: |
          for id in $(gh cache list --ref refs/pull/${{ github.event.pull_request.number }}/merge \
            --limit 100 --json id --jq '.[].id'); do
            gh cache delete $id
          done
        env:
          GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}
          GH_REPO: ${{ github.repository }}
```

### `upload-artifact` v3 → v4 Breaking Changes

**Error:**
```
An artifact with the same name already exists for the associated workflow run.
```

**Root cause:** v4 artifacts are [immutable](https://github.com/actions/upload-artifact). Multiple jobs can no longer upload to the same artifact name.

**Fix:**
```yaml
# v4 — unique names per matrix job
- uses: actions/upload-artifact@v4
  with:
    name: build-${{ matrix.os }}-${{ matrix.node }}

# Download all and merge
- uses: actions/download-artifact@v4
  with:
    pattern: build-*
    merge-multiple: true
    path: dist/
```

### Cross-Workflow Artifact Download

**Error:**
```
Unable to download artifact(s): Artifact not found for name: my-artifact
```

**Fix:** Both upload and download must use the **same version family** (v3↔v3 or v4↔v4 — they use different storage backends):
```yaml
- uses: actions/download-artifact@v4
  with:
    name: my-artifact
    github-token: ${{ secrets.GITHUB_TOKEN }}  # required for cross-workflow
    run-id: ${{ github.event.workflow_run.id }}
```

### `npm ci` Cache Save Timeout

**Error:**
```
The operation was canceled.
```

**Root cause:** Cache save (tar compression) on large `node_modules` exceeds the job timeout. Missing `zstd` in DinD containers forces slow gzip fallback.

**Fix:** Cache `~/.npm` (the npm cache directory), not `node_modules`:
```yaml
- uses: actions/cache@v5
  with:
    path: ${{ steps.npm-cache-dir.outputs.dir }}
    key: ${{ runner.os }}-node-${{ hashFiles('**/package-lock.json') }}
```

For DinD environments, install `zstd`: `apt-get install -y zstd`.

### Docker Layer Caching

**Error:**
```
cache export feature is currently not supported for docker driver
```

**Fix:** You must use `docker/setup-buildx-action` first — the default Docker driver doesn't support cache export:
```yaml
- uses: docker/setup-buildx-action@v3

- uses: docker/build-push-action@v6
  with:
    cache-from: type=gha,scope=${{ github.workflow }}
    cache-to: type=gha,mode=max,scope=${{ github.workflow }}
```

### Cache Corruption

**Error:**
```
tar: Error is not recoverable: exiting now
gzip: stdin: unexpected end of file
```

**Fix:** Delete the corrupt cache via CLI:
```bash
gh cache list --repo owner/repo
gh cache delete <cache-id> --repo owner/repo
```

Prevent future corruption with a download timeout:
```yaml
env:
  SEGMENT_DOWNLOAD_TIMEOUT_MINS: 5
```

### Git LFS Files Not Downloaded

**Symptom:** Binary files are 140-byte text pointers instead of actual content.

**Fix:**
```yaml
- uses: actions/checkout@v4
  with:
    lfs: true
    fetch-depth: 1
```

Cache LFS objects to reduce bandwidth:
```yaml
- uses: actions/cache@v5
  with:
    path: .git/lfs
    key: ${{ runner.os }}-lfs-${{ hashFiles('.lfsconfig') }}
```

### Lockfile Hash Returns Empty String

**Error:**
```
Cache not found for input keys: Linux-node-
```

**Root cause:** `hashFiles('**/package-lock.json')` matched no files, returning empty string.

**Fix:** Debug with:
```yaml
- run: |
    echo "Hash: ${{ hashFiles('**/package-lock.json') }}"
    find . -name "package-lock.json" -not -path "*/node_modules/*"
```

Correct patterns per ecosystem:
```yaml
# npm
key: ${{ runner.os }}-npm-${{ hashFiles('**/package-lock.json') }}
# pip
key: ${{ runner.os }}-pip-${{ hashFiles('**/requirements*.txt', '**/pyproject.toml') }}
# Gradle
key: ${{ runner.os }}-gradle-${{ hashFiles('**/*.gradle*', '**/gradle-wrapper.properties') }}
```

---

## Trigger Problems

### Workflow Not Triggering At All

**No error. No run appears.**

**Root causes (in priority order):**
1. Workflow file is not on the default branch
2. YAML syntax error (silently rejected)
3. Branch filter mismatch (`branches: [master]` but default is `main`)
4. Workflow disabled via UI or inactivity
5. Commit made by `GITHUB_TOKEN` (won't trigger downstream)

**Fix:**
```bash
# Check workflow state
gh workflow list
gh workflow view "My Workflow"
```

### `workflow_dispatch` Button Not Showing

**Root causes:**
1. Workflow file not on default branch (most common)
2. No write access to repository
3. Wrong YAML indentation:

```yaml
# ❌ Wrong — nested under push
on:
  push:
    branches: [main]
    workflow_dispatch:      # indented under push

# ✅ Correct — sibling of push
on:
  push:
    branches: [main]
  workflow_dispatch:        # same level as push
```

### Cron Schedule Running Late or Not Running

**Root cause:** GitHub does [not guarantee cron timing](https://docs.github.com/en/actions/writing-workflows/choosing-when-your-workflow-runs/events-that-trigger-workflows#schedule). During high load, scheduled runs can be delayed by hours or skipped entirely. Minimum interval is 5 minutes. Public/free-tier repos are deprioritized. All times are UTC.

A [real-world case](https://github.com/leonardaraz/fyndplats-cache-warmer/issues/4): workflow configured for `*/10 * * * *` (expected ~144 runs/day), but only 4 runs fired in 32 hours.

**Fix:** For time-sensitive operations, use an external cron service to trigger `workflow_dispatch` via API. Accept a ±1 hour SLA for GitHub-hosted scheduled workflows.

### `workflow_run` Not Firing

**Root causes:**
1. The listener workflow must be on the **default branch**
2. `workflows: ["CI Build"]` must **exactly match** the source workflow's `name:` field
3. Missing `types: [completed]` — without it, fires on both start and finish
4. Source workflow triggered by `GITHUB_TOKEN` (recursion prevention)

**Fix:**
```yaml
on:
  workflow_run:
    workflows: ["CI Build"]     # exact match to name: in source workflow
    types: [completed]

jobs:
  post-build:
    if: github.event.workflow_run.conclusion == 'success'
```

### `repository_dispatch` Returns 204 But Workflow Doesn't Run

**Root cause:** API returns 204 even when `event_type` doesn't match — the mismatch is silent.

**Fix:** Verify `event_type` exactly matches the workflow's `types:`:
```yaml
on:
  repository_dispatch:
    types: [docker-image-updated]  # must EXACTLY match API call
```

### Path Filters Not Working as Expected

**Root cause:** `paths:` and `paths-ignore:` are [mutually exclusive](https://docs.github.com/en/actions/writing-workflows/workflow-syntax-for-github-actions#onpushpull_requestpull_request_targetpathspaths-ignore) — using both on the same event is not supported. `docs` (without `/**`) matches a file literally named `docs`, not the directory.

**Fix:**
```yaml
# Correct: ignore docs directory
on:
  push:
    paths-ignore:
      - 'docs/**'
      - '*.md'
```

### Tag Push vs. Release Published

| Trigger | When It Fires | Use Case |
|---------|--------------|----------|
| `push: tags: [v*]` | On tag push | Binary build |
| `release: types: [created]` | Release created | Build + draft release |
| `release: types: [published]` | Explicit publish | Deploy to prod |

---

## Concurrency & Timing

### Jobs Cancelled Unexpectedly

**Root cause:** Overly broad concurrency group key. Using `group: ${{ github.workflow }}` alone means all runs compete, even on different branches.

**Fix:**
```yaml
# PR workflows — cancel stale runs on same PR
concurrency:
  group: ci-${{ github.workflow }}-${{ github.ref }}
  cancel-in-progress: true

# Production deploys — queue, never cancel
concurrency:
  group: deploy-production
  cancel-in-progress: false

# Branch-sensitive — cancel only on non-default branches
concurrency:
  group: ${{ github.workflow }}-${{ github.ref }}
  cancel-in-progress: ${{ github.ref != 'refs/heads/main' }}
```

### Empty `head_ref` Causing Cross-Branch Cancellation

**Root cause:** `github.head_ref` is empty for push events. All push-triggered runs get the same group key and cancel each other.

**Fix:**
```yaml
concurrency:
  group: ${{ github.workflow }}-${{ github.head_ref || github.run_id }}
```

### Job `needs` Failure Cascading

**Symptom:** A downstream job is `Skipped` even though you want it to run after upstream failure.

**Root cause:** Default `if:` on every job is `success()`, meaning "only run if ALL needs jobs succeeded."

**Fix:**
```yaml
# Always run (notifications, cleanup)
final-job:
  needs: [job-a, job-b]
  if: always()
  steps:
    - if: contains(needs.*.result, 'failure')
      run: exit 1
```

### Default Timeout is 6 Hours

**Root cause:** A hung test suite silently consumes a runner for 6 hours.

**Fix:** Always set `timeout-minutes` at the job level:
```yaml
jobs:
  test:
    timeout-minutes: 20
    steps:
      - run: npm test
        timeout-minutes: 10
```

### Matrix `include` vs. `exclude` Confusion

**Key insight:**
- `include` entries that match ALL existing keys **add properties** to the existing row — they don't create a new job
- `include` entries that match NO existing cell create a new job
- `exclude` requires ALL keys to exist in the base matrix — unknown keys are silently ignored
- Max 256 matrix jobs per workflow run

```yaml
strategy:
  fail-fast: false  # strongly recommended for diagnostics
  matrix: ${{ fromJSON(needs.prepare.outputs.matrix) }}
```

### Dynamic Matrix and Required Status Checks

**The problem:** Matrix job names like `test (ubuntu-latest, 16)` change when matrix values change. Branch protection requires exact string matches — no wildcards.

**Fix:** Add a stable summary job and require that instead:
```yaml
test-summary:
  needs: [test]
  if: always()
  runs-on: ubuntu-latest
  steps:
    - if: needs.test.result != 'success'
      run: exit 1
```

---

## Known Unsolved Problems

These are confirmed platform limitations with no clean workaround. Understanding them saves hours of debugging dead ends.

### No SSH / Interactive Debugging ([#241](https://github.com/actions/runner/issues/241) — 107 👍, open since 2019)

The runner has no TTY allocated. Interactive debugging is not possible natively. Workarounds like [`mxschmitt/action-tmate`](https://github.com/mxschmitt/action-tmate) open SSH reverse tunnels but are a security risk (session URL is in public logs).

### No Step-Level Retry

There's no native `retry: 3` syntax on steps. Use [`nick-fields/retry`](https://github.com/nick-fields/retry) for `run:` steps, or a bash loop:
```bash
for i in 1 2 3; do
  flaky-command && break || sleep 15
done
```

### No Early-Exit / Step Flow Control ([#662](https://github.com/actions/runner/issues/662) — 1,031 👍)

The highest-voted open runner issue. You cannot exit a job early with a specific conclusion (success/neutral). Every step must use `if:` guards to skip, creating verbose YAML.

### Reusable Workflows Cannot Be Called from Composite Actions

Composite actions are inlined steps on the parent runner. Calling a reusable workflow (which spawns a separate runner) from inside a composite action is architecturally impossible without a lifecycle model redesign.

### No `services:` or `container:` in Composite Actions ([ADR 0549](https://github.com/actions/runner/blob/main/docs/adrs/0549-composite-run-steps.md))

By architectural decision. Service containers require Docker lifecycle management at the job level — composite actions don't have job-level lifecycle.

### Secret Masking Edge Cases ([#475](https://github.com/actions/runner/issues/475) — 68 👍, open since 2020)

`::add-mask::` echoes the secret value before the mask takes effect. Short secrets (1-3 chars) cause entire log lines to become `***`. Base64 and URL-encoded versions of secrets may not be masked.

### Cost/Billing Opacity

No per-workflow, per-job, or per-repository breakdown of Actions minutes. The billing page shows total org-level usage. Use `gh api /repos/{owner}/{repo}/actions/runs/{id}` for approximate per-run duration.

---

## Essential Tooling

### `actionlint` — The Single Most Impactful Tool

[`rhysd/actionlint`](https://github.com/rhysd/actionlint) catches the majority of syntax, context, and type errors in this guide before you push:

```bash
# Install
go install github.com/rhysd/actionlint/cmd/actionlint@latest
# Or brew install actionlint

# Run
actionlint

# In CI
- uses: raven-actions/actionlint@v2
```

It validates: YAML syntax, expression types, context availability, matrix configurations, reusable workflow inputs/outputs, shell script syntax, and action version compatibility.

### Online Playground

Don't want to install anything? Use the [actionlint playground](https://rhysd.github.io/actionlint/) — paste your workflow YAML and get instant feedback.

### Debug Logging

Enable debug logging for any workflow run:
1. Go to the failed run → "Re-run all jobs" → check "Enable debug logging"
2. Or set repository variable `ACTIONS_STEP_DEBUG` to `true` (adds `##[debug]` output to all steps)

### `gh` CLI for Debugging

```bash
# List workflow runs
gh run list --workflow ci.yml

# View specific run logs
gh run view <run-id> --log

# Download logs for grep
gh run view <run-id> --log | grep 'error'

# List and delete caches
gh cache list
gh cache delete <id>

# Check workflow state
gh workflow list
gh workflow enable "Workflow Name"
```

---

## Cross-Reference: Related Guides

If you're working with GitHub Actions in the context of platform engineering and DevOps automation, these related articles go deeper on specific patterns:

- [Lessons from 500 GitHub Migrations](/articles/lessons-from-500-github-migrations) — enterprise-scale GitHub rollouts
- [Platform Engineering with GitHub](/articles/platform-engineering-with-github) — building internal developer platforms on GitHub
- [GitOps for Everything: Beyond Deployments](/articles/gitops-for-everything-beyond-deployments) — declarative infrastructure with Actions
- [GitHub Agentic Workflows: Hands-On Guide](/articles/github-agentic-workflows-hands-on-guide) — automated workflows with GitHub Copilot
- [CI Monitor Extension: Agent CI Feedback Loop](/articles/ci-monitor-extension-agent-ci-feedback-loop) — automated CI debugging with AI agents

---

## Resources

Every error message, workaround, and fix in this guide is sourced from real GitHub Issues, official documentation, and architecture decision records:

- **[`rhysd/actionlint`](https://github.com/rhysd/actionlint)** — Static linter for GitHub Actions workflows (the canonical error message reference)
- **[`actions/runner` Issues](https://github.com/actions/runner/issues)** — Official runner bug tracker
- **[`actions/cache` Tips & Workarounds](https://github.com/actions/cache/blob/main/tips-and-workarounds.md)** — Official caching troubleshooting
- **[`actions/upload-artifact` Migration Guide](https://github.com/actions/upload-artifact)** — v3 → v4 breaking changes
- **[GitHub Actions Context Availability](https://docs.github.com/en/actions/learn-github-actions/contexts#context-availability)** — Which contexts are available where
- **[GitHub Actions Security Guides](https://docs.github.com/en/actions/security-for-github-actions)** — `GITHUB_TOKEN`, OIDC, fork PR security
- **[`actions/runner` ADRs](https://github.com/actions/runner/tree/main/docs/adrs)** — Architecture decisions explaining why limitations exist
- **[GitHub Status](https://githubstatus.com)** — Check for infrastructure incidents before debugging

This guide covers the scenarios that have cost me and thousands of other developers the most debugging hours. If your specific error isn't here, [open an issue](https://github.com/htekdev/htek-dev-site/issues) or reach out on [LinkedIn](https://linkedin.com/in/htekdev) — I'll add it to the next update.
