A 403 from Cloudflare R2 on boot, and why I ended up bundling everything in the Docker image

Scripture2Slides boots, looks at its empty data/ directory, and asks Cloudflare R2 for the ~36 MB of PowerPoint templates, layouts, and background images it needs before it can serve a single request. Locally, the download finishes in a second. On Railway, last week, it stopped working.

urllib.error.HTTPError: HTTP Error 403: Forbidden

The app never got past startup. Every deploy failed the health check. It had been working for months.

The diagnosis

The download code was the most innocent thing in the repo:

# server/scripts/download-data.py — before
def download_archive(archive: str) -> bytes:
    url = f"{BASE_URL}/{archive}"
    with urllib.request.urlopen(url, timeout=DOWNLOAD_TIMEOUT_SECONDS) as response:
        return response.read()

Python’s stdlib urllib.request sends a User-Agent of Python-urllib/3.12 by default. That string is on Cloudflare’s “suspected scraper” list. When Cloudflare’s bot-management kicks in (either because my R2 bucket is behind a zone with the default ruleset, or because a recent tuning pass tightened the rules), requests with that User-Agent get a 403 before they ever reach the bucket.

It hadn’t broken before because the rule wasn’t firing before. Something upstream flipped and I inherited the behavior.

The fix is a single line:

# server/scripts/download-data.py — after
def download_archive(archive: str) -> bytes:
    url = f"{BASE_URL}/{archive}"
    request = urllib.request.Request(url, headers={"User-Agent": "scripture2slides/1.0"})
    with urllib.request.urlopen(request, timeout=DOWNLOAD_TIMEOUT_SECONDS) as response:
        return response.read()

Any non-default User-Agent works. Cloudflare’s rule is specifically looking for the stdlib string. Deploys unblocked, health checks green, I could have moved on.

The “but wait” moment

The fix worked, but it exposed a dumber question I hadn’t asked in a year: why am I downloading my own static data from my own bucket at every boot?

The templates change roughly never. The layouts change roughly never. The curated background images change roughly never. When I added R2 originally, the mental model was “CDN” — small web tier, big storage tier, hydrate on demand. But the total “big storage” was 36 MB. That’s not a CDN problem. That’s a git problem I’d lazily off-loaded to object storage.

The 403 was doing me a favor. Every “fix the symptom” patch is an invitation to ask whether the thing failing should exist at all. Big-company instincts say add retries, add a cache, add a circuit breaker. Indie instincts should say delete the dependency.

The in-between step I also didn’t need

Before I got there, I took a half-step. Commit a9273aa added FORCE_STATIC_DATA_DOWNLOAD and SKIP_STATIC_DATA_DOWNLOAD env variables and a “prefer bundled if already present” check in start.sh:

if [[ "${SKIP_STATIC_DATA_DOWNLOAD:-0}" == "1" ]]; then
  echo "⚠️  Skipping static data download"
elif [[ "${FORCE_STATIC_DATA_DOWNLOAD:-0}" == "1" ]]; then
  python scripts/download-data.py
elif runtime_assets_present; then
  echo "✅ Using bundled static data; skipping remote download"
else
  python scripts/download-data.py
fi

I also changed download-data.py so that a failed download falls back to keeping any existing data on disk instead of exiting non-zero. This was the incrementalist move: keep R2 as the source of truth, treat bundled as cache. It felt responsible. It was actually overthinking. The env knobs are three kinds of “do this download when I say so” for data that nobody ever needs to update independently of a deploy.

The actual fix

Commit 7c832fb: check the static data into git. The .dockerignore became an allowlist:

data/*
!data/templates/
!data/layouts/
!data/backgrounds/
data/templates/uploads/
data/backgrounds/builder/

Templates, layouts, and curated backgrounds ship in the image. User uploads (templates/uploads/) and anything generated at runtime (backgrounds/builder/) stay excluded. The .gitignore mirrors this with an allowlist of extensions per directory — .pptx, .webp, .json in, everything else out — so it’s hard to accidentally commit someone’s uploaded deck.

Totals: 55 files, 36.27 MB into the repo. Eight PowerPoint templates plus their WebP previews (~18 MB), nine layout .pptxs with previews and an index (~16 MB), twelve curated background images (~1.8 MB). The Docker image grew by the same amount; cold start is now a boring COPY . . with no network traffic.

The trade I made, explicit

What I gave up:

Git now holds binaries it philosophically shouldn’t. Pulls are slightly slower.
Docker image is ~36 MB heavier.
Asset updates require a deploy. (For assets that changed in the last 12 months: zero of them.)

What I got:

Zero boot-time network calls to anything other than Postgres.
Immune to any future bot-protection surprise from any future storage provider.
Cold starts that can’t fail halfway. This is the big one for Railway-type platforms where a failed boot means the old container keeps running stale, or worse, the health check flaps.
One fewer system to think about when something’s broken at 11pm.

What not to bundle

The Bibles — 80+ translations, several hundred megabytes — still live in R2 and hydrate on demand, because (a) they’re large enough that image bloat becomes real and (b) the set genuinely grows over time. User-uploaded decks obviously stay in R2 too. The working rule:

Bundle what ships with the code. Fetch what ships with the users.

If an asset’s update cadence is the same as your deploy cadence, bundling is free. If it’s faster or independent, you need real storage.

The meta-lesson I keep relearning is that a lot of infra is cargo-culted from contexts that don’t apply. CDN-style asset hydration makes sense when assets are huge or change fast. For a 36 MB folder of templates that change every 18 months, it’s a Rube Goldberg machine whose most interesting failure mode is a 403 from a provider you thought you trusted. The 403 was the warning shot. The bundle was the actual fix.

If you want the result, you can poke at scripture2slides.com — every one of those template previews is now shipping from inside the image.