Django Migration Failed During Deploy: Recovery Playbook

A failed Django migration during deploy usually means your release reached production, but the database change step did not finish cleanly.

Problem statement

A failed Django migration during deploy usually means your release reached production, but the database change step did not finish cleanly. That creates one of the highest-risk deployment states:

new app code may expect schema changes that are not present
part of a migration may have run, especially with non-transactional DDL or custom SQL
new web or worker processes may fail on startup
rolling deploys may leave old and new code hitting the same database with different assumptions

The main goal is not to make migrate pass again as quickly as possible. The goal is to contain the blast radius, determine the real database state, and choose the safest recovery path: fix forward, roll back code, or restore and repair the database.

Quick answer

When a Django migration fails in production during deploy:

Stop further deploys and retries
Pause workers and prevent new instances from starting incompatible code
Capture the exact failed migration name and error
Check both Django migration history and the real database schema and data state
Decide whether to:
- retry or fix forward if the failure is understood and safe
- roll back app code if the migration never changed the database and the previous release is still compatible
- restore or manually repair the database only when partial changes or data integrity issues make forward recovery unsafe
Verify app health, migration state, and background jobs before reopening traffic

Step-by-step solution

1) First contain the failed deploy

Freeze the release pipeline first. Do not let automation make the incident worse.

disable auto-deploys in CI/CD
stop rollout of new web instances
pause Celery workers, cron jobs, or other background jobs that depend on the new schema
if your app runs migrations in startup scripts, stop repeated restarts and retries

If needed, enable maintenance mode or drain traffic at the load balancer.

Example Nginx maintenance block:

server {
    listen 80;
    server_name example.com;

    location / {
        return 503;
    }

    error_page 503 @maintenance;
    location @maintenance {
        root /var/www/html;
        try_files /maintenance.html =503;
    }
}

Reload safely:

sudo nginx -t && sudo systemctl reload nginx

If production serves HTTPS, apply the same maintenance response on the TLS server block as well. Do not update only the port 80 vhost while HTTPS traffic still reaches the app.

Record the incident context:

release version or commit SHA
failed migration name
exact error message and traceback
where it failed:
- CI/CD release step
- python manage.py migrate
- container entrypoint
- app startup hook

Check runtime evidence:

systemctl status gunicorn
journalctl -u gunicorn -n 200 --no-pager

Docker example:

docker logs <web-container> --tail 200
docker logs <release-job-container> --tail 200

If your services are in a crash loop, stop the loop before analysis. Repeated startup attempts can keep rerunning the same failing migration or keep starting incompatible code against a half-changed schema.

2) Determine what state the database is in

Do not trust only the deploy output. Check Django’s view of migration history and the actual database state.

Show migrations:

python manage.py showmigrations

Inspect the SQL for the failed migration:

python manage.py sqlmigrate app_name 00xx_migration_name

Open a DB shell from the same app environment:

python manage.py dbshell

Check whether Django marked the migration as applied:

SELECT app, name, applied
FROM django_migrations
ORDER BY applied DESC
LIMIT 20;

Then inspect the schema directly. PostgreSQL examples:

Check whether a column exists:

SELECT column_name
FROM information_schema.columns
WHERE table_name = 'my_table'
  AND column_name = 'new_column';

Check indexes:

SELECT indexname, indexdef
FROM pg_indexes
WHERE tablename = 'my_table';

Check locks and blockers:

SELECT pid, usename, state, wait_event_type, wait_event, query
FROM pg_stat_activity
WHERE datname = current_database()
ORDER BY query_start;

For data migrations, inspect whether rows were already changed. If the migration contains RunPython or raw SQL, determine whether it is idempotent. A failed data migration may have modified some rows before crashing.

What you are trying to classify is one of these states:

migration not applied at all
migration partially applied
migration applied, but deploy failed later for another reason

Do not decide on rollback or retry until you know which state you are in.

3) Classify the failure before taking action

Use the failure type to choose the path.

Usually safe to retry after containment:

temporary database connectivity loss
lock timeout
interrupted release job
transient environment issue unrelated to schema change logic

Requires code or migration fix:

broken RunPython
invalid SQL in migration
missing import or dependency
cross-branch migration ordering problem

Higher-risk database compatibility or scale issue:

unique constraint creation blocked by duplicate rows
large table rewrite taking too long
insufficient privileges for DDL
partially applied non-transactional changes

If you cannot clearly classify the failure, assume it is not safe to blindly rerun.

4) Recovery path 1: fix forward and complete the migration

Choose fix-forward when:

the previous app release is no longer schema-compatible
the failure is understood
the database is repairable without restore
schema or data integrity can be preserved

Before retrying, confirm you have a backup or snapshot. In production, verify it explicitly rather than assuming one exists.

Run migrations manually, outside app startup:

python manage.py migrate app_name 00xx_migration_name

If needed, ship a targeted fix and redeploy only the release step first. Keep old and new code compatible during the transition.

A safer release sequence is:

deploy fixed code or migration
run migration as a dedicated one-off step
verify migration state
restart web
restart workers

Sanity checks after fix-forward:

python manage.py showmigrations
python manage.py check --deploy

Inspect whether Django still plans to apply any migrations:

python manage.py migrate --plan

Then test app health endpoints, admin access, and a schema-sensitive path.

If the migration was data-heavy, also check:

row counts or spot-checks for affected records
worker queues for jobs that may replay bad assumptions
database locks, long-running transactions, or blocked sessions

Rollback note: if the fixed migration still fails and the database state is getting more complex, stop and reassess. Do not keep retrying different ad hoc changes against production.

5) Recovery path 2: roll back application code safely

Roll back code when:

the migration did not change the database
the failed step happened before schema changes
the old release remains compatible with the current database

Redeploy the last known good app version. Roll back workers and scheduled jobs too, not just the web service.

Examples:

VM or systemd deployment: switch the app directory, artifact, or release symlink back to the previous known-good version, then restart services
containerized deployment: redeploy the previous image tag for both web and worker services

After redeploying the last known good release, restart the affected services:

sudo systemctl restart gunicorn
sudo systemctl restart celery

Important: do not automatically reverse migrations in production unless you have tested that reversal and know it is safe. Many reverse operations are risky, slow, or lossy.

Verification after code rollback:

app starts without model or schema mismatch errors
requests succeed
background jobs run normally
no startup script is still trying to apply the broken migration
the rolled-back release is actually the version now running

A rollback is only successful if both the code version and the running processes match the intended release.

6) Recovery path 3: restore or manually repair the database

Use restore or manual repair only when:

partial schema changes were applied and forward recovery is unsafe
a data migration half-finished and damaged consistency
non-transactional operations left the schema in a broken intermediate state

Restore considerations:

use point-in-time recovery or a tested snapshot process
define the expected data loss window before acting
coordinate app rollback with DB restore so code and schema match
restrict DB credentials and shell access during emergency operations

Manual repair may include:

dropping a partially created index or constraint
reverting a failed column change
correcting rows affected by a partial data migration
updating django_migrations only if the schema truly matches the recorded state

Be careful with --fake. This is only safe when the database already matches what Django expects.

Example of targeted inspection before any fake apply:

python manage.py sqlmigrate app_name 00xx_migration_name
python manage.py dbshell

If the SQL says a column should exist, verify it exists. If an index should exist, verify that exact index exists. Only then consider a fake mark, and only as part of a controlled repair.

If you need to restore, document:

recovery point used
expected data loss window
code version paired with the restored database
follow-up actions needed before reopening traffic

Explanation

This recovery flow works because it separates three states that often get mixed together:

migration not applied
migration partially applied
migration applied, but release still failed for another reason

That distinction determines whether rollback is easy, dangerous, or impossible.

A dedicated migration step is also the main prevention measure. Running migrations from app startup is convenient early on, but in production it causes repeated retries, race conditions, and multiple instances competing to run the same change. A single release job is much safer.

Example CI/CD pattern:

# release step
python manage.py migrate --noinput

# only if migrate succeeds
sudo systemctl restart gunicorn
sudo systemctl restart celery

Keep secrets out of command history and repo files. Use the same production environment variables you normally use for app runtime, such as DATABASE_URL, DJANGO_SETTINGS_MODULE, and secret values loaded from your process manager, secret store, or container environment. Do not paste credentials into ad hoc shell commands unless there is no safer option.

When to script this

If your team repeats these incident checks often, standardize them. Common candidates are backup verification, single migration runner enforcement, collection of migration and log evidence, maintenance-mode toggles, and post-recovery health checks. A reusable release template or incident script reduces decision errors during a real outage.

Edge cases / notes

Failed migration during rolling deploys

This is one of the worst cases. Old and new app versions may both hit one database. If the new version expects a column that does not exist yet, only part of the fleet may fail. Drain traffic, stop the rollout, and make sure workers are version-aligned with the chosen recovery path.

Failed data migration on a large table

Large updates can hold locks or run long enough to exceed deploy timeouts. If a data migration failed, inspect how many rows changed before retrying. Batch updates and idempotent logic are safer than one large write.

Multiple containers retried the same migration

If every container runs migrate on startup, one failure can cascade into repeated attempts. Move migration execution into a dedicated release job and ensure only one runner is active at a time.

PostgreSQL transactional behavior

Many PostgreSQL schema changes are transactional, but not all operational patterns are harmless. Custom SQL, concurrent index creation strategies, and app-level data changes still need explicit review. Check the actual migration SQL and the actual database state rather than assuming rollback behavior.

Internal links

For the background on migration state and compatibility, see How Django Migrations Work in Production.

For a safer release design, read How to Run Django Migrations Safely During Deployment.

If you need the full web stack around the app server, see Deploy Django with Gunicorn and Nginx.

For app version rollback steps, see How to Roll Back a Django Deployment Safely.

If the failed deploy exposed broader production gaps, review Django Deployment Checklist for Production.

FAQ

How do I know whether a failed Django migration changed the database?

Check both django_migrations and the real database schema. Use showmigrations, inspect the migration SQL with sqlmigrate, and then verify tables, columns, indexes, constraints, and any affected rows directly in the database.

Is it safe to rerun `python manage.py migrate` after a deployment failure?

Sometimes, but only after you confirm why it failed. Retrying is often reasonable for transient failures like lock timeouts or interrupted release jobs. It is not safe to blindly rerun if the migration contains broken logic, partial data changes, or non-transactional schema changes.

When should I roll back code instead of fixing forward?

Roll back code when the migration never changed the database and the previous release is still compatible with the current schema. If schema changes already landed or the old code cannot work with the current database state, fix-forward is usually safer.

Can I use `python manage.py migrate --fake` in production?

Only when you have verified that the database already matches the migration’s intended state. --fake is not a repair tool by itself. If you fake-apply a migration against the wrong schema, you can make later deploys much harder to recover.

What should I check before reopening traffic after a migration incident?

Confirm:

migration history is correct
the actual schema matches expected state
web health checks pass
workers are running the correct version
no background job is still using old assumptions
critical user paths and admin actions succeed