Django Migration Failed During Deploy: Recovery Playbook
A failed Django migration during deploy usually means your release reached production, but the database change step did not finish cleanly.
Problem statement
A failed Django migration during deploy usually means your release reached production, but the database change step did not finish cleanly. That creates one of the highest-risk deployment states:
- new app code may expect schema changes that are not present
- part of a migration may have run, especially with non-transactional DDL or custom SQL
- new web or worker processes may fail on startup
- rolling deploys may leave old and new code hitting the same database with different assumptions
The main goal is not to make migrate pass again as quickly as possible. The goal is to contain the blast radius, determine the real database state, and choose the safest recovery path: fix forward, roll back code, or restore and repair the database.
Quick answer
When a Django migration fails in production during deploy:
- Stop further deploys and retries
- Pause workers and prevent new instances from starting incompatible code
- Capture the exact failed migration name and error
- Check both Django migration history and the real database schema and data state
- Decide whether to:
- retry or fix forward if the failure is understood and safe
- roll back app code if the migration never changed the database and the previous release is still compatible
- restore or manually repair the database only when partial changes or data integrity issues make forward recovery unsafe
- Verify app health, migration state, and background jobs before reopening traffic
Step-by-step solution
1) First contain the failed deploy
Freeze the release pipeline first. Do not let automation make the incident worse.
- disable auto-deploys in CI/CD
- stop rollout of new web instances
- pause Celery workers, cron jobs, or other background jobs that depend on the new schema
- if your app runs migrations in startup scripts, stop repeated restarts and retries
If needed, enable maintenance mode or drain traffic at the load balancer.
Example Nginx maintenance block:
server {
listen 80;
server_name example.com;
location / {
return 503;
}
error_page 503 @maintenance;
location @maintenance {
root /var/www/html;
try_files /maintenance.html =503;
}
}
Reload safely:
sudo nginx -t && sudo systemctl reload nginx
If production serves HTTPS, apply the same maintenance response on the TLS server block as well. Do not update only the port 80 vhost while HTTPS traffic still reaches the app.
Record the incident context:
- release version or commit SHA
- failed migration name
- exact error message and traceback
- where it failed:
- CI/CD release step
python manage.py migrate- container entrypoint
- app startup hook
Check runtime evidence:
systemctl status gunicorn
journalctl -u gunicorn -n 200 --no-pager
Docker example:
docker logs <web-container> --tail 200
docker logs <release-job-container> --tail 200
If your services are in a crash loop, stop the loop before analysis. Repeated startup attempts can keep rerunning the same failing migration or keep starting incompatible code against a half-changed schema.
2) Determine what state the database is in
Do not trust only the deploy output. Check Django’s view of migration history and the actual database state.
Show migrations:
python manage.py showmigrations
Inspect the SQL for the failed migration:
python manage.py sqlmigrate app_name 00xx_migration_name
Open a DB shell from the same app environment:
python manage.py dbshell
Check whether Django marked the migration as applied:
SELECT app, name, applied
FROM django_migrations
ORDER BY applied DESC
LIMIT 20;
Then inspect the schema directly. PostgreSQL examples:
Check whether a column exists:
SELECT column_name
FROM information_schema.columns
WHERE table_name = 'my_table'
AND column_name = 'new_column';
Check indexes:
SELECT indexname, indexdef
FROM pg_indexes
WHERE tablename = 'my_table';
Check locks and blockers:
SELECT pid, usename, state, wait_event_type, wait_event, query
FROM pg_stat_activity
WHERE datname = current_database()
ORDER BY query_start;
For data migrations, inspect whether rows were already changed. If the migration contains RunPython or raw SQL, determine whether it is idempotent. A failed data migration may have modified some rows before crashing.
What you are trying to classify is one of these states:
- migration not applied at all
- migration partially applied
- migration applied, but deploy failed later for another reason
Do not decide on rollback or retry until you know which state you are in.
3) Classify the failure before taking action
Use the failure type to choose the path.
Usually safe to retry after containment:
- temporary database connectivity loss
- lock timeout
- interrupted release job
- transient environment issue unrelated to schema change logic
Requires code or migration fix:
- broken
RunPython - invalid SQL in migration
- missing import or dependency
- cross-branch migration ordering problem
Higher-risk database compatibility or scale issue:
- unique constraint creation blocked by duplicate rows
- large table rewrite taking too long
- insufficient privileges for DDL
- partially applied non-transactional changes
If you cannot clearly classify the failure, assume it is not safe to blindly rerun.
4) Recovery path 1: fix forward and complete the migration
Choose fix-forward when:
- the previous app release is no longer schema-compatible
- the failure is understood
- the database is repairable without restore
- schema or data integrity can be preserved
Before retrying, confirm you have a backup or snapshot. In production, verify it explicitly rather than assuming one exists.
Run migrations manually, outside app startup:
python manage.py migrate app_name 00xx_migration_name
If needed, ship a targeted fix and redeploy only the release step first. Keep old and new code compatible during the transition.
A safer release sequence is:
- deploy fixed code or migration
- run migration as a dedicated one-off step
- verify migration state
- restart web
- restart workers
Sanity checks after fix-forward:
python manage.py showmigrations
python manage.py check --deploy
Inspect whether Django still plans to apply any migrations:
python manage.py migrate --plan
Then test app health endpoints, admin access, and a schema-sensitive path.
If the migration was data-heavy, also check:
- row counts or spot-checks for affected records
- worker queues for jobs that may replay bad assumptions
- database locks, long-running transactions, or blocked sessions
Rollback note: if the fixed migration still fails and the database state is getting more complex, stop and reassess. Do not keep retrying different ad hoc changes against production.
5) Recovery path 2: roll back application code safely
Roll back code when:
- the migration did not change the database
- the failed step happened before schema changes
- the old release remains compatible with the current database
Redeploy the last known good app version. Roll back workers and scheduled jobs too, not just the web service.
Examples:
- VM or systemd deployment: switch the app directory, artifact, or release symlink back to the previous known-good version, then restart services
- containerized deployment: redeploy the previous image tag for both web and worker services
After redeploying the last known good release, restart the affected services:
sudo systemctl restart gunicorn
sudo systemctl restart celery
Important: do not automatically reverse migrations in production unless you have tested that reversal and know it is safe. Many reverse operations are risky, slow, or lossy.
Verification after code rollback:
- app starts without model or schema mismatch errors
- requests succeed
- background jobs run normally
- no startup script is still trying to apply the broken migration
- the rolled-back release is actually the version now running
A rollback is only successful if both the code version and the running processes match the intended release.
6) Recovery path 3: restore or manually repair the database
Use restore or manual repair only when:
- partial schema changes were applied and forward recovery is unsafe
- a data migration half-finished and damaged consistency
- non-transactional operations left the schema in a broken intermediate state
Restore considerations:
- use point-in-time recovery or a tested snapshot process
- define the expected data loss window before acting
- coordinate app rollback with DB restore so code and schema match
- restrict DB credentials and shell access during emergency operations
Manual repair may include:
- dropping a partially created index or constraint
- reverting a failed column change
- correcting rows affected by a partial data migration
- updating
django_migrationsonly if the schema truly matches the recorded state
Be careful with --fake. This is only safe when the database already matches what Django expects.
Example of targeted inspection before any fake apply:
python manage.py sqlmigrate app_name 00xx_migration_name
python manage.py dbshell
If the SQL says a column should exist, verify it exists. If an index should exist, verify that exact index exists. Only then consider a fake mark, and only as part of a controlled repair.
If you need to restore, document:
- recovery point used
- expected data loss window
- code version paired with the restored database
- follow-up actions needed before reopening traffic
Explanation
This recovery flow works because it separates three states that often get mixed together:
- migration not applied
- migration partially applied
- migration applied, but release still failed for another reason
That distinction determines whether rollback is easy, dangerous, or impossible.
A dedicated migration step is also the main prevention measure. Running migrations from app startup is convenient early on, but in production it causes repeated retries, race conditions, and multiple instances competing to run the same change. A single release job is much safer.
Example CI/CD pattern:
# release step
python manage.py migrate --noinput
# only if migrate succeeds
sudo systemctl restart gunicorn
sudo systemctl restart celery
Keep secrets out of command history and repo files. Use the same production environment variables you normally use for app runtime, such as DATABASE_URL, DJANGO_SETTINGS_MODULE, and secret values loaded from your process manager, secret store, or container environment. Do not paste credentials into ad hoc shell commands unless there is no safer option.
When to script this
If your team repeats these incident checks often, standardize them. Common candidates are backup verification, single migration runner enforcement, collection of migration and log evidence, maintenance-mode toggles, and post-recovery health checks. A reusable release template or incident script reduces decision errors during a real outage.
Edge cases / notes
Failed migration during rolling deploys
This is one of the worst cases. Old and new app versions may both hit one database. If the new version expects a column that does not exist yet, only part of the fleet may fail. Drain traffic, stop the rollout, and make sure workers are version-aligned with the chosen recovery path.
Failed data migration on a large table
Large updates can hold locks or run long enough to exceed deploy timeouts. If a data migration failed, inspect how many rows changed before retrying. Batch updates and idempotent logic are safer than one large write.
Multiple containers retried the same migration
If every container runs migrate on startup, one failure can cascade into repeated attempts. Move migration execution into a dedicated release job and ensure only one runner is active at a time.
PostgreSQL transactional behavior
Many PostgreSQL schema changes are transactional, but not all operational patterns are harmless. Custom SQL, concurrent index creation strategies, and app-level data changes still need explicit review. Check the actual migration SQL and the actual database state rather than assuming rollback behavior.
Internal links
For the background on migration state and compatibility, see How Django Migrations Work in Production.
For a safer release design, read How to Run Django Migrations Safely During Deployment.
If you need the full web stack around the app server, see Deploy Django with Gunicorn and Nginx.
For app version rollback steps, see How to Roll Back a Django Deployment Safely.
If the failed deploy exposed broader production gaps, review Django Deployment Checklist for Production.
FAQ
How do I know whether a failed Django migration changed the database?
Check both django_migrations and the real database schema. Use showmigrations, inspect the migration SQL with sqlmigrate, and then verify tables, columns, indexes, constraints, and any affected rows directly in the database.
Is it safe to rerun python manage.py migrate after a deployment failure?
Sometimes, but only after you confirm why it failed. Retrying is often reasonable for transient failures like lock timeouts or interrupted release jobs. It is not safe to blindly rerun if the migration contains broken logic, partial data changes, or non-transactional schema changes.
When should I roll back code instead of fixing forward?
Roll back code when the migration never changed the database and the previous release is still compatible with the current schema. If schema changes already landed or the old code cannot work with the current database state, fix-forward is usually safer.
Can I use python manage.py migrate --fake in production?
Only when you have verified that the database already matches the migration’s intended state. --fake is not a repair tool by itself. If you fake-apply a migration against the wrong schema, you can make later deploys much harder to recover.
What should I check before reopening traffic after a migration incident?
Confirm:
- migration history is correct
- the actual schema matches expected state
- web health checks pass
- workers are running the correct version
- no background job is still using old assumptions
- critical user paths and admin actions succeed