Live Migration
1. Life Migration Workflow
- Verify the storage backend is appropriate for the migration type
- Perform a shared storage check for normal migrations
- Do the inverse for block migrations
- Checks are run on both the source and destination, orchestrated via RPC calls from the scheduler
- On the destination
- Create the necessary volume connections
- If block migration, create the instance directory, populate missing backing files from Glance and create empty instance disks
- On the source
- Initiate the actual live migration
- Upon complete
- Generate the Libvirt XML and define it on the destination
- Why migration
- Operations
- Key to performing non-distruptive work
- Re-balancing workloads and resources
- Expectations versus reality
- Special snowflakes
- Ephemeral instance and the "cloud" way
- Type of migration
- Migrate
- Completely "cold", libvirt does almost nothing
- Share code path with "resize"
- Extremely brittle (users SSh and copies files around)
- Live migration
- Orchestrated almost entirely by Libvirt (via DomainMigrateToURI)
- Block migration
- Similar code path as live migraiton
- More risky and brittle (disks are moving along with state)
3. Live Migrations
- Nova offloads capabilities comparisons to Libvirt
- The API equivalent of virsh capabilities is run by the scheduler on the source and destination;
- Nova live migraiton
- Important config options
- Live_migration_flat =+ VIR_MIGRATE_LIVE
- block_migration_flag=+ VIR_MIGRATE_LIVE
- Standardized virtual CPU flags
- libvirt_cpu_model = custome
- libvirt_cpu_model = cpu64-rhel6
- "Max Downtime" (not currently tunable)
- Look for upstream patches soon
- Qemu will keep doing when the cut can be done in "30" millseconds
4. Brittle Operations
- Any long running, synchronous tasks
- All migrations (memory sync, disk sync, etc)
- No graceful way to stop services
- Most prone to failure
- Migrate and resize
- Live migraiton (block or otherwise)
- Instance snapshot
5. Recovering from failures
- Always investigate before forcing actions
- Look at the log for excpetions
- Check whether an instance is running on multiple hypervisors
- Nova reset-state --active and `nova reboot --hard can go a long way
- Sometime, brute force is going to be required
- Kill -9 qumu or kvm processes
- After the database records, commonly `host`
6. "Stuck" Live Migrations
- Live migrations can get stuck
- Instances left in a paused state on both ends
- Monitor socket is unpresponsive, Libvirt is helpless
- Generally a result of an overly aggressive "max donwtime" and rapidly changing memory state (e.g., JVM)
- Can be a result of a QEMU issue/bug
- manageSave (suspend) will generally be prone as well
No comments:
Post a Comment