Saturday, March 19, 2016

Live Migration


1. Life Migration Workflow

  • Verify the storage backend is appropriate for the migration type
    • Perform a shared storage check for normal migrations
    • Do the inverse for block migrations
    • Checks are run on both the source and destination, orchestrated via RPC calls from the scheduler
  • On the destination
    • Create the necessary volume connections
    • If block migration, create the instance directory, populate missing backing files from Glance and create empty instance disks
  • On the source
    • Initiate the actual live migration
  • Upon complete
    • Generate the Libvirt XML and define it on the destination

2. Migrations

  • Why migration
    • Operations
      • Key to performing non-distruptive work
      • Re-balancing workloads and resources
    • Expectations versus reality
      • Special snowflakes
      • Ephemeral instance and the "cloud" way
  • Type of migration
    • Migrate
      • Completely "cold", libvirt does almost nothing
      • Share code path with "resize"
      • Extremely brittle (users SSh and copies files around)
    • Live migration
      • Orchestrated almost entirely by Libvirt (via DomainMigrateToURI)
    • Block migration
      • Similar code path as live migraiton
      • More risky and brittle (disks are moving along with state)

3. Live Migrations

  • Nova offloads capabilities comparisons to Libvirt
    • The API equivalent of virsh capabilities is run by the scheduler on the source and destination; 
  • Nova live migraiton
    • Important config options
      • Live_migration_flat =+ VIR_MIGRATE_LIVE
      • block_migration_flag=+ VIR_MIGRATE_LIVE
    • Standardized virtual CPU flags
      • libvirt_cpu_model = custome
      • libvirt_cpu_model = cpu64-rhel6
    • "Max Downtime" (not currently tunable)
      • Look for upstream patches soon
      • Qemu will keep doing when the cut can be done in "30" millseconds

4. Brittle Operations

  • Any long running, synchronous tasks
    • All migrations (memory sync, disk sync, etc)
  • No graceful way to stop services
  • Most prone to failure
    • Migrate and resize
    • Live migraiton (block or otherwise)
    • Instance snapshot

5. Recovering from failures

  • Always investigate before forcing actions
    • Look at the log for excpetions
    • Check whether an instance is running on multiple hypervisors
    • Nova reset-state --active and `nova reboot --hard can go a long way
  • Sometime, brute force is going to be required
    • Kill -9 qumu or kvm processes
    • After the database records, commonly `host`

6. "Stuck" Live Migrations

  • Live migrations can get stuck
  • Instances left in a paused state on both ends
    • Monitor socket is unpresponsive, Libvirt is helpless
  • Generally a result of an overly aggressive "max donwtime" and rapidly changing memory state (e.g., JVM)
  • Can be a result of a QEMU issue/bug
    • manageSave (suspend) will generally be prone as well



No comments:

Post a Comment