Live Migration of Linux Containers

Recently, I've been playing around with checkpoint and restore of Linux containers. One of the obvious applications is checkpointing on one host and restoring on another (i.e. live migration). Live migration has all sorts of interesting applications, so it is nice to know that at least a proof of concept of it works today.

Anyway, onto the interesting bits! The first thing I did was create two vms, and install criu's and lxc's development versions on both hosts:

sudo add-apt-repository ppa:ubuntu-lxc/daily
sudo apt-get update
sudo apt-get install lxc

sudo apt-get install build-essential protobuf-c-compiler
git clone https://github.com/xemul/criu && cd criu && sudo make install

Then, I created a container:

sudo lxc-create -t ubuntu -n u1 -- -r trusty -a amd64

Since the work on container checkpoint/restore is so young, not all container configurations are supported. In particular, I had to add the following to my config:

cat << EOF | sudo tee -a /var/lib/lxc/u1/config
# hax for criu
lxc.console = none
lxc.tty = 0
lxc.cgroup.devices.deny = c 5:1 rwm
EOF

Finally, although the lxc-checkpoint tool allows us to checkpoint and restore containers, there is no support for migration directly today. There are several tools in the works for this, but for now we can just use a cheesy shell script:

cat > migrate <<EOF
#!/bin/sh
set -e

usage() {
  echo $0 container user@host.to.migrate.to
  exit 1
}

if [ "$(id -u)" != "0" ]; then
  echo "ERROR: Must run as root."
  usage
fi

if [ "$#" != "2" ]; then
  echo "Bad number of args."
  usage
fi

name=$1
host=$2

checkpoint_dir=/tmp/checkpoint

do_rsync() {
  rsync -aAXHltzh --progress --numeric-ids --devices --rsync-path="sudo rsync" $1 $host:$1
}

# we assume the same lxcpath on both hosts, that is bad.
LXCPATH=$(lxc-config lxc.lxcpath)

lxc-checkpoint -n $name -D $checkpoint_dir -s -v

do_rsync $LXCPATH/$name/
do_rsync $checkpoint_dir/

ssh $host "sudo lxc-checkpoint -r -n $name -D $checkpoint_dir -v"
ssh $host "sudo lxc-wait -n u1 -s RUNNING"
EOF
chmod +x migrate

Now, for the magic show! I've set up the container I created above to be a web server running micro-httpd that serves an incredibly important message:

$ ssh ubuntu@$(sudo lxc-info -n u1 -H -i)
ubuntu@u1:~$ sudo apt-get install micro-httpd
ubuntu@u1:~$ echo "Meshuggah is the best metal band." | sudo tee /var/www/index.html
ubuntu@u1:~$ exit
$ curl -s $(sudo lxc-info -n u1 -H -i)
Meshuggah is the best metal band.

Let's migrate!

$ sudo ./migrate u1 ubuntu@criu2.local
  # lots of rsync output...
$ ssh ubuntu@criu2.local 'curl -s $(sudo lxc-info -n u1 -H -i)'
Meshuggah is the best metal band.

Of course, there are several caveats to this. You've got to add the lines above to your config, which means you can't dump containers with ttys. Since containers have the hosts's fusectl bind mounted and fuse mounts aren't supported by criu, containers or hosts using fuse can't be dumped. You can't migrate unprivileged containers yet. There are probably others that I'm forgetting, though list of troubleshoting steps is available at criu.org/LXC#Troubleshooting.

There is ongoing work in both CRIU and LXC to get rid of all the caveats above, so stay tuned!

Comments

gustavo at 2014-10-15 01:47:58 UTC:

i couldn't agree more with webserver's message. besides, this is a very cool feature

huats at 2014-10-18 20:52:14 UTC:

I am running the ubuntu daily ppa When I try to use lxc-checkpoint (CRIU is installed from the git since I followed your guide) I got : lxc-checkpoint -r -n u1 -D /tmp/checkpoint/ -v root@test-criu# sh: 1: /usr/lib/x86_64-linux-gnu/lxc/lxc-restore-net: not found Have you faced that ?

tycho at 2014-10-24 18:52:43 UTC:

Yep, I've sent a patch to the list and filed a bug, https://bugs.launchpad.net/ubuntu/+source/lxc/+bug/1384751

David VomLehn at 2014-11-18 03:57:04 UTC:

It looks like: cat | sudo tee -a /var/lib/lxc/u1/config << EOF should be: cat << EOF | sudo tee -a /var/lib/lxc/u1/config at least for bash.

tycho at 2014-11-30 15:32:21 UTC:

Updated, thanks!

johns at 2014-12-08 18:08:22 UTC:

Can you confirm if "live" migrating the container means the processes running inside it are started on the target system with their memory contents intact as if nothing had happened? Is there a difference in concepts between "live migration" of VMs and containers?

tycho at 2014-12-11 16:33:04 UTC:

Yes, this is migration in the same sense as VMs, where everything is migrated as-is, using the CRIU tool to do the heavy lifting in terms of migration.

Marc MAURICE at 2015-01-14 11:43:30 UTC:

Hello, 2 changes need to be made in te script : * lxc-wait -n $name at the end of the script. * Change rsync flags to -aXHzh : --> a is needed to preserve file attributes (otherwise, for example, /tmp/ loses stiky bit) --> X is needed to preserve advanced file attributes (like setuid bit for ping, otherwise ping is not working after migration) --> H is needed to preserve hardlinks if you have some hardlinks in the container.

Bob Brown at 2015-01-22 08:57:34 UTC:

Hey Tycho - I met you in Dunedin at our Hackathon event. I then coincidentally found this post a couple of days later while researching how to move a LXC container from one machine to another. Is there anything special about the LXC container that would prevent it working seamlessly if copied from one machine to another (assuming it was stopped at the time). I can't quite see how file ownership in the container itself works - but the ownership is certainly stripped if I just copy the files. rsync has helped and I've just spotted Marc's suggestion above which I shall try. Next stop, LXD :) Cheers, Bob.

Lennie at 2015-01-24 14:58:30 UTC:

When using rsync I suggest you use: --numeric-ids I think that might be the problem Bob Brown was having.

tycho at 2015-01-29 16:57:57 UTC:

Hey all, thanks for the comments! I've updated the post accordingly.

Guido Jäkel at 2015-02-05 12:48:31 UTC:

To shorten the offline/takeover time span, you may "pre-sync" the containers rootfs before taking the checkpoint and then "final-sync" the latest changes again using rsync's "update newer" feature.

JPD at 2015-03-16 04:29:08 UTC:

hi, I try to follow your tutorial and I got some issues with the checkpoint. My lxc (u1) fails everytime when I try to checkpoint the container u1. The error shows in dump.log: (00.011624) Error (mount.c:624): 113:./sys/fs/cgroup/perf_event doesn't have a proper root mount I'm not sure what it means and I'm still not very familiar with lxc configuration. I use the command below to checkpoint lxc with criu: sudo lxc-checkpoint -s -D /tmp/checkpoint -n u1 -v And I use the commands below to create and run my lxc container: sudo lxc-create -t ubuntu -n u1 -- -r trusty -a amd64 sudo lxc-start -t ubuntu -n u1 PS. I use criu version 1.5 Thanks for the help in advance. JPD

Ruslan at 2015-03-23 21:06:06 UTC:

JPD, I get the same error too, so I've sent a letter[1] to lxc-devel mailing list asking if they could help. [1] https://lists.linuxcontainers.org/pipermail/lxc-devel/2015-March/011489.html

Swapnil Haria at 2015-05-02 19:36:46 UTC:

I am able to checkpoint/restore as well as migrate containers correctly using this tutorial. However, processes running inside the container get killed in the transition. Is there a way to launch processes in the container so that they get resumed after the container is restored? I have tried using nohup to create background processes as well, to no avail.

Craynic at 2015-05-19 06:37:55 UTC:

I got similar error as JPD's. Error (mount.c:630): 105:./sys/fs/cgroup/perf_event doesn't have a proper root mount

Andrew at 2015-06-29 10:08:19 UTC:

I've tried the above migration code, but I get the following message: "sudo: no tty present and no askpass program specified". Any suggestions as to how to fix this.

Casey at 2015-07-25 04:26:52 UTC:

Meshuggah!!!!

Lena at 2016-01-23 21:03:55 UTC:

Hello Tycho, thanks for your post. I met a problem here. After I added the lines to the config file, I couldn't start my container. It showed me: " lxc-start: lxc_start.c: main: 344 The container failed to start. lxc-start: lxc_start.c: main: 346 To get more details, run the container in foreground mode. lxc-start: lxc_start.c: main: 348 Additional information can be obtained by setting the --logfile and --logpriority options." Any ideas on how to resolve this issue? Thank you!

Lena at 2016-01-26 15:07:52 UTC:

Hey tycho, thanks for the post! I'm pretty new to shell script, so I was wondering if all the "name" in your code should be replaced by "u1"? And all the "host" should be replaced by the actually host name or ip we want to migrate to? And the "echo $0 container user@host.to.migrate.to" container here should be replaced with u1 too? Is that correct? Thanks!

Post a comment