stacker: build OCI images without host privilege

Ahoy! Recently, I've been working on a tool called stacker, which allows unprivileged users to build OCI images. The images that are generated are generated without uid shifting, so they look like any other OCI image that was generated by Docker or some other mechanism, while not requiring root (worth noting that this is what James Bottomley has described as his motivation for writing shiftfs).

Some base setup is required in order to make this happen, though. First, you can follow stacker's install guide to build and install it.

Next, as with any user namespaces setup, stacker needs a 65k uid delegation. On my ubuntu VM with the ubuntu user, this looks like,

$ grep ubuntu /etc/subuid
ubuntu:165536:65536
$ grep ubuntu /etc/subgid
ubuntu:165536:65536

Note that these can be any 65k range of subuids, stacker will use whatever you give the user you run it as.

Finally, stacker also needs a btrfs filesystem. Stacker was designed to build a large number of varying images from a single base image, and uses btrfs to avoid doing a large amount of i/o (and compression/decompression), undiffing filesystems back to their original state. For the purposes of this blog post, we can just use a loopback mounted btrfs filesystem. A slightly modified excerpt from the stacker test suite:

# btrfs setup
sudo truncate -s 100G btrfs.loop
sudo mkfs.btrfs btrfs.loop
sudo mkdir -p roots
# allow for unprivileged subvolume deletion; use a sane flushing strategy
sudo mount -o user_subvol_rm_allowed,flushoncommit,loop .stacker/btrfs.loop roots
# now make sure ubuntu can actually do stuff with this filesystem
sudo chown -R ubuntu:ubuntu roots

And with that, we can actually run stacker and build an image:

stacker build -f ./stacker.yaml

What goes in stacker.yaml you ask? Consider the example from stacker's readme:

centos:
	from:
		type: tar
		url: http://example.com/centos.tar.gz
	environment:
		http_proxy: http://example.com:8080
		https_proxy: https://example.com:8080
	labels:
		foo: bar
		bar: baz
boot:
	from:
		type: built
		tag: centos
	run: |
		yum install openssh-server
		echo meshuggah rocks
web:
	from:
		type: built
		tag: centos
	import: ./lighttp.cfg
	run: |
		yum install lighttpd
		cp /stacker/lighttp.cfg /etc/lighttpd/lighttp.cfg
	entrypoint: lighthttpd
	volumes:
		- /data/db
	working_dir: /var/lib/www

The top level describes the name of a tag in the OCI image to be built, in this case there will be three tags at the end: centos, boot, and web (notably, this example is quite contrived :). Underneath those, there are the following keys:

. from: this describes the base image that stacker will start from. You can

either start from some other image in the same stackerfile, a Docker image, or a tarball.

. import: A set of files to download or copy into the container. Stacker

will put these files at /stacker, which will be automatically cleaned up after the commands in the run section are run and the image is finalized.

. run: This is the set of commands to run in order to build the image; they

are run in a user namespaced container, with the set of files imported available in /stacker.

. environment, labels, working_dir, volumes: these all correspond

exactly to the similarly named bits in the OCI image config spec, and are available for users to pass things through to the runtime environment of the image.

That's a bit about stacker. Hopefully some more details about the internals will appear at some point :). Happy hacking!

Mounting your home directory in LXD

As of LXD stable 2.0.8 and feature release 2.6, LXD has support for various UID and GID map related manipulaions. A common question is: "How do I bind-mount my home directory into a container?" and before the answer was "well, it's complicated but you can do it; it's slightly less complicated if you do it in privleged containers". However, with this feature, now you can do it very easily in unprivileged containers.

First, find out your uid on the host:

$ id
uid=1000(tycho) gid=1000(tycho) groups=1000(tycho),4(adm),24(cdrom),27(sudo),30(dip),46(plugdev),112(lpadmin),124(sambashare),129(libvirtd),149(lxd),150(sbuild)

On standard Ubuntu hosts, the uid of the first user is 1000. Now, we need to allow LXD to remap this id; you'll need an additional entry for root to do this:

$ echo 'root:1000:1' | sudo tee -a /etc/subuid /etc/subgid

Now, create a container, and set the idmap up to map both uid and gid 1000 to uid and gid 1000 inside the container.

$ lxc init ubuntu-daily:z zesty
Creating zesty
$ lxc config set zesty raw.idmap 'both 1000 1000'

Finally, set up your home directory to be mounted in the container:

$ lxc config device add zesty homedir disk source=/home/tycho path=/home/ubuntu

And leave an insightful message for users of the container:

$ echo 'meshuggah rocks' >> message

Finally, start your container and read the message:

$ lxc start zesty
$ lxc exec zesty cat /home/ubuntu/message
meshuggah rocks

And enjoy the insighed offered to you by your home directory :)

LXD networking: lxdbr0 explained

Recently, LXD stopped depending on lxc, and thus moved to using its own bridge, called lxdbr0. lxdbr0 behaves significantly differently than lxcbr0: it is ipv6 link local only by default (i.e. there is no ipv4 or ipv6 subnet configured by default), and only HTTP traffic is proxied over the network. This means that e.g. you can't SSH to your LXD containers with the default configuration of lxdbr0.

The motivation for this change mostly to avoid picking subnets for users, because this can cause breakage, and have users pick their own subnets. Previously, the script that set up lxcbr0 looked around on the host's network, and picked the first 10.0.*.1 address for the bridge that was available. Of course, in some cases (e.g. networks which weren't visible at the time of bridge creation) this can break routing for users' networks.

So, if you want to have parity with lxcbr0, you'll need to configure the bridge yourself. There are a few ways to do this. For a step by step walkthrough of just configuring the bridge, simply do:

sudo dpkg-reconfigure -p medium lxd

And answer the questions however you like. If you've never configured LXD at all (and e.g. want to use a fancy filesystem like ZFS), try:

sudo lxd init

Which will configure all of LXD (both the filesystem and lxdbr0). Finally, you can edit the file /etc/default/lxd-bridge and then do a:

sudo service lxd-bridge stop && sudo service lxd restart

For feature parity with lxcbr0, you can use something like the following (note the 10.0.4.*, so as not to conflict with lxcbr0):

# Whether to setup a new bridge or use an existing one
USE_LXD_BRIDGE="true"

# Bridge name
# This is still used even if USE_LXD_BRIDGE is set to false
# set to an empty value to fully disable
LXD_BRIDGE="lxdbr0"

# Path to an extra dnsmasq configuration file
LXD_CONFILE=""

# DNS domain for the bridge
LXD_DOMAIN="lxd"

# IPv4
## IPv4 address (e.g. 10.0.4.1)
LXD_IPV4_ADDR="10.0.4.1"

## IPv4 netmask (e.g. 255.255.255.0)
LXD_IPV4_NETMASK="255.255.255.0"

## IPv4 network (e.g. 10.0.4.0/24)
LXD_IPV4_NETWORK="10.0.4.1/24"

## IPv4 DHCP range (e.g. 10.0.4.2,10.0.4.254)
LXD_IPV4_DHCP_RANGE="10.0.4.2,10.0.4.254"

## IPv4 DHCP number of hosts (e.g. 250)
LXD_IPV4_DHCP_MAX="253"

## NAT IPv4 traffic
LXD_IPV4_NAT="true"

# IPv6
## IPv6 address (e.g. 2001:470:b368:4242::1)
LXD_IPV6_ADDR=""

## IPv6 CIDR mask (e.g. 64)
LXD_IPV6_MASK=""

## IPv6 network (e.g. 2001:470:b368:4242::/64)
LXD_IPV6_NETWORK=""

## NAT IPv6 traffic
LXD_IPV6_NAT="false"

# Run a minimal HTTP PROXY server
LXD_IPV6_PROXY="false"

And that's it! That's all you need to do to configure lxdbr0.

Sometimes, though, you don't really want your containers to live on a separate network than the host because you want to ssh to them directly or something. There are a few ways to accomplish this, the simplest is with macvlan:

lxc profile device set default eth0 parent eth0
lxc profile device set default eth0 nictype macvlan

Another way to do this is by adding another bridge which is bridged onto your main NIC. You'll need to edit your /etc/network/interfaces.d/eth0.cfg to look like this:

# The primary network interface
auto eth0
iface eth0 inet manual # note the manual here

And then add a bridge by creating /etc/network/interfaces.d/containerbr.cfg with the contents:

auto containerbr
iface containerbr inet dhcp
  bridge_ports eth0

Finally, you'll need to change the default lxd profile to use your new bridge:

lxc profile device set default eth0 parent containerbr

Restart the networking service (which if you do it over ssh, may boot you :), and away you go. If you want some of your containers to be on one bridge, and some on the other, you can use different profiles to accomplish this.

linux.conf.au 2016 talk

Last week I did this ridiculous thing where I flew around the world in the easterly direction, giving talks at FOSDEM and linux.conf.au. The linux.conf.au staff always do a great job of making talk videos, and this year was no exception.

My talk was on LXD and live migration, a brief history of both as well as a status update and some discussion of future work on both. There were also lots of questions in this talk, so there's a lot of discussion of basic migration questions and inner workings.

Unforatunately, I can't embed it here, so I'll give you a link instead. Also, keep in mind at the time I was giving this talk I had been up for ~40 hours, so I forgot some English words here and there :)

https://www.youtube.com/watch?v=ol85OJxDaHc

Using the LXD API from Python

After our recent splash at ODS in Vancouver, it seems that there is a lot of interest in writing some python code to drive LXD to do various things. The first option is to use pylxd, a project maintained by a friend of mine at Canonical named Chuck Short. However, the primary client of this is OpenStack, and thus it is python2. We also don't want to add a lot of dependencies in this module, so we're using raw python urllib and friends, which as you know can sometimes be...painful :)

Another option would be to use python's awesome requests module, which is considerably more user friendly. However, since LXD uses client certificates, it can be a bit challenging to get the basic bits going. Here's a small program that just does some GETs to the API, to see how it might work:

import os.path

import requests

conf_dir = os.path.expanduser('~/.config/lxc')
crt = os.path.join(conf_dir, 'client.crt')
key = os.path.join(conf_dir, 'client.key')

print(requests.get('https://127.0.0.1:8443/1.0', verify=False, cert=(crt, key)).text)

which gives me (piped through jq for sanity):

$ python3 lxd.py | jq .
{
  "type": "sync",
  "status": "Success",
  "status_code": 200,
  "metadata": {
    "api_compat": 1,
    "auth": "trusted",
    "config": {
      "trust-password": true
    },
    "environment": {
      "backing_fs": "ext4",
      "driver": "lxc",
      "kernel_version": "3.19.0-15-generic",
      "lxc_version": "1.1.2",
      "lxd_version": "0.9"
    }
  }
}

It just piggy backs on the lxc client generated certificates for now, but it would be great to have some python code that could generate those as well!

Another bit I should point out for people is lxd's --debug flag, which prints out every request it receives and response that it sends. I found this useful while developing the default lxc client, and it will probably be useful to those of you out there who are developing your own clients.

Happy hacking!

Live Migration in LXD

There has been a lot of interest on the various mailing lists as well as internally at Canonical about the state of migration in LXD, so I thought I'd write a bit about the current state of affairs.

Migration in LXD today passes the "Doom demo" test, i.e. it works well enough to reproduce the LXD announcement demo under certain conditions, which I'll cover below. There is still a lot of ongoing work to make CRIU (the underlying migration technology) work with all these configurations, so support will eventually arrive for everything. For now, though, you'll need to use the configuration I describe below.

First, I should note that things currently won't work on a systemd host. Since systemd re-mounts the rootfs as MS_SHARED, lots of things automatically become shared mounts, which confuses CRIU. There are several mailing list threads about ongoing work with respect to shared mounts in CRIU and I expect something to be merged that will resolve the situation shortly, but for now your host machine needs to be a non-systemd host (i.e. trusty or utopic will work just fine, but not vivid).

You'll need to install the daily versions of liblxc and lxd from their respective PPAs on each host:

sudo apt-add-repository -y ppa:ubuntu-lxc/daily
sudo apt-add-repository -y ppa:ubuntu-lxc/lxd-git-master
sudo apt-get update
sudo apt-get install lxd

Also, you'll need to uninstall lxcfs on both hosts:

sudo apt-get remove lxcfs

liblxc currently doesn't support migrating the mount configuration that lxcfs uses, although there is some work on that as well. The overmounting issue has been fixed in lxcfs, so I expect to land some patches in liblxc soon that will make lxcfs work.

Next, you'll want to set a password for your new lxd instance:

lxc config set password foo

You need some images in lxd, which can be acquired easily enough by lxd-images (of course, this only needs to be done on the source host of the migration):

lxd-images import lxc ubuntu trusty amd64 --alias ubuntu

You'll also need to set a few configuration items in lxd. First, the container needs to be privileged, although there is yet more ongoing work to remove this restriction. There are also a few things that CRIU does not support, so we need to set our container config to respect those as well. You can do all of this using lxd's profiles mechanism, that is:

lxc config profile create migratable
lxc config profile edit migratable

And paste the following content in instead of what's there:

name: migratable
config:
  raw.lxc: |
    lxc.console = none
    lxc.cgroup.devices.deny = c 5:1 rwm
    lxc.start.auto =
    lxc.start.auto = proc:mixed sys:mixed
  security.privileged: "true"
devices:
  eth0:
    nictype: bridged
    parent: lxcbr0
    type: nic

Finally, launch your contianer:

lxc launch ubuntu migratee -p migratable

Finally, add both of your LXDs as non unix-socket remotes (required for now, but not forever):

lxc remote add lxd thishost:8443   # don't use localhost here
lxc remote add lxd2 otherhost:8443 # use a publicly addressable name

Profiles used by a particular container need to be present on both the source of the migration and the sink, so we should copy the profile to the sink as well:

lxc config profile copy migratable lxd2:

And now, you're ready for the magic!

lxc start migratee
lxc move lxd:migratee lxd2:migratee

With luck, you'll have migrated the container to lxd2. Of course, things don't always go right the first time. The full log file for the migration attempts should be available in /var/log/lxd/migratee/migration_{dump|restore}_<timestamp>.log, on the respective host where the dump or restore took place. If you aren't successful in migrating things (or parsing the dump/restore log), feel free to mail lxc-users, and I can help you debug what went wrong.

Happy hacking!

lxd and Doom migration demo

Last week at the Openstack Developer Summit I gave a live demo of migrating a linux container running doom, which generated quite a lot of excitement! Several people asked me for steps on reproducing the demo, which I have just posted.

I am one of Canonical's developers working on lxd, and I will be focused on bringing migration and other features into it. I'm very excited about the opportunity to work on this project! Stay tuned!

Live Migration of Linux Containers

Recently, I've been playing around with checkpoint and restore of Linux containers. One of the obvious applications is checkpointing on one host and restoring on another (i.e. live migration). Live migration has all sorts of interesting applications, so it is nice to know that at least a proof of concept of it works today.

Anyway, onto the interesting bits! The first thing I did was create two vms, and install criu's and lxc's development versions on both hosts:

sudo add-apt-repository ppa:ubuntu-lxc/daily
sudo apt-get update
sudo apt-get install lxc

sudo apt-get install build-essential protobuf-c-compiler
git clone https://github.com/xemul/criu && cd criu && sudo make install

Then, I created a container:

sudo lxc-create -t ubuntu -n u1 -- -r trusty -a amd64

Since the work on container checkpoint/restore is so young, not all container configurations are supported. In particular, I had to add the following to my config:

cat << EOF | sudo tee -a /var/lib/lxc/u1/config
# hax for criu
lxc.console = none
lxc.tty = 0
lxc.cgroup.devices.deny = c 5:1 rwm
EOF

Finally, although the lxc-checkpoint tool allows us to checkpoint and restore containers, there is no support for migration directly today. There are several tools in the works for this, but for now we can just use a cheesy shell script:

cat > migrate <<EOF
#!/bin/sh
set -e

usage() {
  echo $0 container user@host.to.migrate.to
  exit 1
}

if [ "$(id -u)" != "0" ]; then
  echo "ERROR: Must run as root."
  usage
fi

if [ "$#" != "2" ]; then
  echo "Bad number of args."
  usage
fi

name=$1
host=$2

checkpoint_dir=/tmp/checkpoint

do_rsync() {
  rsync -aAXHltzh --progress --numeric-ids --devices --rsync-path="sudo rsync" $1 $host:$1
}

# we assume the same lxcpath on both hosts, that is bad.
LXCPATH=$(lxc-config lxc.lxcpath)

lxc-checkpoint -n $name -D $checkpoint_dir -s -v

do_rsync $LXCPATH/$name/
do_rsync $checkpoint_dir/

ssh $host "sudo lxc-checkpoint -r -n $name -D $checkpoint_dir -v"
ssh $host "sudo lxc-wait -n u1 -s RUNNING"
EOF
chmod +x migrate

Now, for the magic show! I've set up the container I created above to be a web server running micro-httpd that serves an incredibly important message:

$ ssh ubuntu@$(sudo lxc-info -n u1 -H -i)
ubuntu@u1:~$ sudo apt-get install micro-httpd
ubuntu@u1:~$ echo "Meshuggah is the best metal band." | sudo tee /var/www/index.html
ubuntu@u1:~$ exit
$ curl -s $(sudo lxc-info -n u1 -H -i)
Meshuggah is the best metal band.

Let's migrate!

$ sudo ./migrate u1 ubuntu@criu2.local
  # lots of rsync output...
$ ssh ubuntu@criu2.local 'curl -s $(sudo lxc-info -n u1 -H -i)'
Meshuggah is the best metal band.

Of course, there are several caveats to this. You've got to add the lines above to your config, which means you can't dump containers with ttys. Since containers have the hosts's fusectl bind mounted and fuse mounts aren't supported by criu, containers or hosts using fuse can't be dumped. You can't migrate unprivileged containers yet. There are probably others that I'm forgetting, though list of troubleshoting steps is available at criu.org/LXC#Troubleshooting.

There is ongoing work in both CRIU and LXC to get rid of all the caveats above, so stay tuned!