stacker: build OCI images without host privilege

Ahoy! Recently, I've been working on a tool called stacker, which allows unprivileged users to build OCI images. The images that are generated are generated without uid shifting, so they look like any other OCI image that was generated by Docker or some other mechanism, while not requiring root (worth noting that this is what James Bottomley has described as his motivation for writing shiftfs).

Some base setup is required in order to make this happen, though. First, you can follow stacker's install guide to build and install it.

Next, as with any user namespaces setup, stacker needs a 65k uid delegation. On my ubuntu VM with the ubuntu user, this looks like,

$ grep ubuntu /etc/subuid
$ grep ubuntu /etc/subgid

Note that these can be any 65k range of subuids, stacker will use whatever you give the user you run it as.

Finally, stacker also needs a btrfs filesystem. Stacker was designed to build a large number of varying images from a single base image, and uses btrfs to avoid doing a large amount of i/o (and compression/decompression), undiffing filesystems back to their original state. For the purposes of this blog post, we can just use a loopback mounted btrfs filesystem. A slightly modified excerpt from the stacker test suite:

# btrfs setup
sudo truncate -s 100G btrfs.loop
sudo mkfs.btrfs btrfs.loop
sudo mkdir -p roots
# allow for unprivileged subvolume deletion; use a sane flushing strategy
sudo mount -o user_subvol_rm_allowed,flushoncommit,loop .stacker/btrfs.loop roots
# now make sure ubuntu can actually do stuff with this filesystem
sudo chown -R ubuntu:ubuntu roots

And with that, we can actually run stacker and build an image:

stacker build -f ./stacker.yaml

What goes in stacker.yaml you ask? Consider the example from stacker's readme:

		type: tar
		foo: bar
		bar: baz
		type: built
		tag: centos
	run: |
		yum install openssh-server
		echo meshuggah rocks
		type: built
		tag: centos
	import: ./lighttp.cfg
	run: |
		yum install lighttpd
		cp /stacker/lighttp.cfg /etc/lighttpd/lighttp.cfg
	entrypoint: lighthttpd
		- /data/db
	working_dir: /var/lib/www

The top level describes the name of a tag in the OCI image to be built, in this case there will be three tags at the end: centos, boot, and web (notably, this example is quite contrived :). Underneath those, there are the following keys:

. from: this describes the base image that stacker will start from. You can

either start from some other image in the same stackerfile, a Docker image, or a tarball.

. import: A set of files to download or copy into the container. Stacker

will put these files at /stacker, which will be automatically cleaned up after the commands in the run section are run and the image is finalized.

. run: This is the set of commands to run in order to build the image; they

are run in a user namespaced container, with the set of files imported available in /stacker.

. environment, labels, working_dir, volumes: these all correspond

exactly to the similarly named bits in the OCI image config spec, and are available for users to pass things through to the runtime environment of the image.

That's a bit about stacker. Hopefully some more details about the internals will appear at some point :). Happy hacking!

pw, a stateless password generation tool

Ahoy! Recently, I have been working on a new stateless password generation tool, primarily to learn the language rust. The idea was to build replacement for password, which, while I use daily, could use a few extra features.

While I could elaborate on pw's features, I think it's best to just copy the text from pw's readme:

pw uses pbkdf2 with sha512 to stretch your password, with the supplied entity as the salt. The result is encoded in base58, meaning that each symbol in the password has ~5.86 of entropy. By default, pw generates passwords of length 20, so there are ~117 bits of entropy per (default) password. By comparison, "correct horse battery staple" is only 44.

Password Rotation

Changing passwords, memorably. pw offers several features for changing the generated password for a given salt and user secret combination. For example, some organizations require users to change their password every 90 days. This is security theater, but nonetheless, users must cooperate. Using a standard password generator, users could append a "2" and a "3" ("4"...) to their password ad infinitum; the problem with this is that it makes some part of the plaintext input known. pw uses a novel method of changing the number of iterations for pbkdf2 based on such inputs. --otp can be directly used to change the number of iterations and thus the generated password. --period and --date can be used together to work around organizations who e.g. require you to change your password every 90 days. --period alone calculates the password based on the current date, while --date allows you to pass an arbitrary date for which to calculate password.

Adding Special Characters

By default the base58 encoding includes only alphanumeric characters. Some organizations require special characters in their passwords. Users can add arbitrary special characters by supplying an argument to --special. By default, --special includes 25 typically allowed special characters.

Salt Recommendations

The salt is of particular importance to generated passwords. A typical suggestion is to use the domain of the entity that the password is for, but the problem is that an attacker who steals's password database may just generate a rainbow table for So, some personalized version of the salt is recommended. For example, I might choose An additional feature (discussed in TODO) would be a global offset for the algorithm, so people could choose e.g. to not use the default offset of 0, but something else for all of their passwords.


pw has support for storing a password in the OS native keyring, via --{get,set,delete}-keyring-password, so that users don't have to type in their password each invocation.

There is also X11 clipboard support on Linux via xclip, so users can pass --clipboard to pw, and it will automatically copy the generated password to the clipboard.

Finally, worth noting is that pw has support for a configuration file, allowing for a few other features, which can be configured via --{get,set,edit,delete}-keyring-config. For example, users can store OTP offsets, special character sets, or even pre-shared key material (config key preshared, a string) to use for generating particular passwords. Currently this config file must be stored in the keyring, so it is not exposed to unencrypted access. Of course, this is not stateless, and pw can function entirely without this configuration, but it may be useful to some.

Linux Piter

Last weekend I attended the Linux Piter conference for the second year in a row. I have thoroughly enjoyed this conference both for the caliber of speaker (Cristoph Hellwig and Lennart Poettering this year) but more for the caliber of the audience. I receive interesting technical questions, suggestions, and insights about my talks when I present there. I would liken it to a conference like a less corporate/more community focused audience which is highly technical.

Getting to Russia can be complicated for most, but speaking here is interesting in addition to the technical aspects: the program committee puts on a "cultural day" the day after the conference, showing visitors around Saint Petersburg, which is a much nicer speaker gift than a box of chocolates or another USB charger :)

Just how expensive is slub_deug=p?

Recently, I became interested in a debugging option in the Linux kernel


``` Average Half load -j 2 Run (std deviation): Elapsed Time 44.586 (1.67125) User Time 73.874 (2.51294) System Time 7.756 (0.741741) Percent CPU 182.4 (0.547723) Context Switches 13880.8 (157.161) Sleeps 15745.2 (24.3146)

Average Optimal load -j 4 Run (std deviation): Elapsed Time 32.702 (0.400087) User Time 89.22 (16.3062) System Time 8.945 (1.37014) Percent CPU 266.4 (88.5729) Context Switches 15701 (1929.57) Sleeps 15722.2 (78.1875) ```

without slub_debug=p

``` Average Half load -j 2 Run (std deviation): Elapsed Time 40.614 (0.232873) User Time 69.978 (0.503061) System Time 5.09 (0.182209) Percent CPU 184.4 (0.547723) Context Switches 13596 (121.501) Sleeps 15740.4 (46.4629)

Average Optimal load -j 4 Run (std deviation): Elapsed Time 30.622 (0.171523) User Time 86.233 (17.1381) System Time 5.874 (0.853557) Percent CPU 270.1 (90.3431) Context Switches 15370.3 (1875.97) Sleeps 15777.4 (74.43) ```

Mounting your home directory in LXD

As of LXD stable 2.0.8 and feature release 2.6, LXD has support for various UID and GID map related manipulaions. A common question is: "How do I bind-mount my home directory into a container?" and before the answer was "well, it's complicated but you can do it; it's slightly less complicated if you do it in privleged containers". However, with this feature, now you can do it very easily in unprivileged containers.

First, find out your uid on the host:

$ id
uid=1000(tycho) gid=1000(tycho) groups=1000(tycho),4(adm),24(cdrom),27(sudo),30(dip),46(plugdev),112(lpadmin),124(sambashare),129(libvirtd),149(lxd),150(sbuild)

On standard Ubuntu hosts, the uid of the first user is 1000. Now, we need to allow LXD to remap this id; you'll need an additional entry for root to do this:

$ echo 'root:1000:1' | sudo tee -a /etc/subuid /etc/subgid

Now, create a container, and set the idmap up to map both uid and gid 1000 to uid and gid 1000 inside the container.

$ lxc init ubuntu-daily:z zesty
Creating zesty
$ lxc config set zesty raw.idmap 'both 1000 1000'

Finally, set up your home directory to be mounted in the container:

$ lxc config device add zesty homedir disk source=/home/tycho path=/home/ubuntu

And leave an insightful message for users of the container:

$ echo 'meshuggah rocks' >> message

Finally, start your container and read the message:

$ lxc start zesty
$ lxc exec zesty cat /home/ubuntu/message
meshuggah rocks

And enjoy the insighed offered to you by your home directory :)

LXD networking: lxdbr0 explained

Recently, LXD stopped depending on lxc, and thus moved to using its own bridge, called lxdbr0. lxdbr0 behaves significantly differently than lxcbr0: it is ipv6 link local only by default (i.e. there is no ipv4 or ipv6 subnet configured by default), and only HTTP traffic is proxied over the network. This means that e.g. you can't SSH to your LXD containers with the default configuration of lxdbr0.

The motivation for this change mostly to avoid picking subnets for users, because this can cause breakage, and have users pick their own subnets. Previously, the script that set up lxcbr0 looked around on the host's network, and picked the first 10.0.*.1 address for the bridge that was available. Of course, in some cases (e.g. networks which weren't visible at the time of bridge creation) this can break routing for users' networks.

So, if you want to have parity with lxcbr0, you'll need to configure the bridge yourself. There are a few ways to do this. For a step by step walkthrough of just configuring the bridge, simply do:

sudo dpkg-reconfigure -p medium lxd

And answer the questions however you like. If you've never configured LXD at all (and e.g. want to use a fancy filesystem like ZFS), try:

sudo lxd init

Which will configure all of LXD (both the filesystem and lxdbr0). Finally, you can edit the file /etc/default/lxd-bridge and then do a:

sudo service lxd-bridge stop && sudo service lxd restart

For feature parity with lxcbr0, you can use something like the following (note the 10.0.4.*, so as not to conflict with lxcbr0):

# Whether to setup a new bridge or use an existing one

# Bridge name
# This is still used even if USE_LXD_BRIDGE is set to false
# set to an empty value to fully disable

# Path to an extra dnsmasq configuration file

# DNS domain for the bridge

# IPv4
## IPv4 address (e.g.

## IPv4 netmask (e.g.

## IPv4 network (e.g.

## IPv4 DHCP range (e.g.,

## IPv4 DHCP number of hosts (e.g. 250)

## NAT IPv4 traffic

# IPv6
## IPv6 address (e.g. 2001:470:b368:4242::1)

## IPv6 CIDR mask (e.g. 64)

## IPv6 network (e.g. 2001:470:b368:4242::/64)

## NAT IPv6 traffic

# Run a minimal HTTP PROXY server

And that's it! That's all you need to do to configure lxdbr0.

Sometimes, though, you don't really want your containers to live on a separate network than the host because you want to ssh to them directly or something. There are a few ways to accomplish this, the simplest is with macvlan:

lxc profile device set default eth0 parent eth0
lxc profile device set default eth0 nictype macvlan

Another way to do this is by adding another bridge which is bridged onto your main NIC. You'll need to edit your /etc/network/interfaces.d/eth0.cfg to look like this:

# The primary network interface
auto eth0
iface eth0 inet manual # note the manual here

And then add a bridge by creating /etc/network/interfaces.d/containerbr.cfg with the contents:

auto containerbr
iface containerbr inet dhcp
  bridge_ports eth0

Finally, you'll need to change the default lxd profile to use your new bridge:

lxc profile device set default eth0 parent containerbr

Restart the networking service (which if you do it over ssh, may boot you :), and away you go. If you want some of your containers to be on one bridge, and some on the other, you can use different profiles to accomplish this. 2016 talk

Last week I did this ridiculous thing where I flew around the world in the easterly direction, giving talks at FOSDEM and The staff always do a great job of making talk videos, and this year was no exception.

My talk was on LXD and live migration, a brief history of both as well as a status update and some discussion of future work on both. There were also lots of questions in this talk, so there's a lot of discussion of basic migration questions and inner workings.

Unforatunately, I can't embed it here, so I'll give you a link instead. Also, keep in mind at the time I was giving this talk I had been up for ~40 hours, so I forgot some English words here and there :)

Using the LXD API from Python

After our recent splash at ODS in Vancouver, it seems that there is a lot of interest in writing some python code to drive LXD to do various things. The first option is to use pylxd, a project maintained by a friend of mine at Canonical named Chuck Short. However, the primary client of this is OpenStack, and thus it is python2. We also don't want to add a lot of dependencies in this module, so we're using raw python urllib and friends, which as you know can sometimes be...painful :)

Another option would be to use python's awesome requests module, which is considerably more user friendly. However, since LXD uses client certificates, it can be a bit challenging to get the basic bits going. Here's a small program that just does some GETs to the API, to see how it might work:

import os.path

import requests

conf_dir = os.path.expanduser('~/.config/lxc')
crt = os.path.join(conf_dir, 'client.crt')
key = os.path.join(conf_dir, 'client.key')

print(requests.get('', verify=False, cert=(crt, key)).text)

which gives me (piped through jq for sanity):

$ python3 | jq .
  "type": "sync",
  "status": "Success",
  "status_code": 200,
  "metadata": {
    "api_compat": 1,
    "auth": "trusted",
    "config": {
      "trust-password": true
    "environment": {
      "backing_fs": "ext4",
      "driver": "lxc",
      "kernel_version": "3.19.0-15-generic",
      "lxc_version": "1.1.2",
      "lxd_version": "0.9"

It just piggy backs on the lxc client generated certificates for now, but it would be great to have some python code that could generate those as well!

Another bit I should point out for people is lxd's --debug flag, which prints out every request it receives and response that it sends. I found this useful while developing the default lxc client, and it will probably be useful to those of you out there who are developing your own clients.

Happy hacking!

Live Migration in LXD

There has been a lot of interest on the various mailing lists as well as internally at Canonical about the state of migration in LXD, so I thought I'd write a bit about the current state of affairs.

Migration in LXD today passes the "Doom demo" test, i.e. it works well enough to reproduce the LXD announcement demo under certain conditions, which I'll cover below. There is still a lot of ongoing work to make CRIU (the underlying migration technology) work with all these configurations, so support will eventually arrive for everything. For now, though, you'll need to use the configuration I describe below.

First, I should note that things currently won't work on a systemd host. Since systemd re-mounts the rootfs as MS_SHARED, lots of things automatically become shared mounts, which confuses CRIU. There are several mailing list threads about ongoing work with respect to shared mounts in CRIU and I expect something to be merged that will resolve the situation shortly, but for now your host machine needs to be a non-systemd host (i.e. trusty or utopic will work just fine, but not vivid).

You'll need to install the daily versions of liblxc and lxd from their respective PPAs on each host:

sudo apt-add-repository -y ppa:ubuntu-lxc/daily
sudo apt-add-repository -y ppa:ubuntu-lxc/lxd-git-master
sudo apt-get update
sudo apt-get install lxd

Also, you'll need to uninstall lxcfs on both hosts:

sudo apt-get remove lxcfs

liblxc currently doesn't support migrating the mount configuration that lxcfs uses, although there is some work on that as well. The overmounting issue has been fixed in lxcfs, so I expect to land some patches in liblxc soon that will make lxcfs work.

Next, you'll want to set a password for your new lxd instance:

lxc config set password foo

You need some images in lxd, which can be acquired easily enough by lxd-images (of course, this only needs to be done on the source host of the migration):

lxd-images import lxc ubuntu trusty amd64 --alias ubuntu

You'll also need to set a few configuration items in lxd. First, the container needs to be privileged, although there is yet more ongoing work to remove this restriction. There are also a few things that CRIU does not support, so we need to set our container config to respect those as well. You can do all of this using lxd's profiles mechanism, that is:

lxc config profile create migratable
lxc config profile edit migratable

And paste the following content in instead of what's there:

name: migratable
  raw.lxc: |
    lxc.console = none
    lxc.cgroup.devices.deny = c 5:1 rwm = = proc:mixed sys:mixed
  security.privileged: "true"
    nictype: bridged
    parent: lxcbr0
    type: nic

Finally, launch your contianer:

lxc launch ubuntu migratee -p migratable

Finally, add both of your LXDs as non unix-socket remotes (required for now, but not forever):

lxc remote add lxd thishost:8443   # don't use localhost here
lxc remote add lxd2 otherhost:8443 # use a publicly addressable name

Profiles used by a particular container need to be present on both the source of the migration and the sink, so we should copy the profile to the sink as well:

lxc config profile copy migratable lxd2:

And now, you're ready for the magic!

lxc start migratee
lxc move lxd:migratee lxd2:migratee

With luck, you'll have migrated the container to lxd2. Of course, things don't always go right the first time. The full log file for the migration attempts should be available in /var/log/lxd/migratee/migration_{dump|restore}_<timestamp>.log, on the respective host where the dump or restore took place. If you aren't successful in migrating things (or parsing the dump/restore log), feel free to mail lxc-users, and I can help you debug what went wrong.

Happy hacking!

setproctitle() in Linux

While working on LXD, one of the things I occasionally do is submit patches to LXC (e.g. the migration work or other things). In particular, the name of the LXC monitor process (the process that's the parent of init) is fork()ed in the C API call, so whatever the name of the binary that ran the API call (in our case, LXD) is the name of the parent. This could be slightly confusing (especially in the case where LXD dies but a process that looks like it is named LXD lives on). Should be easy enough to fix, right? Lots of *nixes seem to have a setproctitle() function to correct this, so we'll just call that!

And lo, there is prctl() which has a PR_SET_NAME mode that we can use. Done! Except from one small caveat from the man page:

The name can be up to 16 bytes long, and should be null-terminated if it contains fewer bytes.

Yes, you read that, 16 bytes; not useful for a lot of process names, especially something which would be ideal for LXC:

[lxc monitor] /var/lib/lxc container-name

Ok, so how hard can it be to write our own? If you look around on the internet, a lot of people suggest something like strcpy(argv[0], "my-proc-name"). That works, but what happens if your process name is longer than the original? You smash the stack! Try cat /proc/<pid>/environ on the program below:

#include <string.h>
#include <stdio.h>

int main(int argc, char* argv[]) {
    char buf[1024];
    memset(buf, '0', sizeof(buf));
    buf[1023] = 0;
    strncpy(argv[0], buf, sizeof(buf));
    return 0;

If your process name is longer than the original environment, you overwrite something else potentially more useful, which could cause all sorts of nastiness, especially as something that runs as root.

The thing is, the environment isn't necessarily all that useful; it doesn't indicate the current environment, just the initial environment. So we could use that space for the process name, as long as the kernel knew the environment wasn't valid any more. prctl() to the rescue again, we can pass it PR_SET_MM and PR_SET_MM_ENV_{START|END} to update these locations.

Problem solved! Except that we want to do this from, which has no concept of argv. prctl() has no PR_GET_MM calls, so we can't just go the other way with it. We could invent some ugly API where you have to pass it in, but that would require users to either set their argv pointers up front, or carry it around until they needed it, or something similarly ugly. Instead, we steal an idea from the CRIU codebase: we look in /proc/<pid>/stat. This file has (in columns 48-51, if your kernel is new enough) exactly the arguments you want from PR_GET_MM_*! Thus, we can use this file to find out inside of liblxc where is safe to put the new proctitle.

Putting it all together, liblxc now has an implementation of setproctitle() that will overwrite your initial environment (but is careful not to overwrite anything else), which can be used to set process titles longer than 16 bytes. Enjoy!