Category: Semaphore

OpenStack – Take 2 – Storage Architecture w/ X-IO HyperISE

Re-architecting things…again.

After getting basic services configured, we ran into a bit of a conundrum where storage was concerned.

OpenStack has two storage methods, block and object (served by Swift and Cinder in the native OpenStack architecture). Swift is object-only and Cinder is basically a block abstraction layer that sits over a backend driver of some sort. Swift is inherently redundant, replicates itself, and can easily use a variety of disk mediums, but we didn’t have as much of a use case for it as we did volume storage for persistent VM volumes.

Cinder, on the other hand, has issues. While it can have multiple API servers, a cinder volume (when using LVM) is basically tied to a specific cinder volume server. Even though the SeaMicro has the capability of shifting disks around, cinder doesn’t have any concept of migration. What this means in essence is that if a cinder server goes offline, all volumes attached to that server are unavailable until it is restore of rebuilt, even though we could have easily made the data available to another cinder server.  This isn’t ideal for the architecture we’re trying to build.

Enter Ceph.  Ceph is distributed, redundant (much like Swift), supports both object and block storage, and is a redundant and highly available cinder backend option. It’s also incredibly resource intensive in terms of network and storage I/O. Like Swift, it has inherent replication. The differences are quick to spot though, Swift is “eventually consistent.” An object written to swift will be replicated to other locations (depending on the number of defined replicates) at some point in time, but it isn’t a blocking operation. Ceph is immediately consistent, when data changes on a ceph object or block, it is immediately replicated, which means you get double-hits on the network for write operations. On top of that, Ceph simultaneously writes to both its Journal and to the actual storage device as a sequential operation. (btrfs is a copy-on-write filesystem and can avoid the double-write penalty, but it is not considered production ready).  This means that the architecture needs to be considered a good deal more carefully than just randomly assigning disks to our storage servers and creating a large LVM volume on them.

Re-architecting the root disk layout

While testing the SeaMicro and OpenStack, we also had our partners over a X-IO drop by one of their new iSCSI based HyperISE SSD+disk arrays.  While it isn’t technically appropriate for Ceph, it would perform faster than the external JBOD spindles that were attached to the SeaMicro, so we decided to include it in the design.  This gave us three tiers of storage:

  1. SeaMicro Internal SSD – ~3TB of full SSD RAID, currently divided into 64 48GB volumes as the server’s root disks
  2. X-IO HyperISE – ~16TB usable auto-tiered SSD+disk RAID, with 20Gbps aggregate uplink into the SeaMicro
  3. SeaMicro external JBOD – 60 ~3TB individual drives attached to the SeaMicro via eSATA

The first thing that stuck out is that our decision to use the internal SSD as the root drives for our servers was a dumb one, that’s the fastest disk on the system, and we were wasting it on tasks that needed no I/O performance whatsoever.  We decided to keep our bootstrap server on the SSD just to save the trouble of rebuilding it, but we unassigned all the other server’s storage (something the SeaMicro makes quite easy), then wrote a script to delete the other 63 volumes and recycle them back into the RAID pool.  This left us with 3080GB free in our internal SSD pool for high performance storage use.

This does leave us with the small problem of not having any root disks for our servers (other than the bootstrap server that we left configured).  Since we don’t care about performance, we’re going to just carve up 6 of the 3TB JBOD drives into 11 volumes each and assign those to the other servers.  (This leaves us with 54 JBOD drives unused, which will become important later).  To do this, we need to switch the JBOD slots into “volume” mode first: storage set mgmt-mode volume slot 2.  We’ll then take the first 6 disks (JBOS2/6 – JBOD2/11 on our chassis) and create 11 volumes on each disk, giving us a total of 66 volumes (3 more than we need, but that’s fine), and finally assign the new volumes to vdisk 0 on each server (other than our bootstrap server):

seasm15k01# storage clear-metadata disk JBOD2/6-81 slot 2
All data on the specified disk(s) will be lost with this operation
Are you sure you want to proceed (yes/no): yes
Please enter ‘yes’ again if you really want to proceed (yes/no): yes
seasm15k01# storage clear-metadata disk JBOD5/0-75 slot 5
All data on the specified disk(s) will be lost with this operation
Are you sure you want to proceed (yes/no): yes
Please enter ‘yes’ again if you really want to proceed (yes/no): yes
seasm15k01# storage create pool 2/rootpool-1 disk JBOD2/6
Pool 2/rootpool-1 created successfully.
seasm15k01# storage create pool 2/rootpool-2 disk JBOD2/7
Pool 2/rootpool-2 created successfully.
seasm15k01# storage create pool 2/rootpool-3 disk JBOD2/8
Pool 2/rootpool-3 created successfully.
seasm15k01# storage create pool 2/rootpool-4 disk JBOD2/9
Pool 2/rootpool-4 created successfully.
seasm15k01# storage create pool 2/rootpool-5 disk JBOD2/10
Pool 2/rootpool-5 created successfully.
seasm15k01# storage create pool 2/rootpool-6 disk JBOD2/11
Pool 2/rootpool-6 created successfully.
seasm15k01# storage create volume-prefix 2/rootpool-1/rootvol size max#11 count 11
seasm15k01# storage create volume-prefix 2/rootpool-2/rootvol size max#11 count 11
seasm15k01# storage create volume-prefix 2/rootpool-3/rootvol size max#11 count 11
seasm15k01# storage create volume-prefix 2/rootpool-4/rootvol size max#11 count 11
seasm15k01# storage create volume-prefix 2/rootpool-5/rootvol size max#11 count 11
seasm15k01# storage create volume-prefix 2/rootpool-6/rootvol size max#11 count 11
seasm15k01(config)# storage assign-range 0/0-31/0,33/0-63/0 0 volume rootvol uuid
seasm15k01(config)# end
seasm15k01# show storage assign brief
server vdisk type id assignment property
0/0 0 volume rootvol(2/rootpool-3/rootvol-10) active RW
1/0 0 volume rootvol(2/rootpool-4/rootvol-9) active RW
2/0 0 volume rootvol(2/rootpool-4/rootvol-8) active RW
3/0 0 volume rootvol(2/rootpool-4/rootvol-7) active RW
4/0 0 volume rootvol(2/rootpool-4/rootvol-6) active RW
5/0 0 volume rootvol(2/rootpool-4/rootvol-5) active RW
6/0 0 volume rootvol(2/rootpool-4/rootvol-4) active RW
7/0 0 volume rootvol(2/rootpool-4/rootvol-3) active RW
8/0 0 volume rootvol(2/rootpool-4/rootvol-2) active RW
9/0 0 volume rootvol(2/rootpool-4/rootvol-1) active RW
10/0 0 volume rootvol(2/rootpool-4/rootvol-0) active RW
11/0 0 volume rootvol(2/rootpool-1/rootvol-10) active RW
12/0 0 volume rootvol(2/rootpool-5/rootvol-9) active RW
13/0 0 volume rootvol(2/rootpool-5/rootvol-4) active RW
14/0 0 volume rootvol(2/rootpool-5/rootvol-5) active RW
15/0 0 volume rootvol(2/rootpool-5/rootvol-6) active RW
16/0 0 volume rootvol(2/rootpool-5/rootvol-7) active RW
17/0 0 volume rootvol(2/rootpool-5/rootvol-0) active RW
18/0 0 volume rootvol(2/rootpool-5/rootvol-1) active RW
19/0 0 volume rootvol(2/rootpool-5/rootvol-2) active RW
20/0 0 volume rootvol(2/rootpool-5/rootvol-3) active RW
21/0 0 volume rootvol(2/rootpool-3/rootvol-1) active RW
22/0 0 volume rootvol(2/rootpool-3/rootvol-0) active RW
23/0 0 volume rootvol(2/rootpool-3/rootvol-3) active RW
24/0 0 volume rootvol(2/rootpool-3/rootvol-2) active RW
25/0 0 volume rootvol(2/rootpool-3/rootvol-5) active RW
26/0 0 volume rootvol(2/rootpool-3/rootvol-4) active RW
27/0 0 volume rootvol(2/rootpool-3/rootvol-7) active RW
28/0 0 volume rootvol(2/rootpool-3/rootvol-6) active RW
29/0 0 volume rootvol(2/rootpool-2/rootvol-10) active RW
30/0 0 volume rootvol(2/rootpool-3/rootvol-8) active RW
31/0 0 volume rootvol(2/rootpool-4/rootvol-10) active RW
32/0 0 volume RAIDVOL(7/RAIDPOOL/RAIDVOL-0) active RW
33/0 0 volume rootvol(2/rootpool-5/rootvol-10) active RW
34/0 0 volume rootvol(2/rootpool-6/rootvol-1) active RW
35/0 0 volume rootvol(2/rootpool-6/rootvol-10) active RW
36/0 0 volume rootvol(2/rootpool-6/rootvol-0) active RW
37/0 0 volume rootvol(2/rootpool-3/rootvol-9) active RW
38/0 0 volume rootvol(2/rootpool-6/rootvol-2) active RW
39/0 0 volume rootvol(2/rootpool-6/rootvol-3) active RW
40/0 0 volume rootvol(2/rootpool-6/rootvol-4) active RW
41/0 0 volume rootvol(2/rootpool-6/rootvol-5) active RW
42/0 0 volume rootvol(2/rootpool-6/rootvol-6) active RW
43/0 0 volume rootvol(2/rootpool-6/rootvol-7) active RW
44/0 0 volume rootvol(2/rootpool-6/rootvol-8) active RW
45/0 0 volume rootvol(2/rootpool-6/rootvol-9) active RW
46/0 0 volume rootvol(2/rootpool-2/rootvol-2) active RW
47/0 0 volume rootvol(2/rootpool-2/rootvol-3) active RW
48/0 0 volume rootvol(2/rootpool-2/rootvol-0) active RW
49/0 0 volume rootvol(2/rootpool-2/rootvol-1) active RW
50/0 0 volume rootvol(2/rootpool-2/rootvol-6) active RW
51/0 0 volume rootvol(2/rootpool-2/rootvol-7) active RW
52/0 0 volume rootvol(2/rootpool-2/rootvol-4) active RW
53/0 0 volume rootvol(2/rootpool-2/rootvol-5) active RW
54/0 0 volume rootvol(2/rootpool-2/rootvol-8) active RW
55/0 0 volume rootvol(2/rootpool-2/rootvol-9) active RW
56/0 0 volume rootvol(2/rootpool-1/rootvol-6) active RW
57/0 0 volume rootvol(2/rootpool-1/rootvol-7) active RW
58/0 0 volume rootvol(2/rootpool-1/rootvol-4) active RW
59/0 0 volume rootvol(2/rootpool-1/rootvol-5) active RW
60/0 0 volume rootvol(2/rootpool-1/rootvol-2) active RW
61/0 0 volume rootvol(2/rootpool-1/rootvol-3) active RW
62/0 0 volume rootvol(2/rootpool-1/rootvol-0) active RW
63/0 0 volume rootvol(2/rootpool-1/rootvol-1) active RW
* 64 entries

Once done, we’re left with a very similar layout to what we had before, but using the JBOD drives instead. Because we’re running redundant controllers, losing a since JBOD drive costs us at most a controller and 10 compute servers. (Note: in a production environment, we would either be booting from an iSCSI SAN, or have more internal RAID resources on the SeaMicro to insulate against drive failures. This layout is something of a quirk of our particular environment)

Of course, since we just wiped out all of our root drives, we need to rebuild the stack. Again. We’re getting pretty good at this. The only real difference is that we’ll change our DHCP configuration to distribute the 3 controller and 3 storage servers across the 6 JBOD drives (1 controller/storage and 9-10 compute resources per drive). To make that work, we’ll use the following assignments:

  • controller-0 – Server 0/0 (rootpool-3)
  • controller-1 – Server 1/0 (rootpool-4)
  • controller-2 – Server 11/0 (rootpool-1)
  • storage-0 – Server 12/0 (rootpool-5)
  • storage-1 – Server 29/0 (rootpool-2)
  • storage-2 – Server 34/0 (rootpool-6)

When changing the DHCP config file, we’ll simply swap the compute entry’s MAC address with the appropriate controller or storage MAC address, keep the same IP assignments as our previous build, no other changes are necessary.

(On the plus side, getting out OpenStack deployment back to this point was fairly painless by following the previous writeups in this blog.)

With HaProxy, MariaDB+Galera, RabbitMQ and Keystone re-deployed, we can circle back to about how to get the storage component of OpenStack in place.

Network Connectivity to the storage servers

We’ve already assigned and preseeded our storage servers, but now we need to decide how to configure them. Because we’re settling on a storage backend that has replication requirements, as well as iSCSI connectivity, we need to have more than one storage network available. On our SeaMicro, we’ve already assigned VLAN 100 as our management VLAN (this VLAN is internal to the SeaMicro). We’ll now create VLAN 150 (Client Storage) as the storage network between the clients and the Ceph servers, as well as VLAN 200 (Storage Backend) as the iSCSI and replication network. On the storage servers themselves, we’ve already assigned NIC 0 as the management NIC. We’re going to assign NICs 1-3 as the client storage network (3Gbps aggregate throughput per server) and NICs 4-7 as the backend network (4Gbps aggregate throughput per server). This allows for the higher overhead of the replication and iSCSI networks to have more bandwidth available. (Our iSCSI array has already been connected to interfaces TenGig 0/1 and TenGig 7/1 in VLAN 200)

seasm15k01# conf
Enter configuration commands, one per line. End with CNTL/Z.
seasm15k01(config)# switch system-vlan 150
seasm15k01(config)# switch system-vlan 200
seasm15k01(config)# server id 12/0
seasm15k01(config-id-12/0)# nic 1
seasm15k01(config-nic-1)# untagged-vlan 150
seasm15k01(config-nic-1)# nic 2
seasm15k01(config-nic-2)# untagged-vlan 150
seasm15k01(config-nic-2)# nic 3
seasm15k01(config-nic-3)# untagged-vlan 150
seasm15k01(config-nic-3)# nic 4
seasm15k01(config-nic-4)# untagged-vlan 200
seasm15k01(config-nic-4)# nic 5
seasm15k01(config-nic-5)# untagged-vlan 200
seasm15k01(config-nic-5)# nic 6
seasm15k01(config-nic-6)# untagged-vlan 200
seasm15k01(config-nic-6)# nic 7
seasm15k01(config-nic-7)# untagged-vlan 200
seasm15k01(config-nic-7)# exit
seasm15k01(config-id-12/0)# exit
seasm15k01(config)# server id 29/0
seasm15k01(config-id-29/0)# nic 1
seasm15k01(config-nic-1)# untagged-vlan 150
seasm15k01(config-nic-1)# nic 2
seasm15k01(config-nic-2)# untagged-vlan 150
seasm15k01(config-nic-2)# nic 3
seasm15k01(config-nic-3)# untagged-vlan 150
seasm15k01(config-nic-3)# nic 4
seasm15k01(config-nic-4)# untagged-vlan 200
seasm15k01(config-nic-4)# nic 5
seasm15k01(config-nic-5)# untagged-vlan 200
seasm15k01(config-nic-5)# nic 6
seasm15k01(config-nic-6)# untagged-vlan 200
seasm15k01(config-nic-6)# nic 7
seasm15k01(config-nic-7)# untagged-vlan 200
seasm15k01(config-nic-7)# exit
seasm15k01(config-id-29/0)# exit
seasm15k01(config)# server id 34/0
seasm15k01(config-id-34/0)# nic 1
seasm15k01(config-nic-1)# untagged-vlan 150
seasm15k01(config-nic-1)# nic 2
seasm15k01(config-nic-2)# untagged-vlan 150
seasm15k01(config-nic-2)# nic 3
seasm15k01(config-nic-3)# untagged-vlan 150
seasm15k01(config-nic-3)# nic 4
seasm15k01(config-nic-4)# untagged-vlan 200
seasm15k01(config-nic-4)# nic 5
seasm15k01(config-nic-5)# untagged-vlan 200
seasm15k01(config-nic-5)# nic 6
seasm15k01(config-nic-6)# untagged-vlan 200
seasm15k01(config-nic-6)# nic 7
seasm15k01(config-nic-7)# untagged-vlan 200
seasm15k01(config-nic-7)# end
seasm15k01# show vlan
Default Vlan : 0
Number of User Configured Vlans : 3
Number of Default Vlans : 1
Flags : T = Tagged U = Untagged
: I = Incomplete bond state because of difference in the bond member configuration.
: D = interface configured for untagged traffic drop
: P = Vlan pass through enabled
Vlan Port Members
—– ————————————————————————————————————-
100 srv 0/0/0 (U ), srv 1/0/0 (U ), srv 17/0/0 (U ), srv 16/0/0 (U ), srv 32/0/0 (U )
srv 33/0/0 (U ), srv 49/0/0 (U ), srv 48/0/0 (U ), srv 2/0/0 (U ), srv 3/0/0 (U )
srv 19/0/0 (U ), srv 18/0/0 (U ), srv 34/0/0 (U ), srv 35/0/0 (U ), srv 51/0/0 (U )
srv 50/0/0 (U ), srv 6/0/0 (U ), srv 7/0/0 (U ), srv 23/0/0 (U ), srv 22/0/0 (U )
srv 38/0/0 (U ), srv 39/0/0 (U ), srv 55/0/0 (U ), srv 54/0/0 (U ), srv 10/0/0 (U )
srv 11/0/0 (U ), srv 27/0/0 (U ), srv 26/0/0 (U ), srv 42/0/0 (U ), srv 43/0/0 (U )
srv 59/0/0 (U ), srv 58/0/0 (U ), srv 14/0/0 (U ), srv 15/0/0 (U ), srv 31/0/0 (U )
srv 30/0/0 (U ), srv 46/0/0 (U ), srv 47/0/0 (U ), srv 63/0/0 (U ), srv 62/0/0 (U )
srv 12/0/0 (U ), srv 13/0/0 (U ), srv 29/0/0 (U ), srv 28/0/0 (U ), srv 44/0/0 (U )
srv 45/0/0 (U ), srv 61/0/0 (U ), srv 60/0/0 (U ), srv 8/0/0 (U ), srv 9/0/0 (U )
srv 25/0/0 (U ), srv 24/0/0 (U ), srv 40/0/0 (U ), srv 41/0/0 (U ), srv 57/0/0 (U )
srv 56/0/0 (U ), srv 4/0/0 (U ), srv 5/0/0 (U ), srv 21/0/0 (U ), srv 20/0/0 (U )
srv 36/0/0 (U ), srv 37/0/0 (U ), srv 53/0/0 (U ), srv 52/0/0 (U )
150 srv 34/0/1 (U ), srv 34/0/2 (U ), srv 34/0/3 (U ), srv 12/0/1 (U ), srv 12/0/2 (U )
srv 12/0/3 (U ), srv 29/0/1 (U ), srv 29/0/2 (U ), srv 29/0/3 (U )
200 te 0/1 (U ), te 7/1 (U ), srv 34/0/4 (U ), srv 34/0/5 (U ), srv 34/0/6 (U )
srv 34/0/7 (U ), srv 12/0/4 (U ), srv 12/0/5 (U ), srv 12/0/6 (U ), srv 12/0/7 (U )
srv 29/0/4 (U ), srv 29/0/5 (U ), srv 29/0/6 (U ), srv 29/0/7 (U )

With our NICs in the correct VLANs, now we need to decide how to use them. Because we’re using iSCSI on the backend, we could use MPIO there, which is typically the iSCSI recommended approach. However, that doesn’t help us much with the client side network or replication. Since our iSCSI array is presenting 4 MPIO targets already, we have distinct enough flows that we can take advantage of LACP if configured with a Layer 3+4 hashing algorithm. On top of that, an awesome feature of the SeaMicro is auto-LACP between its internal fabric and the server cards. All we need to do is configure linux for LACP NIC bonding (mode 4) with the right hash and we’re good to go. Let’s start by installing the interface bonding software with “apt-get install ifenslave”

We then add the bonding module to the system:

echo “bonding” >> /etc/modules
modprobe bonding

Then add the following to /etc/network/interfaces:

uto eth1
iface eth1 inet manual
bond-master bond0

auto eth2
iface eth2 inet manual
bond-master bond0

auto eth3
iface eth3 inet manual
bond-master bond0

auto bond0
iface bond0 inet static
bond-mode 4
bond-miimon 100
bond-lacp-rate 0
bond-slaves eth1 eth2 eth3
bond_xmit_hash_policy layer3+4

auto eth4
iface eth4 inet manual
bond-master bond1

auto eth5
iface eth5 inet manual
bond-master bond1

auto eth6
iface eth6 inet manual
bond-master bond1

auto eth7
iface eth7 inet manual
bond-master bond1

auto bond1
iface bond1 inet static
bond-mode 4
bond-miimon 100
bond-lacp-rate 0
bond-slaves eth4 eth5 eth6 eth7
bond_xmit_hash_policy layer3+4

At this point, the easiest way to get the bonded interfaces active is to just reboot the server. They should be functional when it restarts.

root@storage-0:~# ip addr
10: bond0:  mtu 1500 qdisc noqueue state UP group default
link/ether 00:22:99:ec:05:01 brd ff:ff:ff:ff:ff:ff
inet brd scope global bond0
valid_lft forever preferred_lft forever
inet6 fe80::222:99ff:feec:501/64 scope link
valid_lft forever preferred_lft forever
11: bond1:  mtu 1500 qdisc noqueue state UP group default
link/ether 00:22:99:ec:05:06 brd ff:ff:ff:ff:ff:ff
inet brd scope global bond1
valid_lft forever preferred_lft forever
inet6 fe80::222:99ff:feec:506/64 scope link
valid_lft forever preferred_lft forever

And on the SeaMicro, they can be see in its LACP bond info command:

seasm15k01# show lacp info server 12/0
Server ID 12/0
Bond ID Slave ID Slave-State Actor-state Partner-state VLAN-id Bond-MAC
320 4 bundled 3d 3d 200 00:22:99:ec:05:06
320 5 bundled 3d 3d 200 00:22:99:ec:05:06
320 6 bundled 3d 3d 200 00:22:99:ec:05:06
320 7 bundled 3d 3d 200 00:22:99:ec:05:06
322 1 bundled 3d 3d 150 00:22:99:ec:05:03
322 2 bundled 3d 3d 150 00:22:99:ec:05:03
322 3 bundled 3d 3d 150 00:22:99:ec:05:03


Now that we have a bundled uplink, we can bring up the X-IO ISE array. Since the X-IO presents 4 targets, we don’t need to do more than 1 session per server on the storage server side, since 4 targets is enough to utilitze our full LACP link. We’ll start with installing the required utilities with “apt-get install multipath-tools open-iscsi”

We’re not bothering with internal security right now, so we’ll leave off any CHAP authentication for the iSCSI sessions, making them fairly easy to discover and login:

root@storage-0:~# iscsiadm -m discovery -t st -p,1,1
root@storage-0:~# iscsiadm -m discovery -t st -p,1,1
root@storage-0:~# iscsiadm -m node -L all

Once login has been confirmed and the drives are visible on the system, set iSCSI to automatically connect on start in /etc/iscsi/iscsid.conf:

“node.startup = automatic”

On our system, we now have drives /dev/sdb-e visible in dmesg. We need to quickly create a basic /etc/multipath.conf file:

defaults {
user_friendly_names yes
blacklist {
devnode “sda$”
blacklist_exceptions {
vendor “XIOTECH”
device {
vendor “XIOTECH”
product “ISE3400”
path_grouping_policy multibus
getuid_callout “/lib/udev/scsi_id –whitelisted –device=/dev/%n”
path_checker tur
path_selector “round-robin 0”
no_path_retry 12
rr_min_io 1

Once the config file is in place, restart multipath with “service multipath-tools restart” and the multipath device should be available for configuration:

root@storage-0:~# multipath -ll
mpath0 (36001f932004f0000052a000200000000) dm-0 XIOTECH,ISE3400
size=5.1T features=’1 queue_if_no_path’ hwhandler=’0′ wp=rw
`-+- policy=’round-robin 0′ prio=1 status=active
|- 34:0:0:0 sde 8:64 active ready running
|- 32:0:0:0 sdb 8:16 active ready running
|- 33:0:0:0 sdc 8:32 active ready running
`- 35:0:0:0 sdd 8:48 active ready running
root@storage-0:~# sgdisk /dev/mapper/mpath0 -p
Creating new GPT entries.
Disk /dev/mapper/mpath0: 10848567296 sectors, 5.1 TiB
Logical sector size: 512 bytes
Disk identifier (GUID): 74C6457B-38C6-41A3-8EC6-AC1A70018AC1
Partition table holds up to 128 entries
First usable sector is 34, last usable sector is 10848567262
Partitions will be aligned on 2048-sector boundaries
Total free space is 10848567229 sectors (5.1 TiB)

Once this is all confirmed, we’ll do the same on the other 2 servers and their iSCSI exported volumes.

Leveraging MPIO and the Seamicro Fabric

Because of the SeaMicro’s abstraction layer between the server cards and storage on the chassis, a unique ability exists to present the same disk to a server via multiple ASIC paths.  Since we’re already using MPIO for the iSCSI connection, it’s fairly trivial to increase performance between the storage servers and the SSD based disk on the SeaMicro chassis. Vdisk 0 is already in use by the root volume, so we’re start with vdisk 1 and assign our volumes to the servers.  We have a specific use in mind for the SSD volume which we’ll get into in the next article, but for now we’re going to create 3 500GB volumes and get them attached.

seasm15k01# storage create volume-prefix 7/RAIDPOOL/Journal size 500 count 3
seasm15k01# conf
Enter configuration commands, one per line. End with CNTL/Z.
seasm15k01(config)# storage assign 12/0 1 volume 7/RAIDPOOL/Journal-0
seasm15k01(config)# storage assign 12/0 2 volume 7/RAIDPOOL/Journal-0
seasm15k01(config)# storage assign 12/0 3 volume 7/RAIDPOOL/Journal-0
seasm15k01(config)# storage assign 12/0 5 volume 7/RAIDPOOL/Journal-0
seasm15k01(config)# storage assign 29/0 5 volume 7/RAIDPOOL/Journal-1
seasm15k01(config)# storage assign 29/0 3 volume 7/RAIDPOOL/Journal-1
seasm15k01(config)# storage assign 29/0 2 volume 7/RAIDPOOL/Journal-1
seasm15k01(config)# storage assign 29/0 1 volume 7/RAIDPOOL/Journal-1
seasm15k01(config)# storage assign 34/0 1 volume 7/RAIDPOOL/Journal-2
seasm15k01(config)# storage assign 34/0 2 volume 7/RAIDPOOL/Journal-2
seasm15k01(config)# storage assign 34/0 3 volume 7/RAIDPOOL/Journal-2
seasm15k01(config)# storage assign 34/0 5 volume 7/RAIDPOOL/Journal-2

Once the storage assignment is complete, we can move to the storage server and create a quick script to pull the serial number from the drive.  (Note: The SeaMicro appears to present the same UUID for all volumes, so we cannot use UUID blacklisting in this case, so we’re blacklisting “devnode sda$” in the multipath config) /root/getDiskSerialNum:

/usr/bin/sginfo -s $1 | cut -d’ -f2 | tr -d ‘n’

We can use the serial number pulled from the above script to determine the serial of the multipath presented disk, and then modify our config to whitelist it in /etc/multipath.conf:

defaults {
user_friendly_names yes
blacklist {
devnode “sda$”
device {
vender “ATA”
product “*”
blacklist_exceptions {
device {
vendor “XIOTECH”
device {
vendor “ATA”
product “SMvDt6S3HSgAOfPp”
devices {
device {
vendor “XIOTECH”
product “ISE3400”
path_grouping_policy multibus
getuid_callout “/lib/udev/scsi_id –whitelisted –device=/dev/%n”
path_checker tur
path_selector “round-robin 0”
no_path_retry 12
rr_min_io 1
device {
vendor “ATA”
user_friendly_names yes
rr_min_io 1
no_path_retry queue
rr_weight uniform
path_grouping_policy group_by_serial=1
getuid_callout “/root/getDiskSerialNum /dev/%n”

Checking our active multipath links now shows both the iSCSI multipath and the direct-attached SSD multipath devices available:

root@storage-0:~# multipath -ll
mpath1 (35000c5001feb99f0) dm-2 ATA,SMvDt6S3HSgAOfPp
size=500G features=’0′ hwhandler=’0′ wp=rw
|-+- policy=’round-robin 0′ prio=1 status=active
| `- 4:0:0:0 sdb 8:16 active ready running
|-+- policy=’round-robin 0′ prio=1 status=enabled
| `- 8:0:0:0 sdc 8:32 active ready running
|-+- policy=’round-robin 0′ prio=1 status=enabled
| `- 12:0:0:0 sdd 8:48 active ready running
`-+- policy=’round-robin 0′ prio=1 status=enabled
`- 20:0:0:0 sde 8:64 active ready running
mpath0 (36001f932004f0000052a000200000000) dm-0 XIOTECH,ISE3400
size=5.1T features=’1 queue_if_no_path’ hwhandler=’0′ wp=rw
`-+- policy=’round-robin 0′ prio=1 status=active
|- 34:0:0:0 sdi 8:128 active ready running
|- 33:0:0:0 sdf 8:80 active ready running
|- 32:0:0:0 sdg 8:96 active ready running
`- 35:0:0:0 sdh 8:112 active ready running

This leaves us with two high speed volumes available.

Just a Bunch Of Disks

The last piece in our storage architecture is slower but high capacity spindle storage. We left most of the JBOD disks unallocated on the SeaMicro chassis, now we’re going to create full-disk volumes out of those and assign 18 of them to each of the storage servers. A quirk of the SeaMicro: pools cannot span multiple disks unless they are in a RAID configuration, so we will end up needing to create 54 JBOD pools first, then assigning a single volume to each pool. Fortunately this process is fairly easy to script. Once this process is complete, we’ll end up with a volume layout as follows:

seasm15k01# show storage volume brief
A = Assigned, U = Unassigned, L = Linear, S = Stripe
slot pool name volume name prov. size actual size attr
2 jbodpool-1 jbodvol-1 2794GB 2794.00GB AL
2 jbodpool-2 jbodvol-2 2794GB 2794.00GB AL
2 jbodpool-3 jbodvol-3 2794GB 2794.00GB AL
2 jbodpool-4 jbodvol-4 2794GB 2794.00GB AL
2 jbodpool-5 jbodvol-5 2794GB 2794.00GB AL
2 jbodpool-6 jbodvol-6 2794GB 2794.00GB AL
2 jbodpool-7 jbodvol-7 2794GB 2794.00GB AL
2 jbodpool-8 jbodvol-8 2794GB 2794.00GB AL
2 jbodpool-9 jbodvol-9 2794GB 2794.00GB AL
2 jbodpool-10 jbodvol-10 2794GB 2794.00GB AL
2 jbodpool-11 jbodvol-11 2794GB 2794.00GB AL
2 jbodpool-12 jbodvol-12 2794GB 2794.00GB AL
2 jbodpool-13 jbodvol-13 2794GB 2794.00GB AL
2 jbodpool-14 jbodvol-14 2794GB 2794.00GB AL
2 jbodpool-15 jbodvol-15 2794GB 2794.00GB AL
2 jbodpool-16 jbodvol-16 2794GB 2794.00GB AL
2 jbodpool-17 jbodvol-17 2794GB 2794.00GB AL
2 jbodpool-18 jbodvol-18 2794GB 2794.00GB AL
2 jbodpool-19 jbodvol-19 2794GB 2794.00GB AL
2 jbodpool-20 jbodvol-20 2794GB 2794.00GB AL
2 jbodpool-21 jbodvol-21 2794GB 2794.00GB AL
2 jbodpool-22 jbodvol-22 2794GB 2794.00GB AL
2 jbodpool-23 jbodvol-23 2794GB 2794.00GB AL
2 jbodpool-24 jbodvol-24 2794GB 2794.00GB AL
2 rootpool-1 rootvol-0 254GB 254.00GB AL
2 rootpool-1 rootvol-1 254GB 254.00GB AL
2 rootpool-1 rootvol-2 254GB 254.00GB AL
2 rootpool-1 rootvol-3 254GB 254.00GB AL
2 rootpool-1 rootvol-4 254GB 254.00GB AL
2 rootpool-1 rootvol-5 254GB 254.00GB AL
2 rootpool-1 rootvol-6 254GB 254.00GB AL
2 rootpool-1 rootvol-7 254GB 254.00GB AL
2 rootpool-1 rootvol-8 254GB 254.00GB UL
2 rootpool-1 rootvol-9 254GB 254.00GB UL
2 rootpool-1 rootvol-10 254GB 254.00GB AL
2 rootpool-2 rootvol-0 254GB 254.00GB AL
2 rootpool-2 rootvol-1 254GB 254.00GB AL
2 rootpool-2 rootvol-2 254GB 254.00GB AL
2 rootpool-2 rootvol-3 254GB 254.00GB AL
2 rootpool-2 rootvol-4 254GB 254.00GB AL
2 rootpool-2 rootvol-5 254GB 254.00GB AL
2 rootpool-2 rootvol-6 254GB 254.00GB AL
2 rootpool-2 rootvol-7 254GB 254.00GB AL
2 rootpool-2 rootvol-8 254GB 254.00GB AL
2 rootpool-2 rootvol-9 254GB 254.00GB AL
2 rootpool-2 rootvol-10 254GB 254.00GB AL
2 rootpool-3 rootvol-0 254GB 254.00GB AL
2 rootpool-3 rootvol-1 254GB 254.00GB AL
2 rootpool-3 rootvol-2 254GB 254.00GB AL
2 rootpool-3 rootvol-3 254GB 254.00GB AL
2 rootpool-3 rootvol-4 254GB 254.00GB AL
2 rootpool-3 rootvol-5 254GB 254.00GB AL
2 rootpool-3 rootvol-6 254GB 254.00GB AL
2 rootpool-3 rootvol-7 254GB 254.00GB AL
2 rootpool-3 rootvol-8 254GB 254.00GB AL
2 rootpool-3 rootvol-9 254GB 254.00GB AL
2 rootpool-3 rootvol-10 254GB 254.00GB AL
2 rootpool-4 rootvol-0 254GB 254.00GB AL
2 rootpool-4 rootvol-1 254GB 254.00GB AL
2 rootpool-4 rootvol-2 254GB 254.00GB AL
2 rootpool-4 rootvol-3 254GB 254.00GB AL
2 rootpool-4 rootvol-4 254GB 254.00GB AL
2 rootpool-4 rootvol-5 254GB 254.00GB AL
2 rootpool-4 rootvol-6 254GB 254.00GB AL
2 rootpool-4 rootvol-7 254GB 254.00GB AL
2 rootpool-4 rootvol-8 254GB 254.00GB AL
2 rootpool-4 rootvol-9 254GB 254.00GB AL
2 rootpool-4 rootvol-10 254GB 254.00GB AL
2 rootpool-5 rootvol-0 254GB 254.00GB AL
2 rootpool-5 rootvol-1 254GB 254.00GB AL
2 rootpool-5 rootvol-2 254GB 254.00GB AL
2 rootpool-5 rootvol-3 254GB 254.00GB AL
2 rootpool-5 rootvol-4 254GB 254.00GB AL
2 rootpool-5 rootvol-5 254GB 254.00GB AL
2 rootpool-5 rootvol-6 254GB 254.00GB AL
2 rootpool-5 rootvol-7 254GB 254.00GB AL
2 rootpool-5 rootvol-8 254GB 254.00GB UL
2 rootpool-5 rootvol-9 254GB 254.00GB AL
2 rootpool-5 rootvol-10 254GB 254.00GB AL
2 rootpool-6 rootvol-0 254GB 254.00GB AL
2 rootpool-6 rootvol-1 254GB 254.00GB AL
2 rootpool-6 rootvol-2 254GB 254.00GB AL
2 rootpool-6 rootvol-3 254GB 254.00GB AL
2 rootpool-6 rootvol-4 254GB 254.00GB AL
2 rootpool-6 rootvol-5 254GB 254.00GB AL
2 rootpool-6 rootvol-6 254GB 254.00GB AL
2 rootpool-6 rootvol-7 254GB 254.00GB AL
2 rootpool-6 rootvol-8 254GB 254.00GB AL
2 rootpool-6 rootvol-9 254GB 254.00GB AL
2 rootpool-6 rootvol-10 254GB 254.00GB AL
5 jbodpool-25 jbodvol-25 2794GB 2794.00GB AL
5 jbodpool-26 jbodvol-26 2794GB 2794.00GB AL
5 jbodpool-27 jbodvol-27 2794GB 2794.00GB AL
5 jbodpool-28 jbodvol-28 2794GB 2794.00GB AL
5 jbodpool-29 jbodvol-29 2794GB 2794.00GB AL
5 jbodpool-30 jbodvol-30 2794GB 2794.00GB AL
5 jbodpool-31 jbodvol-31 2794GB 2794.00GB AL
5 jbodpool-32 jbodvol-32 2794GB 2794.00GB AL
5 jbodpool-33 jbodvol-33 2794GB 2794.00GB AL
5 jbodpool-34 jbodvol-34 2794GB 2794.00GB AL
5 jbodpool-35 jbodvol-35 2794GB 2794.00GB AL
5 jbodpool-36 jbodvol-36 2794GB 2794.00GB AL
5 jbodpool-37 jbodvol-37 2794GB 2794.00GB AL
5 jbodpool-38 jbodvol-38 2794GB 2794.00GB AL
5 jbodpool-39 jbodvol-39 2794GB 2794.00GB AL
5 jbodpool-40 jbodvol-40 2794GB 2794.00GB AL
5 jbodpool-41 jbodvol-41 2794GB 2794.00GB AL
5 jbodpool-42 jbodvol-42 2794GB 2794.00GB AL
5 jbodpool-43 jbodvol-43 2794GB 2794.00GB AL
5 jbodpool-44 jbodvol-44 2794GB 2794.00GB AL
5 jbodpool-45 jbodvol-45 2794GB 2794.00GB AL
5 jbodpool-46 jbodvol-46 2794GB 2794.00GB AL
5 jbodpool-47 jbodvol-47 2794GB 2794.00GB AL
5 jbodpool-48 jbodvol-48 2794GB 2794.00GB AL
5 jbodpool-49 jbodvol-49 2794GB 2794.00GB AL
5 jbodpool-50 jbodvol-50 2794GB 2794.00GB AL
5 jbodpool-51 jbodvol-51 2794GB 2794.00GB AL
5 jbodpool-52 jbodvol-52 2794GB 2794.00GB AL
5 jbodpool-53 jbodvol-53 2794GB 2794.00GB AL
5 jbodpool-54 jbodvol-54 2794GB 2794.00GB AL
7 RAIDPOOL Journal-0 500GB 500.00GB AL
7 RAIDPOOL Journal-1 500GB 500.00GB AL
7 RAIDPOOL Journal-2 500GB 500.00GB AL
* 124 entries

Once that’s done, we can assign the disks from these pools to our storage servers with a single command:

seasm15k01(config)# storage assign-range 12/0,29/0,34/0 4,6-22 volume jbodvol uuid

Now on our three storage servers, we have the following drives available:

root@storage-0:~# cat /proc/partitions
major minor #blocks name
8 0 266338304 sda
8 1 232951808 sda1
8 2 1 sda2
8 5 33383424 sda5
8 16 2929721344 sdb
8 32 2929721344 sdc
8 48 524288000 sdd
8 64 2929721344 sde
8 80 2929721344 sdf
8 96 524288000 sdg
8 112 2929721344 sdh
8 128 2929721344 sdi
8 144 524288000 sdj
8 160 2929721344 sdk
8 176 2929721344 sdl
8 192 2929721344 sdm
8 208 2929721344 sdn
8 224 2929721344 sdo
8 240 524288000 sdp
65 0 2929721344 sdq
65 16 2929721344 sdr
65 32 2929721344 sds
65 48 2929721344 sdt
65 64 2929721344 sdu
65 80 2929721344 sdv
65 96 2929721344 sdw
252 0 524288000 dm-0
65 144 5424283648 sdz
65 160 5424283648 sdaa
65 112 5424283648 sdx
65 128 5424283648 sdy
252 1 5424283648 dm-1

You can see our partition root drive available on sda, the directly attached SSD available at dm-0, and the iSCSI target on dm-1. The rest of the available partitions are the simple JBOD mounts.

Now we’re ready to actually do something with all of this disk.

OpenStack – Take 2 – The Keystone Identity Service

OpenStack – Take 2 – The Keystone Identity Service

Keystone is, more or less, the glue that ties OpenStack together.  It’s required for any of the individual services to be installed and function together.

Fortunately for us, keystone is basically just a REST API, so it’s very easy to make redundant and there isn’t a whole lot to it.

We’ll start by installing keystone and the python mysql client on all three controller nodes:

apt-get install keystone python-mysqldb

Once that’s done, we need a base configuration for keystone.  There are a lot of default options installed in the config file, but we really only care (for now) about giving it an admin token, and connecting it to our DB and Message queue.  Also, because we’re colocating our load balancers on the controller nodes (something which clearly wouldn’t be done in production), we’re going to shift the ports that keystone is binding to so the real ports are available to HAProxy.  (The default ports are being incremented by 10000 for this.)  Everything else will be left at its default value.

/etc/keystone/keystone.conf: (Note – Commented out default config is left in the file, but not included here)

# The port number which the admin service listens on.
# The port number which the public service listens on.
# RabbitMQ HA cluster host:port pairs. (list value)
connection = mysql://keystone:openstack@

We’ll then copy this configuration file to /etc/keystone/keystone.conf on each of the other controller nodes.  (There is no node specific information in our configuration, but if any explicit IP binds or similar host specific statements are made, obviously that needs to be changed from node to node)
Now the we have the config files in place, we can create the DB and DB user, then get the keystone service started and its DB tables populated.  (We’ll be doing all of this from the first controller node)

root@controller-0:~# mysql -u root -popenstack -h
Welcome to the MariaDB monitor.  Commands end with ; or g.
Your MariaDB connection id is 122381
Server version: 5.5.38-MariaDB-1~trusty-wsrep-log binary distribution, wsrep_25.10.r3997

Copyright (c) 2000, 2014, Oracle, Monty Program Ab and others.

Type ‘help;’ or ‘h’ for help. Type ‘c’ to clear the current input statement.

MariaDB [(none)]> create database keystone;
Query OK, 1 row affected (0.00 sec)

MariaDB [(none)]> GRANT ALL ON keystone.* to keystone@’%’ IDENTIFIED BY ‘openstack’;
Query OK, 0 rows affected (0.01 sec)

MariaDB [(none)]> flush privileges;
Query OK, 0 rows affected (0.00 sec)

MariaDB [(none)]> exit

root@controller-0:~# service keystone restart
keystone stop/waiting
keystone start/running, process 21313
root@controller-0:~# keystone-manage db_sync

Once the initial DB has been populated, we want to copy the SSL certificates from the first keystone node to the other two.  Copy the entire contents of /etc/keystone/ssl to the other two nodes, and make sure the directories and their files are chowned to keystone:keystone.

We can then restart the keystone service on the 2nd and 3rd nodes with “service keystone” restart and we should have our keystone nodes listening on the custom ports and ready for HAProxy configuration.  Because this API is accessible from both the public and management interfaces, we’ll need to have HAProxy listen on multiple networks this time:

/etc/haproxy/haproxy.cfg – (Note: We’re adding this to the bottom of the file)

listen keystone_admin_private
balance source
option tcpka
option httpchk
maxconn 10000
server controller-0 check inter 2000 rise 2 fall 5
server controller-1 check inter 2000 rise 2 fall 5
server controller-2 check inter 2000 rise 2 fall 5
listen keystone_api_private
balance source
option tcpka
option httpchk
maxconn 10000
server controller-0 check inter 2000 rise 2 fall 5
server controller-1 check inter 2000 rise 2 fall 5
server controller-2 check inter 2000 rise 2 fall 5
listen keystone_admin_public
balance source
option tcpka
option httpchk
maxconn 10000
server controller-0 check inter 2000 rise 2 fall 5
server controller-1 check inter 2000 rise 2 fall 5
server controller-2 check inter 2000 rise 2 fall 5
listen keystone_api_public
balance source
option tcpka
option httpchk
maxconn 10000
server controller-0 check inter 2000 rise 2 fall 5
server controller-1 check inter 2000 rise 2 fall 5
server controller-2 check inter 2000 rise 2 fall 5

We then reload haproxy on all 3 nodes with “service haproxy reload” and then we can check the haproxy statistics page to determine whether the new keystone services are up and detected by the load balancer:

OpenStack – Take 2 – Storage Architecture w/ X-IO HyperISE 1

The last step for keystone is creating the users, services and endpoints that tie everything together.  There are numerous keystone deployment scripts available online, so we picked one and modified it for our uses.  One thing of note is that we need to differentiate between the public and admin URLs, and the internal URLs which run on our management network.

We’ve left the object and networking (quantum/neutron) services out for now, as we’ll be addressing those in a later article.  Since we know we’re going to be using Glance and Cinder as the image and volume services, we created those now.

A copy of our keystone deployment script can be found here:

We also need to add keystone credentials to the servers we’ll be issuing keystone and other OpenStack commands from.  We’ll place this file on all three controllers for now:


export OS_USERNAME=admin
export OS_PASSWORD=openstack
export OS_TENANT_NAME=admin
export OS_AUTH_URL=

We’ll load that into our environment now and on next login with the following commands:

source ~/.openstack_credentials
echo “. ~/.openstack_credentials” >> ~/.profile

Now we can confirm that our keystone users, services and endpoints and in place and ready to go:

root@controller-0:~# keystone user-list
|                id                |  name  | enabled |       email       |
| c8c5f82c2368445398ef75bd209dded1 | admin  |   True  | |
| 9b6461349428440b9008cc17bdf9aaf5 | cinder |   True  | |
| e6793ca5c3c94918be70010b58653428 | glance |   True  | |
| d2c0dbfba9ae405d8f803df878afb505 |  nova  |   True  |  |
| a0dd7577399a49008a1e5aa35be56065 |  test  |   True  |  |
root@controller-0:~# keystone service-list
|                id                |   name   |   type   |        description        |
| 42a44c9b7b374302af2d2b998376665e |  cinder  |  volume  |  OpenStack Volume Service |
| 163f251efd474459aaf6edb0e766e53d |   ec2    |   ec2    |   OpenStack EC2 service   |
| d734fbd95ec04ade9b680010511d716a |  glance  |  image   |  OpenStack Image Service  |
| c9d70e0f77ed42b1a8b96c51eadb6d20 | keystone | identity |     OpenStack Identity    |
| 8cf8f2b113054a7cb29203e3c31a3ef4 |   nova   | compute  | OpenStack Compute Service |
root@controller-0:~# keystone endpoint-list
|                id                |   region  |                  publicurl                  |              internalurl               |                   adminurl                  |            service_id            |
| 127fa3f046c142c5a83122c68ac9ae79 | regionOne |         |          |         | d734fbd95ec04ade9b680010511d716a |
| 23c84da682614d4db00a8fccba5550b7 | regionOne |$(tenant_id)s |$(tenant_id)s |$(tenant_id)s | 8cf8f2b113054a7cb29203e3c31a3ef4 |
| 29ce6f0c712b499d9537e861d40846d5 | regionOne |  |  |  | 163f251efd474459aaf6edb0e766e53d |
| a4a8e4d6fb9548b4b59ef335581c907b | regionOne |       |       |      | c9d70e0f77ed42b1a8b96c51eadb6d20 |
| f7e663f609a440a9985e30efc1a2c7cf | regionOne |$(tenant_id)s |$(tenant_id)s |$(tenant_id)s | 42a44c9b7b374302af2d2b998376665e |

With keystone up and running, we’ll take a little detour to talk about storage in the next article.

OpenStack – Take 2 – HA Database and Message Queue

OpenStack – Take 2 – HA Database and Message Queue

With our 3 controller servers running on a bare bones Ubuntu install, it’s time to start getting the services required for OpenStack up and running. Before any of the actual cloud services can be installed, we need a shared database and message queue. Because our design goal here is both redundancy and load balancing, we can’t just install a basic MySQL package.

A little researched showed that there are 3 options for a MySQL compatible multi-master cluster: MySQL w/ a wsrep patch, MariaDB, or Percona XtraDB Cluster. In all cases Galera is used as the actual clustering mechanism. For ease of installation (and because it has solid Ubuntu package support), we decided to use MariaDB + Galera for the clustered database.

MariaDB Cluster install

MariaDB has a handy tool to build the apt commands for mirror selection Here, so we’ll use that to build a set of commands to install MariaDB and Galera. We’ll also install rsync at this time if not already installed, since we’ll be using that for Galera cluster sync.

apt-get install software-properties-common
apt-key adv –recv-keys –keyserver hkp:// 0xcbcb082a1bb943db
add-apt-repository ‘deb trusty main’
apt-get update
apt-get install mariadb-galera-server galera rsync

MariaDB will prompt for a root password during install, for ease of use we’re using “openstack” as the password for all service accounts.

The configuration is shared on all 3 nodes, so first we’ll build the configuration on the controller-0 node, then copy it over to the others.



# Galera Provider Configuration

# Galera Cluster Configuration

# Galera Synchronization Congifuration

# Galera Node Configuration

We’ll also comment out the “bind-address =” line from /etc/mysql/my.cnf so that the DB server will listen on our specified address instead of just localhost. (Note: While you can bind address to for all interfaces, this will interfere with our colocated HAProxy unless mysql is moved to a port other than 3306. Since we want to keep mysql on the default port, we’ll just bind it to the internal IP for now)

The last bit of prep needed is to copy the contents of /etc/mysql/debian.cnf from the first node to the other two, since Ubuntu/Debian uses a system maintenance account and the credentials are randomly generated on install time.

Now it’s time to bring up and test the cluster!

First, stop all nodes with “service mysql stop“. Double check that the mysqld is actually stopped, the first time we did that it was still running on the 2nd and 3rd node for some reason.
Initialize the cluster on the primary node: “service mysql start –wsrep-new-cluster
Once that’s up and running, start mysql on the next two nodes and wait for it to fully come up: “service mysql start

Now, it’s time to test the database replication and multi-master config. We’ll do this by writing data to the first node, confirming it replicated onto the second, writing data to the second, confirming on the third, etc…

root@controller-0:~# mysql -u root -popenstack -e “CREATE DATABASE clustertest;”
root@controller-0:~# mysql -u root -popenstack -e “CREATE TABLE clustertest.table1 ( id INT NOT NULL AUTO_INCREMENT, col1 VARCHAR(128), col2 INT, col3 VARCHAR(50), PRIMARY KEY(id) );”
root@controller-0:~# mysql -u root -popenstack -e “INSERT INTO clustertest.table1 (col1, col2, col3) VALUES (‘col1_value1′,25,’col3_value1’);”

root@controller-1:/etc/mysql/conf.d# mysql -u root -popenstack -e “SELECT * FROM clustertest.table1;”

| id | col1 | col2 | col3 |
| 1 | col1_value1 | 25 | col3_value1 |

root@controller-1:/etc/mysql/conf.d# mysql -u root -popenstack -e “INSERT INTO clustertest.table1 (col1,col2,col3) VALUES (‘col1_value2′,5000,’col3_value2’);”

root@controller-2:/etc/mysql/conf.d# mysql -u root -popenstack -e “SELECT * FROM clustertest.table1;”

| id | col1 | col2 | col3 |
| 1 | col1_value1 | 25 | col3_value1 |
| 4 | col1_value2 | 5000 | col3_value2 |

root@controller-2:/etc/mysql/conf.d# mysql -u root -popenstack -e “INSERT INTO clustertest.table1 (col1,col2,col3) VALUES (‘col1_value3′,99999,’col3_value3’);”

root@controller-0:/etc/mysql/conf.d# mysql -u root -popenstack -e “SELECT * FROM clustertest.table1;”

| id | col1 | col2 | col3 |
| 1 | col1_value1 | 25 | col3_value1 |
| 4 | col1_value2 | 5000 | col3_value2 |
| 5 | col1_value3 | 99999 | col3_value3 |

HAProxy + MariaDB integration

Now that we have a clustered database, it’s time to tie it to the HAProxy so that requests can be load balanced, and the MySQL server taken offline if necessary. We don’t need MySQL to be externally accessible, so we’ll only be configuring this service on the management ( network. Let’s start by adding an haproxy user on each of the 3 database servers. (This user should probably be restricted to the load balancer hosts in a more secure production environment) This user has no password and cannot actually connect to any databases, it simply allows a connection to the mysql server for a health check. (Note that you must use the “CREATE USER” DDL in order for changes to be replicated. If you insert directly into the mysql tables, the changes will NOT be replicated to the rest of the cluster)

root@controller-0:~# mysql -u root -popenstack
MariaDB [mysql]> CREATE USER ‘haproxy’@’%’;
Query OK, 0 rows affected (0.01 sec)

MariaDB [(none)]> exit

Once done, we’ll add (on each node) the health check to each of the 3 HAProxy servers in /etc/haproxy/haproxy.cfg: (Configured to listen on the internal VIP presented by keepalived)

listen galera
balance source
mode tcp
option tcpka
option mysql-check user haproxy
server controller-0 check weight 1
server controller-1 check weight 1
server controller-2 check weight 1

(Note: If you forgot to tell MySQL to explicitly bind to the above IP addresses, haproxy will fail since MySQL is already bound to 3306. If you need multiple IP addresses, you can move MySQL to port 3307 and proxy to that instead)

We’ll reload the haproxy config with “service haproxy reload” and see if we can connect to MySQL on the VIP:

root@controller-0:/var/log# mysql -h -u root -popenstack
Welcome to the MariaDB monitor. Commands end with ; or g.
Your MariaDB connection id is 226
Server version: 5.5.38-MariaDB-1~trusty-wsrep-log binary distribution, wsrep_25.10.r3997

Copyright (c) 2000, 2014, Oracle, Monty Program Ab and others.

Type ‘help;’ or ‘h’ for help. Type ‘c’ to clear the current input statement.

MariaDB [(none)]>

We can also see our MySQL services being monitored by the load balancer statistics page:

OpenStack – Take 2 – Storage Architecture w/ X-IO HyperISE 2

RabbitMQ Message Queue

The message queue of choice for most OpenStack installations appears to be RabbitMQ. Since it supporters clustering natively, the the OpenStack services will load balance to the message queue without any additional proxy, this step goes fairly quickly.

Install the RabbitMQ server on all the controller nodes with “apt-get install ntp rabbitmq-server“. Once the install is completed, stop the rabbitmq service on all nodes (“service rabbitmq-server stop“) and confirm that the service isn’t running.

RabbitMQ wants to do hostname based clustering, and since we’re not running any DNS in this environment, we need to add the following lines to /etc/hosts: controller-0 controller-1 controller-2

Additionally, these IP addresses must match the hostnames of your servers. If your servers still have the default “ubuntu” hostname, clustering will fail.

In order for the servers to cluster, the Erlang cookie needs to be the same on all nodes. Copy the file /var/lib/rabbitmq/.erlang.cookie from the first node to the other two nodes. (Since we don’t have root login enabled, we used scp to the ubuntu account, the moved it into place locally. You’ll need to make sure the file has user and group “rabbitmq” after copying.) Now we can restart the service on all nodes with “service rabbitmq-server start”.

With these steps in place, clustering rabbitmq is simple. We’ll start on the 2nd node. (controller-1):

root@controller-1:/var/lib/rabbitmq# service rabbitmq-server stop
* Stopping message broker rabbitmq-server [ OK ]
root@controller-1:/var/lib/rabbitmq# service rabbitmq-server start
* Starting message broker rabbitmq-server [ OK ]
root@controller-1:/var/lib/rabbitmq# cd
root@controller-1:~# rabbitmqctl stop_app
Stopping node ‘rabbit@controller-1’ …
root@controller-1:~# rabbitmqctl join_cluster rabbit@controller-0
Clustering node ‘rabbit@controller-1’ with ‘rabbit@controller-0’ …
root@controller-1:~# rabbitmqctl start_app
Starting node ‘rabbit@controller-1’ …
root@controller-1:~# rabbitmqctl cluster_status
Cluster status of node ‘rabbit@controller-1’ …

Now that the first two nodes are clustered, we’ll do the same with the 3rd:

root@controller-2:/var/lib/rabbitmq# service rabbitmq-server stop
* Stopping message broker rabbitmq-server [ OK ]
root@controller-2:/var/lib/rabbitmq# service rabbitmq-server start
* Starting message broker rabbitmq-server [ OK ]
root@controller-2:/var/lib/rabbitmq# rabbitmqctl stop_app
Stopping node ‘rabbit@controller-2’ …
root@controller-2:/var/lib/rabbitmq# rabbitmqctl join_cluster rabbit@controller-0
Clustering node ‘rabbit@controller-2’ with ‘rabbit@controller-0’ …
root@controller-2:/var/lib/rabbitmq# rabbitmqctl start_app
Starting node ‘rabbit@controller-2’ …
root@controller-2:/var/lib/rabbitmq# rabbitmqctl cluster_status
Cluster status of node ‘rabbit@controller-2’ …

And we’re done! With the clustered DB and MQ building blocks in place, we’re ready to start installing the actual OpenStack services on our controllers. It’s worth noting that RabbitMQ has a concept of a disc and a RAM node and nodes can be changed at any time. Nothing in any of the documentation for OpenStack suggests that a RAM node is required for performance, but presumably if there becomes a question of scale, it would be worth looking into.

OpenStack – Take 2 – High Availability for all services

OpenStack – Take 2 – High Availability for all services

A major part of our design decision for rebuilding our OpenStack from scratch was availability, closer to what one would see in production. This is one of the things Juju got right, installing most services using HAProxy so that clients could connect to any of the servers running the requested service. What it lacked was load balancers and external HA access.

Since we’re doing 3 controller nodes, and basically converging all services onto those 3 nodes, we’ll do that with the load balancers as well. We need both internal and externally accessible load balanced and redundant servers to take into account both customer APIs and internal/management access from the compute nodes.

Virtual IPs for Inside and Outside use

Since HAProxy doesn’t handle redundancy for itself, we’ll need a VIP that clients can point to for access. Keepalived handles that nicely with a VRRP-like Virtual IP that uses gratuitous arp for rapid failover. (While keepalived calls it VRRP, it is not compatible with other VRRP devices and cannot be joined to a VRRP group with, say, a Cisco or Juniper router) We’ll need both internal and external IPs and to ensure that both are capable of failing over. Technically, these don’t need to fail over together, since they function independently as far as HAProxy is concerned, so we don’t need to do interface or VRRP group tracking which greatly simplifies the configuration.

Our management DHCP range starts at, with IPs below that reserve for things like this. Since this is the main controller IP, we’ll assign it for a nice round number. On the “external” network, we have available, and we want to keep most of it for floating IPs. We’ll use for the controllers real and virtual IPs on the outside network.

First we need to allow the virtual IP address to bind to the NICs:

echo “net.ipv4.ip_nonlocal_bind=1” >> /etc/sysctl.conf
sysctl -p

Then we’ll install the haproxy and keepalived package: “apt-get install keepalived haproxy”

We’ll need to create a keepalived config for each controller:


global_defs {
router_id controller-0
vrrp_script haproxy {
script “killall -0 haproxy”
interval 2
weight 2
vrrp_instance 1 {
virtual_router_id 1
advert_int 1
priority 100
state MASTER
interface eth0
virtual_ipaddress { dev eth0
track_script {

vrrp_instance 2 {
virtual_router_id 2
advert_int 1
priority 100
state MASTER
interface eth1
virtual_ipaddress { dev eth1
track_script {

All the needs to be changed from one host to another is the router_id and possibly the interface name. Start up keepalived: “service keepalived restart” and you should see a Virtual IP available on the first node that is restarted.

root@controller-0:~# ip -4 addr show
1: lo: mtu 65536 qdisc noqueue state UNKNOWN group default
inet scope host lo
valid_lft forever preferred_lft forever
2: eth0: mtu 9000 qdisc pfifo_fast state UP group default qlen 1000
inet brd scope global eth0
valid_lft forever preferred_lft forever
inet scope global eth0
valid_lft forever preferred_lft forever
3: eth1: mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
inet brd scope global eth1
valid_lft forever preferred_lft forever
inet scope global eth1
valid_lft forever preferred_lft forever

Rebooting the first node should immediately cause the VIP /32 to move to one of the other servers. (Our recommendation is to have each node with a different priority so there is no ambiguity as to which node is the master, and to set the backups into initial state BACKUP)

HAProxy and Statistics

Now that we have a redundant VirtualIP, we need to get the load balancer working. We installed HAProxy in the previous step, and it nearly works out of the box. We’re going to add an externally facing statistics webpage to the default config however so we can see what it’s doing.


log /dev/log local0
log /dev/log local1 notice
chroot /var/lib/haproxy
user haproxy
group haproxy
stats socket /var/lib/haproxy/status
maxconn 4000

log global
mode http
option httplog
option dontlognull
contimeout 5000
clitimeout 50000
srvtimeout 50000
errorfile 400 /etc/haproxy/errors/400.http
errorfile 403 /etc/haproxy/errors/403.http
errorfile 408 /etc/haproxy/errors/408.http
errorfile 500 /etc/haproxy/errors/500.http
errorfile 502 /etc/haproxy/errors/502.http
errorfile 503 /etc/haproxy/errors/503.http
errorfile 504 /etc/haproxy/errors/504.http

listen stats
mode http
stats enable
stats uri /stats
stats realm HAProxy Statistics
stats auth admin:openstack

For each node, we’ll want to change the listen IP to be the “outside” IP of that particular server. For now, no services are defined, we’ll get to that in the next step.

The last step here is to edit “/etc/default/haproxy” so it says “ENABLED=1”. Once that’s done, activate the proxy with “service haproxy restart” and you should be able to reach the proxy’s statistics pages on the addresses that were defined.

OpenStack – Take 2 – Storage Architecture w/ X-IO HyperISE 3

To be continued – we’ll setup a clustered database and message queue for use with HAProxy in the next step.

OpenStack – Take 2 – Doing It The Hard Way

This is Part 2 of an ongoing series of testing AMD’s SeaMicro 15000 chassis with OpenStack. (Note: Part 1 is delayed while it gets vetted by AMD for technical accuracy)

In the first part, we configured the SeaMicro for a basic OpenStack deploying using MaaS and Juju for bootstrapping and orchestration. This works fine for the purposes of showing what OpenStack looks like and what it can do (quickly) on the SeaMicro hardware. That’s great and all, but Juju is only easily configurable within the context of the options provided with its specific service charms. Because it fully manages the configuration, any manual configuration added (for example, the console proxy in our previous example) will get wiped out if any Juju changes (to a relationship for example) are made.

For production purposes, there are other more powerful orchestration suites out there (Puppet, Chef, SaltStack, etc) but because they are highly configurable they also require a significantly larger amount of manual intervention and scripting. This makes sense, of course, since the reason Juju is as rapid and easy as it is is exactly the same reason that it is of questionable value in a production deployment. To that end, we’re going to deploy OpenStack on the SeaMicro chassis the hard way: from scratch.

The Architecture

If you’re going to do something, you should do it right. We decided to take a slightly different approach to the design of the stack than the Juju based reference architecture did. If we were creating a “cloud-in-a-box” for our own datacenter, we would be attempting to optimize for scalability, availability and performance. This means highly available control services capable of running one or more additional SeaMicro chassis, as well as optimizing the number of compute resources available within the chassis. While the bootstrap/install node isn’t required to be a blade on the chassis for this exercise, we’re going to leave it in place with the expectation that it would be used in future testing as an orchestration control node. Based on that, we have 63 server cards available to us. The Juju deployment used 9 of them, with most control services being non-redundant. (The only redundant services were the Cinder block services)

The Juju based architecture had the following layout for the 29 configured servers:

  • Machine 1 – Juju control node
  • Machine 2 – MySQL database node
  • Machine 3 – RabbitMQ AMPQ node
  • Machine 4 – Keystone identity service, Glance image service, and Horizon OpenStack dashboard service
  • Machine 5 – Nova cloud controller
  • Machine 6 – Neutron network gateway
  • Machines 7-9 – Cinder block storage nodes
  • Machines 10-29 – Nova compute nodes

The goal of this experiment would be to end up with the following layout:

  • Machines 1-3 – Controller nodes (Keystone, Nova controller, MySQL, RabbitMQ and Neutron
  • Machines 4-6 – Storage nodes (Cinder, Ceph, Swift, etc… Specific backends TBD
  • Machines – 7-63 – Nova compute nodes

As part of this re-deployment we’ll be running through each of the OpenStack control services, their ability to be made highly available, and what configuration is required to do so.

A world without MaaS

Since we’re eliminating the MaaS bootstrap server, we’ll need to replace the services it provides. NAT and routing are configured in the kernel still, so the cloud servers will still have the same internet access as before. The services we’ll need to replace are:

  • TFTP – using tftpd-hpa for serving the PXE boot images to the servers on initial install
  • DHCP – using isc-dhcpd for address assignment and TFTP server options
  • DNS – MaaS uses dnsmasq to cache/serve local DNS. We’ll just be replacing this with a DHCP option for the upstream DNS servers for simplicity’s sake

Configuring the bootstrap process

Ubuntu makes this easy, with a simple apt-get install tftpd-hpa. Because the services conflict, this step will uninstall MaaS and start the tftpd-hpa service.
On our MaaS server, isc-dhcp was already installed, so we just needed to create a /etc/dhcpd.conf file. Since we want to have “static” IP addresses, we’ll create fixed leases for every server rather than an actual DHCP pool.

First, we need all the MAC addresses of the down servers (everything except our MaaS server):

seasm15k01# show server summary | include /0 | include Intel | exclude up

This is easily converted into a shell, perl, or whatever your text parsing language of choice is to get a DHCP config that looks something like the following:

subnet netmask {
filename “pxelinux.0”;
option subnet-mask;
option broadcast-address;
option domain-name “local”;
option routers;
option interface-mtu 9000; # Need this for Neutron GRE

host controller-0 {
hardware ethernet 00:22:99:ec:00:00;


Restart the DHCP server process and we now have a functioning DHCP environment directing the servers to our TFTP server.

Fun with preseeds

The basic Ubuntu netboot image loads an interactive installer, which is great for when we configured the MaaS server, but nobody wants to manually enter information for 63 servers for installation. By passing some preseed information into the kernel, we can have it download and run the installer unattended, it just needs some hints as to what it should be doing.

This took a lot of trial and error, even just to get a good environment with nothing but the base system tools and ssh server. (Which is all we want for the controller nodes for now)

The PXE defaults config we settled on looks something like this:

default install
label install
menu label ^Install
menu default
kernel linux
append initrd=ubuntu-installer/amd64/initrd.gz console=ttyS0,9600n8 auto=true priority=critical
interface=auto netcfg/dhcp_timeout=120 preseed/url= — quiet
ipappend 2

This tells the kernel to use the SeaMicro serial console (mandatory in this environment), interface eth0, to disable hard-based interface renaming, and to fetch a preseed file hosted on the MaaS server for install.

The preseed file partitions the disk based on a standard single-partition layout (plus swap), creates a “ubuntu” user with a long useless password, disables root login, and copies an authorized keys file for ssh use into /home/ubuntu/.ssh, allowing ssh login post-install. Since we’re not using a fastpath installer like MaaS does, this takes a bit of time, but it’s hands off once the preseed file is created. Once we get to installing compute nodes later on, we’ll probably find a way to use the preseed file to script the installation and configuration of the OpenStack components on the compute nodes, but since the controller nodes will be installed manually (for this experiment), there isn’t any reason to add much beyond basic ssh access in the initial preseed.

One note: Ubuntu insists on overwriting the preseed target’s /etc/network/interfaces with what it uses to bootstrap the network. Because of udev, this may not be accurate and causes the server to come up without a network. A hacky solution that seems to work is to download an interfaces file, then chattr +i /target/etc/network/interfaces at the end of the preseed so the installer cannot overwrite it. Additionally, udev was exhibiting some strange behavior, renaming only eth0 and eth3 to the old ethX nomenclature on most servers, but leaving the other 5 interfaces as the newer pXpX style. This unfortunately seemed to be somewhat inconsistent, with some servers acknowledging the udev persistent net rules file to rename all interfaces to ethX, and others ignoring it. Since eth0 was renamed in all cases, we decided to ignore this issue for the time being since this isn’t intended to be a production preseed environment.

To be continued…. Once all these nodes install and boot.

Adventures in OpenStack with a SeaMicro 15000

Adventures in OpenStack with a SeaMicro 15000

The Chassis

Semaphore was recently approached by a friend of the company who was now working at AMD and happened to have one of their recently acquired SeaMicro 15000 high density compute chassis available for some short term testing. They offered to loan it to us for a bit in hopes we’d resell them with a particular focus on OpenStack as an enterprise private cloud since it requires some expertise to get up and running. Having never used OpenStack, and even having little experience with AWS and other cloud services, naturally we said, “Of course!”

A week or so later, AMD pulled up to the loading dock with 3 pallets of hardware. They’d brought a SeaMicro 15k chassis, and 2 large disk arrays (we decided to only set up one of the arrays given limited cooling in our lab area). A lot of heavy lifting later, we had both devices on the lab bench, powered on, and ready to start deployment.

OpenStack – Take 2 – Storage Architecture w/ X-IO HyperISE 4

After power on, we got the following specs from the chassis and disk array:

  • 64 Server Cards, each with a single Intel Xeon E3-1265L V3 chip (4 cores, 8 threads), and 32GB of DDR3 RAM
  • 1 Storage Card with 8 480GB SSD drives in a RAID configuration
  • 2 MX (controller) cards, each with 2 10Gig Ethernet SFP+ ports
  • 60 3TB JBOD drives, attached in 2 paths to the chassis via an e-SATA connection

The slickest thing about the SeaMicro chassis is that the compute cards are essentially just a Haswell Northbridge (less a graphics controller). The Southbridge is replaced by a set of ASICs which communicate with the SeaMicro chassis for disk and network I/O configuration and presentation. Each server card has 8 network paths that present to the server as Intel E1000 NICs. The virtual server NICs are fully configurable from the SeaMicro chassis using an IOS-like command line, with full 802.1q VLAN tagging and trunking support (if desired). By default the chassis presents a single untagged VLAN to all the server NICs, as well as the external 10Gig ports.

Disk I/O is even better, since we had a RAID storage card, we configured a single volume of around 3TB, and with just a few lines of configuration on the chassis were able to split the RAID volume into 64 virtual volumes of 48GB, and present each to one of the server cards as a root volume. These were presented as hot-plug SCSI devices, and could be dynamically moved from one server card to another via a couple quick config statements on the SeaMicro chassis. For the JBOD, we were able to assign disk ranges to lists of servers with a single command, feeding it a list of server cards and a number of drives, and the SM would automatically assign that number of disks (still hot-plug) to the server and attach them via the SeaMicro internal fabric and ASICs. Pretty cool stuff! (And invaluable during the OpenStack deployment. More on that later.)

On to OpenStack!

With chassis powered on, root volumes assigned, and the JBOD volumes available to assign to whichever server they made the most sense, we were ready to get going with OpenStack. First hurdle: there is zero removable media available to the chassis. This is pretty normal for setups like this, but unlike something like VmWare, there isn’t any easy ability to mount an ISO for install. Fortunately installing a DHCP sever is trivial on OSX, and it has a built-in TFTP server, so setting up a quick PXE boot server took just a few minutes to get the bootstrap node up. A nice feature of the SeaMicro chassis is that the power-on command for the individual servers allows a one-time PXE boot order change that will go away on the next power on, so you don’t need to mess with boot order in the BIOS at all. We installed Ubuntu 14.04 on one of the server nodes for bootstrapping and then started to look at what we needed to do next.

We’d received a SeaMicro/OpenStack Reference Architecture document which AMD made available to us, as well as finding a blog article on how a group from Canonical configured 10 SeaMicro chassis for OpenStack in about 6 hours, as well as an OpenStack Rapid Deployment Guide for Ubuntu. This seemed like enough knowledge to be dangerous when starting from absolutely nothing, so we dove right in.

Bootstrapping the metal

The reference/rapid deployment architectures all appeared to use MaaS (Metal as a Service) for bootstrapping the individual server blades. MaaS also has a plugin for the SeaMicro chassis to further speed deployment, so once the MaaS admin page was up and running, we were off to the races:

maas maas node­group probe­and­enlist­hardware model=seamicro15k mac= username=admin password=seamicro power_control=restapi2

A few seconds later, MaaS was populated with 64 nodes, each with 8 displayed MAC addresses. Not too shabby. We deleted the bootstrap node from the MaaS node list since it was statically configured, then told MaaS to commission the other 63 nodes for automation. Using the SeaMicro REST API, MaaS powered on each server using PXE boot, ran a quick smoke test to confirm reachability, then powered it back off and listed it as ready for use. Easy as pie, pretty impressive compared to the headaches of booting headless/diskless consoles of old. (I’m looking at you SunOS)

OpenStack – Take 2 – Storage Architecture w/ X-IO HyperISE 5

All the Ubuntu + OpenStack reference architectures use a service orchestration tool called Juju. It’s based on set of scripts called “charms” to deploy an individual service to a machine, configure it, then add relationship hooks to other services. (E.G., tell an API service that it’s going to be using MySQL as its shared backend database)

Juju requires its own server (machine “0”) to run the orchestration tools and deploy services from, to a quick bootstrap after pointing Juju in the direction of the MaaS API, and I had a bootstrap server running, powered on and automatically provisioned by MaaS. Juju also deploys the deploying user’s ssh public key to the new server, for use with its internal “juju ssh ” command, which is quite handy. (I’d later come to learn that password auth is basically nonexistent in cloud architecture, at least on initial deployment. Works for us).

Now it was time to start getting OpenStack deployed. The AMD provided reference architecture didn’t quite match the Ubuntu one, which didn’t at all match what I was seeing in the Canonical deployment test, so I had to make some decisions. By default when you deploy a new Juju service, it instantiates a new machine. This seems very wasteful on a deployment of this size, so it made sense to colocate some of the services, so the initial deployment looked a bit like this:

juju deploy mysql
juju deploy –config=openstack.cfg keystone
juju deploy –config=openstack.cfg nova-cloud-controller
juju deploy nova-compute
juju deploy glance –to 2
juju deploy rabbitmq-server –to 1
juju deploy openstack-dashboard –to 2

Networking headaches

Once completed, this (sorta) left us with 4 new servers with the basic OpenStack services running. Keystone, Glance and Horizon (dashboard) were all colocated on one server, and MySQL and RabbitMQ on another. The Nova controller and first Nova Compute server were standalone. (Both the Ubuntu and AMD reference architectures used this basically layout) After a lengthy series of “add-relation” commands, the services were associated and I had an apparently working OpenStack cloud with a functional dashboard. A few clicks later and an instance was spawned running Ubuntu 14.04 Server, success! Kinda… It didn’t appear to have any networking. The reference config from AMD had then “quantum-gateway” charm installed (the charm for the new named Neutron networking service), but the config file supplied used a Flat DHCP Networking service through Nova, which didn’t appear to actually be working out of the box. Most of the documentation used Neutron rather than Nova-Network anyways, which seemed like a better solution for what we wanted to do. No problem, just change the nova-cloud-controller charm config to use Neutron instead, right?


The network configuration is baked into the configs at install time by Juju. While some config can be changed post-deploy, that wasn’t one of them. This was the first (of many) times that the “juju destroy-environment” command came in handy as a reset to zero button. After a few false starts, we had the above cloud config up and running again, this time with quantum-gateway (Why the charm hasn’t been renamed to Neutron, we don’t know) correctly deployed and configured to work with the Nova cloud controller. This also added the missing “Networks” option to Horizon, allowing us to automatically create public and private subnets, as well as private tenant routers for L3 services between the networks. An instance was brought up again, and this time it could ping things! A floating external IP was easily associated with the instance, and with a few security changes we should ping the instance from the outside world. Success! Since our keypair was automatically installed to the instance on create, we opened an ssh session to the instance and… got absolutely nothing.

Neutron, as deployed by Juju by default uses its ML2 (Modular Layer 2) plugin to allow for configurable tenant network backends. By default, it uses GRE tunnels between compute nodes to tie the tenant networks together across the OpenStack-internal management network. This is great, but because GRE is an encapsulation protocol, it has overhead and reduces your effective MTU. Our attempts to run non-ICMP traffic were running into MTU issues (as is common with GRE) and failing. The quantum-gateway Juju charm does have a knob to reduce the tenant network MTU, but since the SeaMicro supports jumbo frames across its fabric, we added a DHCP Option 26 to the MaaS server to increase the management network MTU to 9000 on server start time, and rebooted the whole cluster.

OpenStack – Take 2 – Storage Architecture w/ X-IO HyperISE 6

SeaMicro Storage quirks

At this point we had a single working instances with full networking available on our one compute node. There were two things left to do before the cloud could really be considered “working”, scale out compute capacity, and add persistent volume storage.

To this point, the instances were using temporary storage on the compute card that would be destroyed when the instance was terminated. This works for most instances, but there was a slight problem, our compute nodes only had 48GB of attached storage, and some of that was taken up by the hypervisor OS. That doesn’t leave a lot for instance storage. Since we had 60 3TB drives attached to the SeaMicro, we decided to give each compute node one disk, giving it 3TB for local non-persistent instance volumes. The initial plan was to add a total of 20 compute nodes, which surely would be as simple as typing “juju deploy -n 20 nova-compute”, right? This is where the biggest headache of using Juju/MaaS/SeaMicro came into play. Juju is machine agnostic, it grabs a random machine from MaaS based on constraints about RAM and CPU cores (if given, all our machines are identical, so there were not constraints). Juju tracks hostnames, which are derived from MaaS. MaaS assigns hostnames to devices as a random 5-character string in the local domain (.master) in this case, and tracks the server MAC addresses. The SeaMicro chassis is only away of the MAC addresses of the servers. On top of this, we needed to have the disk available to the compute node prior to deploying nova-compute onto it.

So, how to add the disk to the compute nodes? Well, first we needed to know which machines we’re talking about. Juju can add an empty container machine, although it doesn’t have a “-n” flag, so type “juju add-machine” 20 times and wait for them to boot. While waiting, get the hostnames of the last 20 machines from Juju. Then go over to MaaS’s web interface (which can only show 50 machines at a time), and search for the random 5-digit string for each of the 20 servers, and make note of the MAC address. Then go over to the SeaMicro command line and issue “show server summary | include ” to get the server number in the SeaMicro chassis. It’s annoyingly time consuming, and if you end up destroying and rebuilding the Juju environment or even the compute nodes, you have to do it all over again, since MaaS randomly assigns the servers to Juju. Ugh.

As a side note, since this was a fairly substantial portion of the time spent getting the initial install up and running, we reached out to AMD about these issues. They’re already on the problem, and are working with Canonical to further integrate the SeaMicro’s REST API with MaaS so the MaaS assigned machine names match the server IDs in the chassis itself, as well as presenting the presence of disk resources to MaaS so they can be used as Juju constraints when assigning new nodes for a particular function. For example, when creating the storage nodes, Juju could be told to only pick a machine with attached SSD resources for assignment. These two changes would significantly streamline the provisioning process, as well as making it much easier to determine which compute cards in the chassis were being used by Juju rather than having to cross-reference them by MAC address in MaaS.

Fortunately, once the server list was generated, attaching the storage on the SeaMicro was easy: “storage assign-range 2/0,4/0,7/0… 1 disk external-disks” and the chassis would automatically assign and attach one of the JBOD drives to each of the listed servers as VDisk 1 (the 2nd disk attached to the server). Since a keypair is already installed on the Juju server, a little shell scripting made it fairly easy to automatically login to each of the empty nodes and format and mount the newly attached disk on the instances path for deployment. Deployment then works fairly automatically, “juju add-unit nova-computer –to 10” for the 20 new machines. After spawning a few test instances, the cloud was left looking something like this:

OpenStack – Take 2 – Storage Architecture w/ X-IO HyperISE 7

Storage Services

At this point what we were really missing was persistent volume storage so instances didn’t need to lose their data when terminated. OpenStack offers a few ways to do this. The basic OpenStack volume service is “Cinder”, which uses pluggable backends, LVM2 volume groups being the default. Since this exercise is as a basic OpenStack proof of concept, we didn’t utilize any of the more advanced storage mechanisms available to OpenStack to start with, choosing to use the SeaMicro to assign 6 3TB JBOD drives to each Cinder node in an LVM configuration across 3 nodes for a total amount of ~54TB of non-redundant persistent volume storage. Cinder + LVM has some significant issues in terms of redundancy, but it was easy enough to setup. We created some mid-sized volumes from our Ubuntu Server image, started some instances form the volumes, then tore down the instances and re-created them on different hypervisors. As expected, all our data was still in there. Performance wasn’t particularly bad, although we didn’t do much in the way of load testing on it. For file I/O heavy loads and redundancy, there are certainly some better ways to approach storage that we’ll explore in another writeup.

At this point, we haven’t implemented any object storage. This can be done with the Ceph distributed filesystem, or Swift which is OpenStack’s version of S3 object storage. Since we’re using local storage for Glance images and didn’t have a use case for object storage during this proof of concept, we decided to skip this step for the time being until we do a more thorough look at OpenStack’s various storage options and subsystems.

Console Juju

OpenStack offers a couple flavors of VNC console services, one directly proxied through the cloud controller (novnc), and a java viewer (xvpvnc). These are fairly straightforward to setup, involving a console proxy service running on the cloud controller (or any other externally accessible server), and a VNC service running on each compute node. Configuration is just a few lines in /etc/nova/nova.conf on each of these servers. But, there’s a caveat here, there isn’t a Juju charm or configuration option for the VNC console services. Because the console services have their configuration installed in a file managed by Juju, any relationship change affecting the nova-cloud-controller or nova-compute service will cause the console configuration to be wiped out on every node. Additionally, the console config on the compute nodes needs to be configured (and the compute service restarted) BEFORE any instances are created on that node. If any instances exist beforehand, they won’t have console access, only new instances will. While this isn’t the end of the world, especially since one assumes the base relationships in Juju wouldn’t be changing much, it does highlight a potential problem with Juju in that if you’re adding custom config that isn’t deployed with the charm, you run the risk of losing it. While we haven’t looked at how difficult custom charms are to write yet, this clearly could be a problem in other areas as well, for example using iSCSI as a cinder and/or ceph backend, using something other than the built-in network backend for Neutron, etc. While there will always be a tradeoff when using service orchestration tools, this does seem like a significant one, since being able to add custom config segments to a managed config file is fairly important.
It seems unlikely to us that large production OpenStack clouds are deployed in this manner. The potential to wipe out large amounts of configuration unexpectedly (Or worse, have inconsistent configuration where newer compute units have different configs than older ones) is significant.

(Note: The below scripts will auto-deploy the VNC console code to all compute instances in /etc/nova/nova.conf immediately after the rabbitmq config) – Run from MaaS node
for i in `juju status nova-compute |grep public-address | awk ‘{print $2}’`; do
scp ubuntu@$i:
ssh ubuntu@$i ‘sudo /bin/bash /home/ubuntu/’
done; – Run from MaaS node
. /root/.profile
IPADDR=`ifconfig br0 | grep “inet addr:” | cut -d: -f2 | awk ‘{print $1}’`
echo $IPADDR
grep -q “vnc_enabled = true” /etc/nova/nova.conf
if [ “$isvnc” == “1” ]; then
# Get line of RabbitMQ config
rline=`grep -n -m 1 rabbit_host /etc/nova/nova.conf | cut -f1 -d:`
echo ${rline}
sed -i “${rline} i\
vnc_enabled = true \
vncserver_listen = \
vncserver_proxyclient_address = ${IPADDR} \
novncproxy_base_url=”” \
xvpvpnproxy_base_url=”” \
” /etc/nova/nova.conf
apt-get install novnc -y
initctl restart nova-compute

Parting thoughts

This article is long enough already, but the initial impression is that OpenStack is complicated, but not as bad as it looks. This is obviously aided by rapid deployment tools, but once the architecture and how the services interact makes sense, most of the mystery is gone. Additionally, if you want a lot of compute resources in a small (and power efficient) footprint, the SeaMicro 15000 is an incredible solution. Juju/MaaS have some issues in terms of ease of use with the SeaMicro, but at least some of them are already being addressed by AMD/Canonical.

Since our proof of concept was basically done, we had the option to go a couple different directions here, the most obvious being an exploration of more advanced, efficient and redundant OpenStack storage. To do that, we’d need to tear down the Juju based stack and go with something more flexible. Since this isn’t production, no better way to do that than to just install everything from scratch, so stay tuned for that in upcoming articles.

A rat ate my network? A story from the service truck…

In a technology version of the classic “My dog ate my homework” excuse, we were recently called onsite to a local school, where the network had suddenly gone offline. It turned out to be a case of “A rat ate my network”…

Day 1:

We arrived onsite to discover that their entire network was down, including all Internet access. We looked at their data room and discovered that five out of 6 strands of fiber that provide their backbone connection were severely damaged for unknown reasons and were passing no light. An entire section of each was actually completely missing!

We managed to do an in-place splice to bring them back on-line by having one tech hold the fusion splicer up near the ceiling while the other tech hastily cleaved and spliced the strands from a ladder. Light was verified and the network came back up. Not the prettiest work, but it was working again quickly and we offered to provide them with a quote to properly replace the damaged bundle.

Day 2:

We received a call that they were down once again and we high-tailed it back over to the school. Once onsite, we discovered that the fiber was again completely severed and missing up to the bulkheads in the panel. All evidence of our previous work was COMPLETELY gone. This time, however, we found copious rat droppings in the panel along with bits of the damaged fiber that remained.

So, we installed a secure wall-mount fiber enclosure and routed fiber into that, where we terminated a new bulkhead and locked it up – hopefully safe from rodent tampering!

It’s interesting to note that this particular rat seemed to have a taste for the 10gig aqua 50 micron multi-mode fiber and it left the standard orange 62.5 micron multi-mode alone. I wonder what wine goes with aqua fiber?

Did you know that we operate a full service low-voltage structured cabling group and field service division, including fiber optic splicing, security systems and more? 

Cisco Unified Communications Manager – Time Based Call Routing

Time based call routing is a convenient feature in the Cisco Unified Communications Manager that allows calls to be treated differently according to a time schedule.

For this example we’ll assume an incoming number will be manipulated so that during regular office hours it will route to reception. Off-hours including lunchtime the call will go direct to voicemail.

Main number is 206-555-1000 The CUCM strips all but the last four digits.
Hours of operation are Mon-Fri 8:00am to 6:00pm with a lunch between noon and 1:00pm.

The starting point is to create Time Periods associated with those hours.

Best practice is to name the time periods based on what they are.

The Open Hours will need two time periods:

  • Mon-Fri 8:00 to 12:00
  • Mon-Fri 13:00 to 18:00

The Closed Hours will need four time periods:

  • Mon-Fri 0:00 to 8:00
  • Mon-Fri 12:00 to 13:00
  • Mon-Fri 18:00 to 24:00
  • Weekends 0:00 to 24:00

Next is to assign the Time Periods to Time Schedules

  • TS_Reception_Open – the two associated time periods.
  • TS_Reception_Closed – the four associated time periods.

Once done you need to create two Partitions assigning the Time Schedules:

  • Part_Reception_Open – TS_Reception_Open
  • Part_Reception_Closed – TS_Reception_Closed

The Calling Search Space (SEA_Local) assigned to the incoming Gateway must then contain those two partitions:

  • For our example the reception phone Line1 is assigned 2000.
  • On the Cisco Unity server the voicemail subscriber with extension 1000 is Reception.
  • The Alternate Extension assigned to the subscriber is 2000 for convenience.

Next is to create a Translation Pattern for open office hours:

  • Translation Pattern – 1000
  • Partition – Part_Reception_Open
  • Description – Main Line 206-555-1000 Open
  • Calling Search Space – SEA_Local
  • Route This Pattern – checked
  • Called Party Transform Mask – 2000

Create a CTI Route Point for the Main Line Off-Hours going direct to Voicemail:

  • DeviceName – MainLineOpen
  • Description – Main Line 206-555-1000 Closed
  • Assign the appropriate Device Pool
  • Calling Search Space – SEA_Local
  • Save and then Line [1] – Add a new DN
  • Directory Number – 1000
  • Partition – Part_Reception_Closed
  • Description/Alerting – Main Line Closed
  • Calling Search Space – SEA_Local
  • Click the box to Forward All to Voicemail
  • Calling Search Space – SEA_Local

The Cisco Unified Communications Manager will then route the call appropriately.

Call flow follows:

PSTN calls 206-555-1000 and for our example strips it to the last four digits.

  • The Gateway sees incoming calls off the Calling Search Space of SEA_Local.
  • In that calling search space it sees two entries for 1000.
  • Next the CUCM checks the time schedule to see how to treat the call.
  • During hours it will be translated to 2000 Line1 of the Reception phone.
  • After hours it will route to the CTI Route Point with a CallFwdAll to the voicemail server and the subscriber extension 1000.

Aruba Partner Summit – 2011

Last week, I had the pleasure of attending the Aruba Networks 2011 Partner Summit. This event ran from Monday 4/4 through Wednesday 4/6 with the most significant density of activity on Tuesday.

I took about 14 pages of notes – more than usual for an event of this sort and feel like the event was worthwhile, in spite of the invariable backlog I accumulated back at the office.

We’ve been working with the Aruba Air Mesh (their outdoor industrial mesh product acquired from Azalea Networks) since 2008, but are a “newly-minted” Aruba Indoor Partner, so I had a lot of ground to cover.

First and foremost, I walked away with a very crisp vision of Aruba’s place in the market – how they view their products and the marketplace itself. Understanding the way they see the world helped me to understand a lot of things, especially their recent entrance into the closet switching market (see their Mobility Access Switches).

I’ll have to admit that, though the switching platforms seemed like great devices, I didn’t understand why Aruba would want to increase their competitive overlap with big players like Cisco, Juniper (did you know we’re now a Juniper partner?!) and HP. It didn’t make a lot of sense on the surface – until I heard from the Aruba CEO, Dominic Orr.

The Security Philosophy

I’ve long said that one of the biggest problems we face as IT professionals is the schizophrenic nature of security policies applied using differing tools, protocols and devices all over the network like a patchwork quilt. There are too many seams and too much complexity. The reality is that increased complexity is less secure – human nature has too many opportunities to slip us up as we make things more complex.

So how does Aruba simplify things? It all boils down to a radically different philosophy about network attachment: for as long as I’ve been involved in networking, the underlying vision has always been port-centric. Everything tied back to the port in some way. Granted, there were complex schemes to make bits and parts more “user focused”, but they were inelegant and inefficient. Aruba has really stepped up the game with a model that provides what we’ve really needed all along. I’m frankly amazed that nobody’s gone here before – though hindsight usually works like that!

Aruba’s focus is on identifying the user and device, independant of where and how they connect – and then applying a security edge that’s tailored to that user and their device (and even their OS/browser, etc.), no matter what media they use to connect. With this user-centric contextual model, security can be adapted to the requirements of the individual user and then further enhanced to account for the different types of devices they use to access the network. You don’t need to think about how they connect at all – it’s just enough to know that they can connect and leave it at that.

What happened to “Core/Distribution/Access”?

To see how this plays out, you need to understand how Aruba’s view of the future world for networking differs philosophically from the view of traditional providers (read: Cisco) in a fairly significant way: they see it consisting of essentially two big “buckets”. The first is datacenter – this can be internal to an organization, or in the cloud. It’s a market in which Aruba seems to have no horse or interest. The second (and final, in their mind) is the “access” layer. This encompasses EVERYTHING else: wired access, wireless access, VPN access. Any method a user can connect to the datacenter.

What’s particularly important here is that Aruba’s product strategy provides for a centralized user-centric AAA model that’s unified across all of the connection methods that also pushes a security policy to the point of connection based not only on user identity, but also based on other context information (such as device type and OS).

For example, if I connect to the network via WiFi with my laptop, and then via WiFi with my iPad, the context is different based on the device and OS – so a differential layered security policy can be applied: first the one for my user and then what amounts to an overlay on top of that for my device type (and even OS/Browser version). This policy can be applied to my wired ports, my wifi access and my VPN access (soft or hard client).

It’s all managed through the same centralized Mobility Controller architecture that Aruba wireless customers already know so well – just extended to wired access. Also, it’s all manageable by the Aruba AirWave management suite.

This is a significant break from the view of traditional wired access vendors – and amounts to the equivalent of a “wired access point” if you want to think of it that way.

Bad news for Intel and Microsoft? We’re plumbers?

Dominic got a few good laughs during his presentation on Wednesday, but one of his semi-humorous comments resonated with me when he was discussing the huge shift in the marketplace away from traditional architectures (wired to wireless), devices (PCs to smart phones and tablets) and vendors (away from Intel/MSFT to an architecture dominated by players like ARM/Apple). He said that Aruba’s role and Aruba’s partner’s roles were as “plumbers”. I like that particular analogy a lot.

His central point was that a HUGE amount of money will change hands as these marketplace and technology transitions take place and that the core infrastructure providers (us plumbers) are well positioned to capture a significant part of that.

It’s the people

Another key take-away for me was purely qualitative – and was with regard to the people at Aruba Networks and their corporate culture. They feel, to me, like Cisco used to feel back in the mid to late 1990s and early 2000s (yes, I’ve been a Cisco partner for that long!). They want to innovate, to disrupt, to partner, to team. They want to get deals done and they have the flexibility to make a deal happen. There wasn’t a single big ego to be found among the Aruba staff. THESE are the kind of people I want to work with and I want my customers to work with. They get things done and have fun doing it.

Anti-fan-boy disclaimer:

So, lest I sound like a total Aruba fan-boy, I’d be remiss if I didn’t mention a significant place where Aruba has been totally silent (at least so far as I’ve observed). It’s the inside of the device – especially the new wave of mobile devices like smart-phones and tablets.

We can make connection of these devices easy and flexible, we can implement a user-centric, device modified security perimeter at the point of connection, but that device still ends up with access to the network. So we still need to make sure that the device isn’t owned by malware. I think this is obvious, but it’s notably absent from Aruba’s articulation of their world view and the picture they paint is a little too rosy for my taste. I don’t want them to help customers lose sight of this important part of a secure architecture. In the meantime, it just means that we, as a partner, need to emphasize this element a little more carefully.

Seriously? The iPad will drive my business?

What else stood out? Tablets. The iPad in particular. It really is everywhere. It’s in the hands of all of the participants. It (and iPhone) is in the hands people in >80% of Fortune 100 companies. It was a $54.8M market in 2011 and is expected to be a $300M market by 2013 (thanks Gartner Group for the data that Aruba passed on to us!). Given my concerns about the security of such devices in enterprise environments, I’m still hesitant about this development, but cannot deny its impact.

It’s clear to me that helping my customers understand how to handle this inrush of “iDevices” will help me to better help them. I’ve been seeing them in the marketplace with increasing frequency and we’re actually making significant changes to our own network and internal security to support them securely – so too do our customers need to invest in the planning and implementation of policies and infrastructure to handle this massive new challenge.

Standout stats

A few more stand-out statistics (from Gartner as well, according to Aruba) that were scattered in various presentations throughout the event:

  • The SmartPhone market was $289M in 2011 and expected to be $1B in 2013.
  • Virtual desktops should be 45M strong by 2013. They accounted for 2% of desktops in 2009 and should be 13% by the end of 2012.
  • The enterprise access market is $11B for 2013 – of which ~$3B is WLAN and with the impact of the iPad alone may be as high as $3.75B.
  • Mobility budgets are increased 10% year-over-year since 2007 – despite the recession.
  • 2011 – 2014 compound annual growth of network access for traditional Ethernet Switching is expected to be -2% and WLAN is +80%!
  • Licensed band wireless infrastructure (cellular) costs 8-10X as much as WLAN to provide similar density coverage (this stat is an Aruba source, not Gartner – Aruba didn’t cite a source here).

Other notable items were:

  • Aruba is increasingly hearing about WLAN coverage problems in non-traditional coverage locations: Bathrooms. Stairwells. Elevators. People want coverage EVERYWHERE! Remind me not to use other people’s iPads from now on. Ick!
  • Density requirements for WLAN have changed: 5 years ago in a conference room meeting of 25 people, a handful might have had laptops with active WiFi. Today, every person could reasonably be expected to have one or MORE devices that could actually be active on WiFi. Smart phones, tablets and still even laptops. So a meeting of 25 people might have 30 or 50 (or more) active devices!
  • Cloud architectures put users on the “wrong side” of the corporate firewall. I suppose I’ve been aware of the perimeter implications of cloud, but haven’t thought of it in quite this context.

That Tesla guy was smarter than everyone thought

In closing, I was reminded during a presentation of a quote by Nikola Tesla in 1926. This is, truly, mind blowing prescience:

“When wireless is perfectly applied the whole earth will be converted into a huge brain, which in fact it is, all things being particles of a real and rhythmic whole. We shall be able to communicate with one another instantly, irrespective of distance. Not only this, but through television and telephony we shall see and hear one another as perfectly as though we were face to face, despite intervening distances of thousands of miles; and the instruments through which we shall be able to do his will be amazingly simple compared with our present telephone. A man will be able to carry one in his vest pocket.”

A phased approach to IPv6 adoption

A phased approach to IPv6 adoption

There’s been much noise about IPv4 address runout and IPv6 adoption lately, even as far as some poorly written articles in the mainstream media. The IANA address pool was depleted on February 3rd, 2011, with the final /8 blocks being assigned to the RIRs for allocation. (ARIN article on IANA runout) Now, just over 2 months later, APNIC appears to be close to exhausting their assignable address pool, the first of the RIRs to do so. (IPv4 Address Report)

So, what does all this mean for non-carrier companies who haven’t adopted IPv6 yet? Will your networks fall off the face of the internet? Will websites become unavailable to your employees, or even worse, your customers?

Well, no. In fact, you probably won’t notice anything at all. IPv4/IPv6 dual-stack hosts will be the norm for the foreseeable future in most regions. However, we are facing the possibility of IPv6 only clients starting to show up in the Asia/Pacific region. This means that even though the sky isn’t falling, steps should be taken to adopt IPv6 in a manageable fashion, particularly if you have a global presence including a customer base in Asia.

Here at Semaphore Corporation, we’ve been running down this road for some time both to migrate our own network, as well as prepare ourselves for the process of assisting our customers in their migration. I’ll present a straightforward high-level roadmap of how we recommend handling IPv6 adoption based on our own experiences.

IPv6 adoption is a 6-phase process (more or less)

OpenStack – Take 2 – Storage Architecture w/ X-IO HyperISE 8

Phase 1 – Internet Connectivity and External Infrastructure

Several of the other phases are optional based on your needs and the scope of your customer base. This phase, however, is not. The single most important step in IPv6 adoption is ensuring your network provider supports IPv6 natively.

Internet Connectivity

At this point, any major business class carrier should offer IPv6 support to their customers. Broadband providers may or may not be an exception to this, although most have pilot programs in place. The first step in this process should be to call your carrier and ask to receive an IPv6 address allocation and have it routed to you. You can typically base your address space request size on your current network infrastructure. The following list is the most likely possibilities for address allocation:

  • Single directly-connected subnet – This model is most often seen in the datacenter for small deployments. Your layer-3 gateway for your external devices is your ISP, and you either have a single server or a layer-2 infrastructure handling connectivity to your servers. Your ISP will likely assign you a /64 (single IPv6 subnet) to match your current IPv4 needs.
  • Single external subnet, multiple internal – There is much confusion as to how perimeter security will function under IPv6. Many state (myself among them) that NAT is no longer necessary and security should be performed on routed subnets. Even if you currently use NAT, if you have infrastructure with multiple subnets behind your firewall, I could recommend requesting an address allocation which can be divided up into smaller subnets. The size of this allocation will vary based on your carrier, ranging from a /60 (16 /64 subnets), a /56 (256 /64 subnets) all the way to a /48 (a lot of /64 subnets)
  • BGP connected and/or multihomed network – If you’re running BGP and have your routes propagating into the global table via your own AS, a /48 is required. This appears to have been decided upon as the new /24, the longest prefix length that will be allowed into the global table. Attempting to announce a netblock smaller than a /48 will likely lead to it being filtered by many networks. There isn’t a particular need for a netblock larger than a /48, so this is likely what you will get.
  • Tier 1/Tier 2 Network Service Provider – If you will be delegating BGP routable space to your end-customers, you will need to go to ARIN (or your local RIR) for you address assignment. The RIRs appear to be assigning /32 length networks to NSPs. Presumably, if you fall into this category, you’re already well on your way to adoption and this post isn’t for you. =)

External Network Infrastructure

This may or may not be required, depending on your particular network model. If your external servers live outside a firewall and are directly connected to a switch (the datacenter model mentioned above), then you may not need to do anything at all here. If you have a router outside your firewall, or a layer-3 switch, you’ll need to assess its IPv6 capability. Most modern layer-3 devices offered by the large players (Cisco, Juniper, etc) have supported IPv6 for years. A code upgrade may be required on some older chassis however. If you do have an external layer-3 device that supports IPv6, now is the time to bind the IPv6 network assigned by your ISP to the internet facing interface and begin testing IPv6 connectivity to outside networks. If you’ve gotten that far, congratulations, the hardest part is done! (Getting started)

Phase 2 – External Servers

This is the second phase that I would consider “mandatory” for most companies at this point. Everything past this phase can wait, since mostly likely companies in all regions will run their externally facing servers dual-stack indefinitely. However, if IPv6 only clients begin showing up, you need to make sure they can reach you! This phase includes external webservers, mail servers (although this is less critical as mail will be relayed by dual-stack hosts), DNS servers (again, relayed, so less critical). Focus first on ensuring that your external web presence is visible by both IPv4 and IPv6 users. Your primary website should be accessible by both v4 and v6 clients at the same URL. So, for example, has both an A and a AAAA record. Earlier is was typical to see a site such as or However, these are becoming less common, since users don’t know to look for them and they will expect the same names they used on IPv4 to work with IPv6. Most DNS servers should support querying on IPv6 addresses at this point. Note that you do not need to query a DNS server on its IPv6 address to receive an IPv6 AAAA record back from the server. Most dual-stack clients will continue to query their DNS servers on their IPv4 addresses for the time being. However, enabling IPv6 on your DNS servers will allow other IPv6 enabled DNS servers to switch to v6 as their primary querying method. Additionally, if your nameservers exist within you own domain (E.G’s DNS servers at and you will need to register an IPv6 Glue Record for your server with your DNS registrar. This allows the registrar to point IPv6 clients to the server’s address to bootstrap the resolution process. This should be done for all DNS servers once they become IPv6 enabled.

Phase 3 – Firewalls and DMZ Servers

If your website is served by a server on a DMZ for security reasons, this phase is for you! Firewalls provided by the large networks vendors (recent Cisco ASA firewalls, and Juniper Netscreen and SRX firewalls) support IPv6 in all recent code revisions. You will need to confirm support from whoever your existing vendor is, or evaluate replacing your firewall. (I’m a big fan of the Juniper SRX, personally.) Something else to consider is caveats for IPv6 deployment. If you are running active/passive stateful failover, there may be restrictions on IPv6. For example, older Cisco PIX firewalls support IPv6, but not in stateful failover mode. This is also the time to determine whether you will be doing NAT or not. Again, my recommendation is to drop NAT for IPv6. A modern flow/session based firewall can provide just as much security without requiring complex NAT/PAT rules for inbound traffic. If you are doing NAT, you will need to research the site-local scoped address space, the IPv6 equivalent of RFC1918 Private IPv4 address space.

Servers in your DMZ will need to be evaluated in the same way as your external servers in the above phase. If you aren’t doing NAT, then all that will be required is to allow inbound traffic for web, mail, etc to the server’s address and you’re done. No port-translations are required for the DMZ, which was the typical approach for a DMZ in the IPv4 world.

Phase 4 – Internal Network Infrastructure

This is the first phase that is truly optional at this point. As I mentioned above, all externally facing resources even in IPv4 constrained Asia will likely continue to respond on both IPv4 and IPv6 addresses for the foreseeable future. However, progress should be made towards eventually supporting a dual-stack architecture company-wide.

Much like the external infrastructure above, all internal layer-3 devices should be assessed for IPv6 support. Internal layer-2 only devices (access switches) need not be upgraded, although if you want the future capability to manage them via an IPv6 address, they too will need IPv6 support. If you were assigned a subnettable block from your ISP, address space planning should be done at this stage, matching your IPv4 internal subnetting scheme. Again, if NAT is in use, you will be using the site-local IPv6 address space for your internal networks.

If your network uses any site-to-site VPN via IPSec or other secured tunnels, this is also the time to evaluate if the tunnel endpoints support IPv6. Because IPSec is built in to the IPv6 protocol, most site-to-site VPN endpoints should support it fully. Configuration may be dramatically different than the equivalent IPv4 configuration.

Phase 5 – LAN Servers

Many LAN services simply do not support IPv6 at this point. However, some basic services most likely do and can be turned up immediately. Services such as internal DNS can be assigned IPv6 addresses which will allow them to query IPv6 DNS servers on the internet. Things such as internal web services, file servers and other basic LAN servers can slowly have dual-stack IPv6 addresses added over time. Corporate mail servers such as Exchange (2007 or later) support IPv6 and can have v6 addresses added, but make sure to read the vendor’s list of caveats for the particular software and OS version you are running prior to adding any IPv6 addresses to the server.

Phase 6 – LAN and VPN clients

As mentioned above, there is no reason to have IPv6 only clients at this point in time. This is a good thing, as IPv6 only clients have some major hurdles at this point. In particular, there is no widely supported way to pass DNS server configuration information to clients receiving and automatic IPv6 address. IPv6 address autoconfiguration is performed via one of three methods currently:

  • Router Advertisement (Stateless autoconf) – This is the most common method for client autoconfiguration at this point in time. The router on the subnet periodically broadcasts its address and the address of the network. IPv6 clients listen for this broadcast and assign themselves an address and default gateway based on that information. There are extensions to allow for DNS configuration to be passed to the client as well, (RDNSS), but they are not widely supported at this time. For dual-stack hosts, this method works perfectly and seamlessly. When we implemented this on our network, our employees were happily surfing IPv6 websites without even realizing it was happening.
  • DHCPv6 (Stateful autoconf) – This method requires a DHCPv6 server running on the network (either on the router, or elsewhere). Much like IPv4’s DHCP, DHCPv6 maintains a database of all address leases on the network. It can also pass along information such as DNS server addresses or other custom options to the clients. However, at this point, DHCPv6 CANNOT pass along a gateway IP address, making it of questionable value.
  • Stateless DHCPv6 – This will mostly likely be where we end up once the dust settles. This method relies on the router advertisement/network discovery model to handle address and routing/gateway management. An “other configuration method” flag is set on the router, telling the client to broadcast for a DHCPv6 server once it has its address. The DHCPv6 server in this case hands off DNS information (and other other custom) fields to the client, but does not need to keep a database of addresses.

VPN clients will most likely be using the IPv6-specified IPSec protocol. You should check with your VPN concentrator/firewall vendor for support for IPv6. It will also be necessary to confirm that the VPN software being used by your clients supports IPv6 as well. We have not yet implemented client-based dynamic VPN for IPv6, so I can’t speak to how easy this process is to implement. If you’ve made it work, let us know!

The Pitch

This process can still seem daunting at first, although it is fairly straightforward if you handle it in small manageable chunks. Semaphore Received its /32 allocation from ARIN in 2003 and has been working with IPv6 even since. If you are a company in the Pacific Northwest Area and would like assistance with IPv6 adoption, please don’t hesitate to contact us! Our consulting engineers would be happy to help you with a readiness assessment and help you form a plan for full IPv6 adoption over time.

To contact us, please email