Month: July 2014

Something funny happened

On the way to building a Gravatar app, we noticed that taking pictures of ourselves to update our Gravatars was something we only wanted to do every month or so, but then we started taking selfies and sharing them with each other and that became a daily and very fun habit. So our Gravatar app morphed into a Selfies app, and it’s now ready for the world to play with! You can read more about the app here. We hope you become one of the first brave souls to try it out, and let us know what you think.

OpenStack – Take 2 – The Keystone Identity Service

OpenStack – Take 2 – The Keystone Identity Service

Keystone is, more or less, the glue that ties OpenStack together.  It’s required for any of the individual services to be installed and function together.

Fortunately for us, keystone is basically just a REST API, so it’s very easy to make redundant and there isn’t a whole lot to it.

We’ll start by installing keystone and the python mysql client on all three controller nodes:

apt-get install keystone python-mysqldb

Once that’s done, we need a base configuration for keystone.  There are a lot of default options installed in the config file, but we really only care (for now) about giving it an admin token, and connecting it to our DB and Message queue.  Also, because we’re colocating our load balancers on the controller nodes (something which clearly wouldn’t be done in production), we’re going to shift the ports that keystone is binding to so the real ports are available to HAProxy.  (The default ports are being incremented by 10000 for this.)  Everything else will be left at its default value.

/etc/keystone/keystone.conf: (Note – Commented out default config is left in the file, but not included here)


[DEFAULT]
admin_token=ADMIN
# The port number which the admin service listens on.
admin_port=45357
# The port number which the public service listens on.
public_port=15000
# RabbitMQ HA cluster host:port pairs. (list value)
rabbit_hosts=10.1.1.20:5672,10.1.1.21:5672,10.1.1.22:5672
[database]
connection = mysql://keystone:openstack@10.1.1.10/keystone

We’ll then copy this configuration file to /etc/keystone/keystone.conf on each of the other controller nodes.  (There is no node specific information in our configuration, but if any explicit IP binds or similar host specific statements are made, obviously that needs to be changed from node to node)
Now the we have the config files in place, we can create the DB and DB user, then get the keystone service started and its DB tables populated.  (We’ll be doing all of this from the first controller node)

root@controller-0:~# mysql -u root -popenstack -h 10.1.1.10
Welcome to the MariaDB monitor.  Commands end with ; or g.
Your MariaDB connection id is 122381
Server version: 5.5.38-MariaDB-1~trusty-wsrep-log mariadb.org binary distribution, wsrep_25.10.r3997

Copyright (c) 2000, 2014, Oracle, Monty Program Ab and others.

Type ‘help;’ or ‘h’ for help. Type ‘c’ to clear the current input statement.

MariaDB [(none)]> create database keystone;
Query OK, 1 row affected (0.00 sec)

MariaDB [(none)]> GRANT ALL ON keystone.* to keystone@’%’ IDENTIFIED BY ‘openstack’;
Query OK, 0 rows affected (0.01 sec)

MariaDB [(none)]> flush privileges;
Query OK, 0 rows affected (0.00 sec)

MariaDB [(none)]> exit
Bye

root@controller-0:~# service keystone restart
keystone stop/waiting
keystone start/running, process 21313
root@controller-0:~# keystone-manage db_sync

Once the initial DB has been populated, we want to copy the SSL certificates from the first keystone node to the other two.  Copy the entire contents of /etc/keystone/ssl to the other two nodes, and make sure the directories and their files are chowned to keystone:keystone.

We can then restart the keystone service on the 2nd and 3rd nodes with “service keystone” restart and we should have our keystone nodes listening on the custom ports and ready for HAProxy configuration.  Because this API is accessible from both the public and management interfaces, we’ll need to have HAProxy listen on multiple networks this time:

/etc/haproxy/haproxy.cfg – (Note: We’re adding this to the bottom of the file)


listen keystone_admin_private 10.1.1.10:35357
balance source
option tcpka
option httpchk
maxconn 10000
server controller-0 10.1.1.20:45357 check inter 2000 rise 2 fall 5
server controller-1 10.1.1.21:45357 check inter 2000 rise 2 fall 5
server controller-2 10.1.1.22:45357 check inter 2000 rise 2 fall 5
listen keystone_api_private 10.1.1.10:5000
balance source
option tcpka
option httpchk
maxconn 10000
server controller-0 10.1.1.20:15000 check inter 2000 rise 2 fall 5
server controller-1 10.1.1.21:15000 check inter 2000 rise 2 fall 5
server controller-2 10.1.1.22:15000 check inter 2000 rise 2 fall 5
listen keystone_admin_public 192.168.243.10:35357
balance source
option tcpka
option httpchk
maxconn 10000
server controller-0 192.168.243.11:45357 check inter 2000 rise 2 fall 5
server controller-1 192.168.243.12:45357 check inter 2000 rise 2 fall 5
server controller-2 192.168.243.13:45357 check inter 2000 rise 2 fall 5
listen keystone_api_public 192.168.243.10:5000
balance source
option tcpka
option httpchk
maxconn 10000
server controller-0 192.168.243.11:15000 check inter 2000 rise 2 fall 5
server controller-1 192.168.243.12:15000 check inter 2000 rise 2 fall 5
server controller-2 192.168.243.13:15000 check inter 2000 rise 2 fall 5

We then reload haproxy on all 3 nodes with “service haproxy reload” and then we can check the haproxy statistics page to determine whether the new keystone services are up and detected by the load balancer:

OpenStack – Take 2 – The Keystone Identity Service 1

The last step for keystone is creating the users, services and endpoints that tie everything together.  There are numerous keystone deployment scripts available online, so we picked one and modified it for our uses.  One thing of note is that we need to differentiate between the public and admin URLs, and the internal URLs which run on our management network.

We’ve left the object and networking (quantum/neutron) services out for now, as we’ll be addressing those in a later article.  Since we know we’re going to be using Glance and Cinder as the image and volume services, we created those now.

A copy of our keystone deployment script can be found here:  keystone_deploy.sh

We also need to add keystone credentials to the servers we’ll be issuing keystone and other OpenStack commands from.  We’ll place this file on all three controllers for now:

~/.openstack_credentials


export OS_USERNAME=admin
export OS_PASSWORD=openstack
export OS_TENANT_NAME=admin
export OS_AUTH_URL=http://192.168.243.10:35357/v2.0

We’ll load that into our environment now and on next login with the following commands:

source ~/.openstack_credentials
echo “. ~/.openstack_credentials” >> ~/.profile

Now we can confirm that our keystone users, services and endpoints and in place and ready to go:


root@controller-0:~# keystone user-list
+———————————-+——–+———+——————-+
|                id                |  name  | enabled |       email       |
+———————————-+——–+———+——————-+
| c8c5f82c2368445398ef75bd209dded1 | admin  |   True  |  admin@domain.com |
| 9b6461349428440b9008cc17bdf9aaf5 | cinder |   True  | cinder@domain.com |
| e6793ca5c3c94918be70010b58653428 | glance |   True  | glance@domain.com |
| d2c0dbfba9ae405d8f803df878afb505 |  nova  |   True  |  nova@domain.com  |
| a0dd7577399a49008a1e5aa35be56065 |  test  |   True  |  test@domain.com  |
+———————————-+——–+———+——————-+
root@controller-0:~# keystone service-list
+———————————-+———-+———-+—————————+
|                id                |   name   |   type   |        description        |
+———————————-+———-+———-+—————————+
| 42a44c9b7b374302af2d2b998376665e |  cinder  |  volume  |  OpenStack Volume Service |
| 163f251efd474459aaf6edb0e766e53d |   ec2    |   ec2    |   OpenStack EC2 service   |
| d734fbd95ec04ade9b680010511d716a |  glance  |  image   |  OpenStack Image Service  |
| c9d70e0f77ed42b1a8b96c51eadb6d20 | keystone | identity |     OpenStack Identity    |
| 8cf8f2b113054a7cb29203e3c31a3ef4 |   nova   | compute  | OpenStack Compute Service |
+———————————-+———-+———-+—————————+
root@controller-0:~# keystone endpoint-list
+———————————-+———–+———————————————+—————————————-+———————————————+———————————-+
|                id                |   region  |                  publicurl                  |              internalurl               |                   adminurl                  |            service_id            |
+———————————-+———–+———————————————+—————————————-+———————————————+———————————-+
| 127fa3f046c142c5a83122c68ac9ae79 | regionOne |          http://192.168.243.10:9292         |         http://10.1.1.10:9292          |          http://192.168.243.10:9292         | d734fbd95ec04ade9b680010511d716a |
| 23c84da682614d4db00a8fccba5550b7 | regionOne | http://192.168.243.10:8774/v2/$(tenant_id)s | http://10.1.1.10:8774/v2/$(tenant_id)s | http://192.168.243.10:8774/v2/$(tenant_id)s | 8cf8f2b113054a7cb29203e3c31a3ef4 |
| 29ce6f0c712b499d9537e861d40846d5 | regionOne |  http://192.168.243.10:8773/services/Cloud  |  http://10.1.1.10:8773/services/Cloud  |  http://192.168.243.10:8773/services/Admin  | 163f251efd474459aaf6edb0e766e53d |
| a4a8e4d6fb9548b4b59ef335581c907b | regionOne |       http://192.168.243.10:5000/v2.0       |       http://10.1.1.10:5000/v2.0       |       http://192.168.243.10:35357/v2.0      | c9d70e0f77ed42b1a8b96c51eadb6d20 |
| f7e663f609a440a9985e30efc1a2c7cf | regionOne | http://192.168.243.10:8776/v1/$(tenant_id)s | http://10.1.1.10:8776/v1/$(tenant_id)s | http://192.168.243.10:8776/v1/$(tenant_id)s | 42a44c9b7b374302af2d2b998376665e |
+———————————-+———–+———————————————+—————————————-+———————————————+———————————-+

With keystone up and running, we’ll take a little detour to talk about storage in the next article.

OpenStack – Take 2 – HA Database and Message Queue

OpenStack – Take 2 – HA Database and Message Queue

With our 3 controller servers running on a bare bones Ubuntu install, it’s time to start getting the services required for OpenStack up and running. Before any of the actual cloud services can be installed, we need a shared database and message queue. Because our design goal here is both redundancy and load balancing, we can’t just install a basic MySQL package.

A little researched showed that there are 3 options for a MySQL compatible multi-master cluster: MySQL w/ a wsrep patch, MariaDB, or Percona XtraDB Cluster. In all cases Galera is used as the actual clustering mechanism. For ease of installation (and because it has solid Ubuntu package support), we decided to use MariaDB + Galera for the clustered database.

MariaDB Cluster install

MariaDB has a handy tool to build the apt commands for mirror selection Here, so we’ll use that to build a set of commands to install MariaDB and Galera. We’ll also install rsync at this time if not already installed, since we’ll be using that for Galera cluster sync.

apt-get install software-properties-common
apt-key adv –recv-keys –keyserver hkp://keyserver.ubuntu.com:80 0xcbcb082a1bb943db
add-apt-repository ‘deb http://ftp.osuosl.org/pub/mariadb/repo/5.5/ubuntu trusty main’
apt-get update
apt-get install mariadb-galera-server galera rsync

MariaDB will prompt for a root password during install, for ease of use we’re using “openstack” as the password for all service accounts.

The configuration is shared on all 3 nodes, so first we’ll build the configuration on the controller-0 node, then copy it over to the others.

/etc/mysql/conf.d/cluster.cnf


[mysqld]
query_cache_size=0
binlog_format=ROW
default-storage-engine=innodb
innodb_autoinc_lock_mode=2
query_cache_type=0
bind-address=10.1.1.20

# Galera Provider Configuration
wsrep_provider=/usr/lib/galera/libgalera_smm.so

# Galera Cluster Configuration
wsrep_cluster_name=”openstack”
wsrep_cluster_address=”gcomm://10.1.1.20,10.1.1.21,10.1.1.22″

# Galera Synchronization Congifuration
wsrep_sst_method=rsync

# Galera Node Configuration
wsrep_node_address=”10.1.1.20″
wsrep_node_name=”controller-0″

We’ll also comment out the “bind-address = 127.0.0.1” line from /etc/mysql/my.cnf so that the DB server will listen on our specified address instead of just localhost. (Note: While you can bind address to 0.0.0.0 for all interfaces, this will interfere with our colocated HAProxy unless mysql is moved to a port other than 3306. Since we want to keep mysql on the default port, we’ll just bind it to the internal IP for now)

The last bit of prep needed is to copy the contents of /etc/mysql/debian.cnf from the first node to the other two, since Ubuntu/Debian uses a system maintenance account and the credentials are randomly generated on install time.

Now it’s time to bring up and test the cluster!

First, stop all nodes with “service mysql stop“. Double check that the mysqld is actually stopped, the first time we did that it was still running on the 2nd and 3rd node for some reason.
Initialize the cluster on the primary node: “service mysql start –wsrep-new-cluster
Once that’s up and running, start mysql on the next two nodes and wait for it to fully come up: “service mysql start

Now, it’s time to test the database replication and multi-master config. We’ll do this by writing data to the first node, confirming it replicated onto the second, writing data to the second, confirming on the third, etc…

root@controller-0:~# mysql -u root -popenstack -e “CREATE DATABASE clustertest;”
root@controller-0:~# mysql -u root -popenstack -e “CREATE TABLE clustertest.table1 ( id INT NOT NULL AUTO_INCREMENT, col1 VARCHAR(128), col2 INT, col3 VARCHAR(50), PRIMARY KEY(id) );”
root@controller-0:~# mysql -u root -popenstack -e “INSERT INTO clustertest.table1 (col1, col2, col3) VALUES (‘col1_value1′,25,’col3_value1’);”

root@controller-1:/etc/mysql/conf.d# mysql -u root -popenstack -e “SELECT * FROM clustertest.table1;”

+—-+————-+——+————-+
| id | col1 | col2 | col3 |
+—-+————-+——+————-+
| 1 | col1_value1 | 25 | col3_value1 |
+—-+————-+——+————-+

root@controller-1:/etc/mysql/conf.d# mysql -u root -popenstack -e “INSERT INTO clustertest.table1 (col1,col2,col3) VALUES (‘col1_value2′,5000,’col3_value2’);”

root@controller-2:/etc/mysql/conf.d# mysql -u root -popenstack -e “SELECT * FROM clustertest.table1;”

+—-+————-+——+————-+
| id | col1 | col2 | col3 |
+—-+————-+——+————-+
| 1 | col1_value1 | 25 | col3_value1 |
| 4 | col1_value2 | 5000 | col3_value2 |
+—-+————-+——+————-+

root@controller-2:/etc/mysql/conf.d# mysql -u root -popenstack -e “INSERT INTO clustertest.table1 (col1,col2,col3) VALUES (‘col1_value3′,99999,’col3_value3’);”

root@controller-0:/etc/mysql/conf.d# mysql -u root -popenstack -e “SELECT * FROM clustertest.table1;”

+—-+————-+——-+————-+
| id | col1 | col2 | col3 |
+—-+————-+——-+————-+
| 1 | col1_value1 | 25 | col3_value1 |
| 4 | col1_value2 | 5000 | col3_value2 |
| 5 | col1_value3 | 99999 | col3_value3 |
+—-+————-+——-+————-+

HAProxy + MariaDB integration

Now that we have a clustered database, it’s time to tie it to the HAProxy so that requests can be load balanced, and the MySQL server taken offline if necessary. We don’t need MySQL to be externally accessible, so we’ll only be configuring this service on the management (10.1.1.0/24) network. Let’s start by adding an haproxy user on each of the 3 database servers. (This user should probably be restricted to the load balancer hosts in a more secure production environment) This user has no password and cannot actually connect to any databases, it simply allows a connection to the mysql server for a health check. (Note that you must use the “CREATE USER” DDL in order for changes to be replicated. If you insert directly into the mysql tables, the changes will NOT be replicated to the rest of the cluster)

root@controller-0:~# mysql -u root -popenstack
MariaDB [mysql]> CREATE USER ‘haproxy’@’%’;
Query OK, 0 rows affected (0.01 sec)

MariaDB [(none)]> exit
Bye

Once done, we’ll add (on each node) the health check to each of the 3 HAProxy servers in /etc/haproxy/haproxy.cfg: (Configured to listen on the internal VIP presented by keepalived)

listen galera 10.1.1.10:3306
balance source
mode tcp
option tcpka
option mysql-check user haproxy
server controller-0 10.1.1.20:3306 check weight 1
server controller-1 10.1.1.21:3306 check weight 1
server controller-2 10.1.1.22:3306 check weight 1

(Note: If you forgot to tell MySQL to explicitly bind to the above IP addresses, haproxy will fail since MySQL is already bound to 3306. If you need multiple IP addresses, you can move MySQL to port 3307 and proxy to that instead)

We’ll reload the haproxy config with “service haproxy reload” and see if we can connect to MySQL on the VIP:

root@controller-0:/var/log# mysql -h 10.1.1.10 -u root -popenstack
Welcome to the MariaDB monitor. Commands end with ; or g.
Your MariaDB connection id is 226
Server version: 5.5.38-MariaDB-1~trusty-wsrep-log mariadb.org binary distribution, wsrep_25.10.r3997

Copyright (c) 2000, 2014, Oracle, Monty Program Ab and others.

Type ‘help;’ or ‘h’ for help. Type ‘c’ to clear the current input statement.

MariaDB [(none)]>

We can also see our MySQL services being monitored by the load balancer statistics page:

OpenStack – Take 2 – HA Database and Message Queue 2

RabbitMQ Message Queue

The message queue of choice for most OpenStack installations appears to be RabbitMQ. Since it supporters clustering natively, the the OpenStack services will load balance to the message queue without any additional proxy, this step goes fairly quickly.

Install the RabbitMQ server on all the controller nodes with “apt-get install ntp rabbitmq-server“. Once the install is completed, stop the rabbitmq service on all nodes (“service rabbitmq-server stop“) and confirm that the service isn’t running.

RabbitMQ wants to do hostname based clustering, and since we’re not running any DNS in this environment, we need to add the following lines to /etc/hosts:

10.1.1.20 controller-0
10.1.1.21 controller-1
10.1.1.22 controller-2

Additionally, these IP addresses must match the hostnames of your servers. If your servers still have the default “ubuntu” hostname, clustering will fail.

In order for the servers to cluster, the Erlang cookie needs to be the same on all nodes. Copy the file /var/lib/rabbitmq/.erlang.cookie from the first node to the other two nodes. (Since we don’t have root login enabled, we used scp to the ubuntu account, the moved it into place locally. You’ll need to make sure the file has user and group “rabbitmq” after copying.) Now we can restart the service on all nodes with “service rabbitmq-server start”.

With these steps in place, clustering rabbitmq is simple. We’ll start on the 2nd node. (controller-1):

root@controller-1:/var/lib/rabbitmq# service rabbitmq-server stop
* Stopping message broker rabbitmq-server [ OK ]
root@controller-1:/var/lib/rabbitmq# service rabbitmq-server start
* Starting message broker rabbitmq-server [ OK ]
root@controller-1:/var/lib/rabbitmq# cd
root@controller-1:~# rabbitmqctl stop_app
Stopping node ‘rabbit@controller-1’ …
…done.
root@controller-1:~# rabbitmqctl join_cluster rabbit@controller-0
Clustering node ‘rabbit@controller-1’ with ‘rabbit@controller-0’ …
…done.
root@controller-1:~# rabbitmqctl start_app
Starting node ‘rabbit@controller-1’ …
…done.
root@controller-1:~# rabbitmqctl cluster_status
Cluster status of node ‘rabbit@controller-1’ …
[{nodes,[{disc,[‘rabbit@controller-0′,’rabbit@controller-1’]}]},
{running_nodes,[‘rabbit@controller-0′,’rabbit@controller-1’]},
{partitions,[]}]
…done.

Now that the first two nodes are clustered, we’ll do the same with the 3rd:

root@controller-2:/var/lib/rabbitmq# service rabbitmq-server stop
* Stopping message broker rabbitmq-server [ OK ]
root@controller-2:/var/lib/rabbitmq# service rabbitmq-server start
* Starting message broker rabbitmq-server [ OK ]
root@controller-2:/var/lib/rabbitmq# rabbitmqctl stop_app
Stopping node ‘rabbit@controller-2’ …
…done.
root@controller-2:/var/lib/rabbitmq# rabbitmqctl join_cluster rabbit@controller-0
Clustering node ‘rabbit@controller-2’ with ‘rabbit@controller-0’ …
…done.
root@controller-2:/var/lib/rabbitmq# rabbitmqctl start_app
Starting node ‘rabbit@controller-2’ …
…done.
root@controller-2:/var/lib/rabbitmq# rabbitmqctl cluster_status
Cluster status of node ‘rabbit@controller-2’ …
[{nodes,[{disc,[‘rabbit@controller-0′,’rabbit@controller-1’,
‘rabbit@controller-2’]}]},
{running_nodes,[‘rabbit@controller-0′,’rabbit@controller-2’,
‘rabbit@controller-1’]},
{partitions,[]}]
…done.

And we’re done! With the clustered DB and MQ building blocks in place, we’re ready to start installing the actual OpenStack services on our controllers. It’s worth noting that RabbitMQ has a concept of a disc and a RAM node and nodes can be changed at any time. Nothing in any of the documentation for OpenStack suggests that a RAM node is required for performance, but presumably if there becomes a question of scale, it would be worth looking into.

OpenStack – Take 2 – High Availability for all services

OpenStack – Take 2 – High Availability for all services

A major part of our design decision for rebuilding our OpenStack from scratch was availability, closer to what one would see in production. This is one of the things Juju got right, installing most services using HAProxy so that clients could connect to any of the servers running the requested service. What it lacked was load balancers and external HA access.

Since we’re doing 3 controller nodes, and basically converging all services onto those 3 nodes, we’ll do that with the load balancers as well. We need both internal and externally accessible load balanced and redundant servers to take into account both customer APIs and internal/management access from the compute nodes.

Virtual IPs for Inside and Outside use

Since HAProxy doesn’t handle redundancy for itself, we’ll need a VIP that clients can point to for access. Keepalived handles that nicely with a VRRP-like Virtual IP that uses gratuitous arp for rapid failover. (While keepalived calls it VRRP, it is not compatible with other VRRP devices and cannot be joined to a VRRP group with, say, a Cisco or Juniper router) We’ll need both internal and external IPs and to ensure that both are capable of failing over. Technically, these don’t need to fail over together, since they function independently as far as HAProxy is concerned, so we don’t need to do interface or VRRP group tracking which greatly simplifies the configuration.

Our management DHCP range starts at 10.1.1.20, with IPs below that reserve for things like this. Since this is the main controller IP, we’ll assign it 10.1.1.10 for a nice round number. On the “external” network, we have 192.168.243.0/24 available, and we want to keep most of it for floating IPs. We’ll use 192.168.243.10-13 for the controllers real and virtual IPs on the outside network.

First we need to allow the virtual IP address to bind to the NICs:

echo “net.ipv4.ip_nonlocal_bind=1” >> /etc/sysctl.conf
sysctl -p

Then we’ll install the haproxy and keepalived package: “apt-get install keepalived haproxy”

We’ll need to create a keepalived config for each controller:

/etc/keepalived/keepalived.conf


global_defs {
router_id controller-0
}
vrrp_script haproxy {
script “killall -0 haproxy”
interval 2
weight 2
}
vrrp_instance 1 {
virtual_router_id 1
advert_int 1
priority 100
state MASTER
interface eth0
virtual_ipaddress {
10.1.1.10 dev eth0
}
track_script {
haproxy
}
}

vrrp_instance 2 {
virtual_router_id 2
advert_int 1
priority 100
state MASTER
interface eth1
virtual_ipaddress {
192.168.243.10 dev eth1
}
track_script {
haproxy
}
}

All the needs to be changed from one host to another is the router_id and possibly the interface name. Start up keepalived: “service keepalived restart” and you should see a Virtual IP available on the first node that is restarted.

root@controller-0:~# ip -4 addr show
1: lo: mtu 65536 qdisc noqueue state UNKNOWN group default
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
2: eth0: mtu 9000 qdisc pfifo_fast state UP group default qlen 1000
inet 10.1.1.20/24 brd 10.1.1.255 scope global eth0
valid_lft forever preferred_lft forever
inet 10.1.1.10/32 scope global eth0
valid_lft forever preferred_lft forever
3: eth1: mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
inet 192.168.243.11/24 brd 192.168.243.255 scope global eth1
valid_lft forever preferred_lft forever
inet 192.168.243.10/32 scope global eth1
valid_lft forever preferred_lft forever

Rebooting the first node should immediately cause the VIP /32 to move to one of the other servers. (Our recommendation is to have each node with a different priority so there is no ambiguity as to which node is the master, and to set the backups into initial state BACKUP)

HAProxy and Statistics

Now that we have a redundant VirtualIP, we need to get the load balancer working. We installed HAProxy in the previous step, and it nearly works out of the box. We’re going to add an externally facing statistics webpage to the default config however so we can see what it’s doing.

/etc/haproxy/haproxy.cfg


global
log /dev/log local0
log /dev/log local1 notice
chroot /var/lib/haproxy
user haproxy
group haproxy
daemon
stats socket /var/lib/haproxy/status
maxconn 4000

defaults
log global
mode http
option httplog
option dontlognull
contimeout 5000
clitimeout 50000
srvtimeout 50000
errorfile 400 /etc/haproxy/errors/400.http
errorfile 403 /etc/haproxy/errors/403.http
errorfile 408 /etc/haproxy/errors/408.http
errorfile 500 /etc/haproxy/errors/500.http
errorfile 502 /etc/haproxy/errors/502.http
errorfile 503 /etc/haproxy/errors/503.http
errorfile 504 /etc/haproxy/errors/504.http

listen stats 192.168.243.11:80
mode http
stats enable
stats uri /stats
stats realm HAProxy Statistics
stats auth admin:openstack

For each node, we’ll want to change the listen IP to be the “outside” IP of that particular server. For now, no services are defined, we’ll get to that in the next step.

The last step here is to edit “/etc/default/haproxy” so it says “ENABLED=1”. Once that’s done, activate the proxy with “service haproxy restart” and you should be able to reach the proxy’s statistics pages on the addresses that were defined.

OpenStack – Take 2 – High Availability for all services 3

To be continued – we’ll setup a clustered database and message queue for use with HAProxy in the next step.

OpenStack – Take 2 – Doing It The Hard Way

This is Part 2 of an ongoing series of testing AMD’s SeaMicro 15000 chassis with OpenStack. (Note: Part 1 is delayed while it gets vetted by AMD for technical accuracy)

In the first part, we configured the SeaMicro for a basic OpenStack deploying using MaaS and Juju for bootstrapping and orchestration. This works fine for the purposes of showing what OpenStack looks like and what it can do (quickly) on the SeaMicro hardware. That’s great and all, but Juju is only easily configurable within the context of the options provided with its specific service charms. Because it fully manages the configuration, any manual configuration added (for example, the console proxy in our previous example) will get wiped out if any Juju changes (to a relationship for example) are made.

For production purposes, there are other more powerful orchestration suites out there (Puppet, Chef, SaltStack, etc) but because they are highly configurable they also require a significantly larger amount of manual intervention and scripting. This makes sense, of course, since the reason Juju is as rapid and easy as it is is exactly the same reason that it is of questionable value in a production deployment. To that end, we’re going to deploy OpenStack on the SeaMicro chassis the hard way: from scratch.

The Architecture

If you’re going to do something, you should do it right. We decided to take a slightly different approach to the design of the stack than the Juju based reference architecture did. If we were creating a “cloud-in-a-box” for our own datacenter, we would be attempting to optimize for scalability, availability and performance. This means highly available control services capable of running one or more additional SeaMicro chassis, as well as optimizing the number of compute resources available within the chassis. While the bootstrap/install node isn’t required to be a blade on the chassis for this exercise, we’re going to leave it in place with the expectation that it would be used in future testing as an orchestration control node. Based on that, we have 63 server cards available to us. The Juju deployment used 9 of them, with most control services being non-redundant. (The only redundant services were the Cinder block services)

The Juju based architecture had the following layout for the 29 configured servers:

  • Machine 1 – Juju control node
  • Machine 2 – MySQL database node
  • Machine 3 – RabbitMQ AMPQ node
  • Machine 4 – Keystone identity service, Glance image service, and Horizon OpenStack dashboard service
  • Machine 5 – Nova cloud controller
  • Machine 6 – Neutron network gateway
  • Machines 7-9 – Cinder block storage nodes
  • Machines 10-29 – Nova compute nodes

The goal of this experiment would be to end up with the following layout:

  • Machines 1-3 – Controller nodes (Keystone, Nova controller, MySQL, RabbitMQ and Neutron
  • Machines 4-6 – Storage nodes (Cinder, Ceph, Swift, etc… Specific backends TBD
  • Machines – 7-63 – Nova compute nodes

As part of this re-deployment we’ll be running through each of the OpenStack control services, their ability to be made highly available, and what configuration is required to do so.

A world without MaaS

Since we’re eliminating the MaaS bootstrap server, we’ll need to replace the services it provides. NAT and routing are configured in the kernel still, so the cloud servers will still have the same internet access as before. The services we’ll need to replace are:

  • TFTP – using tftpd-hpa for serving the PXE boot images to the servers on initial install
  • DHCP – using isc-dhcpd for address assignment and TFTP server options
  • DNS – MaaS uses dnsmasq to cache/serve local DNS. We’ll just be replacing this with a DHCP option for the upstream DNS servers for simplicity’s sake

Configuring the bootstrap process

Ubuntu makes this easy, with a simple apt-get install tftpd-hpa. Because the services conflict, this step will uninstall MaaS and start the tftpd-hpa service.
On our MaaS server, isc-dhcp was already installed, so we just needed to create a /etc/dhcpd.conf file. Since we want to have “static” IP addresses, we’ll create fixed leases for every server rather than an actual DHCP pool.

First, we need all the MAC addresses of the down servers (everything except our MaaS server):

seasm15k01# show server summary | include /0 | include Intel | exclude up

This is easily converted into a shell, perl, or whatever your text parsing language of choice is to get a DHCP config that looks something like the following:


subnet 10.1.1.0 netmask 255.255.255.0 {
filename “pxelinux.0”;
option subnet-mask 255.255.255.0;
option broadcast-address 10.1.1.255;
option domain-name “local”;
option routers 10.1.1.1;
option interface-mtu 9000; # Need this for Neutron GRE
}

host controller-0 {
hardware ethernet 00:22:99:ec:00:00;
fixed-address 10.1.1.20;
}

etc…

Restart the DHCP server process and we now have a functioning DHCP environment directing the servers to our TFTP server.

Fun with preseeds

The basic Ubuntu netboot image loads an interactive installer, which is great for when we configured the MaaS server, but nobody wants to manually enter information for 63 servers for installation. By passing some preseed information into the kernel, we can have it download and run the installer unattended, it just needs some hints as to what it should be doing.

This took a lot of trial and error, even just to get a good environment with nothing but the base system tools and ssh server. (Which is all we want for the controller nodes for now)

The PXE defaults config we settled on looks something like this:


default install
label install
menu label ^Install
menu default
kernel linux
append initrd=ubuntu-installer/amd64/initrd.gz console=ttyS0,9600n8 auto=true priority=critical
interface=auto netcfg/dhcp_timeout=120 preseed/url=http://10.1.1.1/preseed-openstack.cfg — quiet
ipappend 2

This tells the kernel to use the SeaMicro serial console (mandatory in this environment), interface eth0, to disable hard-based interface renaming, and to fetch a preseed file hosted on the MaaS server for install.

The preseed file partitions the disk based on a standard single-partition layout (plus swap), creates a “ubuntu” user with a long useless password, disables root login, and copies an authorized keys file for ssh use into /home/ubuntu/.ssh, allowing ssh login post-install. Since we’re not using a fastpath installer like MaaS does, this takes a bit of time, but it’s hands off once the preseed file is created. Once we get to installing compute nodes later on, we’ll probably find a way to use the preseed file to script the installation and configuration of the OpenStack components on the compute nodes, but since the controller nodes will be installed manually (for this experiment), there isn’t any reason to add much beyond basic ssh access in the initial preseed.

One note: Ubuntu insists on overwriting the preseed target’s /etc/network/interfaces with what it uses to bootstrap the network. Because of udev, this may not be accurate and causes the server to come up without a network. A hacky solution that seems to work is to download an interfaces file, then chattr +i /target/etc/network/interfaces at the end of the preseed so the installer cannot overwrite it. Additionally, udev was exhibiting some strange behavior, renaming only eth0 and eth3 to the old ethX nomenclature on most servers, but leaving the other 5 interfaces as the newer pXpX style. This unfortunately seemed to be somewhat inconsistent, with some servers acknowledging the udev persistent net rules file to rename all interfaces to ethX, and others ignoring it. Since eth0 was renamed in all cases, we decided to ignore this issue for the time being since this isn’t intended to be a production preseed environment.

To be continued…. Once all these nodes install and boot.

Adventures in OpenStack with a SeaMicro 15000

Adventures in OpenStack with a SeaMicro 15000

The Chassis

Semaphore was recently approached by a friend of the company who was now working at AMD and happened to have one of their recently acquired SeaMicro 15000 high density compute chassis available for some short term testing. They offered to loan it to us for a bit in hopes we’d resell them with a particular focus on OpenStack as an enterprise private cloud since it requires some expertise to get up and running. Having never used OpenStack, and even having little experience with AWS and other cloud services, naturally we said, “Of course!”

A week or so later, AMD pulled up to the loading dock with 3 pallets of hardware. They’d brought a SeaMicro 15k chassis, and 2 large disk arrays (we decided to only set up one of the arrays given limited cooling in our lab area). A lot of heavy lifting later, we had both devices on the lab bench, powered on, and ready to start deployment.

Adventures in OpenStack with a SeaMicro 15000 4

After power on, we got the following specs from the chassis and disk array:

  • 64 Server Cards, each with a single Intel Xeon E3-1265L V3 chip (4 cores, 8 threads), and 32GB of DDR3 RAM
  • 1 Storage Card with 8 480GB SSD drives in a RAID configuration
  • 2 MX (controller) cards, each with 2 10Gig Ethernet SFP+ ports
  • 60 3TB JBOD drives, attached in 2 paths to the chassis via an e-SATA connection

The slickest thing about the SeaMicro chassis is that the compute cards are essentially just a Haswell Northbridge (less a graphics controller). The Southbridge is replaced by a set of ASICs which communicate with the SeaMicro chassis for disk and network I/O configuration and presentation. Each server card has 8 network paths that present to the server as Intel E1000 NICs. The virtual server NICs are fully configurable from the SeaMicro chassis using an IOS-like command line, with full 802.1q VLAN tagging and trunking support (if desired). By default the chassis presents a single untagged VLAN to all the server NICs, as well as the external 10Gig ports.

Disk I/O is even better, since we had a RAID storage card, we configured a single volume of around 3TB, and with just a few lines of configuration on the chassis were able to split the RAID volume into 64 virtual volumes of 48GB, and present each to one of the server cards as a root volume. These were presented as hot-plug SCSI devices, and could be dynamically moved from one server card to another via a couple quick config statements on the SeaMicro chassis. For the JBOD, we were able to assign disk ranges to lists of servers with a single command, feeding it a list of server cards and a number of drives, and the SM would automatically assign that number of disks (still hot-plug) to the server and attach them via the SeaMicro internal fabric and ASICs. Pretty cool stuff! (And invaluable during the OpenStack deployment. More on that later.)

On to OpenStack!

With chassis powered on, root volumes assigned, and the JBOD volumes available to assign to whichever server they made the most sense, we were ready to get going with OpenStack. First hurdle: there is zero removable media available to the chassis. This is pretty normal for setups like this, but unlike something like VmWare, there isn’t any easy ability to mount an ISO for install. Fortunately installing a DHCP sever is trivial on OSX, and it has a built-in TFTP server, so setting up a quick PXE boot server took just a few minutes to get the bootstrap node up. A nice feature of the SeaMicro chassis is that the power-on command for the individual servers allows a one-time PXE boot order change that will go away on the next power on, so you don’t need to mess with boot order in the BIOS at all. We installed Ubuntu 14.04 on one of the server nodes for bootstrapping and then started to look at what we needed to do next.

We’d received a SeaMicro/OpenStack Reference Architecture document which AMD made available to us, as well as finding a blog article on how a group from Canonical configured 10 SeaMicro chassis for OpenStack in about 6 hours, as well as an OpenStack Rapid Deployment Guide for Ubuntu. This seemed like enough knowledge to be dangerous when starting from absolutely nothing, so we dove right in.

Bootstrapping the metal

The reference/rapid deployment architectures all appeared to use MaaS (Metal as a Service) for bootstrapping the individual server blades. MaaS also has a plugin for the SeaMicro chassis to further speed deployment, so once the MaaS admin page was up and running, we were off to the races:

maas maas node­group probe­and­enlist­hardware model=seamicro15k mac= username=admin password=seamicro power_control=restapi2

A few seconds later, MaaS was populated with 64 nodes, each with 8 displayed MAC addresses. Not too shabby. We deleted the bootstrap node from the MaaS node list since it was statically configured, then told MaaS to commission the other 63 nodes for automation. Using the SeaMicro REST API, MaaS powered on each server using PXE boot, ran a quick smoke test to confirm reachability, then powered it back off and listed it as ready for use. Easy as pie, pretty impressive compared to the headaches of booting headless/diskless consoles of old. (I’m looking at you SunOS)

Adventures in OpenStack with a SeaMicro 15000 5

All the Ubuntu + OpenStack reference architectures use a service orchestration tool called Juju. It’s based on set of scripts called “charms” to deploy an individual service to a machine, configure it, then add relationship hooks to other services. (E.G., tell an API service that it’s going to be using MySQL as its shared backend database)

Juju requires its own server (machine “0”) to run the orchestration tools and deploy services from, to a quick bootstrap after pointing Juju in the direction of the MaaS API, and I had a bootstrap server running, powered on and automatically provisioned by MaaS. Juju also deploys the deploying user’s ssh public key to the new server, for use with its internal “juju ssh ” command, which is quite handy. (I’d later come to learn that password auth is basically nonexistent in cloud architecture, at least on initial deployment. Works for us).

Now it was time to start getting OpenStack deployed. The AMD provided reference architecture didn’t quite match the Ubuntu one, which didn’t at all match what I was seeing in the Canonical deployment test, so I had to make some decisions. By default when you deploy a new Juju service, it instantiates a new machine. This seems very wasteful on a deployment of this size, so it made sense to colocate some of the services, so the initial deployment looked a bit like this:

juju deploy mysql
juju deploy –config=openstack.cfg keystone
juju deploy –config=openstack.cfg nova-cloud-controller
juju deploy nova-compute
juju deploy glance –to 2
juju deploy rabbitmq-server –to 1
juju deploy openstack-dashboard –to 2

Networking headaches

Once completed, this (sorta) left us with 4 new servers with the basic OpenStack services running. Keystone, Glance and Horizon (dashboard) were all colocated on one server, and MySQL and RabbitMQ on another. The Nova controller and first Nova Compute server were standalone. (Both the Ubuntu and AMD reference architectures used this basically layout) After a lengthy series of “add-relation” commands, the services were associated and I had an apparently working OpenStack cloud with a functional dashboard. A few clicks later and an instance was spawned running Ubuntu 14.04 Server, success! Kinda… It didn’t appear to have any networking. The reference config from AMD had then “quantum-gateway” charm installed (the charm for the new named Neutron networking service), but the config file supplied used a Flat DHCP Networking service through Nova, which didn’t appear to actually be working out of the box. Most of the documentation used Neutron rather than Nova-Network anyways, which seemed like a better solution for what we wanted to do. No problem, just change the nova-cloud-controller charm config to use Neutron instead, right?

Wrong.

The network configuration is baked into the configs at install time by Juju. While some config can be changed post-deploy, that wasn’t one of them. This was the first (of many) times that the “juju destroy-environment” command came in handy as a reset to zero button. After a few false starts, we had the above cloud config up and running again, this time with quantum-gateway (Why the charm hasn’t been renamed to Neutron, we don’t know) correctly deployed and configured to work with the Nova cloud controller. This also added the missing “Networks” option to Horizon, allowing us to automatically create public and private subnets, as well as private tenant routers for L3 services between the networks. An instance was brought up again, and this time it could ping things! A floating external IP was easily associated with the instance, and with a few security changes we should ping the instance from the outside world. Success! Since our keypair was automatically installed to the instance on create, we opened an ssh session to the instance and… got absolutely nothing.

Neutron, as deployed by Juju by default uses its ML2 (Modular Layer 2) plugin to allow for configurable tenant network backends. By default, it uses GRE tunnels between compute nodes to tie the tenant networks together across the OpenStack-internal management network. This is great, but because GRE is an encapsulation protocol, it has overhead and reduces your effective MTU. Our attempts to run non-ICMP traffic were running into MTU issues (as is common with GRE) and failing. The quantum-gateway Juju charm does have a knob to reduce the tenant network MTU, but since the SeaMicro supports jumbo frames across its fabric, we added a DHCP Option 26 to the MaaS server to increase the management network MTU to 9000 on server start time, and rebooted the whole cluster.

Adventures in OpenStack with a SeaMicro 15000 6

SeaMicro Storage quirks

At this point we had a single working instances with full networking available on our one compute node. There were two things left to do before the cloud could really be considered “working”, scale out compute capacity, and add persistent volume storage.

To this point, the instances were using temporary storage on the compute card that would be destroyed when the instance was terminated. This works for most instances, but there was a slight problem, our compute nodes only had 48GB of attached storage, and some of that was taken up by the hypervisor OS. That doesn’t leave a lot for instance storage. Since we had 60 3TB drives attached to the SeaMicro, we decided to give each compute node one disk, giving it 3TB for local non-persistent instance volumes. The initial plan was to add a total of 20 compute nodes, which surely would be as simple as typing “juju deploy -n 20 nova-compute”, right? This is where the biggest headache of using Juju/MaaS/SeaMicro came into play. Juju is machine agnostic, it grabs a random machine from MaaS based on constraints about RAM and CPU cores (if given, all our machines are identical, so there were not constraints). Juju tracks hostnames, which are derived from MaaS. MaaS assigns hostnames to devices as a random 5-character string in the local domain (.master) in this case, and tracks the server MAC addresses. The SeaMicro chassis is only away of the MAC addresses of the servers. On top of this, we needed to have the disk available to the compute node prior to deploying nova-compute onto it.

So, how to add the disk to the compute nodes? Well, first we needed to know which machines we’re talking about. Juju can add an empty container machine, although it doesn’t have a “-n” flag, so type “juju add-machine” 20 times and wait for them to boot. While waiting, get the hostnames of the last 20 machines from Juju. Then go over to MaaS’s web interface (which can only show 50 machines at a time), and search for the random 5-digit string for each of the 20 servers, and make note of the MAC address. Then go over to the SeaMicro command line and issue “show server summary | include ” to get the server number in the SeaMicro chassis. It’s annoyingly time consuming, and if you end up destroying and rebuilding the Juju environment or even the compute nodes, you have to do it all over again, since MaaS randomly assigns the servers to Juju. Ugh.

As a side note, since this was a fairly substantial portion of the time spent getting the initial install up and running, we reached out to AMD about these issues. They’re already on the problem, and are working with Canonical to further integrate the SeaMicro’s REST API with MaaS so the MaaS assigned machine names match the server IDs in the chassis itself, as well as presenting the presence of disk resources to MaaS so they can be used as Juju constraints when assigning new nodes for a particular function. For example, when creating the storage nodes, Juju could be told to only pick a machine with attached SSD resources for assignment. These two changes would significantly streamline the provisioning process, as well as making it much easier to determine which compute cards in the chassis were being used by Juju rather than having to cross-reference them by MAC address in MaaS.

Fortunately, once the server list was generated, attaching the storage on the SeaMicro was easy: “storage assign-range 2/0,4/0,7/0… 1 disk external-disks” and the chassis would automatically assign and attach one of the JBOD drives to each of the listed servers as VDisk 1 (the 2nd disk attached to the server). Since a keypair is already installed on the Juju server, a little shell scripting made it fairly easy to automatically login to each of the empty nodes and format and mount the newly attached disk on the instances path for deployment. Deployment then works fairly automatically, “juju add-unit nova-computer –to 10” for the 20 new machines. After spawning a few test instances, the cloud was left looking something like this:

Adventures in OpenStack with a SeaMicro 15000 7

Storage Services

At this point what we were really missing was persistent volume storage so instances didn’t need to lose their data when terminated. OpenStack offers a few ways to do this. The basic OpenStack volume service is “Cinder”, which uses pluggable backends, LVM2 volume groups being the default. Since this exercise is as a basic OpenStack proof of concept, we didn’t utilize any of the more advanced storage mechanisms available to OpenStack to start with, choosing to use the SeaMicro to assign 6 3TB JBOD drives to each Cinder node in an LVM configuration across 3 nodes for a total amount of ~54TB of non-redundant persistent volume storage. Cinder + LVM has some significant issues in terms of redundancy, but it was easy enough to setup. We created some mid-sized volumes from our Ubuntu Server image, started some instances form the volumes, then tore down the instances and re-created them on different hypervisors. As expected, all our data was still in there. Performance wasn’t particularly bad, although we didn’t do much in the way of load testing on it. For file I/O heavy loads and redundancy, there are certainly some better ways to approach storage that we’ll explore in another writeup.

At this point, we haven’t implemented any object storage. This can be done with the Ceph distributed filesystem, or Swift which is OpenStack’s version of S3 object storage. Since we’re using local storage for Glance images and didn’t have a use case for object storage during this proof of concept, we decided to skip this step for the time being until we do a more thorough look at OpenStack’s various storage options and subsystems.

Console Juju

OpenStack offers a couple flavors of VNC console services, one directly proxied through the cloud controller (novnc), and a java viewer (xvpvnc). These are fairly straightforward to setup, involving a console proxy service running on the cloud controller (or any other externally accessible server), and a VNC service running on each compute node. Configuration is just a few lines in /etc/nova/nova.conf on each of these servers. But, there’s a caveat here, there isn’t a Juju charm or configuration option for the VNC console services. Because the console services have their configuration installed in a file managed by Juju, any relationship change affecting the nova-cloud-controller or nova-compute service will cause the console configuration to be wiped out on every node. Additionally, the console config on the compute nodes needs to be configured (and the compute service restarted) BEFORE any instances are created on that node. If any instances exist beforehand, they won’t have console access, only new instances will. While this isn’t the end of the world, especially since one assumes the base relationships in Juju wouldn’t be changing much, it does highlight a potential problem with Juju in that if you’re adding custom config that isn’t deployed with the charm, you run the risk of losing it. While we haven’t looked at how difficult custom charms are to write yet, this clearly could be a problem in other areas as well, for example using iSCSI as a cinder and/or ceph backend, using something other than the built-in network backend for Neutron, etc. While there will always be a tradeoff when using service orchestration tools, this does seem like a significant one, since being able to add custom config segments to a managed config file is fairly important.
It seems unlikely to us that large production OpenStack clouds are deployed in this manner. The potential to wipe out large amounts of configuration unexpectedly (Or worse, have inconsistent configuration where newer compute units have different configs than older ones) is significant.

(Note: The below scripts will auto-deploy the VNC console code to all compute instances in /etc/nova/nova.conf immediately after the rabbitmq config)

deploy_console.sh – Run from MaaS node
#!/bin/bash
for i in `juju status nova-compute |grep public-address | awk ‘{print $2}’`; do
scp setup_compute_console.sh ubuntu@$i:
ssh ubuntu@$i ‘sudo /bin/bash /home/ubuntu/setup_compute_console.sh’
done;

deploy_compute_console.sh – Run from MaaS node
#!/bin/bash
. /root/.profile
IPADDR=`ifconfig br0 | grep “inet addr:” | cut -d: -f2 | awk ‘{print $1}’`
echo $IPADDR
grep -q “vnc_enabled = true” /etc/nova/nova.conf
isvnc=$?
if [ “$isvnc” == “1” ]; then
# Get line of RabbitMQ config
rline=`grep -n -m 1 rabbit_host /etc/nova/nova.conf | cut -f1 -d:`
((rline++))
echo ${rline}
sed -i “${rline} i\
\
vnc_enabled = true \
vncserver_listen = 0.0.0.0 \
vncserver_proxyclient_address = ${IPADDR} \
novncproxy_base_url=”http://192.168.243.7:6080/vnc_auto.html” \
xvpvpnproxy_base_url=”http://192.168.243.7:6081/console” \
” /etc/nova/nova.conf
apt-get install novnc -y
initctl restart nova-compute
fi

Parting thoughts

This article is long enough already, but the initial impression is that OpenStack is complicated, but not as bad as it looks. This is obviously aided by rapid deployment tools, but once the architecture and how the services interact makes sense, most of the mystery is gone. Additionally, if you want a lot of compute resources in a small (and power efficient) footprint, the SeaMicro 15000 is an incredible solution. Juju/MaaS have some issues in terms of ease of use with the SeaMicro, but at least some of them are already being addressed by AMD/Canonical.

Since our proof of concept was basically done, we had the option to go a couple different directions here, the most obvious being an exploration of more advanced, efficient and redundant OpenStack storage. To do that, we’d need to tear down the Juju based stack and go with something more flexible. Since this isn’t production, no better way to do that than to just install everything from scratch, so stay tuned for that in upcoming articles.