Open Source

Using pymodbus in the TornadoWeb IO loop.

This is more a quick dump of some proof-of-concept code.  We’re in the process of writing communications drivers for an energy management system, many of which need to communicate with devices like Modbus energy meters.

Traditionally I’ve just used the excellent pymodbus library with its synchronous interface for batch-processing scripts, but this time I need real-time and I need to do things asynchronously.  I can either run the synchronous client in a thread, or, use the Twisted interface.

We’re actually using Tornado for our core library, and thankfully there’s an adaptor module to allow you to use Twisted applications.  But how do you do it?  Twisted code requires quite a bit of getting used to, and I’ve still not got my head around it.  I haven’t got my head fully around Tornado either.

So how does one combine these?

The following code pulls out the first couple of registers out of a CET PMC330A energy meter that’s monitoring a few circuits in our office. It is a stripped down copy of this script.

#!/usr/bin/env python
'''
Pymodbus Asynchronous Client Examples -- using Tornado
--------------------------------------------------------------------------

The following is an example of how to use the asynchronous modbus
client implementation from pymodbus.
'''
#---------------------------------------------------------------------------# 
# import needed libraries
#---------------------------------------------------------------------------# 
import tornado
import tornado.platform.twisted
tornado.platform.twisted.install()
from twisted.internet import reactor, protocol
from pymodbus.constants import Defaults

#---------------------------------------------------------------------------# 
# choose the requested modbus protocol
#---------------------------------------------------------------------------# 
from pymodbus.client.async import ModbusClientProtocol
#from pymodbus.client.async import ModbusUdpClientProtocol

#---------------------------------------------------------------------------# 
# configure the client logging
#---------------------------------------------------------------------------# 
import logging
logging.basicConfig()
log = logging.getLogger()
log.setLevel(logging.DEBUG)

#---------------------------------------------------------------------------# 
# example requests
#---------------------------------------------------------------------------# 
# simply call the methods that you would like to use. An example session
# is displayed below along with some assert checks. Note that unlike the
# synchronous version of the client, the asynchronous version returns
# deferreds which can be thought of as a handle to the callback to send
# the result of the operation.  We are handling the result using the
# deferred assert helper(dassert).
#---------------------------------------------------------------------------# 
def beginAsynchronousTest(client):
    io_loop = tornado.ioloop.IOLoop.current()

    def _dump(result):
        logging.info('Register values: %s', result.registers)
    def _err(result):
        logging.error('Error: %s', result)

    rq = client.read_holding_registers(0, 4, unit=1)
    rq.addCallback(_dump)
    rq.addErrback(_err)

    #-----------------------------------------------------------------------# 
    # close the client at some time later
    #-----------------------------------------------------------------------# 
    io_loop.add_timeout(io_loop.time() + 1, client.transport.loseConnection)
    io_loop.add_timeout(io_loop.time() + 2, io_loop.stop)

#---------------------------------------------------------------------------# 
# choose the client you want
#---------------------------------------------------------------------------# 
# make sure to start an implementation to hit against. For this
# you can use an existing device, the reference implementation in the tools
# directory, or start a pymodbus server.
#---------------------------------------------------------------------------# 
defer = protocol.ClientCreator(reactor, ModbusClientProtocol
        ).connectTCP("10.20.30.40", Defaults.Port)
defer.addCallback(beginAsynchronousTest)
tornado.ioloop.IOLoop.current().start()

Qt: Creating QSharedPointers from “this”

Hi all,

This is more a note to self for future reference.  Qt has a nice handy reference counting memory management system by means of QSharedPointer and QWeakPointer.  The system is apparently thread-safe and seems to be totally transparent.

One gotcha though, is two QSharedPointer objects cannot share the same pointer unless one is cloned from the other (either directly or via QWeakPointer).  The other is that you must leave deletion of the object to QSharedPointer.  You’ve given it your precious pointer, it has adopted it and while you may call the object, it is no longer yours, so don’t go deleting it.

So you create an object, you want to pass a reference to yourself to some other object.  How?  Like this?

QSharedPointer<MyClass> MyClass::ref() {
    return QSharedPointer<MyClass>(this); /* NO! */
}

No, not like that! That will create QSharedPointer instances left right and centre. Not what you want to do at all. What you need to do, is create the initial reference, but then store a weak reference to it. Then all future calls, you simply call the toStrongRef method of the weak reference to get a QSharedPointer that’s linked to the first one you handed out.

Then, having done this, when you create your new object, you create it with the new keyword as normal, take a QSharedPointer reference to it, then forget all about the original pointer! You can get it back by calling the data method of the pointer object.

To make it simple, here’s a base class you can inherit to do this for you.

    #include <QWeakPointer>
    #include <QSharedPointer>

    /*!
     * Self-Reference helper.  This allows for objects to maintain
     * references to "this" via the QSharedPointer reference-counting
     * smart pointers.
     */
    template<typename T>
    class SelfRef {
        public:
            /*!
             * Get a strong reference to this object.
             */
            QSharedPointer<T>    ref()
            {
                QSharedPointer<T> this_ref(this->this_weak);
                if (this_ref.isNull()) {
                    this_ref = QSharedPointer<T>((T*)this);
                    this->this_weak = this_ref.toWeakRef();
                }
                return this_ref;
            }

            /*!
             * Get a weak reference to this object.
             */
            QWeakPointer<T>        weakRef() const
            {
                return this->this_weak;
            }
        private:
            /*! A weak reference to this object */
            QWeakPointer<T>        this_weak;
    };

Example usage:

#include <iostream>
#include <stdexcept>
#include "SelfRef.h"

class Test : public SelfRef<Test> {
        public:
                Test()
                {
                        std::cout << __func__ << std::endl;
                        this->freed = false;
                }
                ~Test()
                {
                        std::cout << __func__ << std::endl;
                        this->freed = true;
                }

                void test() {
                        if (this->freed)
                                throw std::runtime_error("Already freed!");
                        std::cout
                                << "Test object is at "
                                << (void*)this
                                << std::endl;
                }

                bool                    freed;
                QSharedPointer<Test>    another;
};

int main(void) {
        Test* a = new Test();
        if (a != NULL) {
                QSharedPointer<Test> ref1 = a->ref();
                if (!ref1.isNull()) {
                        QSharedPointer<Test> ref2 = a->ref();
                        ref2->test();
                }
                ref1->test();
        }
        a->test();
        return 0;
}

Note that the line before the return is a deliberate use after free bug to prove the pointer really was freed.  Also note that the idea of setting a boolean flag to indicate the constructor has been called only works here because nothing happens between that use after free attempt and the destructor being called.  Don’t rely on this to see if your object is being called after destruction.  This is what the output session from gdb looks like:

RC=0 stuartl@rikishi /tmp/qtsp $ make CXXFLAGS=-g
g++ -c -g -I/usr/share/qt4/mkspecs/linux-g++ -I. -I/usr/include/qt4/QtCore -I/usr/include/qt4/QtGui -I/usr/include/qt4 -I. -I. -o test.o test.cpp
g++ -Wl,-O1 -o qtsp test.o    -L/usr/lib64/qt4 -lQtGui -L/usr/lib64 -L/usr/lib64/qt4 -L/usr/X11R6/lib -lQtCore -lgthread-2.0 -lglib-2.0 -lpthread 
RC=0 stuartl@rikishi /tmp/qtsp $ gdb ./qtsp 
GNU gdb (Gentoo 7.7.1 p1) 7.7.1
Copyright (C) 2014 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-pc-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://bugs.gentoo.org/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from ./qtsp...done.
(gdb) r
Starting program: /tmp/qtsp/qtsp 
warning: Could not load shared library symbols for linux-vdso.so.1.
Do you need "set solib-search-path" or "set sysroot"?
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Test
Test object is at 0x555555759c90
Test object is at 0x555555759c90
~Test
terminate called after throwing an instance of 'std::runtime_error'
  what():  Already freed!

Program received signal SIGABRT, Aborted.
0x00007ffff5820775 in raise () from /lib64/libc.so.6
(gdb) bt
#0  0x00007ffff5820775 in raise () from /lib64/libc.so.6
#1  0x00007ffff5821bf8 in abort () from /lib64/libc.so.6
#2  0x00007ffff610cd75 in __gnu_cxx::__verbose_terminate_handler() ()
   from /usr/lib/gcc/x86_64-pc-linux-gnu/4.8.3/libstdc++.so.6
#3  0x00007ffff6109ec8 in ?? () from /usr/lib/gcc/x86_64-pc-linux-gnu/4.8.3/libstdc++.so.6
#4  0x00007ffff6109f15 in std::terminate() () from /usr/lib/gcc/x86_64-pc-linux-gnu/4.8.3/libstdc++.so.6
#5  0x00007ffff610a2e9 in __cxa_throw () from /usr/lib/gcc/x86_64-pc-linux-gnu/4.8.3/libstdc++.so.6
#6  0x0000555555555cea in Test::test (this=0x555555759c90) at test.cpp:20
#7  0x0000555555555315 in main () at test.cpp:41
(gdb) up
#1  0x00007ffff5821bf8 in abort () from /lib64/libc.so.6
(gdb) up
#2  0x00007ffff610cd75 in __gnu_cxx::__verbose_terminate_handler() ()
   from /usr/lib/gcc/x86_64-pc-linux-gnu/4.8.3/libstdc++.so.6
(gdb) up
#3  0x00007ffff6109ec8 in ?? () from /usr/lib/gcc/x86_64-pc-linux-gnu/4.8.3/libstdc++.so.6
(gdb) up
#4  0x00007ffff6109f15 in std::terminate() () from /usr/lib/gcc/x86_64-pc-linux-gnu/4.8.3/libstdc++.so.6
(gdb) up
#5  0x00007ffff610a2e9 in __cxa_throw () from /usr/lib/gcc/x86_64-pc-linux-gnu/4.8.3/libstdc++.so.6
(gdb) up
#6  0x0000555555555cea in Test::test (this=0x555555759c90) at test.cpp:20
20                                      throw std::runtime_error("Already freed!");
(gdb) up
#7  0x0000555555555315 in main () at test.cpp:41
41              a->test();
(gdb) quit
A debugging session is active.

        Inferior 1 [process 17906] will be killed.

Quit anyway? (y or n) y

You’ll notice it fails right on that second last line because the last QSharedPointer went out of scope before this.  This is why you forget all about the pointer once you create the first QSharedPointer.

To remove the temptation to use the pointer directly, you can make all your constructors protected (or private) and use a factory that returns a QSharedPointer to your new object.

A useful macro for doing this:

/*!
 * Create an instance of ClassName with the given arguments
 * and immediately return a reference to it.
 *
 * @returns	QSharedPointer<ClassName> object
 */
#define newRef(ClassName, args ...)	\
	((new ClassName(args))->ref().dynamicCast<ClassName>())

Introduction to Python’s asyncio

Just recently I’ve been looking into asynchronous programming.

Previously I had an aversion to asynchronous code due to the ugly twisted web of callback functions that it can turn into. However, after finding that having a large number of threads blocking on locks and semaphores still manages to thrash a machine, I’ve come to the conclusion that I should put aside my feelings and try it anyway.

Our codebase is written in Python 2.7, sadly, not new enough to have asyncio. However we do plan to eventually move to Python 3.x when things are a bit more stable in the Debian/Ubuntu department (Ubuntu 12.04 didn’t support it and there are a few sites that still run it, one or two still run 10.04).

That said, there’s thankfully a port of what became asyncio in the form of Trollius.

Reading through the examples though still had me lost and the documentation is not exactly extensive. In particular, coroutines and yielding. The yield operator is not new, it’s been in Python for some time, but until now I never really understood it or how it was useful in co-operative programming.

Thankfully, Sahand Saba has written a guide on how this all works:
http://sahandsaba.com/understanding-asyncio-node-js-python-3-4.html

I might put some more notes up as I learn more, but that guide explained a lot of the fundamentals behind a lot of event loop frameworks including asyncio.

ShellShock, HeartBleed and faith in Open Source

Well, it’s been a busy year so far for security vulnerabilities in open-source projects.  Not that those have been the only two bugs, they’re just two high-profile ones that are getting a lot of media attention.

Now, a number of us do take sheer delight in pointing and laughing when one of the big boys, whether they be based in Redmond or California, makes a security balls-up on a big scale.  After all, people pay big dollars to use some of that software, and many are dependent on it for their livelihoods.

The question does get raised though, what do you trust more?  A piece of software whose code is a complete secret, or the a piece of software anyone can audit?  Some argue the former, because anyone can find the holes in the latter and exploit them.  Some argue the latter, since anyone can find the holes and fix them.  Not being able to see the code doesn’t guarantee a lack of security issues however, and these last two headline-making bugs is definitely evidence that having the code isn’t a guarantee to a bug-free utopia.

There is no guarantee either way.

I’ve seen both open-source systems and high-end commercial systems both perform well and I’ve seen both make a dismal failure.  Bad code is bad code, no matter what the license, and even having the source available doesn’t mean you can fix it as first one must be able to understand what its intent is.  Information Technology in particular seems to attract the technologically inept but socially capable types that are able to talk their way into nearly any position, and so you wind up with the monstrosities that you might see on The Daily WTF.  These same people lurk amongst open-source circles too, and there are those who just make an honest mistake.  Security is hard, and it can be easy to overlook a possible hole.

I run Gentoo here, have done so now since 2004 (damn, 10 years already, but I digress…).  I’ve been building my own stage 3 tarballs from scratch since 2010.  July 2010 I bought my current desktop, a 6-core AMD Phenom machine, and so combined with the 512Kbps ADSL I had at the time, it was faster for me to compile stage 3 tarballs for the various systems (i386, AMD64 and about 6 different MIPS builds) than to download the sources.  If I wanted an up-to-date stage 3, I just took my last build, ran it through Gentoo Catalyst, and out came a freshly built tarball.

I still obtain my operating systems that way.  Even though I’ve upgraded the ADSL, I still use the same scripts that used to produce the official Gentoo/MIPS media.

This means I could audit every piece of software that forms my core system.  I have the source code there, all of it.  Not many Linux users have this, most have it at arms reach (i.e. an apt-get source ${PACKAGE} away), or at worst, a polite email/letter to their supplier (e.g. Netcomm will supply sources for their routers for a ~AU$10 fee), however I already have it.

So did I do any audits?  Have I done any audits?  No.  Ultimately I just blindly trust what comes down the wire, and to some, that is arguably no better than just blindly trusting what Apple and Microsoft produce.

Those who say that, do have a point.  I didn’t pick up on HeartBleed, nor on ShellShock, and I probably haven’t spotted what will become the next headline-grabbing bug.  There’s a lot of source code that goes into a GNU/Linux system, and if I were to sit there and audit it, myself, it’d take me a lifetime.  It’d cost me a fortune to pay a team to analyse it.

However, I at least have the choice of auditing parts of it.  I’ll never be able to audit the copies of Microsoft Windows, or the one copy of Apple MacOS X I have.  For those, I’m reliant on the upstream vendors to audit, test and patch their code, I cannot do it myself.

For the open-source software though, it’s ultimately my choice.  I can do it myself, I can also pay someone to do it, I’ve simply chosen not to at this time.  This is an important distinction that the anti-open-source camp seem to forget.

As for the quality factor: well I’ve spent more time arguing with some piece of proprietary software and having trouble getting it to do something I need it to do, or fixing up some cock up caused by a bug in the said software.  One option, I spend hours arguing with it to make it work, and have to pay good money for the privilege.  The other, they money stays in my pocket, and in theory I can re-build it to make it work if needed.  One will place arbitrary restrictions on how I use the software as an end user, forcing me to spend money on more expensive licenses, the other will happily let me keep pushing it until I hit my system’s technical limits.

Neither offer me any kind of warranty regarding to losses I might suffer as a result of their software (I’m sorry, but US$5.00 is as good as worthless), so the money might as well stay in my pocket while I learn something about the software I use.

I remain in control of my destiny that way, and that is the way I’d like to keep it.

Ceph and FlashCache in OpenNebula

Well, lately I’ve been doing some development work with OpenNebula.

We’ve recently deployed a 3-node Ceph cluster which we intend to use as our back-end storage for numerous things: among them being VM storage.  Initially I thought the throughput would be “good enough”, 3 hosts each with gigabit links supplying VM hosts with gigabit backhaul links.

It’d be comparable to typical HDDs, or so I thought.  What I didn’t count on in particular was the random-read latency introduced by round-tripping over the network and overheads.  When I tried Ceph with just libvirt, things weren’t too bad, I was close to saturating my 1Gbps link.  Put two VMs on and again, things hummed along.  Not blistering fast mind you but reasonable.

I got OpenNebula talking to it easy enough.  We’re running the stable version: 4.4.  There are a few things I learned about the way OpenNebula uses Ceph:

  • OpenNebula uses v1-format RBDs (the Ceph default actually)
  • Since v1 RBDs don’t support COW clones, instance images are copied.
  • Copying a 160GB image in triplicate over gigabit Ethernet takes a while, and brought our little cluster to a crawl.

Naturally, we’re looking into beefing up the network links and CPUs on the storage nodes, but I’ve also been looking at ways to reduce the load on the back-end cluster.  One is through caching.  There are a couple of projects out there which allow you to combine two types of storage, using a smaller, faster block device to act as a cache for a larger, slower device.  Two which immediately come to mind: FlashCache and bcache.

bcache is on the TODO list, it has a few more knobs and dials to be able to play with, and shares a single cache device with multiple back-end devices, so might yet be worth investing time in.

Sébastian Han posted a guide on doing RBD caching using FlashCache, and so my work has largely been based on this initial work.  I’ve been hacking up a OpenNebula datastore management and transfer management driver which harnesses FlashCache and the newer v2 RBD format to produce a flexible storage subsystem for OpenNebula.

The basic concept is simple enough:

  • Logical Volume Manager, is used to allocate slices of a SSD to use as cache for back-end RBDs.
  • For non-persistent images, a new copy-on-write clone of the base image is created
  • A flashcache composite device is produced using the LVM volume as cache and the RBD as the backend
  • KVM/QEMU/Xen uses this composite device like a regular disk

The initial attempt worked well for Linux VMs, read performance initially would be between 20MB/sec and 120MB/sec depending on network/storage cluster load.  Subsequent reads would then exceed 240MB/sec.  Write performance was limited to what the cluster could do, unless you used writeback mode at which point speed picked up dramatically.

Windows proved to be a puzzle, it seems some Windows images have an odd way of accessing the disk, and this impacts performance badly.  In many cases, the images were of a sparse nature, with most of the content being in the first 8GB.  So I made sure to allocate 8GB chunks of my SSD, and performed what I call pre-caching: seeding the contents of the SSD with the initial 8GB (or however big the SSD partition is) of the image.

That picks up the initial boot performance by a big margin, at the cost of the image taking a little longer to deploy in the PROLOG stage.

For those who are interested, some early code is available via git.

bcache might be worth a look-in as it has read-ahead caching.  I haven’t done so yet.  I’d like to split the caching subsystem out and have cache drivers much like we have for datastore managers and transfer managers alike.  The same concept would work for iSCSI/CLVM storage or Gluster storage as it does for Ceph.

ceph and stgt

Hi all,

This is more a note to myself on how to configure stgt to talk to a Ceph rbd. Everyone seems to recommend patching tgt-admin: this is simply not necessary. The challenge is the lax way that tgt-admin parses the configuration file.

My scenario: VMWare ESXi virtual machine host, needing to use storage on Ceph.
I have 3 storage nodes running ceph-mon and ceph-osd daemons. They also have a version of tgtd that supports Ceph. (See the ceph-extras repository.)

The /etc/tgt/conf.d/${CLIENT}.conf configuration file. (I’m putting all the targets for ${CLIENT} here.)

# Target naming: iqn.yyyy-mm.backwards.domain.your:client.target
# where yyyy-mm: year and month of target creation
# backwards.domain.your: Your domain name; written backwards.
# client.target: A name for the target, since it's for one client here I name it
# as the client's host name then give the rest some descriptive title.
<target iqn.2014-02.domain.my:my-client.my-target-name>
    driver iscsi
    bs-type rbd
    backing-store pool-name/rbd-name
    initiator-address ip.of.my.client
</target>

For better or worse, I run the tgt daemon on the Ceph nodes themselves. Multipath I’m not sure about at this point, I’ve set up the targets on all of my Ceph nodes so I can connect to any, but I have not tested this yet.

To enable that target:

# tgt-admin -v -e

Then to verify:

# tgt-admin -s

You should see your LUNs listed.

apt repository hell whilst installing mariadb-galera-server on Ubuntu

Hi all,

Not often I have a whinge about something, but this problem has been bugging me of late more than somewhat.  I’m in the process of setting up an OpenStack cluster at work.  Now, as the underlying OS we’ve chosen Ubuntu Linux which is fine.  Ubuntu is a quite stable, reliable and well supported platform.

One of my pet peeves though, is when some package manager decides to get lazy.  Now, those of us who have been around the Linux scene have probably discovered RPM dependency hell… and the smug Debian users who tell us that Debian doesn’t do this.

Ho ho, errm… no, when APT wants to go into dummy mode, it does so with style:

Nov 12 05:32:27 in-target: Setting up python3-update-manager (1:0.186.2) ...
Nov 12 05:32:27 in-target: Setting up python3-distupgrade (1:0.192.13) ...
Nov 12 05:32:27 in-target: Setting up ubuntu-release-upgrader-core 
(1:0.192.13) ...
Nov 12 05:32:27 in-target: Setting up update-manager-core (1:0.186.2) ...
Nov 12 05:32:27 in-target: Processing triggers for libc-bin ...
Nov 12 05:32:27 in-target: ldconfig deferred processing now taking place
Nov 12 05:32:27 in-target: Processing triggers for initramfs-tools ...
Nov 12 05:32:27 in-target: Processing triggers for ca-certificates ...
Nov 12 05:32:27 in-target: Updating certificates in /etc/ssl/certs... 
Nov 12 05:32:29 in-target: 158 added, 0 removed; done.
Nov 12 05:32:29 in-target: Running hooks in /etc/ca-certificates/update.d....
Nov 12 05:32:29 in-target: done.
Nov 12 05:32:29 in-target: Processing triggers for sgml-base ...
Nov 12 05:32:29 pkgsel: installing additional packages
Nov 12 05:32:29 in-target: Reading package lists...
Nov 12 05:32:29 in-target: 
Nov 12 05:32:29 in-target: Building dependency tree...
Nov 12 05:32:30 in-target: 
Nov 12 05:32:30 in-target: Reading state information...
Nov 12 05:32:30 in-target: 
Nov 12 05:32:30 in-target: openssh-server is already the newest version.
Nov 12 05:32:30 in-target: Some packages could not be installed. This may 
mean that you have
Nov 12 05:32:30 in-target: requested an impossible situation or if you are 
using the unstable
Nov 12 05:32:30 in-target: distribution that some required packages have not 
yet been created
Nov 12 05:32:30 in-target: or been moved out of Incoming.
Nov 12 05:32:30 in-target: The following information may help to resolve the 
situation:
Nov 12 05:32:30 in-target: 
Nov 12 05:32:30 in-target: The following packages have unmet dependencies:
Nov 12 05:32:30 in-target:  mariadb-galera-server : Depends: 
mariadb-galera-server-5.5 (= 5.5.33a+maria-1~raring) but it is not going to 
be installed

Mmmm, great, not going to be installed. May I ask why not? No, I’ll just drop to a shell and do it myself then.

Nov 12 05:32:30 in-target: E: Unable to correct problems, you have held 
broken packages.

Now this is probably one of my most hated things about computing, is when a software package accuses YOU of doing something that you haven’t. Excuse me… I have held broken packages? I simply performed a fresh install then told you to do an install!

So let’s have a closer look.

Nov 12 05:32:30 main-menu[20801]: WARNING **: Configuring 'pkgsel' failed 
with error code 100
Nov 12 05:32:30 main-menu[20801]: WARNING **: Menu item 'pkgsel' failed.
Nov 12 05:37:38 main-menu[20801]: INFO: Modifying debconf priority limit from 
'high' to 'medium'
Nov 12 05:37:38 debconf: Setting debconf/priority to medium
Nov 12 05:37:38 main-menu[20801]: DEBUG: resolver (ext2-modules): package 
doesn't exist (ignored)
Nov 12 05:37:40 main-menu[20801]: INFO: Menu item 'di-utils-shell' selected
~ # chroot /target
chroot: can't execute '/bin/network-console': No such file or directory
~ # chroot /target bin/bash

We give it a shot ourselves to see the error more clearly.

root@test-mgmt0:/# apt-get install mariadb-galera-server
Reading package lists... Done
Building dependency tree       
Reading state information... Done
Some packages could not be installed. This may mean that you have
requested an impossible situation or if you are using the unstable
distribution that some required packages have not yet been created
or been moved out of Incoming.
The following information may help to resolve the situation:

The following packages have unmet dependencies:
 mariadb-galera-server : Depends: mariadb-galera-server-5.5 (= 
5.5.33a+maria-1~raring) but it is not going to be installed
E: Unable to correct problems, you have held broken packages.

Fine, so we’ll try installing that instead then.

root@test-mgmt0:/# apt-get install mariadb-galera-server-5.5
Reading package lists... Done
Building dependency tree       
Reading state information... Done
Some packages could not be installed. This may mean that you have
requested an impossible situation or if you are using the unstable
distribution that some required packages have not yet been created
or been moved out of Incoming.
The following information may help to resolve the situation:

The following packages have unmet dependencies:
 mariadb-galera-server-5.5 : Depends: mariadb-client-5.5 (>= 
5.5.33a+maria-1~raring) but it is not going to be installed
                             Depends: libmariadbclient18 (>= 
5.5.33a+maria-1~raring) but it is not going to be installed
                             PreDepends: mariadb-common but it is not going 
to be installed
E: Unable to correct problems, you have held broken packages.

Okay, closer, so we need to install those too. But hang on, isn’t that apt‘s responsibility to know this stuff? (which it clearly does).

Also note we don’t get told why it isn’t going to be installed. It refuses to install the packages, “just because”. No reason given.

We try adding in the deps to our list.

root@test-mgmt0:/# apt-get install mariadb-galera-server-5.5 
mariadb-client-5.5
Reading package lists... Done
Building dependency tree       
Reading state information... Done
Some packages could not be installed. This may mean that you have
requested an impossible situation or if you are using the unstable
distribution that some required packages have not yet been created
or been moved out of Incoming.
The following information may help to resolve the situation:

The following packages have unmet dependencies:
 mariadb-client-5.5 : Depends: libdbd-mysql-perl (>= 1.2202) but it is not 
going to be installed
                      Depends: mariadb-common but it is not going to be 
installed
                      Depends: libmariadbclient18 (>= 5.5.33a+maria-1~raring) 
but it is not going to be installed
                      Depends: mariadb-client-core-5.5 (>= 
5.5.33a+maria-1~raring) but it is not going to be installed
 mariadb-galera-server-5.5 : Depends: libmariadbclient18 (>= 
5.5.33a+maria-1~raring) but it is not going to be installed
                             PreDepends: mariadb-common but it is not going 
to be installed
E: Unable to correct problems, you have held broken packages.

Okay, some more deps, we’ll add those…

root@test-mgmt0:/# apt-get install mariadb-galera-server-5.5 
mariadb-client-5.5 libmariadbclient18
Reading package lists... Done
Building dependency tree       
Reading state information... Done
Some packages could not be installed. This may mean that you have
requested an impossible situation or if you are using the unstable
distribution that some required packages have not yet been created
or been moved out of Incoming.
The following information may help to resolve the situation:

The following packages have unmet dependencies:
 libmariadbclient18 : Depends: mariadb-common but it is not going to be 
installed
                      Depends: libmysqlclient18 (= 5.5.33a+maria-1~raring) 
but it is not going to be installed
 mariadb-client-5.5 : Depends: libdbd-mysql-perl (>= 1.2202) but it is not 
going to be installed
                      Depends: mariadb-common but it is not going to be 
installed
                      Depends: mariadb-client-core-5.5 (>= 
5.5.33a+maria-1~raring) but it is not going to be installed
 mariadb-galera-server-5.5 : PreDepends: mariadb-common but it is not going 
to be installed
E: Unable to correct problems, you have held broken packages.

Wash-rinse-repeat!

root@test-mgmt0:/# apt-get install mariadb-galera-server-5.5 
mariadb-client-5.5 libmariadbclient18 mariadb-common
Reading package lists... Done
Building dependency tree       
Reading state information... Done
Some packages could not be installed. This may mean that you have
requested an impossible situation or if you are using the unstable
distribution that some required packages have not yet been created
or been moved out of Incoming.
The following information may help to resolve the situation:

The following packages have unmet dependencies:
 libmariadbclient18 : Depends: libmysqlclient18 (= 5.5.33a+maria-1~raring) 
but it is not going to be installed
 mariadb-client-5.5 : Depends: libdbd-mysql-perl (>= 1.2202) but it is not 
going to be installed
 mariadb-common : Depends: mysql-common but it is not going to be installed
E: Unable to correct problems, you have held broken packages.
root@test-mgmt0:/# apt-get install mariadb-galera-server-5.5 
mariadb-client-5.5 libmariadbclient18 mariadb-common libdbd-mysql-perl 
mysql-common
Reading package lists... Done
Building dependency tree       
Reading state information... Done
Some packages could not be installed. This may mean that you have
requested an impossible situation or if you are using the unstable
distribution that some required packages have not yet been created
or been moved out of Incoming.
The following information may help to resolve the situation:

The following packages have unmet dependencies:
 libmariadbclient18 : Depends: libmysqlclient18 (= 5.5.33a+maria-1~raring) 
but 5.5.34-0ubuntu0.13.04.1 is to be installed
 mariadb-client-5.5 : Depends: mariadb-client-core-5.5 (>= 
5.5.33a+maria-1~raring) but it is not going to be installed
 mysql-common : Breaks: mysql-client-5.1
                Breaks: mysql-server-core-5.1
E: Unable to correct problems, you have held broken packages.

Aha, so there’s a newer version in the Ubuntu repository that’s overriding ours. Brilliant. Ohh, and there’s a mysql-client binary too, but it won’t tell me what version it’s trying for.

Looking in the repository myself I spot a package named mysql-common_5.5.33a+maria-1~raring_all.deb. That is likely our culprit, so I try version 5.5.33a+maria-1~raring.

root@test-mgmt0:/# apt-get install mariadb-galera-server-5.5 
mariadb-client-5.5 libmariadbclient18 mariadb-common libdbd-mysql-perl 
mysql-common=5.5.33a+maria-1~raring libmysqlclient18=5.5.33a+maria-1~raring 
mariadb-client-core-5.5
Reading package lists... Done
Building dependency tree       
Reading state information... Done
The following extra packages will be installed:
  galera libaio1 libdbi-perl libhtml-template-perl libnet-daemon-perl 
libplrpc-perl

Bingo!

So, for those wanting to pre-seed MariaDB Cluster 5.5, I used the following in my preseed file:

# MariaDB 5.5 repository list - created 2013-11-12 05:20 UTC
# http://mariadb.org/mariadb/repositories/
d-i apt-setup/local3/repository string \
        deb http://mirror.aarnet.edu.au/pub/MariaDB/repo/5.5/ubuntu raring main
d-i apt-setup/local3/comment string \
        "MariaDB repository"
d-i pkgsel/include string mariadb-galera-server-5.5 \
        mariadb-client-5.5 libmariadbclient18 mariadb-common \
        libdbd-mysql-perl mysql-common=5.5.33a+maria-1~raring \
        libmysqlclient18=5.5.33a+maria-1~raring mariadb-client-core-5.5 \
        galera

# For unattended installation, we set the password here
mysql-server mysql-server/root_password select DatabaseRootPassword
mysql-server mysql-server/root_password_again select DatabaseRootPassword

So yeah, next time someone mentions this:

Gentoo: Increasing blood pressure since 1999.

it doesn’t just apply to Gentoo!

GL4Ever FlyTouch III Kernel Sources

Well, some might remember my time with a cheap and nasty Android tablet (some might call these “landfill Android”).  The device packaging did not once even acknowledge the fact that there was GPL’ed software onboard, let alone how one obtains the source.

I discovered it was based around the Vimicro VC0882 SoC.  Turns out, that’s the same as the ViewSonic ViewPad 10e, who do release their kernel sources on their knowledge base.

Thank-you ViewSonic, you have just helped me greatly!  Maybe I should track down one of your tablets and buy one in appreciation.

OpenStack: My foray into cloud computing

I’ve been working with VRT Systems for a few years now. Originally brought in as a software engineer, my role shifted to include network administration duties.

This of course does not phase me, I’ve done network administration work before for charities. There are some small differences, for example, back then it was a single do-everything box running Gentoo hosting a Samba-based NT domain for about 5 Windows XP workstations, now it’s about 20 Windows 7 workstations, a Samba-based NT domain backed by LDAP, and a number of servers.

Part of this has been to move our aging infrastructure to a more modern “private cloud” infrastructure. In the following series, I plan to detail my notes on what I’ve learned through this process, so that others may benefit from my insight.  At this stage, I don’t have all the answers, and there are some things  I may have wrong below.

Planning

The first stage with any such network development (this goes for “cloud”-like and traditional structures) is to consider how we want the network to operate, how it is going to be managed, and what skills we need.

Both my manager and I are Unix-oriented people, in my case I’ll be honest — I have a definite bias towards open source, and I’ll try to assess a solution on technical merit rather than via glossy brochures.

After looking at some commercial solutions, my manager more or less came to the conclusion that a lot of these highly expensive servers are not so magical, they are fundamentally just standard desktops in a small form factor. While we could buy a whole heap of 1U high rack servers, we might be better served by using more standard hardware.

The plan is to build a cluster of standard boxes, in as small form factor as practical, which would be managed at a higher level for load balancing and redundancy.

Hardware: first attempt

One key factor we wanted to reduce was the power consumption. Our existing rack of hardware chews about 1.5kW of power. Since we want to run a lot of virtual machines, we want to make them as efficient as possible. We wanted a small building block that would handle a small handful of VMs, and storing data across multiple nodes for redundancy.

After some research, we wound up with our first attempt at a compute node:

Motherboard: Intel DQ77KB Mini ITX
CPU: Intel Core i3-3220T 2.8GHz Dual-Core
RAM: 8GB SODIMM
Storage: Intel 520S 240GB SSD
Networking: Onboard dual gigabit for cluster, PCIe Realtek RTL8168 adaptor for client-facing network

The plan, is that we’d have many of these, they would pool their storage in a redundant fashion.  The two on-board NICs would be bonded together using LACP and would form a back-end storage network for the nodes to share data.  The one PCIe card would be the “public” face of the cluster and would connect it to the outside world using VLANs.

For the OS, we threw on Ubuntu 12.04 LTS AMD64, and we ran the KVM hypervisor. We then tried throwing this on one of our power meters to see how much power the thing drew. At first my manager asked if the thing was even turned on … it was idling at 10W.

Loaded it on with a few virtual machines, eventually I had 6 VMs going on the thing, ranging from Linux, Windows 2000, Windows XP and a Windows 2008R2 P2V image for one of our customer projects.

The CPU load sat at about 6.0, and the power consumption did not budge above 30W. Our existing boxes drew 300W, so theoretically we could run 10 of these for just one of our old servers.

Management software

Running QEMU VMs from bash scripts is all very well, but in this case we need to be able to give non-technical users use of a subset of the cluster for projects.  I hardly expect them to write bash scripts to fire up KVM over SSH.

We considered a few options: Ganeti, OpenNebula and OpenStack.

Ganeti looked good but the lack of a template system and media library let it down for us, and OpenNebula proved a bit fiddly as well.  OpenStack is a big behemoth and will take quite a bit of research however.

Storage

One factor that stood out like a sore thumb: our initial infrastructure was going to just have all compute nodes, with shared storage between them.  There were a couple of options for doing this such as having the nodes in pairs with DR:BD, using Ceph or Sheepdog, etc… but by far, the most common approach was to have a storage backend on a SAN.

SANs get very expensive very quickly.  Nice hardware, but overkill and over budget.  We figured we should plan for that eventuality, should the need arise, but it’d be a later addition.  We don’t need blistering speed, if we can sustain 160Mbps throughput, that’d probably be fine for most things.

Reading the literature, Ceph looked by far and above the best choice, but it had a catch — you can’t run Ceph server daemons, and Ceph in-kernel clients, on the same host.  Doing so you run the risk of a deadlock, in much the same manner as NFS does when you mount from localhost.

OpenStack actually has 3 types of storage:

  • Ephemeral storage
  • Block storage
  • Image storage

Ephemeral storage is specific to a given virtual machine.  It often lives on the compute node with the VM, or on a back-end storage system, and stores data temporarily for the life of a virtual machine instance.  When a VM instance is created, new copies of ephemeral block devices are created from images stored in image storage.  Once the virtual machine is terminated, these ephemeral block devices are deleted.

Block storage is the persistent storage for a given VM.  Say you were running a mail server … your OS and configuration might exist on a ephemeral device, but your mail would sit on a block device.

Image storage are simply raw images of block devices.  Image storage cannot be mounted as a block device directly, but rather, the storage area is used as a repository which is read from when creating the other two types of storage.

Ephemeral storage in OpenStack is managed by the compute node itself, often using LVM on a local block device.  There is no redundancy as it’s considered to be temporary data only.

For block storage, OpenStack provides a service called cinder.  This, at its heart, seems to use LVM as well, and exports the block devices over iSCSI.

For image storage, OpenStack has a redundant storage system called swift.  The basis for this seems to be rsync, with a service called swift-proxy providing a REST-interface over http.  swift-proxy is very network intensive, and benefits from hardware such as high-speed networking (e.g. 10Gbps Ethernet).

Hardware: second attempt

Having researched how storage works in OpenStack somewhat, it became clear that one single building block would not do.  There would in fact be two other types of node: storage nodes, and management nodes.

The storage nodes would contain largish spinning disks, with software maintaining copies and load balancing between all nodes.

The management nodes would contain the high-speed networking, and would provide services such as Ceph monitors (if we use Ceph), swift-proxy and other core functions.  RabbitMQ and the core database would run here for example.

Without the need for big storage, the compute nodes could be downsized in disk, and expanded in RAM.  So we now had a network that looked like this:

Node Type Compute Management Storage
Motherboard: Intel DQ77KB Mini ITX Intel DQ77MH Micro ATX
CPU: Intel Core i3-3220T 2.8GHz Dual-Core
RAM: 2*8GB SODIMM 2*4GB DIMM
Storage: Intel 520S 60GB SSD Intel 520S 60GB SSD for OS, 2*Seagate ST3000VX000-1CU1 3TB HDDs for data
Networking: Onboard dual gigabit for cluster, PCIe Realtek RTL8168 adaptor for client-facing network Onboard dual gigabit for management, PCIe 10GbE for cluster communications Onboard dual gigabit for cluster, PCIe Realtek RTL8168 adaptor for management

The management and storage nodes are slightly tweaked versions of what we use for compute nodes. The motherboard is basically the same chipset, but capable of taking larger PCIe cards and using a standard ATX power supply.

Since we’re not storing much on the compute nodes, we’ve gone for 60GB SSDs rather than 240GB SSDs to cut the cost down a little. We might have to look at 120GB SSDs in newer nodes, or maybe look at other options, as Intel seem to have discontinued the 60GB 520S … bless them! The Intel 520S SSDs were chosen due to the 5-year warranty offered.

The management and storage nodes, rather than going into small Mini-ITX media-centre style cases, are put in larger 2U rackmount cases. These cases have room for 4 HDDs, in theory.

Deployment

For testing purposes, we got two of each node. This allows us to try out things like testing what would happen if a node went belly up by yanking its power, and to test load balancing when things are working properly.

We haven’t bought the 10GbE cards at this stage, as we’re not sure exactly which ones to get (we have a Cisco SG500X switch to plug them into) and they’re expensive.

The final cluster will have at least 3 storage nodes, 3 management nodes and maybe as many as 16 compute nodes. I say at least 3 storage nodes — in buying the test hardware, I accidentally ordered 7 cases, and so we might decide to build an extra storage node.

Each of those gives us 6TB of storage, and the production plan is to load balance with a replica on at least 3 nodes… so we can survive any two going belly up. The disks also push close to 800Mbps throughput, so with 3 nodes serving up data, that should be enough to saturate the dual-gigabit link on the compute node. 4 nodes would give us 8TB of effective storage.

With so many nodes though, one problem remains, deploying the configuration and managing it all. We’re using Ubuntu as our base platform, and so it makes sense to tap into their technologies for deployment.

We’ll be looking to use Ubuntu Cloud and Juju to manage the deployment.

Ubuntu Cloud itself is a packaged version of OpenStack.  The components of OpenStack are deployed with Juju.  Juju itself can deploy services either to “public clouds” like Amazon AWS, or to one’s own private cluster using Ubuntu MAAS (Metal As A Service).

Metal As a Service itself basically is a deployment system which installs and configures Ubuntu on network-booting clients for automatic installation and configuration.

The underlying technology is based on a few components: dnsmasq DHCP/DNS server, tftp-hpa TFTP server, and the configuration gets served up to the installer via a web service API.  There’s a web interface for managing it all.  Once installed, you then deploy services using Juju (the word juju apparently translates to “magic”).

Further research

So having researched what hardware will likely be needed, I need to research a few things.

Firstly, the storage mechanism, we can either go with the pure OpenStack approach with cinder managing LVM based storage and exporting over iSCSI, or we get cinder to manage a Ceph back-end storage cluster.  This decision has not yet been made.  My two biggest concerns with cinder are:

  • Does cinder manage multiple replicas of block storage?
  • Does cinder try to load-balance between replicas?

With image storage, if we use Ceph, we have two choices.  We can either:

  • Install Swift on the storage nodes, partition the drives and use some of the storage for Swift, and the rest for Ceph… with Swift-proxy on the management nodes.
  • Install Rados Gateway on the management nodes in place of Swift

But which is the better approach?  My understanding is that Ceph doesn’t fully integrate into the OpenStack identity service (called keystone).  I need to find out if this matters much, or whether splitting storage between Swift and Ceph might be better.

Metal As a Service seems great in concept.  I’ve been researching OpenStack and Ceph for a few months now (with numerous interruptions), and I’m starting to get a picture as to how it all fits together.  Now the next step is to understand MAAS and Juju.  I don’t mind magic in entertainment, but I do not like it in my systems.  So my first step will be to get to understand MAAS and Juju on a low level.

Crucially, I want to figure out how one customises the image provided by MAAS… in particular, making sure it deploys to the 60GB SSD on each node, and not just the first block device it sees.

The storage nodes have their two 6Gbps SATA ports connected to the 3TB HDDs for performance, making them visible as /dev/sda and /dev/sdb — MAAS needs to understand that the disk it needs to deploy to is called /dev/sdc in this case.  I’d also perfer it to use XFS rather than EXT4, and a user called something other than “ubuntu”.  These are things I’d like to work out how to configure.

As for Juju, I need to work out exactly what it does when it “bootstraps” itself.  When I tried it last, it randomly picked a compute node.  I’d be happier if it deployed itself to the management node I ran it from.  I also need to figure out how it picks out nodes and deploys the application.  My quick testing with it had me asking it to deploy all the OpenStack components, only to have it sit there doing nothing… so clearly I missed something in the docs.  How is it supposed to work?  I’ll need to find out.  It certainly isn’t this simple.