Single sign-on with OpenConnect VPN server over FreeIPA

In March of 2015 the 0.10.0 version of OpenConnect VPN was released. One of its main features is the addition of MS-KKDCP support and GSSAPI authentication. Putting the acronyms aside that means that authentication in FreeIPA, which uses Kerberos, is greatly simplified for VPN users. Before explaining more, let’s first explore what the typical login process is on a VPN network.

Currently, with a VPN server/product one needs to login to the VPN server using some username-password pair, and then sign into the Kerberos realm using, again, a username-password pair. Many times, exactly the same password pair is used for both logins. That is, we have two independent secure authentication methods to login to a network, one after the other, consuming the user’s time without necessarily increasing the security level. Can things be simplified and achieve single sign on over the VPN? We believe yes, and that’s the reason we combined the two independent authentications into a single authentication instance. The user logs into the Kerberos realm once and uses the obtained credentials to login to the VPN server as well. That way, the necessary passwords are asked only once, minimizing login time and frustration.

How is that done? If the user needs to connect to the VPN in order to access the Kerberos realm, how could he perform Kerberos authentication prior to that? To answer that question we’ll first explain the protocols in use. The protocol followed by the OpenConnect VPN server is HTTPS based, hence, any authentication method available for HTTPS is available to the VPN server as well. In that particular case, we take advantage of the SPNEGO, and the the MS-KKDCP protocols. The former enables GSSAPI negotiation over HTTPS, thus allowing a Kerberos ticket to be used to authenticate to the server. The MS-KKDCP protocol allows an HTTPS server to behave as a proxy to a Kerberos Authentication Server, and that’s the key point which allows the user to obtain the Kerberos ticket over the VPN server protocol. Thus, the combination of the two protocols allows the OpenConnect VPN server to operate both as a proxy to KDC and as a Kerberos-enabled service. Furthermore, the usage of HTTPS ensures that all transactions with the Kerberos server are protected using the OpenConnect server’s key, ensuring the privacy of the exchange. However, there is a catch; since the OpenConnect server is now a proxy for Kerberos messages, the Kerberos Authentication Server cannot see the real IPs of the clients, and thus cannot prevent a flood of requests which can cause denial of service. To address that, we introduced a point system to the OpenConnect VPN server for banning IP addresses when they perform more than a pre-configured amount of requests.

As a consequence, with the above setup, the login processes is simplified by reducing the required steps to login to a network managed by FreeIPA. The user logs into the Kerberos Authentication Server and the VPN to the FreeIPA managed network is made available with no additional prompts.

Wouldn’t that reduce security? Isn’t it more secure to ask different credentials from the user to connect to the home network and different credentials to access the services into it? That’s a valid concern. There can be networks where this is indeed a good design choice, but in other networks it may be not. By stacking multiple authentication methods you could result in having your users trying the different credentials to the different login prompts, effectively training the less security-oriented to try the passwords they were provided anywhere until it works. However, it is desirable to increase the authentication strength when coming from untrusted networks. For that, it is possible, and recommended, to configure FreeIPA to require a second factor authenticator‌ (OTP) as part of the login process.

Another, equally important concern for the single sign-on, is to prevent re-authentication to the VPN for the whole validity time of a Kerberos key. That is, given the long lifetime of Kerberos tickets, how can we prevent a stolen laptop from being able to access the VPN? That, we address by enforcing a configurable TGT ticket lifetime limit on the VPN server. This way, VPN authentication will only occur if the user’s ticket is fresh, and the user’s password will be required otherwise.

Setting everything up

The next paragraphs move from theory to practice, and describe the minimum set of steps required to setup the OpenConnect VPN server and client with FreeIPA. At this point we assume that a FreeIPA setup is already in place and a realm name KERBEROS.REALM exists. See the Fedora FreeIPA guide for information on how to setup FreeIPA.

Server side: Fedora 22, RHEL7

The first step to install the latest of the 0.10.x branch OpenConnect VPN server (ocserv) at the server system. You can use the following command. In a RHEL7 you will also need to setup the EPEL7 repository.

yum install -y ocserv

That will install the server in an unconfigured state. The server utilizes a single configuration file found in /etc/ocserv/ocserv.conf. It contains several directives documented inline. To allow authentication with Kerberos tickets as well as with the password (e.g., for clients that cannot obtain a ticket – like clients in mobile phones) it is needed to enable PAM as well as GSSAPI authentication with the following two lines in the configuration file.

auth = pam
enable-auth = gssapi[tgt-freshness-time=360]

The option ‘tgt-freshness-time’, is available with openconnect VPN server 0.10.5, and specifies the valid for VPN authentication lifetime, in seconds, of a Kerberos (TGT) ticket. A user will have to reauthenticate if this time is exceeded. In effect that prevents the usage of the VPN for the whole lifetime of a Kerberos ticket.

The following line will enable the MS-KKDCP proxy on ocserv. You’ll need to replace the KERBEROS.RELAM with your realm and the KDC IP address.

kkdcp = /KdcProxy KERBEROS.REALM tcp@KDC-IP-ADDRESS:88

Note, that for PAM authentication to operate you will also need to set up a /etc/pam.d/ocserv. We recommend to use pam_sssd for that, although it can contain anything that best suits the local policy. An example for an SSSD PAM configuration is shown in the Fedora Deployment guide.

The remaining options in ocserv.conf are about the VPN network setup; the comments in the default configuration file should be self-explicable. At minimum you’ll need to specify a range of IPs for the VPN network, the addresses of the DNS servers, and the routes to push to the clients. At this point the server can be run with the following commands.

systemctl enable ocserv
systemctl start ocserv

The status of the server can be checked using “systemctl status ocserv”.

Client side: Fedora 21, RHEL7

The first step is to install the OpenConnect VPN client, named openconnect, in the client system. The version must be 7.05 or later. In a RHEL7 you will need to setup the EPEL7 repository.

yum install -y openconnect network-manager-openconnect

Setup Kerberos to use ocserv as KDC. For that you’ll need to modify /etc/krb5.conf to contain the following:

[realms]
KERBEROS.REALM = {
    kdc = https://ocserv.example.com/KdcProxy
    http_anchors = FILE:/path-to-your/ca.pem
    admin_server = ocserv.example.com
    auto_to_local = DEFAULT
}

[domain_realm]
.kerberos.test = KERBEROS.REALM
kerberos.test = KERBEROS.REALM

Note that, ocserv.example.com should be replaced with the DNS name of your server, and the /path-to-your/ca.pem should be replaced by the a PEM-formatted file which holds the server’s Certificate Authority. For the KDC option the server’s DNS name is preferred to an IP address to simplify server name verification for the Kerberos libraries. At this point you should be able to use kinit to authenticate and obtain a ticket from the Kerberos Authentication Server. Note however, that kinit is very brief on the printed errors and a server certificate verification error will not be easy to debug. Ensure that the http_anchors file is in PEM format, it contains the Certificate Authority that signed the server’s certificate, and that the server’s certificate DNS name matches the DNS name setup in the file. Note also, that this approach requires the user to always use the OpenConnect’s KDCProxy. To avoid that restriction, and allow the user to use the KDC directly when in LAN, we are currently working towards auto-discovery of KDC.

Then, at a terminal run:

$ kinit

If the command succeeds, the ticket is obtained, and at this point you will be able to setup openconnect from network manager GUI and connect to it using the Kerberos credentials. To setup a VPN via NetworkManager on the system menu, select VPN, Network Settings, and add a new Network of “CISCO AnyConnect Compatible VPN (openconnect)”. On the Gateway field, fill in the server’s DNS name, add the server’s CA certificate, and that’s all required.

To use the command line client with Kerberos the following trick is recommended. That avoids using sudo with the client and runs the openconnect client as a normal user, after having created a tun device. The reason it avoids using the openconnect client with sudo, is that sudo will prevent access to the user’s Kerberos credentials.

# sudo ip tuntap add vpn0 mode tun user my-user-name
$ openconnect server.example.com -i vpn0

Client side: Windows

A windows client is available for OpenConnect VPN at this web site. Its setup, similarly to NetworkManager, requires setting the server’s DNS name and its certificate. Configuring windows for use with FreeIPA is outside the scope of this text, but more information can be found at this FreeIPA manual.

Conclusion

A single sign-on solution using FreeIPA and the OpenConnect VPN has many benefits. The core optimization of a single login prompt for the user to authorize access to network resources will result in saving user time and frustration. It is important to note that these optimizations are possible by making VPN access part of the deployed infrastructure, rather than an after thought deployment.  With careful planning, an OpenConnect VPN solution can provide a secure and easy solution to network authentication.

The hidden costs of embargoes

It’s 2015 and it’s pretty clear the Open Source way has largely won as a development model for large and small projects. But when it comes to security we still practice a less-than-open model of embargoes with minimal or, in some cases, no community involvement. With the transition to more open development tools, such as Gitorious and GitHub, it is now time for the security process to change and become more open.

The problem

In general the argument for embargoes simply consists of “we’ll fix these issues in private and release the update in a coordinated fashion in order to minimize the time an attacker knows about the issue and an update is not available”. So why not reduce the risk of security issues being exploited by attackers and simply embargo all security issues, fix them privately and coordinate vendor updates to limit the time attackers know about this issue? Well for one thing we can’t be certain that an embargoed issue is known only to the people who found and reported it, and the people working on it. By definition if one person is able to find the flaw, then a second person can find it. This exact scenario happened with the high profile OpenSSL “HeartBleed” issue: initially an engineer at Codenomicon found it, and then it was independently found by the Google security team.

Additionally, the problem is a mismatch between how Open Source software is built and how security flaws are handled in Open Source software. Open Source development, testing, QA and distribution of the software mostly happens in public now. Most Open Source organizations have public source code trees that anyone can view, and in many cases submit change requests to. As well many projects have grown in size, not only code wise but developer wise, and some now involve hundreds or even thousands of developers (OpenStack, the Linux Kernel, etc.). It is clear that the Open Source way has scaled and works well with these projects, however the old fashioned way of handling security flaws in secret has not scaled as well.

Process and tools

The Open Source development method generally looks something like this: project has a public source code repository that anyone can copy from, and specific people can commit to. Project may have some continuous integration platform like Jenkins, and then QE testing to make sure everything still works. Fewer and fewer projects have a private or hidden repository, for one reason because there is little benefit to doing so, and many providers do not allow private repositories on the free plans that they offer. This also applies to the continuous integration and testing environments used by many projects (especially when using free services). So actually handling a security issue in secret without exposing it means that many projects cannot use their existing infrastructure and processes, but must instead email patches around privately and do builds/testing on their own workstations.

Code expertise

But let’s assume that an embargoed issue has been discovered and reported to upstream and only the original reporter and upstream know. Now we have to hope that between the reporter and upstream there is enough time and expertise to properly research the issue and fully understand it. The researcher may have found the issue through fuzzing and may only have a test case that causes a crash but no idea even what section of code is affected or how. Alternatively the researcher may know what code is affected but may not fully understand the code or how it is related to other sections of the program, in which case the issue may be more severe than they think, or perhaps it may not be as severe, or even be exploitable at all. The upstream project may also not have the time or resources to understand the code properly, as many of these people are volunteers, and projects have turn over, so the person who originally wrote the code may be long gone. In this case making the issue public means that additional people, such as vendor security teams, can also participate in looking at the issue and helping to fully understand it.

Patch creation with an embargoed issue means only the researcher and upstream participating. The end result of this is often patches that are incomplete and do not fully address the issue. This happened with the Bash Shellshock issue (CVE-2014-6271) where the initial patch, and even subsequent patches, were incomplete resulting in several more CVEs (CVE-2014-6277, CVE-2014-6278, CVE-2014-7169). For a somewhat complete listing of such examples simply search the CVE database for “because of an incomplete fix for”.

So assuming we now have a fully understood issue and a correct patch we actually need to patch the software and run it through QA before release. If the issue is embargoed this means you have to do so in secret. However, many Open Source projects use public or open infrastructure, or services which do not support selective privacy (a project is either public or private). Thus for many projects this means that the patching and QA cycle must happen outside of their normal infrastructure. For some projects this may not be possible; if they have a large Jenkins infrastructure replicating it privately can be very difficult.

And finally we have a fully patched and tested source code release, we may still need to coordinate a release with other vendors which has significant overhead and time constraint for all concerned. Obviously if the issue is public the entire effort spent on privately coordinating the issue is not needed and that effort can be spent on other things such as ensuring the patches are correct and that they address the flaw properly.

The Fix

The answer to the embargo question is surprisingly simple: we only embargo the issues that really matter and even then we use embargoes sparingly. Bringing in additional security experts, who would not normally be aware due to the embargo, rather than just the original researcher and the upstream project, increases the chances of the issue being properly understood and patched the first time around. Making the issue public will get more eyes on it. And finally, for the majority of lower severity issues (e.g. most XSS, temporary file vulnerabilities) attackers have little to no interest in them, so the cost of embargoes really makes no sense here. In short: why not treat most security bugs like normal bugs and get them fixed quickly and properly the first time around?

Emergency Security Band-Aids with Systemtap

Software security vulnerabilities are a fact of life. So is the subsequent publicity, package updates, and suffering service restarts. Administrators are used to it, and users bear it, and it’s a default and traditional method.

On the other hand, in some circumstances the update & restart methods are unacceptable, leading to the development of online fix facilities like kpatch, where code may be surgically replaced in a running system. There is plenty of potential in these systems, but they are still at an early stage of deployment.

In this article, we present another option: a limited sort of live patching using systemtap. This tool, now a decade old, is conventionally thought of as a tracing widget, but it can do more. It can not only monitor the detailed internals of the Linux kernel and user-space programs, it can also change them – a little. It turns out to be just enough to defeat some classes of security vulnerabilities. We refer to these as security “band-aids” rather than fixes, because we expect them to be used temporarily.

Systemtap

Systemtap is a system-wide programmable probing tool introduced in 2005, and supported since RHEL 4 on all Red Hat and many other Linux distributions. (Its capabilities vary with the kernel’s. For example, user-space probing is not available in RHEL 4.) It is system-wide in the sense that it allows looking into much of the software stack: from device drivers, kernel core, system libraries, through user-applications. It is programmable because it operates based on programs in the systemtap scripting language in order to specify what operations to perform. It probes in the sense that, like a surgical instrument, it safely opens up running software so that we can peek and poke at its internals.

Systemtap’s script language is inspired by dtrace, awk, C, intended to be easy to understand and compact while expressive. Here is hello-world:

probe oneshot { printf("hello world\n") }

Here is some counting of vfs I/O:

 global per_pid_io # an array
 probe kernel.function("vfs_read") 
     { per_pid_io["read",pid()] += $count }
 probe kernel.function("vfs_write")
     { per_pid_io["write",pid()] += $count }
 probe timer.s(5) { exit() } # per_pid_io will be printed

Here is a system-wide strace for non-root processes:

probe syscall.*
{
    if (uid() != 0)
        printf("%s %d %s %s\n", execname(), tid(), name, argstr)
}

Additional serious and silly samples are available on the Internet and also distributed with systemtap packages.

The meaning of a systemtap script is simple: whenever a given probed event occurs, pause that context, evaluate the statements in the probe handler atomically (safely and quickly), then resume the context. Those statements can inspect source-level state in context, trace it, or store it away for later analysis/summary.

The systemtap scripting language is well-featured. It has normal control flow statements, functions with recursion. It deals with integral and string data types, and look-up tables are available. An unusual feature for a small language, full type checking is paired with type inference, so type declarations are not necessary. There exist dozens of types of probes, like timers, calls/interiors/returns from arbitrary functions, designated tracing instrumentation in the kernel or user-space, hardware performance counter values, Java methods, and more.

Systemtap scripts are run by algorithmic translation to C, compilation to machine code, and execution within the kernel as a loadable module. The intuitive hazards of this technique are ameliorated by prolific checking throughout the process, both during translation, and within the generated C code and its runtime. Both time and space usage are limited, so experimentation is safe. There are even modes available for non-root users to inspect only their own processes.  This capability has been used for a wide range of purposes:

  • performance tuning via profiling
  • program-understanding via pinpoint tracing, dynamic call-graphs, through pretty-printing local variables line by line.

Many scripts can run independently at the same time and any can be completely stopped and removed. People have even written interactive games!

Methodology

While systemtap is normally used in a passive (read-only) capacity, it may be configured to permit active manipulations of state. When invoked in “guru mode”, a probe handler may send signals, insert delays, change variables in the probed programs, and even run arbitrary C code. Since systemtap cannot change the program code, changing data is the approach of choice for construction of security band-aids. Here are some of the steps involved, once a vulnerability has been identified in some piece of kernel or user code.

Plan

To begin, we need to look at the vulnerable code bug and, if available, the corrective patch to understand:

  • Is the bug mainly (a) data-processing-related or (b) algorithmic?
  • Is the bug (a) localized or (b) widespread?
  • Is the control flow to trigger the bug (a) simple or (b) complicated?
  • Are control flow paths to bypass the bug (a) available nearby (included in callers) or (b) difficult to reach?
  • Is the bug dependent mainly on (a) local data (such as function parameters) or (b) global state?
  • Is the vulnerability-triggering data accessible over (a) a broad range of the function (a function parameter or global) or only (b) a narrow window (a local variable inside a nested block)?
  • Are the bug triggering conditions (a) specific and selective or (b) complex and imprecise?
  • Are the bug triggering conditions (a) deliberate or (b) incidental in normal operation?

More (a)s than (b)s means it’s more likely that systemtap band-aids would work in the particular situation while more (b)s means it’s likely that patching the traditional way would be best.

Then we need to decide how to change the system state at the vulnerable point. One possibility is to change to a safer error state; the other is to change to a correct state.

In a type-1 band-aid, we will redirect flow of control away from the vulnerable areas. In this approach, we want to “corrupt” incoming data further in the smallest possible way necessary to short-circuit the function to bypass the vulnerable regions. This is especially appropriate if:

  • correcting the data is difficult, perhaps because it is in multiple locations, or because it needs to be temporary or no obvious corrected value can be computed
  • error handling code already exists and is accessible
  • we don’t have a clear identification of vulnerable data states and want to err on the side of error-handling
  • if the vulnerability is deliberate so we don’t want to spend the effort of performing a corrected operation

A type-2 band-aid is correcting data so that the vulnerable code runs correctly. This is especially appropriate in the complementary cases from the above and if:

  • the vulnerable code and data can occur from real workloads so we would like them to succeed
  • corrected data can be practically computed from nearby state
  • natural points occur in the vulnerable code where the corrected data may be inserted
  • natural points occur in the vulnerable code where clean up code (restoring of previous state) may be inserted, if necessary

Implement

With the vulnerability-band-aid approach chosen, we need to express our intent in the systemtap scripting language. The model is simple: for each place where the state change is to be done we place a probe. In each probe handler, we detect whether the context indicates an exploit is in progress and, if so, make changes to the context. We might also need additional probes to detect and capture state from before the vulnerable section of code, for diagnostic purposes.

A minimal script form for changing state can be easily written. It demonstrates one kernel and one user-space function-entry probe, where each happens to take a parameter named p that needs to be range-limited. (The dollar sign identifies the symbol as a variable in the context of the probed program, not as a script-level temporary variable.)

probe kernel.function("foo"),
      process("/lib*/libc.so").function("bar")
{
    if ($p > 100)
        $p = 4
}

Another possible action in the probe handler is to deliver a signal to the current user-space process using the raise function. In this script a global variable in the target program is checked at every statement in the given source code file and line-number-range and deliver a killing blow if necessary:

probe process("/bin/foo").statement("*@src/foo.c:100-200")
{
    if (@var("a_global") > 1000)
        raise(9) # SIGKILL
}

Another possible action is logging the attempt at the systemtap process console:

    # ...
    printf("check process %s pid=%d uid=%d",
           execname(), pid(), uid())
    # ...

Or sending a note to the sysadmin:

    # ...
    system(sprintf("/bin/logger check process %s pid=%d uid=%d",
                   execname(), pid(), uid()))
    # ...

These and other actions may be done in any combination.

During development of a band-aid one should start with just tracing (no band-aid countermeasures) to fine-tune the detection of the vulnerable state. If an exploit is available run it without systemtap, with systemtap (tracing only), and with the operational band-aid. If normal workload can trigger the bug run it with and without the same spectrum of systemtap supervision to confirm that we’re not harming that traffic.

Deploy

To run a systemtap script, we will need systemtap on a developer workstation. Systemtap has been included in RHEL 4 and later since 2006. RHEL 6 and RHEL 7 still receive rebases from new upstream releases, though capabilities vary. (Systemtap upstream is tested against a gamut of RHEL 4 through fresh kernel.org kernels, and against other distributions.)

In addition to systemtap itself, security band-aid type scripts usually require source-level debugging information for the buggy programs. That is because we need the same symbolic information about types, functions, and variables in the program as an interactive debugger like gdb does. On Red Hat/Fedora distributions this means the “-debuginfo” RPMs available on RHN and YUM. Some other distributions make them available as “-dbgsym” packages. Systemtap scripts that probe the kernel will probably need the larger “kernel-debuginfo”; those that probe user-space will probably need a corresponding “foobar-debuginfo” package. (Systemtap will tell you if it’s missing.)

Running systemtap security band-aid scripts will generally require root privileges and a “guru-mode” flag to designate permission to modify state such as:

# stap -g band_aid.stp
[...]
^C

The script will inject instrumentation into existing and future processes and continue running until it is manually interrupted, it stops itself, or error conditions arise. In case of most errors, the script will stop cleanly, print a diagnostic message, and point to a manual page with further advice. For transient or acceptable conditions command line options are available to suppress some safety checks altogether.

Systemtap scripts may be distributed to a network of homogeneous workstations in “pre-compiled” (kernel-object) form, so that a full systemtap + compiler + debuginfo installation is not necessary on the other computers. Systemtap includes automation for remote compilation and execution of the scripts. A related facility is available to create MOK-signed modules for machines running under SecureBoot.  Scripts may also be installed for automatic execution at startup via initscripts.

Some examples

Of all the security bugs for which systemtap band-aids have been published, we analyse a few below.

CVE-2013-2094, perf_swevent_enabled array out-of-bound access

This was an older bug in the kernel’s perf-event subsystem which takes a complex command struct from a syscall. The bug involved missing a range check inside the struct pointed to by the event parameter. We opted for a type-2 data-correction fix, even though the a type-1 failure-induction could have worked as well.

This script demonstrates an unusual technique: embedded-C code called from the script, to adjust the erroneous value. There is a documented argument-passing API between embedded-C and script, but systemtap cannot analyze or guarantee anything about the safety of the C code. In this case, the same calculation could have been expressed within the safe scripting language, but it serves to demonstrate how a more intricate correction could be fitted.

# declaration for embedded-C code
%{
#include <linux/perf_event.h>
%}
# embedded-C function - note %{ %} bracketing
function sanitize_config:long (event:long) %{
    struct perf_event *event;
    event = (struct perf_event *) (unsigned long) STAP_ARG_event;
    event->attr.config &= INT_MAX;
%}

probe kernel.function("perf_swevent_init").call {
    sanitize_config($event)  # called with pointer
}

CVE-2015-3456, “venom”

Here is a simple example from the recent VENOM bug, CVE-2015-3456, in QEMU’s floppy-drive emulation code. In this case, a buffer-overflow bug allows some user-supplied data to overwrite unrelated memory. The official upstream patch adds explicit range limiting for an invented index variable pos, in several functions. For example:

@@ -1852,10 +1852,13 @@
 static void fdctrl_handle_drive_specification_command(FDCtrl *fdctrl, int direction)
 {
     FDrive *cur_drv = get_cur_drv(fdctrl);
+    uint32_t pos;
-    if (fdctrl->fifo[fdctrl->data_pos - 1] & 0x80) {
+    pos = fdctrl->data_pos - 1;
+    pos %= FD_SECTOR_LEN;
+    if (fdctrl->fifo[pos] & 0x80) {
         /* Command parameters done */
-        if (fdctrl->fifo[fdctrl->data_pos - 1] & 0x40) {
+        if (fdctrl->fifo[pos] & 0x40) {
             fdctrl->fifo[0] = fdctrl->fifo[1];
             fdctrl->fifo[2] = 0;
             fdctrl->fifo[3] = 0;

Inspecting the original code, we see that the vulnerable index was inside a heap object at fdctrl->data_pos. A type-2 systemtap band-aid would have to adjust that value before the code runs the fifo[] dereference, and subsequently restore the previous value. This might be expressed like this:

global saved_data_pos
probe process("/usr/bin/qemu-system-*").function("fdctrl_*spec*_command").call
{
    saved_data_pos[tid()] = $fdctrl->data_pos;
    $fdctrl->data_pos = $fdctrl->data_pos % 512 # FD_SECTOR_LEN
}
probe process("/usr/bin/qemu-system-*").function("fdctrl_*spec*_command").return
{
    $fdctrl->data_pos = saved_data_pos[tid()]
    delete saved_data_pos[tid()]
}

The same work would have to be done at each of the three analogous vulnerable sites unless further detailed analysis suggests that a single common higher-level function could do the job.

However, this is probably too much work. The CVE advisory suggests that any call to this area of code is likely a deliberate exploit attempt (since modern operating systems don’t use the floppy driver). Therefore, we could opt for a type-1 band-aid, where we bypass the vulnerable  computations entirely. We find that all the vulnerable functions ultimately have a common caller, fdctrl_write.

static void fdctrl_write (void *opaque, uint32_t reg, uint32_t value)
{
    FDCtrl *fdctrl = opaque;
    [...]
    reg &= 7;
    switch (reg) {
    case FD_REG_DOR:
        fdctrl_write_dor(fdctrl, value);
        break;
        [...]
    case FD_REG_CCR:
        fdctrl_write_ccr(fdctrl, value);
        break;
    default:
        break;
    }
}

We can disarm the entire simulated floppy driver by pretending that the simulated CPU is addressing a reserved FDC register, thus falling through to the default: case. This requires just one probe. Here we’re being more conservative than necessary, overwriting only the low few bits:

probe process("/usr/bin/qemu-system-*").function("fdctrl_write")
{
    $reg = (($reg & ~7) | 6) # replace register address with 0x__6
}

CVE-2015-0235, “ghost”

This recent bug in glibc involved a buffer overflow related to dynamic allocation with a miscalculated size. It affected a function that is commonly used in normal software, and the data required to determine whether the vulnerability would be triggered or not is not available in situ. Therefore, a type-1 error-inducing band-aid would not be appropriate.

However, it is a good candidate for type-2 data-correction. The script below works by incrementing the size_needed variable set around line 86 of glibc nss/digits_dots.c, so as to account for the missing sizeof (*h_alias_ptr). This makes the subsequent comparisons work and return error codes for buffer-overflow situations.

85
86 size_needed = (sizeof (*host_addr)
87             + sizeof (*h_addr_ptrs) + strlen (name) + 1);
88
89 if (buffer_size == NULL)
90   {
91     if (buflen < size_needed)
92       {
93         if (h_errnop != NULL)
94           *h_errnop = TRY_AGAIN;
95         __set_errno (ERANGE);
96         goto done;
97       }
98   }
99 else if (buffer_size != NULL && *buffer_size < size_needed)
100  {
101    char *new_buf;
102    *buffer_size = size_needed;
103    new_buf = (char *) realloc (*buffer, *buffer_size);

The script demonstrates an unusual technique. The variable in need of correction (size_needed) is deep within a particular function so we need “statement” probes to place it before the bad value is used. Because of compiler optimizations, the exact line number where the probe may be placed can’t be known a prior so we ask systemtap to try a whole range. The probe handler than protects itself against being invoked more than once (per function call) using an auxiliary flag array.

global added%
global trap = 1 # stap -G trap=0 to only trace, not fix
probe process("/lib*/libc.so.6").statement("__nss_hostname_digits_dots@*:87-102")
{
    if (! added[tid()])
    {
        added[tid()] = 1; # we only want to add once
        printf("%s[%d] BOO! size_needed=%d ", execname(), tid(),
 $size_needed)
        if (trap)
        {
            # The &@cast() business is a fancy sizeof(uintptr_t),
            # which makes this script work for both 32- and 64-bit glibc's.
            $size_needed = $size_needed + &@cast(0, "uintptr_t")[1]
            printf("ghostbusted to %d", $size_needed)
        }
    printf("\n")
    }
}

probe process("/lib*/libc.so.6").function("__nss_hostname_digits_dots").return
{
    delete added[tid()] # reset for next call
}

This type-2 band-aid allows applications to operate as through glibc was patched.

Conclusions

We hope you enjoyed this foray into systemtap and its unexpected application as a potential band-aid for security bugs. If you would like to learn more, read our documentation, contact our team, or just go forth and experiment. If this technology seems like a fit for your installation and situation, consult your vendor for a possible systemtap band-aid.

JSON, Homoiconicity, and Database Access

During a recent review of an internal web application based on the Node.js platform, we discovered that combining JavaScript Object Notation (JSON) and database access (database query generators or object-relational mappers, ORMs) creates interesting security challenges, particularly for JavaScript programming environments.

To see why, we first have to examine traditional SQL injection.

Traditional SQL injection

Most programming languages do not track where strings and numbers come from. Looking at a string object, it is not possible to tell if the object corresponds to a string literal in the source code, or input data which was read from a network socket. Combined with certain programming practices, this lack of discrimination leads to security vulnerabilities. Early web applications relied on string concatenation to construct SQL queries before sending them to the database, using Perl constructs like this to load a row from the users table:

# WRONG: SQL injection vulnerability
$dbh->selectrow_hashref(qq{
  SELECT * FROM users WHERE users.user = '$user'
})

But if the externally supplied value for $user is "'; DROP TABLE users; --", instead of loading the user, the database may end up deleting the users table, due to SQL injection. Here’s the effective SQL statement after expansion of such a value:

  SELECT * FROM users WHERE users.user = ''; DROP TABLE users; --'

Because the provenance of strings is not tracked by the programming environment (as explained above), the SQL database driver only sees the entire query string and cannot easily reject such crafted queries.

Experience showed again and again that simply trying to avoid pasting untrusted data into query strings did not work. Too much data which looks trustworthy at first glance turns out to be under external control. This is why current guidelines recommend employing parametrized queries (sometimes also called prepared statements), where the SQL query string is (usually) a string literal, and the variable parameters are kept separate, combined only in the database driver itself (which has the necessary database-specific knowledge to perform any required quoting of the variables).

Homoiconicity and Query-By-Example

Query-By-Example is a way of constructing database queries based on example values. Consider a web application as an example. It might have a users table, containing columns such as user_id (a serial primary key), name, password (we assume the password is stored in the clear, also this practice is debatable), a flag that indicates if the user is an administrator, a last_login column, and several more.

We could describe a concrete row in the users table like this, using JavaScript Object Notation (JSON):

{
  "user_id": 1,
  "name": "admin",
  "password": "secret",
  "is_admin": true,
  "last_login": 1431519292
}

The query-by-example style of writing database queries takes such a row descriptor, omits some unknown parts, and treats the rest as the column values to match. We could check user name an password during a login operation like this:

{
  "name": "admin",
  "password": "secret",
}

If the database returns a row, we know that the user exists, and that the login attempt has been successful.

But we can do better. With some additional syntax, we can even express query operators. We could select the regular users who have logged in today (“1431475200” refers to midnight UTC, and "$gte" stands for “greater or equal”) with this query:

{
  "last_login": {"$gte": 1431475200},
  "is_admin": false
}

This is in fact the query syntax used by Sequelize, a object-relational mapping tool (ORM) for Node.js.

This achieves homoiconicity refers to a property of programming environment where code (here: database queries) and data look very much alike, roughly speaking, and can be manipulated with similar programming language constructors. It is often hailed as a primary design achievement of the programming language Lisp. Homoiconicity makes query construction with the Sequelize toolkit particularly convenient. But it also means that there are no clear boundaries between code and data, similar to the old way of constructing SQL query strings using string concatenation, as explained above.

Getting JSON To The Database

Some server-side programming frameworks, notably Node.js, automatically decode bodies of POST requests of content type application/json into JavaScript JSON objects. In the case of Node.js, these JSON objects are indistinguishable from other such objects created by the application code.  In other words, there is no marker class or other attribute which allows to tell apart objects which come from inputs and objects which were created by (for example) object literals in the source.

Here is a simple example of a hypothetical login request. When Node.js processes the POST request on he left, it assigns a JavaScript object to the the req.body field in exactly the same way the JavaScript code on the right does.

POST request Application code
POST /user/auth HTTP/1.0
Content-Type: application/json

{"name":"admin","password":"secret"}
req.body = {
  name: "admin",
  password: "secret"
}

In a Node.js application using Sequelize, the application would first define a model User, and then use it as part of the authentication procedure, in code similar to this (for the sake of this example, we still assume the password is stored in plain text, the reason for that will be come clear immediately):

User.findOne({
  where: {
    name: req.body.name,
    password: req.body.password
  }
}).then(function (user) {
  if (user) {
    // We got a user object, which means that login was successful.
    …
  } else {
    // No user object, login failure.
    …
  }
})

The query-by-example part is highlighted.

However, this construction has a security issue which is very difficult to fix. Suppose that the POST request looks like this instead:

POST /user/auth HTTP/1.0
Content-Type: application/json

{
  "name": {"$gte": ""},
  "password": {"$gte": ""}
}

This means that Sequelize will be invoked with this query (and the markers included here are invisible to the Sequelize code, they just illustrate the data that came from the post request):

User.findOne({
  where: {
    name: {"$gte": ""},
    password: {"$gte": ""}
  }
})

Sequelize will translate this into a query similar to this one:

SELECT * FROM users where name >= ''  AND password >= '';

Any string is greater than or equal to the empty string, so this query will find any user in the system, regardless of the user name or password. Unless there are other constraints imposed by the application, this allows an attacker to bypass authentication.

What can be done about this? Unfortunately, not much. Validating POST request contents and checking that all the values passed to database queries are of the expected type (string, number or Boolean) works to mitigate individual injection issues, but the experience with SQL injection issues mentioned at the beginning of this post suggests that this is not likely to work out in practice, particularly in Node.js, where so much data is exposed as JSON objects. Another option would be to break homoiconicity, and mark in the query syntax where the query begins and data ends. Getting this right is a bit tricky. Other Node.js database frameworks do not describe query structure in terms of JSON objects at all; Knex.js and Bookshelf.js are in this category.

Due to the prevalence of JSON, such issues are most likely to occur within Node.js applications and frameworks. However, already in July 2014, Kazuho Oku described a JSON injection issue in the SQL::Maker Perl package, discovered by his colleague Toshiharu Sugiyama.

Update (2015-05-26): After publishing this blog post, we learned that a very similar issue has also been described in the context of MongoDB: Hacking NodeJS and MongoDB.

Other fixable issues in Sequelize

Sequelize overloads the findOne method with a convenience feature for primary-key based lookup. This encourages programmers to write code like this:

User.findOne(req.body.user_id).then(function (user) {
  … // Process results.
}

This allows attackers to ship a complete query object (with the “{where: …}” wrapper) in a POST request. Even with strict query-by-example queries, this can be abused to probe the values of normally inaccessible table columns. This can be done efficiently using comparison operators (with one bit leaking per query) and binary search.

But there is another issue. This construct

User.findOne({
  where: "user_id IN (SELECT user_id " +
    "FROM blocked_users WHERE unblock_time IS NULL)"
}).then(function (user) {
  … // Process results.
}

pastes the marked string directly into the generated SQL query (here it is used to express something that would be difficult to do directly in Sequelize (say, because the blocked_users table is not modeled). With the “findOne(req.body.user_id)” example above, a POST request such as

POST /user/auth HTTP/1.0
Content-Type: application/json

{"user_id":{"where":"0=1; DROP TABLE users;--"}}

would result in a generated query, with the highlighted parts coming from the request:

SELECT * FROM users WHERE 0=1; DROP TABLE users;--;

(This will not work with some databases and database drivers which reject multi-statement queries. In such cases, fairly efficient information leaks can be created with sub-queries and a binary search approach.)

This is not a defect in Sequelize, it is a deliberate feature. Perhaps it would be better if this functionality were not reachable with plain JSON objects. Sequelize already supports marker objects for including literals, and a similar marker object could be used for verbatim SQL.

The Sequelize upstream developers have mitigated the first issue in version 3.0.0. A new method, findById (with an alias, findByPrimary), has been added which queries exclusively by primary keys (“{where: …}” queries are not supported). At the same time, the search-by-primary-key automation has been removed from findOne, forcing applications to choose explicitly between primary key lookup and full JSON-based query expression. This explicit choice means that the second issue (although not completely removed from version 3.0.0) is no longer directly exposed. But as expected, altering the structure of a query by introducing JSON constructs (as with the "$gte example is still possible, and to prevent that, applications have to check the JSON values that they put into Sequelize queries.

Conclusion

JSON-based query-by-example expressions can be an intuitive way to write database queries. However, this approach, when taken further and enhanced with operators, can lead to a reemergence of injection issues which are reminiscent of SQL injection, something these tools try to avoid by operating at a higher abstraction level. If you, as an application developer, decide to use such a tool, then you will have to make sure that data passed into queries has been properly sanitized.