Overview

Build a system whereby the operating system doesn’t realize it is running in multiple locations on different hardware in a distributed fashion.

Current “Single System Image” distributed operating systems are generally concerned with IPC and checkpointing of processes. Mostly they replace and extend library functions that were originally designed to faciliate threading and multiprocess communication on a physical machine. These solutions require large modifications and patches to the underlying operating system, and while every attempt to hide their existance from userland processes is made, they invariably end up exposing the very calls and hooks that they attempted to avoid exposing in the first place.

That is to say, that a definition of “Single System Image” is generic enough to allow for additional calls and hooks, but most projects attempt to hide these complex systems from user space programs. The ideal version of a distributed operating system for certain workloads, and I would argue most cases of distributed workloads today, allows for physical and network failures, such that the operating system and all processes above it continue to run uninterrupted, unaware of any problem.

Idea

It should be possible to construct a system such that:

  • Any operating system can be run, since we’re exposing virtualized resources that emulate hardware.
  • The operating system is unaware, or optionally aware that it is running in a distributed envrionment.
  • All resources that are exposed through the virtualized hardware abstraction layer are distributed.
    • Network Devices
    • HID Devices
    • Block Devices
    • System Memory
    • CPU and related computing resources
  • Node failure at a physical or network layer can be detected and hidden from the operating system above.
    • This is completely optional, and configurable.

Example 1 - A distributed LDAP server.

A three node system is constructed to faciliate a distributed hardware abstraction layer (HAL). These nodes are configured such that everything from CPU registers to main system memory, to packet buffers on the network card are only acted upon when quorum between the nodes is reached.

Linux is installed; The operating system sees a single CPU with 512MB or RAM. The instruction set on the CPU is nothing special. The peripherials that it sees are:

  • A serial console
  • A network card
  • A single hard drive

OpenLDAP is installed and configured; It stores data on the hard drive, and in system memory as a cache. It binds to a port and is exposed and available through the network card.

An external system ( think developer laptop or similar outside the distributed system ) routes an ARP request that ends up coming into the virtualized network device. Because the virtual network device is distributed, frame’s of layer2 or similar appear as the input.

The ARP request is ingested at one of the physical nodes, and presented to the operating system when the same information has been confirmed in quorum.

The operating system ( Linux ) operates on the frame, and eventually a response is generated and sent to the virtualized network card to send. That response is seen by all hosts that are connected to any of the physical hosts that are bridged.

The external system at this point sees the ARP response, sends a TCP request, the process repeats in such a way that an LDAP transaction can occur.

During the middle of processing the LDAP query from the external client, one of the three nodes network gets unplugged. The underlying virtualization system detects this through cockroachdb’s systems. The virtualization system does not require any input from the hosted operating system, and the resources are unchanged, so the transaction is completed, with the results being sent to the virtualized ethernet device all the same.

This is a trivial example to demonstrate that any arbitrary process could be setup to be long lived and distributed, but need not be aware of these facts.

Example 2 - A distributed email system

A distributed virtualization system is constructed such that there are three nodes, and quorum must be kept to ensure any transaction. The entire HAL is distributed, and every resource that is exposed to the guest operating system is in fact in a database, from virtualized FIFO packet buffers on a network card, to the internal registers of a CPU, and their corresponding values.

Some operating system is installed on this system; Software is installed on it that communicates via the SMTP protocol. The system is setup to store files and information through web calls, such as an AWS S3 system.

External to the distributed system the MX records of a particular domain are set to external endpoints that map to the three nodes of the distributed system. NAT is used to abstract specific external IP addresses as the destination external to the distributed system, to a standardized internal address. This process of NATing the packets external to the distributed system allows for the guest operating system to be unaware that it is running in a distributed fashion.

As the email is being received by the distributed system through the virtual ethernet card over multiple packets, all information is being stored in the database. Because of this consistency, a physical node failing would not result in loss of data.

If multiple advertisements of the same IP space were used, individual packets would simply be routed to another node that had not failed, allowing for the individual TCP connection to persist in the event of node failure.

In this example, the operating system accepts the SMTP email, sends a response, which is sent back to the virtualized network card, which is then processed as outgoing packets. During this time it is free to asynchronous hit an external HTTP API similar to AWS S3 to store the information.

The resulting system provides for a fault tolerant email server that utilizes external third party storage mechanisms that may provide for specific features outside of the scope of this distributed system.

Example 3 - Shared compute environments.

Similar to the above examples, a distributed virtualization envrionment is constructed with three physically separated nodes. A guest operating system is installed and is presented with resources that are in fact, backed by a distributed concurrent, and fault tolerant database. Everything from CPU registers to disk access is distributed.

Due to the fact that the operating system can only be accessed through the distributed virtualized environment, it could have presented a HAL ( hardware abstraction layer ) with multiple input and output devices. It would be possible to provide for multiple keyboards for example. It is also possible to MUX multiple external keybaord events into a single virtualized one.

This in addition to the fact that a network card is exposed to gain remote access, allows for multiple concurrent users to have access to the distributed virtualized resources, by way of the guest operating system.

The effect is that multiple users could be manipulating or utilizing the guest operating system, say a shared development system, in a distributed fashion. It would be possible to construct a system whereby resources were dynamically exposed to the guest operating system as developers showed up to work, with increased performance being possible through exposing new CPU resources, or perhaps more memory.

Use cases

Some workloads are safety critical, in that they control physical devices or mechanisms that could possibly cause harm to people, places, or things. Currently these systems are designed to ‘fail-safe’, or move into an undefined mode of operation that is within some bounds. Using a distributed operating system, these safety critical applications could be constructed in such a way as to “never fail”; That is, if they are distributed themselves, external processes can also be gaurenteed to be distributed, without the need to program specific distributed architectures.

For example in a distributed system that controls access to physical portals such as door locks, a distributed operating system could be used to facilitate multiple control points, whereby, if any one of those controls points is comprimised or disconnected, the distributed system proactively removes that node from the system. In this way, assuming each door is controlled via a network that each control room is independently wired into.

Challenges and issues

Latency for specific resources

The most obvious challenge with constructing a system of this kind is the latency between nodes of the distributed system. In particular certain resources such as CPU registers are accessed and changed very rapidly. Imposing strict distributed quorum on this will reduce the effective clock rate of the virtualized hardware substantially. Construction of a prototype system should be done with a low latency network, assuming a 0.2ms quorum time, an estimated clock rate is ~5khz.

To address this steps could be taken to effectively group or bunch operations, with many operations being queued up. As long as the queue of operations is agreed upon ahead of time, operations could be run without qorum, assuming the resources that those operations modify are bounded and known. That is, it should be possible to perform introspection of some kind on the resources that certain code paths modify. Once this is done, the distributed virtualization system need only take into account those changes, and be able to replicate those changes on another node if required.

Implementation

Taking a quick look at the languages that are supported in both Unicorn and CockroachDB, Python or Rust seems like a good starting point. It appears as though most of the configuration of the failover mechanisms and current state are available through the SQL interface.

There are existing Unicorn projects that aim to log everything into a database, providing hooks for the normal execution of binaries. This would be a logical place to start. Perhaps looking at ways to abstract all the HAL systems in Unicorn back to a single dynamic call that takes some identifier of the resource, whether the call is a read or a write, and optional data. This would allow a single communication point between the two systems, with all values being resolved via a SQL query, and all writes being translated with a SQL UPDATE or UPDATE.

Starting with a very simple and small architecture would ease this process; Perhaps something along the lines of MIPS or an ARM Cortex-M0 core. The ideal proof of concept of this would be a simple ARM core with a process such as FreeRTOS running on it.