Design and Best Practices

Quick points:

Uses a key-value store
Objects are represented as JSON
Uses watchers on a key or range of keys to monitor for any updates

This document focuses on the semantics of transactions rather than its API. To offer this simplicity, the examples below internally use a Transaction wrapper. For an overview of the formal API refer to Entities and Operations.

Transaction Basics

The SDP configuration database interface is built around the concept of transactions, i.e. blocks of read and write queries to the database state that are guaranteed to be executed atomically. For example, consider this code:

for txn in config.txn():
   a = txn.get('a')
   if a is None:
       txn.create('a', '1')
   else:
       txn.update('a', str(int(a)+1))

It is guaranteed that we increment the 'a' key by exactly once here, no matter how many other processes might be operating on it. How does this work?

The way transactions are implemented follows the philosophy of Software Transactional Memory as opposed to a lock-based implementation. The idea is that all reads are performed, but all writes are actually delayed until the end of the transaction. So in the above example, 'a' is actually read from the database using get, but the writes performed by create or update do not happen immediately.

Once the transaction finishes (the end of the for loop), an implicit commit operation sends a single request to the database that updates all written values only if none of the values that were read have been written in the meantime. If the commit fails, we repeat the transaction (that’s why it is a loop!) until it succeeds. The idea is that this is fairly rare, and repeating the transaction should typically be cheap.

Usage Guidelines

What does this mean for everyday usage? Transactions should be as self-contained as possible - i.e. they should explicitly contain all assumptions about the database state they are making. If we wrote the above transaction as follows:

for txn in config.txn():
   a = txn.get('a')

for txn in config.txn():
   if a is None:
       txn.create('a', '1')
   else:
       txn.update('a', str(int(a)+1))

A whole number of things could happen between the first and the second transaction:

The 'a' key could not exist in the first transaction, but could have been created by the second (which would cause us to fail)
The 'a' key could exist in the first transaction, but could have been deleted by the second (which would also cause the above to fail)
Another transaction might have updated the 'a' key with a new value (which would cause that update to be lost)

A rule of thumb is that you should assume nothing about the database state at the start of a transaction. If you rely on something, you need to (re)query it after you enter it. If for some reason you couldn’t merge the transactions above, you should write something like:

for txn in config.txn():
   a = txn.get('a')

for txn in config.txn():
   assert txn.get('a') == a, "database state independently updated!"
   if a is None:
       txn.create('a', '1')
   else:
       txn.update('a', str(int(a)+1))

This would especially catch case (3) above. This sort of approach can be useful when we want to make sub-transactions that only depend on a part of the overall state:

for txn in config.txn():
    keys = txn.list_keys('/as/')
for key in keys:
    for txn in config.txn():
        a = txn.get(key)
        # Safety check: Path might have vanished in the meantime!
        if a is None:
            break
        # ... do something that depends solely on existence of "key" ...

This can especially be combined with watchers (see below) to keep track of many objects without requiring huge transactions.

Wrapping transactions

The safest way to work with transactions is to make them as “large” as possible, spanning all the way from getting inputs to writing outputs. This should be the default unless we have a strong reason to do it differently (examples for such reasons would be transactions becoming too large, or transactions taking so long that they never finish - but either should be extremely rare).

However, in the context of a program with complex behaviour this might appear cumbersome: This means we have to pass the transaction object to every single method that could either read or write the state. An elegant way to get around this is to move such methods to a “model” class that wraps the transaction itself:

class IncrementModel:
    def __init__(self, txn):
        self._txn = txn
    def increase(self, key):
        a = self._txn.get(key)
        if a is None:
            self._txn.create(key, '1')
        else:
            self._txn.update(key, str(int(a)+1))
    def list_objects(self):
        return self._txn.list_keys("/a")
    def some_check(self, obj):
        return True

# ...
for txn in config.txn():
   model = IncrementModel(txn)
   model.increase('a')

In fact, we can provide factory functions that entirely hide the transaction object from view:

def increment_txn(config):
    for txn in config.txn():
        yield IncrementModel(txn)

# ...
for model in increment_txn(config):
   model.increase('a')

We could wrap this model the same way again to build as many abstraction layers as we want - key is that high-level methods such as “increase” are now directly tied to the existence of a transaction object.

Dealing with roll-backs

Especially as we start wrapping transactions more and more, we must keep in mind that while we can easily “roll back” any writes of the transaction (as they are not actually performed immediately), the same might not be true for program state. So for instance, the following would be unsafe:

to_update = ['a','b','c']
for model in increment_txn(config):
    while to_update:
        model.increase(to_update.pop())

Clearly this transaction would work differently the second time around! For this reason it is a good idea to keep in mind that while we expect the for to only execute once, it is entirely possible that they would execute multiple times, and the code should be written accordingly.

Fortunately, this sort of occurrence should be relatively rare - the following might be more typical:

objects_found = []
for model in increment_txn(config):
    for obj in model.list_objects():
        if model.some_check(obj):
            LOGGER.debug(f'Found {obj}!')
            objects_found.append(obj)

In this case, objects_found might contain duplicate objects if the transaction repeats - which could be easily fixed by moving the initialisation into the for loop.

On the other hand, note that transaction loops might also lead to duplicated log lines here, which might be seen as confusing. In this case, this is relatively benign and therefore likely acceptable. It might be possible to generate log messages at the start and end of transactions to make this more visible.

Another possible approach could be to replicate the transaction behaviour: for example, we could make the logging calls to IncrementModel, which would internally aggregate the logging lines to generate, which increment_txn could then emit in one go once the transaction actually goes through.

Watchers

Occasionally we might want to actively track something in the configuration. For sake of example, let’s say we want to wait for a key to appear so we can print it. A simple implementation using polling might look like the following:

while True:
    for txn in config.txn():
        line = txn.get('/line_to_print')
        if line is not None:
            txn.delete('/line_to_print')
    if line is not None:
        print(line)
    time.sleep(1)

(Note that we are making sure to print outside the transaction loop - otherwise lines might get printed multiple times if we were running more than one instance of this program in parallel!)

But clearly this is not very good - it re-queries the database every second, which adds database load and is pretty slow. Instead, we can use a watcher loop:

for watcher in config.watcher():
    for txn in watcher.txn():
        line = txn.get('/line_to_print')
        if line is not None:
            txn.delete('/line_to_print')
    if line is not None:
        print(line)

Note that we are calling txn on the watcher instead of config: What is happening here is that the watcher object collects keys read by the transaction, and only iterates once one of them has been written. It is a concept that has a lot in common with the transaction loop, except that while the transaction loop only iterates if the transaction is inconsistent, the watcher loop always iterates.

Note that you can have multiple separate transactions within a watcher loop, which however are not guaranteed to be consistent. For example:

for watcher in config.watcher():
    for txn in watcher.txn():
        line = txn.get('/line_to_print')
    print('A:', line)
    for txn in watcher.txn():
        line = txn.get('/line_to_print')
    print('B:', line)

In this program we might get different results for A and B. However, the watcher does guarantee that the loop will iterate if any of the read values have been invalidated. So if the line was deleted between the two transaction, the following output would be generated:

A: something
B: None
A: None
B: None

After all, while transaction B had a current view of the situation the first time around, the view of transaction A became out-of-date.

By default, the watcher only iterates if any keys read by a watcher transaction has been written. This may take an arbitrary amount of time (including infinite amount), hence we can “force” the watcher loop to go to its next iteration via two methods. A default timeout can be set either upon initiation:

for watcher in etcd3.watcher(timeout=60):
    ...

or manually with the watcher.set_timeout(<new_timeout>) method. The timeout is valid for the whole life-cycle of the watcher. Alternatively, you can set a “wake-up call”, on a loop-by-loop basis, using the watcher.set_wake_up_at(<value_of_alarm>) method. This guarantees that the watcher will wake up at the given time or earlier (specified as an absolute datetime object). This especially means that if the method gets called multiple times, the watcher will wake up at the earliest of the times specified, either by timeout or by any of the wake_up calls.

Watcher Internals

Creating watchers

Behind the scenes, each watcher object maintains a dictionary of values (and key lists) queried by transactions in the last watcher iteration. After every watcher loop iteration, we need to make sure that we have watchers that cover

all read keys (_get_queries)
all keys listed (_list_queries)

The way this is implemented is that every Etcd3Watcher maintains a permanent connection to the etcd server. After a watcher loop iteration we then go ahead and

Create range watches using WatchCreateRequest for every list query. To keep things simple, we subscribe to the values associated with the keys as well, as in most cases the client code also wants the values, and this cuts down on the number of watchers we need to register.
Create specific watches using WatchCreateRequest for every read key that is not matched by a list request already
Cancel any existing watches that are now redundant.

Watcher cache

The watcher does not only keep track of the queries made, but also of their results (in _cache and _list_cache respectively). The idea is that if two transactions within one watcher loop iteration read the same value, we do not have to re-query the database! In fact, the watcher ensures that we enter every iteration of the watcher loop with a “fresh” cache, which is to say that we have updated _cache and _list_cache such that they are consistent with a database _revision that corresponds to the newest received change notifications.

That way, transactions can just assume that the _revision from the watcher represents the current state of the database, and proceed to answer all matching get and list_keys requests from its cache. Sometimes this might cause a transaction to read values that are slightly out of date, however in those cases either

If the transaction makes a write, we will verify all reads before committing it. In that process we will find that a read value was out-of date, which will cause us to re-fresh the cache, then start the transaction over with the new _revision. Note that this means that the _revision and cached values can change within one watcher loop iteration.
If the transaction did not make any write, the fact that values were out-of-date will not be noticed immediately - however we will have noted that we read the values in question in the _get_queries or _list_queries, and will therefore (eventually) receive a change notification that causes the watcher to start another iteration.

However, both of these are clearly quite unlikely. In practice we have a very high chance that we can serve basically all database reads from the cache, and don’t need to repeat.

Note that _cache and _list_cache need to be tracked separately from _get_queries and _list_queries: The former will already be pre-populated from the start of the watcher loop iteration, whereas the latter tracks what was actually read within an iteration. After every iteration, we will want to purge the cache for anything that has not actually been read.

Cache consistency problem

Transactions should ensure that it is impossible to observe an inconsistent database state. E.g. if something updates two keys A and B as part of a single transaction, it should be impossible to have another transaction see a database state where only A was changed but not B (or vice-versa). After all, pure read transactions will not be re-verified for consistency, and even inconsistent transactions that don’t write the database can still have destructive effects outside of it (e.g. cause exceptions and crashes).

When serving reads from a cache, that means that we need to be sure that all values in the cache are up-to-date with the specified _revision. Unfortunately, while the etcd server will send us a WatchResponse message every time a change was made in the database (complete with the applicable database revision), we have no way of knowing whether we have received all WatchResponse messages corresponding to a certain database revision. After all, as explained above we will potentially be registering many watchers on different keys on the same connection, and they will all be reporting back semi-independently and with no guaranteed order.

This means that once there are at least two watchers, it becomes generally impossible to make any definite statements about database state from the WatchResponse s alone: We never know for certain whether there are more outstanding WatchResponse objects that are still in-flight. We cannot rely on all updates to be in one packet, as that clearly won’t be the case once the transaction gets big enough. Any timeout would also be unreliable, as a network disruption might delay a second packet indefinitely – plus we really wouldn’t want to wait for seconds every time we need to refresh the cache.

Progress notifications

All we really need is a way to ask the server for whether we have received everything for a certain revision. This is effectively equivalent to simply asking the server what the “current” revision of the database is. Fortunately, there is a message for (almost) this exact purpose: Progress notification requests. Originally they are actually meant as a way for extremely long-running watchers to deal with situations where they need to re-connect to a server after a prolonged phase of not getting any WatchResponse s, and therefore having an extremely out-of-date revision.

Happily, the protocol has a way to explicitly request such progress notifications from the server (WatchProgressRequest), which is documented as:

Requests the [sic] a watch stream progress status be sent in the
watch response stream as soon as possible.

This would cause a WatchResponse with no events to be sent, and the ResponseHeader will have a revision, which is documented as:

For watch progress responses, the header.revision indicates
progress. All future events received in this stream are guaranteed
to have a higher revision number than the header.revision number.

Which is precisely what we need: The server guarantees that there are no outstanding WatchResponse once we have received a progress notification.

Therefore the solution to solving the cache consistency problem is fairly straightforward: Whenever we need to refresh the cache (which happens when we enter a watcher loop iteration or a transaction within the watcher loop iteration detects that we are out of date and refresh), we send out a progress notification requests. Then we continue to proceed to update the cache with all incoming watch responses, and return only once we receive the progress notification, as that’s the only time we can be sure that things are consistent.

etcd server bug workarounds

Unfortunately, using progress notifications in a way that is not quite the original intention means that we are often quite sensitive to etcd behaviour changes that likely appeared non-consequential to the developers. And that despite this idea actually having been semi-recently implemented “officially” ( see https://github.com/etcd-io/etcd/issues/19371 ).

For instance:

Originally the server would actually make no effort to properly synchronise progress notification responses with normal watch responses, which means that there were some race conditions in which a progress notification for a later revision could arrive before the an earlier watch responses. This was fixed, see https://github.com/etcd-io/etcd/issues/15220 .
Up to etcd version 3.5.13, there was a chance that a watcher that was asked for a progress notification immediately after having been created would never answer it. This was roughly because every watcher would be started in “unsynchronised” state as the server had not checked yet whether it had a WatchResponse to send right away. Therefore any progress notification would not be answered immediately. However, the code that would send out the delayed progress notification would only trigger when the last WatchResponse gets generated, so if there was no changes in the database, the server would just never send a progress notification until the next time the database changed – which could take arbitrarily long to happen.

To make matters worse, re-sending the progress notification request would not fix the issue in this situation, because a connection with a “delayed” progress notification would not re-check, but just ignore the request. The only good fix in this situation was to re-set the connection, re-create all workers, maybe wait a bit before issuing the progress notification request, and hope that the race would not happen.

This was “fixed” by https://github.com/etcd-io/etcd/pull/17566 , which simply had the server ignore progress notification requests. However, this means that the server now reacts if we simply repeat the progress notification request when the race happens, so this is actually easier to handle.
From etcd version 3.5.22, the server entirely stopped answering progress notifications on a watchers that were still on their “start” revision (see https://github.com/etcd-io/etcd/commit/2d5a9b9863f0acc4745f1d7d8dc4054832d7404b ). This means that no amount of re-starting or re-querying would get us a progress notification.

Fortunately, all we need to do is (re-)start watchers on a slightly older revision, than proceed to just ignore any notifications that we get for “past” events.