Joachim Breitner

Blog

Zero-downtime upgrades of Internet Computer canisters

Published 2021-11-28 in sections English, Internet Computer.

TL;DR: Zero-downtime upgrades are possible if you stick to the basic actor model.

Background

DFINITY’s Internet Computer provides a kind of serverless compute platform, where the services are WebAssemmbly programs called “canisters”. These services run without stopping (or at least that’s what it feels like from the service’s perspective; this is called “orthogonal persistence”), and process one message after another. Messages not only come from the outside (“ingress” calls), but are also exchanged between canisters.

On top of these uni-directional messages, the system provides the concept of “inter-canister calls”, which associates a respondse message with the outgoing message, and guarantees that a response will come. This RPC-like interface allows canister developers to program in the popular async/await model, where these inter-canister calls look almost like normal function calls, and the subsequent code is suspended until the response comes back.

The problem

This is all very well, until you try to upgrade your canister, i.e. install new code to fix a bug or add a feature. Because if you used the await pattern, there may still be suspended computations waiting for the response. If you swap out the program now, the code of that suspended computation will no longer be present, and the response cannot be handled! Worse, because of an infelicity with the current system’s API, when the response comes back, it may actually corrupt your service’s state.

That is why upgrading a canister requires stopping it first, which means waiting for all outstanding calls to come back. During this time, your canister is not available for new calls (so there is downtime), and worse, the length of the downtime is at the whims of the canisters you called – they could withhold the response ad infinitum, rendering your canister unupgradeable.

Clearly, this is not acceptable for any serious application. In this post, I’ll explore some of the ways to mitigate this problem, and how to create canisters that are safely instantanously (no downtime) upgradeable.

It’s a spectrum

Some canisters are trivially upgradeable, for others all hope is lost; it depends on what the canister does and how. As an overview, here is the spectrum:

  1. A canister that never performs inter-canister calls can always be upgraded without stopping.
  2. A canister that only does one-way calls, and does them in a particular way (see below), can always be upgraded without stopping.
  3. A canister that performs calls, and where it is acceptable to simply drop outstanding repsonses, can always be upgraded without stopping, once the System API has been improved and your Canister Development Kit (CDK; Motoko or Rust) has adapted.
  4. A canister that performs calls, but uses explicit continuations to handle, responses instead of the await-convenience, based on an eventually fixed System API, can be upgradeded without stopping, and will even handle responses afterwards.
  5. A canister that uses await to do inter-canister call cannot be upgraded without stopping.

In this post I will explain 2, which is possible now, in more detail. Variant 3 and 4 only become reality if and when the System API has improved.

One-way calls

A one-way call is a call where you don’t care about the response; neither the replied data, nor possible failure conditions.

Since you don’t care about the response, you can pass an invalid continuation to the system (technical detail: a Wasm table index of -1). Because it is invalid for any (realistic) Wasm module, it will stay invalid even after an upgrade, and the problem of silent corruption mentioned above is avoided. And otherwise it’s fine for this to be invalid: it means the canister “traps” once the response comes back, which is harmeless (and possibly even cheaper than a do-nothing computation).

This requires your CDK to support this kind of call. Mostly incidential, Motoko (and Candid) actually have the concept of one-way call in their type system, namely shared functions with return type () instead of async ... (Motoko is actually older than the system, and not every prediction about what the system will provide has proven successful). So, pending this PR to be released, Motoko will implement one-way calls in this way. On Rust, you have to use the System API directly or wait for cdk-rs to provide this ability (patches welcome, happy to advise).

You might wonder: How are calls useful if I don’t get to look at the response? Of course, this is a set-back – calls with responses are useful, and await is convenient. And if you have to integrate with an existing service that only provides normal calls, you are out of luck.

But if you get to design the canister and all called canisters together, it may be possible to use only one-way messages. You’d be programming in the plain actor model now, with all its advantages (simple concurrency, easy to upgrade, general robustness).

Consider for example a token ledger canister, not unlike the ICP ledger canister. For the most part, it doesn’t have to do any outgoing calls (and thus be trivially upgradeble). But say we need to add notify functionality, where the ledger canister tells other canisters about a transaction. This is a good example for a one-way call: Maybe the ledger canister doesn’t care if that notification was received? The ICP leder does care (once it comes back successful, this particular notification cannot be sent again), but maybe your ledger can do it differently: let the other canister confirm the receip via another one-way call, instead of via the reply; or simply charge for each notification and do not worry about repeated notifications.

Maybe you want to add archiving functionality, where the ledger canister streams its data to an archive canister. There, again, instead of using successful responses to confirm receipt, the archive canister can ping the ledger canister with the latest received index directly.

Yes, it changes the programming model a bit, and all involved parties have to play together, but the gain (zero-downtime upgrades) is quite valuable, and removes a fair number of other sources of issues.

And in the future?

The above is possible with today’s Internet Computer. If the System API gets improves the way I hope it will be, you have a possible middle ground: You still don’t get to use await and instead have to write your response handler as separate functions, but this way you can call any canister again, and you get the system’s assistance in mapping responses to calls. With this in place, any canister can be rewritten to a form that supports zero-downtime upgrades, without affecting its interface or what the canister can do.

How to audit an Internet Computer canister

Published 2021-11-09 in sections English, Internet Computer.

I was recently called upon by Origyn to audit the source code of some of their Internet Computer canisters (“canisters” are services or smart contracts on the Internet Computer), which were written in the Motoko programming language. Both the application model of the Internet Computer as well as Motoko bring with them their own particular pitfalls and possible sources for bugs. So given that I was involved in the creation of both, they reached out to me.

In the course of that audit work I collected a list of things to watch out for, and general advice around them. Origyn generously allowed me to share that list here, in the hope that it will be helpful to the wider community.

Inter-canister calls

The Internet Computer system provides inter-canister communication that follows the actor model: Inter-canister calls are implemented via two asynchronous messages, one to initiate the call, and one to return the response. Canisters process messages atomically (and roll back upon certain error conditions), but not complete calls. This makes programming with inter-canister calls error-prone. Possible common sources for bugs, vulnerabilities or simply unexpected behavior are:

  • Reading global state before issuing an inter-canister call, and assuming it    to still hold when the call comes back.

  • Changing global state before issuing an inter-canister call, changing it again    in the response handler, but assuming nothing else changes the state in    between (reentrancy).

  • Changing global state before issuing an inter-canister call, and not    handling failures correctly, e.g. when the code handling the callback rolls    backs.

If you find such pattern in your code, you should analyze if a malicious party can trigger them, and assess the severity that effect

These issues apply to all canisters, and are not Motoko-specific.

Rollbacks

Even in the absence of inter-canister calls the behavior of rollbacks can be surprising. In particular, rejecting (i.e. throw) does not rollback state changes done before, while trapping (e.g. Debug.trap, assert …, out of cycle conditions) does.

Therefore, one should check all public update call entry points for unwanted state changes or unwanted rollbacks. In particular, look for methods (or rather, messages, i.e. the code between commit points) where a state change is followed by a throw.

This issues apply to all canisters, and are not Motoko-specific, although other CDKs may not turn exceptions into rejects (which don’t roll back).

Talking to malicious canisters

Talking to untrustworthy canisters can be risky, for the following (likely incomplete) reasons:

  • The other canister can withhold a response. Although the bidirectional   messaging paradigm of the Internet Computer was designed to guarantee a   response eventually, the other party can busy-loop for as long as they are   willing to pay for before responding. Worse, there are ways to deadlock a   canister.

  • The other canister can respond with invalidly encoded Candid. This will cause   a Motoko-implemented canister to trap in the reply handler, with no easy way   to recover. Other CDKs may give you better ways to handle invalid Candid, but even then you will have to worry about Candid cycle bombs that will cause your reply handler to trap. 

Many canisters do not even do inter-canister calls, or only call other trustwothy canisters. For the others, the impact of this needs to be carefully assessed.

Canister upgrade: overview

For most services it is crucial that canisters can be upgraded reliably. This can be broken down into the following aspects:

  1. Can the canister be upgraded at all?
  2. Will the canister upgrade retain all data?
  3. Can the canister be upgraded promptly?
  4. Is three a recovery plan for when upgrading is not possible?

Canister upgradeability

A canister that traps, for whatever reason, in its canister_preupgrade system method is no longer upgradeable. This is a major risk. The canister_preupgrade method of a Motoko canister consists of the developer-written code in any system func preupgrade() block, followed by the system-generated code that serializes the content of any stable var into a binary format, and then copies that to stable memory.

Since the Motoko-internal serialization code will first serialize into a scratch space in the main heap, and then copy that to stable memory, canisters with more than 2GB of live data will likely be unupgradeable. But this is unlikely the first limit:

The system imposes an instruction limit on upgrading a canister (spanning both canister_preupgrade and canister_postupgrade). This limit is a subnet configuration value, and sepearate (and likely higher) than the normal per-message limit, and not easily determined. If the canister’s live data becomes too large to be serialized within this limit, the canister becomes non-upgradeable.

This risk cannot be eliminated completely, as long as Motoko and Stable Variables are used. It can be mitigated by appropriate load testing:

Install a canister, fill it up with live data, and exercise the upgrade. If this succeeds with a live data set exceeding the expected amount of data by a margin, this risk is probably acceptable. Bonus points for adding functionality that will prevent the canister’s live data to increase above a certain size.

If this testing is to be done on a local replica, extra care needs to be taken to make sure the local replica actually performs instruction counting and has the same resource limits as the production subnet.

An alternative mitigation is to avoid canister_pre_upgrade as much as possible. This means no use of stable var (or restricted to small, fixed-size configuration data). All other data could be

  • mirrored off the canister (possibly off chain), and manually re-hydrated after an upgrade.
  • stored in stable memory manually, during each update call, using the ExperimentalStableMemory API. While this matches what high-assurance Rust canisters (e.g. the Internet Identity) do, This requires manual binary encoding of the data, and is marked experimental, so I cannot recommend this at the moment.
  • not put into a Motoko canister until Motoko has a scalable solution for stable variable (for example keeping them in stable memory permanently, with smart caching in main memory, and thus obliterating the need for pre-upgrade code.)

Data retention on upgrades

Obviously, all live data ought to be retained during upgrades. Motoko automatically ensures this for stable var data. But often canisters want to work with their data in a different format (e.g. in objects that are not shared and thus cannot be put in stable vars, such as HashMap or Buffer objects), and thus may follow following idiom:

stable var fooStable = …;
var foo = fooFromStable(fooStable);
system func preupgrade() { fooStable := fooToStable(foo); })
system func postupgrade() { fooStable := (empty); })

In this case, it is important to check that

  • All non-stable global vars, or global lets with mutable values, have a stable companion.
  • The assignments to foo and fooStable are not forgotten.
  • The fooToStable and fooFromStable form bijections.

An example would be HashMaps stored as arrays via Iter.toArray(….entries()) and HashMap.fromIter(….vals()).

It is worth pointiong out that a code view will only look at a single version of the code, but cannot check whether code changes will preserve data on upgrade. This can easily go wrong if the names and types of stable variables are changed in incompatible way. The upgrade may fail loudly in this cases, but in bad cases, the upgrade may even succeed, losing data along the way. This risk needs to be mitigated by thorough testing, and possibly backups (see below).

Prompt upgrades

Motoko and Rust canisters cannot be safely upgraded when they are still waiting for responses to inter-canister calls (the callback would eventually reach the new instance, and because of infelicities of the IC’s System API, could possibly call arbitrary internal functions). Therefore, the canister needs to be stopped before upgrading, and started again. If the inter-canister calls take a long time, this mean that upgrading may take a long time, which may be undesirable. Again, this risk is reduced if all calls are made to trustworthy canisters, and elevated when possibly untrustworthy canisters are called, directly or indirectly.

Backup and recovery

Because of the above risk around upgrades it is advisable to have a disaster recovery strategy. This could involve off-chain backups of all relevant data, so that it is possible to reinstall (not upgrade) the canister and re-upload all data.

Note that reinstall has the same issue as upgrade described above in “prompt upgrades”: It ought to be stopped first to be safe.

Note that the instruction limit for messages, as well as the message size limit, limit the amount of data returned. If the canister needs to hold more data than that, the backup query method might have to return chunks or deltas, with all the extra complexity that entails, e.g. state changes between downloading chunks.

If large data load testing is performed (as Irecommend anyways to test upgradeability), one can test whether the backup query method works within the resource limits.

Time is not strictly monotonic

The timestamps for “current time” that the Internet Computer provides to its canisters is guaranteed to be monotonic, but not strictly monotonic. It can return the same values, even in the same messages, as long as they are processed in the same block. It should therefore not be used to detect “happens-before” relations.

Instead of using and comparing time stamps to check whether Y has been performed after X happened last, introduce an explicit var y_done : Bool state, which is set to False by X and then to True by Y. When things become more complex, it will be easier to model that state via an enumeration with speaking tag names, and update this “state machine” along the way.

Another solution to this problem is to introduce a var v : Nat counter that you bump in every update method, and after each await. Now v is your canister’s state counter, and can be used like a timestamp in many ways.

While we are talking about time: The system time (typically) changes across an await. So if you do let now = Time.now() and then await, the value in now may no longer be what you want.

Wrapping arithmetic

The Nat64 data type, and the other fixed-width numeric types provide opt-in wrapping arithmetic (e.g. +%, fromIntWrap). Unless explicitly required by the current application, this should be avoided, as usually a too large or negatie value is a serious, unrecoverable logic error, and trapping is the best one can do.

Cycle balance drain attacks

Because of the IC’s “canister pays” model, all canisters are prone to DoS attacks by draining their cycle balance, and this risk needs to be taken into account.

The most elementary mitigation strategy is to monitor the cycle balance of canisters and keep it far from the (configurable) freezing threshold.

On the raw IC-level, further mitigation strategies are possible:

  • If all update calls are authenticated, perform this authentication as quickly as possible, possibly before decoding the caller’s argument. This way, a cycle drain attack by an unauthenticated attacker is less effective (but still possible).

  • Additionally, implementing the canister_inspect_message system method allows the above checks to be performed before the message even is accepted by the Internet Computer. But it does not defend against inter-canister messages and is therefore not a complete solution.

  • If an attack from an authenticated user (e.g. a stakeholder) is to be expected, the above methods are not effective, and an effective defense might require relatively involved additional program logic (e.g. per-caller statistics) to detect such an attack, and react (e.g. rate-limiting).

  • Such defenses are pointless if there is only a single method where they do not apply (e.g. an unauthenticated user registration method). If the application is inherently attackable this way, it is not worth the bother to raise defenses for other methods.

   Related: A justification why the Internet Identity does not use canister_inspect_message)

A motoko-implemented canister currently cannot perform most of these defenses: Argument decoding happens unconditionally before any user code that may reject a message based on the caller, and canister_inspect_message is not supported. Furthermore, Candid decoding is not very cycle defensive, and one should assume that it is possible to construct Candid messages that require many instructions to decode, even for “simple” argument type signatures.

The conclusion for the audited canisters is to rely on monitoring to keep the cycle blance up, even during an attack, if the expense can be born, and maybe pray for IC-level DoS protections to kick in.

Large data attacks

Another DoS attack vector exists if public methods allow untrustworthy users to send data of unlimited size that is persisted in the canister memory. Because of the translation of async-await code into multiple message handlers, this applies not only to data that is obviously stored in global state, but also local data that is live across an await point.

The effectiveness of such attacks is limited by the Internet Computer’s message size limit, which is in the order of a few megabytes, but many of those also add up.

The problem becomes much worse if a method has an argument type that allows a Candid space bomb: It is possible to encode very large vectors with all values null in Candid, so if any method has an argument of type [Null] or [?t], a small message will expand to a large value in the Motoko heap.

Other types to watch out:

  • Nat and Int: This is an unbounded natural number, and thus can be arbitrarily large. The Motoko representation will however not be much larger than the Candid encoding (so this does not qualify as a space bomb).

   It is still advisable to check if the number is reasonable in size before storing it or doing an await. For example, when it denotes an index in an array, throw early if it exceeds the size of the array; if it denotes a token amount to transfer, check it against the available balance, if it denotes time, check it against reasonable bounds.

  • Principal: A Principal is effectively a Blob. The Interface   specification says that principals are at most 29   bytes in   length, but the Motoko Candid   decoder   does not check that currently (fixed in the next version of Motoko). Until then, a Principal passed as an   argument can be large (the principal in msg.caller is system-provided and   thus safe). If you cannot wait for the fix to reach you, manually check the size of the princial (via   Principal.toBlob) before doing the await.

Shadowing of msg or caller

Don’t use the same name for the “message context” of the enclosing actor and the methods of the canister: It is dangerous to write shared(msg) actor, because now msg is in scope across all public methods. As long as these also use public shared(msg) func …, and thus shadow the outer msg, it is correct, but it if one accidentially omits or mis-types the msg, no compiler error would occur, but suddenly msg.caller would now be the original controller, likely defeating an important authorization step.

Instead, write shared(init_msg) actor or shared({caller = controller}) actor to avoid using msg.

Conclusion

If you write a “serious” canister, whether in Motoko or not, it is worth to go through the code and watch out for these patterns. Or better, have someone else review your code, as it may be hard to spot issues in your own code.

Unfortunately, such a list is never complete, and there are surely more ways to screw up your code – in addition to all the non-IC-specific ways in which code can be wrong. Still, things get done out there, so best of luck!

A mostly allocation-free optional type

Published 2021-10-31 in sections English, Internet Computer.

The Motoko programming language has a built-in data type for optional values, named ?t with values null and ?v (for v : t); this is the equivalent of Haskell’s Maybe, Ocaml’s option or Rust’s Option. In this post, I explain how Motoko represents such optional values (almost) without allocation.

I neither claim nor expect that any of this is novel; I just hope it’s interesting.

Uniform representation

The Motoko programming language, designed by Andreas Rossberg and implemented by a pretty cool team at DFINITY is a high-level language with strict semantics and a strong, structural, equi-recursive type system that compiles down to WebAssembly.

Because the type system supports polymorphism, it represents all values uniformly. Simplified for the purpose of this blog post, they are always pointers into a heap object where the first word of the heap object, the heap tag, contains information about the value:

┌─────┬───┄
│ tag │ …
└─────┴───┄

The tag is something like array, int64, blob, variant, record, …, and it has two purposes:

  • The garbage collector uses it to understand what kind of object it is looking at, so that it can move it around and follow pointers therein. Variable size objects such as arrays include the object size in a subsequent word of the heap object.

  • Some types have values that may have different shapes on the heap. For example, the ropes used in our text representation can either be a plain blob, or a concatenation node of two blobs. For these types, the tag of the heap object is inspected.

The optional type, naively

The optional type (?t) is another example of such a type: Its values can either be null, or ?v for some value v of type t, and the primitive operations on this type are the two introduction forms, an analysis function, and a projection for non-null values:

null : () -> ?t
some : t -> ?t
is_some : ?t -> bool
project : ?t -> t     // must only be used if is_some holds

It is natural to use the heap tag to distinguish the two kind of values:

  • The null value is a simple one-word heap object with just a tag that says that this is null:

    ┌──────┐
    │ null │
    └──────┘
  • The other values are represented by a two-word object: The tag some, indicating that it is a ?v, and then the payload, which is the pointer that represents the value v:

    ┌──────┬─────────┐
    │ some │ payload │
    └──────┴─────────┘

With this, the four operations can be implemented as follows:

def null():
  ptr <- alloc(1)
  ptr[0] = NULL_TAG
  return ptr

def some(v):
  ptr <- alloc(2)
  ptr[0] = SOME_TAG
  ptr[1] = v
  return ptr

def is_some(p):
  return p[0] == SOME_TAG

def project(p):
  return p[1]

The problem with this implementation is that null() and some(v) both allocate memory. This is bad if they are used very often, and very liberally, and this is the case in Motoko: For example the iterators used for the for (x in e) construct have type

type Iter<T> = { next : () -> ?T }

and would unavoidably allocate a few words on each iteration. Can we avoid this?

Static values

It is quite simple to avoid this for for null: Just statically create a single null value and use it every time:

static_null = [NULL_TAG]

def null():
  return static_null

This way, at least null() doesn’t allocate. But we gain more: Now every null value is represented by the same pointer, and since the pointer points to static memory, it does not change even with a moving garbage collector, so we can speed up is_some:

def is_some(p):
  return p != static_null

This is not a very surprising change so far, and most compilers out there can and will do at least the static allocation of such singleton constructors.

For example, in Haskell, there is only a single empty list ([]) and a single Nothing value in your program, as you can see in my videos exploring the Haskell heap.

But can we get rid of the allocation in some(v) too?

Unwrapped optional values

If we don’t want to allocate in some(v), what can we do? How about simply

def some(v):
  return v

That does not allocate! But it is also broken. At type ??Int, the values null, ?null and ??null are distinct values, but here this breaks.

Or, more formally, the following laws should hold for our four primitive operations:

  1. is_some(null()) = false
  2. v. is_some(some(v)) = true
  3. p. project(some(p)) = p

But with the new definition of some, we’d get is_some(some(null())) = false. Not good!

But note that we only have a problem if we are wrapping a value that is null or some(v). So maybe take the shortcut only then, and write the following:

def some(v):
  if v == static_null || v[0] == SOME_TAG:
    ptr <- alloc(2)
    ptr[0] = SOME_TAG
    ptr[1] = v
    return ptr
  else:
    return v

The definition of is_some can stay as it is: It is still the case that all null values are represented by static_null. But the some values are now no longer all of the same shape, so we have to change project():

def project(p):
  if p[0] == SOME_TAG:
    return p[1]
  else:
    return p

Now things fall into place: A value ?v can, in many cases, be represented the same way as v, and no allocation is needed. Only when v is null or ?null or ??null or ???null etc. we need to use the some heap object, and thus have to allocate.

In fact, it doesn’t cost much to avoid allocation even for ?null:

static_some_null = [SOME_TAG, static_null]
def some(v):
  if v == static_null:
    return static_some_null
  else if v[0] == SOME_TAG:
    ptr <- alloc(2)
    ptr[0] = SOME_TAG
    ptr[1] = v
    return ptr
  else:
    return v

So unless one nests the ? type two levels deep, there is no allocation whatsoever, and the only cost is a bit more branching in some and project.

That wasn’t hard, but quite rewarding, as one can now use idioms like the iterator shown above with greater ease.

Examples

The following tables shows the representation of various values before and after. Here […] is a pointed-to dynamically allocated heap object, {…} a statically allocated heap object, N = NULL_TAG and S = SOME_TAG.

type value before after
Null null {N} {N}
?Int null {N} {N}
?Int ?23 [S,23] 23
??Int null {N} {N}
??Int ?null [S,{N}] {S,{N}}
??Int ??23 [S,[S,23]] 23
???Int null {N} {N}
???Int ?null [S,{N}] {S,{N}}
???Int ??null [S,[S,{N}]] [S,{S,{N}}]
???Int ???23 [S,[S,[S,23]]] 23

Concluding remarks

  • If you know what parametric polymorphism is, and wonder how this encoding can work in a language that has that, note that this representation is on the low-level of the untyped run-time value representation: We don’t need to know the type of v in some(v), merely its heap representation.

  • The uniform representation in Motoko is a bit more involved: The pointers are tagged (by subtracting 1) and some scalar values are represented directly in place (shifted left by 1 bit). But this is luckily orthogonal to what I showed here.

  • Could Haskell do this for its Maybe type? Not so easily:

    • The Maybe type is not built-in, but rather a standard library-defined algebraic data type. But the compiler could feasible detect that this is option-like?

    • Haskell is lazy, so at runtime, the type Maybe could be Nothing, or Just v, or, and this is crucial, a yet to be evaluated expression, also called a thunk. And one definitely needs to distinguish between a thunk t :: Maybe a that may evaluate to Nothing, and a value Just t :: Maybe a that definitely is Just, but contains a value, which may be a thunk.

    Something like this may work for a strict Maybe type or unlifted sums like (# (# #) | a #), but may clash with other ticks in GHC, such as pointer tagging.

  • As I said above, I don’t expect this to be novel, and I am happy to add references to prior art here.

  • Given that a heap object with tag SOME_TAG now always encodes a tower ?ⁿnull for n>0, one could try to optimize that even more by just storing the n:

    ┌──────┬─────┐
    │ some │  n  │
    └──────┴─────┘

    But that seems unadvisable: It is only a win if you have deep towers, which is probably rare. Worse, now the project function would have to return such a heap object with n decremented, so now projection might have to allocate, which goes against the cost model expected by the programmer.

  • If you rather want to see code than blog posts, feel free to check out Motoko PR #2115.

  • Does this kind of stuff excite you? Motoko is open source, so your contributions may be welcome!

A Candid explainer: Quirks

Published 2021-09-14 in sections English, Internet Computer.

This is the fifth and final post in a series about the interface description language Candid.

If you made it this far, you now have a good understanding of what Candid is, what it is for and how it is used. For this final post, I’ll put the spotlight on specific aspects of Candid that are maybe surprising, or odd, or quirky. This section will be quite opinionated, and could maybe be called “what I’d do differently if I’d re-do the whole thing”.

Note that these quirks are not serious problems, and they don’t invalidate the overall design. I am writing this up not to discourage the use of Candid, but merely help interested parties to understand it better.

References in the wire format

When the work on Candid began at DFINITY, the Internet Computer was still far away from being a thing, and many fundamental aspects about it were still in the air. I particular, there was still talk about embracing capabilities as a core feature of the application model, which would be implemented as opaque references on the system level, likely building on WebAssembly’s host reference type proposal (which only landed recently), and could be used to model access permissions, custom tokens and many other things.

So Candid is designed with that in mind, and you’ll find that its wire format is not just a type table and a value table, but actually

a triple (T, M, R), where T (“type”) and M (“memory”) are sequences of bytes and R (“references”) is a sequence of references.

Also the wire format for values of function service tyeps have an extra byte to distinguish between “public references” (represented by a principal and possible a method name in the data part), and these opaque references.

Alas, references never made it into the Internet Computer, so all Candid implementations simply ignore that part of the specification. But it’s still in the spec, and if it confused you before, now you know why.

Hashed field names

Candid record and variant types look like they have textual field names:

type T = record { syndactyle : nat; trustbuster: bool }

But that is actually only true superficially. The wire format for Candid only stores hashes of field names. So the above is actually equivalent to

type T = record { 4260381820 : nat; 3504418361 : bool }

or, for that matter, to

type T = record { destroys : bool; rectum : nat }

(Yes, I used an english word list to find these hash collisions. There aren’t that many actually.)

The benefit of such hashing is that the messages are a bit smaller in most (not all) cases, but it is a big annoyance for dynamic uses of Candid. It’s the reason why tools like dfx, if they don’t know the Candid interface of a service, will print the result with just the numerical hash, letting you guess which field is which.

It also complicates host languages that derive Candid types from the host language, like Motoko, as some records (e.g. record { trustbuster: bool; destroys : int }) with field name hash collisions can not be represented in Candid, and either the host language’s type system needs to be Candid aware now (as is the case of Motoko), or serialization/deserialization will fail at runtime, or odd bugs can happen.

(More discussion of this issue).

Tuples

Many languages have a built-in notion of a tuple type (e.g. (Int, Bool)), but Candid does not have such a type. The only first class product type is records.

This means that tuples have to encoded as records somehow. Conveniently(?) record fields are just numbers after all, so the type (Int, Bool) would be mapped to the type

record { 0 : int; 1 : bool }

So tuples can be expressed. But from my experience implementing the type mappings for Motoko and Haskell this is causing headaches. To get a good experience when importing from Candid, the tools have to try to reconstruct which records may have been tuples originally, and turn them into tuples.

The main argument for the status quo is that Candid types should be canonical, and there should not be more than one product type, and records are fine, and there needs to be no anonymous product type. But that has never quite resonated with me, given the practical reality of tuple types in many host languages.

Argument sequences

Did I say that Candid does not have tuple types? Turns out it does, sort of. There is no first class anonymous product, but since functions take sequences of arguments and results, there is a tuple type right there:

func foo : (bool, int) -> (int, bool)

Again, I found that ergonomic interaction with host languages becomes relatively unwieldy by requiring functions to take and return sequences of values. This is especially true for languages where functions take one argument value or return one result type (the latter being very common). Here, return sequences of length one are turned into that type directly, longer argument sequences turn into the host language’s tuple type, and nullary argument sequences turn into the idiomatic unit type. But this means that the types (int, bool) -> () and (record { 0: int, 1: bool}) -> () may be mapped to the same host language type, which causes problems when you hope to encode all necessary Candid type information in the host language.

Another oddity with argument and result sequences is that you can give names to the entries, e.g. write

func hello : (last_name : text; first_name : text) -> ()

but these names are completely ignored! So while this looks like you can, for example, add new optional arguments in the middle, such as

func hello : (last_name : text; middle_name: opt text, first_name : text) -> ()

without breaking clients, this does not have the effect you think it has and will likely break.

My suggestion is to never put names on function arguments and result values in Candid interfaces, and for anything that might be extended with new fields or where you want to name the arguments, use a single record type as the only argument:

func hello : (record { last_name : text; first_name : text}) -> ()

This allows you to add and remove arguments more easily and reliably.

Type “shorthands”

The Candid specification defines a system of types, and then adds a number of “syntactic short-hands”. For example, if you write blob in a Candid type description, it ought to means the same as vec nat8.

My qualm with that is that it doesn’t always mean the same. A Candid type description is interpreted by a number of, say, “consumers”. Two such consumers are part of the Candid specification:

  • The specification that defines the wire format for that type
  • The upgrading (subtyping) rules

But there are more! For every host language, there is some mapping from Candid types to host language types, and also generic tools like Candid UI are consumers of the type algebra. If these were to take the Candid specification as gospel, they would be forced to treat blob and vec nat8 the same, but that would be quite unergonomic and might cause performance regressions (most language try to map blob to some compact binary data type, while vec t tends to turn into some form of array structure).

So they need to be pragmatic and treat blob and vec nat8 differently. But then, for all practical purposes, blob is not just a short-hand of vec nat8. They are different types that just happens to have the same wire representations and subtyping relations.

This affects not just blob, but also “tuples” (record { blob; int; bool }) and field “names”, as discussed above.

The value text format

For a while after defining Candid, the only implementation was in Motoko, and all the plumbing was automatic, so there was never a need for users to to explicitly handle Candid values, as all values were Motoko values. Still, for debugging and testing and such things, we eventually needed a way to print out Candid values, so the text format was defined (“To enable convenient debugging, the following grammar specifies a text format for values…”).

But soon the dfx tool learned to talk to canisters, and now users needed to enter Candid value on the command line, possibly even when talking to canisters for which the interface was not known to dfx. And, sure enough, the textual interface defined in the Candid spec was used.

Unfortunately, since it was not designed for that use case, it is rather unwieldy:

  • It is quite verbose. You have to write record { … }, not just { … }. Vectors are written vec { …; …} instead of some conventional syntax like […, …]. Variants are written as variant { error = "…"} with braces that don’t any value here, and something like #error "…" might have worked as well.

    With a bit more care, a more concise and ergonomic syntax might have been possible.

  • It wasn’t designed to be sufficient to create a Candid value from it. If you write 5 it’s unclear whether that’s a nat or an int16 or what (and all of these have different wire representations). Type annotations were later added, but are relatively unwieldy, and don’t cover all cases (e.g. a service reference with a recursive type cannot be represented in the textual format at the moment).

  • Not really the fault of the textual format, but some useful information about the types is not reflected in the type description that’s part of the wire format. In particular not the field names, and whether a value was intended to be binary data (blob) or a list of small numbers (vec nat8), so pretty-printing such values requires guesswork. The Haskell library even tries to brute-force the hash to guess the field name, if it is short or in a english word list!

In hindsight I think it was too optimistic to assume that correct static type information is always available, and instead of actively trying to discourage dynamic use, Candid might be better if we had taken these (unavoidable?) use cases into account.

Custom wire format

At the beginning of this post, I have a “Candid is …” list. The list is relatively long, and the actual wire format is just one bullet point. Yes, defining a wire format that works isn’t rocket science, and it was easiest to just make one up. But since most of the interesting meat of Candid is in other aspects (subtyping rules, host language integration), I wonder if it would have been better to use an existing, generic wire format, such as CBOR, and build Candid as a layer on top.

This would give us plenty of tools and libraries to begin with. And maybe it would have reduced barrier of entry for developers, which now led to the very strange situation that DFINITY advocates for the universal use of Candid on the Internet Computer, so that all services can smoothly interact, but two of the most important services on the Internet Computer (the registry and the ledger) use Protobuf as their primary interface format, with Candid interfaces missing or an afterthought.

The type constructor opcodes

A small quirk of the wire format: It defines op-codes for every type constructor, both primitive (bool…) and composite (vec…), and the op-codes from primitive and composite types share a common namespace. This is a left-over from earlier iterations of the design where primitive and composite types may have been encoded together. But eventually, the “type table” was introduced, and composite types are now always referenced via their index in the type table, and the type table contains only composite types. In other words: In any place, one either deals with primitive types/table indices, or composite types. I think it might have been clearer and more honest if primitive and composite types got numbered independently.

Sideways Interface Evolution

This is not a quirk of Candid itself, but rather an idiom of how you can use Candid that emerged from our solution for record extensions and upgrades.

Consider our example from before, a service with interface

service { add_user : (record { login : text; name : text }) -> () }

where you want to add an age field, which should be a number.

The “official” way of doing that is to add that field with an optional type:

service { add_user : (record { login : text; name : text; age : opt nat }) -> () }

As explained above, this will not break old clients, as the decoder will treat a missing argument as null. So far so good.

But often when adding such a field you don’t want to bother new clients with the fact that this age was, at some point in the past, not there yet. And you can do that! The trick is to distinguish between the interface you publish and the interface you implement. You can (for example in your documentation) state that the interface is

service { add_user : (record { login : text; name : text; age : nat }) -> () }

which is not a subtype of the old type, but it is the interface you want new clients to work with. And then your implementation uses the type with opt nat. Calls from old clients will come through as null, and calls from new clients will come through as opt 42.

We can see this idiom used in the Management Canister of the Internet Computer. The current documented interface only mentions a controllers : vec principal field in the settings, but the implementation still can handle both the old controller : principal and the new controllers field.

It’s probably advisable to let your CI system check that new versions of your service continue to implement all published interfaces, including past interfaces. But as long as the actual implementation’s interface is a subtype of all interfaces ever published, this works fine.

This pattern is related to when your service implements, say, http_request (so its implemented interface is a subtype of that common interface), but does not include that method in the published documentation (because clients of your service don’t need to call it).

Self-describing Services

As you have noticed, Candid was very much designed assuming that all parties always have the service type of services they want to interact with. But the Candid specification does not define how one can obtain the interface of a given service, and there isn’t really a an official way to do that on the Internet Computer.

That is unfortunate, because many interesting features depend on that: Such as writing import C "ic:7e6iv-biaaa-aaaaf-aaada-cai" in your Motoko program, and having it’s type right there. Or tools like ic.rocks, that allow you to interact with any canister right there.

One reason why we don’t really have that feature yet is because of disagreements about how dynamic that feature should be. Should you be able to just ask the canister for its interface (and allow the canister to vary the response, for example if it can change its functionality over time, even without changing the actual wasm code)? Or is the interface a static property of the code, and one should be able to query the system for that data, without the canister’s active involvement. Or, completely different, should interfaces be distributed out of band, maybe together with API documentation, or in some canister registry somewhere else?

I always leaned towards the first of these options, but not convincingly enough. The second options requires system assistance, so more components to change, more teams to be involved that maybe intrinsically don’t care a lot about this feature. And the third might have emerged as the community matures and builds such infrastructure, but that did not happen yet.

In the end I sneaked in an implementation of the first into Motoko, arguing that even if we don’t know yet how this feature will be implemented eventually, we all want the feature to exist somehow, and we really really want to unblock all the interesting applications it enables (e.g. Candid UI). That’s why every Motoko canister, and some rust canisters too, implements a method

__get_candid_interface_tmp_hack : () -> (text)

that one can use to get the Candid interface file.

The name was chosen to signal that this may not be the final interface, but like all good provisional solutions, it may last longer than intended. If that’s the case, I’m not sorry.

This concludes my blog post series about Candid, for now. If you want to know more, feel free to post your question on the DFINTY developer forum, and I’ll probably answer.

A Candid explainer: Language integration

Published 2021-09-13 in sections English, Internet Computer.

This is the forth post in a series about the interface description language Candid.

Now for something completely different: How does Candid interact with the various host languages, i.e. the actual programming languages that you write your services and clients in?

There are two facets to that question:

  1. How is the Candid interface represented inside the language?

    Some languages with rich type systems can express all relevant information about a Candid method or service within its own type system, and then the concrete serialization/deserialization code can be derived from that type (e.g. using type classes in Haskell, Traits in Rust, or built into the compiler in Motoko).

    Other languages have a less rich type system (e.g. C), no type-driven generic programming or are simply dynamically typed. In these cases, the Candid type has to be either transformed into specific code for that service by some external tool (e.g. JavaScript and TypeScript) or the Candid description has to be parsed and interpreted at runtime.

    Either approach will give rise to a type mapping between the Candid types and the host language types. Developers will likely have to know which types correspond to which, which is why the Candid manual’s section on types explicitly lists that.

  2. How is the Candid interface description produced and consumed?

    This is maybe the even more important question; what comes first: The code written in the host language, or the Candid description. There are multiple ways to tackle this, all of which have their merits, so I let’s look at some typical approaches.

Generating candid from the host language

In many case you don’t care too much about the interface of your service, and you just want to write the functionality, and get the interface definition for free. This is what you get when you write Motoko services, where the compiler calculates the Candid interface based on the Motoko types of your actor methods, and the build tool (dfx) puts that Candid file where it needs to go. You can thus develop services without ever writing or even looking at Candid type definitions.

The Candid library for Rust supports that mode as well, although you have to add some macros to your program to use it.

A downside of this model is that you have only indirect control over the generated Candid. Since it is type-driven, whenever there is more than one sensible Candid type for a given host language type, the translation tool has to make a choice, and if that does not suit you, that can be a problem.

In the case of Motoko we were able to co-design it with Candid, and their type systems are similar enough that this works well in practice. We have a specification of Candid-Motoko-type-mappings, and the type export from from Motoko to Candid is almost surjective. (Almost, because of Candid’s float32 type, which Motoko simply does not have, and because of service types with methods names that are not valid Motoko identifiers.)

Checking host language against Candid

The above works well when you (as the service developer) get to define the service’s interface as you go. But sometimes you want to develop a service that adheres to a given Candid interface. For example, in order to respond to HTTP requests in an Internet Computer service, you should provide a method http_request that implements this interface (simplified):

type HeaderField = record { text; text; };

type HttpRequest = record {
  method: text;
  url: text;
  headers: vec HeaderField;
  body: blob;
};

type HttpResponse = record {
  status_code: nat16;
  headers: vec HeaderField;
  body: blob;
};

service : {
  http_request: (request: HttpRequest) -> (HttpResponse) query;
}

Here, a suitable mode of operation is to generate the Candid description of the service that you built, and then compare it against this expected interface with a tool that implements the Candid subtyping relation. This would then complain if what you wrote was not compatible with the above interface. The didc check tool that comes with the Rust library can do that. If your service has to implement multiple such pre-defined interfaces, its actual interface will end up being a subtype of each of these interfaces.

Importing Candid

If you already have a Candid file, in particular if you are writing a client that wants to talk to an existing service, you can also import that Candid file into your language. This is a very common mode of operation for such interface descriptions languages (e.g. Protobuf). The details depend a lot on the host language, though:

  • In Motoko, you can import typed actor references: Just write import C "canister:foo" where foo is the name of another canister in your project, and your build tool (dfx) will pass the Candid interface of foo to the Motoko compiler, which then translates that Candid service type into a Motoko actor type for you to use.

    The mapping from Candid types to Motoko types is specified in IDL-Motoko.md as well, via the function i(…).

    The vision was always that you can import a reference to any canister on the Internet Computer this way (import C "ic:7e6iv-biaaa-aaaaf-aaada-cai"), and the build tool would fetch the interface automatically, but that has not been implemented yet, partly because of disagreement about how canisters expose their interface (see below). The Motoko compiler is ready, though, as described in its build tool interface specification.

    What is still missing in Motoko is a way to import a Candid type alone (without a concrete canister reference), to use the types somewhere (in function arguments, or to assert the type of the full canister).

  • The Rust library does not support importing Candid, as far as I know. If you write a service client in Rust, you have to know which Rust types map to the right Candid types, and manually get that right.

  • For JavaScript and TypeScript, the didc tool that comes with the Rust library can take a Candid file and produce JS resp. TS code that gives you an object representing the service with properly typed methods that you can easily interact with.

  • With the Haskell Candid library you can import Candid inline or from a file, and it uses metaprogramming (Template Haskell) to generate suitable Haskell types for you, which you can encode/decode at. This is for example used in the test suite of the Internet Identity, which executes that service in a simulated Internet Computer environment.

Generic and dynamic use

Finally, it’s worth mentioning that Candid can be used generically and dynamically:

  • Since Canisters can indicate their interface, a website like ic.rocks that enumerates Canisters can read that interface, and provide a fully automatically generated UI for them. For example, you can not only see the Candid interface of the Internet Identity canister at https://ic.rocks/principal/rdmx6-jaaaa-aaaaa-aaadq-cai, but actually interact with it!

  • Similarly, during development of a service, it is very useful to be able to interact with it before you have written your actual front-end. The Candid UI tool provides that functionality, and can be used during local development or on-chain. It is also integrated into the Internet Computer playground.

  • Command line tools like dfx allow you to make calls to services, even without having their actual Candid interface description around. However, such dynamic use was never part of the original design of Candid, so it is a bit rough around the edges – the textual format is a bit unwieldy, you need to get the types right, and field name in the responses may be missing.

Do you want to know why field names are missing then? Then check out the next and final post of this series, where I discuss a number of Candid’s quirks.

A Candid explainer: Opt is special

Published 2021-09-07 in sections English, Internet Computer.

This is the third post in a series about the interface description language Candid.

The record extension problem

Initially, the upgrade rules of Candid were a straight-forward application of the canonical subtyping rules. This worked and was sound, but it forbid one very commonly requested use case: Extending records in argument position.

The subtyping rule for records say that

   record { old_field : nat; new_field : int }
<: record { old_field : nat }

or, in words, that subtypes can have more field than supertypes. This makes intuitive sense: A user of a such a record value that only expects old_field to be there is not bothered if suddenly a new_field appears. But a user who expects new_field to be there is thrown off if it suddenly isn’t anymore. Therefore it is ok to extend the records returned by a service with new fields, but not to extend the record in your methods’s argument types.

But users want to extend their record types over time, also in argument position!

In fact, they often want to extend them in both positions. Consider a service with the following interface, where the CUser record appears both in argument and result position:

type CUser = record { login : text; name : text };
service C : {
  register_user : (CUser) -> ();
  get_user_data : (text) -> (CUser);
}

It seems quite natural to want to extend the record with a new field (say, the age of the user). But simply changing the definition of CUser to

type CUser = record { login : text; name : text; age : nat }

is not ok, because now register_user requires the age field, but old clients don’t provide it.

So how can we allow developers to make changes like this (because they really really want that), while keeping the soundness guarantees made by Candid (because we really really want that)? This question has bothered us for over two years, and we even created a meta issue that records the dozen approached we considered, discussed and eventually ditched.

Dynamic subtyping checks in opt

I will spare you the history lesson, though, and explain the solution we eventually came up with.

In the example above it seems unreasonable to let the developer add a field age of type nat. Since there may be old clients around, the service unavoidably has to deal with records that don’t have an age field. If the code expects an age value, what should it do in that case?

But one can argue that changing the record type to

type CUser = record { login : text; name : text; age : opt nat }

could work: If the age field is present, use the value. If the field is absent, inject a null value during decoding.

In the first-order case, this rather obvious solution would work just fine, and we’d be done. But Candid aims to solve the higher order case, and I said things get tricky here, didn’t I?

Consider another, independent service D with the following interface:

type DUser = record { login : text; name : text };
service D : {
  on_user_added : (func (DUser) -> ()) -> ()
}

This service has a method that takes a method reference, presumably with the intention of calling it whenever a new user was added to service D. And because the types line up nicely, we can compose these two services, by once calling D.on_user_added(C.register_user), so that from now on D calls C.register_user(…). These kind of service mesh-ups are central to the vision of the Internet Computer!

But what happens if the services now evolve their types in different ways, e.g. towards

type CUser = record { login : text; name : text; age : opt nat }

and

type DUser = record { login : text; name : text; age : opt variant { under_age; adult }}

Individually, these are type changes that we want to allow. But now the call from D to C may transmit a value of record { …; age = opt variant { under_age } } when C expects a natural number! This is precisely what we want to prevent by introducing types, and it seems we have lost soundness.

The best solution we could find is to make the opt type somewhat special, and apply these extra rules when decoding at an expected type of opt t.

  • If this was a record field, and the provided record value doesn’t even have a field of that name, don’t fail but instead treat this as null. This handles the first-order example above.

  • If there is a value, look at its type (which, in Candid, is part of the message).

    • If it is opt t' and t' <: t, then decode as normal.

    • If it is opt t' but t' <: t does not hold, then treat that as null.

      This should only happen in these relatively obscure higher-order cases where services evolved differently and incompatibly, and makes sure that the calls that worked before the upgrades will continue to work afterwards.

      It is not advisable to actually use that rule when changing your service’s interface. Tools that assist the developer with an upgrade check should prevent or warn about the use of this rule.

  • Not strictly required here, but since we are making opt special anyways:

    If its type t' is not of the form opt …, pretend it was opt t', and also pretend that the given value was wrapped in opt.

    This allows services to make record field in arguments that were required in the old version optional in a new version, e.g. as a way to deprecated them. So it is mildly useful, although I can report that it makes the meta-theory and implementation rather complex, in particular together with equirecursion and beasts like type O = opt O. See this discussion for a glimpse of that.

In the above I stress that we look at the type of the provided value, and not just the value. For example, if the sender sends the value opt vec {} at type opt vec nat, and the recipient expects opt vec bool, then this will decode as null, even though one could argue that the value opt vec {} could easily be understood at type opt vec bool. We initially had that, but then noticed that this still breaks soundness when there are references inside, and we have to do a full subtyping check in the decoder. This is very unfortunate, because it makes writing a Candid decoder a noticeably harder task that requires complicated graph algorithms with memoization (which I must admit Motoko has not implemented yet), but it’s the least bad solution we could find so far.

Isn’t that just dynamic typing?

You might have noticed that with these rules, t <: opt t' holds for all types t and t'. In other words, every opt … is a top type (like reserved), thus all optional types are equivalent. One could argue that we are throwing all of the benefits of static typing over board this way. But it’s not quite as bad: It’s true that decoding Candid values now involves a dynamic check inserting null values in certain situations, but as long as everyone plays by the rules (i.e. upgrades their services according to the Candid safe upgrade rules, and heeds the warning alluded to above), these certain situations will only happen in the obscurest of cases.

It is, by the way, not material to this solution that the special subtyping behavior is applied to “the” opt type, and a neighboring point in the design space would be a dedicated upgraded t type constructor. That would allow developers to use the canonical opt type with the usual subtyping rules and communicate their intent (“clients should consider this field optional” vs. “old clients don’t know about this field, but new clients should use it”) more cleanly – at the expense of having more “non-canonical” types in the type system.

To see why it is convenient if Candid has mostly just the “normal” types, read the next post, which will describe how Candid can be integrated into a host language.

A Candid explainer: Safe higher-order upgrades

Published 2021-08-30 in sections English, Internet Computer.

This is the second post in a series about the interface description language Candid.

Safe upgrades

A central idea behind Candid is that services evolve over time, and so also their interfaces evolve. As they do, it is desirable to keep the interface usable by clients who have not been updated. In particular on a blockchainy platform like the Internet Computer, where some programs are immutable and cannot be changed to accommodate changes in the interface of the services they use, this is of importance.

Therefore, Candid defines which changes to an interface are guaranteed to be backward compatible. Of course it’s compatible to add new methods to a service, but some changes to a method signature can also be ok. For example, changing

service A1 : {
  get_value : (variant { current; previous : nat })
    -> (record { value : int; last_change : nat })
}

to

service A2 : {
  get_value : (variant { current; previous : nat; default })
    -> (record { value : int; last_change : nat; committed : bool })
}

is fine: It doesn’t matter that clients that still use the old interface don’t know about the new constructor of the argument variant. And the extra field in the result record will silently be ignored by old clients.

In the Candid spec, this relation is written as A2 <: A1 (note the order!), and the formal footing this builds on is subtyping. We thus say that “it is safe to upgrade a service to a subtype”, and that A2’s interface is a subtype of A1’s.

In small examples, I often use nat and int, because every nat is also a int, but some int values are not not nat values, namely the negative ones. We say nat is a subtype of int, nat <: int. So a function that in the first returns a int can in the new version return a nat without breaking old clients. But the other direction doesn’t work: If the old version’s method had a return type of nat, then changing that to int would be a breaking change, as old clients would not be prepared to handle negative numbers.

Note that arguments of function types follow different rules. In fact, the rules are simply turned around, and in argument position (also called “negative position”), we can evolve the interface to accept supertypes. Concretely, a function that originally expected an nat can be changed to expect an int, but the other direction doesn’t work, because there might still be old clients around that send negative numbers. This reversing-of-the-rules is called contravariance.

The vision is that the developer’s tooling will warn the developer before they upgrade the code of a running service to a type that is not a subtype of the old interface.

Let me stress, though, that all this is about not breaking existing clients that use the old interface. It does not mean that a client developer who fetches the new interface for your service won’t have to change their code to make his programs compile again.

Higher order composition

So far, we considered the simple case of one service with a bunch of clients, i.e. the “first order” situation. Things get much more interesting if we have multiple services that are composed, and that pass around references to each other, and we still want everything to be nicely typed and never fail even as we upgrade services.

Therefore, Candid’s type system can express that a service’s method expects or a returns a reference to another service or method, and the type that this service or method should have. For example

service B : { add_listener : (text, func (int) -> ()) -> () }

says that the service with interface B has a method called add_listener which takes two arguments, a plain value of type text and a reference to a function that itself expects a int-typed argument.

The contravariance of the subtyping rules explained above also applies to the types of these function reference. And because the int in the above type is on the left of two function arrows, the subtyping rule direction flips twice, and it is therefore safe to upgrade the service to accept the following interface:

service B : { add_listener : (text, func (nat) -> ()) -> () }

Soundness theorem and Coq proof

The combination of these higher order features and the safe upgrade mechanism is maybe the unique selling point of Candid, but also what made its design quite tricky sometimes. And although the conventional subtyping rules work well, we wanted to do some less conventional things (more on that below), and more than once thought we had a good solution, only to notice that it did not hold water in complicated higher-order cases.

But what does it mean to hold water? I felt the need to precisely define a soundness criteria for higher order interface description languages, which you can find in the document An IDL Soundness Proposition. This is a general framework which you can instantiate with your concrete IDL language and rules, and then it tells you what you have to prove to consider your language to be sound. Intuitively, it simulates all possible ways in which services can be defined and upgraded, and in which they can pass around references to each other, and the soundness property is that then that for all messages sent between services, they can always be understood.

The document also shows, in general, that any system that builds on “canonical subtyping” in sound, as expected. That is still a generic theorem that you can instantiate with a concrete system, but now there is less to do.

Because these proofs can get tricky in corner cases, it is valuable to mechanize them in an interactive theorem prover and have the computer check the proofs. So I have created a Coq formalization that defines the soundness criteria, including the reduction to canonical subtyping. It also defines MiniCandid, a greatly simplified variant of Candid, proves various properties about it (transitivity etc.) and finally instantiates the soundness theorem.

I am particularly fond of the use of coinduction to model the equirecursive types, and the use of named cases, as we know them from Isabelle, to make the proofs a bit more readable and maintainable.

I am less fond of how much work it seems to be to extend MiniCandid with more of Candid’s types. Already adding vec was more work than it’s worth it, and I defensively blame Coq’s not-so-great support for nested recursion.

The soundness relies on subtyping, and that is all fine and well as long as we stick to canonical rules. Unfortunately, practical considerations force us to deviate from these a bit, as we see in the next post of this series.

A Candid explainer: The rough idea

Published 2021-08-29 in sections English, Internet Computer.

One of the technologies that was created by DFINITY as we built the “Internet Computer” is Candid. Candid is

  • a language to describe the interface of a service, with named methods, arguments and results,
  • a type system for these arguments and results that maps to many possible implementation languages, thus enabling interoperability between these languages,
  • a wire format (encoding) of values of these types,
  • a notion of “safe upgrade” that allows service developers to evolve their interface with the assurance that no clients break, even in the presence of higher-order composition,
  • a meta-theory of safe interface description languages and a formal proof that Candid has these properties, and
  • a bunch of tools and libraries.

In this in-depth blog post series I want to shed some light onto these aspects of Candid. The target audience is mainly anyone who wants to deeply understand Candid, e.g. researchers, implementors of Candid tools, anyone who wants to builds a better alternative, but no particular prior Candid knowledge is expected. Also, much of what is discussed here is independent of the Internet Computer. Some posts may be more theoretical or technical than others; if you are lost in one I hope you’ll rejoin for the subsequent post

Much in these posts is not relevant for developers who just want to use Candid in their projects, and also much that they want to know is missing. The Internet Computer dev docs will hopefully cater to that audience.

Blog post series overview

  1. The rough idea (this post)

    Announcing the blog post series, and very briefly outlining what an Interface Description Language even is.

  2. Safe higher-order upgrades

    Every can do first-order interface description languages. What makes Candid special is its support for higher-order service composition. This post also contains some general meta-theory and a Coq proof!

  3. Opt is special

    Extending records in argument position is surprisingly tricky. Lean why, and how we solved it.

  4. Language integration

    The point of Candid is inter-op, so let’s look the various patterns of connecting Candid to your favorite programming language.

  5. Quirks

    The maybe most interesting post in this series: Candid has a bunch of quirks (hashed field names, no tuples but sequnces, references that nobody used, and more). I’ll explain some of them, why they are there, and what else we could have done. Beware: This post will be opinionated.

The rough idea

The idea of an interface description language is easy to begin with. Something like

service A : {
  add : (int) -> ();
  get : () -> (int);
}

clearly communicates that a service called A provides two methods add and get, that add takes an integer as an argument, and that get returns such a number. Of course, this only captures the outer shape of the service, but not what it actually does – we might assume it implements a counter of sorts, but that is not part of the Candid interface description.

In order to now use the service, we also need a (low-level) transport mechanism. Candid itself does not specify that, but merely assumes that it exists, and is able to transport the name of the invoked method and a raw sequence of bytes to the service, and a response (again a raw sequence of bytes) back.

Already on the Internet Computer we have two distinct transport mechanisms used are external calls via an HTTP-based RPC interface and on-chain inter-canister calls. Both are handled by the service in the same way. Here we can see that Candid succeeds in abstracting over differences in the low-level transport mechanism. Because of this abstraction, it is possible to use Candid over other transportation mechanisms as well (conventional HTTPS, E-Mail, avian carrier). I think Candid has potential there as well, so even if you are not interested in the Internet Computer, this post may be interesting to you.

The translation of argument and result values (e.g. numbers of type int) to the raw sequence of bytes is specified by Candid, defining a wire format. For example, the type int denotes whole numbers of arbitrary size, and they are encoded using the LEB128 scheme. The Candid wire format also contains a magic number (DIDL, which is 0x4449444c in hex) and a type description (more on that later). So when passing the value 2342 to the add, the raw bytes transported will be 0x4449444c00017da612.

Of course we want to transfer more data than just integers, and thus Candid supports a fairly complete set of common basic types (nat, int, nat8, nat16, nat32, nat64, int8, int16, int32, int64, float32, float64, bool, text, blob) and composite types (vectors, optional values , records and variants). The design goal is to provide a canonical set of types – enough to express most data that you might want to pass, but no more than needed, so that different host languages can support these types easily.

The Candid type system is structural. This means that two types are the same when they are defined the same way. Imagine two different services defining the same

type User = record { name : text; user_id : nat }

then although User is defined in two places, it’s still the same type. In other words, these name type definitions are always just simple aliases, and what matters is their right-hand side.

Because Candid types can be recursive (e.g. type Peano = opt Peano), this means we have an equirecursive type system, which makes some things relatively hard.

So far, nothing too special about Candid compared to other interface definition languages (although a structural equirecursive type system is already something). In the next post we look at reference types, and how such higher order features make Candid an interesting technology.

Sweet former employee appreciation

Published 2021-08-22 in sections English, Internet Computer.

Two months ago was my last day of working as an employee for DFINITY, and as I mentioned in the previous blog post, one of my main contributions was the introduction and maintenance of the Interface Specification.

The Interface Specification as a book

Hence I was especially happy to find that my former colleague Hans Larsen has, as a farewell gift and token of appreciation, created a hard copy of the document, with 107 pages of dry technical content that, and a very sweet dedication on the back. This is especially valuable as it came from Lars personally (i.e. it’s not “just” routine HR work to keep former employees happy, which I could imagine to be a thing in big and mature corporations), and despite the fact that he himself has left DFINITY since.

I also take this as further confirmation that this specification-driven design process, despite the initial resistance and daily hurdles, is a good one. A rough list of guiding principles would be:

  • Describe the complete interface, not just signatures (schemas), but also semantics (behavior), in one document. The teams on either side of the interface should not need any other information. In particular, they should never have to peek into the other team’s code to see how it works.
  • New features or changes are first designed by writing down how they affect the interface specification, discussed on that level with the “other” teams, agreed upon, and then implemented. Treating the document as code and discussing these features on pull requests is helpful here. Depending on how quickly DFINITY becomes more open, we may get to see that happening on the currently internal ic-ref repository.
  • The interface semantics is described not just using possibly ambiguous or incomplete prose, but also a comprehensive formalism. That formalism is ideally mechanized, but pseudo-math is better than nothing.
  • The interface is implemented more than once (e.g. a prototype reference implementation, and the production implementation) to keep the implementations honest.
  • Dually, the interface has more than one consumer, one ideally being a executable test suite that is implementation-agnostic.

The similarity to the concept of “deep specifications” from the DeepSpec project is indeed striking.

One of my hopes at DFINITY was that these principles would catch on and that other components (e.g the NNS, the nodes or the ICP ledger canister) would follow suite. That did not happen, it seems. Or rather, it did not happen yet…

PS, in case you are wondering how: The rendering of the document that’s shown at at https://sdk.dfinity.org/docs/interface-spec/index.html is not great; the internal website had a different style with made navigating this large document more possible, as it was a dedicated site with the document’s nested table of content on the left.

Leaving DFINITY

Published 2021-06-19 in sections English, Internet Computer, Internet Computer.

Last Thursday was my last working day at DFINITY. There are various reasons why I felt that after almost three years the DFINITY Foundation isn’t quite the right place for me anymore, and this plan has been in the making for a while. Primarily, there are personal pull factors that strongly suggest that I’ll take a break from full time employment, so I decided to see the launch of the Internet Computer through and then leave.

DFINITY has hired some amazing people, and it was a great pleasure to work with them. I learned a lot (some Rust, a lot of Nix, and just how merciless Conway’s law is), and I dare say I had the opportunity to do some good work, contributing my part to make the Internet Computer a reality.

I am especially proud of the Interface Specification and the specification-driven design principles behind it. It even comes with a model reference implementation and acceptance test suite, and although we didn’t quite get to do formalization, those familiar with the DeepSpec project will recognize some influence of their concept of “deep specifications”.

Besides that, there is of course my work on the Motoko programming language, where I build the backend,a the Candid interoperability layer, where I helped with the formalism, formulated the a generic soundness criterion for Interface Description Languages in a higher order settings and formally verified that in Coq. Fortunately, all of this work is now Free Software or at least Open Source.

With so much work poured into this project, I continue to care about it, and you’ll see me post on the the developer forum and hack on Motoko. As the Internet Computer becomes gradually more open, I hope I can be gradually more involved again. But even without me contributing full-time I am sure that DFINITY and the Internet Computer will do well; when I left there were still plenty of smart, capable and enthusiastic people forging ahead.

So what’s next?

So far, I have rushed every professional transition in my life: When starting my PhD, when starting my postdoc, when starting my job at DFINITY, and every time I regretted it. So this time, I will take a proper break and will explore the world a bit (as far as that is possible given the pandemic). I will stresslessly contribute to various open source projects. I also hope to do more public outreach and teaching, writing more blog posts again, recording screencasts and giving talks and lectures. If you want to invite me to your user group/seminar/company/workshop, please let me know! Also, I might be up for small interesting projects in a while.

Beyond these, I have no concrete plans and am looking forward to the inspiration I might get from hiking through the Scandinavian wilderness. If you happen to stumble across my tent, please stop for a tea.