How to audit an Internet Computer canister

Published 2021-11-09 in sections English, Internet Computer.

I was recently called upon by Origyn to audit the source code of some of their Internet Computer canisters (“canisters” are services or smart contracts on the Internet Computer), which were written in the Motoko programming language. Both the application model of the Internet Computer as well as Motoko bring with them their own particular pitfalls and possible sources for bugs. So given that I was involved in the creation of both, they reached out to me.

In the course of that audit work I collected a list of things to watch out for, and general advice around them. Origyn generously allowed me to share that list here, in the hope that it will be helpful to the wider community.

Inter-canister calls

The Internet Computer system provides inter-canister communication that follows the actor model: Inter-canister calls are implemented via two asynchronous messages, one to initiate the call, and one to return the response. Canisters process messages atomically (and roll back upon certain error conditions), but not complete calls. This makes programming with inter-canister calls error-prone. Possible common sources for bugs, vulnerabilities or simply unexpected behavior are:

Reading global state before issuing an inter-canister call, and assuming it to still hold when the call comes back.
Changing global state before issuing an inter-canister call, changing it again in the response handler, but assuming nothing else changes the state in between (reentrancy).
Changing global state before issuing an inter-canister call, and not handling failures correctly, e.g. when the code handling the callback rolls backs.

If you find such pattern in your code, you should analyze if a malicious party can trigger them, and assess the severity that effect

These issues apply to all canisters, and are not Motoko-specific.

Rollbacks

Even in the absence of inter-canister calls the behavior of rollbacks can be surprising. In particular, rejecting (i.e. throw) does not rollback state changes done before, while trapping (e.g. Debug.trap, assert …, out of cycle conditions) does.

Therefore, one should check all public update call entry points for unwanted state changes or unwanted rollbacks. In particular, look for methods (or rather, messages, i.e. the code between commit points) where a state change is followed by a throw.

This issues apply to all canisters, and are not Motoko-specific, although other CDKs may not turn exceptions into rejects (which don’t roll back).

Talking to malicious canisters

Talking to untrustworthy canisters can be risky, for the following (likely incomplete) reasons:

The other canister can withhold a response. Although the bidirectional messaging paradigm of the Internet Computer was designed to guarantee a response eventually, the other party can busy-loop for as long as they are willing to pay for before responding. Worse, there are ways to deadlock a canister.
The other canister can respond with invalidly encoded Candid. This will cause a Motoko-implemented canister to trap in the reply handler, with no easy way to recover. Other CDKs may give you better ways to handle invalid Candid, but even then you will have to worry about Candid cycle bombs that will cause your reply handler to trap.

Many canisters do not even do inter-canister calls, or only call other trustwothy canisters. For the others, the impact of this needs to be carefully assessed.

Canister upgrade: overview

For most services it is crucial that canisters can be upgraded reliably. This can be broken down into the following aspects:

Can the canister be upgraded at all?
Will the canister upgrade retain all data?
Can the canister be upgraded promptly?
Is there a recovery plan for when upgrading is not possible?

Canister upgradeability

A canister that traps, for whatever reason, in its canister_preupgrade system method is no longer upgradeable. This is a major risk. The canister_preupgrade method of a Motoko canister consists of the developer-written code in any system func preupgrade() block, followed by the system-generated code that serializes the content of any stable var into a binary format, and then copies that to stable memory.

Since the Motoko-internal serialization code will first serialize into a scratch space in the main heap, and then copy that to stable memory, canisters with more than 2GB of live data will likely be unupgradeable. But this is unlikely the first limit:

The system imposes an instruction limit on upgrading a canister (spanning both canister_preupgrade and canister_postupgrade). This limit is a subnet configuration value, and sepearate (and likely higher) than the normal per-message limit, and not easily determined. If the canister’s live data becomes too large to be serialized within this limit, the canister becomes non-upgradeable.

This risk cannot be eliminated completely, as long as Motoko and Stable Variables are used. It can be mitigated by appropriate load testing:

Install a canister, fill it up with live data, and exercise the upgrade. If this succeeds with a live data set exceeding the expected amount of data by a margin, this risk is probably acceptable. Bonus points for adding functionality that will prevent the canister’s live data to increase above a certain size.

If this testing is to be done on a local replica, extra care needs to be taken to make sure the local replica actually performs instruction counting and has the same resource limits as the production subnet.

An alternative mitigation is to avoid canister_pre_upgrade as much as possible. This means no use of stable var (or restricted to small, fixed-size configuration data). All other data could be

mirrored off the canister (possibly off chain), and manually re-hydrated after an upgrade.
stored in stable memory manually, during each update call, using the ExperimentalStableMemory API. While this matches what high-assurance Rust canisters (e.g. the Internet Identity) do, this requires manual binary encoding of the data, and is marked experimental, so I cannot recommend this at the moment.
not put into a Motoko canister until Motoko has a scalable solution for stable variable (for example keeping them in stable memory permanently, with smart caching in main memory, and thus obliterating the need for pre-upgrade code.)

Data retention on upgrades

Obviously, all live data ought to be retained during upgrades. Motoko automatically ensures this for stable var data. But often canisters want to work with their data in a different format (e.g. in objects that are not shared and thus cannot be put in stable vars, such as HashMap or Buffer objects), and thus may follow following idiom:

stable var fooStable = …;
var foo = fooFromStable(fooStable);
system func preupgrade() { fooStable := fooToStable(foo); })
system func postupgrade() { fooStable := (empty); })

In this case, it is important to check that

All non-stable global vars, or global lets with mutable values, have a stable companion.
The assignments to foo and fooStable are not forgotten.
The fooToStable and fooFromStable form bijections.

An example would be HashMaps stored as arrays via Iter.toArray(….entries()) and HashMap.fromIter(….vals()).

It is worth pointiong out that a code view will only look at a single version of the code, but cannot check whether code changes will preserve data on upgrade. This can easily go wrong if the names and types of stable variables are changed in incompatible way. The upgrade may fail loudly in this cases, but in bad cases, the upgrade may even succeed, losing data along the way. This risk needs to be mitigated by thorough testing, and possibly backups (see below).

Prompt upgrades

Motoko and Rust canisters cannot be safely upgraded when they are still waiting for responses to inter-canister calls (the callback would eventually reach the new instance, and because of infelicities of the IC’s System API, could possibly call arbitrary internal functions). Therefore, the canister needs to be stopped before upgrading, and started again. If the inter-canister calls take a long time, this mean that upgrading may take a long time, which may be undesirable. Again, this risk is reduced if all calls are made to trustworthy canisters, and elevated when possibly untrustworthy canisters are called, directly or indirectly.

Backup and recovery

Because of the above risk around upgrades it is advisable to have a disaster recovery strategy. This could involve off-chain backups of all relevant data, so that it is possible to reinstall (not upgrade) the canister and re-upload all data.

Note that reinstall has the same issue as upgrade described above in “prompt upgrades”: It ought to be stopped first to be safe.

Note that the instruction limit for messages, as well as the message size limit, limit the amount of data returned. If the canister needs to hold more data than that, the backup query method might have to return chunks or deltas, with all the extra complexity that entails, e.g. state changes between downloading chunks.

If large data load testing is performed (as Irecommend anyways to test upgradeability), one can test whether the backup query method works within the resource limits.

Time is not strictly monotonic

The timestamps for “current time” that the Internet Computer provides to its canisters is guaranteed to be monotonic, but not strictly monotonic. It can return the same values, even in the same messages, as long as they are processed in the same block. It should therefore not be used to detect “happens-before” relations.

Instead of using and comparing time stamps to check whether Y has been performed after X happened last, introduce an explicit var y_done : Bool state, which is set to False by X and then to True by Y. When things become more complex, it will be easier to model that state via an enumeration with speaking tag names, and update this “state machine” along the way.

Another solution to this problem is to introduce a var v : Nat counter that you bump in every update method, and after each await. Now v is your canister’s state counter, and can be used like a timestamp in many ways.

While we are talking about time: The system time (typically) changes across an await. So if you do let now = Time.now() and then await, the value in now may no longer be what you want.

Wrapping arithmetic

The Nat64 data type, and the other fixed-width numeric types provide opt-in wrapping arithmetic (e.g. +%, fromIntWrap). Unless explicitly required by the current application, this should be avoided, as usually a too large or negatie value is a serious, unrecoverable logic error, and trapping is the best one can do.

Cycle balance drain attacks

Because of the IC’s “canister pays” model, all canisters are prone to DoS attacks by draining their cycle balance, and this risk needs to be taken into account.

The most elementary mitigation strategy is to monitor the cycle balance of canisters and keep it far from the (configurable) freezing threshold.

On the raw IC-level, further mitigation strategies are possible:

If all update calls are authenticated, perform this authentication as quickly as possible, possibly before decoding the caller’s argument. This way, a cycle drain attack by an unauthenticated attacker is less effective (but still possible).
Additionally, implementing the canister_inspect_message system method allows the above checks to be performed before the message even is accepted by the Internet Computer. But it does not defend against inter-canister messages and is therefore not a complete solution.
If an attack from an authenticated user (e.g. a stakeholder) is to be expected, the above methods are not effective, and an effective defense might require relatively involved additional program logic (e.g. per-caller statistics) to detect such an attack, and react (e.g. rate-limiting).
Such defenses are pointless if there is only a single method where they do not apply (e.g. an unauthenticated user registration method). If the application is inherently attackable this way, it is not worth the bother to raise defenses for other methods.

Related: A justification why the Internet Identity does not use canister_inspect_message)

A motoko-implemented canister currently cannot perform most of these defenses: Argument decoding happens unconditionally before any user code that may reject a message based on the caller, and canister_inspect_message is not supported. Furthermore, Candid decoding is not very cycle defensive, and one should assume that it is possible to construct Candid messages that require many instructions to decode, even for “simple” argument type signatures.

The conclusion for the audited canisters is to rely on monitoring to keep the cycle blance up, even during an attack, if the expense can be born, and maybe pray for IC-level DoS protections to kick in.

Large data attacks

Another DoS attack vector exists if public methods allow untrustworthy users to send data of unlimited size that is persisted in the canister memory. Because of the translation of async-await code into multiple message handlers, this applies not only to data that is obviously stored in global state, but also local data that is live across an await point.

The effectiveness of such attacks is limited by the Internet Computer’s message size limit, which is in the order of a few megabytes, but many of those also add up.

The problem becomes much worse if a method has an argument type that allows a Candid space bomb: It is possible to encode very large vectors with all values null in Candid, so if any method has an argument of type [Null] or [?t], a small message will expand to a large value in the Motoko heap.

Other types to watch out:

Nat and Int: This is an unbounded natural number, and thus can be arbitrarily large. The Motoko representation will however not be much larger than the Candid encoding (so this does not qualify as a space bomb).

It is still advisable to check if the number is reasonable in size before storing it or doing an await. For example, when it denotes an index in an array, throw early if it exceeds the size of the array; if it denotes a token amount to transfer, check it against the available balance, if it denotes time, check it against reasonable bounds.
Principal: A Principal is effectively a Blob. The Interface specification says that principals are at most 29 bytes in length, but the Motoko Candid decoder does not check that currently (fixed in the next version of Motoko). Until then, a Principal passed as an argument can be large (the principal in msg.caller is system-provided and thus safe). If you cannot wait for the fix to reach you, manually check the size of the princial (via Principal.toBlob) before doing the await.

Shadowing of `msg` or `caller`

Don’t use the same name for the “message context” of the enclosing actor and the methods of the canister: It is dangerous to write shared(msg) actor, because now msg is in scope across all public methods. As long as these also use public shared(msg) func …, and thus shadow the outer msg, it is correct, but it if one accidentially omits or mis-types the msg, no compiler error would occur, but suddenly msg.caller would now be the original controller, likely defeating an important authorization step.

Instead, write shared(init_msg) actor or shared({caller = controller}) actor to avoid using msg.

Conclusion

If you write a “serious” canister, whether in Motoko or not, it is worth to go through the code and watch out for these patterns. Or better, have someone else review your code, as it may be hard to spot issues in your own code.

Unfortunately, such a list is never complete, and there are surely more ways to screw up your code – in addition to all the non-IC-specific ways in which code can be wrong. Still, things get done out there, so best of luck!

Comments

Have something to say? You can post a comment by sending an e-Mail to me at <mail@joachim-breitner.de>, and I will include it here.

Go up