Blob Visibility & Access Control for Private Documents

This is a status update for the current state of Private Documents: Phase 1 implementation, and also the technical and architectural documentation for it.

The names of files and other code-related things are valid at the time of the writing. Things might change as we continue developing these features.

The Core Idea

Some blobs in our system can now be private, i.e. restricted only to members of the space who have write access. The main purpose is to implement private documents with optional collaboration.

Practically we only support syncing private blobs through a server, but in the future we can support P2P syncing as well.

Visibility Tracking

For each blob we record its visibility in the blob_visibility table, which supersedes public_blobs table (we were only tracking public blobs before).

Some blobs are always public — Contacts, Capabilities, Profiles.

Some blobs have explicit visibility (via the visibility field on the blob itself) — Refs, Comments.

Other blobs inherit the visibility from other blobs that point to them, e.g. Changes, UnixFS files (DagPB). By default these blobs are not public nor private, but when they are part of some Ref's history they transitively inherit the visibility from this Ref.

Visibility flows through CID links between blobs (stored in blob_links table), based on blob types and link types. The rules for this propagation are stored in the blob_visibility_rules table. These rules are not dynamic — they are simply stored in the database as an optimization for some recursive queries we do — to avoid unnecessary back-and-forth between Go and C. The rules themselves are documented in the schema — see the backend/storage/schema.sql file.

Public blobs are always public and open to anyone. The visibility for private blobs is tracked per space. A private blob can be accessible to one or multiple spaces, e.g. branches and forks of documents, the same image being part of multiple private documents in different spaces, the same UnixFS chunks being part of different roots, and things like that.

The code for visibility propagation is in backend/blob/index_visibility.go.

Some hard rules:

When a blob has at least one public visibility record in the visibility table — it's considered public for everybody, even if it has private visibility records as well.

It's really important to track the visibility of blobs accurately to avoid leaking private data to unauthorized peers.

We must ensure that visibility propagates correctly regardless of the order in which blobs are indexed — the end result must be consistent.

Access Control

The Dichotomy with Peer ID

The primary cause of complexity in our access control model is that we grant access to accounts, but we enforce access on peers. The relationship between accounts and peers in our system is not always explicit, not one-to-one, and not even always visible.

There's a way for a peer to represent an account explicitly — for example, when you have your account keys in your desktop app. However, there's also an implicit rule for peers that represent remote web servers for a given space — these peers do not belong to any accounts explicitly. Instead, they are designated by space owners when a web domain is specified in the siteUrl metadata field of the home document. Private blobs are synced via servers, and these peers are those servers.

You may have noticed that there're two dimensions to this peer dichotomy — peers <-> accounts, and peers <-> web domains from the siteUrl field.

Overall, a peer should be able to access a blob if at least one of the following conditions is met:

The blob has public visibility (mind the hard rule number 1 from the previous section).

The peer is related to a space that has access to the blob (as per blob_visibility table), explicitly or implicitly.

Bitswap

Currently the only way a blob can leave our node is via Bitswap, so this is the place where must be extremely careful.

Bitswap has a feature they call PeerBlockRequestFilter (search the codebase for this string to find the place where it's used, currently it's in backend/hmnet/hmnet.go), which is basically a callback that looks like this:

func(p peer.ID, c cid.Cid) bool

An application can provide this callback, which will be called every time a remote peer requests a blob from us.

This is very inefficient though — it's called every time for every blob and peer. One by one. Usually Bitswap network messages operate with batches of blobs, but this check is unable to take advantage of this batching, making database accesses expensive and slow. For this reason there're some tricks we do to avoid database access for every single check.

Before a remote peer would try to download a blob from us, there're a few way they can learn about what we can provide. We could announce blobs in a push request, during the periodic syncing with RBSR, or via the on-demand discovery with RBSR. All of these are related, but not exactly the same.

Syncing

The first step to not leaking private information is to make sure we don't announce it to unauthorized peers when we sync with them.

Any kind of syncing starts from the syncing.GetRelatedMaterial function. This function now has an authorizedSpaces argument which is a set of spaces for which we should collect private blobs. We always collect public blobs, but when we find a private one we check whether any space in authorizedSpaces has access to it (by checking the blob_visibility table), and if not — we skip it. Passing an empty list of authorizedSpaces will only collect public blobs.

Being careful when announcing blobs helps us simplify the Bitswap checks we talked about before — by using a simple concept of an allowlist. Every time we announce a batch of blobs to a peer we simply add this information to an internal allowlist in memory, speeding up the Bitswap filter checks considerably — we've already done the work for enforcing access control for the whole batch of blobs, no need to do it again for each request one by one.

This is currently implemented only in the push workflow, but should probably be implemented in RBSR as well. Although, it's a bit more complicated. We don't know when RBSR peer has finished downloading the data from us, we only know when we finished reconciling the blob CIDs. Maybe this could be an optimization for the future.

P2P APIs

So, there're two places where we need to enforce access control: when we announce blobs to peers, and when we receive Bitswap requests. In both of these cases we need to know which account a peer might be related to, explicitly or implicitly.

For the explicit flow we have a new P2P RPC — Authenticate (defined in proto/p2p/v1alpha/p2p.proto). Peers that want to reveal their explicit relationships to account/s would call this API to authenticate themselves as those account. This authentication is done by signing an ephemeral Capability blob delegating from the account key to the peer key, and restricted for an explicit "audience" (the peer we're talking to) to prevent replay attacks. Client would generate this ephemeral Capability, sign it, and then pass the necessary information to the remote peer, which would reconstruct the Capability and verify its signature. We enforce that this Capability is fresh, to avoid all the complexity with revocations, and other permanent effects of long-term Capabilities.

We store the authentication information for the currently connected peers in memory (search the codebase for peerAuthStore). As soon as a peer disconnects, we forget it's authentication information. This is probably suboptimal, as it can cause churn when peers come and go, but storing it blindly in the database could also lead to problems, like DoS attacks, having stale information, and so on.

For the implicit flow, we lookup the spaces for the requested blob in the blob_visibility table, then for each of those spaces we lookup their siteUrl, then we call those URLs on /hm/api/config HTTP route to lookup their peer ID, and then we compare whether it matches the ID of the peer we're talking to. This is pretty convoluted, and could be simplified with additional indexing and caching. We do currently cache the result of this lookup in memory, in an LRU cache, but this would benefit from a more persistent storage, so we should probably store this in the database as well. The code for this lives in backend/blob/site_peer_resolver.go.

Frontend APIs

For private blobs to come into existence someone needs to create them first 😅. In our documents API that's used by the frontend we've added the visibility field to CreateDocumentChange and CreateRef requests. The visibility field was also added to Document, DocumentInfo, and Ref types. This way we can create documents with private visibility, and know which documents are private at read time. The idea is to filter out private documents on the web for now — they will only be visibility in the app.

The Comments are meant to inherit the visibility from their target document when they are created, but they do have an explicit visibility field for convenience (e.g. when we have the comment but don't have the document yet — it's good to know that the target document is private).

Final Notes

Some of the things described here are still missing. I will create tasks in Linear.

Of course the UI part is also missing, although according to Private Documents Designs the initial implementation is very simple. Hopefully we can make it work with no help from the frontend team, as they are busy with other things :)