Project: Backoff Peer-Routing and Syncing

type: #Project

Problem

The tsunami of requests.
We used to be syncing with all the Mintter peers we've ever heard about. We were iterating over all of the KeyDelegation blobs that we have locally, and we were trying to call all those peers in attempt to pull data from them.
Turns out that most of those peers are dummy data, or they are simply offline. So we are trying to resolve their PeerID to an IP every sync interval, without realizing we'll never resolve it.
We've detected huge numbers of open connections and traffic being sent via the DHT protocol due to this behavior, which was eating all the network bandwidth breaking the internet connection.
In attempt to mitigate the problem we decided to only pull data from those peers who you trust (+ all the sites, because they don't exhibit this problem, as we already now the precise address of a site by doing a DNS lookup and then an HTTP call to our /.well-known endpoint). The assumption was that if we limit the number of peers we're trying to resolve, then there shouldn't be so much traffic. Turns out that even some trusted peers might have a lot of devices that are offline, and if the number of trusted peers is somewhat big (like in the tens), the issue still remains. The internet goes down.

Solution

    Refactor the architecture of syncing.
      Instead of trying to sync every minute with all the "targets", spread the load more evenly and independently. I.e. instead of one loop that wakes up once a minute, have multiple worker loops that pull jobs from a queue.
      Prioritize targets in the queue. Put online peers first, then prioritize those peers that know the IP of from previous interactions, etc.
      Limit the concurrency of peer routing requests. We need to find what is the right number of concurrent requests that doesn't cause huge spike in traffic and number of connections.
      Implement a backoff! If we couldn't talk to a peer within our sync interval — increase the interval, progressively with the number of attempts.
    Put in-place the built-in libp2p resource manager. We currently have it set to unlimited. We're experimenting with putting a hard limit of open connection specifically on the DHT protocol to prevent spikes from occurring.
    Potentially decouple the process of syncing from the process of discovering the IPs of peers. Although with measures described above this might be somewhat less relevant.
    Expose an API call to forcefully sync with a given peer ignoring any backoffs and queues. This might be not so important, because presumably the other peer would also have you as a trusted peer, so they'll dial you when they are online. But for peers with no mutual trust this might work as a measure of troubleshooting.
    We keep syncing with trusted peers only + all the sites.

Rabbit Holes

    Ad-hoc content routing of hypermedia entities.
      It will still work by trying to sync all the blobs from the remote peer that claims to be a provider of a certain piece of content. Improving this will be a separate project.
      After this initial sync during the resolution process, this peer won't be added to the trusted set of peers, so no further periodic sync will be performed.

No Gos

    Fancy settings for fine-tuning the syncing process.
    Subscription-based syncing for anything more granular than entire peer.
    Partial syncing of peer's data. Currently all the peer's data will be synced.