Echelon: Peer-to-Peer Network Diagnosis with Network Coding

Echelon: Peer-to-Peer Network Diagnosis with Network Coding It is critical to monitor the performance and “health” of large-scale peer-to-peer applications. As an example, operators of peer-to-peer live streaming applications may be interested in observing performance bottlenecks, peer failures, and network topologies. In most cases, such observations are used to diagnose potential problems in the protocol design, to troubleshoot network outage, or to improve the Quality of Service of the peer-to-peer network in general. They are not time sensitive in nature, as delayed observations up to minutes or even hours are still valuable. However, such historical and delay-tolerant observations should include measurements of peers that have already failed or departed, as peer dynamics significantly affect the health of peer-to-peer applications. Such a delay-tolerant observation of peer-to-peer applications over a historical period of time is referred to as a diagnosis. In this paper, we present Echelon, a time-insensitive way to construct the diagnosis of a large-scale peer-to-peer application. Replacing the traditional wisdom of logging servers, we leverage the power of network coding to collect application-specific measurements on each peer, and disseminate them to other peers in a coded form. Over time, measurements of departed peers can still be recovered, simply by probing a small subset of peers in the network. Simulation studies have shown that Echelon is highly configurable, bandwidth efficient, and extremely tolerant of peer dynamics, thanks to the advantages of randomized network coding