Facebook Datacenter consists of a large number of servers that run diverse Facebook services aggregated to serve any given user request. To allow this aggregation, servers have to interact with each other via different traffic flows which are managed by networking fabric. The underlying connection powering this fabric consists of a large number of pluggable optical interconnects and On Board Optical (OBO) modules carrying production data. This connectivity at scale requires fast and reliable detection of the link failures to ensure resolution. In the first generation of the deployments, detection of the link failure was sequential and a slow process. The troubleshoot process was equally tedious as the available tools required characterizing one optical transceiver at a time. Further, the failure analysis also presented a majority of resolution with no failed optics as a root cause resulting in high No Trouble Found (NTF) rate.
In this paper we introduce a novel link failure detection and resolution method that improves on the previous method across three dimensions: faster resolution, reliable troubleshooting and scalable implementation. We introduce BER Illusion Methodology (BIM) that is a highly scalable and resource efficient solution that significantly reduces the time taken to troubleshoot pluggable optical interconnects. This is also scalable to next-gen OBO modules at Facebook datacenters aiming to lower the NTF rate and optimally utilizing the available resources. BIM, which is based on Open Compute Platform (OCP) network switches, can be used to troubleshoot 128 QSFP28, 64 QSFP56 or 32 OBO modules simultaneously in under 30 minutes. The tool is easy to implement and capable of also reporting diagnostics on the transceiver such as Transmitter Power, Transmitter Bias Current, Receiver Power, Case Temperature, Bit Error Rate result per channel, Vendor information and Manufacturing part number. This additional test data report along with true failure indication helps optic suppliers gain confidence and build customer credibility. The open-source nature and the universal applicability of this tool offers possibility for other users to adopt and further customize it for their networking needs.