APNIC Pty Ltd.

02/08/2024 | Press release | Distributed by Public on 01/08/2024 23:02

When routing breaks your (open) DNS service

If you haven't seen it yet, this post from Cloudflare discusses a loss of service they experienced on the open DNS service in late June. It was caused by the combination of a hijack and route leaks and the article is an excellent explainer of what route leaks and hijacks are, and how various mitigation techniques like RPKI and Remotely Triggered Black-Hole (RTBH) filtering can play, as well as newer Border Gateway Policy (BGP) features like 'only-to-customer (OTC)'.

The event wasn't globally visible and while it may have affected millions of users, it was probably marginal to the total userbase Internet-wide.

Cloudflare uses the netblock under an agreement with APNIC; the resource was delegated to APNIC Labs in a policy process leading to prop-109. Prop-109 was adopted and implemented in May 2014, which recognized the severe taint of this address range caused by the widespread use of the block in documentation and example configurations, as well as for some public-facing services like free Wi-Fi in fast food chains.

This event caused significant disruptions, with Cloudflare first noticing reachability issues before disabling a peering location and discussing the issue with network operators of the Autonomous System Number (ASN) in discussions. However, the ASN continued leaking even with a new AS-PATH.

Despite these actions, traffic to remained problematic and the impact was severe, with many customers unable to reach this prefix or experiencing high latency. The root cause was a combination of BGP hijacking and route leaking, exacerbated by a peer's lack of filtering, which allowed invalid routes to be widely distributed. Additionally, a Tier-1 provider blackholed due to an unauthorized route announcement causing widespread traffic disruption.

To resolve the issue, tools like Monocle and BGP Monitoring Protocol data helped trace and analyse the BGP updates and route propagation. Cloudflare then disabled multiple peering locations, announcements to some of their upstreams, and the impacted Tier-1 provider.

No regular delegate can handle the level of unexpected traffic that would result from announcing this prefix. However, Cloudflare, as a globally distributed service, has a huge network of BGP-speaking nodes. These nodes can accept traffic within anycast distance, allowing Cloudflare to detect misdirected packets everywhere without overwhelming any single node. Additionally, Cloudflare's DNS service benefits from being offered on a well-known address that is usually reachable from very close to the user.

The key takeaways from the incident are adhering to MANRS standards, and considering stricter RPKI validation and Discard Origin Authorization (DOA) objects for RTBH validation. It's a fascinating post and a timely reminder, which I encourage you to read in full.

The views expressed by the authors of this blog are their own and do not necessarily reflect the views of APNIC. Please note a Code of Conduct applies to this blog.