On February 16th, SuproNet, a local Czech provider, single-handedly caused a global Internet meltdown for upwards of an hour today. SuproNet accomplished this feat by sending out a rather unusual routing update, one which a lot of routers did not handle very well. The result was Internet bedlam.

“What we think happened next is the Internet equivalent of a massive buffer overflow. While most of the core routers run by major ISPs fared just fine, processing the ridiculous path and sending it on, others choked. Perhaps they weren’t as well maintained or were running buggy software. These routers viewed the update as malformed and so tore down their session with whoever sent them the update. In other words, two routers that were happily exchanging traffic with each other just moments before suddenly stopped all communication. Traffic was lost, alternative paths were explored, and maybe the former cooperating routers recovered and re-established contact.” Earl Zmijewski, vice president and general manager at Renesys, wrote in the blog post.

“SuproNet (AS 47868) normally announces a single prefix, 94.125.216.0/21, to a single provider, CD-Telematika (AS 25512). On February 16th at 16:23:30 UTC, we saw this same prefix via a different provider, Sloane Park Property Trust (AS 29113), but with an AS path exceeding 255 ASNs. Such messages continued for almost exactly one hour or until 17:23:00 UTC. We observed Level 3 (AS 3356), Tiscali (AS 3257) and TeliaSonera (AS 1299) propagating most of these routes globally, with a total of 230 unique ASes ultimately sending us the problematic announcements.

This single Czech provider announcing a single prefix caused a huge increase in the global rate of updates, peaking at 107,780 updates per-second. This peak occurred at 16:30:54 UTC, less than 8 minutes after the first announcement.

At Renesys, we call a prefix impacted in a given hour if either suffers an outage or has a non-trivial amount of instability. In the hour before this event, there were 1215 impacted prefixes globally out of a total of 271,175. During the event, that number surged to 12,920 or 4.8% of all prefixes on earth. One announcement from one provider and we have a 10-fold increase in planetary routing instability for an hour. North America suffered the most, increasing from 0.35% to 4.76%, while South America suffered the least, increasing from 0.52% to 1.75%.” he said again.

The IOS bug occurs only when the inbound AS-path contains close to 255 AS numbers and the router does inbound or outbound AS-path prepending.

The new bug is tracked as CSCsx73770 and affects downstream EBGP or IBGP sessions as follows:

  • When you do inbound AS-path prepending and receive a BGP update where the total length of inbound AS-path and prepending exceeds 255, the AS-path in your BGP table is completely mangled. The path is sporadically advertised to IBGP or EBGP peers and kills downstream BGP sessions (remote BGP peer sends BGP notification due to invalid UPDATE message).
  • When you do outbound AS-path prepending and send a BGP update where the total length of the AS-path in your BGP table, prepended AS-path and your own AS-number exceed 255, the outbound EBGP update is incorrect and the downstream EBGP peer sends BGP notification, resulting in BGP session reset. IBGP peers are not affected, as IOS does not perform AS-path prepending on IBGP sessions.
  • If you don’t prepend, you’re safe. IOS marks BGP paths with AS-path length greater or equal to 254 as invalid and does not propagate them, so the AS-path length in the outbound update can never exceed 255 without prepending.

 

Internet instability

ios-bug-before

Global Instability by Country - Before

 

ios-bug-after

Global Instability by Country - During