Optus routing to certain US-based networks appears to have gone bad around 2017-08-30 04:28 UTC, fluctuating wildly until ICMP RTT settled down to ~100ms worse than usual around 16:15 UTC, where it's stayed ever since. The new path is extremely congested during APAC day; I'm struggling to get 400Kbps to parts of the US west coast at present. It might be fallout from the AU<->Asia cable break, but it doesn't seem like AU<->US traffic should be directly impacted and it feels more like some peering has gone bad. Here's an ICMP RTT graph of the event: http://imgur.com/CmnR2O3 (times are UTC+10).
For one of my US servers, the usual latency from home of 180ms spiked to 450ms for several hours, before dropping back to a stable 280ms. Traffic to a subset of US west and east coast hosts (eg. github.com, ebay.com.au, weebly.com as above) now gets routed via Tata (AS6453) in Japan (eg. ix-xe-10-2-1-40.tcore1.TV2-Tokyo.as6453.net). Interestingly, traffic to some other US hosts routes via SingTel (7473) to the US west coast and then successfully only uses Tata within the US to get to the east coast, without hitting Japan (Tata or otherwise). The path from Optus to the SingTel US west coast routers seems to be uncongested.
https://pastebin.com/MC6VNAZT has a selection of incriminating traceroutes with good and bad routes from a LAN behind Optus residential cable. Poking at Optus's routing tables (https://pastebin.com/ALbWTDh9) confirms what is seen in the traceroutes: the fast path reaches 6453 via 7473, while the slow path reaches 6453 directly from 7474. That router picks the 6541 route to Linode Newark rather than 6543 that I got locally, but in either case it takes 7473 to the US so it's fast. Interestingly enough, while the surprise Japanese tv2-tcore1 shows up on the direct 7474 -> 6453 path, the route to tv2-tcore1 takes 7474 -> 7473 -> 6453!
Looking up AS7474 (Optus) on https://stat.ripe.net/widget/asn-neighbours-history suggests that the 7474<->6453 adjacency indeed only appeared on the 30th. I wonder if the direct Optus AU -> Tata JP peering was added urgently to mitigate the effect on Asian connectivity of the SEA-ME-WE3 break earlier that day, but the route filter is misconfigured and doesn't exclude all North American routes?
There is an underground cable fault off Hong Kong caused by the recent typhoons. Have a look at previous posts on this forum.
I'm aware, but the usual Optus transit to the US (via AS7473) does not appear to be affected (not surprising -- SEA-ME-WE3 is a long way out of the way). The problem seems to be that a new BGP adjacency with AS6453 in Tokyo is advertising routes from AS6453's North American peers to Optus, which doesn't make much sense since it's indirect and only going to exacerbate the stress on the links between Australia and Asia caused by the cable break.
A ramification of this is intermittant client connections to Rackspace hosted exchange servers. Have spent the better part of half a day troubleshooting only to find the routing issue causing intermittant proxy server connection errors in Optus client Outlook connections when connecting to Rackspace. Same setup on two other ISPs tested fine. tracert to connect.emailsrvr.com on Optus is 27 hops and mad latency. On an iinet connection is 17 hops and standard latency (and no proxy connection errors).
I investigated more during the worst of the congestion last night (though it's only not terribly congested between 00:30 and 07:30). iperf showed 45-50% packet loss toward an affected US host, but 0% on the way back -- all of the loss is on the outbound path. The path back from the US doesn't hit Asia, going directly from the US to Australia via AS7473, as the traffic to the US did until last week. So all the loss is on the indirect, congested outbound route via the new AS6453 adjacnecy, and US ISPs reply via the sensible, direct, uncongested route.
I could buy that the US->AU pipes used by Optus are saturated, but it doesn't seem likely that AU->US is saturated if US->AU isn't, since Australia almost certainly downloads vastly more from the US than it uploads. So why route outbound traffic via Japan if return traffic makes it back fine without hitting Japan?
I cannot see any scenario under which a break between Australia and Singapore should route AU->US traffic through Asia. Multiple sources show that AS7474 only started peering with AS6453 on the 30th, so the problematic path was impossible before the 30th. Adding that peer on the 30th to mitigate the break and add an extra path to Asia makes sense, but using it as an alternate path to the US doesn't seem to.
[ Third try on this reply. My previous two messages got silently deleted. You can see on my profile page that two are missing. Rewording this time in case I hit some content filter. ]
This all fits with what I am seeing. Our problems appeared to start on the morning of 1st September ( that's when we started getting send/receive errors) but prior to that the emails were slower than I thought they should be. We can receive emails OK, though even the "receiving" process takes a lot longer than say the receiving process for the Optus server, but at least it doesn't time out - apart from once this morning, but then on 2nd attempt it worked.
@Toomey- for your reference. Please ask network engineers to reconsider & rework the routing for US sites, because this one isn't working.
The bad routing to the US was fixed 50 minutes ago. The new path bypasses AS7474 entirely, going directly from AS4804 -> AS7473. route-views.optus.net.au (on AS7474) still thinks that 6453 is the best way out, but at least Optus residential connections don't. Traffic to Japan still seems to use the new AS 6453 path, so they may just finally be filtering routes properly.
I wonder if our incessant complaints worked. Hopefully someone will confirm what the issue was, rather than just silently fixing it.
I've explored pretty thoroughly and everything looks good. Even routing to Asia isn't completely broken.
I can't see anything that looks like a new link, so I think we really did just spend five days, 21 hours and 9 minutes with severely degraded US connectivity because they wouldn't escalate my initial complaints last week. Support agents' claims that it was the direct fault of the SEA-ME-WE3 cable break seem thoroughly debunked.