Here are some thoughts about another topic/question I find myself discussing with network geeks and business execs alike -
Is network budgeting about saving money or increasing revenue?
Or -
Does squeezing network costs run the risk of impacting users, revenue, or whatever your important metric is?
A bit of both, though the usual focus is mostly on cost.
If your business model is related in any way to customer performance though, this can be a very short-sighted way to look at things.
At AboveNet in the late 1990s, we were aggressively going after traffic sinks (eyeball networks) and trying to peer with them directly, ideally via private interconnect, and ideally with much more total capacity than we needed at the time.
The motto was “QoS should stand for Quantity of Service, not Quality of Service”.
At first this may look pretty expensive – but if you have a positive gross margin on your services, in most cases, the expensive thing is to congest your bandwidth or allow it to be congested.
Every time we turned up substantial new private interconnects we saw traffic go up.
Why? Well, if traffic is congeted, there is a slight increase in traffic from retransmits, but the throughput per stream goes to hell and what happens is outside the TCP stack – users just go away and find something else to do.
At Akamai, we had similar experiences with directly placing boxes into many edge networks – even for networks we peered with (or bought transit from). In many cases, though, juggling was needed to make sure that the boxes had better network connectivity – connecting from within many major networks’ data centers was WORSE than using transit to get to them due to data center network aggregation.
In many cases, the problem is easy to find – a full transit link somewhere. For most networks, I recommend that the NOC or an eng group take a look 3-4 times/day (shift change at the NOC is a good time, especially if the NOC shifts are offset so someone is coming or going every 4 hours). Humans are really good at eyeballing pathological graphs. Take a look at the top 10 transit, peering, customer, and backbone graphs and you can often find links in need of investigation.
If the congestion is ‘behind’ one of your transit providers, though, you’ll need to get more clever.
The first avihack-ish way I did this involved one a large AboveNet content provider. This customer (we’ll call them ddrive) served content of a specific nature that was highly popular. As with many content provider customers, when there were network issues, they would open a ticket claiming that ‘signups are down’, which doesn’t really provide much information to go on to a network engineer.
What I wound up doing was writing a perl program for them that downloaded a daily BGP dump, looked at their signup logs (on their servers), aggregated signups by Origin ASN, and then sent a report to them and to us. From that, we actually could occasionally find that singups were in fact down proportionally in some networks and use that as a way to hunt for connectivity issues. And when we turned up more and more direct connectivity to those networks, signups did in fact go up proportionally to those networks.
AboveNet’s (and Akamai’s) business model worked very well for this since it billed all customers on a usage (vs flat-rate) basis.
That’s one reason that I think that the flat-rate model can be unhealthy – it drives customers towards trying to fill pipes, which hurts their business and ultimately the industry.
So anyway -
How do you find out where these problems are on your network?
The easiest way I know of is with passive monitoring of the TCP statistics – the actual performance of the TCP sessions that your customers are seeing across your network.
There are three places this can be done:
First, on the client. This is messy, as the client environment can be slow. There have been a few service providers to take this approach. Some have been bought, some are very expensive, and I don’t really follow this space thoroughly so perhaps there are better options now.
Second, ‘in the middle’. This is what you’re stuck with if you run a network and don’t interface with smart customers who have TCP statistics. To get stats at this level you’ll need to copy traffic (optical tap or switch SPANning) to boxes that watch the traffic and do TCP state reconstruction. There are some packages that do this (Argus), and at a simple level you can approximate performance by just looking at the time delta between the SYN-ACK and the ACK of the SYN-ACK. Ideally you want to look at peak throughput over periods of time (say, 1-2 seconds) and, most importantly, the retransmit rate/ratio.
Third, and perhaps best, you can grab the TCP statistics from the kernel before it throws them away. The kernel’s already accumulating this data, so it’s just a matter of getting it out. The Web100 patches for Linux do this, and some others have developed custom patches as well. There are some notable tricks with HTTP – with persistent connections, the session may be idle for reasonable lengths of time, so the retransmit ratio and session setup timing will be interesting, but throughput will appear abnormally low. Ideally for that you’d snapshot the stats at every transaction end, which can be done by the application if you set up a path to query the TCP stats for the socket.
Some other approaches to consider:
For readnews, I wanted a simple passive monitoring system and played with both Argus and Web100. Being a fussy geek, neither exactly suited. Generally I prefer not to run custom kernels, and I also didn’t want to set up a bank of machines and stripe with hashing or other hackery.
So I instrumented the Usenet server software to throw a log entry when the response started and a log entry when the last \r\n.\r\n was sent, and we have some perl code that trolls the logs computing throughput per transaction, aggregating by origin AS, and throwing the results into various RRDs.
The most generally-applicable way for http-based applications is pretty simple, but in my experience rarely done – just turn on start and stop time for http transactions!
Most web logs only have the time the transaction started, which doesn’t really help. With start and stop time, and object/transaction size, you have all the data you need to catch the gross badness and look for repeated throughput that’s less than N kbps (say 200) and aggregate/alert/etc. You could even do that by POP, machine, and URL base and find application or application components that are slow to everywhere, in addition to looking for destinations that are slow to all application segments.
Anyway, hope this helps or at least inspires some insight and network monitoring projects!