NPF#14: Internet bandwidth load

So first let me give you some facts on the load on our internet line at NPF#13, we had an average load of 600Mbps and a peak of just over 1Gbps, which caused some latency spikes saturday evening after the stage show. The number of participants was 1200 people and we had a single pfSense routing the traffic, with a second ready as backup. With these fact in mind and a ticket sale of 2200 for NPF#14, which is a 83% increase from NPF#13, we would expect that 2Gbps on the internet line should be sufficient. More precise my expectation on the load was approximately 1.1Gbps on average and a peak of about 1.8Gbps, but as we’ll see in a bit this was way off the actual measured values.

Luckily we had four 1Gbps lines, well due to time pressure and my underestimation of number of crew (network technicians) needed we only had single line up and running at the time of opening. But my hope was that we in time could get the other lines up and running before the internet bandwidth became a problem. Unfortunately due to other problems with network it wasn’t until 9 o’clock I had time to work on the second gigabit line and by that time the first line was under pressure especially after the stage show and at the start of the tournaments.

First WAN link load

First WAN link load

This unfortunately forced the game admins of League of Legends and Counter Strike: Global Offensive to postpone the tournaments until we had more bandwidth as Riot Games (League of Legends) and Valve (Counter Strike: Global Offensive) had released updates during that evening and people weren’t able to download them fast enough. After spending the better part of 4 hours – with a lot of interruptions – I finally got the second gigabit line up and running. This is where the funny part comes, namely that it took less than 30 seconds before the second line was fully utilized. People luckily did however report that there speedtests on http://speedtest.net/ increased from 0.4Mbps to approximately 25Mbps.

Second WAN link load

Second WAN link load

This however didn’t solve the bandwidth problem completely, and I started setting the third and fourth pfSenses up, however with one major difference namely that we, BitNissen and I, found that it’s possible to extract the configuration of one pfSense and load it into another pfSense. All that there were left for us to do was to make some adjustments to the public IP values, this didn’t succeeded in the first attempt because we missed some things which forced us to do some manual troubleshooting to correct the mistakes. We learned from these mistakes and the fourth pfSense was configured in less 10 minuttes. By the time we got to this point it was 4 o’clock in morning and the load on the WAN links had dropped considerably, therefore I decided to get a couple of hours sleep before connecting it to our network, before the tournaments were set start again (saturday 10 o’clock). Around 9:30 the last two pfSenses were connected to our network and from that point on we didn’t have any problem with the internet bandwidth. As mentioned in another blog, NPF#14: Network for 2200 people, we ended up having 23TB of traffic in total on the WAN interfaces of our core switches, and that is in only 47 hours. Below you can see the average load on the four WAN links.

Load on the different WAN links

Load on the different WAN links

Finally I will just mention that we saturday evening measured a bandwidth load peak of 2.8Gbps and on average use just over 2.1-2.2Gbps during normal gaming hours, which was a lot more than originally expected. So I’m guessing next year that we’ll need a 10Gbps or large line in order not to run out of bandwidth, if we choose to expand further.

NPF#14: Network for 2200 people

So let me be the first to acknowl that this year the network course more problems than what is good and much of it properly could have been caught with more preparation. The three main problems where the internet connection, DHCP snooping on the access switches, and the configuration of the SMC switches used as access switches. On that note let me make it clear that the two first problems where solved during the first evening and the last problem wasn’t solved completely only to the best of our ability. The general network worked very well and we didn’t have any problems with our Cisco hardware after friday evening, apart from a single Layer 1 problem – where someone had disconnected the uplink cable to a switch.

The size of network and traffic makes it equivalent to a medium sized company. We have 2200 participants with relative high requirements to bandwidth and not to mention several streamers. This puts so pressure on the distribution and core hardware. A typical construction to obtain highest possible speed is to use what I would call a double star topology. The first star topology is from the core switches/router to the distribution switches and the second star topology is from the distribution switches to the access switches. This is what DreamHack uses with some built-in redundancies. We’ve however chosen a different topology for our network, namely what I would call a loop-star topology. The top topology is a loop where the core and distribution switches/routers are connected in one or more loop(s), and the bottom topology is a star from the distribution switches to the access switches. The top topology does of course places some requirements to the core and distribution in terms of routing protocols, if you choose to use layer 3 between them as we did.

Double star and Loop-star topologies

Double star and Loop-star topologies

The protocol we choose for our layer 3 routing was OSPF, as the original design included some HP ProCurve switches for distribution and EIGRP is Cisco proprietary and those not really an option. One of the remaining possibilities was RIP, but this protocol doesn’t propagate fast enough in a loop topology and those isn’t really an option either. The reason for choosing the loop topology is quite simply redundancy in the distribution layer and the idea of doing it this way is taken from the construction of the network made for SL2012 (Spejdernes Lejr 2012 – http://sl2012.dk/en), where I helped out as a network technician. The major differences from SL2012 to NPF#14 is the number of loops and bandwidth between the distribution points – SL2012 had a single one gigabit loop and two places where the internet was connected to the loop, while we at NPF#14 had two loops, one with two gigabits for administrative purposes and one with four gigabit for the participants, and only a single place where four gigabit internet was connected. Our core consisted of two Cisco switches in stack, each of the participants distribution switches was a 48-port Cisco switch and each of the administrative distribution switches was a 24-port Cisco switch. For the internet we had four 1 gigabit lines which each where setup on a pfSense for NATing on to the different scopes of the internet-lines, and we then used the Cisco layer 3 protocol for load-balancing the four lines. Finally everything from the distribution switches to the access switches is layer 2 based, and this also the way we control the number of people on the same subnet and on the same public IP as we don’t have a public IP per participant.

But enough about the construction of our network let us take a look at the traffic amounts over the 47 hours the event lasted. First of the four internet lines moved 23TB of data, of which 20.2TB was download and 2.8TB was upload. The amount of traffic moved through the core on the participant loop was 21.48TB, 12.9TB on one side and 8.58TB on the other side. The administrative loop on the other hand only moved 1.37TB of data through core switch. With a little math we get that the average load on the participant loop respectively is 625Mbps and 415Mbps, while the administrative loop load average is 66.3Mbps. The average load on the internet from the core was 1.114Gbps, with a measured peak of 2.8Gbps. In the picture below the data is placed on the connections between the switches.

Traffic on different interfaces

Traffic on different interfaces

We have learned a thing or two from this year’s construction of network, layer 3 routing between distribution points and a multi-loop topology can work for a LAN Party of our size. Furthermore we might be able to reduce the size of the bandwidth on loops a bit, the average loads are fairly low – but this doesn’t say anything about peak load performance which we encounter friday and saturday evening. Therefore for good measure the bandwidth on the different types of loops should properly just stay the same. Depending on the growth next year it might be a reasonable idea to make two participant distribution loops instead of just one and maintain the bandwidth of 4Gbps on both loops. Also depending on growth the bandwidth on the internet has to follow, we’re fairly close to our limit this year.