Hacking the Hackers

by Anonymous

Or, 27 Hours of Troubleshooting and Tracking Compromised Servers, and How I Learned to Love NetFlow

Friday night, still digesting the delicious turkey sandwich from the leftover Thanksgiving turkey the day prior, I received a call on my work phone...  It was two months after starting my job as a network engineer for the datacenters at my new employer.  I was on my first on-call rotation over the long Thanksgiving Day weekend.  My on-call predecessor had some issues he had dealt with, but we had no hand-off so I assumed all was well.  I went into Friday with a new phone and laptop...   I guess I was prepared!?

I discovered that evening exactly what my prior on-call colleague had been working with - a very odd issue where anyone attempting to connect to the datacenter in Charlotte was periodically hit with significant lag and packet loss.  It would occur anywhere from every 30 minutes to two hours, sometimes more.  It would hit for a period of about five to ten minutes and, yes, it was long enough for people to notice and complain.  It had been going on for several days with no resolution.  None of my colleagues, or highly experienced and technical management who were promoted through the ranks, could figure it out.

The bridge was frenetic and, after weeding through all the pressures and demands, I managed to get caught up on what was happening and where.  I poked around and looked at services and devices reported to be affected.  I found nothing wrong inside the datacenter, and it seemed to impact many or all services in the datacenter, not just one or two or even a handful.  I dug back into my history of troubleshooting experiences in networking, system support, and software development, and realized there had to be one common issue.  What is that single point that could cause all of this?

I was still learning, so I had little experience with how this datacenter connected, but quickly discovered, as this was a national datacenter, that it connected directly to the company backbone.  Well, that made things a bit simpler - the only way in and out was through the redundant pair of backbone devices, and this was the only item in common to this datacenter.  Experience taught me - and my gut told me - it had to be the bottleneck!  I didn't blame as I know how one thing can inadvertently impact another.

As this had been going on for several days already, I quickly took control of the dozens of people on the bridge, including VPs and SVPs from across the company, and the vendor who seemed to be out of options they could recommend.  We were still working on setting up fiber taps, but that had not been fully implemented, and might be in the wrong location anyway (later it turned out I was correct).  We were not collecting NetFlow data remotely yet, so I suggested on the bridge, ...Let's turn on NetFlow ..."  Crickets...  The vendor then chimed in, "... Sure... that can be done!"  I think they were desperate for another option.

The backbone individual on the call was reluctant.  He wasn't sure he was allowed to implement NetFlow , nor what impact it would have on his backbone devices.  I assured him that NetFlow would stay internal on the device, have little to no impact (no worse than what was happening currently), we could read the buffers after the event, and we could set the collection sample rate to its lowest setting.  He contacted his management who eventually gave the go ahead.  I asked who could write the config for this...  Crickets...  "... O.K., tell you what, I've had some experience with NetFlow in the past.  I'll write it, but I want the vendor to review it to be sure it's fully kosher, and I'm not authorized to make changes to backbone devices (as I'm with datacenter) so someone from backbone will need to install the config."  I got agreement from both parties on that.

I sat down and researched NetFlow config for these boxes.  I did set the sample rate to as low as it could go, one in 65,536 records, and designed an elegant solution, including methods to monitor and extract the data as it came in.  The buffers were not large on the device, so they had to be pulled immediately!  The backbone individual had tools to monitor general activity on the device, so when he discovered high volumes, he'd immediately report it and I'd pull the capture.  The vendor looked over the config I had written, did not make any changes, and reported back a few hours later that it "seemed" safe.

O.K., now we just had to wait for the next event...  We waited, and we waited for the next event...  The bridge was tense with anticipation...  I can only imagine the person from the backbone group was nervous as he was the "lookout" and every second counted.  We couldn't move from our spots, not even for nature breaks.  No one could talk, the bridge was near silent, we all waited for the signal...  At this time, I contemplated how Paul Revere felt as he waited for lanterns that night in the Christ Church tower...

Over two hours later, the backbone group reported an event on their routers!  I had prepared a set of commands to pull data from the suspect devices...  Success!!!  I captured tens of thousands of records in a few seconds, the hammer had fallen, we were prepared, and our current 27 hours of troubleshooting were nearing an end!

I pulled the records to analyze while the backbone individual checked his logs to detect for increased volume on interfaces.  He found the culprit!  It was some relatively recently turned up interfaces for a new logical datacenter within the facility.  It was so new, I didn't even have access yet as it had only local credentials.  It had not been turned up on our ACS security systems yet.  I had to get someone from our design and implementation team to log into that structure and track down where they noted activity.  They found it quickly as nothing was supposed to be on it, so any activity even slight could be traced easily.

He tracked it down to some new VMs that had been turned up already by the systems team.  I was informed that they were instructed not to do this yet, as there was no security for systems out there...  He configured some Access-Control List (ACLs) to protect the VMs, and activity immediately ceased!  We'd come to a resolution of this incident.  But not the why.

I was looking at the logs and noted three odd things.

My capture rate was as low as it could be, but as best I could tell, [nearly] all packets were invalid.  That is to say, the source and destination IPs were not in our routing table.  In fact, nearly all the source IPs were slated for China.  Nearly all the destination IPs were for Africa.  So, when the destination IP isn't in routing, it follows what's known as the default route.  The two backbone peering routers oddly did not have identical routing because they have single-homed services, meaning some IPs would only go through one router and not the other.

What this means is, if the service could not be found on the one router, it would follow the default route to its redundant pair, and in theory would find its way to that single-homed service (IP) on the other router.

Well, this will work if you don't have your systems hijacked like this.

So, what happened was the packets would bounce back-and-forth between the redundant pair of routers on fiber connections until the Time to Live (TTL) was reached.  A TTL value on a packet is usually set to 30 or 60, but can be as high as 255 (for other services, other values are possible).  It actually has two functions.  It can be a timer in seconds, or a hop counter, or both, and, if both, whichever comes first.  A packet must get to the destination IP before the TTL expires.  When that timer is triggered, a response packet is attempted to be sent back to the source IP to tell the source it could not reach the destination.  These are actually the fundamentals on how traceroute works by setting the TTL low and incrementing by one.

So, what was happening was that there was a very low steady trickle of data coming from the VMs, but it was so slight as to be invisible.  And at the time we didn't know what we were looking for, so did not think to check for this.  But when the "events" occurred, there was a bombardment of billions of packets in a very short period of time.  You may ask, aren't billions of packets noticeable?  Not on large scale backbone devices, and normally, when they are able to be passed on the data plane, instead of having to be analyzed by the CPU and shipped to a redundant device on the control plane.  The data plane is handled in special integrated circuits called Application-Specific Integrated Circuit (ASICs), and are thousands of times more efficient than punting packets to the CPU and control plane.

Because our CPU and control plane were getting pummeled by these packets at a tremendous rate, and bounced back-and-forth to try to get to the destination (remember the default route pointing to one another?), and then doubled the effort trying to get back to the source to let it know it couldn't deliver the packets after TTL expired, that required necessary "network" packets on these devices to be delayed or dropped, and this caused these boxes to delay or drop all traffic passing north and south through them.

There are bandwidth and throughput measures on devices and interfaces that can be analyzed and diagnosed, but there are few ways to monitor for overall Packets Per Second (PPS) handled by a device.  This is exactly what happened here, and why it was so incredibly difficult to track this down.  Billions of packets are nothing, but when those packets are bounced back-and-forth via the control plane, impacting the CPU performance at rates of 30x, 60x, or more (times two for the return traffic), then you start to see slowdowns in perceived traffic through the device.  That is why this is so dangerous.

Mitigation has been put in place since this event, and I've not heard about another like it.  The event was handled over a 27 hour period (we actually were allowed a break for three hours, but then called back on earlier than scheduled to continue working it).

My Turkey Day weekend ended with another two hour incident that was nothing major, just some fiber interface buried down on a Nexus that had an optic failing.  Replaced, and it worked fine.

On Monday during work hours, our security team contacted me and asked me for the log.  I was happy to give it to them, and they could do a deeper dive, maybe track down the culprits, or "noted" culprits from the IPs.

NetFlow saved the day!

Return to $2600 Index