Using Network Recon to Solve a Problem

by Aesun

Disclaimer: I am not a computer networking expert, I deal with networks day in and day out from a UNIX administrator's perspective and have maintained simple networks.  The hostnames and Internet Protocol (IP) addresses have been changed to protect the (more or less) innocent - namely myself.  I also want to note that I have no idea how this worked, only that it did work, although I have a few guesses as to why.

Recently, I managed to create my own networking problem out of sheer stupidity (which is how I usually manage to create technical problems for myself).  I accidentally left two Linux systems configured with identical shared IP addresses on the same Virtual Local Area Network (VLAN).  The IP addresses were being used for a software system that was not online yet; no harm, no foul.  Regardless, the systems did need to be functional for, well, functional testing.  This article details the problem I created, the rather strange way I fixed it, and, of course, the possible repercussions of what I discovered.

My project was simple: install and configure one GNU/Linux server with a collection of shared IP addresses for an application.  Then, once the first server was set up and functional, build, install and configure a warm backup GNU/Linux system which would fire up the shared IP addresses if the primary server went offline.  The script worked perfectly during tests; it would even detect when the primary server came back online and drop the shared IP addresses.  During testing I came across a problem: when the secondary server came online, the addresses appeared to be okay but the application could not find the new location.  Little did I know, this symptom was indicative of a much larger problem.  After trouble-shooting for a few hours, I left the problem to work on a production issue and accidentally left the shared IP addresses on both systems.  Since they were not being tested at the time, I didn't really think it would matter.

To illustrate the configuration here is an example:
      ----------------------------------
      | Everything Else                |
      ----------------------------------
              |
  ----------------------------------------------
  | GSS/CSS Device: 192.168.0.1 and 192.168.1.1 |
  ----------------------------------------------
        |         |
------------------------- --------------------------
| Primary: 192.168.0.10 | | Secondary: 192.168.0.11|
------------------------- --------------------------
|                       | |                        |
| Shared Range:         | | Shared Range:          |
| 192.168.0.100-110     | | 192.168.0.100-110      |
------------------------- --------------------------
A few days went by as other projects took priority. The users testing the new system didn't do any testing for a while and then, lo and behold, I got a call.  The application was working with some of the IP addresses but not others.  I began ping tests and all of the addresses were answering.  I informed the users that I was not sure what was wrong and would get back to them when I had solved the problem.  I ran the application and noticed that, while it was failing for some of the addresses and not others, all addresses were available.  I logged into the secondary system and fired up the application server.  Suddenly, the application started working.  It was at this point that I realized I had come across something odd and decided to start doing some network recon to see if my guess was right.

I logged into both servers using secure shell, fired up a tcpdump session targeting the application port on each one, and started pinging the IP addresses and port that the application was using from a third system.  I discovered that some packets were landing on the primary server, while others were landing on the secondary server.  I also noted the replies from the servers were going to the same device, but when I did a domain lookup on the device it had addresses on two different networks; one network was the same one that the servers were on while the other was a locally managed network.  I deduced (correctly) that the device was either a global or content switch.  While I thought the findings were interesting, I realized my users needed their test systems back to get their work done and decided to knock the shared IP addresses offline on the secondary system.  This is when the trouble started.

The secondary server's shared IP addresses were offline, yet some of the IP addresses still would not work with the primary server.  My first instinct was Address Resolution Protocol (ARP) cache; I had seen in the past where a host ARP cache could cause potential routing problems.  The easiest cure, of course, was to clear the ARP cache on both servers.  No dice.  I then resorted to a tactic I never like to do - I rebooted both servers.  Still no dice.

Again, it was time to start researching the problem to see what was happening.

I was a little out of my territory as I had never been in a GSS and/or CSS switched environment.  Once again, I logged in to both servers using secure shell and fired up tcpdump but filtered out secure shell traffic.  Once again, packets were split and landing on the same systems they had before.  It was at this point I realized the problem was not with the hosts or any clients.  It was definitely a network issue.  I didn't have time to track down the overworked network administrator, so I began to think up of ways to solve the problem on my own.  I needed more data.

I restarted my packet sniffers in full verbose mode and noted that the packets going to the secondary server also had its machine address (MAC) in the packet data.  I now had a working theory as to what was wrong: the switch had the wrong hardware address in its tables for the IP address.  Note that the incorrect path to the secondary server was stuck for well over an hour after rebooting.

I recalled from my addled brain that switches often maintain a table of IP address to hardware address mappings.  Under normal circumstances, if the hardware address changed then the switch would simply update the tables and move on.  For some reason, that had not happened in the case of this particular device.

I knew what was wrong, but how to fix it?  It was a tough problem because the main IP address was, in fact, different on both servers (which I think is part of why the problem existed in the first place).

I then remembered that, often, when aggressive network traffic fires from a particular host into (or across) a switch or router it causes the switch or router to go through a quick check of what it knows about the device talking to it.  What quick and easy tool might I have made sure was installed on a system with heavy network use in an network environment I was unfamiliar with?

Nmap, of course.

Using Nmap, I fired off a fingerprinting scan from the primary server and spoofed the address using the shared IP address instead of defaulting to the actual interface address and, voilà, problem solved.

Which immediately made me wonder:

If real dual-IP addresses messed up the mappings, then what possibilities would it open up?  What if, using a tool that could change a hardware address on an interface and another tool that could spoof an address, one persistently hit a GSS or CSS device ... ?

Unfortunately, to date, I have not had a chance to try any experiments.  I did hit up a friend of mine who is a Cisco specialist and, even though he had never used GSS and/or CSS, he agreed that not only was IP/MAC spoofing a possible issue but ARP spoofing as well.

The nature of what happened is telling with regard to Cisco's Content Switch Management (CSM) software.  I did a little research and found a rather long document detailing bugs in GSS switches particular to MAC addresses and the CSM software.  There were several bugs that could have been related to the behavior I had witnessed.

I learned two invaluable lessons:

Tools such as packet sniffers and aggressive scanners do have their place in the troubleshooting realm.  Although I had used both of them for diagnostic purposes before, in this instance I actually used them to fix a problem.

Even though network systems have improved greatly over the last several decades; they still could do things incredibly stupid.

Thanks for reading and keep hacking.

Return to $2600 Index