Skip to content

When Implementations Fail: Firewall Edition

We’ve been running the company on a cluster of ASA 5520s from quite some time, and it was finally time to upgrade and clean up the mess that multiple rounds of management had made in those firewalls. Additionally, we had to move the secondary unit and all the WAN equipment attached to it, to a new building on a new WAN circuit.

We went big and decided to go with ASA 5585s. These are far too large for our environment, but this is the same company that has a pair of Nexus 7018s for <1000 users. Go figure. Either way, since we had a large code gap to deal with while migrating code, and a whole bunch of ancillary infrastructure components to upgrade/modify in preparation for the event, management decided that we would tackle the infrastructure design in house, and hire a consultant for specifically the firewall code migration and replacement.

We managed to get almost everything in place and configured before our change window and we were all nervous but confident (a healthy state pre-implementation, in my opinion) that this would actually work. This was a very visible project that required company-wide downtime, and a lot of people were watching. I was anxious because I can count the amount of critical infrastructure projects I’ve been in a lead position on with less than three hands. However, we had things under control and documented, and everything was going to be OK. Right?

Wrong.

Sometimes it’s not the big mistakes that fail a project. In fact, the big mistakes are usually something glaring and obvious that can be fixed on the fly (if you have a good dynamic team that can react and act quickly). It’s the small things, the issues that pop up that nobody at my pay grade would ever be expected to foresee, the things that can only be tested in production, that actually introduce the most risk to a network implementation project.

After starting our change window at 6PM Sat. night, and spending the first few hours powering down the secondary site, verifying configs, and ironing out some inevitable last minute questions, we were ready to actually power up the new equipment at roughly 10PM (I swear time moves faster during change windows. For the life of me, I don’t know where those 4 hours went). After the primary firewall was switched on and physical connectivity was verified, we couldn’t ping the internal trusted interface of the Firewall. There was no reason for this. Everything was checked, double checked, triple checked. We had a veteran consultant looking at the problem, and he was stumped. It was a doozy. Cisco TAC wasn’t called because it would have taken too long for them to get back to us, and we were nearing the end of our window. We had everything in place and ready to go, and we were being stopped by a pesky Layer 2 issue that a room of network engineers couldn’t resolve. (This is still unresolved, and we’re currently blaming a possible ASA bug with ARP. We never confirmed this)

Finally, at 2AM, we decided it was time to fail back to plan B. Plan B was to complete the objective of moving the secondary site to a new building with a new Internet circuit, but install and keep using the old 5520s. I was already bummed from the main crux of the project failing, and I was tired and ready to go home. All I had to do was turn the firewalls back on, plug the new Internet circuit in (already tested and configured the router) and call it a night.

But the implementation gods were not finished with me.

That Internet circuit that I tested already, it wasn’t working that night. Nothing. No communication from the ISP equipment. So our secondary site was down and we really didn’t want to go back to management with a result that was even worse than when we started. So called ATT support at 3AM. Calling ISP support is bad enough during when I’m alert and it’s daytime. This was miserable. And they finally said that the turn up for the circuit hadn’t been completed yet, and that the interface on the ISP equipment is administratively down and only the implementation engineer could fix that. Thank you and good night.

So fine, we don’t have the secondary internet circuit. We can at least have the secondary firewall up for internal redundancy, right?

Wrong again.

As soon as I turned the secondary firewall on (we’re at roughly 4AM now), it decided it was the primary device in the failover cluster. This shouldn’t happen if heartbeats are traversing the network as designed. And since the Internet at this site was down, lots of things broke when this failover happened. And to make matters even more fun, since heartbeat was somehow not working, the primary unit also thought it was active, confusing ARP caches all over the place, and generally wreaking havoc.

To cut this long story just a little bit shorter, after some investigation we discovered that our firewalls have been running split brain for a very long time, and we’ve just been lucky that we haven’t had any serious failures to warrant a WAN failover. So now we have to fix that too.

It’s 6AM at this point, and we all went home to get some sleep.

After spending a few hours on Sunday trying to get the heartbeats working, we gave up. The secondary site is down. The project is failed and we’ll have to try again at a future date.

So, lessons learned?

  • Leave yourself lots of time for these things. Doesn’t matter how much you’ve prepared and documented. And come early. Ask for an extension of the maintenance window to an earlier hour instead of waiting until the end and asking for a longer window after you’ve already used your time. Also, nobody works at their best at 3AM. It’s not a great time for troubleshooting an infrastructure problem.
  • Leave nothing (or close to nothing) for the change window other than the actual cutover. If you’re still making changes during the window, you haven’t planned well enough. I underestimated the amount of configuration changes we would be making to the surrounding network, and we eventually just ran out of time and were making changes during the change window, eating into our time and possibly jeopardizing the project.
  • Double, triple check every physical connection. While verifying required switchports for the new firewalls, I had the web filter attached to a switch instead of directly attached to the firewall. It’s a small mistake, but it was a change from what we told the consultant and further muddied the waters during the window. This was something that could have been avoided easily.

That all said, the most important lesson I could take away from this experience is that failures happen. Sometimes you can control the catalyst for failure, sometimes it’s unexpected, but it will happen, and don’t be too upset about it. This field is all about learning from mistakes, and trying our best to use the lessons learned from failure in future endeavors. As long as management understands the basic nature and inevitability of failure, and as long as the mess failure is cleaned up (nothing should be down), failure is healthy and a part of growth. We’ll get it next time.

Post a Comment

Your email is never published nor shared. Required fields are marked *
*
*