Skip to content

When Implementations Fail: Firewall Edition

We’ve been running the company on a cluster of ASA 5520s from quite some time, and it was finally time to upgrade and clean up the mess that multiple rounds of management had made in those firewalls. Additionally, we had to move the secondary unit and all the WAN equipment attached to it, to a new building on a new WAN circuit.

We went big and decided to go with ASA 5585s. These are far too large for our environment, but this is the same company that has a pair of Nexus 7018s for <1000 users. Go figure. Either way, since we had a large code gap to deal with while migrating code, and a whole bunch of ancillary infrastructure components to upgrade/modify in preparation for the event, management decided that we would tackle the infrastructure design in house, and hire a consultant for specifically the firewall code migration and replacement.

We managed to get almost everything in place and configured before our change window and we were all nervous but confident (a healthy state pre-implementation, in my opinion) that this would actually work. This was a very visible project that required company-wide downtime, and a lot of people were watching. I was anxious because I can count the amount of critical infrastructure projects I’ve been in a lead position on with less than three hands. However, we had things under control and documented, and everything was going to be OK. Right?

Wrong.

Sometimes it’s not the big mistakes that fail a project. In fact, the big mistakes are usually something glaring and obvious that can be fixed on the fly (if you have a good dynamic team that can react and act quickly). It’s the small things, the issues that pop up that nobody at my pay grade would ever be expected to foresee, the things that can only be tested in production, that actually introduce the most risk to a network implementation project.

After starting our change window at 6PM Sat. night, and spending the first few hours powering down the secondary site, verifying configs, and ironing out some inevitable last minute questions, we were ready to actually power up the new equipment at roughly 10PM (I swear time moves faster during change windows. For the life of me, I don’t know where those 4 hours went). After the primary firewall was switched on and physical connectivity was verified, we couldn’t ping the internal trusted interface of the Firewall. There was no reason for this. Everything was checked, double checked, triple checked. We had a veteran consultant looking at the problem, and he was stumped. It was a doozy. Cisco TAC wasn’t called because it would have taken too long for them to get back to us, and we were nearing the end of our window. We had everything in place and ready to go, and we were being stopped by a pesky Layer 2 issue that a room of network engineers couldn’t resolve. (This is still unresolved, and we’re currently blaming a possible ASA bug with ARP. We never confirmed this)

Finally, at 2AM, we decided it was time to fail back to plan B. Plan B was to complete the objective of moving the secondary site to a new building with a new Internet circuit, but install and keep using the old 5520s. I was already bummed from the main crux of the project failing, and I was tired and ready to go home. All I had to do was turn the firewalls back on, plug the new Internet circuit in (already tested and configured the router) and call it a night.

But the implementation gods were not finished with me.

That Internet circuit that I tested already, it wasn’t working that night. Nothing. No communication from the ISP equipment. So our secondary site was down and we really didn’t want to go back to management with a result that was even worse than when we started. So called ATT support at 3AM. Calling ISP support is bad enough during when I’m alert and it’s daytime. This was miserable. And they finally said that the turn up for the circuit hadn’t been completed yet, and that the interface on the ISP equipment is administratively down and only the implementation engineer could fix that. Thank you and good night.

So fine, we don’t have the secondary internet circuit. We can at least have the secondary firewall up for internal redundancy, right?

Wrong again.

As soon as I turned the secondary firewall on (we’re at roughly 4AM now), it decided it was the primary device in the failover cluster. This shouldn’t happen if heartbeats are traversing the network as designed. And since the Internet at this site was down, lots of things broke when this failover happened. And to make matters even more fun, since heartbeat was somehow not working, the primary unit also thought it was active, confusing ARP caches all over the place, and generally wreaking havoc.

To cut this long story just a little bit shorter, after some investigation we discovered that our firewalls have been running split brain for a very long time, and we’ve just been lucky that we haven’t had any serious failures to warrant a WAN failover. So now we have to fix that too.

It’s 6AM at this point, and we all went home to get some sleep.

After spending a few hours on Sunday trying to get the heartbeats working, we gave up. The secondary site is down. The project is failed and we’ll have to try again at a future date.

So, lessons learned?

  • Leave yourself lots of time for these things. Doesn’t matter how much you’ve prepared and documented. And come early. Ask for an extension of the maintenance window to an earlier hour instead of waiting until the end and asking for a longer window after you’ve already used your time. Also, nobody works at their best at 3AM. It’s not a great time for troubleshooting an infrastructure problem.
  • Leave nothing (or close to nothing) for the change window other than the actual cutover. If you’re still making changes during the window, you haven’t planned well enough. I underestimated the amount of configuration changes we would be making to the surrounding network, and we eventually just ran out of time and were making changes during the change window, eating into our time and possibly jeopardizing the project.
  • Double, triple check every physical connection. While verifying required switchports for the new firewalls, I had the web filter attached to a switch instead of directly attached to the firewall. It’s a small mistake, but it was a change from what we told the consultant and further muddied the waters during the window. This was something that could have been avoided easily.

That all said, the most important lesson I could take away from this experience is that failures happen. Sometimes you can control the catalyst for failure, sometimes it’s unexpected, but it will happen, and don’t be too upset about it. This field is all about learning from mistakes, and trying our best to use the lessons learned from failure in future endeavors. As long as management understands the basic nature and inevitability of failure, and as long as the mess failure is cleaned up (nothing should be down), failure is healthy and a part of growth. We’ll get it next time.

Virtualizing the Sysadmin

Virtualization is a new word for an old concept. Any time we present a resource to an end-user that is simply symbolic of different physical resources “behind the curtain”, we are virtualizing. When we present a graphical interface, we are virtualizing the internal hardware that makes the interface work. Similarly, virtualization is also combining disparate resources into one logical entity (or separating one physical entity into multiple logical parts). We’ve seen this in the server world with the advent and popularization of the hypervisor layer; we’ve seen it in the networking world with the availability of many network resources using less physical resources; we’ve seen it in the storage world where we represent multiple countless logical blocks of storage across a finite amount of disks. Parallel to this fantastic trend of resource utilization efficiency is the progression of human virtualization.

There are different roles that a Sysadmin can fill within an IT organization. In general, these roles have been compartmentalized heavily and it wasn’t simple to transfer skills from one to the other. They can be broadly defined as: Network Services, Server Services, and Storage Services. While there can be overlap between the three (after all, a given device will use all three services), the skills required to manage each role were very different, and a sysadmin would typically take a deep dive into one and never look back.

As more and more IT services become virtualized and further abstracted from their physical resources, we’re beginning to see the same phenomenon in the job of a sysadmin. Virtulization by definition means that traditional Network/Server/Storage boundaries are blurred, and the skills required to manage a converged IT infrastructure are not in line with the traditional roles that the industry is used to. We’re starting to see platforms and frameworks that are managing the data center as whole instead of simply the sum of its parts. Sysadmins who at one time were able to define themselves as Storage Experts will not have the skillsets required to manage the converged datacenter. The future sysadmin will be someone who understands everything in the datacenter and how all the pieces interconnect. The future sysadmin will be a virtualization of three different sysadmins of past decades. The data center will be a unified logical entity that represents various disparate physical interconnections that previously were managed by completely different skillsets.

I came into this field right as this movement was really getting started. I never felt like I needed to be boxed into one skillset because right from the outset I was dealing with technologies that required at least a basic understanding of multiple traditional roles (Cisco UCS, Cisco Nexus 5548UP). There is some worry that that this kind of “human virtualization” will make a number of sysadmins obsolete. This is accurate, and has always been the case in this business. Systems Administration is in many ways a meritocracy and those who stay ahead of the curve will be on top. In truth, it’s an exciting time to be a sysadmin, as we’re on the cusp of a large paradigm shift redefining the role altogether. There will be growing pains, but it’s ultimately a good thing, and a unified sysadmin role is just as valuable as a converged datacenter.

Exchange ActiveSync and iOS 7 (déjà vu)

I should have seen this one coming. When iOS 6 was released, we had issues with IIS resource hogging, and it looks like Apple is at it again.

We’ve had intermittent outages for ActiveSync email clients, and it turns out iOS 7 may be the culprit. This is a Microsoft KB talking about issues with iOS 7 devices having trouble syncing with Exchange.

Basically, ActiveSync on the device and IIS on the server aren’t playing nicely and are trying to access files at the same time. This causes a server exceptions to be thrown back at the device. In response, iOS 7 continues to try and access the file, sometimes thousands of times before giving up.

On the server, the event log entry to look for is 4999 (MSExchange Common). We’ve got lots of these now.

A different KB, talking about general issues with w3wp.exe having high CPU load, wants me to parse my IIS log files for mobile devices that are logging large numbers of RPC calls due to issues with contact sync. I downloaded the Log Parser tool (which is a useful little command line thing that comes in handy when trying to make sense of huge log files), and ran it on my IIS logs for today. It came back with two abusive devices.

RPCCount of 123,509. That’s a lot. And definitely a sign of issues. Also, notice the device ID of both phones showing iPhone 5C; iOS 7 devices.

What the first KB doesn’t mention is that all this redundant activity will hammer the IIS service (w3wp.exe), causing it to consume memory, CPU and fill up the disk with huge IIS log files.

That’s 2GB worth of logs, in 10 hours. At its baseline, my environment produces ~200MB per day. Something is definitely flooding IIS with requests.

This will certainly cause issues, but apparently only issues with mobile devices. I’m not sure why Outlook Anywhere and OWA also aren’t, but that could be because they aren’t as dependent on the w3wp service (?). Currently, the MS recommended fix is to apply RU2 to our Exchange Boxes (2010 SP3). Apple has yet to offer a fix for their implementation of ActiveSync in iOS 7.

Something I noticed in Powershell v3

So, let’s say you have an object that contains objects. A bunch of AD users whose first name is Bob, for example.

$persons = get-aduser -filter {givenname –eq "Bob"}

Now you want to display the list of objects, but only one property of those objects, like DisplayName. How would you do that?

In Powershell v3, it’s simple. You can just do this:

$persons.displayname

I’ve been using v3 for so long that I didn’t realize this is relatively new functionality. I wrote a script in ISE v3 that contained something along those lines, and tried to run it on a Server 2008 box. It never worked in that environment.

Finally, I realized that in v2, you had to type a few more characters to get this to work. You can do:

$persons | foreach-object{$_.displayname}

Or

$persons | select displayname

The difference between the two is that the first choice actually outputs a list of discrete entities (objects), and the second outputs one object containing many records. The v3 versions is more like the first v2 option; it outputs a list of strings instead of one object. In either version, if you want one object as output, you need to pipe the original object to Select or Format-Table.

Exchange TotalDeletedItemSize and how to tame it when it runs rampant

Exchange 2010 does a lot of activity in the background to ensure that you don’t permanently lose things that you didn’t mean to lose. Without going into the actual code of Exchange, suffice it to say that the steps it takes to do this can sometimes cause loops and create copies of inconsequential messages over and over again until you’ve hit the limit (default is 30GB) of your TotalDeletedItemSize. iOS 6.1 has been known to a culprit in causing these loops.

Typically, a user will come to me explaining that Outlook is constantly telling him that he can’t get any more mail because his Inbox is full. When a quick perusal of mailbox size shows that his Inbox isn’t remotely close to full, I go to powershell and run get-mailboxstatistics for that mailbox. That’s when I’ll see this:

Not good.

Usually, in the same output, you’ll see DeletedItemCount at something like 300K or some ridiculous number.

In Exchange 2010, if you have Single-Item Recovery enabled, you’ll have a hidden folder called Recoverable Items. Within that folder are 3 folders: Deletions, Purges, and Versions. Deletions is where things go that are hard deleted in Outlook. It skips the Outlook Deleted Items folder, but it can be recovered in Outlook by going to Recover Deleted Items. Purges is where things go after they get deleted from the Deletions folder (until the default Retention period expires). This is inaccessible in Outlook, but it is accessible in MFCMAPI. Versions is a calendar feature that saves versions of calendar entries when they get changed. This is where a lot of the buggy loops happen (with iOS 6.1, many times), but I’ve seen it the other folders as well.

I just had one case (the one screenshot above) where the TotalDeletedItemSize was 30GB but there was only a couple thousand messages in the Purges folder. Shouldn’t be even close to 30GB. To investigate this further, I ran the following code against the mailbox.

Get-mailboxfolderstatistics | ft name, itemsinfolder,foldersize –autosize

And look what we’ve got here. There are only 2680 items in Purges, but it somehow comes out to ~30GB.

I’ve found the culprit, but now how do I get rid of it?

There are a few cmdlets that come in very handy for any of these situations:

Set-mailbox {name} –CalendarVersionStoreDisabled $true

This takes away the versioning feature that can cause bugs.

Search-mailbox {name} –SearchDumpsterOnly –DeleteContent

This searches the mailbox for items only in the Deletions folder and deletes them (Must have at least SP1 installed).

Start-ManagedFolderAssistant {name}

This overrides the default Retention policy schedule and starts the purging of Purges immediately. In my case, this is the one this fit the bill.

I’ve read that a lot of the bugginess in Single-Item recovery is fixed in RUs to SP2. I’ll be upgrading all the way to SP3 sometime this month, so hopefully this pesky issue can be fixed.

Adding a Subnet to a Nexus/UCS/VMware Environment

So, this was a fun exercise. It started when I was trying to build a few Server 2012 Lab machines to do some testing on. We realized that we were running out of IP addresses for our server vlans, so I was tasked to create a new vlan and subnet specifically for testing and labs.

The setup is as follows:

  • Two Cisco Nexus 7000 Core Switches as VTP servers.
  • Cisco UCS Chassis holding ESXi hosts.
  • VMware environment where the servers will live.

Here’s how to add a subnet to this environment (no DHCP required) in this scenario.

  1. On the switches, I simply added a vlan to the vlan database. (We’ll use 100 for example).
    Then I added the SVI to each core with HSRP enabled for failover.

    On one core, this sets up the SVI to route using the already-established EIGRP topology, and sets it up as a member of an HSRP port group. On the other core, I did the same thing, with a different IP and lower priority on the HSRP.

    That’s all I needed to do on the Switch side. I was able to ping the SVI from elsewhere on the network.

  2. Next I went into vSphere and created a port group corresponding to the vlan I just created. Since we’re using the distributed switch model (apparently we paid a lot of money for that), I only had to do this once and all 12 ESX hosts were able to use the port group.

    I copied the settings from other port groups in the switch just to stay consistent, but it’s pretty much the default settings.

    At this point, I thought I was done. I spun up a server, put in the 10.1.100 subnet, but alas, it was not talking. I realized that the ESXi boxes are sitting in UCS, and UCS also needs to know which vlan to talk on.

  3. In UCS, it’s not completely intuitive to add a vlan to the NIC. I thought it would be available in the ESX Server profile template we had set up, but it wouldn’t allow me to modify vlans in that view.

    It clearly says that the vNIC in the server profile is bound to a vNIC template elsewhere, and I was able to find that under LAN/Policies/root/vNIC Templates. Opening up the vNIC template allowed me to modify the vlans as well as add a new one.

    Adding a vlan here is simple. You give it a name and an ID. It checks for overlap, and you’re done.

    Once you’ve created the vlan, go back into the Modify vlans screen and make sure the test vlan is checked. Once that’s checked, you should be good to go in UCS.

Recap

Here’s what we did in a nutshell:

  • Add the vlan and SVI on the network
  • Add the port group in vSphere
  • Add the vlan in UCS

The end result is a brand new subnet, just waiting to be filled with sweet, sweet servers.

Testing Blog Posts From Office 2013

This is the first time I’ve used a real word processor to write a blog post. I installed Office 2013, and Word connected to WordPress without any real issues. The only point of contention is that is apparently sends your username and password to your provider without any encryption. At least they warn you about it.

I’ve been impressed in general with Office 2013 integration across the enterprise, from Sharepoint in the browser to cross-application rich contact support. The eye candy has been expertly crafted, and the small things that hardware acceleration provides make for an incredeibly smooth experience. My personal favorite thing so far is the cursor animation. The cursor no longer jumps from character to character. It now animates and slides. It’s a small touch, but for some reason it makes the whole process classy and sleek. I haven’t gotten a chance to take a deeper dive into the new functionality, but it seems like Microsoft is really getting a lot of things right with the look and feel, much like the host of visual changes that came with Windows 8 and Server 2012.

I don’t have anything else interesting to say about Office 2013 just yet, but as I dig deeper into Outlook 2013/Exchange, I’m sure I’ll find some interesting new features to talk about.

Make UCS Even Better With Powershell

I try to use Powershell whenever I can. It’s been a huge step in the right direction for Microsoft, and it’s made a lot of sysadmins feel more comfortable managing and automating larger environments. So, when our UCS vendor told me about the Powershell module that Cisco provides to manage UCS, I was practically salivating.

Cisco provides the kit free of charge to Cisco.com members, and installing it is just a matter of copying the module files to your modules folder and firing up Powershell (an import-module command in your profile will also do the trick).

Once installed, simply run

connect-ucs <hostname/IP>

After inputting your credentials at the prompt, you’re good to go. You can also put

$username =
$pass = | convertTo-securestring -asplaintext -force
$cred = New-object System.Management.Automation.PSCredential($username, $pass)
connect-ucs <hostname/IP> $cred

in your profile to set that up every time you open Powershell.

Powershell makes it trivial to find information about your architecture without messing with the well-designed-but-still-Java UCS Manager, and if you need to configure a lot of things on the fly, there’s a wealth of configuration cmdlets to get you started.

Here’s Cisco’s cheat sheet for those interested.

A Simple Powershell Script to Show RPC Connections to Exchange CAS Servers

I ran into to this when I was trying to shut down a CAS box to apply patches. We have two CAS servers in a Windows Load balanced Array, so the least disruptive way to kick everyone off a node so you can reboot it is to order a Drainstop on one node. This keeps current connections active, but doesn’t allow new connections. After a few hours, especially at the end of the day, your connections will drop off to acceptable levels for a reboot.

I wanted a Powershell method to check on the status of a Drainstop command, and I discovered the hooks Powershell has into Windows’ Performance Monitor. Here’s the script (you can run it remotely with the right permissions):

$cas = {Server1}, {Server2}
$active = $cas | %{Get-Counter -computername $_ -Counter "\MSExchange RPCClientAccess\Active User Count" | select -expandproperty CounterSamples | select CookedValue }
$connection = $cas | %{Get-Counter -computername $_ -Counter "\MSExchange RPCClientAccess\Connection Count" | select -expandproperty CounterSamples | select CookedValue }
Write-host Server1
Write-host Connections............. $connection[0].CookedValue
Write-host Active Users............ $active[0].CookedValue
Write-host Server
Write-host Connections............. $connection[1].CookedValue
Write-host Active Users............ $active[1].CookedValue

It may not be the most elegant script I’ve ever written, but it gets the job done faster than loading up perfmon every time I want to glance at this.

The CookedValue expanded property is new to me, and it makes parsing actual values out of this much simpler. Otherwise, it will just return an object that isn’t easy to stick in a readable write-host statement.

The iOS 6.1 Activesync Fiasco of 2013

Reports have started to materialize across the Internet that iOS 6.1 has a serious Activesync bug, causing certain calendar activity to cause redundant logging on the CAS server and quickly overwhelm storage space and CPU loads in a matter of days.

While everyone’s pointing fingers, admins are trying to save their Exchange environments from buckling under ridiculous transaction log growth and waiting with bated breath for Apple or MS to release some kind of fix. Here’s the growth on my end.

 

logs

In the meantime, there are a few different strategies IT departments are using to combat the issue:

  • Warn users not to download the update – This should be the obvious first step.
  • Block EAS from iOS 6.1 devices – This is a drastic measure, and many organization wouldn’t allow that kind of disruption without some serious signing off from executives who all use iPhones. It is possible to filter the rule by iOS version if you’re using Powershell. In OWA, you can only drill down to Hardware model.
  • Fix the devices – There have been a few fixes being kicked around forums. Some are saying that simply turning off calendar sync and turning back on has solved the problem. Some are saying that a full OS wipe is required. Some are saying that you need to delete the Exchange account and add it back. Either way, we have 256 devices running iOS 6.1, so these options aren’t very elegant or feasible (but will work as a last ditch effort).
  • Put up sandbags until a fix is released – The bug affects storage and CPU. The storage (for Exchange 2010 at least) is the C:\Windows\system32\LogFiles\Exchange\W3SVC1 folder, which contains IIS logs from incoming connections. The good news is that you can delete these logs without any inclement effects. If you don’t want to delete these (for security/compliance reasons), they’re text files; I was able to compress 4.1GB down to 379MB. You can also redirect these logs to different storage temporarily. I have been seeing CPU spikes, but it hasn’t affected overall performance in any meaningful way (yet).

Hopefully, this will be a non-issue as one of the two companies releases some kind of fix. As of this writing, there hasn’t been an official communication from either company.