11:00 P.M. Tuesday
I had 98% iPhone battery life left, but wanted to make sure it hit 100%. I was on-call for the next 7 days - and it was only Day 1. Seems ridiculous to think the phone's battery would die completely overnight, but I didn't want to take any chances. I was primary on-call, meaning I would get called first, and if escalation was required, the secondary senior engineer would have to be called. But, if the secondary on-call was to be contacted, it had better be not just an emergency, but a catastrophe.
See, there is a whole psychology of being on-call as a senior network security engineer. My work environment is fiercely competitive in terms of "who knows the most" about network security, and not in a negative way. The competitive nature brings out the best in all of us, each one trying to make something like a Hide-NAT traffic work in unconventional ways or resolving a VPN issue between two different vendor firewalls in which traffic was supernetting. Or just coming up with ways to better aggregate logs to make it more human-readable. It is this drive to be the one who is the best which has sharpened all my colleagues into brilliant engineers. Some are CCIEs, CCNPs, some CISSPs, and while some don't have any certifications or college degrees, they are wizards of Linux, Bash, and network security in general. Beyond wizards, some are just Jedi Masters of network security.
It's a slightly anxiety filled time during an on-call shift. On-call means that for 7 straight days, if there is any issue with any of the firewalls for our 500+ customers at any time of day, it is the on-call engineers' job to solve it. There are two people on-call at a time, Tier 1 and Tier 2. Tier 1 on-call engineers get called first during an emergency. They do their best to resolve the issue, but if they are not able to, then they must call the Tier 2 engineer. And a Tier 2 engineer must resolve the problem, they don't have anybody else to call. Most Tier 2 engineers are senior level network security engineers who have pretty much seen it all when it comes to either Cisco ASA, Checkpoint, or Palo Alto firewalls. But unfairly, I am close friends with the other Tier 2 engineers and have the ability to call them anytime if needed, even at 3:27am on a Tuesday (somehow 3:27am seems like the time in which the night is darkest, when someone is getting their deepest sleep). But ultimately, it's a failure on my part if I was unable to resolve an issue. Keeping this in mind, I read a few pages of Isaac Asimov's "The Gods Themselves" and drifted off to sleep.
1:23 A.M. Wednesday
The iPhone's "Night Owl" ringtone slowly went from a quiet ring to full blast. The gradual increase in volume wasn't from the phone, but from me coming out of my dream state. The dream of being the #1 CISSP platform in the world by 2020. I reached over to my night desk and clicked the green "Answer" button.
"Hey, what's up?" - No saying "Hello" or any other formalities, I already knew what the call meant.
"Hey Luke, sorry to bother you this late, but we have an issue......" - said the nervous night shift engineer on the phone. Frankly, I was nervous too, as I do not want to be faced with an issue which I cannot resolve, but that kind of thinking isn't realistic - you just can't be expected to solve every problem every single time.
The engineer continued "We have made the scheduled change to the IPS blade of the firewall for tonight, but something went wrong. After pushing policy, the client stated they couldn't reach any of the servers behind their firewall."
Trying to eliminate the obvious problems first I had the following Q&A session with the engineer, while still waking up:
Q: Are we sure it's the firewall and not some router or switch that is blocking traffic?
A. Our NOC said all traffic is not even reaching their routers or switches and that those devices are not having any issues.
Q: Does the customer know? Are they complaining?
A: Customer knows, but they are not complaining as of right now, they just want to have it fixed by morning. The customer's network engineer is being surprisingly pleasant about the situation.
Q: Well, that's a relief. Okay...does SmartView Tracker show drops?
A: Yes, the packet captures show nothing but drops for all types of traffic, regular traffic and the ones filtered by the IPS.
"Okay, let me sign on to Skype."
Sure enough, all traffic was being dropped. We pinged the servers behind the firewall and they replied, so that means the traffic from the firewall to the servers was okay. But it was traffic coming from outside the network which was being blocked. Outside traffic accounted for 98% of their connections. I logged into the command line of the firewall and issued the following Checkpoint commands:
"fw tab -t connections -s" - Checks the number of current connections in the firewall's state table. It showed 4, when it should be something over 9,000. Not good.
"fw stat" - Shows the name of the current policy. Checked this just in case the engineer somehow pushed the wrong policy to the wrong firewall. This time it showed the correct policy.
"tcpdump -ni eth0" - Performed a packet capture on the firewall to see if external traffic was even arriving to the firewall (hoping some edge router may be the problem, and out of our scope of service). Traffic was indeed hitting the firewall, but not being accepted.
"fw ctl zdebug drop" - This is to see if the firewall was dropping any of the traffic. Confirmed it was dropping ALL traffic.
Okay, I try to be a security professional at all times, but for about 25 seconds the only thing I thought was "Oh ****! ****! ****! What do I do?? This is NOT good...Management is going to start complaining any minute now and it looks bad if I don't solve this!"
This happens to me every time. Even before going on job interviews, I find myself fully prepared, but still needing to throw up a little bit in the bathroom beforehand. Nerves, I guess. Not very professional.
But then, I remembered I was a CISSP and am getting paid to do this, it is my job. I signed up for this kind of work, nobody asked me. I needed to reduce the stress levels and just focus.
Okay, okay. If the firewall is dropping all traffic because of an IPS change...let me just revert the IPS change and push policy.
3:03 A.M. Wednesday
Policy installation failed.
If the policy installation failed it means a few things. It means I won't be able to push out the policies which will revert the IPS changes. And two, it means there is NO communication between the management server of the firewall and the firewall itself. Which is an even BIGGER problem. It means the firewall is denying even our own access to push customer policies. It was like the firewall had a mind of its own.
At this point I thought of my past experiences, my past technical training, and then my CISSP training. Suddenly I could breathe. Ideas and vivid memories of what senior engineers had done before started flooding in. I now gained the necessary clairvoyance to solve this problem, and had two ideas.
Idea A - Reboot the Firewall
If the firewall was dropping traffic anyway, there was no harm in rebooting the firewall. We only reboot a customer's firewall during a scheduled maintenance window and only with their permission. But by now it was 3:15am, and we weren't about to call the customer for permission. This decision required a manager's way of thinking.
"Let's reboot the firewall" - After acquiring agreement from the others on the conference call, we rebooted the firewall via console connection.
I prayed hard that once the firewall shutdown to reboot, it actually came back up. If it didn't, then things will have gone from bad to worse.
It came back up as it was supposed to, phew. The Universe showed mercy, if just slightly. We still faced the same problem though, all traffic was still being dropped and we still couldn't push policy.
Idea B - Unload policy
There is something called the "Initial Policy" on Checkpoint firewalls. The Initial policy is the very first default policy which comes on the firewall right out of the box. It denies all traffic except for management ports like SSH, HTTPS, and Checkpoint's (secure internal communication) SIC. The communication between the management server and the gateway uses SIC - so it was assumed somehow this is broken along with all the other connections. We began the process of unloading the current policy and loading the default firewall policy with this command:
"fw unload local"
Then typed in "fw stat", and saw that the name of the policy was not the previous customer-specific one, but this "InitialPolicy". Great. Step 1 of Plan B complete. Theoretically, this should allow me to push policy to the firewall.
Now, with a deep breath, I clicked "Install Policy" from the SmartDashboard. This was the message:
Phewwwwwwww. Policy installed successfully. And since we pushed policy using the IPS policy in its reverted state, and not the one which caused all traffic to be blocked, things started to work again.
We once again saw "Accept" connections to the servers, the number of connections to the firewall was climbing rapidly from 4, to 1,200, to 5,500, to 8,900.
Things started working - it was a huge relief. Like passing the CISSP exam. That almost same feeling, when you worked hard at something, and then were unsure of answers during the exam, only to see the "Congratulations" at the end. This is the discipline the CISSP teaches you. Never mind the knowledge, it is the discipline to get things done which stays with the Certified Information Systems Security Professional.
After this issue was resolved, the night shift engineers were to investigate the logs and see what they can find out about the failed IPS change and what made it drop all traffic. I was going to sleep and would be in late to work the following morning. Those who get called on-call and worked late are entitled to arrive later to work the next day.
I winded up going to work on time anyway, just to find out what happened. It was found that when trying to use the "Geo-Protection" feature of the IPS, somewhere in the configuration the engineer misunderstood and set traffic from legitimate countries to "Prevent" instead of "Detect" in another IPS profile. The details of the full issue I cannot reveal but this was the general inconsistency. It was not the night-shift engineer's fault, or anyone else. It was a general failure of multiple parties to communicate with each other before implementing this complex IPS policy.
Conclusion and Relation to the CISSP
The change to the IPS was complex. It was suggested by a senior engineer that a bit more due diligence be performed before implementing the change. However, the customer wanted it done. There were no senior engineers available during the night-shift, so the engineers at that time went ahead with the change. This push to get this done without a few more cautionary steps and a rollback plan led to a prolonged outage. The issue was resolved within 2 hours, well within the service level agreement. Research and a root cause analysis was provided to the customer and documented internally to the organization for future occurrences of the same type of IPS change.
CISSP Take-Away Concepts
Domain 1: Security and Risk Management
Domain 4: Network Security
Pinging servers behind the firewall
Intrusion Prevention System (IPS)
Cisco ASA, Checkpoint, Palo Alto stateful firewalls
Checking the firewall's state table
Firewalls today use stateful filtering, not packet-filtering
Stateful filtering firewalls can remember a packet's information in a "state" table beyond just source/destination/port. They do not need firewall rules for the same inbound/outbound connection
Packet-filtering firewalls require two different rules for the same traffic, one for inbound and one for outbound
Domain 7: Security Operations
Stories of a CISSP: Low Availability
Stories of a CISSP: Change Management
Stories of a CISSP: Symmetric Key Recovery
Stories of a CISSP: Unknown Password
Stories of a CISSP: TCP Handshake