top of page

Stories of a CISSP: Low Availability


It was just supposed to be a simple software upgrade.

The requirement was that there is zero downtime.

The client was supposed to be on the conference call to upgrade their Palo Alto firewall from version 7.0, to 7.1, to 8.0 for just 2 hours. It turned into 18 hours.

The Palo was in active/passive HA mode, and to make matters just a bit more complicated, it was located in Amazon Web Services as a virtual machine.

What Was Supposed to Happen

January 20th, 8pm - 9pm maintenance window

  1. Upgrade the secondary while it is in standby passive mode to version 7.1.

  2. Reboot the secondary firewall

  3. Confirm the the secondary firewall is now running on version 7.1

  4. Failover the primary active to the secondary firewall

  5. This way we can upgrade the primary and reboot it without affecting live traffic. Live traffic will be handled by the secondary.

  6. Upgrade the primary to version 7.1

  7. Reboot the primary firewall

  8. Confirm the the primary firewall is now running on version 7.1

  9. Failover back to the primary

  10. End maintenance at 9pm

What Actually Happened

8pm

Upgraded the secondary to version 7.1 No issues.

8:15pm

  • Failed over to the secondary (making it the active firewall)

  • Upgraded the primary firewall to version 7.1

  • Rebooted the primary (now the standby) firewall

8:20pm

  • We get an alert that ALL production traffic is down for both firewalls

  • HA links and interface(s) links are down on both firewalls

9:00pm

Both Palo Alto TAC and AWS TAC join the call to see the issues.

10:30pm

Both parties try troubleshooting steps, but can't figure out the issue.

2:00am

  1. Multiple changes are executed on the firewall including:

  2. Restoring from a backup

  3. Multiple reboots

  4. Shutting down and starting up the interfaces

  5. Shutting down and starting up the virtual machine on AWS

  6. Downgrade back to version 7.0 on both firewalls

5:00am

Production firewalls are still down, and the business is about to open in 2 hours

6:00am

It is decided on the fly that we are going to delete the entire Palo Alto instance on AWS cloud and build a new one

9:00am

Firewalls still down, production traffic not being protected by a firewall.

11:00am

  1. We have partial recovery of the firewall. Conditions:

  2. The HA had to be broken, currently the firewall is just a standalone, no HA.

  3. It seems the root cause was a specific interface on the firewall which interfered with some sub-netting issue within the AWS environment.

1:00pm

All traffic now being protected by the firewall. Customer has asked both Palo Alto and AWS for a root cause analysis.

CISSP Concepts

Domain 4: Network Security

  • IP Networking

  • There was a lot of network routing, subnetting, and link aggregation work involved

  • Operation of hardware

  • Although it was a virtual machine, we still had to investigate the operation of the hardware

  • Remote access

  • I was SSH'ing to the firewall, as well as 443

Domain 7: Security Operations

  • Continuous monitoring

  • Making sure firewall is still operating as normal

  • Asset Management

  • Update any documentation stating the firewall is HA, and change the documentation to say it is now standalone

  • Service Level Agreement

  • Discuss how SLA was breached by Palo Alto and AWS in terms of availability, uptime, and to resolve the issue in a timely fashion

  • Incident management

  • Incident management began the moment the firewall stopped passing traffic. We had to :

  • Detect, respond, mitigate, report, recover, perform remediation, and lessons learned

  • Change management

  • At the time of the disaster, there was no change management process. All changes to the firewall was being done on the fly. As production traffic was already impacted, this was fine. Normally, changes to the firewall go through a change management process.

  • Implement recovery strategies

  • Restored previous image of firewall from backup

  • AWS had multiple processing sites to create a new virtual machine and recover the firewall

Thanks for reading.

bottom of page