It was just supposed to be a simple software upgrade.
The requirement was that there is zero downtime.
The client was supposed to be on the conference call to upgrade their Palo Alto firewall from version 7.0, to 7.1, to 8.0 for just 2 hours. It turned into 18 hours.
The Palo was in active/passive HA mode, and to make matters just a bit more complicated, it was located in Amazon Web Services as a virtual machine.
What Was Supposed to Happen
January 20th, 8pm - 9pm maintenance window
Upgrade the secondary while it is in standby passive mode to version 7.1.
Reboot the secondary firewall
Confirm the the secondary firewall is now running on version 7.1
Failover the primary active to the secondary firewall
This way we can upgrade the primary and reboot it without affecting live traffic. Live traffic will be handled by the secondary.
Upgrade the primary to version 7.1
Reboot the primary firewall
Confirm the the primary firewall is now running on version 7.1
Failover back to the primary
End maintenance at 9pm
What Actually Happened
Upgraded the secondary to version 7.1 No issues.
Failed over to the secondary (making it the active firewall)
Upgraded the primary firewall to version 7.1
Rebooted the primary (now the standby) firewall
We get an alert that ALL production traffic is down for both firewalls
HA links and interface(s) links are down on both firewalls
Both Palo Alto TAC and AWS TAC join the call to see the issues.
Both parties try troubleshooting steps, but can't figure out the issue.
Multiple changes are executed on the firewall including:
Restoring from a backup
Shutting down and starting up the interfaces
Shutting down and starting up the virtual machine on AWS
Downgrade back to version 7.0 on both firewalls
Production firewalls are still down, and the business is about to open in 2 hours
It is decided on the fly that we are going to delete the entire Palo Alto instance on AWS cloud and build a new one
Firewalls still down, production traffic not being protected by a firewall.
We have partial recovery of the firewall. Conditions:
The HA had to be broken, currently the firewall is just a standalone, no HA.
It seems the root cause was a specific interface on the firewall which interfered with some sub-netting issue within the AWS environment.
All traffic now being protected by the firewall. Customer has asked both Palo Alto and AWS for a root cause analysis.
Domain 4: Network Security
There was a lot of network routing, subnetting, and link aggregation work involved
Operation of hardware
Although it was a virtual machine, we still had to investigate the operation of the hardware
I was SSH'ing to the firewall, as well as 443
Domain 7: Security Operations
Making sure firewall is still operating as normal
Update any documentation stating the firewall is HA, and change the documentation to say it is now standalone
Service Level Agreement
Discuss how SLA was breached by Palo Alto and AWS in terms of availability, uptime, and to resolve the issue in a timely fashion
Incident management began the moment the firewall stopped passing traffic. We had to :
Detect, respond, mitigate, report, recover, perform remediation, and lessons learned
At the time of the disaster, there was no change management process. All changes to the firewall was being done on the fly. As production traffic was already impacted, this was fine. Normally, changes to the firewall go through a change management process.
Implement recovery strategies
Restored previous image of firewall from backup
AWS had multiple processing sites to create a new virtual machine and recover the firewall
Thanks for reading.