It was just supposed to be a simple software upgrade.
The requirement was that there is zero downtime.
The client was supposed to be on the conference call to upgrade their Palo Alto firewall from version 7.0, to 7.1, to 8.0 for just 2 hours. It turned into 18 hours.
The Palo was in active/passive HA mode, and to make matters just a bit more complicated, it was located in Amazon Web Services as a virtual machine.
What Was Supposed to Happen
January 20th, 8pm - 9pm maintenance window
Upgrade the secondary while it is in standby passive mode to version 7.1.
Reboot the secondary firewall
Confirm the the secondary firewall is now running on version 7.1
Failover the primary active to the secondary firewall
This way we can upgrade the primary and reboot it without affecting live traffic. Live traffic will be handled by the secondary.
Upgrade the primary to version 7.1
Reboot the primary firewall
Confirm the the primary firewall is now running on version 7.1
Failover back to the primary
End maintenance at 9pm
What Actually Happened
Upgraded the secondary to version 7.1 No issues.
Failed over to the secondary (making it the active firewall)
Upgraded the primary firewall to version 7.1
Rebooted the primary (now the standby) firewall
Both Palo Alto TAC and AWS TAC join the call to see the issues.
Both parties try troubleshooting steps, but can't figure out the issue.
Multiple changes are executed on the firewall including:
Restoring from a backup
Shutting down and starting up the interfaces
Shutting down and starting up the virtual machine on AWS
Downgrade back to version 7.0 on both firewalls
Production firewalls are still down, and the business is about to open in 2 hours
It is decided on the fly that we are going to delete the entire Palo Alto instance on AWS cloud and build a new one
Firewalls still down, production traffic not being protected by a firewall.
We have partial recovery of the firewall. Conditions:
The HA had to be broken, currently the firewall is just a standalone, no HA.
It seems the root cause was a specific interface on the firewall which interfered with some sub-netting issue within the AWS environment.
All traffic now being protected by the firewall. Customer has asked both Palo Alto and AWS for a root cause analysis.
Domain 4: Network Security
Operation of hardware
Domain 7: Security Operations
Thanks for reading.
Click here to read Stories of a CISSP: Change Management.