Stories of a CISSP: Low Availability

January 24, 2018

 

It was just supposed to be a simple software upgrade.  

 

The requirement was that there is zero downtime.  

 

The client was supposed to be on the conference call to upgrade their Palo Alto firewall from version 7.0, to 7.1, to 8.0 for just 2 hours.  It turned into 18 hours.  

 

The Palo was in active/passive HA mode, and to make matters just a bit more complicated, it was located in Amazon Web Services as a virtual machine.  

 

What Was Supposed to Happen

 

January 20th, 8pm - 9pm maintenance window

 

  1. Upgrade the secondary while it is in standby passive mode to version 7.1.  

    1. Reboot the secondary firewall

    2. Confirm the the secondary firewall is now running on version 7.1 

  2. Failover the primary active to the secondary firewall

    1. This way we can upgrade the primary and reboot it without affecting live traffic.  Live traffic will be handled by the secondary.

  3. Upgrade the primary to version 7.1

    1. Reboot the primary firewall

    2. Confirm the the primary firewall is now running on version 7.1 

  4. Failover back to the primary

  5. End maintenance at 9pm

 

 

What Actually Happened 

 

8pm

Upgraded the secondary to version 7.1  No issues.  

 

8:15pm 

  • Failed over to the secondary (making it the active firewall) 

  • Upgraded the primary firewall to version 7.1 

  • Rebooted the primary (now the standby) firewall

 

8:20pm 

  • We get an alert that ALL production traffic is down for both firewalls

  • HA links and interface(s) links are down on both firewalls 

  •  

9:00pm 

Both Palo Alto TAC and AWS TAC join the call to see the issues.  

 

10:30pm 

Both parties try troubleshooting steps, but can't figure out the issue. 

 

2:00am

  1. Multiple changes are executed on the firewall including: 

    1. Restoring from a backup

    2. Multiple reboots

    3. Shutting down and starting up the interfaces 

    4. Shutting down and starting up the virtual machine on AWS

    5. Downgrade back to version 7.0 on both firewalls

 

5:00am 

Production firewalls are still down, and the business is about to open in 2 hours 

 

6:00am

It is decided on the fly that we are going to delete the entire Palo Alto instance on AWS cloud and build a new one

 

9:00am

Firewalls still down, production traffic not being protected by a firewall. 

 

11:00am

  1. We have partial recovery of the firewall.  Conditions: 

    1. The HA had to be broken, currently the firewall is just a standalone, no HA.  

    2. It seems the root cause was a specific interface on the firewall which interfered with some sub-netting issue within the AWS environment. 

 

 

1:00pm

All traffic now being protected by the firewall.  Customer has asked both Palo Alto and AWS for a root cause analysis. 

 

 

 

CISSP Concepts

 

Domain 4: Network Security

 

  • IP Networking

    • There was a lot of network routing, subnetting, and link aggregation work involved

  • Operation of hardware

    • Although it was a virtual machine, we still had to investigate the operation of the hardware

  • Remote access

    • I was SSH'ing to the firewall, as well as 443

 

 

Domain 7: Security Operations

 

  • Continuous monitoring

    • Making sure firewall is still operating as normal

  • Asset Management

    • Update any documentation stating the firewall is HA, and change the documentation to say it is now standalone

  • Service Level Agreement 

    • Discuss how SLA was breached by Palo Alto and AWS in terms of availability, uptime, and to resolve the issue in a timely fashion

  • Incident management

    • Incident management began the moment the firewall stopped passing traffic.  We had to : 

      • Detect, respond, mitigate, report, recover, perform remediation, and lessons learned

  • Change management

    • At the time of the disaster, there was no change management process.  All changes to the firewall was being done on the fly.  As production traffic was already impacted, this was fine.  Normally, changes to the firewall go through a change management process. 

  • Implement recovery strategies

    • Restored previous image of firewall from backup 

    • AWS had multiple processing sites to create a new virtual machine and recover the firewall 

 

Thanks for reading.

 

Click here to read Stories of a CISSP: Change Management.

 

 

 

Share on Facebook
Share on Twitter
Please reload

STUDY RESOURCES
MEMBERSHIP
  • 220+ CISSP VIDEOS
  • 625+ PRACTICE QUESTIONS
  • PDF NOTES
  • 1,100 FLASHCARDS
  • TELEGRAM GROUP
  • EMAIL UPDATES
  • $29.99 per month
  • $74.99 3-months
  • $144.99 6-months
CRACK THE EXAM

How Linda Cracked Her CISSP Exam

January 23, 2020

1/26
Please reload

LEARN ABOUT

© 2013 Study Notes and Theory
Terms and Conditions/Privacy Policy

Proudly created to make you

a better security professional.