1:00 PM, the IT director sent out an email about the number of phishing attacks we have been getting today.

3:50 PM, Friday, a developer on our team calls me with the widget database on production, and the same database on the QA server was also. While talking to him, I am querying the databases he is describing. The dude is crazy frantic. Calming him down was priority one. I needed more information about this event. The databases still exist on the servers. Data is missing, but that was data. We have 800-900 applications that run on 30 servers.

These particular servers are our homegrown applications, and this database was a controller for all applications on the server. Seventy of our homegrown applications were down. It also pulls data from the active directory. So I have not ruled out an Active-Directory issue

So what caused this?

I told the developer I would look into it can call him back. He was asking me to restore from the development server. No, that will not work because of GUIDs in the tables. The developer and I hang up.

I attempt my lead developer, who wrote and managed this application. No dice, he left at 3:00 and was not answering his phone.

4:10 PM, a network engineer calls me and bluntly asks me what I did on this server today. All production applications are down. Yes, I was on the server. I loaded the backup agent we were using and ran a test backup. While the network engineer and I talked, I restored a copy of the QA database that was partially missing. About ten tables, and now I have the row from the live system and the rows from noon. Has any data changed?

This system has primarily static data and has a refresh from the active directory at 5:00 AM. The network engineer and I are chatting.

We ran the refresh from the active directory. It takes about 9 minutes.

The network engineer helped by reaching out to management, given I was not able to reach anyone.

4:23 PM, network engineer, and business analysis had a 3-way call with me. They are testing the applications. We have most applications working as we end the call with who will sausage the developer. I offered, and they offered to buy me dinner and drinks. Yeah right.

4:30 PM, I have a restore next to the production database, a copy from backup. And the same in QA. I got an email suggesting one of the vendor accounts was involved in deleting the data. I called the network engineer back and had him disable the vendor account and virtual machines. To be sure.

The lead developer will call in 45 minutes when he arrives at his destination, sans computer. Ok, I will get the information only from him.

I start reviewing the data a little closer.

Database compare of rows:
Table Prod restore
Users 3508 5509
Claims 0 758
Roles 0 1088
Logins 0 0
__Migration 1 2
RolePermissions 0 74
Applications 34 34
Roles 8 8
Nodes 0 477
Sitemaps 0 26
Permissions 8 8
Roles 0 11

Sure glad I did not restore from development that would have hosed the entire system. Only one department still working. And their applications are still down. But I have a plan, and sure it will not break anything! And it is recoverable too. I start this action.

Kill active users in the database, rename the production database to _org. Then rename my recovered copy to the production name.

5:30 PM, the lead developer calls. I have reviewed the data. I explain the plan and send the business analysis to the rest and the developer also to test. The lead developer approved the plan to correct the issue, which I already did. While we talked about the plan over the business analysis, chatted “all up, all good thanks” the developer sent likewise.

5:40 PM, I got a text from the deputy directory thanking the team for quick action.

5:45 PM, the lead and I disconnected. I spend the next 30 minutes documenting what, why, and when for our after-action report.

Process improvement and stall points

  1. Interment knowledge of the system is impossible
  2. Permission to proceed cannot delay action.
  3. Blame has no place in the type of action.
  4. Everyone needs to stay calm.
  5. All plans need backup and verification processes.
  6. Watch out for red herring. The email on phishing attacks is an excellent cautionary step by had no action.

I also got extremely lucky, I was not scheduled to coach swimming this day, so I did not have to run across town at 4:00 PM

What is your plan when emergencies hit? I have a document I use for this. It’s a simple table on a sheet with the following data: Time, action, back out plan, a result expected

The key here is to have a plan and make sure you use it. Then what is your emergency action report or post-mortem process? Just have one of each! Need help planning your emergency action plan (EAP).

Privacy Preference Center