Major Service Loss January 30th 2023: Response and Recommendations

What Happened?

Web servers and other public servers are always under attack. Our systems deflect and mitigate thousands of unauthorised attempts to gain access every day, both direct login attempts and denial of service attacks where volumes of junk traffic are thrown at servers.

It is impossible to both have an accessible service and an inaccessible service and we are always trying to tread that line and this time we failed to get it right.

We experienced intermittent connection issues across the weekend but some access was still available.

All connection and service was lost during the early hours of Monday morning when the full attack was started.

An attempted ‘ransomware attack’ on the main server encrypted some files before it was stopped by our anti-malware processes. This was an automated scripted bot attack. There was no unauthorised access to customer data or backup data which is separately encrypted and stored. There was no evidence of any data exfiltration (access to data stores, virtual servers, external storage, no increase in bandwidth consumption, and no FTP/SFTP access).

To be clear, no data was viewed, accessed, or removed during this attack.

As we had complete backups from shortly before the attack started it was deemed that the most secure response would be to wipe the main server and so negate any potential additional issues with lingering malware or potential trojans/backdoor attacks.

As per our standard operating procedures in such an event, the servers hard drives and RAID arrays were wiped and reinitialised and the process of reinstalling the base operating system started.

Although the initial steps of installation were successful we experienced additional issues when the new operating system would not load. The server was troubleshooted for potential issues/changes that were preventing the new operating system from loading.

Working with our hosting providers, we established that the attack had also led to a hardware fault with the RAID controller. This additional issue meant our existing server hardware was no longer viable and a new server provision was initiated.

The new server was brought online and the server operating system was installed. This initially failed due to the unavailability of Microsoft’s license approval servers.

By Monday evening at 19:30, the new server was responsive and the restoration process could begin.

Our established processes did kick in but we lost approximately 12 hours of user data when rolling back. However, this episode has provided a valuable learning experience and we have started a more in depth review of our response, looking at our successes and failures and how this might any future response.

We have been proud of our performance to date when it comes to cyber security, with this being our first ever full day outage in our twelve year history, but, as always, there are lessons we can learn and things we can do better.

Below is the details timeline of events, actions taken, and lessons learned.

Timeline

07:00 Initial investigations of the affected server begin

08:00 Attack was identified and its severity assessed

08:30 Total loss of our main server was suspected

09:00 Decision was made to wipe main server and restore to earlier backups.

10:30 The affected server was wiped and the new raid array of hard drives initialised

10:50 Raid completed and new OS installation was started

11:50 New OS installation fails to start correctly

12:00 Second attempt to install new OS

12:40 Second attempt also fails

13:00 Troubleshooting starts on the hardware to try to get Windows to boot

14:30 New server is provisioned

15:40 New server built and brought online

15:40 New OS installation started on new hardware

16:15 New OS installation unsuccessful due to licesne server unavailability from Microsoft

18:00 New OS configuration complete

18:30 Required software and utility installation

19:30 Begin to restore backups

20:20 Backups restored

20:45 Applications start to be restored and access gained

22:30 All services responding normally

Review of the Disaster Recovery Procedure

Our full recovery plans can be viewed on our website;

https://teamkinetic.co.uk/policies/Contingency%20and%20Continuity%20Planning%20Policy

https://teamkinetic.co.uk/policies/Data%20Asset%20Protection%20and%20Resilience

In summary, we failed to meet our recovery time objective (RTO) of 2 hours because of the continued knock-on effects of hardware issues. The actual recovery process from downloading, extracting and installing the most recent backups was close to 2 hours once the hardware and operating system platform was stable.

Incidence Reporting, Communication, and Support

Once the outage was affecting our customers we started to send out regular updates to keep customers informed.

These were sent via email as all internal messaging systems were affected. We also posted on our Facebook page and volunteer manager groups with the current status.

We had multiple members of staff available all day on the phone to take calls and requests for support and believe we did a satisfactory job of keeping people up to date.

This was a major and long-lasting outage and all our affected customers are entitled to a month’s service credit that is redeemable at the next invoicing period. We know this doesn’t make up for lost time and the frustration of not having access to your applications.

What Did We Learn?

Our notification system for monitoring server health failed and was not able to cope with the specific complexity of this attack. We had a situation where our network accessible servers and systems were alive but not working correctly.

Our response times during the weekend exacerbated the monitoring issues.

Our hypervisor server is our most critical single point of failure.

It takes longer to download and extract backups now than it did as they are considerably larger and so our RTO needs to be updated.

Our transactional database backups (which fill in the gaps between full backups) need to be available from off-site backups to further limit the data loss in total failure events like this.

Our hardware provider was too slow to respond and made mistakes in provisioning that were made worse by lower staff numbers over the weekend, changes in shifts, and lack of communication between those shifts.

Almost 70% of the time to restoration of services was spent waiting for our hardware providers to execute their responsibilities.

Our customers are incredibly understanding and supportive, thank you!

Mitigations and Improvements

Add in more sensitive monitoring and also include positive monitoring that tells us that things are OK not just negative monitoring.

Mandate two factor authentication for UAC as well as login.

Switch to a new hardware provider with better response times and procedures for dealing with issues.

Recalculate our RTO bearing in mind the increase in size of our systems.

Move transactional data logs to temp off-site storage at regular intervals within a 24 hour period. Retain these logs transactions for 48 hours.

Provision a duplicate server for quicker server reinstatment. If we get a total failure/loss of the mainserver we can rollback to the most recent backup within the RTO period.

Look at a double daily complete backup of virtual servers. Would need to test the impact of backups on serve performance during regular accessing hours (right now the backup is performed at our quietest time). This would half our potential data loss in the case of complete failure.

Follow up

An attack of this type can cause anxiety for our users, and it is important to us here at TeamKinetic, that you feel confident in our response to this incident and trust that we have taken away the important lessons from this experience.

If you would like to speak to a member of the team, we would be only too happy to spend some time answering your questions. Feel free to use this link to arrange a call with the team.

You can also subscribe to service status updates here.

Major Service Loss January 30th 2023: Response and Recommendations

What Happened?

Timeline

Review of the Disaster Recovery Procedure

Incidence Reporting, Communication, and Support

What Did We Learn?

Mitigations and Improvements

Follow up

Related

Leave a Reply Cancel reply

Search

Subscribe to our blog

Contact Details

Major Service Loss January 30th 2023: Response and Recommendations

What Happened?

Timeline

Review of the Disaster Recovery Procedure

Incidence Reporting, Communication, and Support

What Did We Learn?

Mitigations and Improvements

Follow up

Share this:

Related

Volunteer And Volunteer Manager Expectations

Retail Volunteer Recruitment

Leave a Reply Cancel reply

Search

Subscribe to our blog

Contact Details