Jun 11

Author : Ingmar Verheij

Understanding-snapshot-management-te[2]

Recently I had to troubleshoot a SQL server that performed nightly batch jobs for a management information system. Under normal conditions this required 6.5 hours but this was suddenly increased to 11.5 hours. An increase of 75%!

Because of this delay the information wasn’t presented on time with a lot of implications. Several departments where asked what has changed in the past days, of course the answer was “nothing”.

Point in time

The delay of the batch was introduced since the 6th of June (increasing from 400 to 600+ minutes):

Job-duration---before_thumb7

VMware vSphere Client

Performance

The performance metrics of the virtual machines showed a decrease in both processor and disk performance while the network was hardly affected.

This is unexpected since the content of the batch job is unchanged, and the same applies for the infrastructure. No (major) changes are executed that justify the decrease in performance

APP178---Performance---CPU_thumb2APP178---Performance---Disk_thumb3APP178---Performance---Network_thumb

Storage

There was a sudden increase (of ~ 600GB) in allocated disk space, with a substantial amount for snapshots. Aha!

APP178---Performance---Storage---Bef[1]

Snapshot

Unless a change is performed (and a rollback is required) no snapshot should be present. However there was a snapshot called “Consolidate Helper- 0” .

Snapshots-for-APP178_2012-06-08_12-0[1]

This snapshot was residual after a failed Veeam backup (as described Jim Jones in this article).

Veeam Backup & Replication

To verify that the snapshot indeed was a leftover of a failed backup I verified the backup log. And indeed, after performing a successful backup on the 4th the backup of 5th of june ended with a warning:

The backup on the 6th of june could not be completed at all and ended with an error:

CUsersIngmar-VerheijDocumentsWerkKla[7]CUsersIngmar-VerheijDocumentsWerkKlaCUsersIngmar-VerheijDocumentsWerkKla[4]

Result

After removing the snapshot the storage space was reclaimed

APP178---Performance---Storage---Aft[2]


and the time required to perform the batch job was back to normal

Job-duration---After_thumb4

Moral of the story

Be careful with snapshots of virtual machines. The impact on the performance can be dramatic and the time-to-fix can be quite a while if you’re unaware of this.

More information :

Ingmar Verheij

Ingmar Verheij works for PepperByte as a Senior Consultant. His work consist of designing, migrating and troubleshooting Microsoft and Citrix infrastructures. He is working with technologies like Microsoft RDS, user environment management and (performance) monitoring. Ingmar is in the steering group of the Dutch Citrix User Group (DuCUG). RES Software named Ingmar RES Software Valued Professional in 2014.

More Posts - Website

Follow Me:
TwitterLinkedInGoogle Plus

Comments are closed.