06/06/2012: Internal DNS resolution failure

Problem persisted to 07/06/2012

Firstly, these problems were not related to the LINX failures of the previous week, which are explained in this article.  Click the link to read the article in a new page.

The core problem was DNS (Domain Name System) which was not properly replicating between all servers on the platform. It should have failed, which would have made the problem much easier to identify, but instead Netbios (a legacy system) took the place of DNS for name resolution and made it seem like DNS was functioning properly.

A secondary problem was the primary domain controller (DC1), used for authentication, permissions and policies, was not replicating with the secondary domain controller (DC2) (N.B. We have at least two servers for each of our front-line services to help us provide a continuous service). When a user logs in, either one of these servers should authenticate the user but the DNS failure caused this to time out and try the other server - hence the user experienced a slow log-in process.

The problems continued during the user desktop session, as the mapped drives were also taking a long time to connect and the usually continuous connection to the domain controllers was unstable.

To resolve the problem, we repaired the errors in the DNS table, which was then replicated throughout the platform network. This process is now complete and the platform is stable.

Subsequent to and separate from this, on Friday morning there was a problem with the Net2Printer Service, which caused spoolsv.exe (a printer process) to monopolise server resources. This was identified and resolved quite quickly, but the residual effect on logged-in accounts was frustrating for many.

Moving forward and on a positive note ... We have identified the root cause of the problems that have been causing slowness for a while and have repaired the problem. Work is still ongoing to make shared resources respond more swiftly so the platform should improve still further.

We understand the frustration that users will have experienced and genuinely apologise for the inconvenience and problems this will have caused.

Return to category: Service Disruptions

Back to top