It's only now, given that Microsoft have just (well, three days ago, actually, but I've only just noticed) published a particular
Knowledge Base article, that I feel able to make this post. I make no apologies for the fact that this post is quite lengthy and involved; it documents what was involved in diagnosing a perplexing issue in our analytics product; narrowing it down to what looked like a new and undocumented bug in Microsoft's Analysis Services product; working with Microsoft to prove the bug to them; evaluating the hotfix which they subsequently produced; through to deploying the solution in our application. Personally, I am very impressed with the speed and professionalism with which Microsoft responded, but there are some key tactics to employ when trying to convince them (or any other software vendor) that a bug exists in their product, for which they should produce a fix. Hopefully, these tactics might benefit others.
Fortunately (given the fragmented nature of my memory at the moment!) I have a comprehensive email trail, so the chronology which follows is accurate. However, here's a little background to put things in context. Exony Reporting is designed allow reporting and analysis of operations inside telephony call centres. It imports data (huge volumes of data) from various source systems, consolidates, summarises and packages that data into a form (real-time and historical) that can be used by a wide range of interested and authorised users; and it needs to do so quickly, securely and reliably. The architecture would look familiar to any Data Warehousing professional, i.e. we use a relational database to receive and pre-process the incoming data, an OLAP engine to summarise the data into a form that can be easily used, an application server to contain the business logic, a web server to provide a user interface to browser-based clients, together with inter-site replication to provide resilience to site failures and improved local performance. We've used package software (provided by Microsoft and others) and written substantial integration code to unify them. The key point here is that with a component solution like this, an issue which appears in one area may be a symptom of a problem somewhere else; finding exactly *where* that "somewhere else" is is not always a trivial undertaking!
In early July of this year, I was asked by my support colleagues to look at a problem which was periodically occurring. After completing a new installation of our product, typically a backlog of source data would be imported, processed and summarised, administrators would configure users who would be allowed access to the system, and then those users would begin to use the system, with new data being continually added. At some point, maybe hours, days or weeks after going live, CPU utilisation would suddenly rise to 100% in our application, and not return to normal levels. The application would require a forced restart, but at some point shortly thereafter, the same rise to maximum would occur. This was a little perturbing...
The problem was being shown in our application; what was it in our application code that might be causing this? I couldn't see any obvious pattern; ordinarily, the system could support many users, but once the problem occurred, even a single user could send the CPU usage to maximum. And once it had occurred, then it would recur rapidly. I hadn't seen behaviour like this in testing, but some of my colleagues had; apparently, reprocessing their Analysis Services database from the imported data made the issue go away. Which was the first clue.
High on my list of priorities was finding a single, repeatable series of actions which would reproduce the problem; for me, there's only one thing worse than a system problem, and that's an inconsistent, intermittent and non-reproducible system problem! However, I had made some progress; while I still couldn't identify what caused the first CPU spike, once it had happened, I could easily reproduce it. Once it had occurred, then attempting to access some of the Analysis Services cubes caused the CPU spike in whichever application was attempting access. For example, trying to use Analysis Manager to edit the cube would send Analysis Manager to 100% from which only a hard reset of Analysis Manager was possible; using the MDX Sample Application to issue an MDX query against a cube would have the same effect on the MDX Sample Application; using Query Analyzer to issue an MDX query by means of a pass-through query to Analysis Services would send Query Analyzer to a non-recoverable 100% CPU spike. So, I started working on the hypothesis that once the first CPU spike occurred in our application, then subsequent spikes were not restricted to our application; *any* application attempting to access part of the Analysis Services database would spike. In other words, the subsequent spikes were not the *fault* of our application, but the *symptom* of a fault that lay elsewhere.
I began to suspect corruption in the Analysis Services cubes, particularly since completely reprocessing the cubes (as my colleagues had done) meant a return to stability, at least for a while. This was bad news; the sheer volumes of data we import, together with the need to make the processed data available to users in a timely fashion meant that our architecture used incremental updating and partitioned cubes. The fact that reprocessing the cubes made the corruption go away (temporarily) wasn't really a solution, on two counts; firstly, the time taken to reprocess the cubes (each time corruption occurred) could be substantial; secondly, the staged data might no longer be available in the relational database, or in the source systems. If this problem was caused by corruption of the cubes, we needed to find and fix what was causing the corruption.
The breakthrough came when I was able to examine a clean system on which the initial CPU spike had occurred. Because of the array of software components involved, diagnostic and error logging tools at my disposal were varied: Event Viewer system and application logs; SQL Server error logs; Analysis Services processing logs; bespoke logging in our application; Performance Monitor logging; SQL Profiler traces... as anyone who has tried to diagnose system problems on a distributed server application will know, turning on all available logging will overwhelm you, unless you know what you're looking for, and roughly when it occurred. Having enabled comprehensive logging, the first spike revealed the key; once it occurred, in the Event Viewer application log I could see that the MSSQLServerOlapService (i.e. Analysis Services) had unexpectedly restarted shortly prior to the spike. Because of that, Analysis Services had produced a diagnostic memory dump. I then looked in the processing logs for Analysis Services and some custom code which we use to control Analysis Services processing. Sure enough, immediately prior to the restart and memory dump, Analysis Services threw an error while incrementally updating a dimension. After it had restarted, *any* attempt to access the dimension itself, or *any* cube which used that dimension, would cause the CPU spike in the application concerned; but *not* in Analysis Services itself.
So now the focus changed again. What was happening that caused the dimension corruption? Was our application doing something that it shouldn't... or was there a bug in Analysis Services that was causing it to corrupt dimensions?
I was getting closer to identifying a consistent series of reproduction steps, but at this point I thought it was worthwhile to check the Microsoft Knowledge Base to see if anything relevant was listed there. The article "FIX: The Analysis server computer stops responding when you process dimensions and virtual dimensions" looked promising, so on 14 July I contacted Microsoft PSS by telephone to open a Support Incident and ask for the hotfix referred to in that article. After an initial telephone consultation, so that the PSS engineer could determine if the problem we were experiencing was similar to that which the hotfix was designed to resolved, I was emailed a link from which I could download the hotfix, and commenced testing.
Sadly, the hotfix did not solve the problem. Over the next few days, I was able to reproduce the initial CPU spike three or four times; each time, it was preceded by a dimension processing error. Each time, fully reprocessing the dimension (which necessitated fully reprocessing all the cubes which used that dimension) solved the problem; but only for a while. With multiple occurrences, however, I now had more information to go on. The initial dimension processing error was not always the same dimension; but they were always parent-child dimensions. It was never on the initial processing of the dimension, but always on a subsequent incremental processing. Each dimension that failed had been successfully incrementally processed before subsequently failing.
So now, I needed to determine if it was just parent-child dimensions that were at risk, or was the problem potentially applicable to *all* changing dimensions (all parent-child dimensions are changing, but not all changing dimensions are parent-child) or in the worst case, were *all* dimensions at risk? In any event, having verified that nothing our application was doing was breaking the rules of what you can do with changing dimensions, I opened a new support incident with Microsoft PSS. (Technically, support incidents are chargeable, and must be paid for when the case is opened; however, if the incident is a request for a hotfix which is already available, there is no charge, nor is there a charge if the cause is found to be a fault with the Microsoft software.) As an MSDN Universal subscriber, I also had unused support incidents at my disposal, so on 21 July I opened the new case.
I was dealing with the same Support Engineer who had responded to the initial hotfix request (and with whom I had dealt the previous year on some replication issues); he had already impressed me with his technical abilities and resourcefulness. I explained the stage of diagnostics I had reached, and that Build 931 (the hotfix) had not helped; he knew that there were later builds of Analysis Services than the one I had tried, so he undertook to see if one could be made available to me. On 23 July, I received Build 950. Over the weekend, I tested this build, and the dimension corruption was still occurring. I was now convinced that the fault lay with a bug in the incremental processing of dimensions; I also knew that I would have to prove that the fault was not caused by our application if Microsoft were to be convinced that they should provide a fix. So, I worked on providing a reproduction scenario based only on SQL tables, simple cubes, and a series of updates to the SQL data (representing legitimate updates to the members of a changing dimension) with a series of incremental updates of the associated Analysis Services dimension. Only Microsoft utilities (Query Analyzer and Analysis Manager) were used. Having found one that was simple, consistent, and 100% guaranteed to produce the error, I emailed the details to the Support Engineer on the morning of 26 July.
Within a couple of hours, he emailed me to say that he, too, could consistently reproduce the problem on the more recent Build 960. He asked if I could provide a full description of the operational impact this issue was having on us and our customers (in case he needed to justify the effort involved in producing a code fix) which I promptly sent to him.
On 27 July, I received an email to say that the problem was being escalated within Microsoft; particularly since the issue wasn't just related to a lack of usability, but actual corruption of data leading to substantial downtime while the corrupted objects were reprocessed.
On 28 July, I was emailed to say that the issue would be tested that day on the latest build, Build 964. The issue persisted.
On 29 July, I was contacted by a new Support Engineer, who explained the five-milestone process which would be used to escalate this case. The milestones were: production of a problem report; request for a QFE (Quick Fix Engineering) hotfix; acceptance of the problem by the development team; production of a hotfix for internal testing; release of the hotfix to us for acceptance testing. It was also made clear that, even if the problem was accepted as a bug, there was no guarantee that a fix would be produced, or a commitment to when any fix might be produced.
On 30 July, I was asked to provide some of the memory dump files which occurred when Analysis Services restarted after the initial processing error, which I did. Later that day, I was informed that the case had now passed the first milestone, i.e. a problem report had been raised, and passed to a Critical Problem Resolution Engineer; by whom I was contacted later that day.
On 2 August, I was informed that a request for a QFE hotfix had been raised. This was both good news, and bad. Good, in the sense that Microsoft PSS accepted that there was a serious issue in their product, and therefore a code fix might be forthcoming at some point. Bad, because it also meant that there was no workaround which could be suggested by Microsoft or deployed by ourselves, until and unless a code fix was released.
Even while I was heavily involved in working with Microsoft to raise the profile of this issue, I was working on a mitigation strategy which we might have to adopt, until and unless we received a suitable code fix. On that day, we removed all parent-child dimensions (which were still the only dimensions which had been affected) from our application, and released a hotfix to our customers which would do the same on their live systems. The downside of this action was that the affected cubes would need to be fully reprocessed, and there would be some loss of analytical flexibility; the upside was that the threat of initial corruption and the performance issues which flowed from it would be removed. Sometimes, you need to make an unpleasant decision to prevent an even more unpleasant outcome.
Things went quiet for a week or so, until I was contacted by a new Engineer on 11 August, who informed me that milestone three had now been reached, i.e. the product development team had accepted that a product issue existed which required a code fix. Even now, though, there was still no guarantee that a code fix would be produced.
By 18 August, the development team had isolated the specific conditions which would cause the dimension corruption, and were working on a solution. On 20 August, I was informed that they had produced a code fix, which had been sent for internal testing.
On 24 August, a hotfix was made available to us; my preliminary testing confirmed that it worked, in the context of the reproduction scenarios I had identified. More extensive testing over the following days reinforced the likelihood that he issue had been fixed, and that we could deploy the fix and consider the reintroduction of the parent-child dimensions into our application.
Finally, on 27 October, the Knowledge Base article referred to at beginning of this post, and documenting the hotfix, was published. Maybe it's just me, but even though there is no name-check for me or my employers in the article, it still feels like *my* hotfix! :-)
On a more serious note, this incident probably consumed a month's effort on my part. From the moment the second support incident was raised with Microsoft, however, the baton had effectively been passed, and much less day-to-day involvement was needed from me, until the hotfix was made available. The baton could only *be* passed, however, because of all the work that was necessary to isolate the problem to a specific application, and prove that the problem was occurring because of an inherent issue in that application, and not in the code or logic inherent in our application. I think it's worth stressing that Microsoft really only became formally involved on 21 July; and five weeks later, on 24 August, a tested code fix to a major product which was four years old was made available to us. Personally, I think that is quite an impressive response.
So, what did I learn from this episode; indeed, what transferable lessons are there that I (and hopefully others) can apply in the future, both when diagnosing issues in complicated distributed architectures, and specifically when seeking assistance from a supplier like Microsoft? On the first count, I think the key points are:
1) Use all possible sources (including Event Viewer logs, Performance Monitor logs, Analysis Services processing logs, SQL Profiler traces, SQL Server error logs, own application logging, memory dumps) to identify what might be causing the problem. Particularly, look for abnormal events which coincide with other unusual occurrences;
2) Keep in mind that the symptom may not be the cause - the initial problem appeared to be runaway CPU usage in our application; it was some time before the real issues began to surface;
3) Develop a quickly repeatable automated test suite (e.g. use batch files, scripts, database backups / restores, VMWare / Virtual Server images with snapshots or repeatable resume) when trying to isolate the specific steps which consistently reproduce the problem;
4) Develop automated test cases for all possible interactions, between users and your application, even if you don't expect they will occur.
As for making it easier for a supplier like Microsoft to help you (which also means making it beyond question that the supplier has a responsibility to help you):
1) Isolate the problem to the suppliers product *only* (in this case, SQL tables, SQL queries and cube operations using Microsoft's own applications) i.e. prove that your application and logic is not a contributor to the issue;
2) Stress the impact of the problem on your customers; be reasonable, but be explicit;
3) Accept that a code fix may not occur (especially if manual or procedural workarounds can be identified); develop mitigation strategies of your own, in case a code fix is not produced, or as an interim measure until one is produced;
4) Wherever possible, communicate by email; when telephone conversations have taken place, make sure the content of the conversation is written down (and, out of courtesy, copied to the other party).
After the incident has been resolved (one way or another):
1) Make sure to thank those who have worked on your case; apart from being good manners, who knows when you might need to raise another support incident :-)
2) Give detailed feedback if contacted afterwards; after each support incident I have raised with Microsoft, I've been contacted and asked to give detailed feedback on the incident and how it was handled. No matter how busy I was when asked, I took the time to give that feedback. It is courtesy to give credit where it is due, and that feedback may help to make future support incidents run even more smoothly.
OK, that's it, I've probably broken umpteen rules about the length of blog posts, but if you've read this far, hopefully there's something here that may assist you in the future.
This blog has been migrated to new software on a different server (http://www.multidimensional.me.uk) and comments on this post on *this* blog are now closed. All existing comments have been copied to the equivalent post on the new blog. If you still wish to comment on this post, please use the equivalent post at: http://www.multidimensional.me.uk/
Read more!