Self Diagnostic Bot

SageData is based in Ottawa, Ontario, Canada

Introduction

Sometimes systems fail. Reducing the time taken to put a system back into service (MTTR - Mean Time To Recover) is very important to a business. This article describes how we addressing this in our systems.
Our software can be configured in two ways. When mounted on our servers, diagnosis is easy, and MTTR measured in minutes. When mounted remotely, on client servers, diagnosis is difficult, and MTTR longer. We wanted to improve this.
Security regulations still require some clients to run their applications on local machines. Data cannot be accessed remotely, and sometimes cannot be exported. So fault-finding is difficult, and MTTR is sometimes measured in days.
Security rules mean that we cannot give our support personnel immediate access to remote data. We wondered whether we could instead build a piece of software, a bot, that would run on client hardware alongside our core software, and mimic the role of support personnel. It would examine a failed system and identify the point of failure. Without external control.

Uncertainties were many. How much data would the bot have to collect? What functions would it monitor? At what level of detail? Would it slow down the core software? How would it communicate? Could unskilled client personnel recover and transmit key data, without compromising security. And, most important - would it be able to find and fix faults?
As with vaccines, the bot had to be proven both effective, and free of negative side effects.

The solution

We started by searching the Internet. Key words were ISD (Integrated Self Diagnostics) and SSD (Software Self Diagnosis). Links showed that others had pursued this, but most assumed a permanent connection and so were not relevant.

The issue, for us and our clients, was time spent on support activities. The many causes, in approximate order of frequency, include: NFF (No Fault Found), operator error; hardware failure; configuration errors; data related errors; and logic errors in our code. We could resolve errors quickly when we had access to the data. We needed to determine what information must be captured so that remote systems could perhaps self-diagnose simple errors, or pass data back to us for the more complicated problems.

In our systems, data capture is mostly done using handheld computers with integrated barcode readers operating autonomously - no permanent connection. The systems capture information only when each task is completed. For effective diagnosis we needed to, additionally, capture many intermediate steps, which would increase storage and transfer requirements. Some initial processing of the data could be completed on the mobile units, but most data would be passed back to the local server for more detailed analysis.

To diagnose hardware problems, or local data errors, we needed to uniquely identify each of the mobile units. We needed to verify configuration and software versions. We needed to give the front-line user the ability to forward data to us for more complex errors, within the security protocols of the client. We had to avoid degrading the response time or reliability of the existing system.

We began by reviewing all data entrance points, to ensure that we were protecting from data-related errors. We then improved the error screens to ensure that messages gave a clear description of the problem, and where possible, a corrective action. We identified each mobile unit, and linked data collected back to the unit that collected it. We developed a self-check process, to confirm the build numbers for the software on each mobile unit, and verify key functions. We provided local reporting to confirm this check was being completed as scheduled.
To more accurately determine the origin of each error, we increased the number of transactions recorded against each task.

We developed a way to share client data with our internal support personnel. To avoid the risk of degrading existing system performance, we designed this as a standalone module, that could be installed and operated without interfering with the normal operation of the existing software. Diagnostic data sets are much larger than the data from conventional operations, but are rarely needed, so we arranged that the transfer could be manually triggered only when an incident occurs.

To ensure that data could be transferred in a timely manner, we designed the return process so that it could be executed by frontline operators, with no requirement for detailed IT skills. We passed key system information and anonymized diagnostic data back to us by repurposing techniques we had developed long ago to save storage space, using tokens for data. This satisfies the security requirements of most clients. The data transferred is a clear ascii file, with numbers that mean nothing outside the originating system. We also enabled the generation of an export file that could readily be attached to an email.

The capture of data at more points in the process lets us quickly verify or eliminate various hypotheses and identify the cause of the errors. For one infrequent random error we successfully identified the sequence of actions causing the error, and also the specific individual causing the issue. In another case we identified a specific data entry resulting in a subsequent error in an apparently unrelated process downstream.

Conclusion

The system was tested in the lab with known problems and worked as expected. We then successfully performed field testing on two live remote systems, which had known elusive problems.

The process of exporting and transferring data from remote sites was field tested and revised several times incorporating user feedback. Where security constraints permit, we now quickly access anonymized diagnostic data from remote sites.
Although intended for remote installations, the system also improved MTTR for our local servers, reducing the load for support personnel. It was effective for errors that occur infrequently. Diagnosing these is particularly difficult as there is often little data, discovery long after the trigger event, and little contextual help from users.

If you found this useful, you might also want to review:

- an introduction to applications

- an introduction to asset management

- an introduction to WMS - warehouse management systems

- an introduction to barcode technologies

QAOK4370