You must have suspected, that I am not just philosophizing about building an automation engine, but have already done so (or at least I have designed the concepts and we at arago have built it – and you also know the software architect behind the whole scene – Jens “Cy” Bartsch. So after introducing you to the concepts of automation engines and my ideas on the social impact of automation, I would like to give a short overview on the technology we use to actually have an engine that performs the system administration tasks mostly done manually today. We have build an engine that will learn and is instructed by system administrators to increase its operational abilities every day. Actually we have been working on developing this engine since 1995 and are currently at major release 4 of the engine.

As you will understand, I cannot reveal too much technical detail, but I still want to give a short look at the concepts we use. The key input to our engine is an IT infrastructure and application model based on the four layer M—A-R-S approach described earlier. The nodes of this model are enhanced with “static” data on the node, such as software version, log file location and everything else that can be found on the subject. Of course different data will appear in different kinds of nodes (obviously a machine node does not need a software version J). This model is read into the automation engine and represents a basic graph. All the nodes of the model are connected regarding to their interdependencies. The real time event and monitoring information is now connected to the nodes.

On this basic graph that represents the actual IT infrastructure as a model and with all available monitoring and event data the engine is to work on, rules connect to the nodes. These rules can be simple threshold rules or complex constructs built from conditions across many IT components. When such a rule matches an issue object is created. An issue is sort of a “pre incident” that tells us, something may be going slay. An issue object can now travel the graph. This travel is directed by the issues urge to collect new data in order to match an action rule that will allow the issue to perform an action – either to collect more data or to resolve the issue automatically. While travelling the graph the issue collects more and more data from the nodes it visits and relates to other issues supplying access to different branches of the graph or additional data. The travel algorithm focuses on achieving the maximum number of actions available to the issue. Of course issues can be injected into the graph – for example by reporting an incident – as well.

Compared to a top down rules evaluation or aggregation used by so called root cause analysis systems, an issue in our automation engine can circle in an a problem, testing the functionality it is looking for from different angles of the IT infrastructure. Thus finding the spot of the action problem not by drill down, but by a divide and conquer approach, like a good system administrator would do. Also by relating issues problems that are spread across the infrastructure and would not normally be found by system management software or a specialized administrator can be identified as such and then be solved by the same mechanism. The automatic relation of issues when their combined data opens new automation actions to be taken also creates many implicit rules, i.e. someone creating rules does not necessarily have to know all actions that have to be taken throughout the infrastructure but the automation engine will find all the connected actions by itself. A good example is the fact that many dependant systems require restarts after a centralized component or service broker has been changed. The people generating the rules surrounding the change of the central service broker do not know anything about the other components and the people maintaining the depending component do not know anything about the change processes of the central service broker. Not a problem for the graph approach we have chosen, because the issues created by the change at the central service broker meet on the IT interdependency model relate to each other and thus derive combined or correlated actions to be taken without any explicit rule.