IT Automation – All the Things We Are Talking About

Automation, Automation Technology Architect View, Business Impact of Automation 1 Comment »

Reading and writing about IT automation, I keep on learning about the subject. Lately I found that there are so many flavors of automation around the operating processes of IT, that misunderstanding seems inevitable. So I try to make a point here to talk about the different kinds of automation one can use all around maintaining a high quality IT environment.

Types of Automation Tasks

  1. Incident-, Problem-, Capacity- and Availability Management
    Automation engines specialized on analyzing and handling events that occur in a IT environment that may lead to or themselves represent malfunctions, loss of quality and the like. Both reactive (automated reaction to an incoming event) and proactive (automated actions taken to prevent events from occurring) are target of these engines. Automation engines that handle the “fault operating” are either embedded into the ITIL processes (see blog entry on extending ITIL with automation) like our automation engine (aAe) or are embedded into system components or management systems with a narrow scope e.g. on redundancy activation.
  2. Change Management
    Automation engines specialized on performing changes that modify or extend an IT environment automatically. Either these engines are Inserting an abstracted layer above tasks that need to be performed (like adding users, restarting a component and the like) these engines allow an administrator to perform tasks on many machines or on different platforms without by interacting with the automation engine. An example for this kind of engine is the Puppet framework with a very structured approach to abstraction. Or these engines focus on scaling an IT environment by dynamically adding resources or automatically installing or modifying a system like the Tivoli Provisioning Manager or VMWare Virtual Center does.

I really do hope (not just to save you some consulting fees) to have helped avoid misunderstandings, when you are talking to others about automation and even better maybe I could point out some additional techniques you can look at to make life easier.

Who is automated „away“

Automation, Social Impact of Automation No Comments »

As discussed before, automation in IT operations definitely has a strong social impact. It is a question of how IT professionals deal with the change that will make the difference in the end.

As I spent most of last week at an American University, I obviously had quite some discussions on how automation impacts the lives of IT administrators. There seems to be a lot of personal discomfort (understandably). Unfortunately these personal issues get mixed up with the technical ones. Many people have asked me questions like “do you trust the machine to stop a service, restart a machine or even allocate resources dynamically?” Well, yes I do. I have trusted my system for quite some time to allocate memory and disk space for me and so have you and we are trusting computer programs to land planes, control elevators and life support systems in an ER. So why – WHY – should we not trust a machine to do something radical like rebooting a server?

In my opinion a machine has two major advantages over a human administrator in standard situations. First it never executes radical commands due to “gut feeling” (like boot feels good) and second it documents the path it took to reach to conclusion that executing specific commands is a good idea. So you do have documentation (hello to all you SOX consultants out there) and if there really is an error you know where to look and you will be able to change you rule set accordingly.

Garex Ok, so maybe we can solve the problem of trust through logical argument. Unfortunately some people are very much resistant to logic. So another approach we sometimes take is to do a dry run. That means, we install the automation engine and disable all execution and redirect the execution command to document everything it would do into a trouble-ticket. As soon as administrators start pasting commands out of the tickets you know it is time to enable the real automation.

But let us get down to the actual administrators and the consequences all that automation has on them. There is this geek shirt “Go away, or I will replace you with a very small shell script”. By the way, the guy in the picture is actually one of our administrators - one of the guys who really DO automation. I think the shirt was done to scare off users. But nowadays this is actually what will happen to administrators who do not want to be part of this changing world. In my vision of the future there will only be two kinds of administrative staff close to a data center: Real IT experts (the Gurus) and janitors. The experts are today´s administrators who want to get rid of all the boring – I have done that about 10.000 times – tasks and deal with the exciting stuff instead. Well the others …..

To get it straight: I actually do not think that there will be fewer jobs in IT administration in the future, mainly because IT is an ever growing plant. I do think that there will be a lot less “boring” and unqualified work in IT – as we have seen in all other industries. Before.

So, is that really a bad thing? More exciting tasks, more real results, more happy administrators? I don´t think so… Let´s get it on guys

Taking a Look inside aAE (arago Automation Engine)

Automation, Automation Technology Deep Insight No Comments »

Time and again people ask me, what they see, after implementing an automation engine. My answer usually was “well nothing really, you will see that your applications have a better uptime…”, but obviously that is not what people want to hear. The whole idea of an automation engine is, that things happen in the background and no one has to sit in front of some console watching lights turn red.

aAE Visualizer ScreenshotStill people want to see what is going on. And as automation is a matter of trust – the trust of system administrators and managers, that such an engine will improve IT service instead of messing it up – it probably is a good idea to enable a peek under the hood of the machine. Actually as we are using a graph algorithm approach to finding the automatic steps to be taken in order to resolve a problem, it sounds like we should show a graph of the whole thing.

So that is just what we have done. In the screenshot attached you see the prototype of “aAE Visualizer”. This JAVA Application actually displays the IT model and the issues and events travelling the model. On the model Graph it is possible to see where issues are created and how they travel the engine in order to find actions to take. But this visualization application is not just a pretty way to let interested people “look at what is happening” in our automation engine; it also allows to locate hotspots in an automated IT landscapes easily. Hotspots always indicate a challenge. Either a hot spot is an error in the model – a place where problems travel in circles without finding any resolution – or a hot spot is an actual bottle neck in the IT infrastructure that is not visible from a capacity management point of view.

So I am very happy to announce, that this visualization application will not only make my job of explaining how “automation works” easier, but will also allow our administrators to locate model problems or IT landscape problems with much less effort than before.

A CMDB that can deliver the model

Uncategorized No Comments »

As we are not really consultants for modeling IT infrastructure, we are always looking for a good way to minimize our manual effort when installing our automation engine, I actually thought it should be easy to load the necessary M-A-R-S information out of any CMDB, but so far that has proven much more difficult than expected. Most CMDBs we have looked at, did either not supply the needed relationship and interdependency data or did not contain the static node information we need to bind rules. BUT yesterday we had a workshop at the IBM briefing center in Mainz to take a look at the IBM CCMDB. And tell you what: It looks like we found a CMDB that actually contains all the data we need to load the IT interdependency model. Even if some organizations keep attributes we need for rule binding in excel sheets or other strange data sources we can load them off the IBM CCMDB through its federation technology.

But that is not the whole story. We were quite exuberant about the depth of relationship and interdependencies stored in the CMDB, but it really got amazing when we saw in an actual environment, that most of the interdependencies were detected automatically. Someone at IBM actually did the work of modeling quite some ssh connections and scripts to pull this information out of netstat and other system calls. Well going though firewalls without losing the network angle seems to be a difficulty that means actual real time detection of different zones of trust is not really possible, but what we are getting out here is much better than anything we have seen before. It will save us about 80% time on implementing automation for highly complex applications. Also the time our customers need to maintain the model in place while they change their IT landscape is probably greatly reduced. We will look into creating a persistent interface to the IBM CCMDB and while we are at it to their event bus and execution facilities as well.

You know my comments on other CMDBs and our difficulties of reading anything more than SML out of them. Normally I am quite taken aback and don´t say much, but this time I am really happy.

The arago Automation Engine (aAE)

Automation Technology Deep Insight No Comments »

You must have suspected, that I am not just philosophizing about building an automation engine, but have already done so (or at least I have designed the concepts and we at arago have built it – and you also know the software architect behind the whole scene – Jens “Cy” Bartsch. So after introducing you to the concepts of automation engines and my ideas on the social impact of automation, I would like to give a short overview on the technology we use to actually have an engine that performs the system administration tasks mostly done manually today. We have build an engine that will learn and is instructed by system administrators to increase its operational abilities every day. Actually we have been working on developing this engine since 1995 and are currently at major release 4 of the engine.

As you will understand, I cannot reveal too much technical detail, but I still want to give a short look at the concepts we use. The key input to our engine is an IT infrastructure and application model based on the four layer M—A-R-S approach described earlier. The nodes of this model are enhanced with “static” data on the node, such as software version, log file location and everything else that can be found on the subject. Of course different data will appear in different kinds of nodes (obviously a machine node does not need a software version J). This model is read into the automation engine and represents a basic graph. All the nodes of the model are connected regarding to their interdependencies. The real time event and monitoring information is now connected to the nodes.

On this basic graph that represents the actual IT infrastructure as a model and with all available monitoring and event data the engine is to work on, rules connect to the nodes. These rules can be simple threshold rules or complex constructs built from conditions across many IT components. When such a rule matches an issue object is created. An issue is sort of a “pre incident” that tells us, something may be going slay. An issue object can now travel the graph. This travel is directed by the issues urge to collect new data in order to match an action rule that will allow the issue to perform an action – either to collect more data or to resolve the issue automatically. While travelling the graph the issue collects more and more data from the nodes it visits and relates to other issues supplying access to different branches of the graph or additional data. The travel algorithm focuses on achieving the maximum number of actions available to the issue. Of course issues can be injected into the graph – for example by reporting an incident – as well.

Compared to a top down rules evaluation or aggregation used by so called root cause analysis systems, an issue in our automation engine can circle in an a problem, testing the functionality it is looking for from different angles of the IT infrastructure. Thus finding the spot of the action problem not by drill down, but by a divide and conquer approach, like a good system administrator would do. Also by relating issues problems that are spread across the infrastructure and would not normally be found by system management software or a specialized administrator can be identified as such and then be solved by the same mechanism. The automatic relation of issues when their combined data opens new automation actions to be taken also creates many implicit rules, i.e. someone creating rules does not necessarily have to know all actions that have to be taken throughout the infrastructure but the automation engine will find all the connected actions by itself. A good example is the fact that many dependant systems require restarts after a centralized component or service broker has been changed. The people generating the rules surrounding the change of the central service broker do not know anything about the other components and the people maintaining the depending component do not know anything about the change processes of the central service broker. Not a problem for the graph approach we have chosen, because the issues created by the change at the central service broker meet on the IT interdependency model relate to each other and thus derive combined or correlated actions to be taken without any explicit rule.

I/O scheme of an automation engine – or the Importance of having a correct IT Model

Automation Technology Architect View 2 Comments »

The automation engine is a computer program and as such it follows the simple scheme of “Input - Processing - Output”. The engine takes care of the processing part. So in order to talk about the quality of such a program, we have to examine the I/O scheme of the automation engine.

automation engine IO schemeThere are two lanes of external input into the automation engine. First there is the model of the IT infrastructure and application landscape the automation Engine is to work on. The second input stream is monitoring or event data from all the components of the infrastructure described by the first input stream. The automation engine only has one internal input or configuration stream. This stream defines the rule set of the engine in an appropriate format. This rule set will enable the “black box” to determine which action has to be executed automatically as well as the conditions permitting the execution of a certain action. This configuration stream also includes the actions themselves or links to a repository of actions - e.g. scripts written by administrators.

The automation engine will produce two output streams. One is an external stream documenting the actions taken by the automation engine to a service management or similar system as well as exporting skill management data to interface with manual operations effectively. The main output of the automation engine is a stream of commands to the components described in the IT model previously mentioned.

The internal input and output of the automation engine is the primary processing of the software and therefore part of the implementation of an automation engine. Proper functionality of the engine strongly relies on the quality of the external input streams (IT Model and event or monitoring data). The integration of the automation engine into the manual administration processes or rather the control of the manual processes though the automation engine determines the effectiveness of the operational workforce and the degree of automation that can be reached.

Thus the input data and the integration of the output produced by the automation engine are issues that must not be underestimated. It is not enough to have some sort of CMDB to import components of the IT infrastructure to be operated upon. The input has to include relationship and attribute information that are at the necessary level of detail and the monitoring streams have to be connected to the configuration items and their relations. Otherwise the automation engine will not operate at the desired level of effectiveness or even worse generate false commands to be executed.

I have found that in many cases the configuration of the automation engine - especially hooking it up to the IT infrastructure and the available monitoring environment - is best done manually. Many CMDB implementations can be used to cross check the configuration but I am still looking for a CMDB implementation that will give an automation engine a good IT model and access to the required monitoring and event data.

First look at an automation engine

Automation, Automation Technology Architect View No Comments »

as you - hopefully - have read, automation is not magic, not even black magic. It is the execution of actions based on conditions. As this does not sound all that difficult, what do we need to integrate this concept into everyday IT maintenance life? Simple, we need some sort of machine - an automation engine - that will sit on all the IT components of our environment and execute actions if some conditions we have programmed the machine with become true.

Simple may be a pretty misleading term. The concept of this machine is very simple, but this automation engine has to monitor all data available in our IT environment in order to match any conditions and on the other hand this engine will have to find the right action to execute. The concept behind this approach is simple; the technical problems to be solved in order to make this machine work are numerous and have to be dealt with. Let me take a glimpse at a few of the most immanent ones:

  1. Mass of data to be processed
    As you may remember, we are looking at all the system management, KPI and quality data we can get our hands on for all the IT around us. So there is a lot of data and we have to deal with all of it.
  2. Mass of conditions
    Besides all that data there are a lot of conditions that have to be evaluated upon the available data series. The automation engine is a very elaborate version of a rule engine, because it is dealing with a highly interconnected logic tree (the IT model) and many conditions on a large data space. So typical approaches like decision matrixes do not work for cutting short on rule evaluation.
  3. Unknown rules
    If we wanted to put everything that needs to be executed automatically into an explicit rule, building the rule system would take a lifetime and the problem “mass of conditions” would become ever more influential. Building implicit rules is too complicated for the user. So the automation engine has to adopt a behavior of encircle the problem. This is a divide and conquer approach instead of asking a user to enter every circumstance and every reaction because this kind of “brain dump” is simply not invented yet. I know this is very abstract and I am sure that I will find a little more time soon to elaborate on the way an automation engine has to find the proper actions to take to solve a real life problem.

By the way, most computer systems and approaches in system management software take a simple approach to tackle these challenges. Techniques like root cause analysis or autonomic systems try to move down the dependency tree and find the problem somewhere down there. Why is this approach practical? Well it narrows down the amount of possible data sources and actions that can be taken quickly and in that way a computer system can actually work by out the simple problem resulting. And why are these approaches a short jump? Well, they simply don’t work with complex problems that show symptoms in some remote location of the IT environment or problems that are caused by multiple sources. Most problems in modern IT systems are of the latter kind and therefore the common top down approaches execute quite some actions, but not merely as much as an automation engine should solve. Or would you expect your best system administrator to simply go down the logical tree of connected systems while trying to find out why your ecommerce application is not working? No, not really, because good administrators encircle the cause of a problem and thus exclude great parts of the IT environment throught their experience as possibly causes and then only concentrate on the “relevant” remainders.

Top