AIOps Service Management Suite (IG1190)

The focus of the analysis and assessment in this work is the AIOps Service Management layer, and more specifically the analysis and assessment of the IT and Network service management processes, to identify gaps and challenges and define principles and guidelines for the reengineering of those processes in order to prepare them to manage and govern large deployments of AI systems and AI components in CSPs operations environments.

The end goal of our journey is how to enable and drive the transformation of traditional service management practices into what we call AIOps Service Management.

This document series includes the following:

  • RN367 AIOps Service Management Release Notes.  This document provides guidance on how to get started in using and applying the AIOps Service Management Suite by describing the contents and the correct versions which make up the solution suite.
  • IG1190 AIOps Service Management.  Large-scale deployments of AI in operations creates huge operational challenges such as how to govern, deploy, operate, control and maintain hundreds or thousands of AI models and components which will eventually form part of their IT and network systems architecture. Unlike the traditional software, AI models learn and evolve autonomously when exposed to new input data, and they are “black boxes” which are potentially even more fragile than traditional software, exposed to bias, opaque and nondeterministic by nature. In order to address these challenges, TM Forum and its members are leading an industry initiative called “AIOps Service Management” and are creating an industry agreed framework focused on redesigning the multiple processes of the service operations management. In this document, we take a step in this complex transformation journey of key service management processes, to prepare them to handle, operate and govern AI software at scale.
  • IG1190A AIOps Configuration Management. 

    The Configuration Management process allows an organization to know the current version, the configuration, and the history of any component of the system and service architecture. It is a central process in service management as it enables the control of everything in Production and in any environment, the automation of activities, serves other processes, enables fast delivery time, supports continuous delivery pipelines etc.  In AIOps, this process becomes even more strategic in order to keep control of the self-evolving and dynamic AI models in Production and to automate the retrospective actions triggered to manage their AI-driven online changes. Additionally, this process is key to know the provenance of the AI components.

  • IG1190B AIOps Change Management.  Change Management is the service management process that controls the lifecycle of all changes to service, system, and infrastructure components in Production environments and, in general, in any controlled environment.  The Change Management process also ensures a smooth and successful transition of the changes into the target environment, minimizing risks and impacts to service quality, security and performance, preventing incidents and, when necessary, preparing the end users to adopt the new capabilities/functions brought by those changes.  In AIOps, the existing practices are challenged as AI models may evolve and change autonomously and unpredictably when they are in Production bypassing the change controls that have been put in place for traditional software.
  • IG1190C AIOps Release Management.  Release Management is the process that defines the strategy for which new components and changes will be deployed to Production; establishes how to package and integrate those new and updated components; plans the deployment of the new release from Development to Production environments (and, in general, to any relevant environment); manages and supervises the deployment process of the new release until its closing stage, including the coordination of eventual rollback and the retrospective lessons-learned process.  In AIOps, in addition to the traditional process above, which is still valid and applicable to changes flowing from Development to Production, we also have the reverse scenario, as AI online components may autonomously change their state in Production and those changes may generate diffuse impacts that need to be managed retrospectively.
  • IG1190D AIOPs Acceptance Testing.  Testing is the process that ensures the quality of all new and existing software and services against expected targets and acceptance criteria, usually prescribed by business and/or process owners.

    In this document, our focus is on the tests performed at the Deployment stage where we usually go through a final set of tests in order to verify and approve the readiness and acceptance of the new software for the final delivery and roll-out into the Production environment.

    The deployment of AI software in Production brings new challenges like testing the ML training process, testing the online changes of AI models in Production, manage new testing environments, among others.

  • IG1190E AIOps Knowledge Management.   The scope of the Knowledge Management process is to ensure that all parties involved in the software engineering, in the service delivery lifecycles, and in the operations value streams, are properly and timely well informed and aware of any new relevant information and knowledge that is needed or expected to deliver the service in line with the agreed service levels, organization’s policies and market regulations.
    The deployment of AI models in Production challenges the existing Knowledge Management practices as the AI models themselves may be new active players in the Knowledge Management process, becoming producers and consumers of knowledge.
  • IG1190F AIOPs Monitoring and Event Management.   The Monitoring & Event Management process enables Production teams to prevent operational issues, provide mechanisms for the early detection and resolution of incidents, filter and correlate events, ensure the integration with other service management processes, and trigger automatically the responses to relevant events.  In AIOps this process needs to be redesigned to face with the opaque and dynamic nature of the AI components and to process new types of events, like the new training input data, and the related responses.
  • IG1190G AIOps Incident Management.  The primary objective of the Incident Management process is to restore the services to the end-users/customers and to recover normal service operations as quickly as possible and in line with the agreed SLAs, minimize the adverse impacts on business operations, thus ensuring that expected levels of service quality are maintained.  The Incident Management process manages the lifecycle of incidents from the detection of the incident until its closure and retrospective analysis, when applicable.  Incident Management is a key process of any operations framework to ensure the quality, availability and performance of the services and to reduce the operational risks.
  • IG1190H AIOps Problem Management.  A widely accepted definition of the Problem Management process, sometimes also called Root Cause Analysis (RCA) process or in some other way, is defined as the process maintaining the service quality at the expected level and improving the service availability through the detection of problems, the analysis of the root causes behind new and known problems, and their corresponding resolution, reducing consequently the number of incidents and their business impacts.
  • IG1190I Application Maintenance.  A key strategic process to ensure that the existing components continue working properly, efficiently and effectively despite unexpected situations, technical/technological evolutions, changing environments, obsolescence, changed boundary conditions, new business expectations etc.
  • IG1190J AIOps Management. The data management process in operations where AI components have been or will be deployed at a significant scale. We call this process AIDataOps. The scope of the data operations management process is how to deploy, monitor, maintain and eventually decommission (retire) data and the corresponding data flows in Production.

Resources Included