IT Infrastructure for AI

By definition AI solutions process massive amounts of data, so legacy infrastructure is not likely to be able to provide the kind of capabilities needed for an AI-enabled organisation.

Key considerations for infrastructure design include:

·      Security - according to Gartner, traditional prevent-and-detect approaches are inadequate and a shift to a “continuous response” stance is needed. Continuous monitoring of systems and behaviour is the only way to reliably deal with threats.

·      Specialized hardware, and chips with many computing cores and high bandwidth memory (HBM), optimized for highly parallel, Neural Network computation.

·      Systems software with hardware-efficient implementation and dynamic-programming, which automatically optimises code.

·      Distributed computing frameworks that can efficiently scale out model operations across multiple nodes.

·      Data and metadata management systems to enable reliable, uniform, and reproducible pipelines for creating and managing both training and prediction data.

·      Low-latency server infrastructure that enables machines to rapidly execute actions based on real-time data and context.

·      Model interpretation, QA, debugging, and observability tools to monitor, introspect, tune, and optimize models and applications at scale.

·      End-to-end platforms that encapsulate the entire ML/AI workflow and mask complexity from end users.  

·      3rd party cloud-based networks that can supplement an organisation’s capabilities should be carefully considered.

·      Infrastructure paradigms such as AI-Defined Infrastructure (ADI). 

This last point is of particular significance. ADI is a new infrastructure paradigm that both uses AI as a tool and enables AI functions. It enables IT infrastructure to run itself, without human interaction by:

·      Deploying the necessary resources depending on the workload requirements as well as de-allocating the resources when they are not needed anymore.

·      Constantly analysing the ever-changing behaviour and status of every single infrastructure component.

·      Autonomously and proactively acting based on the status of infrastructure components.

Machine Learning Architecture.

Depending on the scale of the organisation, a machine learning can be run on systems ranging in size from a laptop to massive data farms distributed across the globe.

Regardless of scale, the machine learning process is standard –

  1. Manage data
  2. Train models
  3. Evaluate models
  4. Deploy models
  5. Make predictions
  6. Monitor predictions

We can simplify this further - 

Data preparation > Train models > Evaluate models > Apply models.

Key components that are needed for this are:

File systems for storing and processing large volumes of structured and unstructured data on clusters of commodity hardware. Apache Hadoop could be used for an in-house solution, or cloud services like Microsoft’s Azure or Amazon Web Services could also be used.

Data Lake, a single store of all enterprise data.

Data Warehouse for summarising, querying and analysing large data sets including data features, outcomes (vectors) and sample prediction. Apache Hive can provide this kind of capability.

Machine Learning Frameworks provide interfaces, libraries and tools which allows Data Scientists to build machine learning models. Popular frameworks include Apache Spark MLlib, Pytorch and Tensorflow. Popular languages for writing machine learning algorithms are Python and R.

Because different functions are going to be run across different computers, a system for controlling clusters of computers is needed. Apache Spark is a popular cluster-computing framework used in machine learning systems.

Each component in the system will need to send, receive and process messages from other components, so a message streaming platform is needed. Apache Kafka is one such platform and is capable of handling trillions of events a day.

It’s essential also for people across different parts of the process and the wider organisation to have good visibility of data, processes, outcomes and system management metrics. For this, visualisation tools are required such as Matplotlib for visualising Python algorithms, Jupyter for showing code, text & numerical analysis and charts, and Tableau data visualisation.

Figure 40. Simplified machine learning solution architecture

Organisations have 3 choices to make when deciding how to build a machine learning infrastructure.

Option 1 – Internal. The advantages of this is control and the ability to make the system do exactly what you need it to. Disadvantages include finding the right skills and responsibility for keeping it operational.

Option 2 – Cloud services. Companies such as Microsoft, Amazon and IBM offer machine learning as a Service. Advantages of this approach include elasticity (ability to scale up or down according to demand), and Service Level Agreements. Whilst this approach may remove some of the engineering skill requirements, ultimately machine learning is more of a mathematical skillset than engineering so you will still need people who deeply understand machine learning models from a mathematical perspective. Whilst cloud-based solutions are often more cost-effective in the short term, they can add up over time. There is also the risk of ‘vendor lock-in’ when taking this route.

Option 3 – Hybrid. A third option is to combine the ‘best of both worlds’. An organisation may decide to farm-out selected parts of the machine learning process to a Cloud Service provider. For example, it could be possible to store data in Amazon Web Services stores to better manage peaks and troughs in data flows, whilst retaining the rest of the machine learning processes in-house.

Complete and Continue