Nowadays, most people have heard of machine learning, deep learning, and other modern disciplines. However, technically oriented people have a deeper understanding and know that in this field we work with a large amount of data that goes through specific processing and ultimately get trained models that must be able to process data they have not seen before.

The great chasm between the people who make the models themselves and the practical implementation in computer systems often remains hidden. In these two worlds, some different values and goals are followed.

Data Scientists work hard to analyze, collect and process data, prepare the appropriate algorithm, and implement it. They can perform specific tests for data accuracy and determine the fundamental qualities of the model.

Whether a simple API is being developed to return a result from a single model or complex system, the environment will differ from the Jupyter Notebook or Google Colab environments. Most online courses and materials focus on training and evaluating models, but few reveal what to do with those models next.

On the other hand, for a software system to function, many more components are needed. The model’s qualities are essential, but other details give a system a finished look. Reliability, scalability, maintainability, performance, availability, testability are the properties that provide the non-functional (quality) parameters of the software system. The results of the model predictions are more related to the functional parameters.

One of the critical indicators for improvement is the so-called model inference. This is the process of entering actual data points into a machine learning model to derive the defined result. This process is at the core of any software system that works in machine learning, so the focus on its optimization has long-term positives.

machine learning inference

The concept of MLOps

To address the abyss problem between data scientists and software engineers, the concept of MLOps (Machine Learning Operations) arises. MLOps is a set of practices used to implement and maintain machine learning models in real software systems. Simply put, MLOps covers all the actions that follow the construction of the model.

Much attention is paid to the aspect of building the models themselves. Training and evaluation were considered the most time-consuming aspect of the data science life cycle, and the work usually ends there. There is little significance associated with everything that happened after the implementation of the trained ML model.

Over time, however, it has become increasingly clear that the implemented machine learning models are beginning to perform less and less. For example, models that initially achieve high accuracy during validation cannot make correct predictions for new user data entering the expert system.

This requires the application of a set of techniques for optimization of all processes in the chain — the model itself to be easily portable and easy to implement in the future; real-time testing processes to report changes in input data; optimization processes within the stability of the software system with increased usability.

What type of problems arise

Adequacy of the model in a test environment, but inapplicability in real

Low throughput and high latency are problems that data scientists do not attach much importance to. But they become a massive obstacle when the model becomes widely available. In a test environment, the main focus is on the model’s predictions, but in the real software system, these are the adequacy of the model, the high bandwidth, and the low latency. Thus, one of the fundamental problems is rooted — reducing the gap between the test environment and the production system. In this article, we will look at practical methods for achieving this goal through some optimizations.

Overfitting, Underfitting and input data change in time

The model, trained on some test data, may not adapt well enough to the new data entering the system. For example, in a book recommendation system based on the reader’s preferences, tastes and preferences may change at specific intervals. These can be short ones (1–2 weeks) and wider ones than 1–2 years.

The model may then not be able to generate data it has never seen before. If there are discrepancies between the training data and the actual data, the ML model will not perform well outside the domain in which it was trained. If not corrected, this will lead to inaccurate forecasts and poor performance.

MLOps engineers must also ensure that training data is processed using the same techniques as real-world data at the entrance to the system. In addition, the same cleaning / pre-processing techniques should be applied to new data to ensure no inconsistencies in model prediction.

Lack of testing (and tools for that)

Testing is a significant step, but at the moment — there are not many platforms and ways to test ML models. Moreover, when the model encounters data that he has never seen (which is the real life scenario), it often happens that the results can be very different than expected. For example — if we have trained a model to recognize and classify road signs — predictions may change in the presence of tree branches, fog, and other similar unexpected conditions. That is why the testing of models in machine learning is a promising area.


Another vital element is the adequate monitoring of all events related to the model, the intelligent system and the hardware. For example, server load, number of requests per second, size, and type of data — everything must be analyzed and monitored to detect possible problems in its infancy. There are many ready monitoring systems and frameworks, but as excellent self-made options for a base can be based on Document-Oriented Databases (MongoDB, Elasticsearch) and Row-Oriented Databases (ClickHouse, BigQuery). Each of which has specifics, advantages, and disadvantages.

Differences in programming languages

In addition, DevOps and data learning are sometimes unable to help with implementation, often due to conflicting priorities or a lack of understanding of what is needed for ML purposes. In many cases, the ML model and input functions are written in Python, which is popular with data professionals. Still, the IT team is better acquainted with a language such as Java or C #. This means that engineers must take the Python code and translate it into another language to execute it within their infrastructure. Furthermore, the implementation of the model requires additional processing of the input data in a format that the ML model can adopt, and this extra work adds to the burden of the engineers in the implementation of the ML model.

Model specificity

Almost always, a model meets one task and is very closely specialized. The data for his training must be carefully selected and correspond as much as possible to the data he will work on later. But how can a system be created that does more than one task and can focus on different subject areas?

Ways to optimize

A way to optimize is OpenVINO™. So I decided to experiment to optimize the current models that work through TensorFlow serving.

OpenVINO™ is an open-source tool for optimization and implementation of processes using a simplified interference model. Its primary purposes are:

  • improving the productivity of in-depth computer vision training, automatic speech recognition, natural language processing, and other everyday tasks;
  • use on models trained with popular frameworks such as TensorFlow, PyTorch, etc. Just like our case at Scaleflex.
  • reducing systemic resources and improving the processes for implementing new models or their variants.

openvino serving

The specific experiment was the conversion of a Tensorflow image classification model to an OpenVINO™.

Model Optimizer from the OpenVINO™ toolkit

The Model Optimizer from the OpenVINO™ toolkit is used to optimize the model itself. A cross-platform command-line tool (CLI) facilitates the transition between a learning and deployment environment, performs static model analysis, and adjusts deep learning models for optimal device performance. Finally, the Model Optimizer itself is used on the finished model, in our case, the Tensorflow model in the format:

model format

After executing the specific model processing commands, the output files look like this:

In other words, the following derivative files are obtained:

  • .xml — describes the network topology;
  • .bin — contains binary data for weights and deviations;
  • .mapping — provide maps of the IO layer from the original model.

The whole process of converting files becomes more apparent with the following scheme:

OpenVino optimization process

A drawback that is worth mentioning from my point of view is the CLI tool itself. Python is used, which is not very convenient in certain situations, and personally, I prefer lower-level languages for CLI tools. Go would be a good choice for this kind of CLI tool and would be extra convenient when installing, changing versions, help commands, etc.

The exact transformation commands as well as the supported topologies of the model can be found in this documentation.

OpenVINO™ results

Okay, we’ve transformed the model into OpenVINO™, and now what? It is time to look at the real benefits of these actions.

First of all, even before comparing the other parameters, we would like to make sure that the results of the predictions of the model itself are the same (or similar).

A short test comparing the predictions of the standard Tensorflow model and the OpenVino derivative model shows that the predictions are adequate:

Results with OpenVino serving

In terms of the size of the model itself:

  • schema_recognition model in TensorFlow format: 1.12 GB
  • schema_recognition model after OpenVino optimizer: 184 MB

Which is a significant improvement. Although not directly relevant to the customers of the software platform, this reduction allows optimization of the DevOps activities for porting and deploying the model, as well as subsequent versions.

To compare the loading speed of results, we will use a simple Python script that has the following goals:

  • taking a set of random images;
  • sending sequentially to Tensorflow serving service and measuring the results of inference;
  • then send sequentially to OpenVino serving service and measure the results of inference;
  • all this in a sufficient number of iterations to make the difference clear.

The results, along with the servers we used, can be seen here:

Server 1

CPU: Intel (R) Xeon (R) CPU E3–1245 v5 @ 3.50GHz

RAM: 32 GB

Python: 3.6.9

PIP: 21.1.2


  • TensorFlow serving: 653.43 seconds
  • OpenVino serving: 504.89 seconds
  • Difference: 148.54 seconds, 22.73% improvement with OpenVino

Server 2

CPU: Intel(R) Xeon(R) Platinum 8360Y CPU @ 2.40GHz

RAM: 252 GB

Python: 3.7.6

PIP: 21.1.2


  • TensorFlow serving: 544.60 seconds
  • OpenVino serving: 288.38 seconds
  • Difference: 256.22 seconds, 47.05 % improvement with OpenVino


In this article, we looked at the concept of MLOps and showed some methods for improving the processes associated with providing a stable environment for machine learning models.

In the current state of machine learning systems, it is becoming increasingly important to think about the big picture and the overall presentation of the software system. In addition, of course, choosing a suitable algorithm, training data, and framework are fundamental. Still, optimization and achieving stability, traceability, and fault tolerance give a finished look worthy of presentation to end customers.


Last Update: 26/12/2023