Deploying a deep neural network is a challenging task. But luckily, there are mature tools that help us do the job optimally. 

In this article, you will discover how to use TFServing to deploy TensorFlow models.

Let’s get started.

What is TensorFlow serving

It is a high-performant framework to deploy machine-learning models into production environments. The main goal is to deal with inference without loading your model from disk on each request. 

It is generally built for TensorFlow models, but extending it to other types is also possible. TFServing is a good choice if you plan to expose your models as endpoint services, which is a typical scenario.

Some of the built-in features:

  • multiple models can be served with a few simple configurations;
  • each model can have separate versions that your customers can consume;
  • deployment of a new model typically involves only tiny changes in the config file;
  • runs on different hardware, for example, on CPU and GPU;
  • can be configured to run in Docker with docker-compose, providing a level of capsulation;
  • gRPC and REST API are supported.

Deploy a model

Let’s now see a real example of model deployment with TensorFlow serving.

Serving a model with TF

Sample process of model lifecycle. We focus on serving part in this tutorial.

First, we will need a model in SavedFormat. We will use a shallow Keras model on the CIFAR-10 dataset for this tutorial. It is essential to mention that we focus on the deployment part rather than on the model itself.

import tensorflow as tf
from keras.datasets import cifar10
from keras.models import Sequential
from keras.layers import Conv2D, MaxPooling2D, Flatten, Dense
from keras.utils import to_categorical
from keras.optimizers import Adam

# Load CIFAR-10 dataset
(train_images, train_labels), (test_images, test_labels) = cifar10.load_data()

# Normalize pixel values to be between 0 and 1
train_images, test_images = train_images / 255.0, test_images / 255.0

# Convert class vectors to binary class matrices
train_labels = to_categorical(train_labels, 10)
test_labels = to_categorical(test_labels, 10)

# Define the model
model = Sequential([
    Conv2D(32, (3, 3), activation='relu', input_shape=(32, 32, 3)),
    MaxPooling2D((2, 2)),
    Conv2D(64, (3, 3), activation='relu'),
    MaxPooling2D((2, 2)),
    Dense(64, activation='relu'),
    Dense(10, activation='softmax')

# Compile the model

# Train the model, train_labels, epochs=10, batch_size=64)
# Save the model in the TensorFlow SavedModel format, export_dir='result_model')

On the last line from the script, we saved the model on the hard disk.

In the first example, we can fetch the official docker image and start the container like this:

# Pull the official image
docker pull tensorflow/serving

Then, we need to copy the model in the right place. In this case, that is /path/to/your/model/ folder. But there is something else, which is important the serving to work. 

Create a version of the model (as a folder) inside your model root, like 1 in /path/to/your/model/. Place there all the files, produced in SavedFormat. And finally, run the serving container.

# Run the container
docker run -p 8501:8501 --name=tf_serving_container \
-v /path/to/your/model/:/models/my_model \
-e MODEL_NAME=my_model \
-t tensorflow/serving


  • -p 8501:8501 maps port 8501 of the container to port 8501 on your host machine.
  • --name=tf_serving_container gives the container a name.
  • -v /path/to/your/model/:/models/my_model mounts the model directory to the container. Remember to add the version of the model, as a folder.
  • -e MODEL_NAME=my_model sets the environment variable MODEL_NAME to the name of your model.
  • -t tensorflow/serving specifies the Docker image to use. On local environments, espetially on Mac M1 – you would need to choose the right image. For example emacski/tensorflow-serving:latest-linux_arm64 is the only way for local tests.

Everything should work now, and when inspecting the logs you should see something like:

# Logs from tensorflow serving container
2023-12-25 05:01:41.510900: I external/tf_serving/tensorflow_serving/core/] Successfully loaded servable version {name: cf version: 1}
2023-12-25 05:01:41.511943: I external/tf_serving/tensorflow_serving/model_servers/] Finished adding/updating models
2023-12-25 05:01:41.511996: I external/tf_serving/tensorflow_serving/model_servers/] Using InsecureServerCredentials
2023-12-25 05:01:41.512008: I external/tf_serving/tensorflow_serving/model_servers/] Profiler service is enabled
2023-12-25 05:01:41.513766: I external/tf_serving/tensorflow_serving/model_servers/] Running gRPC ModelServer at

Your model should be accessible on:

# TensorFlow Serving URL (assuming local deployment on port 8501)
url = 'http://localhost:8501/v1/models/my_model:predict'

Deploy a new version of the same model

In production-grade ML systems, releasing a new version of the model is widespread. TFServing allows us to make this very easy, just copying the new files into a dedicated folder. As you saw in previous part, we created a separate folder, holding a specific version of the model.

Let’s do this again, but this time with our new version of the model. I removed the MaxPooling2D and retrain the network. Remember, that we are focused on deploying the model, not getting the most accurate model. Removing the pooling layer is for sure not the best idea, when developing a SOTA models, but for this tutorial it is perfectly fine.

I created folder with name 2 and pasted the new version. Then run the container again. You should be able to see success messages in models loading:

# Logs from tensorflow serving container
2023-12-25 05:30:44.978028: I external/tf_serving/tensorflow_serving/core/] Successfully loaded servable version {name: cf version: 2}
2023-12-25 05:30:44.978955: I external/tf_serving/tensorflow_serving/model_servers/] Finished adding/updating models

Deploy multiple models

Creating a docker-compose.yml file along with a TensorFlow Serving model configuration file is a great way to manage and deploy your models using Docker Compose.

This setup allows you to define and run multi-container Docker applications. Below is an example of how you can set this up for serving a TensorFlow model.

version: '3'
    image: tensorflow/serving
      - "8501:8501"
      - ./models:/models
      - ./models_config:/models_config
    command: --model_config_file=/models_config/models.config --model_config_file_poll_wait_seconds=60
      - MODEL_NAME=my_model


  • Port 8501 is exposed for REST API requests.
  • Two volumes are mounted:
    • ./models is the local directory containing your TensorFlow models.
    • ./models_config is the local directory containing your model configuration file.
  • The command field specifies the path to the configuration file inside the container and sets a polling interval for updating models without restarting the container.
  • MODEL_NAME is an environment variable setting the default model to be served.

We also need TensorFlow Serving model configuration File (models.config):

model_config_list {
  config {
    name: 'my_model'
    base_path: '/models/my_model'
    model_platform: 'tensorflow'

Directory structure should looks something like:

├── docker-compose.yml
├── models
│   └── my_model
│       └── 1
│           ├── saved_model.pb
│           ├── assets
│           └── variables
│       └── 2
│           ├── saved_model.pb
│           ├── assets
│           └── variables
└── models_config
    └── models.config
  • models directory contains TensorFlow SavedModel files.
  • models_config directory contains TensorFlow Serving model configuration file.

Advanced configurations

For a more advanced setup with TensorFlow Serving using Docker Compose, you can include additional configurations to handle multiple models, enable batching, logging, and monitoring, and fine-tune the performance. Here’s an enhanced version of the docker-compose.yml file with these advanced configurations:

version: '3'
    image: tensorflow/serving
      - "8501:8501"
      - "8500:8500"  # gRPC port
      - ./models:/models
      - ./models_config:/models_config
      - ./batching_config:/batching_config
    command: >
      - MODEL_NAME=my_model
    restart: on-failure


  • gRPC Port: Port 8500 is exposed for gRPC requests. TensorFlow Serving supports both REST and gRPC APIs.
  • Batching: Enable request batching by setting --enable_batching=true. Batching can improve throughput at the cost of increased latency.
  • Batching Parameters File: Specify a batching parameters file using --batching_parameters_file. This file contains settings for batch sizes and timeouts.
  • Monitoring: Enable monitoring with --monitoring_config_file. This file can be used to configure monitoring settings, such as Prometheus metrics.
  • Parallelism: Configure TensorFlow’s session, intra-op, and inter-op parallelism for performance tuning.
  • Restart Policy: The restart: on-failure policy ensures that the container restarts if it fails.

An example batching parameters (batching_parameters.txt) could look like:

max_batch_size { value: 32 }
batch_timeout_micros { value: 1000 }
num_batch_threads { value: 4 }
max_enqueued_batches { value: 10000 }

And monitoring configuration (monitoring.config):

prometheus_config {
  enable: true
  path: "/monitoring/prometheus/metrics"

These configurations make your TensorFlow Serving deployment more robust, efficient, and suitable for handling complex scenarios with multiple models and high-throughput requirements. Remember to adjust the file paths and parameters according to your specific needs and infrastructure.


In this tutorial you learn how to setup inference with TensorFlow serving. It helps with standartizing the predictions of your models and it is very popular approach for good reasons. 

Use the links below for even more advanved configurations or deeper understanding of the settings:


Last Update: 26/12/2023