Deploying a deep neural network is a challenging task. But luckily, there are mature tools that help us do the job optimally.
In this article, you will discover how to use TFServing to deploy TensorFlow models.
Let’s get started.
What is TensorFlow serving
It is a high-performant framework to deploy machine-learning models into production environments. The main goal is to deal with inference without loading your model from disk on each request.
It is generally built for TensorFlow models, but extending it to other types is also possible. TFServing is a good choice if you plan to expose your models as endpoint services, which is a typical scenario.
Some of the built-in features:
- multiple models can be served with a few simple configurations;
- each model can have separate versions that your customers can consume;
- deployment of a new model typically involves only tiny changes in the config file;
- runs on different hardware, for example, on CPU and GPU;
- can be configured to run in Docker with docker-compose, providing a level of capsulation;
- gRPC and REST API are supported.
Deploy a model
Let’s now see a real example of model deployment with TensorFlow serving.
First, we will need a model in SavedFormat. We will use a shallow Keras model on the CIFAR-10 dataset for this tutorial. It is essential to mention that we focus on the deployment part rather than on the model itself.
import tensorflow as tf
from keras.datasets import cifar10
from keras.models import Sequential
from keras.layers import Conv2D, MaxPooling2D, Flatten, Dense
from keras.utils import to_categorical
from keras.optimizers import Adam
# Load CIFAR-10 dataset
(train_images, train_labels), (test_images, test_labels) = cifar10.load_data()
# Normalize pixel values to be between 0 and 1
train_images, test_images = train_images / 255.0, test_images / 255.0
# Convert class vectors to binary class matrices
train_labels = to_categorical(train_labels, 10)
test_labels = to_categorical(test_labels, 10)
# Define the model
model = Sequential([
Conv2D(32, (3, 3), activation='relu', input_shape=(32, 32, 3)),
MaxPooling2D((2, 2)),
Conv2D(64, (3, 3), activation='relu'),
MaxPooling2D((2, 2)),
Flatten(),
Dense(64, activation='relu'),
Dense(10, activation='softmax')
])
# Compile the model
model.compile(optimizer=Adam(),
loss='categorical_crossentropy',
metrics=['accuracy'])
# Train the model
model.fit(train_images, train_labels, epochs=10, batch_size=64)
# Save the model in the TensorFlow SavedModel format
tf.saved_model.save(model, export_dir='result_model')
On the last line from the script, we saved the model on the hard disk.
In the first example, we can fetch the official docker image and start the container like this:
# Pull the official image
docker pull tensorflow/serving
Then, we need to copy the model in the right place. In this case, that is /path/to/your/model/
folder. But there is something else, which is important the serving to work.
Create a version of the model (as a folder) inside your model root, like 1
in /path/to/your/model/
. Place there all the files, produced in SavedFormat. And finally, run the serving container.
# Run the container
docker run -p 8501:8501 --name=tf_serving_container \
-v /path/to/your/model/:/models/my_model \
-e MODEL_NAME=my_model \
-t tensorflow/serving
Where:
-p 8501:8501
maps port 8501 of the container to port 8501 on your host machine.--name=tf_serving_container
gives the container a name.-v /path/to/your/model/:/models/my_model
mounts the model directory to the container. Remember to add the version of the model, as a folder.-e MODEL_NAME=my_model
sets the environment variableMODEL_NAME
to the name of your model.-t tensorflow/serving
specifies the Docker image to use. On local environments, espetially on Mac M1 – you would need to choose the right image. For exampleemacski/tensorflow-serving:latest-linux_arm64
is the only way for local tests.
Everything should work now, and when inspecting the logs you should see something like:
# Logs from tensorflow serving container
2023-12-25 05:01:41.510900: I external/tf_serving/tensorflow_serving/core/loader_harness.cc:87] Successfully loaded servable version {name: cf version: 1}
2023-12-25 05:01:41.511943: I external/tf_serving/tensorflow_serving/model_servers/server_core.cc:486] Finished adding/updating models
2023-12-25 05:01:41.511996: I external/tf_serving/tensorflow_serving/model_servers/server.cc:133] Using InsecureServerCredentials
2023-12-25 05:01:41.512008: I external/tf_serving/tensorflow_serving/model_servers/server.cc:383] Profiler service is enabled
2023-12-25 05:01:41.513766: I external/tf_serving/tensorflow_serving/model_servers/server.cc:409] Running gRPC ModelServer at 0.0.0.0:8500
Your model should be accessible on:
# TensorFlow Serving URL (assuming local deployment on port 8501)
url = 'http://localhost:8501/v1/models/my_model:predict'
Deploy a new version of the same model
In production-grade ML systems, releasing a new version of the model is widespread. TFServing allows us to make this very easy, just copying the new files into a dedicated folder. As you saw in previous part, we created a separate folder, holding a specific version of the model.
Let’s do this again, but this time with our new version of the model. I removed the MaxPooling2D
and retrain the network. Remember, that we are focused on deploying the model, not getting the most accurate model. Removing the pooling layer is for sure not the best idea, when developing a SOTA models, but for this tutorial it is perfectly fine.
I created folder with name 2
and pasted the new version. Then run the container again. You should be able to see success messages in models loading:
# Logs from tensorflow serving container
2023-12-25 05:30:44.978028: I external/tf_serving/tensorflow_serving/core/loader_harness.cc:87] Successfully loaded servable version {name: cf version: 2}
2023-12-25 05:30:44.978955: I external/tf_serving/tensorflow_serving/model_servers/server_core.cc:486] Finished adding/updating models
Deploy multiple models
Creating a docker-compose.yml
file along with a TensorFlow Serving model configuration file is a great way to manage and deploy your models using Docker Compose.
This setup allows you to define and run multi-container Docker applications. Below is an example of how you can set this up for serving a TensorFlow model.
version: '3'
services:
tensorflow_serving:
image: tensorflow/serving
ports:
- "8501:8501"
volumes:
- ./models:/models
- ./models_config:/models_config
command: --model_config_file=/models_config/models.config --model_config_file_poll_wait_seconds=60
environment:
- MODEL_NAME=my_model
Where:
- Port 8501 is exposed for REST API requests.
- Two volumes are mounted:
./models
is the local directory containing your TensorFlow models../models_config
is the local directory containing your model configuration file.
- The
command
field specifies the path to the configuration file inside the container and sets a polling interval for updating models without restarting the container. MODEL_NAME
is an environment variable setting the default model to be served.
We also need TensorFlow Serving model configuration File (models.config
):
model_config_list {
config {
name: 'my_model'
base_path: '/models/my_model'
model_platform: 'tensorflow'
}
}
Directory structure should looks something like:
.
├── docker-compose.yml
├── models
│ └── my_model
│ └── 1
│ ├── saved_model.pb
│ ├── assets
│ └── variables
│ └── 2
│ ├── saved_model.pb
│ ├── assets
│ └── variables
└── models_config
└── models.config
models
directory contains TensorFlow SavedModel files.models_config
directory contains TensorFlow Serving model configuration file.
Advanced configurations
For a more advanced setup with TensorFlow Serving using Docker Compose, you can include additional configurations to handle multiple models, enable batching, logging, and monitoring, and fine-tune the performance. Here’s an enhanced version of the docker-compose.yml
file with these advanced configurations:
version: '3'
services:
tensorflow_serving:
image: tensorflow/serving
ports:
- "8501:8501"
- "8500:8500" # gRPC port
volumes:
- ./models:/models
- ./models_config:/models_config
- ./batching_config:/batching_config
command: >
--model_config_file=/models_config/models.config
--enable_batching=true
--batching_parameters_file=/batching_config/batching_parameters.txt
--monitoring_config_file=/models_config/monitoring.config
--tensorflow_session_parallelism=2
--tensorflow_intra_op_parallelism=2
--tensorflow_inter_op_parallelism=2
environment:
- MODEL_NAME=my_model
restart: on-failure
Where:
- gRPC Port: Port 8500 is exposed for gRPC requests. TensorFlow Serving supports both REST and gRPC APIs.
- Batching: Enable request batching by setting
--enable_batching=true
. Batching can improve throughput at the cost of increased latency. - Batching Parameters File: Specify a batching parameters file using
--batching_parameters_file
. This file contains settings for batch sizes and timeouts. - Monitoring: Enable monitoring with
--monitoring_config_file
. This file can be used to configure monitoring settings, such as Prometheus metrics. - Parallelism: Configure TensorFlow’s session, intra-op, and inter-op parallelism for performance tuning.
- Restart Policy: The
restart: on-failure
policy ensures that the container restarts if it fails.
An example batching parameters (batching_parameters.txt
) could look like:
max_batch_size { value: 32 }
batch_timeout_micros { value: 1000 }
num_batch_threads { value: 4 }
max_enqueued_batches { value: 10000 }
And monitoring configuration (monitoring.config
):
prometheus_config {
enable: true
path: "/monitoring/prometheus/metrics"
}
These configurations make your TensorFlow Serving deployment more robust, efficient, and suitable for handling complex scenarios with multiple models and high-throughput requirements. Remember to adjust the file paths and parameters according to your specific needs and infrastructure.
Conclusion
In this tutorial you learn how to setup inference with TensorFlow serving. It helps with standartizing the predictions of your models and it is very popular approach for good reasons.
Use the links below for even more advanved configurations or deeper understanding of the settings:
- https://www.tensorflow.org/tfx/serving/serving_config
- https://www.tensorflow.org/tfx/serving/architecture
- https://www.tensorflow.org/tfx/serving/serving_kubernetes