AI Model Serving

In this blog article I will give you an overview of different AI model serving approaches that we gave a try and used for our production environment.

Building machine learning or deep learning solutions can be a tricky part when it comes to serving different trained models in a production environment. Before building any AI model or starting a training you have to be clear what kind of problem you want to solve. You should definitely specify what you want to achieve, define your goals and start with some deeper research about how to solve your problem using machine learning or deep learning approaches.

Personally, I think it is not that difficult to train a first baseline model on some training data and achieve pretty good results. However, one of the main challenges in AI products is actually the deployment and using different models in a real production environment either in a standalone application or in a distributed architecture.

Model serving Solutions

There are different deployment solutions for machine learning or deep learning models. Choosing the best option depends on the underlying use case and where the models should be served. I’ll go into more detail and the options we considered below:

Embed trained models into an application (standalone app, dockerized service or serverless cloud function)
Inference server which serves multiple models at once (TensorFlow-serving, torchserve, NVIDIA triton inference server, etc.)

Of course more solutions exist for example to dockerize single models as individual web services such as BentoML, Cortex, Seldon Core.

At aptone we started with the first approach by embedding our models into a single application and using them directly in the business logic. We started with a serverless cloud function written in python on google cloud. The main goal was to to ensure high scalability of the service and to reduce costs by scaling to zero if no inference was made. We quickly figured out different disadvantages which could be problematic for our use case and that’s why we did not follow up with the cloud function approach:

Cold starting application which lead to long delays for the first run
Hard to test or debug locally in case of error handling
Kind of a vendor lock in for the target cloud infrastructure
No model versioning

The second approach was to implement a micro service which also embeds the models for inference into the business logic. We could then containerize the service and deploy it in a kubernetes cluster and scale it as needed. We could see some improvements but still got some drawbacks:

With each new model the docker image gets bigger and bigger.
There is no version control for the model so every time a model changes the application has to be changed as well.
Parallelizing the application gives me headaches because of a very stateful service which loads each model in memory for inference.

As a result of trying to speed up the analyze micro service by parallelizing requests, we finally introduced a separate inference server by extracting the model serving from the business logic. There are different open source serving frameworks available like TensorFlow-serving, torchserve, BentoML, Cortex, Seldon Core and much more. The best fit for our case is currently the NVIDIA triton inference server. It was amazingly easy to configure, to implement and to deploy in our production environment. Here are the main benefits for us:

Extract model serving from business logic
Independent scaling of the inference server
Simple model versioning and configuration
Ready to use docker image which loads models from any model repository

The downside of this approach is that we have to take care of another service that is also tightly coupled to our prediction micro service. However we could improve the performance and are more flexible when deploying new models in production.

Copyright © 2022 NVIDIA: Triton inference server architecture

Conclusion

There are many serving frameworks available for a lot of different use cases. Depending on scalability, high availability and budget, it lies in your decision which architecture fits best to solve your problem in the target productive environment. The NVIDIA triton inference server is currently a good fit for our solution and handles a lot of struggle which we faced in the early stages. We improved our AI prediction and analysis service and achieved better performance by introducing parallelization for a significant speed up. But it definitely depends on the underlying problem and what fits best to your product and solution.

About the author

Bastian Werner

Bastian Werner is one of the Co-Founders of aptone. AI enthusiast and played in several bands as a drummer.