7  Deploy and monitoring

Author

phonchi

Published

April 10, 2023

Open In Colab


7.1 Setup

!pip install bentoml -qq
!pip install pyngrok -qq
!pip install PyYAML -U -qq
!pip install streamlit -qq
!pip install gradio -qq
!pip install evidently -qq
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 972.5/972.5 KB 10.3 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 50.9/50.9 KB 2.5 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 135.3/135.3 KB 7.5 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.0/1.0 MB 23.4 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 94.5/94.5 KB 5.1 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.3/1.3 MB 20.6 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 66.4/66.4 KB 2.9 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 57.8/57.8 KB 4.3 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 200.6/200.6 KB 9.6 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 53.1/53.1 KB 3.6 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 45.7/45.7 KB 1.6 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 66.5/66.5 KB 3.8 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 158.8/158.8 KB 5.5 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 114.2/114.2 KB 2.4 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 264.6/264.6 KB 7.9 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 58.3/58.3 KB 7.6 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 761.3/761.3 KB 26.0 MB/s eta 0:00:00
  Preparing metadata (setup.py) ... done
  Building wheel for pyngrok (setup.py) ... done
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 9.7/9.7 MB 87.7 MB/s eta 0:00:00
  Preparing metadata (setup.py) ... done
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 82.1/82.1 KB 12.7 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 184.3/184.3 KB 25.0 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 164.8/164.8 KB 24.4 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 4.7/4.7 MB 102.4 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 62.7/62.7 KB 8.1 MB/s eta 0:00:00
  Building wheel for validators (setup.py) ... done
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 15.7/15.7 MB 31.0 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 129.5/129.5 KB 18.4 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 144.1/144.1 KB 22.0 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 57.1/57.1 KB 6.5 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 50.5/50.5 KB 7.9 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 200.1/200.1 KB 27.9 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 71.5/71.5 KB 11.1 MB/s eta 0:00:00
  Preparing metadata (setup.py) ... done
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 66.9/66.9 KB 9.2 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 69.6/69.6 KB 8.2 MB/s eta 0:00:00
  Building wheel for ffmpy (setup.py) ... done
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
bentoml 1.0.17 requires starlette<0.26, but you have starlette 0.26.1 which is incompatible.
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 12.1/12.1 MB 97.7 MB/s eta 0:00:00

Notice that you may need to restart the kernel after the above installations.

# Scientific computing
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib as mpl
from matplotlib import pyplot as plt
from matplotlib import cm
%matplotlib inline

# Modeling
from sklearn import datasets
from sklearn.metrics import accuracy_score, precision_score, confusion_matrix, classification_report
from sklearn.model_selection import train_test_split
from sklearn import svm
from sklearn import ensemble
import tensorflow as tf

# Deploy
import bentoml
import gradio as gr

# Monitoring
from evidently import ColumnMapping
from evidently.report import Report
from evidently.metric_preset import DataDriftPreset, TargetDriftPreset, RegressionPreset

# Helper library
from pyngrok import ngrok, conf
import getpass

# Other system library
from pathlib import Path
import requests
import os
import json
import sys
import zipfile
import io
from datetime import datetime, time

Here are some tips for this notebook https://amitness.com/2020/06/google-colaboratory-tips/ and https://stackoverflow.com/questions/59741453/is-there-a-general-way-to-run-web-applications-on-google-colab.

ngrok is a reverse proxy tool that opens secure tunnels from public URLs to localhost, perfect for exposing local web servers, building webhook integrations, enabling SSH access, testing chatbots, demoing from your own machine, and more. In this lab, we will use use https://pyngrok.readthedocs.io/en/latest/integrations.html. However, for production environment, it is recommended to use cloud service such as AWS, GCP or Azure, see here or https://pycaret.gitbook.io/docs/get-started/functions/deploy#deploy_model for more details.

print("Enter your authtoken, which can be copied from https://dashboard.ngrok.com/auth")
conf.get_default().auth_token = getpass.getpass()
Enter your authtoken, which can be copied from https://dashboard.ngrok.com/auth
Β·Β·Β·Β·Β·Β·Β·Β·Β·Β·
# Setup a tunnel to the port 8050
public_url = ngrok.connect(8050)
public_url
<NgrokTunnel: "http://ebc0-35-234-170-255.ngrok-free.app" -> "http://localhost:8050">
if not tf.config.list_physical_devices('GPU'):
    print("No GPU was detected. Neural nets can be very slow without a GPU.")
    if "google.colab" in sys.modules:
        print("Go to Runtime > Change runtime and select a GPU hardware "
              "accelerator.")
    if "kaggle_secrets" in sys.modules:
        print("Go to Settings > Accelerator and select GPU.")

7.2 Deploying TensorFlow models to TensorFlow Serving (TFS) on remote server

You could create your own microservice using any technology you want (e.g., using the Flask library), but why reinvent the wheel when you can just use TF Serving?

7.2.1 Exporting SavedModels

TensorFlow provides a simple tf.keras.models.save_model() function to export models to the SavedModel format. All you need to do is give it the model, specifying its name and version number, and the function will save the model’s computation graph and its weights:

# Load and split the MNIST dataset
mnist = tf.keras.datasets.mnist.load_data()
(X_train_full, y_train_full), (X_test, y_test) = mnist
X_valid, X_train = X_train_full[:5000], X_train_full[5000:]
y_valid, y_train = y_train_full[:5000], y_train_full[5000:]
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz
11490434/11490434 [==============================] - 1s 0us/step

It’s usually a good idea to include all the preprocessing layers in the final model you export so that it can ingest data in its natural form once it is deployed to production. This avoids having to take care of preprocessing separately within the application that uses the model. Bundling the preprocessing steps within the model also makes it simpler to update them later on and limits the risk of mismatch between a model and the preprocessing steps it requires!

# Build & train an MNIST model (also handles image preprocessing)

tf.random.set_seed(42)
tf.keras.backend.clear_session()
model = tf.keras.Sequential([
    tf.keras.layers.Flatten(input_shape=[28, 28], dtype=tf.uint8),
    tf.keras.layers.Rescaling(scale=1 / 255),
    tf.keras.layers.Dense(100, activation="relu"),
    tf.keras.layers.Dense(10, activation="softmax")
])
model.compile(loss="sparse_categorical_crossentropy",
              optimizer=tf.keras.optimizers.SGD(learning_rate=1e-2),
              metrics=["accuracy"])
model.fit(X_train, y_train, epochs=10, validation_data=(X_valid, y_valid))
Epoch 1/10
1719/1719 [==============================] - 11s 3ms/step - loss: 0.6814 - accuracy: 0.8237 - val_loss: 0.3684 - val_accuracy: 0.9020
Epoch 2/10
1719/1719 [==============================] - 8s 4ms/step - loss: 0.3509 - accuracy: 0.9018 - val_loss: 0.2974 - val_accuracy: 0.9164
Epoch 3/10
1719/1719 [==============================] - 6s 4ms/step - loss: 0.2992 - accuracy: 0.9157 - val_loss: 0.2617 - val_accuracy: 0.9266
Epoch 4/10
1719/1719 [==============================] - 7s 4ms/step - loss: 0.2680 - accuracy: 0.9248 - val_loss: 0.2395 - val_accuracy: 0.9338
Epoch 5/10
1719/1719 [==============================] - 7s 4ms/step - loss: 0.2449 - accuracy: 0.9321 - val_loss: 0.2214 - val_accuracy: 0.9380
Epoch 6/10
1719/1719 [==============================] - 6s 4ms/step - loss: 0.2268 - accuracy: 0.9370 - val_loss: 0.2087 - val_accuracy: 0.9424
Epoch 7/10
1719/1719 [==============================] - 7s 4ms/step - loss: 0.2117 - accuracy: 0.9409 - val_loss: 0.1945 - val_accuracy: 0.9488
Epoch 8/10
1719/1719 [==============================] - 6s 3ms/step - loss: 0.1987 - accuracy: 0.9444 - val_loss: 0.1866 - val_accuracy: 0.9504
Epoch 9/10
1719/1719 [==============================] - 7s 4ms/step - loss: 0.1872 - accuracy: 0.9476 - val_loss: 0.1768 - val_accuracy: 0.9534
Epoch 10/10
1719/1719 [==============================] - 6s 3ms/step - loss: 0.1773 - accuracy: 0.9503 - val_loss: 0.1685 - val_accuracy: 0.9534
<keras.callbacks.History at 0x7f2b0c0efc70>
X_new = X_test[:3]  # pretend we have 3 new digit images to classify
np.round(model.predict(X_new), 2)
1/1 [==============================] - 0s 81ms/step
array([[0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.99, 0.  , 0.  ],
       [0.  , 0.  , 0.98, 0.01, 0.  , 0.  , 0.  , 0.  , 0.  , 0.  ],
       [0.  , 0.98, 0.01, 0.  , 0.  , 0.  , 0.  , 0.01, 0.  , 0.  ]],
      dtype=float32)

Now to version the model, you just need to create a subdirectory for each model version:

model_name = "my_mnist_model"
model_version = "0001"
model_path = Path(model_name) / model_version
tf.keras.models.save_model(
    model,
    model_path,
    overwrite=True,
    include_optimizer=True,
    save_format="tf",
    signatures=None,
    options=None
)

A SavedModel represents a version of your model. It is stored as a directory containing a saved_model.pb file, which defines the computation graph (represented as a serialized protocol buffer), and a variables subdirectory containing the variable values. For models containing a large number of weights, these variable values may be split across multiple files. A SavedModel also includes an assets subdirectory that may contain additional data, such as vocabulary files, class names, or some example instances for this model.

for root, dirs, files in os.walk(model_name):
    indent = '    ' * root.count(os.sep)
    print('{}{}/'.format(indent, os.path.basename(root)))
    for filename in files:
        print('{}{}'.format(indent + '    ', filename))
my_mnist_model/
    0001/
        keras_metadata.pb
        fingerprint.pb
        saved_model.pb
        assets/
        variables/
            variables.data-00000-of-00001
            variables.index

As you might expect, you can load a SavedModel using the tf.keras.models.load_model() function.

saved_model = tf.keras.models.load_model(model_path)
np.round(saved_model.predict(X_new), 2)
1/1 [==============================] - 0s 51ms/step
array([[0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.99, 0.  , 0.  ],
       [0.  , 0.  , 0.98, 0.01, 0.  , 0.  , 0.  , 0.  , 0.  , 0.  ],
       [0.  , 0.98, 0.01, 0.  , 0.  , 0.  , 0.  , 0.01, 0.  , 0.  ]],
      dtype=float32)

TensorFlow also comes with a small saved_model_cli command-line tool to inspect SavedModels:

!saved_model_cli show --dir {model_path} --all
2023-04-09 04:55:26.330899: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT

MetaGraphDef with tag-set: 'serve' contains the following SignatureDefs:

signature_def['__saved_model_init_op']:
  The given SavedModel SignatureDef contains the following input(s):
  The given SavedModel SignatureDef contains the following output(s):
    outputs['__saved_model_init_op'] tensor_info:
        dtype: DT_INVALID
        shape: unknown_rank
        name: NoOp
  Method name is: 

signature_def['serving_default']:
  The given SavedModel SignatureDef contains the following input(s):
    inputs['flatten_input'] tensor_info:
        dtype: DT_UINT8
        shape: (-1, 28, 28)
        name: serving_default_flatten_input:0
  The given SavedModel SignatureDef contains the following output(s):
    outputs['dense_1'] tensor_info:
        dtype: DT_FLOAT
        shape: (-1, 10)
        name: StatefulPartitionedCall:0
  Method name is: tensorflow/serving/predict
The MetaGraph with tag set ['serve'] contains the following ops: {'Placeholder', 'StringJoin', 'VarHandleOp', 'StaticRegexFullMatch', 'DisableCopyOnRead', 'Softmax', 'NoOp', 'Cast', 'StatefulPartitionedCall', 'Mul', 'Const', 'AddV2', 'ShardedFilename', 'Select', 'MatMul', 'RestoreV2', 'Pack', 'BiasAdd', 'ReadVariableOp', 'AssignVariableOp', 'Identity', 'Reshape', 'SaveV2', 'Relu', 'MergeV2Checkpoints'}
2023-04-09 04:55:29.198144: W tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:47] Overriding orig_value setting because the TF_FORCE_GPU_ALLOW_GROWTH environment variable is set. Original config value was 0.

Concrete Functions:
  Function Name: '__call__'
    Option #1
      Callable with:
        Argument #1
          flatten_input: TensorSpec(shape=(None, 28, 28), dtype=tf.uint8, name='flatten_input')
        Argument #2
          DType: bool
          Value: False
        Argument #3
          DType: NoneType
          Value: None
    Option #2
      Callable with:
        Argument #1
          inputs: TensorSpec(shape=(None, 28, 28), dtype=tf.uint8, name='inputs')
        Argument #2
          DType: bool
          Value: False
        Argument #3
          DType: NoneType
          Value: None
    Option #3
      Callable with:
        Argument #1
          inputs: TensorSpec(shape=(None, 28, 28), dtype=tf.uint8, name='inputs')
        Argument #2
          DType: bool
          Value: True
        Argument #3
          DType: NoneType
          Value: None
    Option #4
      Callable with:
        Argument #1
          flatten_input: TensorSpec(shape=(None, 28, 28), dtype=tf.uint8, name='flatten_input')
        Argument #2
          DType: bool
          Value: True
        Argument #3
          DType: NoneType
          Value: None

  Function Name: '_default_save_signature'
    Option #1
      Callable with:
        Argument #1
          flatten_input: TensorSpec(shape=(None, 28, 28), dtype=tf.uint8, name='flatten_input')

  Function Name: 'call_and_return_all_conditional_losses'
    Option #1
      Callable with:
        Argument #1
          inputs: TensorSpec(shape=(None, 28, 28), dtype=tf.uint8, name='inputs')
        Argument #2
          DType: bool
          Value: False
        Argument #3
          DType: NoneType
          Value: None
    Option #2
      Callable with:
        Argument #1
          flatten_input: TensorSpec(shape=(None, 28, 28), dtype=tf.uint8, name='flatten_input')
        Argument #2
          DType: bool
          Value: True
        Argument #3
          DType: NoneType
          Value: None
    Option #3
      Callable with:
        Argument #1
          flatten_input: TensorSpec(shape=(None, 28, 28), dtype=tf.uint8, name='flatten_input')
        Argument #2
          DType: bool
          Value: False
        Argument #3
          DType: NoneType
          Value: None
    Option #4
      Callable with:
        Argument #1
          inputs: TensorSpec(shape=(None, 28, 28), dtype=tf.uint8, name='inputs')
        Argument #2
          DType: bool
          Value: True
        Argument #3
          DType: NoneType
          Value: None

A SavedModel contains one or more metagraphs. When you pass a tf.keras model, by default the function saves a simple SavedModel: it saves a single metagraph tagged β€œserve”, which contains two signature definitions, an initialization function (called _saved_model_init_op) and a default serving function (called serving_default). When saving a tf.keras model, the default serving function corresponds to the model’s call() function, which of course makes predictions.

7.2.2 Serve your model with TensorFlow Serving (Server side)

There are many ways to install TF Serving: using the system’s package manager, using a Docker image, installing from source, and more. Since Colab/Kaggle runs on Ubuntu, we can use Ubuntu’s apt package manager like this:

if "google.colab" in sys.modules or "kaggle_secrets" in sys.modules:
    url = "https://storage.googleapis.com/tensorflow-serving-apt"
    src = "stable tensorflow-model-server tensorflow-model-server-universal"
    !echo 'deb {url} {src}' > /etc/apt/sources.list.d/tensorflow-serving.list
    !curl '{url}/tensorflow-serving.release.pub.gpg' | apt-key add -
    !apt update -q && apt-get install -y tensorflow-model-server
    %pip install -q -U tensorflow-serving-api
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  2943  100  2943    0     0  17414      0 --:--:-- --:--:-- --:--:-- 17414
OK
Get:1 http://security.ubuntu.com/ubuntu focal-security InRelease [114 kB]
Hit:2 http://ppa.launchpad.net/c2d4u.team/c2d4u4.0+/ubuntu focal InRelease
Hit:3 http://archive.ubuntu.com/ubuntu focal InRelease
Get:4 http://archive.ubuntu.com/ubuntu focal-updates InRelease [114 kB]
Get:5 https://cloud.r-project.org/bin/linux/ubuntu focal-cran40/ InRelease [3,622 B]
Hit:6 http://ppa.launchpad.net/cran/libgit2/ubuntu focal InRelease
Hit:7 http://ppa.launchpad.net/deadsnakes/ppa/ubuntu focal InRelease
Get:8 http://archive.ubuntu.com/ubuntu focal-backports InRelease [108 kB]
Get:9 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64  InRelease [1,581 B]
Hit:10 http://ppa.launchpad.net/graphics-drivers/ppa/ubuntu focal InRelease
Hit:11 http://ppa.launchpad.net/ubuntugis/ppa/ubuntu focal InRelease
Get:12 https://storage.googleapis.com/tensorflow-serving-apt stable InRelease [3,026 B]
Get:13 http://security.ubuntu.com/ubuntu focal-security/restricted amd64 Packages [2,060 kB]
Get:14 http://security.ubuntu.com/ubuntu focal-security/multiverse amd64 Packages [28.5 kB]
Get:15 http://archive.ubuntu.com/ubuntu focal-updates/universe amd64 Packages [1,324 kB]
Get:16 http://archive.ubuntu.com/ubuntu focal-updates/multiverse amd64 Packages [31.3 kB]
Get:17 http://archive.ubuntu.com/ubuntu focal-updates/main amd64 Packages [3,069 kB]
Get:18 http://archive.ubuntu.com/ubuntu focal-updates/restricted amd64 Packages [2,199 kB]
Get:19 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64  Packages [972 kB]
Get:20 https://storage.googleapis.com/tensorflow-serving-apt stable/tensorflow-model-server amd64 Packages [340 B]
Get:21 https://storage.googleapis.com/tensorflow-serving-apt stable/tensorflow-model-server-universal amd64 Packages [348 B]
Fetched 10.0 MB in 3s (3,259 kB/s)
Reading package lists...
Building dependency tree...
Reading state information...
24 packages can be upgraded. Run 'apt list --upgradable' to see them.
Reading package lists... Done
Building dependency tree       
Reading state information... Done
The following NEW packages will be installed:
  tensorflow-model-server
0 upgraded, 1 newly installed, 0 to remove and 24 not upgraded.
Need to get 414 MB of archives.
After this operation, 0 B of additional disk space will be used.
Get:1 https://storage.googleapis.com/tensorflow-serving-apt stable/tensorflow-model-server amd64 tensorflow-model-server all 2.11.1 [414 MB]
Fetched 414 MB in 13s (30.9 MB/s)
Selecting previously unselected package tensorflow-model-server.
(Reading database ... 122349 files and directories currently installed.)
Preparing to unpack .../tensorflow-model-server_2.11.1_all.deb ...
Unpacking tensorflow-model-server (2.11.1) ...
Setting up tensorflow-model-server (2.11.1) ...
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.1/1.1 MB 38.0 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 588.3/588.3 MB 2.6 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 439.2/439.2 KB 44.5 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 6.0/6.0 MB 100.1 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.7/1.7 MB 72.6 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 4.9/4.9 MB 105.1 MB/s eta 0:00:00
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
bentoml 1.0.17 requires starlette<0.26, but you have starlette 0.26.1 which is incompatible.

The code above starts by adding TensorFlow’s package repository to Ubuntu’s list of package sources. Then it downloads TensorFlow’s public GPG key and adds it to the package manager’s key list so it can verify TensorFlow’s package signatures. Next, it uses apt to install the tensorflow-model-server package. Lastly, it installs the tensorflow-serving-api library, which we will need to communicate with the server.

If tensorflow_model_server is installed (e.g., if you are running this notebook in Colab/Kaggle), then the following 2 cells will start the server. If your OS is Windows, you may need to run the tensorflow_model_server command in a terminal, and replace ${MODEL_DIR} with the full path to the my_mnist_model directory. This is where we start running TensorFlow Serving and load our model. After it loads we can start making inference requests using REST. There are some important parameters:

  • port: The port that you’ll use for gRPC requests.
  • rest_api_port: The port that you’ll use for REST requests.
  • model_name: You’ll use this in the URL of REST requests. It can be anything.
  • model_base_path: This is the path to the directory where you’ve saved your model.
os.environ["MODEL_DIR"] = str(model_path.parent.absolute())
%%bash --bg
nohup tensorflow_model_server \
     --port=8500 \
     --rest_api_port=8050 \
     --model_name=my_mnist_model \
     --model_base_path="${MODEL_DIR}" > server.log 2>&1

The %%bash --bg magic command executes the cell as a bash script, running it in the background. The >my_server.log 2>&1 part redirects the standard output and standard error to the server.log file. And that’s it! TF Serving is now running in the background, and its logs are saved to server.log.

!tail server.log
[warn] getaddrinfo: address family for nodename not supported
[evhttp_server.cc : 245] NET_LOG: Entering the event loop ...

7.2.3 Querying TF Serving through the REST API (client side)

Let’s start by creating the query. It must contain the name of the function signature you want to call, and of course the input data. Since the request must use the JSON format, we have to convert the input images from a NumPy array to a Python list:

input_data_json = json.dumps({
    "signature_name": "serving_default",
    "instances": X_new.tolist(),
})

Note that the JSON format is 100% text-based:

repr(input_data_json)[:1500] + "..."
'\'{"signature_name": "serving_default", "instances": [[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 84, 185, 159, 151, 60, 36, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 222, 254, 254, 254, 254, 241, 198, 198, 198, 198, 198, 198, 198, 198, 170, 52, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 67, 114, 72, 114, 163, 227, 254, 225, 254, 254, 254, 250, 229, 254, 254, 140, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 17, 66, 14, 67, 67, 67, 59, 21, 236, 254, 106, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 83, 253, 209, 18, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 22, 233, 255, 83, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 129, 254, 238, 44, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 59, 249, 254, 62, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...'

Now let’s send the input data to TF Serving by sending an HTTP POST request. This can be done easily using the requests library:

SERVER_URL = 'http://localhost:8050/v1/models/my_mnist_model:predict'
response = requests.post(SERVER_URL, data=input_data_json)
response.raise_for_status() # raise an exception in case of error
response = response.json()
response
{'predictions': [[2.062969e-05,
   3.84173156e-08,
   0.000472674757,
   0.00438946532,
   2.47851091e-07,
   7.01864119e-05,
   1.34316258e-09,
   0.994873822,
   7.89433e-06,
   0.000165013989],
  [0.00119448744,
   0.000342271611,
   0.983010888,
   0.0104818037,
   4.77538409e-09,
   0.00223252643,
   0.00219411566,
   8.47914272e-09,
   0.000543907867,
   1.48177925e-09],
  [7.8709978e-05,
   0.975474656,
   0.00764368149,
   0.00409595482,
   0.000257658307,
   0.00130979472,
   0.00135653175,
   0.00625853334,
   0.00292111211,
   0.000603225373]]}

The response is a dictionary containing a single β€œpredictions” key. The corresponding value is the list of predictions. This list is a Python list, so let’s convert it to a NumPy array and round the floats it contains to the second decimal:

y_proba = np.array(response["predictions"])
y_proba.round(2)
array([[0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.99, 0.  , 0.  ],
       [0.  , 0.  , 0.98, 0.01, 0.  , 0.  , 0.  , 0.  , 0.  , 0.  ],
       [0.  , 0.98, 0.01, 0.  , 0.  , 0.  , 0.  , 0.01, 0.  , 0.  ]])

For more information, please refer to https://github.com/tensorflow/serving which include the usuage of gRPC.

7.2.4 Deploying a new model version

Now let’s create a new model version and export a SavedModel to the my_mnist_model/0002 directory, just like earlier:

# Change the architecture

np.random.seed(42)
tf.random.set_seed(42)
model = tf.keras.Sequential([
    tf.keras.layers.Flatten(input_shape=[28, 28], dtype=tf.uint8),
    tf.keras.layers.Rescaling(scale=1 / 255),
    tf.keras.layers.Dense(50, activation="relu"),
    tf.keras.layers.Dense(50, activation="relu"),
    tf.keras.layers.Dense(10, activation="softmax")
])
model.compile(loss="sparse_categorical_crossentropy",
              optimizer=tf.keras.optimizers.SGD(learning_rate=1e-2),
              metrics=["accuracy"])
history = model.fit(X_train, y_train, epochs=10,
                    validation_data=(X_valid, y_valid))
Epoch 1/10
1719/1719 [==============================] - 10s 5ms/step - loss: 0.7847 - accuracy: 0.7836 - val_loss: 0.3468 - val_accuracy: 0.9080
Epoch 2/10
1719/1719 [==============================] - 8s 5ms/step - loss: 0.3279 - accuracy: 0.9055 - val_loss: 0.2753 - val_accuracy: 0.9238
Epoch 3/10
1719/1719 [==============================] - 6s 3ms/step - loss: 0.2742 - accuracy: 0.9211 - val_loss: 0.2347 - val_accuracy: 0.9360
Epoch 4/10
1719/1719 [==============================] - 8s 4ms/step - loss: 0.2394 - accuracy: 0.9312 - val_loss: 0.2126 - val_accuracy: 0.9418
Epoch 5/10
1719/1719 [==============================] - 11s 7ms/step - loss: 0.2134 - accuracy: 0.9391 - val_loss: 0.1931 - val_accuracy: 0.9442
Epoch 6/10
1719/1719 [==============================] - 7s 4ms/step - loss: 0.1934 - accuracy: 0.9448 - val_loss: 0.1759 - val_accuracy: 0.9532
Epoch 7/10
1719/1719 [==============================] - 7s 4ms/step - loss: 0.1772 - accuracy: 0.9493 - val_loss: 0.1660 - val_accuracy: 0.9556
Epoch 8/10
1719/1719 [==============================] - 8s 5ms/step - loss: 0.1640 - accuracy: 0.9528 - val_loss: 0.1617 - val_accuracy: 0.9550
Epoch 9/10
1719/1719 [==============================] - 6s 4ms/step - loss: 0.1526 - accuracy: 0.9574 - val_loss: 0.1501 - val_accuracy: 0.9592
Epoch 10/10
1719/1719 [==============================] - 7s 4ms/step - loss: 0.1431 - accuracy: 0.9597 - val_loss: 0.1411 - val_accuracy: 0.9600
model_version = "0002"
model_name = "my_mnist_model"
model_path = os.path.join(model_name, model_version)

tf.keras.models.save_model(
    model,
    model_path,
    overwrite=True,
    include_optimizer=True,
    save_format="tf",
    signatures=None,
    options=None
)
for root, dirs, files in os.walk(model_name):
    indent = '    ' * root.count(os.sep)
    print('{}{}/'.format(indent, os.path.basename(root)))
    for filename in files:
        print('{}{}'.format(indent + '    ', filename))
my_mnist_model/
    0002/
        keras_metadata.pb
        fingerprint.pb
        saved_model.pb
        assets/
        variables/
            variables.data-00000-of-00001
            variables.index
    0001/
        keras_metadata.pb
        fingerprint.pb
        saved_model.pb
        assets/
        variables/
            variables.data-00000-of-00001
            variables.index

At regular intervals (the delay is configurable), TensorFlow Serving checks for new model versions. If it finds one, it will automatically handle the transition gracefully: by default, it will answer pending requests (if any) with the previous model version, while handling new requests with the new version. As soon as every pending request has been answered, the previous model version is unloaded. You can see this at work in the TF Serving logs:

SERVER_URL = 'http://localhost:8050/v1/models/my_mnist_model:predict'
            
response = requests.post(SERVER_URL, data=input_data_json)
response.raise_for_status()
response = response.json()
y_proba = np.array(response["predictions"])
y_proba.round(2)
array([[0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.99, 0.  , 0.  ],
       [0.  , 0.  , 0.98, 0.01, 0.  , 0.  , 0.  , 0.  , 0.  , 0.  ],
       [0.  , 0.98, 0.01, 0.  , 0.  , 0.  , 0.  , 0.01, 0.  , 0.  ]])
!pgrep tensorflow
6207
!kill $(pgrep tensorflow)
!pgrep tensorflow

As you can see, TF Serving makes it quite simple to deploy new models. Moreover, if you discover that version 2 does not work as well as you expected, then rolling back to version 1 is as simple as removing the my_mnist_model/0002 directory.

You can also refer to https://github.com/microsoft/ML-For-Beginners/blob/main/3-Web-App/1-Web-App/README.md or https://github.com/rodrigo-arenas/fast-ml-deploy which use Flask and FastAPI that may have more flexibility.

If you would like to deploy to GCP vertex AI, checkout here.

7.3 Deploy a REST API server using BentoML on remote server

To begin with BentoML, you will need to save your trained models with BentoML API in its model store (a local directory managed by BentoML). The model store is used for managing all your trained models locally as well as accessing them for serving.

7.3.1 Train a classifier model using the iris dataset

# Load training data
iris = datasets.load_iris()
X, y = iris.data, iris.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Train the model
clf = svm.SVC(gamma='scale')
clf.fit(X_train, y_train)
SVC()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
y_pred = clf.predict(X_test)
print(classification_report(y_test,y_pred))
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        10
           1       1.00      1.00      1.00         9
           2       1.00      1.00      1.00        11

    accuracy                           1.00        30
   macro avg       1.00      1.00      1.00        30
weighted avg       1.00      1.00      1.00        30
clf.predict([[5.9, 3.0, 5.1, 1.8]])
array([2])

Save the clf model with BentoML. We begin by saving a trained model instance to BentoML’s local model store. The local model store is used to version your models as well as control which models are packaged with your bento. It is noted that there are a wide range of models can be saved via BentoML.

It is possible to use pre-trained models directly with BentoML or import existing trained model files to BentoML. Learn more about it from Preparing Models.

# Save model to the BentoML local model store
saved_model = bentoml.sklearn.save_model("iris_clf", clf)
print(f"Model saved: {saved_model}")
Model saved: Model(tag="iris_clf:ne2yncwwssscuasc")

Models saved can be accessed via bentoml models CLI command:

!bentoml models list
 Tag                        Module           Size      Creation Time       
 iris_clf:ne2yncwwssscuasc  bentoml.sklearn  5.32 KiB  2023-04-09 05:07:22 

To verify that the saved model can be loaded correctly:

loaded_model = bentoml.sklearn.load_model("iris_clf:latest")
# model = bentoml.sklearn.load_model("iris_clf:wewrqnwn2s6ucasc") #we can instead load specific version of model
loaded_model.predict([[5.9, 3.0, 5.1, 1.8]])
array([2])

In BentoML, the recommended way of running ML model inference in serving is via Runner:

# Create a Runner instance:
iris_clf_runner = bentoml.sklearn.get("iris_clf:latest").to_runner()

# Runner#init_local initializes the model in current process, this is meant for development and testing only:
iris_clf_runner.init_local()

# This should yield the same result as the loaded model:
iris_clf_runner.predict.run([[5.9, 3.0, 5.1, 1.8]])
WARNING:bentoml._internal.runner.runner:'Runner.init_local' is for debugging and testing only. Make sure to remove it before deploying to production.
array([2])

In this example, bentoml.sklearn.get() creates a reference to the saved model in the model store, and to_runner() creates a Runner instance from the model. The Runner abstraction gives BentoServer more flexibility in terms of how to schedule the inference computation, how to dynamically batch inference calls and better take advantage of all hardware resource available.

7.3.2 Create a BentoML Service for serving the model

Services are the core components of BentoML, where the serving logic is defined. With the model saved in the model store, we can define the service by creating a Python file service.py with the following contents:

%%writefile service.py
import numpy as np
import bentoml
from bentoml.io import NumpyNdarray

# Load the runner for the latest ScikitLearn model we just saved
iris_clf_runner = bentoml.sklearn.get("iris_clf:latest").to_runner()

# Create the iris_classifier service with the ScikitLearn runner
# Multiple runners may be specified if needed in the runners array
# When packaged as a bento, the runners here will included
svc = bentoml.Service("iris_classifier", runners=[iris_clf_runner])

# Create API function with pre- and post- processing logic with your new "svc" annotation
@svc.api(input=NumpyNdarray(), output=NumpyNdarray())
def classify(input_series: np.ndarray) -> np.ndarray:
    # Define pre-processing logic
    result = iris_clf_runner.predict.run(input_series)
    # Define post-processing logic
    return result
Writing service.py

In this example, we defined the input and output type to be numpy.ndarray. More options, such as pandas.DataFrame, JSON and PIL.image are also supported. The svc.api decorator adds a function to the bentoml.Service object’s APIs list. The input and output parameter takes an IO Descriptor object, which specifies the API function’s expected input/output types, and is used for generating HTTP endpoints. Inside the API function, users can define any business logic, feature fetching, and feature transformation code. Model inference calls are made directly through runner objects, that are passed into bentoml.Service(name=.., runners=[..]) call when creating the service object.

BentoML Server runs the Service API in an ASGI web serving layer and puts Runners in a separate worker process pool managed by BentoML. The ASGI web serving layer will expose REST endpoints for inference APIs, such as POST /predict and common infrastructure APIs, such as GET /metrics for monitoring. You can use other ASGI app like FastAPI or WSGI app like Flask, see here.

We now have everything we need to serve our first request. Launch the server in debug mode by running the bentoml serve command in the current working directory. Using the --reload option allows the server to reflect any changes made to the service.py module without restarting:

!nohup bentoml serve ./service.py:svc --reload --port 8050 &
nohup: appending output to 'nohup.out'

We can then send requests to the newly started service with any HTTP client:

requests.post(
    "http://127.0.0.1:8050/classify",
    headers={"content-type": "application/json"},
    data="[[5.9, 3, 5.1, 1.8]]"
    ).text
'[2]'
!pgrep bentoml
12658
!kill $(pgrep bentoml)

7.3.3 Build and Deploy Bentos 🍱

Once we are happy with the service definition, we can build the model and service into a bento. Bento is the distribution format for a service. It is a self-contained archive that contains all the source code, model files and dependency specifications required to run the service. Checkout Building Bentos for more details.

To build a Bento, first create a file named bentofile.yaml in your project directory:

%%writefile bentofile.yaml
service: "service.py:svc"  # A convention for locating your service: <YOUR_SERVICE_PY>:<YOUR_SERVICE_ANNOTATION>
description: "file: ./README.md"
labels:
    owner: nsysu-math608
    stage: demo
include:
 - "*.py"  # A pattern for matching which files to include in the bento
python:
  packages:
   - scikit-learn  # Additional libraries to be included in the bento
   - pandas
  lock_packages: False
Writing bentofile.yaml
%%writefile README.md
This is a iris classifier build in math608
Writing README.md

Next, use the bentoml build CLI command in the same directory to build a bento.

!bentoml build
Building BentoML service "iris_classifier:udz3qngwswfdyasc" from build context "/content".
Packing model "iris_clf:ne2yncwwssscuasc"

β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•—β–‘β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•—β–ˆβ–ˆβ–ˆβ•—β–‘β–‘β–ˆβ–ˆβ•—β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•—β–‘β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•—β–‘β–ˆβ–ˆβ–ˆβ•—β–‘β–‘β–‘β–ˆβ–ˆβ–ˆβ•—β–ˆβ–ˆβ•—β–‘β–‘β–‘β–‘β–‘
β–ˆβ–ˆβ•”β•β•β–ˆβ–ˆβ•—β–ˆβ–ˆβ•”β•β•β•β•β•β–ˆβ–ˆβ–ˆβ–ˆβ•—β–‘β–ˆβ–ˆβ•‘β•šβ•β•β–ˆβ–ˆβ•”β•β•β•β–ˆβ–ˆβ•”β•β•β–ˆβ–ˆβ•—β–ˆβ–ˆβ–ˆβ–ˆβ•—β–‘β–ˆβ–ˆβ–ˆβ–ˆβ•‘β–ˆβ–ˆβ•‘β–‘β–‘β–‘β–‘β–‘
β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•¦β•β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•—β–‘β–‘β–ˆβ–ˆβ•”β–ˆβ–ˆβ•—β–ˆβ–ˆβ•‘β–‘β–‘β–‘β–ˆβ–ˆβ•‘β–‘β–‘β–‘β–ˆβ–ˆβ•‘β–‘β–‘β–ˆβ–ˆβ•‘β–ˆβ–ˆβ•”β–ˆβ–ˆβ–ˆβ–ˆβ•”β–ˆβ–ˆβ•‘β–ˆβ–ˆβ•‘β–‘β–‘β–‘β–‘β–‘
β–ˆβ–ˆβ•”β•β•β–ˆβ–ˆβ•—β–ˆβ–ˆβ•”β•β•β•β–‘β–‘β–ˆβ–ˆβ•‘β•šβ–ˆβ–ˆβ–ˆβ–ˆβ•‘β–‘β–‘β–‘β–ˆβ–ˆβ•‘β–‘β–‘β–‘β–ˆβ–ˆβ•‘β–‘β–‘β–ˆβ–ˆβ•‘β–ˆβ–ˆβ•‘β•šβ–ˆβ–ˆβ•”β•β–ˆβ–ˆβ•‘β–ˆβ–ˆβ•‘β–‘β–‘β–‘β–‘β–‘
β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•¦β•β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•—β–ˆβ–ˆβ•‘β–‘β•šβ–ˆβ–ˆβ–ˆβ•‘β–‘β–‘β–‘β–ˆβ–ˆβ•‘β–‘β–‘β–‘β•šβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•”β•β–ˆβ–ˆβ•‘β–‘β•šβ•β•β–‘β–ˆβ–ˆβ•‘β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•—
β•šβ•β•β•β•β•β•β–‘β•šβ•β•β•β•β•β•β•β•šβ•β•β–‘β–‘β•šβ•β•β•β–‘β–‘β–‘β•šβ•β•β–‘β–‘β–‘β–‘β•šβ•β•β•β•β•β–‘β•šβ•β•β–‘β–‘β–‘β–‘β–‘β•šβ•β•β•šβ•β•β•β•β•β•β•

Successfully built Bento(tag="iris_classifier:udz3qngwswfdyasc").

Possible next steps:

 * Containerize your Bento with `bentoml containerize`:
    $ bentoml containerize iris_classifier:udz3qngwswfdyasc

 * Push to BentoCloud with `bentoml push`:
    $ bentoml push iris_classifier:udz3qngwswfdyasc

Bentos built will be saved in the local bento store, which you can view using the bentoml list CLI command.

!bentoml list
 Tag                     Size       Creation Time        Path                   
 iris_classifier:udz3q…  18.37 KiB  2023-04-09 05:16:05  ~/bentoml/bentos/iris… 

We can serve bentos from the bento store using the bentoml serve --production CLI command. Using the --production option will serve the bento in production mode.

%%bash --bg
nohup bentoml serve iris_classifier:latest \
     --production \
     --port 8050 > bentoml.log 2>&1

This is another way to query the server

!curl \
  -X POST \
  -H "content-type: application/json" \
  --data "[[5.9,3,5.1,1.8]]" \
  http://127.0.0.1:8050/classify
[2]

The Bento directory contains all code, files, models and configs required for running this service. BentoML standarlizes this file structure which enables serving runtimes and deployment tools to be built on top of it. By default, Bentos are managed under the ~/bentoml/bentos directory:

path ="/root/bentoml/bentos/iris_classifier/"
for root, dirs, files in os.walk(path):
    indent = ' ' * root.count(os.sep)
    print('{}{}/'.format(indent, os.path.basename(root)))
    for filename in files:
        print('{}{}'.format(indent + ' ', filename))
     /
      latest
     udz3qngwswfdyasc/
      bento.yaml
      README.md
      models/
       iris_clf/
        latest
        ne2yncwwssscuasc/
         saved_model.pkl
         model.yaml
      src/
       service.py
       __pycache__/
        service.cpython-39.pyc
      env/
       docker/
        entrypoint.sh
        Dockerfile
       python/
        install.sh
        requirements.txt
        version.txt
      apis/
       openapi.yaml
!pgrep bentoml
14130
!kill $(pgrep bentoml)

For more information, please refer to https://docs.bentoml.org/en/latest/index.html.

7.4 Deploy web base application in local computer using streamit

Streamlit’s simple and focused API lets you build incredibly rich and powerful tools. It contains a large number of elements and components that you can use.

There are a few ways to display data (tables, arrays, data frames) in Streamlit apps. Below, st.write() can be used to write anything from text, plots to tables. In addition, when you’ve got the data or model into the state that you want to explore, you can add in widgets like st.slider(), st.button() or st.selectbox(). Finally, Streamlit makes it easy to organize your widgets in a left panel sidebar with st.sidebar. Each element that’s passed to st.sidebar is pinned to the left, allowing users to focus on the content in your app while still having access to UI controls. For example, if you want to add a selectbox and a slider to a sidebar, use st.sidebar.slider and st.sidebar.selectbox instead of st.slider and st.selectbox:

%%writefile iris-app.py
import streamlit as st
import pandas as pd
from sklearn import datasets
from sklearn.ensemble import RandomForestClassifier

st.write("""
# Simple Iris Flower Prediction App

This app predicts the **Iris flower** type!
""")

st.sidebar.header('User Input Parameters')

def user_input_features():
    sepal_length = st.sidebar.slider('Sepal length', 4.3, 7.9, 5.4)
    sepal_width = st.sidebar.slider('Sepal width', 2.0, 4.4, 3.4)
    petal_length = st.sidebar.slider('Petal length', 1.0, 6.9, 1.3)
    petal_width = st.sidebar.slider('Petal width', 0.1, 2.5, 0.2)
    data = {'sepal_length': sepal_length,
            'sepal_width': sepal_width,
            'petal_length': petal_length,
            'petal_width': petal_width}
    features = pd.DataFrame(data, index=[0])
    return features

df = user_input_features()

st.subheader('User Input parameters')
st.write(df)

iris = datasets.load_iris()
X = iris.data
Y = iris.target

clf = RandomForestClassifier()
clf.fit(X, Y)

prediction = clf.predict(df)
prediction_proba = clf.predict_proba(df)

st.subheader('Class labels and their corresponding index number')
st.write(iris.target_names)

st.subheader('Prediction')
st.write(iris.target_names[prediction])

st.subheader('Prediction Probability')
st.write(prediction_proba)
Writing iris-app.py
%%bash --bg 
streamlit run iris-app.py --server.port 8050 > debug.log 2>&1

As soon as you run the script as shown above, a local Streamlit server will spin up and your app will open in a new tab in your default web browser. The app is your canvas, where you’ll draw charts, text, widgets, tables, and more.

!tail debug.log
public_url
<NgrokTunnel: "http://c5cb-35-234-170-255.ngrok-free.app" -> "http://localhost:8050">

Try to click the above link to access the web app. For more information, please refer to https://github.com/streamlit/streamlit.

!pgrep streamlit
14947
!kill $(pgrep streamlit)

7.5 Deploy web base application in local computer using Gradio

UI models are perfect to use with Gradio’s image input component, so in this section we will build a web demo to classify images using Gradio. We will be able to build the whole web application in Python, and it will look like this.

7.5.1 Setting up the Image Classification Model

First, we will need an image classification model. For this tutorial, we will use a pretrained Mobile Net model, as it is easily downloadable from Keras. You can use a different pretrained model or train your own.

!wget https://hf.space/embed/abidlabs/keras-image-classifier/file/banana.jpg
!wget https://hf.space/embed/abidlabs/keras-image-classifier/file/car.jpg
--2023-04-09 05:21:23--  https://hf.space/embed/abidlabs/keras-image-classifier/file/banana.jpg
Resolving hf.space (hf.space)... 54.159.43.68, 18.204.155.216, 54.81.158.24, ...
Connecting to hf.space (hf.space)|54.159.43.68|:443... connected.
HTTP request sent, awaiting response... 307 Temporary Redirect
Location: https://abidlabs-keras-image-classifier.hf.space/file/banana.jpg [following]
--2023-04-09 05:21:23--  https://abidlabs-keras-image-classifier.hf.space/file/banana.jpg
Resolving abidlabs-keras-image-classifier.hf.space (abidlabs-keras-image-classifier.hf.space)... 34.196.131.200, 54.156.168.251, 34.195.4.197
Connecting to abidlabs-keras-image-classifier.hf.space (abidlabs-keras-image-classifier.hf.space)|34.196.131.200|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 28437 (28K) [image/jpeg]
Saving to: β€˜banana.jpg’

banana.jpg          100%[===================>]  27.77K  --.-KB/s    in 0.09s   

2023-04-09 05:21:23 (325 KB/s) - β€˜banana.jpg’ saved [28437/28437]

--2023-04-09 05:21:24--  https://hf.space/embed/abidlabs/keras-image-classifier/file/car.jpg
Resolving hf.space (hf.space)... 54.159.43.68, 18.204.155.216, 54.81.158.24, ...
Connecting to hf.space (hf.space)|54.159.43.68|:443... connected.
HTTP request sent, awaiting response... 307 Temporary Redirect
Location: https://abidlabs-keras-image-classifier.hf.space/file/car.jpg [following]
--2023-04-09 05:21:24--  https://abidlabs-keras-image-classifier.hf.space/file/car.jpg
Resolving abidlabs-keras-image-classifier.hf.space (abidlabs-keras-image-classifier.hf.space)... 34.196.131.200, 54.156.168.251, 34.195.4.197
Connecting to abidlabs-keras-image-classifier.hf.space (abidlabs-keras-image-classifier.hf.space)|34.196.131.200|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 79626 (78K) [image/jpeg]
Saving to: β€˜car.jpg’

car.jpg             100%[===================>]  77.76K   457KB/s    in 0.2s    

2023-04-09 05:21:24 (457 KB/s) - β€˜car.jpg’ saved [79626/79626]
inception_net = tf.keras.applications.MobileNetV2()
Downloading data from https://storage.googleapis.com/tensorflow/keras-applications/mobilenet_v2/mobilenet_v2_weights_tf_dim_ordering_tf_kernels_1.0_224.h5
14536120/14536120 [==============================] - 1s 0us/step

7.5.2 Defining a predict function

Next, we will need to define a function that takes in the user input, which in this case is an image, and returns the prediction. The prediction should be returned as a dictionary whose keys are class name and values are confidence probabilities. We will load the class names from this text file.

In the case of our pretrained model, it will look like this:

# Download human-readable labels for ImageNet.
response = requests.get("https://git.io/JJkYN")
labels = response.text.split("\n")

def classify_image(inp):
    inp = inp.reshape((-1, 224, 224, 3))
    inp = tf.keras.applications.mobilenet_v2.preprocess_input(inp)
    prediction = inception_net.predict(inp).flatten()
    confidences = {labels[i]: float(prediction[i]) for i in range(1000)}
    return confidences

Let’s break this down. The function takes one parameter: * inp: the input image as a NumPy array

Then, the function adds a batch dimension, passes it through the model, and returns: * confidences: the predictions, as a dictionary whose keys are class labels and whose values are confidence probabilities

7.5.3 Creating a Gradio Interface

Now that we have our predictive function set up, we can create a Gradio Interface around it. In this case, the input component is a drag-and-drop image component. To create this input, we can use the gradio.inputs.Image class, which creates the component and handles the preprocessing to convert that to a numpy array. We will instantiate the class with a parameter that automatically preprocesses the input image to be 224 pixels by 224 pixels, which is the size that MobileNet expects.

The output component will be a β€œlabel”, which displays the top labels in a nice form. Since we don’t want to show all 1,000 class labels, we will customize it to show only the top 3 images.

Finally, we’ll add one more parameter, the examples, which allows us to prepopulate our interfaces with a few predefined examples. The code for Gradio looks like this:

gr.Interface(fn=classify_image, 
             inputs=gr.Image(shape=(224, 224), label="Input image"),
             outputs=gr.Label(num_top_classes=3, label="Predition Probabilities"),
             examples=["banana.jpg", "car.jpg"],
             description="Please upload an image",
             title="Classification using MobileNet",
             ).launch(server_port=8050)
Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
Note: opening Chrome Inspector may crash demo inside Colab notebooks.

To create a public link, set `share=True` in `launch()`.

Gradio automatically produces sharable links with others, but you can also access the web app with our port as follows:

public_url
<NgrokTunnel: "http://ebc0-35-234-170-255.ngrok-free.app" -> "http://localhost:8050">

You can see that there is a flaaged directory which collect data from users who try the model. To close the gradio, just call the close_all() function.

gr.close_all()
Closing server running on port: 8050

For more information, please refer to https://github.com/gradio-app/gradio.

7.6 Deploy web base applocation using Tensorflow.js

Tensorflow.js is a WebGL accelerated JavaScript library for training and deploying ML models. The TensorFlow.js project includes a tensorflowjs_converter tool that can convert a TensorFlow SavedModel or a Keras model file to the TensorFlow.js Layers format: this is a directory containing a set of sharded weight files in binary format and a model.json file that describes the model’s architecture and links to the weight files. This format is optimized to be downloaded efficiently on the web.

Users can then download the model and run predictions in the browser using the TensorFlow.js library. Here is a code snippet to give you an idea of what the JavaScript API looks like:

import * as tf from '@tensorflow/tfjs';
const model = await tf.loadLayersModel('https://example.com/tfjs/model.json');
const image = tf.fromPixels(webcamElement);
const prediction = model.predict(image);

For more information, please refer to https://github.com/tensorflow/tfjs.

7.7 Deploy mobile application using Tensorflow Lite

Once again, doing justice to this topic would require a whole book. If you want to learn more about TensorFlow Lite, check out the O’Reilly book Practical Deep Learning for Cloud, Mobile, and Edge or refer to https://www.tensorflow.org/lite.

7.8 Monitoring shift with evidently

7.8.1 The task at hand: bike demand forecasting

We took a Kaggle dataset on Bike Sharing Demand. Our goal is to predict the volume of bike rentals on an hourly basis. To do that, we have some data about the season, weather, and day of the week.

content = requests.get("https://archive.ics.uci.edu/ml/machine-learning-databases/00275/Bike-Sharing-Dataset.zip").content
with zipfile.ZipFile(io.BytesIO(content)) as arc:
    raw_data = pd.read_csv(arc.open("hour.csv"), header=0, sep=',', parse_dates=['dteday'], index_col='dteday')
raw_data.index = raw_data.apply(
    lambda row: datetime.combine(row.name, time(hour=int(row['hr']))), axis = 1)
raw_data.head()
instant season yr mnth hr holiday weekday workingday weathersit temp atemp hum windspeed casual registered cnt
2011-01-01 00:00:00 1 1 0 1 0 0 6 0 1 0.24 0.2879 0.81 0.0 3 13 16
2011-01-01 01:00:00 2 1 0 1 1 0 6 0 1 0.22 0.2727 0.80 0.0 8 32 40
2011-01-01 02:00:00 3 1 0 1 2 0 6 0 1 0.22 0.2727 0.80 0.0 5 27 32
2011-01-01 03:00:00 4 1 0 1 3 0 6 0 1 0.24 0.2879 0.75 0.0 3 10 13
2011-01-01 04:00:00 5 1 0 1 4 0 6 0 1 0.24 0.2879 0.75 0.0 0 1 1

7.8.2 Train a model

We trained a random forest model using data for the four weeks from January. Let’s imagine that in practice, we just started the data collection, and that was all the data available. The performance of the trained model looked acceptable, so we decided to give it a go.

We further assume that we only learn the ground truth (the actual demand) at the end of each week. That is a realistic assumption in real-world machine learning. Integrating and updating different data sources is not always straightforward. Even after the actual event has occurred! Maybe the daily usage data is stored locally and is only sent and merged in the database once per week.

To run it, we prepare our performance data as a Pandas DataFrame. It should include: * Model application logsβ€”features that went into the model and corresponding prediction; and * Ground truth dataβ€”the actual number of bikes rented each hour as our β€œtarget.”

Once we train the model, we can take our training dataset and generated predictions and specify it as the β€œReference” data. We can select this period directly from the DataFrame since it has datetime as an index:

reference = raw_data.loc['2011-01-01 00:00:00':'2011-01-28 23:00:00']

target = 'cnt'
prediction = 'prediction'
numerical_features = ['temp', 'atemp', 'hum', 'windspeed', 'hr', 'weekday']
categorical_features = ['season', 'holiday', 'workingday']
reference.head()
instant season yr mnth hr holiday weekday workingday weathersit temp atemp hum windspeed casual registered cnt
2011-01-01 00:00:00 1 1 0 1 0 0 6 0 1 0.24 0.2879 0.81 0.0 3 13 16
2011-01-01 01:00:00 2 1 0 1 1 0 6 0 1 0.22 0.2727 0.80 0.0 8 32 40
2011-01-01 02:00:00 3 1 0 1 2 0 6 0 1 0.22 0.2727 0.80 0.0 5 27 32
2011-01-01 03:00:00 4 1 0 1 3 0 6 0 1 0.24 0.2879 0.75 0.0 3 10 13
2011-01-01 04:00:00 5 1 0 1 4 0 6 0 1 0.24 0.2879 0.75 0.0 0 1 1
reference[numerical_features + categorical_features].shape
(618, 9)
regressor = ensemble.RandomForestRegressor(random_state = 0, n_estimators = 50)
regressor.fit(reference[numerical_features + categorical_features], reference[target])
RandomForestRegressor(n_estimators=50, random_state=0)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
ref_prediction = regressor.predict(reference[numerical_features + categorical_features])
reference['prediction'] = ref_prediction

We also map the columns to show Evidently what each column contains and perform a correct analysis:

column_mapping = ColumnMapping()

column_mapping.target = target
column_mapping.prediction = prediction
column_mapping.numerical_features = numerical_features
column_mapping.categorical_features = categorical_features

By default, Evidently uses the index as an x-axis in plots. In this case, it is datetime, so we do not need to add anything else explicitly. Otherwise, we would have to specify it in our column mapping.

Next, we call a corresponding report for regression models.

regression_perfomance = Report(metrics=[RegressionPreset()])
regression_perfomance.run(current_data=reference, reference_data=None, column_mapping=column_mapping)
# You can also specify the metrics see https://docs.evidentlyai.com/reference/all-metrics
#the_report = Report(metrics=[
#    RegressionQualityMetric(),
#    RegressionErrorPlot(),
#    RegressionErrorDistribution(),
#    DataDriftPreset(stattest=anderson_stat_test, stattest_threshold=0.9),
#])

And display the results right in the Jupyter notebook.

regression_perfomance.show()
Output hidden; open in https://colab.research.google.com to view.

We also save it as a .html file to be able to share it easily.

!mkdir reports
regression_perfomance.save_html('reports/regression_performance_at_training.html')

We can see that the model has a fine quality given that we only trained on four weeks of data! The error is symmetric and distributed around zero. There is no obvious under- or over-estimation.

We will continue treating this dataset from model performance in training as our β€œreference.” It gives us a good feel of the quality we can expect from our model in production use. So, we can contrast the future performance against this benchmark.

7.8.3 The first week in production

Observing the model in production has straightforward goals. We want to detect if something goes wrong. Ideally, in advance. We also want to diagnose the root cause and get a quick understanding of how to address it. Maybe, the model degrades too fast, and we need to retrain it more often? Perhaps, the error is too high, and we need to adapt the model and rebuild it? Which new patterns are emerging?

In our case, we simply start by checking how well the model performs outside the training data. Our first week becomes what would have otherwise been a holdout dataset.

For demonstration purposes, we generated all predictions for several weeks ahead in a single batch. In reality, we would run the model sequentially as the data comes in.

Let’s start by comparing the performance in the first week to what we have seen in training. The first 28 days are our Reference dataset; the next 7 are the Production.

current = raw_data.loc['2011-01-29 00:00:00':'2011-02-28 23:00:00']
current_prediction = regressor.predict(current[numerical_features + categorical_features])
current['prediction'] = current_prediction
regression_perfomance = Report(metrics=[RegressionPreset()])
regression_perfomance.run(current_data=current.loc['2011-01-29 00:00:00':'2011-02-07 23:00:00'], 
                          reference_data=reference,
                          column_mapping=column_mapping)

regression_perfomance.show()
Output hidden; open in https://colab.research.google.com to view.

The error has slightly increased and is leaning towards underestimation. Let’s check if there is any statistical change in our target. To do that, we will generate the Target Drift report.

target_drift = Report(metrics=[TargetDriftPreset()])
target_drift.run(current_data=current.loc['2011-01-29 00:00:00':'2011-02-07 23:00:00'],
                 reference_data=reference,
                 column_mapping=column_mapping)

target_drift.show()
Output hidden; open in https://colab.research.google.com to view.

We can see that the distribution of the actual number of bikes rented remains sufficiently similar. To be more precise, the similarity hypothesis is not rejected. No drift is detected. The distributions of our predictions did not change much either.

Despite this, a rational decision is to update your model by including the new week’s data. This way, the model can continue to learn, and we can probably improve the error. For the sake of demonstration, we’ll stick to see how fast things go really wrong.

7.8.4 The second week: failing to keep up

Once again, we benchmark our new week against the reference dataset.

regression_perfomance = Report(metrics=[RegressionPreset()])
regression_perfomance.run(current_data=current.loc['2011-02-07 00:00:00':'2011-02-14 23:00:00'], 
                          reference_data=reference,
                          column_mapping=column_mapping)

regression_perfomance.show()
Output hidden; open in https://colab.research.google.com to view.

At first glance, the model performance in the second week does not differ much. MAE remains almost the same. But, the skew towards under-estimation continues to grow. It seems that the error is not random! To know more, we move to the plots. We can see that the model catches overall daily trends just fine. So it learned something useful! But, at peak hours, the actual demand tends to be higher than predicted.

In the error distribution plot, we can see how it became β€œwider,” as we have more predictions with a high error. The shift to the left is visible, too. In some extreme instances, we have errors between 80 and 40 bikes that were unseen previously.

Let’s check our target as well.

target_drift = Report(metrics=[TargetDriftPreset()])
target_drift.run(current_data=current.loc['2011-02-07 00:00:00':'2011-02-14 23:00:00'],
                 reference_data=reference,
                 column_mapping=column_mapping)

target_drift.show()
Output hidden; open in https://colab.research.google.com to view.

Things are getting interesting!

We can see that the target distribution is now different: the similarity hypothesis is rejected. Literally, people are renting more bikes. And this is a statistically different change from our training period.

But, the distribution of our predictions does not keep up! That is an obvious example of model decay. Something new happens in the world, but it misses the patterns.

It is tempting to investigate further. Is there anything in the data that can explain this change? If there is some new signal, retraining would likely help the model to keep up. The Target Drift report has a section to help us explore the relationship between the features and the target (or model predictions). ‍When browsing through the individual features, we can inspect if we notice any new patterns. We know that predictions did not change, so we only look at the relations with the target. For example, there is a shift towards higher temperatures (measured in Celsius) with a corresponding increase in rented bikes.

Maybe, it would pick up these patterns in retraining. But for now, we simply move on to the next week without any updates.

7.8.5 Week 3: when things go south

regression_perfomance = Report(metrics=[RegressionPreset()])
regression_perfomance.run(current_data=current.loc['2011-02-15 00:00:00':'2011-02-21 23:00:00'], 
                          reference_data=reference,
                          column_mapping=column_mapping)

regression_perfomance.show()
Output hidden; open in https://colab.research.google.com to view.

Okay, now things do look bad. On week 3, we face a major quality drop. Both absolute and percentage error grew significantly. If we look at the plots, the model predictions are visibly scattered. We also face a new data segment with high demand that the model fails to predict. But even within the known range of target value, the model now makes errors. Things did change since the training. We can see that the model does not extrapolate well. The predicted demand stays within the same known range, while actual values are peaking.

If we zoom in on specific days, we might suggest that the error is higher on specific (active) hours of the day. We are doing just fine from 10 pm to 6 am!

In our example, we particularly want to understand the segment where the model underestimates the target function. The Error Bias table gives up more details. We sort it by the "Range%" field. If the values of a specific feature are significantly different in the group where the model under- or over-estimates, this feature will rank high. In our case, we can see that the extreme errors are dependent on the β€œtemp” (temperature) and β€œatemp” (feels-like temperature) features.

After this quick analysis, we have a more specific idea about model performance and its weaknesses. The model faces new, unusually high demand. Given how it was trained, it tends to underestimate it. On top of it, these errors are not at all random. At the very least, they are related to the temperature we observe. The higher it is, the larger the underestimation. It suggests new patterns that are related to the weather that the model could not learn before. Days got warmer, and the model went rogue.

If we run a target drift report, we will also see a relevant change in the linear correlations between the feature and the target. Temperature and humidity stand out.

target_drift = Report(metrics=[TargetDriftPreset()])
target_drift.run(current_data=current.loc['2011-02-15 00:00:00':'2011-02-21 23:00:00'],
                 reference_data=reference,
                 column_mapping=column_mapping)

target_drift.show()
Output hidden; open in https://colab.research.google.com to view.

We should retrain as soon as possible and do this often until we learn all the patterns. If we are not comfortable with frequent retraining, we might choose an algorithm that is more suitable for time series or is better in extrapolation.

7.8.6 Data Drift

In practice, once we receive the ground truth, we can indeed course-correct quickly. Had we retrained the model after week one, it would have likely ended less dramatically. But what if we do not have the ground truth available? Can we catch such decay in advance?

In this case, we can analyze the data drift. We do not need actuals to calculate the error. Instead, our goal is to see if the input data has changed.

Once again, let’s compare the first week of production to our data in training. We can, of course, look at all our features. But we can also conclude that categorical features (like β€œseason,” β€œholiday” and β€œworkingday”) are not likely to change. Let’s look at numerical features only!

We specify these features so that the tool applies the correct statistical test. It would be Kolmogorov-Smirnov in this case.

column_mapping = ColumnMapping()

column_mapping.numerical_features = numerical_features
data_drift = Report(metrics = [DataDriftPreset()])
data_drift.run(current_data = current.loc['2011-01-29 00:00:00':'2011-02-07 23:00:00'],
               reference_data = reference,
               column_mapping=column_mapping)

data_drift.show()
Output hidden; open in https://colab.research.google.com to view.

The data drift report compares the distributions of each feature in the two datasets. It automatically picks an appropriate statistical test or metric based on the feature type and volume. It then returns p-values or distances and visually plots the distributions. You can also adjust the drift detection method or thresholds, or pass your own.

Once we show the report, it returns an answer. We can see already during the first week there is a statistical change in feature distributions.

Let’s zoom in on our usual suspectβ€”temperature. The report gives us two views on how the feature distributions evolve with time. We can notice how the observed temperature becomes higher day by day. The values clearly drift out of our green corridor (one standard deviation from the mean) that we saw in training. Looking at the steady growth, we can suspect an upward trend.

As we checked earlier, we did not detect drift in the model predictions after week one. Given that our model is not good at extrapolating, we should not really expect it. Such prediction drift might still happen and signal about things like broken input data. Otherwise, we would observe it if we had a more sensitive model. Regardless of this, the data drift alone provides excellent early monitoring to detect the change and react to it.

For more information please refer to https://github.com/evidentlyai/evidently, https://github.com/SeldonIO/alibi-detect, https://github.com/great-expectations/great_expectations or https://github.com/whylabs/whylogs.

7.9 References

  1. https://github.com/ageron/handson-ml2/blob/master/19_training_and_deploying_at_scale.ipynb
  2. https://github.com/bentoml/BentoML
  3. https://github.com/streamlit/streamlit
  4. https://raw.githubusercontent.com/dataprofessor/code/master/streamlit/part2/iris-ml-app.py
  5. https://gradio.app/image-classification-in-tensorflow/
  6. https://evidentlyai.com/blog/tutorial-1-model-analytics-in-production