External Model¶

You can get code for runing this guide on the Getting started guide

First import all the modules

import previsionio as pio
import yaml
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import  make_pipeline
from sklearn.preprocessing import OrdinalEncoder
from sklearn.neighbors import KNeighborsClassifier
from skl2onnx import convert_sklearn
from skl2onnx.common.data_types import FloatTensorType
import numpy as np
import logging

Setup your token account ( see Using The API ) and some parameter for your project, like its name, the name of the datasets…

Note that you always must create a Project for hosting datasets and experiments.

import os
from os.path import join
from dotenv import load_dotenv

load_dotenv()

PROJECT_NAME="Sklearn models Comparison"
TRAINSET_NAME="fraud_train"
HOLDOUT_NAME="fraud_holdout"
INPUT_PATH=join("data","assets")
TARGET = 'fraude'


pio.client.init_client(
   token=os.environ['PIO_MASTER_TOKEN'],
   prevision_url=os.environ['DOMAIN'])

Create a New project, or reuse an existing one

projects_list = pio.Project.list()
# Create a new Project or using the old one

if  PROJECT_NAME not in [p.name for p in projects_list] :
   project = pio.Project.new(name=PROJECT_NAME,  description="An experiment using ")
else :
   project = [p for p in projects_list if p.name==PROJECT_NAME] [0]

Add the dataset to the projects or get the existing one if already uploaded ( the dataset will be automatically uploaded to your account when you create them )

datasets_list = project.list_datasets()
for d in datasets_list:
   if TRAINSET_NAME in [d.name for d in datasets_list] :
      train = [d for d in datasets_list if d.name==TRAINSET_NAME] [0]
   else :
      train = project.create_dataset(file_name=join(INPUT_PATH,"trainset_fraud.csv"), name='fraud_train')

   if HOLDOUT_NAME in [d.name for d in datasets_list] :
      test = [d for d in datasets_list if d.name==HOLDOUT_NAME] [0]
   else :
      test  = project.create_dataset(file_name=join(INPUT_PATH,"holdout_fraud.csv"), name='fraud_holdout')

Beware to converting the data to the right type before makgin your dataset

train_data = train.data.astype(np.float32)
test_data = test.data.astype(np.float32)

X_train = train_data.drop(TARGET, axis=1)
y_train = train_data[TARGET]

Then train some classifiers ( you may upload many models at once ) and create an yaml file to hodl the models configuration.

classifiers=[ {
               "name":"lrsklearn",
               "algo":LogisticRegression(max_iter=3000)
               },
               {
               "name":"knnsk",
               "algo": KNeighborsClassifier(3)
               }
            ]

initial_type = [('float_input', FloatTensorType([None,X_train.shape[1]]))]


config={}
config["class_names"] = [str(c) for c in set(y_train)]
config["input"] = [str(feature) for feature in X_train.columns]
with open(join(INPUT_PATH,'logreg_fraude.yaml'), 'w') as f:
   yaml.dump(config, f)

Sklearn Pipeline are supported so you may build any pipeline you want as long as you provide the right config file. Convert each of your model to an onnx file once fitted :

for clf in classifiers :
   logging
   clr = make_pipeline(OrdinalEncoder(),clf["algo"])
   clr.fit(X_train, y_train )

   onx = convert_sklearn(clr, initial_types=initial_type)
   with open(join(INPUT_PATH,f'{clf["name"]}_logreg_fraude.onnx'), 'wb') as f:
      f.write(onx.SerializeToString())

And last, use the Project create_external_classification method to upload all your models at once in the same experiment

Note

You can upload many onnx file in the same experiment in order to becnhmark them. To do that you must provide a list of tuple, one for each onnx file with :

a name
the path to your onnx file
the path to your config file ( often the same for each model

external_models=[(clf["name"],join(INPUT_PATH,f'{clf["name"]}_logreg_fraude.onnx'), join(INPUT_PATH,'logreg_fraude.yaml')) for clf in classifiers ]
exp = project.create_external_classification(experiment_name=f'churn_sklearn_{clf["name"]}',
                                    dataset=train,
                                    holdout_dataset=test,
                                    target_column=TARGET,
                                    external_models =  external_models
                                 )