External Model¶
You can get code for runing this guide on the Getting started guide
First import all the modules
1import previsionio as pio
2import yaml
3from sklearn.linear_model import LogisticRegression
4from sklearn.pipeline import make_pipeline
5from sklearn.preprocessing import OrdinalEncoder
6from sklearn.neighbors import KNeighborsClassifier
7from skl2onnx import convert_sklearn
8from skl2onnx.common.data_types import FloatTensorType
9import numpy as np
10import logging
Setup your token account ( see Using The API ) and some parameter for your project, like its name, the name of the datasets…
Note that you always must create a Project for hosting datasets and experiments.
1import os
2from os.path import join
3from dotenv import load_dotenv
4
5load_dotenv()
6
7PROJECT_NAME="Sklearn models Comparison"
8TRAINSET_NAME="fraud_train"
9HOLDOUT_NAME="fraud_holdout"
10INPUT_PATH=join("data","assets")
11TARGET = 'fraude'
12
13
14pio.client.init_client(
15 token=os.environ['PIO_MASTER_TOKEN'],
16 prevision_url=os.environ['DOMAIN'])
Create a New project, or reuse an existing one
1projects_list = pio.Project.list()
2# Create a new Project or using the old one
3
4if PROJECT_NAME not in [p.name for p in projects_list] :
5 project = pio.Project.new(name=PROJECT_NAME, description="An experiment using ")
6else :
7 project = [p for p in projects_list if p.name==PROJECT_NAME] [0]
Add the dataset to the projects or get the existing one if already uploaded ( the dataset will be automatically uploaded to your account when you create them )
1datasets_list = project.list_datasets()
2for d in datasets_list:
3 if TRAINSET_NAME in [d.name for d in datasets_list] :
4 train = [d for d in datasets_list if d.name==TRAINSET_NAME] [0]
5 else :
6 train = project.create_dataset(file_name=join(INPUT_PATH,"trainset_fraud.csv"), name='fraud_train')
7
8 if HOLDOUT_NAME in [d.name for d in datasets_list] :
9 test = [d for d in datasets_list if d.name==HOLDOUT_NAME] [0]
10 else :
11 test = project.create_dataset(file_name=join(INPUT_PATH,"holdout_fraud.csv"), name='fraud_holdout')
Beware to converting the data to the right type before makgin your dataset
1train_data = train.data.astype(np.float32)
2test_data = test.data.astype(np.float32)
3
4X_train = train_data.drop(TARGET, axis=1)
5y_train = train_data[TARGET]
Then train some classifiers ( you may upload many models at once ) and create an yaml file to hodl the models configuration.
1classifiers=[ {
2 "name":"lrsklearn",
3 "algo":LogisticRegression(max_iter=3000)
4 },
5 {
6 "name":"knnsk",
7 "algo": KNeighborsClassifier(3)
8 }
9 ]
10
11initial_type = [('float_input', FloatTensorType([None,X_train.shape[1]]))]
12
13
14config={}
15config["class_names"] = [str(c) for c in set(y_train)]
16config["input"] = [str(feature) for feature in X_train.columns]
17with open(join(INPUT_PATH,'logreg_fraude.yaml'), 'w') as f:
18 yaml.dump(config, f)
Sklearn Pipeline are supported so you may build any pipeline you want as long as you provide the right config file. Convert each of your model to an onnx file once fitted :
1for clf in classifiers :
2 logging
3 clr = make_pipeline(OrdinalEncoder(),clf["algo"])
4 clr.fit(X_train, y_train )
5
6 onx = convert_sklearn(clr, initial_types=initial_type)
7 with open(join(INPUT_PATH,f'{clf["name"]}_logreg_fraude.onnx'), 'wb') as f:
8 f.write(onx.SerializeToString())
And last, use the Project create_external_classification method to upload all your models at once in the same experiment
Note
You can upload many onnx file in the same experiment in order to becnhmark them. To do that you must provide a list of tuple, one for each onnx file with :
a name
the path to your onnx file
the path to your config file ( often the same for each model
1external_models=[(clf["name"],join(INPUT_PATH,f'{clf["name"]}_logreg_fraude.onnx'), join(INPUT_PATH,'logreg_fraude.yaml')) for clf in classifiers ]
2exp = project.create_external_classification(experiment_name=f'churn_sklearn_{clf["name"]}',
3 dataset=train,
4 holdout_dataset=test,
5 target_column=TARGET,
6 external_models = external_models
7 )