- Train a machine learning model using Python libraries in a Jupyter Notebook
- Incorporate a trained machine learning model into a Streamlit web app
These are the activities for this lesson:
WORKING WITH MACHINE LEARNING MODELS
Another major feature of Jupyter Notebooks, Python, and Streamlit is the ability to train machine learning models and make predictions.
If you are new to Artificial Intelligence, you might want to view the AI lessons in this curriculum to learn the basics before jumping into the more advanced coding involved here. You can use a user-friendly machine learning model platform like Teachable Machine to make a model and still incorporate it into a Python web app.
If you have had some experience with artificial intelligence and working with datasets using Jupyter Notebooks, this is a good next step for you.
In this lesson, you will learn about some of the Python machine learning libraries, and some of the different machine learning models you can create using Python.
To review, to create a machine learning model, there are 3 main parts.
DATASET
FINDS PATTERNS WITH LEARNING ALGORITHM
PREDICTION!
The dataset is your input for the model. It can include text, images, sounds, or poses. We worked with text and numerical data in Unit 2 using Jupyter Notebooks. We will continue to work with text data in this lesson, in the form of a spreadsheet.
Finding Patterns is essentially building the machine learning model using the dataset. Python contains many libraries that will build an AI model from data. In this curriculum, we will use many of the functions from the scikit-learn package. In addition to the libraries it provides, the website contains lots of excellent information about machine learning and the process of building models. It is a great resource to learn more!
Once you have created your model, the model can be used to predict an outcome based on new information. Once again, Python supplies libraries that enable this.
PREPROCESSING DATA
Before your dataset can be sent to the algorithm for building the model, it must be preprocessed, or “cleaned” so that the model building algorithm can work with it and make the most accurate model possible. In fact, the bulk of the work for creating a machine learning model is in the preprocessing. You will need to look carefully at the data to decide what is important, what can be left out, and what needs to be cleaned up.
What does preprocessing involve? With a text-based dataset, here are some things to deal with.
Null Values
Sometimes the dataset contains blank or null values, especially if the data is survey data. Sometimes you will eliminate any rows that have null data.
However, if the number of samples is low, you may not want to eliminate them. Another option is to replace a null value with some other value. It could be zero, or it could be the average of all the other values for that field.
Outliers
Sometimes the data contains one or two samples that are very different from the rest of the data. This might skew the model. You don't want the outliers to affect how the model is created, so often outliers are eliminated from the dataset.
For example, you might have a dataset where 95% of the samples are people aged 10-30, but you have a few random samples where the people are over 50. Since the vast majority of samples are in the 10-30 age group, you can consider removing the samples from the older age group.
Standardization
Often the numbers in a large dataset vary, depending on what features are represented. For example, you might have age, that varies from 0 to 70, but then you might have salary, that ranges from 0 to 500,000! The scales are very different, so one feature might count more in the model.
To fix, this, you can standardize the data so it's on a single scale. scikit-learn provides a StandardScaler, which updates each features so the mean is 0 and the standard deviation is 1.
Another thing to consider is a balanced number of samples for each class or label. You want to make sure their are about the same number of each class in your dataset.
Encoding
AI likes numbers, not necessarily words. So it helps to have all of the data converted to numbers. scikit-learn provides Encoder functions, so you can easily convert a range of possible text values to a range of numbers.
An example could be activity levels, with sample values sedentary, light, moderate, high. Those responses could be encoded to be values 0, 1, 2, and 3, which is much easier for the model building algorithm to handle.
SPLITTING DATA
Once you have preprocessed your data, you need to split it into a training set and a testing set. The training set will be used to train and create the model. Then you will test your model using the test set to see how it performs.
There are standard ways to split it (usually 75% for training, and 25% for testing) but you can split it however you want. Again, functions are provided so this is all automated for you.
CREATING THE MODEL
The next step is creating the model. A big decision to make is, which algorithm do I use? There are many different supervised learning algorithms to choose from, and it is hard to know which one to use. A good process is to try out several different algorithms and then see which one gives you the best accuracy.
The first step is to decide whether you need a classification algorithm or a regression algorithm. That depends on what you are trying to predict.
Classification algorithms are used to predict discrete targets or classes. For example, classifying email as spam or not spam would be a classification problem
Regression algorithms are used to predict something that is along a continuous range. One example would be predicting how much salary a person will be paid. The prediction is a range of numbers and the output could be any value along that range.
Here are just a few of the popular types of model creation algorithms.
Classification
- Decision Tree
- Random Forest
- K-Nearest Neighbors
- Naive Bayes
- Logistic Regression
- Support Vector Machine
Regression
- Linear regression
- Ridge Regression
- Lasso Regression
- Polynomial Regression
- Bayesian Linear Regression
- Support Vector Regression
Note that some algorithms fall under both types. For example, there is a Decision Tree Classifier as well as a Decision Tree Regressor. And a Support Vector Machine works for classification, and Support Vector Regression works for regression.
So, how do you decide which one to use? You should research to see what other data scientists use for different situations to see what might for for your model. You should also try out some different algorithms with your data and then find which one provides the best accuracy. You can also tweak parameters for a particular algorithm to see if it returns a more accurate model.
scitkit-learn provides functions for all of these algorithms, so it’s straightforward to create the model.
EVALUATING THE MODEL
You want your model to be the best it can be, so you need to evaluate its performance. Two common variables in evaulating a model are bias and variance.
Bias is the difference between the model’s predicted value and the correct value.
Variance is how much the predictions change when different data is used.
You want to achieve a good balance between bias and variance.
High bias -> underfitting.
Underfitting happens when the model is too simple to account for the noise in training data. This can happen if there is not enough data or not enough features (columns), or too much noise in the data. If a model doesn’t perform well on both the training data and the testing data, it signals underfitting.
High variance -> overfitting.
Overfitting happens if you train a model on one set of data and it performs very well, but if you then present it with new data, it does not perform well at all. This can happen if the model is overly complex and it tries to fit too closely to the training data. The model may predict very well on the training data, but then performs poorly on testing data.
One technique to check the performance of a model is cross-validation.
Cross-validation means training your model several times, using different splits of training/testing data each time. Your dataset is split into several folds, or subsets. Then one fold is held out as the validation or test set, and the remaining folds are used to train it. This is performed several times, so each time the training and test sets change.
sciikit-learn also provides a metrics library so you can easily get performance scores for your models.
- accuracy score = correct predictions/total predictions
- precision = true positives/(true positives + false positives)
- recall = true positives/(true positives + false negatives)
- F1 score = (2 x precision x recall)/(precision + recall)
- specificity = true negative/ (true positives + false negatives)
- confusion matrix – shows true positive, true negative, false positive, and false negative counts
By checking metrics from various algorithms, you can choose the best model.
PREDICTING!
Once you have a model that you are satisfied with, you then want to use it in your app.
It is common practice to do the preprocessing, creation, and evaluation of your model using Python, in an environment like Jupyter Notebooks. From there, you can export your model as a file.
Then, within your Streamlit app, you can load the model and use it to make predictions.
In this lesson’s activities, you will go through this entire process using a stroke risk dataset. You will see how to preprocess the data, create models using different algorithms, and then use a model in a simple Streamlit app to predict the risk of stroke, given some input characteristics.
ACTIVITY 1: TRAIN AN AI MODEL IN JUPYTER NOTEBOOK
Explore a Stroke Risk Dataset to build an AI model
- Download a stroke prediction dataset from Kaggle.
- Work with the data in a Jupyter Notebook to:
- Review the data
- Preprocess the data to prepare it for the model
- Create some different models
- Evaluate and choose a model for your app
- Export the model
CHALLENGE
Try out a different model than the ones in the Jupyter Notebook.
- Research the scikit-learn website for other classification algorithms, and look at other model-building examples on Kaggle.
- Choose one algorithm and add the code to your notebook to create the model.
- Use the scikit-learn metrics to check for accuracy.
How does your model perform? Is it better than any of other algorithms in the notebook?
ACTIVITY 2: BUILD A PREDICTION APP
Use your model in a Streamlit App
CHALLENGE
The Iris dataset is a classic dataset that classifies iris flowers into 3 species (setosa, versicolor, and virginica) based on the petal and sepal dimensions.
- Do some research on the dataset to learn its features and targets.
- You can download the dataset and create your own model or use this model (pickle file) created using K-nearest neighbors. Note that no scaler was needed for this dataset. The pickle file contains just the model.
- Import the model and create a Streamlit app to predict the iris species based on the four dataset features.
TECHNOVATION INSPIRATION
Here are some pretty amazing examples from Technovation participants who used Python and Streamlit to build web apps incorporating machine learning models.
T.E.D.D.Y – Text-based Early Distress Detector for Youth , by Team TEDDY of the USA, assists teachers and counselors in the early detection of students’ mental-health concerns. TEDDY uses AI to identify sentences expressing negative sentiment or showing the language patterns expected of individuals with depression. Then, students may be referred to a counselor for support.
REFLECTION
You have gone through the entire process of preprocessing a dataset, building several models, and evaluating and choosing one to use in an app. That is A LOT to learn in one lesson!
REVIEW OF KEY TERMS
- Preprocessing – taking a dataset and making sure the data in it is suitable to train a machine learning model with
- Classification algorithm – an algorithm used to train a machine learning model that will classify or predict discrete values
- Regression algorithm – algorithm used to train a machine learning model to predict a value on a continuous range
- Bias – the difference between the model’s predicted value and the correct value, due to incorrect assumptions that simplify the model
- Variance – the amount of variability in model predictions, when a model is unable to generalize when faced with new data
- Overfitting – when the model fits too well to the training data that it cannot predict well on new data, caused by high variance in the model
- Underfitting – when a model is simplified too much and does not perform well on either training or testing data, caused by high bias or assumptions in the model
ADDITIONAL RESOURCES
Machine Learning
- Geeks for Geeks Machine Learning Tutorial -gives a great introduction and overview of some of the machine learning process and terminology
- Simplilearn’s Scikit-Learn tutorial – goes through a wine quality dataset to practice using scikit-learn in a Jupyter Notebook
- Machine Learning with scikit-learn – a full Youtube playlist by Data School
Streamlit
- Build a Machine Learning Web App from Scratch with Patrick Loeber – another example of building a model and using it in a web app, from start to finish
- Image Classifier with Streamlit, if your model is trained on images