Artificial Intelligence 3: Datasets

Datasets

Artificial Intelligence 3

This lesson will help you earn points in the selected technology section of the Demo video of the rubric.

In this lesson, you will…

  • Learn about different types of datasets
  • Start to plan the dataset for your project

Key Terms and Concepts

  • Datasets - large sets of data that are used to teach AI to recognize patterns and predict something
  • Microcontroller - small computer on a single integrated chip, used in larger computers and other systems such as appliances, vehicles, and robots
  • Sensor - a device that detects changes in the environment and is used to monitor that information within an electronic system
  • Class - a label that is provided to an AI model so it learns how to classify inputs by its class

The first step in creating an AI model that can classify something is to plan the dataset.

You want to make sure you have lots of data to train your AI model. The more examples you can give the model, the better it will perform. The data should be balanced between your different classes or labels. You want to have approximately the same number of examples for each class, in order to prevent bias for one over the other. 

As you gather training examples for your dataset, remember to keep a portion of your examples to test the trained model. You will need some data examples that were not used in training the model to test with. You need to see if your model has learned well enough to predict when presented with a new example. A good way to do this is to create two folders on your computer - one labelled Training Data and one labelled Testing Data. The bulk of your data examples should go towards training the model, but a small portion, 10-20%, should be kept aside to test for accuracy.

You want to provide diverse examples. For example, say you are creating an AI model to detect if someone is wearing a face mask or not. You should gather images that reflect varied examples:

  • Different types and colors of masks
  • Different people - genders, ethnicities, ages
  • Different backgrounds - indoors, outdoors, light, dark
  • Different head angles
  • Different placement of head in frame - close, far, left side, right side

 

What if you only trained your model using images of white men with blue surgical masks for your mask class? What happens when a female of color wearing a purple mask uses your model? How do you think it will be classified? Will your model perform well or not?

Photo credit: A Tiny CNN Architecture for Medical Face Mask Detection for Resource-Constrained Endpoints by Puranjay Mohan, Aditya Jyoti Paul, and Abhay Chirani

And of course, a dataset must be the right kind of data. You can train AI models  using numbers, text, images, and sounds. Make sure you choose the data type that is right for your project!

Determining what goes into your dataset gives you immense power! You must be careful to use lots of data, different data, and the  right type of data. Otherwise, your AI model will not be very accurate and could make bad predictions and take the wrong action. Taking the time to collect data that will make for a healthy dataset is critical to a successful model.

 

As a reminder there are three possible ways to collect data for training an AI model.

Activity: Planning your Dataset

  1. Choose what data you want to collect. Decide what information you will need to train your model to work for your project. Think about types of data (sound, images, text, numbers)
  2. Decide where you will collect the data for your dataset. Will it be community, sensors, or public datasets? 
  3. How will you collect the data? Give more detail behind your answer to #2.
  4. Decide what the classes or labels will be for your model? What will your model predict, so what categories will be necessary? You will need examples for each class you include.
  5. Decide on a target number of examples for each class. 50 per class should be a minimum.

 

Once you’ve got your dataset planning started, fill in more details:

  • If gathering from the community, start your survey or questions.
  • If gathering with sensors, list the sensors you will need.
  • If gathering from public datasets, start listing possible datasets you can use.

Additional Resources: Advanced Integrations

Hardware and Sensors

This video gives good information on the microcontroller hardware we recommend for projects using sensors: Sparkfun Workshop: Microcontrollers and Machine Learning

For a comprehensive list of sensors, check out this Wikipedia article.

 

Public datasets

If you choose to go with a public dataset, these sites will be invaluable for your project.