AI: Datasets

  • Learn about different types of datasets
  • Start to plan the dataset for your project AI model that will predict something

These are the activities for this lesson:

HEALTHY DATASETS

The first step in creating an AI model that can classify something is to plan the dataset.

Healthy Datasets

right arrow

Lots of data

Different examples of data

The right kind of data

right arrow

Correct actions or decisions

AI NEEDS DATA

Keep the following qualities in mind when gathering examples for your dataset.

QUANTITY

The more examples you can give the model, the better it will perform. Provide at least 50 examples for each class.

balanced scales

BALANCE

You should have about the same number of examples for each class, in order to prevent bias for one over the other.

folders

TEST DATA

Keep a portion of your examples separate to test the trained model. You will need some examples that were not used to train the model to test if your model is accurate.
10-20% of data should be test data.

DIVERSITY

You also want to include varied examples.

For example, say you are creating an AI model to detect if someone is wearing a face mask or not. You should gather images that reflect varied examples:

 

  • Different types and colors of masks
  • Different people – genders, ethnicities, ages
  • Different backgrounds – indoors, outdoors, light, dark
  • Different head angles
  • Different placement of head in frame – close, far, left side, right side

What if you only trained your model using images of white men with blue surgical masks for your mask class? What happens when a female of color wearing a purple mask uses your model? How do you think it will be classified? Will your model perform well or not?

African American woman with mask

TYPES OF DATA

A dataset must also be the right kind of data. Make sure you choose the data type that is right for your project! The options are:

excel icon

Numbers

statistical data, demographic information, sensor data

text document

Text

messages, social media posts, books, articles, websites

sound wave

Sound

music, recordings, voices

image icons

Images

faces, places ... anything!

AI GIVES YOU POWER

Determining what goes into your dataset gives you immense power!

Be careful to use Determining what goes into your dataset gives you immense power!

Be careful to use lots of data, different data, and the right type of data.

Otherwise, your AI model will

  • not be very accurate
  • could make bad predictions
  • take the wrong action.

Taking the time to collect data that will make for a healthy dataset is critical to a successful model.

girl with fist in the air

GATHERING DATA

There are 3 ways to collect data for training your model.

MORE ON SENSORS

There are many low cost sensors that can connect to small microcontrollers and provide your project with data. Here are some sensors that could be used.

camera

Camera

Speedometer

Microphone

Light sensor

Pressure sensor

Air quality sensor

Infrared Thermometer

Proximity sensor

ACTIVITY: PLAN YOUR DATASET

Estimated time: 45 minutes

Follow the instructions in the worksheet to outline:

  • What data you want to collect.
  • Where you will collect the data for your dataset. Will it be community, sensors, or public datasets?
  • How will you collect the data? What will the classes or labels be for your model?
  • How many examples for each class? 50 per class should be a minimum.
Open worksheet

Best practices: Encourage the students to think about the problems that they have in their day-to-day lives, is there a data set that relates to that? Are there any sensors in the items around you? What kind of information are these sensors gathering? How could you use those (the new google phone has a temperature sensor)?

Guiding Questions to ask students: Does your city have an “Open Data” portal? Example: NYC and Edmonton, Canada

Mentor tips are provided by support from AmeriCorps.

stylized A, AmeriCorps logo in navy

REFLECTION

You now have a plan for your dataset! As you start to gather the examples for your dataset, keep them safe and well organized.

Don’t forget to keep a portion of the dataset for testing! About 10-20% should be kept separate for testing.

reflection in lake

REVIEW OF KEY TERMS

  • Datasets – large sets of data that are used to teach AI to recognize patterns and predict something

  • Sensor – a device that detects changes in the environment and is used to monitor that information within an electronic system

  • Microcontroller – small computer on a single integrated chip, used in larger computers and other systems such as appliances, vehicles, and robots

ADDITIONAL RESOURCES

Hardware and Sensors


For a comprehensive list of sensors, check out this Wikipedia article.

This video gives good information on the microcontroller hardware we recommend for projects using sensors.

This video tutorial shows you how to access a public dataset on Kaggle.