- Learn about different types of datasets
- Start to plan the dataset for your project AI model that will predict something
These are the activities for this lesson:
HEALTHY DATASETS
The first step in creating an AI model that can classify something is to plan the dataset.
Healthy Datasets
Lots of data
Different examples of data
The right kind of data
Correct actions or decisions
AI NEEDS DATA
Keep the following qualities in mind when gathering examples for your dataset.
QUANTITY
The more examples you can give the model, the better it will perform. Provide at least 50 examples for each class.
BALANCE
You should have about the same number of examples for each class, in order to prevent bias for one over the other.
TEST DATA
Keep a portion of your examples separate to test the trained model. You will need some examples that were not used to train the model to test if your model is accurate.
10-20% of data should be test data.
DIVERSITY
You also want to include varied examples.
For example, say you are creating an AI model to detect if someone is wearing a face mask or not. You should gather images that reflect varied examples:
- Different types and colors of masks
- Different people – genders, ethnicities, ages
- Different backgrounds – indoors, outdoors, light, dark
- Different head angles
- Different placement of head in frame – close, far, left side, right side
What if you only trained your model using images of white men with blue surgical masks for your mask class? What happens when a female of color wearing a purple mask uses your model? How do you think it will be classified? Will your model perform well or not?
TYPES OF DATA
A dataset must also be the right kind of data. Make sure you choose the data type that is right for your project! The options are:
Numbers
statistical data, demographic information, sensor data
Text
messages, social media posts, books, articles, websites
Sound
music, recordings, voices
Images
faces, places ... anything!
AI GIVES YOU POWER
Determining what goes into your dataset gives you immense power!
Be careful to use Determining what goes into your dataset gives you immense power!
Be careful to use lots of data, different data, and the right type of data.
Otherwise, your AI model will
- not be very accurate
- could make bad predictions
- take the wrong action.
Taking the time to collect data that will make for a healthy dataset is critical to a successful model.
GATHERING DATA
There are 3 ways to collect data for training your model.
If your project focuses directly on your community, the community might make the logical place to supply the data you need. Make sure you have permission to use the data!
How will collect data in your community?
- take pictures?
- ask community members to give you pictures?
- record sounds?
- use a survey?
- interview community members?
If you are going to need lots and lots of data for your model, you might look into public datasets. There are many datasets available online that can provide you with large amounts of data quickly.
Here are some good dataset sites:
Be sure you review the data to make sure it fits the criteria above for a healthy dataset.
You most likely will also have to make some changes to the data to fit your needs. For example, tools like Teachable Machine require images that are square, so you might need to edit the dataset images to fit the correct dimensions for the tool you are using.
Microcontrollers are small computers on a single integrated circuit that are used to control devices like automobile engines and home appliances. Some microcontrollers have built-in sensors. Many have options to connect sensors to them.
Each of the three recommended microcontrollers below has its own particular features, and could require using different programming languages to make them work for your project. Some of the tools, like App Inventor, have extensions you can add to be able to use these devices with those tools. All three devices have recently added AI capability, so you want to check out what is possible!
MORE ON SENSORS
There are many low cost sensors that can connect to small microcontrollers and provide your project with data. Here are some sensors that could be used.
Camera
Speedometer
Microphone
Light sensor
Pressure sensor
Air quality sensor
Infrared Thermometer
Proximity sensor
ACTIVITY: PLAN YOUR DATASET
Follow the instructions in the worksheet to outline:
- What data you want to collect.
- Where you will collect the data for your dataset. Will it be community, sensors, or public datasets?
- How will you collect the data? What will the classes or labels be for your model?
- How many examples for each class? 50 per class should be a minimum.
Mentor Tip
Best practices: Encourage the students to think about the problems that they have in their day-to-day lives, is there a data set that relates to that? Are there any sensors in the items around you? What kind of information are these sensors gathering? How could you use those (the new google phone has a temperature sensor)?
Guiding Questions to ask students: Does your city have an “Open Data” portal? Example: NYC and Edmonton, Canada.
Mentor tips are provided by support from AmeriCorps.
REFLECTION
You now have a plan for your dataset! As you start to gather the examples for your dataset, keep them safe and well organized.
Don’t forget to keep a portion of the dataset for testing! About 10-20% should be kept separate for testing.
REVIEW OF KEY TERMS
Datasets – large sets of data that are used to teach AI to recognize patterns and predict something
Sensor – a device that detects changes in the environment and is used to monitor that information within an electronic system
Microcontroller – small computer on a single integrated chip, used in larger computers and other systems such as appliances, vehicles, and robots
ADDITIONAL RESOURCES
Hardware and Sensors
For a comprehensive list of sensors, check out this Wikipedia article.
This video gives good information on the microcontroller hardware we recommend for projects using sensors.
This video tutorial shows you how to access a public dataset on Kaggle.