Machine Learning Coding Tutorial 3. What Makes a Good Feature?

Machine Learning Coding Tutorial 3. What Makes a Good Feature?

In the previous tutorial, we used decision tree as the classifier. Classifiers are only as good as the features you provide.

That means coming with good features is one of your most important jobs in machine learning.

1. Dog Classifier

Imagine we want to write a classifier to tell the difference between two types of dogs: greyhounds, and Labradors.

Here we’ll use one feature: the dog’s height in inches. Greyhounds are usually taller than Labradors.

2. Coding

Let’s head into Python for a programmatic example.

Create a python file dogs.py and write following code to program.

Please read comments carefully to understand the meaning of codes.

"""
GoodTecher Machine Learning Coding Tutorial
http://www.goodtecher.com

Machine Learning Coding Tutorial 3. What Makes a Good Feature?

The program takes a measurements (dogs height) as input
and display normal distribution of types of dogs
"""

import numpy as np
import matplotlib.pyplot as plt

# creates 500 Greyhounds and 500 Labradors
number_of_greyhounds = 500
number_of_labradors = 500

# Greyhounds on average 28 inches tall
# Labradors on average 24 inches tall
# Let's say height is normally distributed,
# so we'll make both of these plus or minus 4 inches
# the following code generates arrays of 500 Greyhound heights
# and 500 Labradors heights
greyhounds_heights = 28 + 4 * np.random.randn(number_of_greyhounds)
labradors_heights = 24 + 4 * np.random.randn(number_of_labradors)

# visualize in histogram,
# Greyhounds are in red, Labradors are in blue
plt.hist([greyhounds_heights, labradors_heights], stacked=True, color=['r', 'b'])
plt.show()

Run the program with the following command in Terminal (Mac) or Command Prompt (Windows):

python dogs.py

You should see a popup window of a histogram.

3. Explanation

To the left of the histogram, the probability of dogs is to be Labradors. On the other hand, if we go all the way to the right of the histogram and we look at a dog who is 35 inches tall, we can be confident they are greyhound.

In the middle, the probability of each type of dog is close.

So height is a useful feature, but it’s not perfect. That’s why in machine learning, you almost need multiple features. Otherwise, we can just write if statement instead of bothering with the classifier.

Ideal features are

  • informative
  • independent
  • easy to understand

informative

For example, eye colors of dogs are useless to tell what type of dogs it is.

independent

For example, Height in inches and Height in centimeters are redundant

easy to understand

For example, to estimate the time to fly from a city to another. The distance between two cities is better than the longitude, latitude information of two cities.

Leave a Reply

Your email address will not be published. Required fields are marked *