Iris (Classification)
One of the more famous classification problems, we can load the classic Iris Dataset saved directly to Scikitlearn using the dataset
submodule.
Loading the Data
from sklearn.datasets import load_iris
data = load_iris()
Doing so gives us a Bunch
object
type(data)
sklearn.utils.Bunch
Which is basically a dictionary, but with some other stuff
data.__class__.__bases__
(dict,)
Inspecting the Data
Let’s look at the keys
data.keys()
dict_keys(['data', 'target', 'target_names', 'DESCR', 'feature_names'])
The data
and target
keys are just numpy arrays
print(type(data['data']), data['data'].shape)
print(type(data['target']), data['target'].shape)
<class 'numpy.ndarray'> (150, 4)
<class 'numpy.ndarray'> (150,)
Whereas feature_names
are just that
print(data['feature_names'])
['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
And target_names
are the un-tokenized labels for the target
array.
print(data['target_names'])
['setosa' 'versicolor' 'virginica']
Using the Data
Sklearn
Data’s already broken up by X
and y
so let’s assign it as such.
X = data['data']
y = data['target']
Done deal.
Pandas
A bit trickier, basically, we want to merge our X
and y
together
import numpy as np
values = np.c_[X, y]
Then stuff those into a DataFrame
import pandas as pd
df = pd.DataFrame(values)
df.head()
0 | 1 | 2 | 3 | 4 | |
---|---|---|---|---|---|
0 | 5.1 | 3.5 | 1.4 | 0.2 | 0.0 |
1 | 4.9 | 3.0 | 1.4 | 0.2 | 0.0 |
2 | 4.7 | 3.2 | 1.3 | 0.2 | 0.0 |
3 | 4.6 | 3.1 | 1.5 | 0.2 | 0.0 |
4 | 5.0 | 3.6 | 1.4 | 0.2 | 0.0 |
And label the data accordingly
cols = data['feature_names'] + ['flower_names']
df.columns = cols
df.head()
sepal length (cm) | sepal width (cm) | petal length (cm) | petal width (cm) | flower_names | |
---|---|---|---|---|---|
0 | 5.1 | 3.5 | 1.4 | 0.2 | 0.0 |
1 | 4.9 | 3.0 | 1.4 | 0.2 | 0.0 |
2 | 4.7 | 3.2 | 1.3 | 0.2 | 0.0 |
3 | 4.6 | 3.1 | 1.5 | 0.2 | 0.0 |
4 | 5.0 | 3.6 | 1.4 | 0.2 | 0.0 |
And if we wanted to un-encode the flower_names
column, we’d make a dictionary mapping number to flower name.
# verbose, but also generic
d = dict(zip(range(len(data['target_names'])), data['target_names']))
d
{0: 'setosa', 1: 'versicolor', 2: 'virginica'}
And throw it up against the flower_names
column.
df['flower_names'] = df['flower_names'].map(d)
df.head()
sepal length (cm) | sepal width (cm) | petal length (cm) | petal width (cm) | flower_names | |
---|---|---|---|---|---|
0 | 5.1 | 3.5 | 1.4 | 0.2 | setosa |
1 | 4.9 | 3.0 | 1.4 | 0.2 | setosa |
2 | 4.7 | 3.2 | 1.3 | 0.2 | setosa |
3 | 4.6 | 3.1 | 1.5 | 0.2 | setosa |
4 | 5.0 | 3.6 | 1.4 | 0.2 | setosa |
Some More Description
The DESCR
key gives a pretty good overview of what we’re dealing with
print(data['DESCR'])
Iris Plants Database
====================
Notes
-----
Data Set Characteristics:
:Number of Instances: 150 (50 in each of three classes)
:Number of Attributes: 4 numeric, predictive attributes and the class
:Attribute Information:
- sepal length in cm
- sepal width in cm
- petal length in cm
- petal width in cm
- class:
- Iris-Setosa
- Iris-Versicolour
- Iris-Virginica
:Summary Statistics:
============== ==== ==== ======= ===== ====================
Min Max Mean SD Class Correlation
============== ==== ==== ======= ===== ====================
sepal length: 4.3 7.9 5.84 0.83 0.7826
sepal width: 2.0 4.4 3.05 0.43 -0.4194
petal length: 1.0 6.9 3.76 1.76 0.9490 (high!)
petal width: 0.1 2.5 1.20 0.76 0.9565 (high!)
============== ==== ==== ======= ===== ====================
:Missing Attribute Values: None
:Class Distribution: 33.3% for each of 3 classes.
:Creator: R.A. Fisher
:Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
:Date: July, 1988
This is a copy of UCI ML iris datasets.
http://archive.ics.uci.edu/ml/datasets/Iris
The famous Iris database, first used by Sir R.A Fisher
This is perhaps the best known database to be found in the
pattern recognition literature. Fisher's paper is a classic in the field and
is referenced frequently to this day. (See Duda & Hart, for example.) The
data set contains 3 classes of 50 instances each, where each class refers to a
type of iris plant. One class is linearly separable from the other 2; the
latter are NOT linearly separable from each other.
References
----------
- Fisher,R.A. "The use of multiple measurements in taxonomic problems"
Annual Eugenics, 7, Part II, 179-188 (1936); also in "Contributions to
Mathematical Statistics" (John Wiley, NY, 1950).
- Duda,R.O., & Hart,P.E. (1973) Pattern Classification and Scene Analysis.
(Q327.D83) John Wiley & Sons. ISBN 0-471-22361-1. See page 218.
- Dasarathy, B.V. (1980) "Nosing Around the Neighborhood: A New System
Structure and Classification Rule for Recognition in Partially Exposed
Environments". IEEE Transactions on Pattern Analysis and Machine
Intelligence, Vol. PAMI-2, No. 1, 67-71.
- Gates, G.W. (1972) "The Reduced Nearest Neighbor Rule". IEEE Transactions
on Information Theory, May 1972, 431-433.
- See also: 1988 MLC Proceedings, 54-64. Cheeseman et al"s AUTOCLASS II
conceptual clustering system finds 3 classes in the data.
- Many, many more ...