# Iris (Classification)

One of the more famous classification problems, we can load the classic Iris Dataset saved directly to Scikitlearn using the dataset submodule.

from sklearn.datasets import load_iris

data = load_iris()

Doing so gives us a Bunch object

type(data)
sklearn.utils.Bunch


Which is basically a dictionary, but with some other stuff

data.__class__.__bases__
(dict,)


## Inspecting the Data

Let’s look at the keys

data.keys()
dict_keys(['data', 'target', 'target_names', 'DESCR', 'feature_names'])


The data and target keys are just numpy arrays

print(type(data['data']), data['data'].shape)
print(type(data['target']), data['target'].shape)
<class 'numpy.ndarray'> (150, 4)
<class 'numpy.ndarray'> (150,)


Whereas feature_names are just that

print(data['feature_names'])
['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']


And target_names are the un-tokenized labels for the target array.

print(data['target_names'])
['setosa' 'versicolor' 'virginica']


## Using the Data

### Sklearn

Data’s already broken up by X and y so let’s assign it as such.

X = data['data']
y = data['target']

Done deal.

### Pandas

A bit trickier, basically, we want to merge our X and y together

import numpy as np

values = np.c_[X, y]

Then stuff those into a DataFrame

import pandas as pd

df = pd.DataFrame(values)
df.head()
0 1 2 3 4
0 5.1 3.5 1.4 0.2 0.0
1 4.9 3.0 1.4 0.2 0.0
2 4.7 3.2 1.3 0.2 0.0
3 4.6 3.1 1.5 0.2 0.0
4 5.0 3.6 1.4 0.2 0.0

And label the data accordingly

cols = data['feature_names'] + ['flower_names']

df.columns = cols
df.head()
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) flower_names
0 5.1 3.5 1.4 0.2 0.0
1 4.9 3.0 1.4 0.2 0.0
2 4.7 3.2 1.3 0.2 0.0
3 4.6 3.1 1.5 0.2 0.0
4 5.0 3.6 1.4 0.2 0.0

And if we wanted to un-encode the flower_names column, we’d make a dictionary mapping number to flower name.

# verbose, but also generic
d = dict(zip(range(len(data['target_names'])), data['target_names']))
d
{0: 'setosa', 1: 'versicolor', 2: 'virginica'}


And throw it up against the flower_names column.

df['flower_names'] = df['flower_names'].map(d)
df.head()
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) flower_names
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa

## Some More Description

The DESCR key gives a pretty good overview of what we’re dealing with

print(data['DESCR'])
Iris Plants Database
====================

Notes
-----
Data Set Characteristics:
:Number of Instances: 150 (50 in each of three classes)
:Number of Attributes: 4 numeric, predictive attributes and the class
:Attribute Information:
- sepal length in cm
- sepal width in cm
- petal length in cm
- petal width in cm
- class:
- Iris-Setosa
- Iris-Versicolour
- Iris-Virginica
:Summary Statistics:

============== ==== ==== ======= ===== ====================
Min  Max   Mean    SD   Class Correlation
============== ==== ==== ======= ===== ====================
sepal length:   4.3  7.9   5.84   0.83    0.7826
sepal width:    2.0  4.4   3.05   0.43   -0.4194
petal length:   1.0  6.9   3.76   1.76    0.9490  (high!)
petal width:    0.1  2.5   1.20  0.76     0.9565  (high!)
============== ==== ==== ======= ===== ====================

:Missing Attribute Values: None
:Class Distribution: 33.3% for each of 3 classes.
:Creator: R.A. Fisher
:Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
:Date: July, 1988

This is a copy of UCI ML iris datasets.
http://archive.ics.uci.edu/ml/datasets/Iris

The famous Iris database, first used by Sir R.A Fisher

This is perhaps the best known database to be found in the
pattern recognition literature.  Fisher's paper is a classic in the field and
is referenced frequently to this day.  (See Duda & Hart, for example.)  The
data set contains 3 classes of 50 instances each, where each class refers to a
type of iris plant.  One class is linearly separable from the other 2; the
latter are NOT linearly separable from each other.

References
----------
- Fisher,R.A. "The use of multiple measurements in taxonomic problems"
Annual Eugenics, 7, Part II, 179-188 (1936); also in "Contributions to
Mathematical Statistics" (John Wiley, NY, 1950).
- Duda,R.O., & Hart,P.E. (1973) Pattern Classification and Scene Analysis.
(Q327.D83) John Wiley & Sons.  ISBN 0-471-22361-1.  See page 218.
- Dasarathy, B.V. (1980) "Nosing Around the Neighborhood: A New System
Structure and Classification Rule for Recognition in Partially Exposed
Environments".  IEEE Transactions on Pattern Analysis and Machine
Intelligence, Vol. PAMI-2, No. 1, 67-71.
- Gates, G.W. (1972) "The Reduced Nearest Neighbor Rule".  IEEE Transactions
on Information Theory, May 1972, 431-433.
- See also: 1988 MLC Proceedings, 54-64.  Cheeseman et al"s AUTOCLASS II
conceptual clustering system finds 3 classes in the data.
- Many, many more ...