Getting JSON Handling Right
Motivation
I’m currently going through the Deployment of ML Models course on Udemy and in the sixth module, they start implementing a testing suite to ensure stability in a package comprised of pandas
, sklearn
, and numpy
.
I almost glossed over this section until I realized that I wasn’t totally up to speed what they were doing as far as formatting goes. I’ve written a ton about pd.read_csv()
and felt pretty comfortable about that.
However, what about .predict()
methods of my models? Specifically, what about the code that gets passed to them?
My Pipeline
object had a bunch of Transformer
objects chained together, spilling into a simple Linear_model
.
Pipeline(steps=[('categorical_imputer', CategoricalImputer()),
...
('Linear_model', Lasso(alpha=0.005, random_state=0))])
So in order for my pipeline to execute correctly, I needed my input data to be able to serve column values by key (e.g. impute these columns). Hence, I needed it to be a DataFrame
.
However, the prediction function that they used didn’t take a DataFrame
, but instead a stringed-JSON, which was converted into a DataFrame
within the function.
def make_prediction(*, input_data) -> dict:
"""Make a prediction using the saved model pipeline."""
data = pd.read_json(input_data)
prediction = _price_pipe.predict(data[config.FEATURES])
output = np.exp(prediction)
response = {"predictions": output}
return response
This made enough sense– some serialized data stream wasn’t going to come to me in a DataFrame
format. I should expect JSON.
So the trick was going from old, reliable pd.read_csv()
import pandas as pd
df = pd.read_csv('data/housing.csv')
df.head()
Id | MSSubClass | MSZoning | LotFrontage | LotArea | Street | Alley | LotShape | LandContour | Utilities | ... | ScreenPorch | PoolArea | PoolQC | Fence | MiscFeature | MiscVal | MoSold | YrSold | SaleType | SaleCondition | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1461 | 20 | RH | 80.0 | 11622 | Pave | NaN | Reg | Lvl | AllPub | ... | 120 | 0 | NaN | MnPrv | NaN | 0 | 6 | 2010 | WD | Normal |
1 | 1462 | 20 | RL | 81.0 | 14267 | Pave | NaN | IR1 | Lvl | AllPub | ... | 0 | 0 | NaN | NaN | Gar2 | 12500 | 6 | 2010 | WD | Normal |
2 | 1463 | 60 | RL | 74.0 | 13830 | Pave | NaN | IR1 | Lvl | AllPub | ... | 0 | 0 | NaN | MnPrv | NaN | 0 | 3 | 2010 | WD | Normal |
3 | 1464 | 60 | RL | 78.0 | 9978 | Pave | NaN | IR1 | Lvl | AllPub | ... | 0 | 0 | NaN | NaN | NaN | 0 | 6 | 2010 | WD | Normal |
4 | 1465 | 120 | RL | 43.0 | 5005 | Pave | NaN | IR1 | HLS | AllPub | ... | 144 | 0 | NaN | NaN | NaN | 0 | 1 | 2010 | WD | Normal |
5 rows × 80 columns
…into a JSON format that would unpack correctly into a DataFrame
once make_prediction()
got ahold of it.
Lots of JSON formats
So at this point, I needed to figure out how to de-serialize my DataFrame
into JSON, such that pd.read_json()
didn’t choke on it.
Little did I know, there were a bunch of different ways that this function will JSON-ify your data
orient : string
Indication of expected JSON string format.
* The format of the JSON string
- 'split' : dict like {'index' -> [index],
'columns' -> [columns], 'data' -> [values]}
- 'records' : list like
[{column -> value}, ... , {column -> value}]
- 'index' : dict like {index -> {column -> value}}
- 'columns' : dict like {column -> {index -> value}}
- 'values' : just the values array
- 'table' : dict like {'schema': {schema}, 'data': {data}}
describing the data, and the data component is
like ``orient='records'``.
The docs were very dense, so I figured it was worth enumerating the different options here.
split
orient='split'
basically de-constructs the dataset by its indexes
from json import loads
split = df.to_json(orient='split')
split[:500]
'{"columns":["Id","MSSubClass","MSZoning","LotFrontage","LotArea","Street","Alley","LotShape","LandContour","Utilities","LotConfig","LandSlope","Neighborhood","Condition1","Condition2","BldgType","HouseStyle","OverallQual","OverallCond","YearBuilt","YearRemodAdd","RoofStyle","RoofMatl","Exterior1st","Exterior2nd","MasVnrType","MasVnrArea","ExterQual","ExterCond","Foundation","BsmtQual","BsmtCond","BsmtExposure","BsmtFinType1","BsmtFinSF1","BsmtFinType2","BsmtFinSF2","BsmtUnfSF","TotalBsmtSF","Hea'
res = loads(split)
res.keys()
dict_keys(['columns', 'index', 'data'])
You get the row and column indexes
display(res['columns'][:5])
display(res['index'][:5])
['Id', 'MSSubClass', 'MSZoning', 'LotFrontage', 'LotArea']
[0, 1, 2, 3, 4]
As well as the data itself.
print(res['data'][:2])
[[1461, 20, 'RH', 80.0, 11622, 'Pave', None, 'Reg', 'Lvl', 'AllPub', 'Inside', 'Gtl', 'NAmes', 'Feedr', 'Norm', '1Fam', '1Story', 5, 6, 1961, 1961, 'Gable', 'CompShg', 'VinylSd', 'VinylSd', 'None', 0, 'TA', 'TA', 'CBlock', 'TA', 'TA', 'No', 'Rec', 468, 'LwQ', 144, 270, 882, 'GasA', 'TA', 'Y', 'SBrkr', 896, 0, 0, 896, 0, 0, 1, 0, 2, 1, 'TA', 5, 'Typ', 0, None, 'Attchd', 1961, 'Unf', 1, 730, 'TA', 'TA', 'Y', 140, 0, 0, 0, 120, 0, None, 'MnPrv', None, 0, 6, 2010, 'WD', 'Normal'], [1462, 20, 'RL', 81.0, 14267, 'Pave', None, 'IR1', 'Lvl', 'AllPub', 'Corner', 'Gtl', 'NAmes', 'Norm', 'Norm', '1Fam', '1Story', 6, 6, 1958, 1958, 'Hip', 'CompShg', 'Wd Sdng', 'Wd Sdng', 'BrkFace', 108, 'TA', 'TA', 'CBlock', 'TA', 'TA', 'No', 'ALQ', 923, 'Unf', 0, 406, 1329, 'GasA', 'TA', 'Y', 'SBrkr', 1329, 0, 0, 1329, 0, 0, 1, 1, 3, 1, 'Gd', 6, 'Typ', 0, None, 'Attchd', 1958, 'Unf', 1, 312, 'TA', 'TA', 'Y', 393, 36, 0, 0, 0, 0, None, None, 'Gar2', 12500, 6, 2010, 'WD', 'Normal']]
Which has the same shape as the original dataset
# Columns
display(len(res['data']))
# by row
display(len(res['data'][0]))
9
80
df.shape
(9, 80)
And is easy enough to reconstruct the original
pd.DataFrame(res['data'],
columns=res['columns'],
index=res['index'])
Id | MSSubClass | MSZoning | LotFrontage | LotArea | Street | Alley | LotShape | LandContour | Utilities | ... | ScreenPorch | PoolArea | PoolQC | Fence | MiscFeature | MiscVal | MoSold | YrSold | SaleType | SaleCondition | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1461 | 20 | RH | 80.0 | 11622 | Pave | None | Reg | Lvl | AllPub | ... | 120 | 0 | None | MnPrv | None | 0 | 6 | 2010 | WD | Normal |
1 | 1462 | 20 | RL | 81.0 | 14267 | Pave | None | IR1 | Lvl | AllPub | ... | 0 | 0 | None | None | Gar2 | 12500 | 6 | 2010 | WD | Normal |
2 | 1463 | 60 | RL | 74.0 | 13830 | Pave | None | IR1 | Lvl | AllPub | ... | 0 | 0 | None | MnPrv | None | 0 | 3 | 2010 | WD | Normal |
3 | 1464 | 60 | RL | 78.0 | 9978 | Pave | None | IR1 | Lvl | AllPub | ... | 0 | 0 | None | None | None | 0 | 6 | 2010 | WD | Normal |
4 | 1465 | 120 | RL | 43.0 | 5005 | Pave | None | IR1 | HLS | AllPub | ... | 144 | 0 | None | None | None | 0 | 1 | 2010 | WD | Normal |
5 | 1466 | 60 | RL | 75.0 | 10000 | Pave | None | IR1 | Lvl | AllPub | ... | 0 | 0 | None | None | None | 0 | 4 | 2010 | WD | Normal |
6 | 1467 | 20 | RL | NaN | 7980 | Pave | None | IR1 | Lvl | AllPub | ... | 0 | 0 | None | GdPrv | Shed | 500 | 3 | 2010 | WD | Normal |
7 | 1468 | 60 | RL | 63.0 | 8402 | Pave | None | IR1 | Lvl | AllPub | ... | 0 | 0 | None | None | None | 0 | 5 | 2010 | WD | Normal |
8 | 1469 | 20 | RL | 85.0 | 10176 | Pave | None | Reg | Lvl | AllPub | ... | 0 | 0 | None | None | None | 0 | 2 | 2010 | WD | Normal |
9 rows × 80 columns
However, it fails because pandas
can’t find a uniform key length in split
try:
pd.read_json(split)
except Exception as e:
print(e)
arrays must all be same length
For obvious reasons
for key in res.keys():
print(key, len(res[key]))
columns 80
index 9
data 9
index
index
creates a simple dictionary of indexes
idx = df.to_json(orient='index')
idx[:500]
'{"0":{"Id":1461,"MSSubClass":20,"MSZoning":"RH","LotFrontage":80.0,"LotArea":11622,"Street":"Pave","Alley":null,"LotShape":"Reg","LandContour":"Lvl","Utilities":"AllPub","LotConfig":"Inside","LandSlope":"Gtl","Neighborhood":"NAmes","Condition1":"Feedr","Condition2":"Norm","BldgType":"1Fam","HouseStyle":"1Story","OverallQual":5,"OverallCond":6,"YearBuilt":1961,"YearRemodAdd":1961,"RoofStyle":"Gable","RoofMatl":"CompShg","Exterior1st":"VinylSd","Exterior2nd":"VinylSd","MasVnrType":"None","MasVnrAr'
res = loads(idx)
res.keys()
dict_keys(['0', '1', '2', '3', '4', '5', '6', '7', '8'])
And nested underneath that is each column of data, per row.
from itertools import islice
for k, v in islice(res['0'].items(), 5):
print(k.ljust(25), res['0'][k])
print('...')
Id 1461
MSSubClass 20
MSZoning RH
LotFrontage 80.0
LotArea 11622
...
This is almost right– just lopsided
pd.read_json(idx).T
1stFlrSF | 2ndFlrSF | 3SsnPorch | Alley | BedroomAbvGr | BldgType | BsmtCond | BsmtExposure | BsmtFinSF1 | BsmtFinSF2 | ... | SaleType | ScreenPorch | Street | TotRmsAbvGrd | TotalBsmtSF | Utilities | WoodDeckSF | YearBuilt | YearRemodAdd | YrSold | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 896 | 0 | 0 | None | 2 | 1Fam | TA | No | 468 | 144 | ... | WD | 120 | Pave | 5 | 882 | AllPub | 140 | 1961 | 1961 | 2010 |
1 | 1329 | 0 | 0 | None | 3 | 1Fam | TA | No | 923 | 0 | ... | WD | 0 | Pave | 6 | 1329 | AllPub | 393 | 1958 | 1958 | 2010 |
2 | 928 | 701 | 0 | None | 3 | 1Fam | TA | No | 791 | 0 | ... | WD | 0 | Pave | 6 | 928 | AllPub | 212 | 1997 | 1998 | 2010 |
3 | 926 | 678 | 0 | None | 3 | 1Fam | TA | No | 602 | 0 | ... | WD | 0 | Pave | 7 | 926 | AllPub | 360 | 1998 | 1998 | 2010 |
4 | 1280 | 0 | 0 | None | 2 | TwnhsE | TA | No | 263 | 0 | ... | WD | 144 | Pave | 5 | 1280 | AllPub | 0 | 1992 | 1992 | 2010 |
5 | 763 | 892 | 0 | None | 3 | 1Fam | TA | No | 0 | 0 | ... | WD | 0 | Pave | 7 | 763 | AllPub | 157 | 1993 | 1994 | 2010 |
6 | 1187 | 0 | 0 | None | 3 | 1Fam | TA | No | 935 | 0 | ... | WD | 0 | Pave | 6 | 1168 | AllPub | 483 | 1992 | 2007 | 2010 |
7 | 789 | 676 | 0 | None | 3 | 1Fam | TA | No | 0 | 0 | ... | WD | 0 | Pave | 7 | 789 | AllPub | 0 | 1998 | 1998 | 2010 |
8 | 1341 | 0 | 0 | None | 2 | 1Fam | TA | Gd | 637 | 0 | ... | WD | 0 | Pave | 5 | 1300 | AllPub | 192 | 1990 | 1990 | 2010 |
9 rows × 80 columns
columns
Columns is basically the same thing, but in reverse
cols = df.to_json(orient='columns')
cols[:500]
'{"Id":{"0":1461,"1":1462,"2":1463,"3":1464,"4":1465,"5":1466,"6":1467,"7":1468,"8":1469},"MSSubClass":{"0":20,"1":20,"2":60,"3":60,"4":120,"5":60,"6":20,"7":60,"8":20},"MSZoning":{"0":"RH","1":"RL","2":"RL","3":"RL","4":"RL","5":"RL","6":"RL","7":"RL","8":"RL"},"LotFrontage":{"0":80.0,"1":81.0,"2":74.0,"3":78.0,"4":43.0,"5":75.0,"6":null,"7":63.0,"8":85.0},"LotArea":{"0":11622,"1":14267,"2":13830,"3":9978,"4":5005,"5":10000,"6":7980,"7":8402,"8":10176},"Street":{"0":"Pave","1":"Pave","2":"Pave",'
res = loads(cols)
res.keys()
dict_keys(['Id', 'MSSubClass', 'MSZoning', 'LotFrontage', 'LotArea', 'Street', 'Alley', 'LotShape', 'LandContour', 'Utilities', 'LotConfig', 'LandSlope', 'Neighborhood', 'Condition1', 'Condition2', 'BldgType', 'HouseStyle', 'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd', 'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType', 'MasVnrArea', 'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinSF1', 'BsmtFinType2', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', 'Heating', 'HeatingQC', 'CentralAir', 'Electrical', '1stFlrSF', '2ndFlrSF', 'LowQualFinSF', 'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath', 'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'KitchenQual', 'TotRmsAbvGrd', 'Functional', 'Fireplaces', 'FireplaceQu', 'GarageType', 'GarageYrBlt', 'GarageFinish', 'GarageCars', 'GarageArea', 'GarageQual', 'GarageCond', 'PavedDrive', 'WoodDeckSF', 'OpenPorchSF', 'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'PoolArea', 'PoolQC', 'Fence', 'MiscFeature', 'MiscVal', 'MoSold', 'YrSold', 'SaleType', 'SaleCondition'])
Each column key houses dicts
of the form (index, data)
from itertools import islice
for k, v in islice(res['Id'].items(), 5):
print(k.ljust(25), res['Id'][k])
print('...')
0 1461
1 1462
2 1463
3 1464
4 1465
...
This would work fine
pd.read_json(cols)
Id | MSSubClass | MSZoning | LotFrontage | LotArea | Street | Alley | LotShape | LandContour | Utilities | ... | ScreenPorch | PoolArea | PoolQC | Fence | MiscFeature | MiscVal | MoSold | YrSold | SaleType | SaleCondition | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1461 | 20 | RH | 80.0 | 11622 | Pave | NaN | Reg | Lvl | AllPub | ... | 120 | 0 | NaN | MnPrv | None | 0 | 6 | 2010 | WD | Normal |
1 | 1462 | 20 | RL | 81.0 | 14267 | Pave | NaN | IR1 | Lvl | AllPub | ... | 0 | 0 | NaN | None | Gar2 | 12500 | 6 | 2010 | WD | Normal |
2 | 1463 | 60 | RL | 74.0 | 13830 | Pave | NaN | IR1 | Lvl | AllPub | ... | 0 | 0 | NaN | MnPrv | None | 0 | 3 | 2010 | WD | Normal |
3 | 1464 | 60 | RL | 78.0 | 9978 | Pave | NaN | IR1 | Lvl | AllPub | ... | 0 | 0 | NaN | None | None | 0 | 6 | 2010 | WD | Normal |
4 | 1465 | 120 | RL | 43.0 | 5005 | Pave | NaN | IR1 | HLS | AllPub | ... | 144 | 0 | NaN | None | None | 0 | 1 | 2010 | WD | Normal |
5 | 1466 | 60 | RL | 75.0 | 10000 | Pave | NaN | IR1 | Lvl | AllPub | ... | 0 | 0 | NaN | None | None | 0 | 4 | 2010 | WD | Normal |
6 | 1467 | 20 | RL | NaN | 7980 | Pave | NaN | IR1 | Lvl | AllPub | ... | 0 | 0 | NaN | GdPrv | Shed | 500 | 3 | 2010 | WD | Normal |
7 | 1468 | 60 | RL | 63.0 | 8402 | Pave | NaN | IR1 | Lvl | AllPub | ... | 0 | 0 | NaN | None | None | 0 | 5 | 2010 | WD | Normal |
8 | 1469 | 20 | RL | 85.0 | 10176 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | 0 | NaN | None | None | 0 | 2 | 2010 | WD | Normal |
9 rows × 80 columns
values
Returns a simple nested list
of the values– no columns or index
values = df.to_json(orient='values')
values[:500]
'[[1461,20,"RH",80.0,11622,"Pave",null,"Reg","Lvl","AllPub","Inside","Gtl","NAmes","Feedr","Norm","1Fam","1Story",5,6,1961,1961,"Gable","CompShg","VinylSd","VinylSd","None",0,"TA","TA","CBlock","TA","TA","No","Rec",468,"LwQ",144,270,882,"GasA","TA","Y","SBrkr",896,0,0,896,0,0,1,0,2,1,"TA",5,"Typ",0,null,"Attchd",1961,"Unf",1,730,"TA","TA","Y",140,0,0,0,120,0,null,"MnPrv",null,0,6,2010,"WD","Normal"],[1462,20,"RL",81.0,14267,"Pave",null,"IR1","Lvl","AllPub","Corner","Gtl","NAmes","Norm","Norm","1F'
res = loads(values)
type(res)
list
print(res[0][:10])
print(res[1][:10])
[1461, 20, 'RH', 80.0, 11622, 'Pave', None, 'Reg', 'Lvl', 'AllPub']
[1462, 20, 'RL', 81.0, 14267, 'Pave', None, 'IR1', 'Lvl', 'AllPub']
Also works for creating a DataFrame
, however, the Pipeline
object can’t select by column name.
pd.read_json(values)
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | ... | 70 | 71 | 72 | 73 | 74 | 75 | 76 | 77 | 78 | 79 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1461 | 20 | RH | 80.0 | 11622 | Pave | NaN | Reg | Lvl | AllPub | ... | 120 | 0 | NaN | MnPrv | None | 0 | 6 | 2010 | WD | Normal |
1 | 1462 | 20 | RL | 81.0 | 14267 | Pave | NaN | IR1 | Lvl | AllPub | ... | 0 | 0 | NaN | None | Gar2 | 12500 | 6 | 2010 | WD | Normal |
2 | 1463 | 60 | RL | 74.0 | 13830 | Pave | NaN | IR1 | Lvl | AllPub | ... | 0 | 0 | NaN | MnPrv | None | 0 | 3 | 2010 | WD | Normal |
3 | 1464 | 60 | RL | 78.0 | 9978 | Pave | NaN | IR1 | Lvl | AllPub | ... | 0 | 0 | NaN | None | None | 0 | 6 | 2010 | WD | Normal |
4 | 1465 | 120 | RL | 43.0 | 5005 | Pave | NaN | IR1 | HLS | AllPub | ... | 144 | 0 | NaN | None | None | 0 | 1 | 2010 | WD | Normal |
5 | 1466 | 60 | RL | 75.0 | 10000 | Pave | NaN | IR1 | Lvl | AllPub | ... | 0 | 0 | NaN | None | None | 0 | 4 | 2010 | WD | Normal |
6 | 1467 | 20 | RL | NaN | 7980 | Pave | NaN | IR1 | Lvl | AllPub | ... | 0 | 0 | NaN | GdPrv | Shed | 500 | 3 | 2010 | WD | Normal |
7 | 1468 | 60 | RL | 63.0 | 8402 | Pave | NaN | IR1 | Lvl | AllPub | ... | 0 | 0 | NaN | None | None | 0 | 5 | 2010 | WD | Normal |
8 | 1469 | 20 | RL | 85.0 | 10176 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | 0 | NaN | None | None | 0 | 2 | 2010 | WD | Normal |
9 rows × 80 columns
default / records
By default, the function turns each column into its own dict, which in turn has each row/index as a nested dict.
This is what we want.
from json import loads
default = df.to_json()
default[:500]
'{"Id":{"0":1461,"1":1462,"2":1463,"3":1464,"4":1465,"5":1466,"6":1467,"7":1468,"8":1469},"MSSubClass":{"0":20,"1":20,"2":60,"3":60,"4":120,"5":60,"6":20,"7":60,"8":20},"MSZoning":{"0":"RH","1":"RL","2":"RL","3":"RL","4":"RL","5":"RL","6":"RL","7":"RL","8":"RL"},"LotFrontage":{"0":80.0,"1":81.0,"2":74.0,"3":78.0,"4":43.0,"5":75.0,"6":null,"7":63.0,"8":85.0},"LotArea":{"0":11622,"1":14267,"2":13830,"3":9978,"4":5005,"5":10000,"6":7980,"7":8402,"8":10176},"Street":{"0":"Pave","1":"Pave","2":"Pave",'
Columns
loads(default).keys()
dict_keys(['Id', 'MSSubClass', 'MSZoning', 'LotFrontage', 'LotArea', 'Street', 'Alley', 'LotShape', 'LandContour', 'Utilities', 'LotConfig', 'LandSlope', 'Neighborhood', 'Condition1', 'Condition2', 'BldgType', 'HouseStyle', 'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd', 'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType', 'MasVnrArea', 'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinSF1', 'BsmtFinType2', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', 'Heating', 'HeatingQC', 'CentralAir', 'Electrical', '1stFlrSF', '2ndFlrSF', 'LowQualFinSF', 'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath', 'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'KitchenQual', 'TotRmsAbvGrd', 'Functional', 'Fireplaces', 'FireplaceQu', 'GarageType', 'GarageYrBlt', 'GarageFinish', 'GarageCars', 'GarageArea', 'GarageQual', 'GarageCond', 'PavedDrive', 'WoodDeckSF', 'OpenPorchSF', 'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'PoolArea', 'PoolQC', 'Fence', 'MiscFeature', 'MiscVal', 'MoSold', 'YrSold', 'SaleType', 'SaleCondition'])
Rows/Indexes
loads(default)['Id']
{'0': 1461,
'1': 1462,
'2': 1463,
'3': 1464,
'4': 1465,
'5': 1466,
'6': 1467,
'7': 1468,
'8': 1469}
Works great.
pd.read_json(default)
Id | MSSubClass | MSZoning | LotFrontage | LotArea | Street | Alley | LotShape | LandContour | Utilities | ... | ScreenPorch | PoolArea | PoolQC | Fence | MiscFeature | MiscVal | MoSold | YrSold | SaleType | SaleCondition | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1461 | 20 | RH | 80.0 | 11622 | Pave | NaN | Reg | Lvl | AllPub | ... | 120 | 0 | NaN | MnPrv | None | 0 | 6 | 2010 | WD | Normal |
1 | 1462 | 20 | RL | 81.0 | 14267 | Pave | NaN | IR1 | Lvl | AllPub | ... | 0 | 0 | NaN | None | Gar2 | 12500 | 6 | 2010 | WD | Normal |
2 | 1463 | 60 | RL | 74.0 | 13830 | Pave | NaN | IR1 | Lvl | AllPub | ... | 0 | 0 | NaN | MnPrv | None | 0 | 3 | 2010 | WD | Normal |
3 | 1464 | 60 | RL | 78.0 | 9978 | Pave | NaN | IR1 | Lvl | AllPub | ... | 0 | 0 | NaN | None | None | 0 | 6 | 2010 | WD | Normal |
4 | 1465 | 120 | RL | 43.0 | 5005 | Pave | NaN | IR1 | HLS | AllPub | ... | 144 | 0 | NaN | None | None | 0 | 1 | 2010 | WD | Normal |
5 | 1466 | 60 | RL | 75.0 | 10000 | Pave | NaN | IR1 | Lvl | AllPub | ... | 0 | 0 | NaN | None | None | 0 | 4 | 2010 | WD | Normal |
6 | 1467 | 20 | RL | NaN | 7980 | Pave | NaN | IR1 | Lvl | AllPub | ... | 0 | 0 | NaN | GdPrv | Shed | 500 | 3 | 2010 | WD | Normal |
7 | 1468 | 60 | RL | 63.0 | 8402 | Pave | NaN | IR1 | Lvl | AllPub | ... | 0 | 0 | NaN | None | None | 0 | 5 | 2010 | WD | Normal |
8 | 1469 | 20 | RL | 85.0 | 10176 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | 0 | NaN | None | None | 0 | 2 | 2010 | WD | Normal |
9 rows × 80 columns
The only difference between calling with the default None
value and records
is that records
will first sort your columns.
recs = df.to_json(orient='records')
pd.read_json(recs)
1stFlrSF | 2ndFlrSF | 3SsnPorch | Alley | BedroomAbvGr | BldgType | BsmtCond | BsmtExposure | BsmtFinSF1 | BsmtFinSF2 | ... | SaleType | ScreenPorch | Street | TotRmsAbvGrd | TotalBsmtSF | Utilities | WoodDeckSF | YearBuilt | YearRemodAdd | YrSold | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 896 | 0 | 0 | NaN | 2 | 1Fam | TA | No | 468 | 144 | ... | WD | 120 | Pave | 5 | 882 | AllPub | 140 | 1961 | 1961 | 2010 |
1 | 1329 | 0 | 0 | NaN | 3 | 1Fam | TA | No | 923 | 0 | ... | WD | 0 | Pave | 6 | 1329 | AllPub | 393 | 1958 | 1958 | 2010 |
2 | 928 | 701 | 0 | NaN | 3 | 1Fam | TA | No | 791 | 0 | ... | WD | 0 | Pave | 6 | 928 | AllPub | 212 | 1997 | 1998 | 2010 |
3 | 926 | 678 | 0 | NaN | 3 | 1Fam | TA | No | 602 | 0 | ... | WD | 0 | Pave | 7 | 926 | AllPub | 360 | 1998 | 1998 | 2010 |
4 | 1280 | 0 | 0 | NaN | 2 | TwnhsE | TA | No | 263 | 0 | ... | WD | 144 | Pave | 5 | 1280 | AllPub | 0 | 1992 | 1992 | 2010 |
5 | 763 | 892 | 0 | NaN | 3 | 1Fam | TA | No | 0 | 0 | ... | WD | 0 | Pave | 7 | 763 | AllPub | 157 | 1993 | 1994 | 2010 |
6 | 1187 | 0 | 0 | NaN | 3 | 1Fam | TA | No | 935 | 0 | ... | WD | 0 | Pave | 6 | 1168 | AllPub | 483 | 1992 | 2007 | 2010 |
7 | 789 | 676 | 0 | NaN | 3 | 1Fam | TA | No | 0 | 0 | ... | WD | 0 | Pave | 7 | 789 | AllPub | 0 | 1998 | 1998 | 2010 |
8 | 1341 | 0 | 0 | NaN | 2 | 1Fam | TA | Gd | 637 | 0 | ... | WD | 0 | Pave | 5 | 1300 | AllPub | 192 | 1990 | 1990 | 2010 |
9 rows × 80 columns
Which we can (painstakingly) show by de-serializing default_json
, sorting the keys, reserializing and comparing recs
default_json = loads(default)
sorted_default_json = {k: default_json[k] for k in sorted(default_json)}
sorted_default_json.keys()
dict_keys(['1stFlrSF', '2ndFlrSF', '3SsnPorch', 'Alley', 'BedroomAbvGr', 'BldgType', 'BsmtCond', 'BsmtExposure', 'BsmtFinSF1', 'BsmtFinSF2', 'BsmtFinType1', 'BsmtFinType2', 'BsmtFullBath', 'BsmtHalfBath', 'BsmtQual', 'BsmtUnfSF', 'CentralAir', 'Condition1', 'Condition2', 'Electrical', 'EnclosedPorch', 'ExterCond', 'ExterQual', 'Exterior1st', 'Exterior2nd', 'Fence', 'FireplaceQu', 'Fireplaces', 'Foundation', 'FullBath', 'Functional', 'GarageArea', 'GarageCars', 'GarageCond', 'GarageFinish', 'GarageQual', 'GarageType', 'GarageYrBlt', 'GrLivArea', 'HalfBath', 'Heating', 'HeatingQC', 'HouseStyle', 'Id', 'KitchenAbvGr', 'KitchenQual', 'LandContour', 'LandSlope', 'LotArea', 'LotConfig', 'LotFrontage', 'LotShape', 'LowQualFinSF', 'MSSubClass', 'MSZoning', 'MasVnrArea', 'MasVnrType', 'MiscFeature', 'MiscVal', 'MoSold', 'Neighborhood', 'OpenPorchSF', 'OverallCond', 'OverallQual', 'PavedDrive', 'PoolArea', 'PoolQC', 'RoofMatl', 'RoofStyle', 'SaleCondition', 'SaleType', 'ScreenPorch', 'Street', 'TotRmsAbvGrd', 'TotalBsmtSF', 'Utilities', 'WoodDeckSF', 'YearBuilt', 'YearRemodAdd', 'YrSold'])
from json import dumps
# .fillna('') because NULL doesn't play well w ==
a = pd.read_json(dumps(sorted_default_json)).fillna('')
b = pd.read_json(recs).fillna('')
Woof.
table
Finally, orient='table'
takes records
and does one better
table = df.to_json(orient='table')
table[:500]
'{"schema": {"fields":[{"name":"index","type":"integer"},{"name":"Id","type":"integer"},{"name":"MSSubClass","type":"integer"},{"name":"MSZoning","type":"string"},{"name":"LotFrontage","type":"number"},{"name":"LotArea","type":"integer"},{"name":"Street","type":"string"},{"name":"Alley","type":"number"},{"name":"LotShape","type":"string"},{"name":"LandContour","type":"string"},{"name":"Utilities","type":"string"},{"name":"LotConfig","type":"string"},{"name":"LandSlope","type":"string"},{"name":"N'
The data
value is the same as if we’d done orient='records'
res = loads(table)
res.keys()
dict_keys(['schema', 'data'])
display(len(res['data']))
display(len(loads(recs)))
9
9
Except for the fact that data
holds onto the Index
column
len(res['data'][0].keys())
81
len(loads(recs)[0].keys())
80
(set(res['data'][0].keys())
-
(set(loads(recs)[0])))
{'index'}
but schema
gives us a simple schema definition, like we might expect work working with PySpark. Which is pretty cool.
loads(table)['schema']
{'fields': [{'name': 'index', 'type': 'integer'},
{'name': 'Id', 'type': 'integer'},
{'name': 'MSSubClass', 'type': 'integer'},
{'name': 'MSZoning', 'type': 'string'},
{'name': 'LotFrontage', 'type': 'number'},
{'name': 'LotArea', 'type': 'integer'},
{'name': 'Street', 'type': 'string'},
{'name': 'Alley', 'type': 'number'},
{'name': 'LotShape', 'type': 'string'},
{'name': 'LandContour', 'type': 'string'},
{'name': 'Utilities', 'type': 'string'},
{'name': 'LotConfig', 'type': 'string'},
{'name': 'LandSlope', 'type': 'string'},
{'name': 'Neighborhood', 'type': 'string'},
{'name': 'Condition1', 'type': 'string'},
{'name': 'Condition2', 'type': 'string'},
{'name': 'BldgType', 'type': 'string'},
{'name': 'HouseStyle', 'type': 'string'},
{'name': 'OverallQual', 'type': 'integer'},
{'name': 'OverallCond', 'type': 'integer'},
{'name': 'YearBuilt', 'type': 'integer'},
{'name': 'YearRemodAdd', 'type': 'integer'},
{'name': 'RoofStyle', 'type': 'string'},
{'name': 'RoofMatl', 'type': 'string'},
{'name': 'Exterior1st', 'type': 'string'},
{'name': 'Exterior2nd', 'type': 'string'},
{'name': 'MasVnrType', 'type': 'string'},
{'name': 'MasVnrArea', 'type': 'integer'},
{'name': 'ExterQual', 'type': 'string'},
{'name': 'ExterCond', 'type': 'string'},
{'name': 'Foundation', 'type': 'string'},
{'name': 'BsmtQual', 'type': 'string'},
{'name': 'BsmtCond', 'type': 'string'},
{'name': 'BsmtExposure', 'type': 'string'},
{'name': 'BsmtFinType1', 'type': 'string'},
{'name': 'BsmtFinSF1', 'type': 'integer'},
{'name': 'BsmtFinType2', 'type': 'string'},
{'name': 'BsmtFinSF2', 'type': 'integer'},
{'name': 'BsmtUnfSF', 'type': 'integer'},
{'name': 'TotalBsmtSF', 'type': 'integer'},
{'name': 'Heating', 'type': 'string'},
{'name': 'HeatingQC', 'type': 'string'},
{'name': 'CentralAir', 'type': 'string'},
{'name': 'Electrical', 'type': 'string'},
{'name': '1stFlrSF', 'type': 'integer'},
{'name': '2ndFlrSF', 'type': 'integer'},
{'name': 'LowQualFinSF', 'type': 'integer'},
{'name': 'GrLivArea', 'type': 'integer'},
{'name': 'BsmtFullBath', 'type': 'integer'},
{'name': 'BsmtHalfBath', 'type': 'integer'},
{'name': 'FullBath', 'type': 'integer'},
{'name': 'HalfBath', 'type': 'integer'},
{'name': 'BedroomAbvGr', 'type': 'integer'},
{'name': 'KitchenAbvGr', 'type': 'integer'},
{'name': 'KitchenQual', 'type': 'string'},
{'name': 'TotRmsAbvGrd', 'type': 'integer'},
{'name': 'Functional', 'type': 'string'},
{'name': 'Fireplaces', 'type': 'integer'},
{'name': 'FireplaceQu', 'type': 'string'},
{'name': 'GarageType', 'type': 'string'},
{'name': 'GarageYrBlt', 'type': 'integer'},
{'name': 'GarageFinish', 'type': 'string'},
{'name': 'GarageCars', 'type': 'integer'},
{'name': 'GarageArea', 'type': 'integer'},
{'name': 'GarageQual', 'type': 'string'},
{'name': 'GarageCond', 'type': 'string'},
{'name': 'PavedDrive', 'type': 'string'},
{'name': 'WoodDeckSF', 'type': 'integer'},
{'name': 'OpenPorchSF', 'type': 'integer'},
{'name': 'EnclosedPorch', 'type': 'integer'},
{'name': '3SsnPorch', 'type': 'integer'},
{'name': 'ScreenPorch', 'type': 'integer'},
{'name': 'PoolArea', 'type': 'integer'},
{'name': 'PoolQC', 'type': 'number'},
{'name': 'Fence', 'type': 'string'},
{'name': 'MiscFeature', 'type': 'string'},
{'name': 'MiscVal', 'type': 'integer'},
{'name': 'MoSold', 'type': 'integer'},
{'name': 'YrSold', 'type': 'integer'},
{'name': 'SaleType', 'type': 'string'},
{'name': 'SaleCondition', 'type': 'string'}],
'primaryKey': ['index'],
'pandas_version': '0.20.0'}