Getting JSON Handling Right

26 Oct 2020

Motivation

I’m currently going through the Deployment of ML Models course on Udemy and in the sixth module, they start implementing a testing suite to ensure stability in a package comprised of pandas, sklearn, and numpy.

I almost glossed over this section until I realized that I wasn’t totally up to speed what they were doing as far as formatting goes. I’ve written a ton about pd.read_csv() and felt pretty comfortable about that.

However, what about .predict() methods of my models? Specifically, what about the code that gets passed to them?

My Pipeline object had a bunch of Transformer objects chained together, spilling into a simple Linear_model.

Pipeline(steps=[('categorical_imputer', CategoricalImputer()),
                 ...
                ('Linear_model', Lasso(alpha=0.005, random_state=0))])

So in order for my pipeline to execute correctly, I needed my input data to be able to serve column values by key (e.g. impute these columns). Hence, I needed it to be a DataFrame.

However, the prediction function that they used didn’t take a DataFrame, but instead a stringed-JSON, which was converted into a DataFrame within the function.

def make_prediction(*, input_data) -> dict:
    """Make a prediction using the saved model pipeline."""

    data = pd.read_json(input_data)
    prediction = _price_pipe.predict(data[config.FEATURES])
    output = np.exp(prediction)
    response = {"predictions": output}

    return response

This made enough sense– some serialized data stream wasn’t going to come to me in a DataFrame format. I should expect JSON.

So the trick was going from old, reliable pd.read_csv()

import pandas as pd

df = pd.read_csv('data/housing.csv')
df.head()

	Id	MSSubClass	MSZoning	LotFrontage	LotArea	Street	Alley	LotShape	LandContour	Utilities	...	ScreenPorch	PoolQC	Fence	MiscFeature	MiscVal	MoSold	YrSold	SaleType	SaleCondition
0	1461	20	RH	80.0	11622	Pave	NaN	Reg	Lvl	AllPub	...	120	NaN	MnPrv	NaN	0	6	2010	WD	Normal
1	1462	20	RL	81.0	14267	Pave	NaN	IR1	Lvl	AllPub	...	0	NaN	NaN	Gar2	12500	6	2010	WD	Normal
2	1463	60	RL	74.0	13830	Pave	NaN	IR1	Lvl	AllPub	...	0	NaN	MnPrv	NaN	0	3	2010	WD	Normal
3	1464	60	RL	78.0	9978	Pave	NaN	IR1	Lvl	AllPub	...	0	NaN	NaN	NaN	0	6	2010	WD	Normal
4	1465	120	RL	43.0	5005	Pave	NaN	IR1	HLS	AllPub	...	144	NaN	NaN	NaN	0	1	2010	WD	Normal

5 rows × 80 columns

…into a JSON format that would unpack correctly into a DataFrame once make_prediction() got ahold of it.

Lots of JSON formats

So at this point, I needed to figure out how to de-serialize my DataFrame into JSON, such that pd.read_json() didn’t choke on it.

Little did I know, there were a bunch of different ways that this function will JSON-ify your data

        orient : string
            Indication of expected JSON string format.

            * The format of the JSON string

              - 'split' : dict like {'index' -> [index],
                'columns' -> [columns], 'data' -> [values]}
              - 'records' : list like
                [{column -> value}, ... , {column -> value}]
              - 'index' : dict like {index -> {column -> value}}
              - 'columns' : dict like {column -> {index -> value}}
              - 'values' : just the values array
              - 'table' : dict like {'schema': {schema}, 'data': {data}}
                describing the data, and the data component is
                like ``orient='records'``.

The docs were very dense, so I figured it was worth enumerating the different options here.

split

orient='split' basically de-constructs the dataset by its indexes

from json import loads

split = df.to_json(orient='split')
split[:500]

'{"columns":["Id","MSSubClass","MSZoning","LotFrontage","LotArea","Street","Alley","LotShape","LandContour","Utilities","LotConfig","LandSlope","Neighborhood","Condition1","Condition2","BldgType","HouseStyle","OverallQual","OverallCond","YearBuilt","YearRemodAdd","RoofStyle","RoofMatl","Exterior1st","Exterior2nd","MasVnrType","MasVnrArea","ExterQual","ExterCond","Foundation","BsmtQual","BsmtCond","BsmtExposure","BsmtFinType1","BsmtFinSF1","BsmtFinType2","BsmtFinSF2","BsmtUnfSF","TotalBsmtSF","Hea'

res = loads(split)
res.keys()

dict_keys(['columns', 'index', 'data'])

You get the row and column indexes

display(res['columns'][:5])
display(res['index'][:5])

['Id', 'MSSubClass', 'MSZoning', 'LotFrontage', 'LotArea']



[0, 1, 2, 3, 4]

As well as the data itself.

print(res['data'][:2])

[[1461, 20, 'RH', 80.0, 11622, 'Pave', None, 'Reg', 'Lvl', 'AllPub', 'Inside', 'Gtl', 'NAmes', 'Feedr', 'Norm', '1Fam', '1Story', 5, 6, 1961, 1961, 'Gable', 'CompShg', 'VinylSd', 'VinylSd', 'None', 0, 'TA', 'TA', 'CBlock', 'TA', 'TA', 'No', 'Rec', 468, 'LwQ', 144, 270, 882, 'GasA', 'TA', 'Y', 'SBrkr', 896, 0, 0, 896, 0, 0, 1, 0, 2, 1, 'TA', 5, 'Typ', 0, None, 'Attchd', 1961, 'Unf', 1, 730, 'TA', 'TA', 'Y', 140, 0, 0, 0, 120, 0, None, 'MnPrv', None, 0, 6, 2010, 'WD', 'Normal'], [1462, 20, 'RL', 81.0, 14267, 'Pave', None, 'IR1', 'Lvl', 'AllPub', 'Corner', 'Gtl', 'NAmes', 'Norm', 'Norm', '1Fam', '1Story', 6, 6, 1958, 1958, 'Hip', 'CompShg', 'Wd Sdng', 'Wd Sdng', 'BrkFace', 108, 'TA', 'TA', 'CBlock', 'TA', 'TA', 'No', 'ALQ', 923, 'Unf', 0, 406, 1329, 'GasA', 'TA', 'Y', 'SBrkr', 1329, 0, 0, 1329, 0, 0, 1, 1, 3, 1, 'Gd', 6, 'Typ', 0, None, 'Attchd', 1958, 'Unf', 1, 312, 'TA', 'TA', 'Y', 393, 36, 0, 0, 0, 0, None, None, 'Gar2', 12500, 6, 2010, 'WD', 'Normal']]

Which has the same shape as the original dataset

# Columns
display(len(res['data']))

# by row
display(len(res['data'][0]))

df.shape

(9, 80)

And is easy enough to reconstruct the original

pd.DataFrame(res['data'],
             columns=res['columns'],
             index=res['index'])

	Id	MSSubClass	MSZoning	LotFrontage	LotArea	Street	Alley	LotShape	LandContour	Utilities	...	ScreenPorch	PoolQC	Fence	MiscFeature	MiscVal	MoSold	YrSold	SaleType	SaleCondition
0	1461	20	RH	80.0	11622	Pave	None	Reg	Lvl	AllPub	...	120	None	MnPrv	None	0	6	2010	WD	Normal
1	1462	20	RL	81.0	14267	Pave	None	IR1	Lvl	AllPub	...	0	None	None	Gar2	12500	6	2010	WD	Normal
2	1463	60	RL	74.0	13830	Pave	None	IR1	Lvl	AllPub	...	0	None	MnPrv	None	0	3	2010	WD	Normal
3	1464	60	RL	78.0	9978	Pave	None	IR1	Lvl	AllPub	...	0	None	None	None	0	6	2010	WD	Normal
4	1465	120	RL	43.0	5005	Pave	None	IR1	HLS	AllPub	...	144	None	None	None	0	1	2010	WD	Normal
5	1466	60	RL	75.0	10000	Pave	None	IR1	Lvl	AllPub	...	0	None	None	None	0	4	2010	WD	Normal
6	1467	20	RL	NaN	7980	Pave	None	IR1	Lvl	AllPub	...	0	None	GdPrv	Shed	500	3	2010	WD	Normal
7	1468	60	RL	63.0	8402	Pave	None	IR1	Lvl	AllPub	...	0	None	None	None	0	5	2010	WD	Normal
8	1469	20	RL	85.0	10176	Pave	None	Reg	Lvl	AllPub	...	0	None	None	None	0	2	2010	WD	Normal

9 rows × 80 columns

However, it fails because pandas can’t find a uniform key length in split

try:
    pd.read_json(split)
except Exception as e:
    print(e)

arrays must all be same length

For obvious reasons

for key in res.keys():
    print(key, len(res[key]))

columns 80
index 9
data 9

index

index creates a simple dictionary of indexes

idx = df.to_json(orient='index')
idx[:500]

'{"0":{"Id":1461,"MSSubClass":20,"MSZoning":"RH","LotFrontage":80.0,"LotArea":11622,"Street":"Pave","Alley":null,"LotShape":"Reg","LandContour":"Lvl","Utilities":"AllPub","LotConfig":"Inside","LandSlope":"Gtl","Neighborhood":"NAmes","Condition1":"Feedr","Condition2":"Norm","BldgType":"1Fam","HouseStyle":"1Story","OverallQual":5,"OverallCond":6,"YearBuilt":1961,"YearRemodAdd":1961,"RoofStyle":"Gable","RoofMatl":"CompShg","Exterior1st":"VinylSd","Exterior2nd":"VinylSd","MasVnrType":"None","MasVnrAr'

res = loads(idx)
res.keys()

dict_keys(['0', '1', '2', '3', '4', '5', '6', '7', '8'])

And nested underneath that is each column of data, per row.

from itertools import islice

for k, v in islice(res['0'].items(), 5):
    print(k.ljust(25), res['0'][k])
    
print('...')

Id                        1461
MSSubClass                20
MSZoning                  RH
LotFrontage               80.0
LotArea                   11622
...

This is almost right– just lopsided

pd.read_json(idx).T

	1stFlrSF	2ndFlrSF	Alley	BedroomAbvGr	BldgType	BsmtCond	BsmtExposure	BsmtFinSF1	BsmtFinSF2	...	SaleType	ScreenPorch	Street	TotRmsAbvGrd	TotalBsmtSF	Utilities	WoodDeckSF	YearBuilt	YearRemodAdd	YrSold
0	896	0	None	2	1Fam	TA	No	468	144	...	WD	120	Pave	5	882	AllPub	140	1961	1961	2010
1	1329	0	None	3	1Fam	TA	No	923	0	...	WD	0	Pave	6	1329	AllPub	393	1958	1958	2010
2	928	701	None	3	1Fam	TA	No	791	0	...	WD	0	Pave	6	928	AllPub	212	1997	1998	2010
3	926	678	None	3	1Fam	TA	No	602	0	...	WD	0	Pave	7	926	AllPub	360	1998	1998	2010
4	1280	0	None	2	TwnhsE	TA	No	263	0	...	WD	144	Pave	5	1280	AllPub	0	1992	1992	2010
5	763	892	None	3	1Fam	TA	No	0	0	...	WD	0	Pave	7	763	AllPub	157	1993	1994	2010
6	1187	0	None	3	1Fam	TA	No	935	0	...	WD	0	Pave	6	1168	AllPub	483	1992	2007	2010
7	789	676	None	3	1Fam	TA	No	0	0	...	WD	0	Pave	7	789	AllPub	0	1998	1998	2010
8	1341	0	None	2	1Fam	TA	Gd	637	0	...	WD	0	Pave	5	1300	AllPub	192	1990	1990	2010

9 rows × 80 columns

columns

Columns is basically the same thing, but in reverse

cols = df.to_json(orient='columns')
cols[:500]

'{"Id":{"0":1461,"1":1462,"2":1463,"3":1464,"4":1465,"5":1466,"6":1467,"7":1468,"8":1469},"MSSubClass":{"0":20,"1":20,"2":60,"3":60,"4":120,"5":60,"6":20,"7":60,"8":20},"MSZoning":{"0":"RH","1":"RL","2":"RL","3":"RL","4":"RL","5":"RL","6":"RL","7":"RL","8":"RL"},"LotFrontage":{"0":80.0,"1":81.0,"2":74.0,"3":78.0,"4":43.0,"5":75.0,"6":null,"7":63.0,"8":85.0},"LotArea":{"0":11622,"1":14267,"2":13830,"3":9978,"4":5005,"5":10000,"6":7980,"7":8402,"8":10176},"Street":{"0":"Pave","1":"Pave","2":"Pave",'

res = loads(cols)
res.keys()

dict_keys(['Id', 'MSSubClass', 'MSZoning', 'LotFrontage', 'LotArea', 'Street', 'Alley', 'LotShape', 'LandContour', 'Utilities', 'LotConfig', 'LandSlope', 'Neighborhood', 'Condition1', 'Condition2', 'BldgType', 'HouseStyle', 'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd', 'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType', 'MasVnrArea', 'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinSF1', 'BsmtFinType2', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', 'Heating', 'HeatingQC', 'CentralAir', 'Electrical', '1stFlrSF', '2ndFlrSF', 'LowQualFinSF', 'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath', 'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'KitchenQual', 'TotRmsAbvGrd', 'Functional', 'Fireplaces', 'FireplaceQu', 'GarageType', 'GarageYrBlt', 'GarageFinish', 'GarageCars', 'GarageArea', 'GarageQual', 'GarageCond', 'PavedDrive', 'WoodDeckSF', 'OpenPorchSF', 'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'PoolArea', 'PoolQC', 'Fence', 'MiscFeature', 'MiscVal', 'MoSold', 'YrSold', 'SaleType', 'SaleCondition'])

Each column key houses dicts of the form (index, data)

from itertools import islice

for k, v in islice(res['Id'].items(), 5):
    print(k.ljust(25), res['Id'][k])
    
print('...')

0                         1461
1                         1462
2                         1463
3                         1464
4                         1465
...

This would work fine

pd.read_json(cols)

	Id	MSSubClass	MSZoning	LotFrontage	LotArea	Street	Alley	LotShape	LandContour	Utilities	...	ScreenPorch	PoolQC	Fence	MiscFeature	MiscVal	MoSold	YrSold	SaleType	SaleCondition
0	1461	20	RH	80.0	11622	Pave	NaN	Reg	Lvl	AllPub	...	120	NaN	MnPrv	None	0	6	2010	WD	Normal
1	1462	20	RL	81.0	14267	Pave	NaN	IR1	Lvl	AllPub	...	0	NaN	None	Gar2	12500	6	2010	WD	Normal
2	1463	60	RL	74.0	13830	Pave	NaN	IR1	Lvl	AllPub	...	0	NaN	MnPrv	None	0	3	2010	WD	Normal
3	1464	60	RL	78.0	9978	Pave	NaN	IR1	Lvl	AllPub	...	0	NaN	None	None	0	6	2010	WD	Normal
4	1465	120	RL	43.0	5005	Pave	NaN	IR1	HLS	AllPub	...	144	NaN	None	None	0	1	2010	WD	Normal
5	1466	60	RL	75.0	10000	Pave	NaN	IR1	Lvl	AllPub	...	0	NaN	None	None	0	4	2010	WD	Normal
6	1467	20	RL	NaN	7980	Pave	NaN	IR1	Lvl	AllPub	...	0	NaN	GdPrv	Shed	500	3	2010	WD	Normal
7	1468	60	RL	63.0	8402	Pave	NaN	IR1	Lvl	AllPub	...	0	NaN	None	None	0	5	2010	WD	Normal
8	1469	20	RL	85.0	10176	Pave	NaN	Reg	Lvl	AllPub	...	0	NaN	None	None	0	2	2010	WD	Normal

9 rows × 80 columns

values

Returns a simple nested list of the values– no columns or index

values = df.to_json(orient='values')
values[:500]

'[[1461,20,"RH",80.0,11622,"Pave",null,"Reg","Lvl","AllPub","Inside","Gtl","NAmes","Feedr","Norm","1Fam","1Story",5,6,1961,1961,"Gable","CompShg","VinylSd","VinylSd","None",0,"TA","TA","CBlock","TA","TA","No","Rec",468,"LwQ",144,270,882,"GasA","TA","Y","SBrkr",896,0,0,896,0,0,1,0,2,1,"TA",5,"Typ",0,null,"Attchd",1961,"Unf",1,730,"TA","TA","Y",140,0,0,0,120,0,null,"MnPrv",null,0,6,2010,"WD","Normal"],[1462,20,"RL",81.0,14267,"Pave",null,"IR1","Lvl","AllPub","Corner","Gtl","NAmes","Norm","Norm","1F'

res = loads(values)
type(res)

list

print(res[0][:10])
print(res[1][:10])

[1461, 20, 'RH', 80.0, 11622, 'Pave', None, 'Reg', 'Lvl', 'AllPub']
[1462, 20, 'RL', 81.0, 14267, 'Pave', None, 'IR1', 'Lvl', 'AllPub']

Also works for creating a DataFrame, however, the Pipeline object can’t select by column name.

pd.read_json(values)

	0	1	2	3	4	5	6	7	8	9	...	70	72	73	74	75	76	77	78	79
0	1461	20	RH	80.0	11622	Pave	NaN	Reg	Lvl	AllPub	...	120	NaN	MnPrv	None	0	6	2010	WD	Normal
1	1462	20	RL	81.0	14267	Pave	NaN	IR1	Lvl	AllPub	...	0	NaN	None	Gar2	12500	6	2010	WD	Normal
2	1463	60	RL	74.0	13830	Pave	NaN	IR1	Lvl	AllPub	...	0	NaN	MnPrv	None	0	3	2010	WD	Normal
3	1464	60	RL	78.0	9978	Pave	NaN	IR1	Lvl	AllPub	...	0	NaN	None	None	0	6	2010	WD	Normal
4	1465	120	RL	43.0	5005	Pave	NaN	IR1	HLS	AllPub	...	144	NaN	None	None	0	1	2010	WD	Normal
5	1466	60	RL	75.0	10000	Pave	NaN	IR1	Lvl	AllPub	...	0	NaN	None	None	0	4	2010	WD	Normal
6	1467	20	RL	NaN	7980	Pave	NaN	IR1	Lvl	AllPub	...	0	NaN	GdPrv	Shed	500	3	2010	WD	Normal
7	1468	60	RL	63.0	8402	Pave	NaN	IR1	Lvl	AllPub	...	0	NaN	None	None	0	5	2010	WD	Normal
8	1469	20	RL	85.0	10176	Pave	NaN	Reg	Lvl	AllPub	...	0	NaN	None	None	0	2	2010	WD	Normal

9 rows × 80 columns

default / records

By default, the function turns each column into its own dict, which in turn has each row/index as a nested dict.

This is what we want.

from json import loads

default = df.to_json()
default[:500]

'{"Id":{"0":1461,"1":1462,"2":1463,"3":1464,"4":1465,"5":1466,"6":1467,"7":1468,"8":1469},"MSSubClass":{"0":20,"1":20,"2":60,"3":60,"4":120,"5":60,"6":20,"7":60,"8":20},"MSZoning":{"0":"RH","1":"RL","2":"RL","3":"RL","4":"RL","5":"RL","6":"RL","7":"RL","8":"RL"},"LotFrontage":{"0":80.0,"1":81.0,"2":74.0,"3":78.0,"4":43.0,"5":75.0,"6":null,"7":63.0,"8":85.0},"LotArea":{"0":11622,"1":14267,"2":13830,"3":9978,"4":5005,"5":10000,"6":7980,"7":8402,"8":10176},"Street":{"0":"Pave","1":"Pave","2":"Pave",'

Columns

loads(default).keys()

dict_keys(['Id', 'MSSubClass', 'MSZoning', 'LotFrontage', 'LotArea', 'Street', 'Alley', 'LotShape', 'LandContour', 'Utilities', 'LotConfig', 'LandSlope', 'Neighborhood', 'Condition1', 'Condition2', 'BldgType', 'HouseStyle', 'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd', 'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType', 'MasVnrArea', 'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinSF1', 'BsmtFinType2', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', 'Heating', 'HeatingQC', 'CentralAir', 'Electrical', '1stFlrSF', '2ndFlrSF', 'LowQualFinSF', 'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath', 'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'KitchenQual', 'TotRmsAbvGrd', 'Functional', 'Fireplaces', 'FireplaceQu', 'GarageType', 'GarageYrBlt', 'GarageFinish', 'GarageCars', 'GarageArea', 'GarageQual', 'GarageCond', 'PavedDrive', 'WoodDeckSF', 'OpenPorchSF', 'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'PoolArea', 'PoolQC', 'Fence', 'MiscFeature', 'MiscVal', 'MoSold', 'YrSold', 'SaleType', 'SaleCondition'])

Rows/Indexes

loads(default)['Id']

{'0': 1461,
 '1': 1462,
 '2': 1463,
 '3': 1464,
 '4': 1465,
 '5': 1466,
 '6': 1467,
 '7': 1468,
 '8': 1469}

Works great.

pd.read_json(default)

	Id	MSSubClass	MSZoning	LotFrontage	LotArea	Street	Alley	LotShape	LandContour	Utilities	...	ScreenPorch	PoolQC	Fence	MiscFeature	MiscVal	MoSold	YrSold	SaleType	SaleCondition
0	1461	20	RH	80.0	11622	Pave	NaN	Reg	Lvl	AllPub	...	120	NaN	MnPrv	None	0	6	2010	WD	Normal
1	1462	20	RL	81.0	14267	Pave	NaN	IR1	Lvl	AllPub	...	0	NaN	None	Gar2	12500	6	2010	WD	Normal
2	1463	60	RL	74.0	13830	Pave	NaN	IR1	Lvl	AllPub	...	0	NaN	MnPrv	None	0	3	2010	WD	Normal
3	1464	60	RL	78.0	9978	Pave	NaN	IR1	Lvl	AllPub	...	0	NaN	None	None	0	6	2010	WD	Normal
4	1465	120	RL	43.0	5005	Pave	NaN	IR1	HLS	AllPub	...	144	NaN	None	None	0	1	2010	WD	Normal
5	1466	60	RL	75.0	10000	Pave	NaN	IR1	Lvl	AllPub	...	0	NaN	None	None	0	4	2010	WD	Normal
6	1467	20	RL	NaN	7980	Pave	NaN	IR1	Lvl	AllPub	...	0	NaN	GdPrv	Shed	500	3	2010	WD	Normal
7	1468	60	RL	63.0	8402	Pave	NaN	IR1	Lvl	AllPub	...	0	NaN	None	None	0	5	2010	WD	Normal
8	1469	20	RL	85.0	10176	Pave	NaN	Reg	Lvl	AllPub	...	0	NaN	None	None	0	2	2010	WD	Normal

9 rows × 80 columns

The only difference between calling with the default None value and records is that records will first sort your columns.

recs = df.to_json(orient='records')

pd.read_json(recs)

	1stFlrSF	2ndFlrSF	Alley	BedroomAbvGr	BldgType	BsmtCond	BsmtExposure	BsmtFinSF1	BsmtFinSF2	...	SaleType	ScreenPorch	Street	TotRmsAbvGrd	TotalBsmtSF	Utilities	WoodDeckSF	YearBuilt	YearRemodAdd	YrSold
0	896	0	NaN	2	1Fam	TA	No	468	144	...	WD	120	Pave	5	882	AllPub	140	1961	1961	2010
1	1329	0	NaN	3	1Fam	TA	No	923	0	...	WD	0	Pave	6	1329	AllPub	393	1958	1958	2010
2	928	701	NaN	3	1Fam	TA	No	791	0	...	WD	0	Pave	6	928	AllPub	212	1997	1998	2010
3	926	678	NaN	3	1Fam	TA	No	602	0	...	WD	0	Pave	7	926	AllPub	360	1998	1998	2010
4	1280	0	NaN	2	TwnhsE	TA	No	263	0	...	WD	144	Pave	5	1280	AllPub	0	1992	1992	2010
5	763	892	NaN	3	1Fam	TA	No	0	0	...	WD	0	Pave	7	763	AllPub	157	1993	1994	2010
6	1187	0	NaN	3	1Fam	TA	No	935	0	...	WD	0	Pave	6	1168	AllPub	483	1992	2007	2010
7	789	676	NaN	3	1Fam	TA	No	0	0	...	WD	0	Pave	7	789	AllPub	0	1998	1998	2010
8	1341	0	NaN	2	1Fam	TA	Gd	637	0	...	WD	0	Pave	5	1300	AllPub	192	1990	1990	2010

9 rows × 80 columns

Which we can (painstakingly) show by de-serializing default_json, sorting the keys, reserializing and comparing recs

default_json = loads(default)

sorted_default_json = {k: default_json[k] for k in sorted(default_json)}
sorted_default_json.keys()

dict_keys(['1stFlrSF', '2ndFlrSF', '3SsnPorch', 'Alley', 'BedroomAbvGr', 'BldgType', 'BsmtCond', 'BsmtExposure', 'BsmtFinSF1', 'BsmtFinSF2', 'BsmtFinType1', 'BsmtFinType2', 'BsmtFullBath', 'BsmtHalfBath', 'BsmtQual', 'BsmtUnfSF', 'CentralAir', 'Condition1', 'Condition2', 'Electrical', 'EnclosedPorch', 'ExterCond', 'ExterQual', 'Exterior1st', 'Exterior2nd', 'Fence', 'FireplaceQu', 'Fireplaces', 'Foundation', 'FullBath', 'Functional', 'GarageArea', 'GarageCars', 'GarageCond', 'GarageFinish', 'GarageQual', 'GarageType', 'GarageYrBlt', 'GrLivArea', 'HalfBath', 'Heating', 'HeatingQC', 'HouseStyle', 'Id', 'KitchenAbvGr', 'KitchenQual', 'LandContour', 'LandSlope', 'LotArea', 'LotConfig', 'LotFrontage', 'LotShape', 'LowQualFinSF', 'MSSubClass', 'MSZoning', 'MasVnrArea', 'MasVnrType', 'MiscFeature', 'MiscVal', 'MoSold', 'Neighborhood', 'OpenPorchSF', 'OverallCond', 'OverallQual', 'PavedDrive', 'PoolArea', 'PoolQC', 'RoofMatl', 'RoofStyle', 'SaleCondition', 'SaleType', 'ScreenPorch', 'Street', 'TotRmsAbvGrd', 'TotalBsmtSF', 'Utilities', 'WoodDeckSF', 'YearBuilt', 'YearRemodAdd', 'YrSold'])

from json import dumps

# .fillna('') because NULL doesn't play well w ==
a = pd.read_json(dumps(sorted_default_json)).fillna('')
b = pd.read_json(recs).fillna('')

Woof.

table

Finally, orient='table' takes records and does one better

table = df.to_json(orient='table')
table[:500]

'{"schema": {"fields":[{"name":"index","type":"integer"},{"name":"Id","type":"integer"},{"name":"MSSubClass","type":"integer"},{"name":"MSZoning","type":"string"},{"name":"LotFrontage","type":"number"},{"name":"LotArea","type":"integer"},{"name":"Street","type":"string"},{"name":"Alley","type":"number"},{"name":"LotShape","type":"string"},{"name":"LandContour","type":"string"},{"name":"Utilities","type":"string"},{"name":"LotConfig","type":"string"},{"name":"LandSlope","type":"string"},{"name":"N'

The data value is the same as if we’d done orient='records'

res = loads(table)
res.keys()

dict_keys(['schema', 'data'])

display(len(res['data']))
display(len(loads(recs)))

Except for the fact that data holds onto the Index column

len(res['data'][0].keys())

len(loads(recs)[0].keys())

(set(res['data'][0].keys())
-
(set(loads(recs)[0])))

{'index'}

but schema gives us a simple schema definition, like we might expect work working with PySpark. Which is pretty cool.

loads(table)['schema']

{'fields': [{'name': 'index', 'type': 'integer'},
  {'name': 'Id', 'type': 'integer'},
  {'name': 'MSSubClass', 'type': 'integer'},
  {'name': 'MSZoning', 'type': 'string'},
  {'name': 'LotFrontage', 'type': 'number'},
  {'name': 'LotArea', 'type': 'integer'},
  {'name': 'Street', 'type': 'string'},
  {'name': 'Alley', 'type': 'number'},
  {'name': 'LotShape', 'type': 'string'},
  {'name': 'LandContour', 'type': 'string'},
  {'name': 'Utilities', 'type': 'string'},
  {'name': 'LotConfig', 'type': 'string'},
  {'name': 'LandSlope', 'type': 'string'},
  {'name': 'Neighborhood', 'type': 'string'},
  {'name': 'Condition1', 'type': 'string'},
  {'name': 'Condition2', 'type': 'string'},
  {'name': 'BldgType', 'type': 'string'},
  {'name': 'HouseStyle', 'type': 'string'},
  {'name': 'OverallQual', 'type': 'integer'},
  {'name': 'OverallCond', 'type': 'integer'},
  {'name': 'YearBuilt', 'type': 'integer'},
  {'name': 'YearRemodAdd', 'type': 'integer'},
  {'name': 'RoofStyle', 'type': 'string'},
  {'name': 'RoofMatl', 'type': 'string'},
  {'name': 'Exterior1st', 'type': 'string'},
  {'name': 'Exterior2nd', 'type': 'string'},
  {'name': 'MasVnrType', 'type': 'string'},
  {'name': 'MasVnrArea', 'type': 'integer'},
  {'name': 'ExterQual', 'type': 'string'},
  {'name': 'ExterCond', 'type': 'string'},
  {'name': 'Foundation', 'type': 'string'},
  {'name': 'BsmtQual', 'type': 'string'},
  {'name': 'BsmtCond', 'type': 'string'},
  {'name': 'BsmtExposure', 'type': 'string'},
  {'name': 'BsmtFinType1', 'type': 'string'},
  {'name': 'BsmtFinSF1', 'type': 'integer'},
  {'name': 'BsmtFinType2', 'type': 'string'},
  {'name': 'BsmtFinSF2', 'type': 'integer'},
  {'name': 'BsmtUnfSF', 'type': 'integer'},
  {'name': 'TotalBsmtSF', 'type': 'integer'},
  {'name': 'Heating', 'type': 'string'},
  {'name': 'HeatingQC', 'type': 'string'},
  {'name': 'CentralAir', 'type': 'string'},
  {'name': 'Electrical', 'type': 'string'},
  {'name': '1stFlrSF', 'type': 'integer'},
  {'name': '2ndFlrSF', 'type': 'integer'},
  {'name': 'LowQualFinSF', 'type': 'integer'},
  {'name': 'GrLivArea', 'type': 'integer'},
  {'name': 'BsmtFullBath', 'type': 'integer'},
  {'name': 'BsmtHalfBath', 'type': 'integer'},
  {'name': 'FullBath', 'type': 'integer'},
  {'name': 'HalfBath', 'type': 'integer'},
  {'name': 'BedroomAbvGr', 'type': 'integer'},
  {'name': 'KitchenAbvGr', 'type': 'integer'},
  {'name': 'KitchenQual', 'type': 'string'},
  {'name': 'TotRmsAbvGrd', 'type': 'integer'},
  {'name': 'Functional', 'type': 'string'},
  {'name': 'Fireplaces', 'type': 'integer'},
  {'name': 'FireplaceQu', 'type': 'string'},
  {'name': 'GarageType', 'type': 'string'},
  {'name': 'GarageYrBlt', 'type': 'integer'},
  {'name': 'GarageFinish', 'type': 'string'},
  {'name': 'GarageCars', 'type': 'integer'},
  {'name': 'GarageArea', 'type': 'integer'},
  {'name': 'GarageQual', 'type': 'string'},
  {'name': 'GarageCond', 'type': 'string'},
  {'name': 'PavedDrive', 'type': 'string'},
  {'name': 'WoodDeckSF', 'type': 'integer'},
  {'name': 'OpenPorchSF', 'type': 'integer'},
  {'name': 'EnclosedPorch', 'type': 'integer'},
  {'name': '3SsnPorch', 'type': 'integer'},
  {'name': 'ScreenPorch', 'type': 'integer'},
  {'name': 'PoolArea', 'type': 'integer'},
  {'name': 'PoolQC', 'type': 'number'},
  {'name': 'Fence', 'type': 'string'},
  {'name': 'MiscFeature', 'type': 'string'},
  {'name': 'MiscVal', 'type': 'integer'},
  {'name': 'MoSold', 'type': 'integer'},
  {'name': 'YrSold', 'type': 'integer'},
  {'name': 'SaleType', 'type': 'string'},
  {'name': 'SaleCondition', 'type': 'string'}],
 'primaryKey': ['index'],
 'pandas_version': '0.20.0'}