Getting JSON Handling Right

Motivation

I’m currently going through the Deployment of ML Models course on Udemy and in the sixth module, they start implementing a testing suite to ensure stability in a package comprised of pandas, sklearn, and numpy.

I almost glossed over this section until I realized that I wasn’t totally up to speed what they were doing as far as formatting goes. I’ve written a ton about pd.read_csv() and felt pretty comfortable about that.

However, what about .predict() methods of my models? Specifically, what about the code that gets passed to them?

My Pipeline object had a bunch of Transformer objects chained together, spilling into a simple Linear_model.

Pipeline(steps=[('categorical_imputer', CategoricalImputer()),
                 ...
                ('Linear_model', Lasso(alpha=0.005, random_state=0))])

So in order for my pipeline to execute correctly, I needed my input data to be able to serve column values by key (e.g. impute these columns). Hence, I needed it to be a DataFrame.

However, the prediction function that they used didn’t take a DataFrame, but instead a stringed-JSON, which was converted into a DataFrame within the function.

def make_prediction(*, input_data) -> dict:
    """Make a prediction using the saved model pipeline."""

    data = pd.read_json(input_data)
    prediction = _price_pipe.predict(data[config.FEATURES])
    output = np.exp(prediction)
    response = {"predictions": output}

    return response

This made enough sense– some serialized data stream wasn’t going to come to me in a DataFrame format. I should expect JSON.

So the trick was going from old, reliable pd.read_csv()

import pandas as pd

df = pd.read_csv('data/housing.csv')
df.head()
Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour Utilities ... ScreenPorch PoolArea PoolQC Fence MiscFeature MiscVal MoSold YrSold SaleType SaleCondition
0 1461 20 RH 80.0 11622 Pave NaN Reg Lvl AllPub ... 120 0 NaN MnPrv NaN 0 6 2010 WD Normal
1 1462 20 RL 81.0 14267 Pave NaN IR1 Lvl AllPub ... 0 0 NaN NaN Gar2 12500 6 2010 WD Normal
2 1463 60 RL 74.0 13830 Pave NaN IR1 Lvl AllPub ... 0 0 NaN MnPrv NaN 0 3 2010 WD Normal
3 1464 60 RL 78.0 9978 Pave NaN IR1 Lvl AllPub ... 0 0 NaN NaN NaN 0 6 2010 WD Normal
4 1465 120 RL 43.0 5005 Pave NaN IR1 HLS AllPub ... 144 0 NaN NaN NaN 0 1 2010 WD Normal

5 rows × 80 columns

…into a JSON format that would unpack correctly into a DataFrame once make_prediction() got ahold of it.

Lots of JSON formats

So at this point, I needed to figure out how to de-serialize my DataFrame into JSON, such that pd.read_json() didn’t choke on it.

Little did I know, there were a bunch of different ways that this function will JSON-ify your data

        orient : string
            Indication of expected JSON string format.

            * The format of the JSON string

              - 'split' : dict like {'index' -> [index],
                'columns' -> [columns], 'data' -> [values]}
              - 'records' : list like
                [{column -> value}, ... , {column -> value}]
              - 'index' : dict like {index -> {column -> value}}
              - 'columns' : dict like {column -> {index -> value}}
              - 'values' : just the values array
              - 'table' : dict like {'schema': {schema}, 'data': {data}}
                describing the data, and the data component is
                like ``orient='records'``.

The docs were very dense, so I figured it was worth enumerating the different options here.

split

orient='split' basically de-constructs the dataset by its indexes

from json import loads

split = df.to_json(orient='split')
split[:500]
'{"columns":["Id","MSSubClass","MSZoning","LotFrontage","LotArea","Street","Alley","LotShape","LandContour","Utilities","LotConfig","LandSlope","Neighborhood","Condition1","Condition2","BldgType","HouseStyle","OverallQual","OverallCond","YearBuilt","YearRemodAdd","RoofStyle","RoofMatl","Exterior1st","Exterior2nd","MasVnrType","MasVnrArea","ExterQual","ExterCond","Foundation","BsmtQual","BsmtCond","BsmtExposure","BsmtFinType1","BsmtFinSF1","BsmtFinType2","BsmtFinSF2","BsmtUnfSF","TotalBsmtSF","Hea'
res = loads(split)
res.keys()
dict_keys(['columns', 'index', 'data'])

You get the row and column indexes

display(res['columns'][:5])
display(res['index'][:5])
['Id', 'MSSubClass', 'MSZoning', 'LotFrontage', 'LotArea']



[0, 1, 2, 3, 4]

As well as the data itself.

print(res['data'][:2])
[[1461, 20, 'RH', 80.0, 11622, 'Pave', None, 'Reg', 'Lvl', 'AllPub', 'Inside', 'Gtl', 'NAmes', 'Feedr', 'Norm', '1Fam', '1Story', 5, 6, 1961, 1961, 'Gable', 'CompShg', 'VinylSd', 'VinylSd', 'None', 0, 'TA', 'TA', 'CBlock', 'TA', 'TA', 'No', 'Rec', 468, 'LwQ', 144, 270, 882, 'GasA', 'TA', 'Y', 'SBrkr', 896, 0, 0, 896, 0, 0, 1, 0, 2, 1, 'TA', 5, 'Typ', 0, None, 'Attchd', 1961, 'Unf', 1, 730, 'TA', 'TA', 'Y', 140, 0, 0, 0, 120, 0, None, 'MnPrv', None, 0, 6, 2010, 'WD', 'Normal'], [1462, 20, 'RL', 81.0, 14267, 'Pave', None, 'IR1', 'Lvl', 'AllPub', 'Corner', 'Gtl', 'NAmes', 'Norm', 'Norm', '1Fam', '1Story', 6, 6, 1958, 1958, 'Hip', 'CompShg', 'Wd Sdng', 'Wd Sdng', 'BrkFace', 108, 'TA', 'TA', 'CBlock', 'TA', 'TA', 'No', 'ALQ', 923, 'Unf', 0, 406, 1329, 'GasA', 'TA', 'Y', 'SBrkr', 1329, 0, 0, 1329, 0, 0, 1, 1, 3, 1, 'Gd', 6, 'Typ', 0, None, 'Attchd', 1958, 'Unf', 1, 312, 'TA', 'TA', 'Y', 393, 36, 0, 0, 0, 0, None, None, 'Gar2', 12500, 6, 2010, 'WD', 'Normal']]

Which has the same shape as the original dataset

# Columns
display(len(res['data']))

# by row
display(len(res['data'][0]))
9



80
df.shape
(9, 80)

And is easy enough to reconstruct the original

pd.DataFrame(res['data'],
             columns=res['columns'],
             index=res['index'])
Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour Utilities ... ScreenPorch PoolArea PoolQC Fence MiscFeature MiscVal MoSold YrSold SaleType SaleCondition
0 1461 20 RH 80.0 11622 Pave None Reg Lvl AllPub ... 120 0 None MnPrv None 0 6 2010 WD Normal
1 1462 20 RL 81.0 14267 Pave None IR1 Lvl AllPub ... 0 0 None None Gar2 12500 6 2010 WD Normal
2 1463 60 RL 74.0 13830 Pave None IR1 Lvl AllPub ... 0 0 None MnPrv None 0 3 2010 WD Normal
3 1464 60 RL 78.0 9978 Pave None IR1 Lvl AllPub ... 0 0 None None None 0 6 2010 WD Normal
4 1465 120 RL 43.0 5005 Pave None IR1 HLS AllPub ... 144 0 None None None 0 1 2010 WD Normal
5 1466 60 RL 75.0 10000 Pave None IR1 Lvl AllPub ... 0 0 None None None 0 4 2010 WD Normal
6 1467 20 RL NaN 7980 Pave None IR1 Lvl AllPub ... 0 0 None GdPrv Shed 500 3 2010 WD Normal
7 1468 60 RL 63.0 8402 Pave None IR1 Lvl AllPub ... 0 0 None None None 0 5 2010 WD Normal
8 1469 20 RL 85.0 10176 Pave None Reg Lvl AllPub ... 0 0 None None None 0 2 2010 WD Normal

9 rows × 80 columns

However, it fails because pandas can’t find a uniform key length in split

try:
    pd.read_json(split)
except Exception as e:
    print(e)
arrays must all be same length

For obvious reasons

for key in res.keys():
    print(key, len(res[key]))
columns 80
index 9
data 9

index

index creates a simple dictionary of indexes

idx = df.to_json(orient='index')
idx[:500]
'{"0":{"Id":1461,"MSSubClass":20,"MSZoning":"RH","LotFrontage":80.0,"LotArea":11622,"Street":"Pave","Alley":null,"LotShape":"Reg","LandContour":"Lvl","Utilities":"AllPub","LotConfig":"Inside","LandSlope":"Gtl","Neighborhood":"NAmes","Condition1":"Feedr","Condition2":"Norm","BldgType":"1Fam","HouseStyle":"1Story","OverallQual":5,"OverallCond":6,"YearBuilt":1961,"YearRemodAdd":1961,"RoofStyle":"Gable","RoofMatl":"CompShg","Exterior1st":"VinylSd","Exterior2nd":"VinylSd","MasVnrType":"None","MasVnrAr'
res = loads(idx)
res.keys()
dict_keys(['0', '1', '2', '3', '4', '5', '6', '7', '8'])

And nested underneath that is each column of data, per row.

from itertools import islice

for k, v in islice(res['0'].items(), 5):
    print(k.ljust(25), res['0'][k])
    
print('...')
Id                        1461
MSSubClass                20
MSZoning                  RH
LotFrontage               80.0
LotArea                   11622
...

This is almost right– just lopsided

pd.read_json(idx).T
1stFlrSF 2ndFlrSF 3SsnPorch Alley BedroomAbvGr BldgType BsmtCond BsmtExposure BsmtFinSF1 BsmtFinSF2 ... SaleType ScreenPorch Street TotRmsAbvGrd TotalBsmtSF Utilities WoodDeckSF YearBuilt YearRemodAdd YrSold
0 896 0 0 None 2 1Fam TA No 468 144 ... WD 120 Pave 5 882 AllPub 140 1961 1961 2010
1 1329 0 0 None 3 1Fam TA No 923 0 ... WD 0 Pave 6 1329 AllPub 393 1958 1958 2010
2 928 701 0 None 3 1Fam TA No 791 0 ... WD 0 Pave 6 928 AllPub 212 1997 1998 2010
3 926 678 0 None 3 1Fam TA No 602 0 ... WD 0 Pave 7 926 AllPub 360 1998 1998 2010
4 1280 0 0 None 2 TwnhsE TA No 263 0 ... WD 144 Pave 5 1280 AllPub 0 1992 1992 2010
5 763 892 0 None 3 1Fam TA No 0 0 ... WD 0 Pave 7 763 AllPub 157 1993 1994 2010
6 1187 0 0 None 3 1Fam TA No 935 0 ... WD 0 Pave 6 1168 AllPub 483 1992 2007 2010
7 789 676 0 None 3 1Fam TA No 0 0 ... WD 0 Pave 7 789 AllPub 0 1998 1998 2010
8 1341 0 0 None 2 1Fam TA Gd 637 0 ... WD 0 Pave 5 1300 AllPub 192 1990 1990 2010

9 rows × 80 columns

columns

Columns is basically the same thing, but in reverse

cols = df.to_json(orient='columns')
cols[:500]
'{"Id":{"0":1461,"1":1462,"2":1463,"3":1464,"4":1465,"5":1466,"6":1467,"7":1468,"8":1469},"MSSubClass":{"0":20,"1":20,"2":60,"3":60,"4":120,"5":60,"6":20,"7":60,"8":20},"MSZoning":{"0":"RH","1":"RL","2":"RL","3":"RL","4":"RL","5":"RL","6":"RL","7":"RL","8":"RL"},"LotFrontage":{"0":80.0,"1":81.0,"2":74.0,"3":78.0,"4":43.0,"5":75.0,"6":null,"7":63.0,"8":85.0},"LotArea":{"0":11622,"1":14267,"2":13830,"3":9978,"4":5005,"5":10000,"6":7980,"7":8402,"8":10176},"Street":{"0":"Pave","1":"Pave","2":"Pave",'
res = loads(cols)
res.keys()
dict_keys(['Id', 'MSSubClass', 'MSZoning', 'LotFrontage', 'LotArea', 'Street', 'Alley', 'LotShape', 'LandContour', 'Utilities', 'LotConfig', 'LandSlope', 'Neighborhood', 'Condition1', 'Condition2', 'BldgType', 'HouseStyle', 'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd', 'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType', 'MasVnrArea', 'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinSF1', 'BsmtFinType2', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', 'Heating', 'HeatingQC', 'CentralAir', 'Electrical', '1stFlrSF', '2ndFlrSF', 'LowQualFinSF', 'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath', 'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'KitchenQual', 'TotRmsAbvGrd', 'Functional', 'Fireplaces', 'FireplaceQu', 'GarageType', 'GarageYrBlt', 'GarageFinish', 'GarageCars', 'GarageArea', 'GarageQual', 'GarageCond', 'PavedDrive', 'WoodDeckSF', 'OpenPorchSF', 'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'PoolArea', 'PoolQC', 'Fence', 'MiscFeature', 'MiscVal', 'MoSold', 'YrSold', 'SaleType', 'SaleCondition'])

Each column key houses dicts of the form (index, data)

from itertools import islice

for k, v in islice(res['Id'].items(), 5):
    print(k.ljust(25), res['Id'][k])
    
print('...')
0                         1461
1                         1462
2                         1463
3                         1464
4                         1465
...

This would work fine

pd.read_json(cols)
Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour Utilities ... ScreenPorch PoolArea PoolQC Fence MiscFeature MiscVal MoSold YrSold SaleType SaleCondition
0 1461 20 RH 80.0 11622 Pave NaN Reg Lvl AllPub ... 120 0 NaN MnPrv None 0 6 2010 WD Normal
1 1462 20 RL 81.0 14267 Pave NaN IR1 Lvl AllPub ... 0 0 NaN None Gar2 12500 6 2010 WD Normal
2 1463 60 RL 74.0 13830 Pave NaN IR1 Lvl AllPub ... 0 0 NaN MnPrv None 0 3 2010 WD Normal
3 1464 60 RL 78.0 9978 Pave NaN IR1 Lvl AllPub ... 0 0 NaN None None 0 6 2010 WD Normal
4 1465 120 RL 43.0 5005 Pave NaN IR1 HLS AllPub ... 144 0 NaN None None 0 1 2010 WD Normal
5 1466 60 RL 75.0 10000 Pave NaN IR1 Lvl AllPub ... 0 0 NaN None None 0 4 2010 WD Normal
6 1467 20 RL NaN 7980 Pave NaN IR1 Lvl AllPub ... 0 0 NaN GdPrv Shed 500 3 2010 WD Normal
7 1468 60 RL 63.0 8402 Pave NaN IR1 Lvl AllPub ... 0 0 NaN None None 0 5 2010 WD Normal
8 1469 20 RL 85.0 10176 Pave NaN Reg Lvl AllPub ... 0 0 NaN None None 0 2 2010 WD Normal

9 rows × 80 columns

values

Returns a simple nested list of the values– no columns or index

values = df.to_json(orient='values')
values[:500]
'[[1461,20,"RH",80.0,11622,"Pave",null,"Reg","Lvl","AllPub","Inside","Gtl","NAmes","Feedr","Norm","1Fam","1Story",5,6,1961,1961,"Gable","CompShg","VinylSd","VinylSd","None",0,"TA","TA","CBlock","TA","TA","No","Rec",468,"LwQ",144,270,882,"GasA","TA","Y","SBrkr",896,0,0,896,0,0,1,0,2,1,"TA",5,"Typ",0,null,"Attchd",1961,"Unf",1,730,"TA","TA","Y",140,0,0,0,120,0,null,"MnPrv",null,0,6,2010,"WD","Normal"],[1462,20,"RL",81.0,14267,"Pave",null,"IR1","Lvl","AllPub","Corner","Gtl","NAmes","Norm","Norm","1F'
res = loads(values)
type(res)
list
print(res[0][:10])
print(res[1][:10])
[1461, 20, 'RH', 80.0, 11622, 'Pave', None, 'Reg', 'Lvl', 'AllPub']
[1462, 20, 'RL', 81.0, 14267, 'Pave', None, 'IR1', 'Lvl', 'AllPub']

Also works for creating a DataFrame, however, the Pipeline object can’t select by column name.

pd.read_json(values)
0 1 2 3 4 5 6 7 8 9 ... 70 71 72 73 74 75 76 77 78 79
0 1461 20 RH 80.0 11622 Pave NaN Reg Lvl AllPub ... 120 0 NaN MnPrv None 0 6 2010 WD Normal
1 1462 20 RL 81.0 14267 Pave NaN IR1 Lvl AllPub ... 0 0 NaN None Gar2 12500 6 2010 WD Normal
2 1463 60 RL 74.0 13830 Pave NaN IR1 Lvl AllPub ... 0 0 NaN MnPrv None 0 3 2010 WD Normal
3 1464 60 RL 78.0 9978 Pave NaN IR1 Lvl AllPub ... 0 0 NaN None None 0 6 2010 WD Normal
4 1465 120 RL 43.0 5005 Pave NaN IR1 HLS AllPub ... 144 0 NaN None None 0 1 2010 WD Normal
5 1466 60 RL 75.0 10000 Pave NaN IR1 Lvl AllPub ... 0 0 NaN None None 0 4 2010 WD Normal
6 1467 20 RL NaN 7980 Pave NaN IR1 Lvl AllPub ... 0 0 NaN GdPrv Shed 500 3 2010 WD Normal
7 1468 60 RL 63.0 8402 Pave NaN IR1 Lvl AllPub ... 0 0 NaN None None 0 5 2010 WD Normal
8 1469 20 RL 85.0 10176 Pave NaN Reg Lvl AllPub ... 0 0 NaN None None 0 2 2010 WD Normal

9 rows × 80 columns

default / records

By default, the function turns each column into its own dict, which in turn has each row/index as a nested dict.

This is what we want.

from json import loads

default = df.to_json()
default[:500]
'{"Id":{"0":1461,"1":1462,"2":1463,"3":1464,"4":1465,"5":1466,"6":1467,"7":1468,"8":1469},"MSSubClass":{"0":20,"1":20,"2":60,"3":60,"4":120,"5":60,"6":20,"7":60,"8":20},"MSZoning":{"0":"RH","1":"RL","2":"RL","3":"RL","4":"RL","5":"RL","6":"RL","7":"RL","8":"RL"},"LotFrontage":{"0":80.0,"1":81.0,"2":74.0,"3":78.0,"4":43.0,"5":75.0,"6":null,"7":63.0,"8":85.0},"LotArea":{"0":11622,"1":14267,"2":13830,"3":9978,"4":5005,"5":10000,"6":7980,"7":8402,"8":10176},"Street":{"0":"Pave","1":"Pave","2":"Pave",'

Columns

loads(default).keys()
dict_keys(['Id', 'MSSubClass', 'MSZoning', 'LotFrontage', 'LotArea', 'Street', 'Alley', 'LotShape', 'LandContour', 'Utilities', 'LotConfig', 'LandSlope', 'Neighborhood', 'Condition1', 'Condition2', 'BldgType', 'HouseStyle', 'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd', 'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType', 'MasVnrArea', 'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinSF1', 'BsmtFinType2', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', 'Heating', 'HeatingQC', 'CentralAir', 'Electrical', '1stFlrSF', '2ndFlrSF', 'LowQualFinSF', 'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath', 'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'KitchenQual', 'TotRmsAbvGrd', 'Functional', 'Fireplaces', 'FireplaceQu', 'GarageType', 'GarageYrBlt', 'GarageFinish', 'GarageCars', 'GarageArea', 'GarageQual', 'GarageCond', 'PavedDrive', 'WoodDeckSF', 'OpenPorchSF', 'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'PoolArea', 'PoolQC', 'Fence', 'MiscFeature', 'MiscVal', 'MoSold', 'YrSold', 'SaleType', 'SaleCondition'])

Rows/Indexes

loads(default)['Id']
{'0': 1461,
 '1': 1462,
 '2': 1463,
 '3': 1464,
 '4': 1465,
 '5': 1466,
 '6': 1467,
 '7': 1468,
 '8': 1469}

Works great.

pd.read_json(default)
Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour Utilities ... ScreenPorch PoolArea PoolQC Fence MiscFeature MiscVal MoSold YrSold SaleType SaleCondition
0 1461 20 RH 80.0 11622 Pave NaN Reg Lvl AllPub ... 120 0 NaN MnPrv None 0 6 2010 WD Normal
1 1462 20 RL 81.0 14267 Pave NaN IR1 Lvl AllPub ... 0 0 NaN None Gar2 12500 6 2010 WD Normal
2 1463 60 RL 74.0 13830 Pave NaN IR1 Lvl AllPub ... 0 0 NaN MnPrv None 0 3 2010 WD Normal
3 1464 60 RL 78.0 9978 Pave NaN IR1 Lvl AllPub ... 0 0 NaN None None 0 6 2010 WD Normal
4 1465 120 RL 43.0 5005 Pave NaN IR1 HLS AllPub ... 144 0 NaN None None 0 1 2010 WD Normal
5 1466 60 RL 75.0 10000 Pave NaN IR1 Lvl AllPub ... 0 0 NaN None None 0 4 2010 WD Normal
6 1467 20 RL NaN 7980 Pave NaN IR1 Lvl AllPub ... 0 0 NaN GdPrv Shed 500 3 2010 WD Normal
7 1468 60 RL 63.0 8402 Pave NaN IR1 Lvl AllPub ... 0 0 NaN None None 0 5 2010 WD Normal
8 1469 20 RL 85.0 10176 Pave NaN Reg Lvl AllPub ... 0 0 NaN None None 0 2 2010 WD Normal

9 rows × 80 columns

The only difference between calling with the default None value and records is that records will first sort your columns.

recs = df.to_json(orient='records')

pd.read_json(recs)
1stFlrSF 2ndFlrSF 3SsnPorch Alley BedroomAbvGr BldgType BsmtCond BsmtExposure BsmtFinSF1 BsmtFinSF2 ... SaleType ScreenPorch Street TotRmsAbvGrd TotalBsmtSF Utilities WoodDeckSF YearBuilt YearRemodAdd YrSold
0 896 0 0 NaN 2 1Fam TA No 468 144 ... WD 120 Pave 5 882 AllPub 140 1961 1961 2010
1 1329 0 0 NaN 3 1Fam TA No 923 0 ... WD 0 Pave 6 1329 AllPub 393 1958 1958 2010
2 928 701 0 NaN 3 1Fam TA No 791 0 ... WD 0 Pave 6 928 AllPub 212 1997 1998 2010
3 926 678 0 NaN 3 1Fam TA No 602 0 ... WD 0 Pave 7 926 AllPub 360 1998 1998 2010
4 1280 0 0 NaN 2 TwnhsE TA No 263 0 ... WD 144 Pave 5 1280 AllPub 0 1992 1992 2010
5 763 892 0 NaN 3 1Fam TA No 0 0 ... WD 0 Pave 7 763 AllPub 157 1993 1994 2010
6 1187 0 0 NaN 3 1Fam TA No 935 0 ... WD 0 Pave 6 1168 AllPub 483 1992 2007 2010
7 789 676 0 NaN 3 1Fam TA No 0 0 ... WD 0 Pave 7 789 AllPub 0 1998 1998 2010
8 1341 0 0 NaN 2 1Fam TA Gd 637 0 ... WD 0 Pave 5 1300 AllPub 192 1990 1990 2010

9 rows × 80 columns

Which we can (painstakingly) show by de-serializing default_json, sorting the keys, reserializing and comparing recs

default_json = loads(default)

sorted_default_json = {k: default_json[k] for k in sorted(default_json)}
sorted_default_json.keys()
dict_keys(['1stFlrSF', '2ndFlrSF', '3SsnPorch', 'Alley', 'BedroomAbvGr', 'BldgType', 'BsmtCond', 'BsmtExposure', 'BsmtFinSF1', 'BsmtFinSF2', 'BsmtFinType1', 'BsmtFinType2', 'BsmtFullBath', 'BsmtHalfBath', 'BsmtQual', 'BsmtUnfSF', 'CentralAir', 'Condition1', 'Condition2', 'Electrical', 'EnclosedPorch', 'ExterCond', 'ExterQual', 'Exterior1st', 'Exterior2nd', 'Fence', 'FireplaceQu', 'Fireplaces', 'Foundation', 'FullBath', 'Functional', 'GarageArea', 'GarageCars', 'GarageCond', 'GarageFinish', 'GarageQual', 'GarageType', 'GarageYrBlt', 'GrLivArea', 'HalfBath', 'Heating', 'HeatingQC', 'HouseStyle', 'Id', 'KitchenAbvGr', 'KitchenQual', 'LandContour', 'LandSlope', 'LotArea', 'LotConfig', 'LotFrontage', 'LotShape', 'LowQualFinSF', 'MSSubClass', 'MSZoning', 'MasVnrArea', 'MasVnrType', 'MiscFeature', 'MiscVal', 'MoSold', 'Neighborhood', 'OpenPorchSF', 'OverallCond', 'OverallQual', 'PavedDrive', 'PoolArea', 'PoolQC', 'RoofMatl', 'RoofStyle', 'SaleCondition', 'SaleType', 'ScreenPorch', 'Street', 'TotRmsAbvGrd', 'TotalBsmtSF', 'Utilities', 'WoodDeckSF', 'YearBuilt', 'YearRemodAdd', 'YrSold'])
from json import dumps

# .fillna('') because NULL doesn't play well w ==
a = pd.read_json(dumps(sorted_default_json)).fillna('')
b = pd.read_json(recs).fillna('')

Woof.

table

Finally, orient='table' takes records and does one better

table = df.to_json(orient='table')
table[:500]
'{"schema": {"fields":[{"name":"index","type":"integer"},{"name":"Id","type":"integer"},{"name":"MSSubClass","type":"integer"},{"name":"MSZoning","type":"string"},{"name":"LotFrontage","type":"number"},{"name":"LotArea","type":"integer"},{"name":"Street","type":"string"},{"name":"Alley","type":"number"},{"name":"LotShape","type":"string"},{"name":"LandContour","type":"string"},{"name":"Utilities","type":"string"},{"name":"LotConfig","type":"string"},{"name":"LandSlope","type":"string"},{"name":"N'

The data value is the same as if we’d done orient='records'

res = loads(table)
res.keys()
dict_keys(['schema', 'data'])
display(len(res['data']))
display(len(loads(recs)))
9



9

Except for the fact that data holds onto the Index column

len(res['data'][0].keys())
81
len(loads(recs)[0].keys())
80
(set(res['data'][0].keys())
-
(set(loads(recs)[0])))
{'index'}

but schema gives us a simple schema definition, like we might expect work working with PySpark. Which is pretty cool.

loads(table)['schema']
{'fields': [{'name': 'index', 'type': 'integer'},
  {'name': 'Id', 'type': 'integer'},
  {'name': 'MSSubClass', 'type': 'integer'},
  {'name': 'MSZoning', 'type': 'string'},
  {'name': 'LotFrontage', 'type': 'number'},
  {'name': 'LotArea', 'type': 'integer'},
  {'name': 'Street', 'type': 'string'},
  {'name': 'Alley', 'type': 'number'},
  {'name': 'LotShape', 'type': 'string'},
  {'name': 'LandContour', 'type': 'string'},
  {'name': 'Utilities', 'type': 'string'},
  {'name': 'LotConfig', 'type': 'string'},
  {'name': 'LandSlope', 'type': 'string'},
  {'name': 'Neighborhood', 'type': 'string'},
  {'name': 'Condition1', 'type': 'string'},
  {'name': 'Condition2', 'type': 'string'},
  {'name': 'BldgType', 'type': 'string'},
  {'name': 'HouseStyle', 'type': 'string'},
  {'name': 'OverallQual', 'type': 'integer'},
  {'name': 'OverallCond', 'type': 'integer'},
  {'name': 'YearBuilt', 'type': 'integer'},
  {'name': 'YearRemodAdd', 'type': 'integer'},
  {'name': 'RoofStyle', 'type': 'string'},
  {'name': 'RoofMatl', 'type': 'string'},
  {'name': 'Exterior1st', 'type': 'string'},
  {'name': 'Exterior2nd', 'type': 'string'},
  {'name': 'MasVnrType', 'type': 'string'},
  {'name': 'MasVnrArea', 'type': 'integer'},
  {'name': 'ExterQual', 'type': 'string'},
  {'name': 'ExterCond', 'type': 'string'},
  {'name': 'Foundation', 'type': 'string'},
  {'name': 'BsmtQual', 'type': 'string'},
  {'name': 'BsmtCond', 'type': 'string'},
  {'name': 'BsmtExposure', 'type': 'string'},
  {'name': 'BsmtFinType1', 'type': 'string'},
  {'name': 'BsmtFinSF1', 'type': 'integer'},
  {'name': 'BsmtFinType2', 'type': 'string'},
  {'name': 'BsmtFinSF2', 'type': 'integer'},
  {'name': 'BsmtUnfSF', 'type': 'integer'},
  {'name': 'TotalBsmtSF', 'type': 'integer'},
  {'name': 'Heating', 'type': 'string'},
  {'name': 'HeatingQC', 'type': 'string'},
  {'name': 'CentralAir', 'type': 'string'},
  {'name': 'Electrical', 'type': 'string'},
  {'name': '1stFlrSF', 'type': 'integer'},
  {'name': '2ndFlrSF', 'type': 'integer'},
  {'name': 'LowQualFinSF', 'type': 'integer'},
  {'name': 'GrLivArea', 'type': 'integer'},
  {'name': 'BsmtFullBath', 'type': 'integer'},
  {'name': 'BsmtHalfBath', 'type': 'integer'},
  {'name': 'FullBath', 'type': 'integer'},
  {'name': 'HalfBath', 'type': 'integer'},
  {'name': 'BedroomAbvGr', 'type': 'integer'},
  {'name': 'KitchenAbvGr', 'type': 'integer'},
  {'name': 'KitchenQual', 'type': 'string'},
  {'name': 'TotRmsAbvGrd', 'type': 'integer'},
  {'name': 'Functional', 'type': 'string'},
  {'name': 'Fireplaces', 'type': 'integer'},
  {'name': 'FireplaceQu', 'type': 'string'},
  {'name': 'GarageType', 'type': 'string'},
  {'name': 'GarageYrBlt', 'type': 'integer'},
  {'name': 'GarageFinish', 'type': 'string'},
  {'name': 'GarageCars', 'type': 'integer'},
  {'name': 'GarageArea', 'type': 'integer'},
  {'name': 'GarageQual', 'type': 'string'},
  {'name': 'GarageCond', 'type': 'string'},
  {'name': 'PavedDrive', 'type': 'string'},
  {'name': 'WoodDeckSF', 'type': 'integer'},
  {'name': 'OpenPorchSF', 'type': 'integer'},
  {'name': 'EnclosedPorch', 'type': 'integer'},
  {'name': '3SsnPorch', 'type': 'integer'},
  {'name': 'ScreenPorch', 'type': 'integer'},
  {'name': 'PoolArea', 'type': 'integer'},
  {'name': 'PoolQC', 'type': 'number'},
  {'name': 'Fence', 'type': 'string'},
  {'name': 'MiscFeature', 'type': 'string'},
  {'name': 'MiscVal', 'type': 'integer'},
  {'name': 'MoSold', 'type': 'integer'},
  {'name': 'YrSold', 'type': 'integer'},
  {'name': 'SaleType', 'type': 'string'},
  {'name': 'SaleCondition', 'type': 'string'}],
 'primaryKey': ['index'],
 'pandas_version': '0.20.0'}