## Split Train & Test and Normalization

**Create overall random Train and Test split.**

1 2 |
from sklearn.cross_validation import train_test_split df_train, df_test = train_test_split(df_all, train_size=0.7) |

We leave P_PRICE out as a predictor given it is already involved in many of the ratio features. Also as a value on its own (i.e., not in a ratio like Price to Earnings), it would not likely make a good predictor given the intuition is to use price in a relative manner such as in a ratio or in terms of change, like for momentum of trailing six months returns (P_PRICE_RETURNS_PR).

1 2 3 4 |
df_train = df_train.drop('P_PRICE',axis=1) df_test = df_test.drop('P_PRICE',axis=1) feature_names.remove('P_PRICE') feature_names |

`Out[85]:`

['P_PRICE_RETURNS',

'P_PRICE_RETURNS_PR',

'FF_EPS',

'FF_BPS',

'FF_PE',

'FF_PBK',

'FF_PSALES',

'FF_ENTRPR_VAL_SALES',

'FF_ENTRPR_VAL_EBITDA_OPER',

'FF_FREE_PS_CF',

'FE_RATING',

'FF_ROE',

'FF_ROA',

'FG_DIV_YLD',

'FG_EPS_LTG',

'FF_EPS_BASIC_GR']

**Normalize (i.e., standardize) features and response.**

Note we are not so concerned with outliers as our data set consists of 30 well established stocks collected from a well estblished source, namely FactSet, and although we don’t want to ignore any out of character data, it may have real meaning we want to model. That said, with more time, we would still do outlier analysis as part of the comprehensive data science process, as it could reveal some undetected data collection issue.

1 2 3 4 5 6 |
df_train_means = df_train.mean() df_train_std = df_train.std() t_periods = df_all['t'].unique() for f in feature_names: df_train.loc[:,f] = (df_train[f] - df_train_means[f]) / df_train_std[f] df_test.loc[:,f] = (df_test[f] - df_train_means[f]) / df_train_std[f] |

1 2 |
for f in feature_names: print 'standardized (train)', f, 'mean', df_train[f].mean(), 'std', df_train[f].std() |

`standardized (train) P_PRICE_RETURNS mean 3.53509109111e-16 std 1.0`

standardized (train) P_PRICE_RETURNS_PR mean 3.96155771327e-16 std 1.0

standardized (train) FF_EPS mean 3.19268421224e-15 std 1.0

standardized (train) FF_BPS mean 1.23005662093e-16 std 1.0

standardized (train) FF_PE mean 5.64803935543e-17 std 1.0

standardized (train) FF_PBK mean -6.11349595286e-16 std 1.0

standardized (train) FF_PSALES mean 1.41527001758e-15 std 1.0

standardized (train) FF_ENTRPR_VAL_SALES mean 1.25619312131e-15 std 1.0

standardized (train) FF_ENTRPR_VAL_EBITDA_OPER mean 1.12775749938e-15 std 1.0

standardized (train) FF_FREE_PS_CF mean 4.91868450731e-16 std 1.0

standardized (train) FE_RATING mean 1.25770646102e-14 std 1.0

standardized (train) FF_ROE mean -5.81545393851e-16 std 1.0

standardized (train) FF_ROA mean 2.14933890791e-15 std 1.0

standardized (train) FG_DIV_YLD mean 1.73432696775e-15 std 1.0

standardized (train) FG_EPS_LTG mean 3.4043094622e-15 std 1.0

standardized (train) FF_EPS_BASIC_GR mean 6.30800526452e-16 std 1.0

1 2 |
for f in feature_names: print 'standardized (test)', f, 'mean', df_test[f].mean(), 'std', df_test[f].std() |

`standardized (train) P_PRICE_RETURNS mean 3.53509109111e-16 std 1.0`

standardized (train) P_PRICE_RETURNS_PR mean 3.96155771327e-16 std 1.0

standardized (train) FF_EPS mean 3.19268421224e-15 std 1.0

standardized (train) FF_BPS mean 1.23005662093e-16 std 1.0

standardized (train) FF_PE mean 5.64803935543e-17 std 1.0

standardized (train) FF_PBK mean -6.11349595286e-16 std 1.0

standardized (train) FF_PSALES mean 1.41527001758e-15 std 1.0

standardized (train) FF_ENTRPR_VAL_SALES mean 1.25619312131e-15 std 1.0

standardized (train) FF_ENTRPR_VAL_EBITDA_OPER mean 1.12775749938e-15 std 1.0

standardized (train) FF_FREE_PS_CF mean 4.91868450731e-16 std 1.0

standardized (train) FE_RATING mean 1.25770646102e-14 std 1.0

standardized (train) FF_ROE mean -5.81545393851e-16 std 1.0

standardized (train) FF_ROA mean 2.14933890791e-15 std 1.0

standardized (train) FG_DIV_YLD mean 1.73432696775e-15 std 1.0

standardized (train) FG_EPS_LTG mean 3.4043094622e-15 std 1.0

standardized (train) FF_EPS_BASIC_GR mean 6.30800526452e-16 std 1.0