## The Model

### Model Description

A simpler, but still effective model, based on fewer features, and thus of lower dimensionality, in theory, will generalize better for explaining new data. For this reason we have chosen to experiment with the Lasso.

Lasso Regression performs L1 regularization and precisely adds penalty equivalent to absolute value of the magnitude of coefficients :

`Minimization objective = LS Obj + α * (sum of absolute value of coefficients)`

LASSO stands for Least Absolute Shrinkage and Selection Operator. I know it doesn’t give much of an idea but there are 2 key words here – ‘absolute‘ and ‘selection‘.

Lets consider the former first and worry about the latter later.

Lasso regression performs L1 regularization, i.e. it adds a factor of sum of absolute value of coefficients in the optimization objective. Thus, lasso regression optimizes the following:

`Objective = RSS + α * (sum of absolute value of coefficients)`

Here, α (alpha) works similar to that of ridge and provides a trade-off between balancing RSS and magnitude of coefficients. Like that of ridge, α can take various values. Lets iterate it here briefly:

α = 0: Same coefficients as simple linear regression

α = ∞: All coefficients zero (same logic as before)

0 < α < ∞: coefficients between 0 and that of simple linear regression

Yes its appearing to be very similar to Ridge till now. But just hang on with me and you’ll know the difference by the time we finish.

This is an experiment, a “toy” example if you will, developed by the team over several weeks, to exercise several course topics, specifically using data visualizations, dimensionality reduction of Lasso regression, and cross validation to tune hyper-parameters.

This is not a strict financial treatment by any measure for this type of return modeling as outlined in the literature (e.g., no consideration for risk free rate, macro-market trends, inclusion of stocks of different shapes and sizes, and many other financial modeling details that a comprehensive treatment with more time would account for, etc.)

Traditionally features are selected for such a fundamental factor return model by various well documented techniques, for example using feature IC (https://en.wikipedia.org/wiki/Information_coefficient) analysis which considers how correlated each stock feature is to stock returns, however, in this experiement we try using the Lasso, with its coefficient shrinkage, to narrow our set of features to attain a parsimonious model.

The general background knowledge for this project is from Harvard University Extension course MGMT E-2690 Quantitative Equity Investing from Fall 2009 semester, which summarized factor models and potential features one might use, but did not express in any way the use of the Lasso, this being a unique design element we introduce and experiment with here.

### Designate X (i.e., feature matrix), and y (i.e., response)

We are trying to predict, really decompose, the monthly price returns (P_PRICE_RETURNS), so it is time to designate our X and y data for the regression modeling.

1 2 3 4 5 |
df_xtrain = df_train.drop('P_PRICE_RETURNS',1) df_xtest = df_test.drop('P_PRICE_RETURNS',1) df_ytrain = df_train['P_PRICE_RETURNS'] df_ytest = df_test['P_PRICE_RETURNS'] |

1 |
df_xtrain.head() |

1 |
df_ytrain.head() |

`idx`

201110MMM 1.575203

201111AAPL -1.244065

201106GS -1.217132

201012XOM 0.682883

201504AAPL -0.134872

Name: P_PRICE_RETURNS, dtype: float64

1 |
df_xtest.head() |

1 |
df_ytest.head() |

`idx`

201405BA 0.736549

201301JPM 1.152699

201101GS -0.725493

201103GE -0.988315

201012GS 1.147736

Name: P_PRICE_RETURNS, dtype: float64

1 2 3 4 5 6 7 8 9 |
from sklearn.metrics import make_scorer from sklearn.linear_model import Lasso from sklearn.grid_search import GridSearchCV def cv_optimize_lasso(X, y, n_folds=4): clf = Lasso() parameters = {"alpha": [1e-8, 1e-6, 1e-5, 5e-5, 1e-4, 5e-4, 1e-3, 1e-2, 1e-1, 1.0]} gs = GridSearchCV(clf, param_grid=parameters, cv=n_folds, scoring="mean_absolute_error") gs.fit(X, y) return gs |

1 2 3 |
# this is our response variable feature_names.remove('P_PRICE_RETURNS') feature_names |

`['P_PRICE_RETURNS_PR',`

'FF_EPS',

'FF_BPS',

'FF_PE',

'FF_PBK',

'FF_PSALES',

'FF_ENTRPR_VAL_SALES',

'FF_ENTRPR_VAL_EBITDA_OPER',

'FF_FREE_PS_CF',

'FE_RATING',

'FF_ROE',

'FF_ROA',

'FG_DIV_YLD',

'FG_EPS_LTG',

'FF_EPS_BASIC_GR']

1 |
df_xtrain[feature_names].head() |