PREDICTIVE ANALYTICS FOR EQUITIES – THE DATA – PART 2

FactSet Financial Data

All financial data is courtesy of FactSet (http://www.factset.com) and it’s Excel data query plugin, for which we have been granted academic license. There is a FactSet API not available under the academic license which would have allowed us to collect far more data, given the project time constraints.

We create a Dataframe of samples based on an Excel spreadsheet composed of multiple sheets, with each sheet containing monthly data, ranging consecutively from 11/30/2010 to 10/30/2015, representing one of the 15 stock related features over the past five years for a group of stocks, namely the 30 stocks of the DJIA (https://en.wikipedia.org/wiki/Dow_Jones_Industrial_Average). Once assembled, there are 1,800 data samples containing 15 features.

Response Variable (FactSet):

P_PRICE_RETURNS: Price Returns with dividends reinvested

Features (FactSet):

P_PRICE_RETURNS_PR: Price Returns over a Predefined Range - "Momentum" as indicated by trailing 6 months returns.
FF_EPS: Earnings per Share
FF_BPS: Book Value per Share
FF_PE: Price to Earnings ratio
FF_PBK: Price to Book Value ratio
FF_PSALES: Price to Sales ratio
FF_ENTRPR_VAL_SALES: Enterprise Value to Sales ratio
FF_ENTRPR_VAL_EBITDA_OPER: Enterprise Value to EBITDA (Earnings Before Interest, Taxation, Depreciation, and Amortization)
FF_FREE_PS_CF: Cash Flow per Share
FE_RATING: Average Rating as a quantitative value as estimated by various default sources (brokers, etc.)
FF_ROE: Return on Average Total Equity
FF_ROA: Return on Average Assets FG_DIV_YLD: Dividend Yield FG_EPS_LTG: Long Term Growth Rate - EPS - Mean
FF_EPS_BASIC_GR: EPS - Extras - Before Growth - 1 Year Growth

Output :

array([u'MMM', u'AXP', u'AAPL', u'BA', u'CAT', u'CVX', u'CSCO', u'KO',
u'DD', u'XOM', u'GE', u'GS', u'HD', u'INTC', u'IBM', u'JNJ', u'JPM',
u'MCD', u'MRK', u'MSFT', u'NKE', u'PFE', u'PG', u'TRV', u'UNH',
u'UTX', u'VZ', u'V', u'WMT', u'DIS'], dtype=object)

Let’s append the features to the Dataframe :

Let’s replace any missing feature values with feature mean across all tickers for that particular month:

(1800, 21)

<< Intro     Part 3 >>