Replicate Fama French 5-Factor Model from publicly available data sources

As an equity quantitative analyst, you have recurring positioning analysis tasks. Your most effective approach is to model your object of study (usually stocks, portfolios or indexes) and decompose its behavior into common risk factors. You can create your own “factor model” or use existing models. There are robust models generated by data providers like MSCI or Qontigo, which are quite expensive but necessary for large institutions. Otherwise you have lighter academic models like the famous five-factor model developed by Fama-French.

In this article, we will try to reproduce the last one.

The FF-5 model is freely available but it is published with a 2 month lag, which is why it is interesting to create a proxy to recover these missing months, in order to produce a live analysis.

One difficulty that immediately arises is the sourcing of data. Academic research often use CRSP database. But it is reserved for academic practitioners and not free. We can wonder if we can reproduce, or at least have very close results, with publicly available data. For that, we will leverage the python library yfinance which interacts with Yahoo Finance APIs to download all the financial data we will need.

For the automatic download of Fama French original data, please refer to this previous article.

import pandas as pd
import numpy as np
import yfinance as yf
import requests
from tqdm import tqdm
import statsmodels.api as sm

We start with a list containing around 2500 tickers of US stocks (~99% of US market cap).


We load Fama French factor returns, these are the timeseries that we will try to replicate.


We have to specify a few dates to request the data. Following Fama French website methodology, portfolios are made each year at the end of June, with most recent fundamental data available at the end of last calendar year. Here we create the June 2023 baskets, to assess replication from July 2023 to maximum June 2024. Last calendar year is 2022.

start_date = "2021-12-31"
end_date = "2024-03-29"
start_year_y1 = "2022-01-01"
end_year_y1 = "2022-12-30"
cutoff_year = "2022-12"
cutoff_date = "2023-06-30"
cutoff_date_y1 = "2022-06-30"
last_ff_date = "2024-01-31"

risk_free_rate = data_ff.RF.iloc[-1] # we take the last available risk free rate number

We request, from yahoo api, the following metrics: operating income (or net income if missing), total assets, stockholder’s equity, number of shares and shares close prices.

operating_income = pd.DataFrame() # needed for RMW
total_assets = pd.DataFrame() # needed for CMA
book_value = pd.DataFrame() # needed for HML
nb_shares = pd.DataFrame(index=pd.date_range(start_year_y1, cutoff_date, freq="D")) # number of stock's shares
share_price = pd.DataFrame() # prices

s = requests.Session()

for this_ticker in tqdm(ticker_list.SYMBOL):
    stock = yf.Ticker(this_ticker, session=s)
    if stock.financials.empty: # check if ticker exists, otherwise skip stock
    # operating income
    try: # standard formula
        if operating_income.empty:
            operating_income = stock.financials.loc["Operating Income"].rename(this_ticker)
            operating_income = pd.concat([operating_income, stock.financials.loc["Operating Income"].rename(this_ticker)], axis=1)
        try: # if operating income metric is missing, we take net income
            if operating_income.empty:
                operating_income = (stock.financials.loc['Net Income']).rename(this_ticker)
                operating_income = pd.concat([operating_income, (stock.financials.loc['Net Income']).rename(this_ticker)], axis=1) 
        except: # otherwise skip this stock
    # total asset
    if total_assets.empty:
        total_assets = stock.balance_sheet.loc["Total Assets"].rename(this_ticker)
        total_assets = pd.concat([total_assets, stock.balance_sheet.loc["Total Assets"].rename(this_ticker)], axis=1)
    # book value
    if book_value.empty:
        book_value = stock.balance_sheet.loc["Stockholders Equity"].rename(this_ticker)
        book_value = pd.concat([book_value, stock.balance_sheet.loc["Stockholders Equity"].rename(this_ticker)], axis=1)
    # number of shares
    try: #first method
        nb_shares_new = stock.get_shares_full(start=start_year_y1, end=cutoff_date).rename(this_ticker).drop_duplicates().tz_localize(None)
        nb_shares.loc[nb_shares_new.index, this_ticker] = nb_shares_new
    except: # if no data, test second method
        nb_shares_new = stock.balance_sheet.loc["Ordinary Shares Number"].rename(this_ticker)
        nb_shares.loc[:, this_ticker] = nb_shares_new.loc[nb_shares_new.index.isin(nb_shares.index)]   
    # price of share
    if share_price.empty:
        share_price = stock.history(start=start_date, end=end_date).Close.rename(this_ticker)
        share_price = pd.concat([share_price, stock.history(start=start_date, end=end_date).Close.rename(this_ticker)], axis=1)
share_price.index = pd.to_datetime(share_price.index.strftime("%Y%m%d"))
share_return = (share_price / share_price.shift(1) - 1).fillna(0) # compute price returns

Now with the downloaded raw data, we can compute the fundamental metrics & ratios.

factor_metric = pd.DataFrame(index=ticker_list.SYMBOL, columns=["HML", "SMB", "RMW", "CMA"], dtype="float")
# SMB: price x nb shares = market cap
market_cap = share_price.loc[cutoff_date] * nb_shares.infer_objects().ffill().loc[cutoff_date]
factor_metric.loc[market_cap.index, "SMB"] = market_cap
# HML: book to market ratio
book_market = book_value.infer_objects().ffill().loc[cutoff_year] / (share_price.loc[end_year_y1] * nb_shares.infer_objects().ffill().loc[end_year_y1])
factor_metric.loc[book_market.columns, "HML"] = book_market.values
# RMW: operating income / book value
operating_profit = operating_income.infer_objects().ffill().loc[cutoff_year] / book_value.infer_objects().ffill().loc[cutoff_year]
factor_metric.loc[operating_profit.columns, "RMW"] = operating_profit.values
# CMA: total asset 1Y growth
investment = total_assets.infer_objects().ffill().loc[cutoff_date] / total_assets.infer_objects().ffill().loc[cutoff_date_y1] - 1
factor_metric.loc[investment.index, "CMA"] = investment

We filter the investment universe to keep only stocks that have positive book to market ratios. We compute metrics thresholds to map stocks into the long and short legs.

filtered_metric = factor_metric[factor_metric.HML > 0].copy()

# Define thresholds
mktcap_threshold = filtered_metric.SMB.quantile(0.5)

high_threshold =  filtered_metric.HML.quantile(0.7)
low_threshold =  filtered_metric.HML.quantile(0.3)

robust_threshold =  filtered_metric.RMW.quantile(0.7)
weak_threshold =  filtered_metric.RMW.quantile(0.3)

aggress_threshold =  filtered_metric.CMA.quantile(0.7)
conserv_threshold =  filtered_metric.CMA.quantile(0.3)

We create 4 baskets for each factor (HML, RMW, CMA). No need for SMB, as it is a combination of these.

basket_compo = filtered_metric.copy().drop(columns=["SMB"])
basket_compo['HML'] = None
basket_compo['HML'] = basket_compo['HML'].astype(str)
basket_compo.loc[(filtered_metric.SMB < mktcap_threshold) & (filtered_metric.HML >= high_threshold), "HML"] = "small_high"
basket_compo.loc[(filtered_metric.SMB < mktcap_threshold) & (filtered_metric.HML <= low_threshold), "HML"] = "small_low"
basket_compo.loc[(filtered_metric.SMB >= mktcap_threshold) & (filtered_metric.HML >= high_threshold), "HML"] = "big_high"
basket_compo.loc[(filtered_metric.SMB >= mktcap_threshold) & (filtered_metric.HML <= low_threshold), "HML"] = "big_low"
basket_compo['RMW'] = None
basket_compo['RMW'] = basket_compo['RMW'].astype(str)
basket_compo.loc[(filtered_metric.SMB < mktcap_threshold) & (filtered_metric.RMW >= robust_threshold), "RMW"] = "small_robust"
basket_compo.loc[(filtered_metric.SMB < mktcap_threshold) & (filtered_metric.RMW <= weak_threshold), "RMW"] = "small_weak"
basket_compo.loc[(filtered_metric.SMB >= mktcap_threshold) & (filtered_metric.RMW >= robust_threshold), "RMW"] = "big_robust"
basket_compo.loc[(filtered_metric.SMB >= mktcap_threshold) & (filtered_metric.RMW <= weak_threshold), "RMW"] = "big_weak"
basket_compo['CMA'] = None
basket_compo['CMA'] = basket_compo['CMA'].astype(str)
basket_compo.loc[(filtered_metric.SMB < mktcap_threshold) & (filtered_metric.CMA >= aggress_threshold), "CMA"] = "small_aggress"
basket_compo.loc[(filtered_metric.SMB < mktcap_threshold) & (filtered_metric.CMA <= conserv_threshold), "CMA"] = "small_conserv"
basket_compo.loc[(filtered_metric.SMB >= mktcap_threshold) & (filtered_metric.CMA >= aggress_threshold), "CMA"] = "big_aggress"
basket_compo.loc[(filtered_metric.SMB >= mktcap_threshold) & (filtered_metric.CMA <= conserv_threshold), "CMA"] = "big_conserv"

We have all the data needed to compute returns for long and short baskets for each factor. In the FF methodology factors are composed in June and weights are kept constant every day until June Y+1.

basket_weight = filtered_metric.copy().drop(columns=["SMB"])
basket_hml = pd.DataFrame(index=share_return.index, columns=["small_high", "small_low", "big_high", "big_low"])
basket_rmw = pd.DataFrame(index=share_return.index, columns=["small_robust", "small_weak", "big_robust", "big_weak"])
basket_cma = pd.DataFrame(index=share_return.index, columns=["small_aggress", "small_conserv", "big_aggress", "big_conserv"])

basket_weight["HML"] = np.nan
for bskt in basket_hml.columns:
  	# compute weight, which is market cap weighted
    basket_weight.loc[basket_compo.HML == bskt, "HML"] = filtered_metric[basket_compo.HML == bskt].SMB / filtered_metric[basket_compo.HML == bskt].SMB.sum()
    ret = share_return.loc[:, basket_compo[basket_compo.HML == bskt].index] # get returns of selected stocks
    wgt = basket_weight[basket_compo.HML == bskt].HML.values # get weighting of selected stocks
    basket_hml[bskt] = ret @ wgt.T # compute basket weighted performance

# same for RMW and CMA
# ...

# MARKET: We market cap weight the investment universe and compute historical weights
basket_weight["MKT_WEIGHT"] = filtered_metric.SMB / filtered_metric.SMB.sum()
raw_wgt = (basket_weight["MKT_WEIGHT"].values * (1+share_return.loc[:, basket_weight.index].shift(1).fillna(0)/100).cumprod())
wgt = (raw_wgt.T / raw_wgt.sum(axis=1)).T

As we have the returns of all baskets, we can compute the factors (long/short) performances. Factors are immunized against Size effect, so we subtract the mean of small & big long leg with the mean of small & big short leg. For SMB, we use the other factors to compute it.

ff_proxy = pd.DataFrame(index=share_return.index)
ff_proxy["MKT_proxy"] = (share_return.loc[:, basket_weight.index] * wgt).sum(axis=1) - risk_free_rate/100
ff_proxy["HML_proxy"] = 0.5 * (basket_hml.small_high + basket_hml.big_high) - 0.5 * (basket_hml.small_low + basket_hml.big_low)
ff_proxy["RMW_proxy"] = 0.5 * (basket_rmw.small_robust + basket_rmw.big_robust) - 0.5 * (basket_rmw.small_weak + basket_rmw.big_weak)
ff_proxy["CMA_proxy"] = 0.5 * (basket_cma.small_conserv + basket_cma.big_conserv) - 0.5 * (basket_cma.small_aggress + basket_cma.big_aggress)
ff_proxy["SMB_proxy"] = (1/3) * 0.5 * ( (basket_hml.small_high + basket_hml.small_low + basket_rmw.small_robust + basket_rmw.small_weak + basket_cma.small_conserv + basket_cma.small_aggress)
                        - (basket_hml.big_high + basket_hml.big_low + basket_rmw.big_robust + basket_rmw.big_weak + basket_cma.big_conserv + basket_cma.big_aggress))

We can now compare the computed proxy timeseries with official FF timeseries.

comp_rmw = pd.merge(ff_proxy.RMW_proxy[:last_ff_date], data_ff.RMW[cutoff_date:]/100, how='left', left_index=True, right_index=True).dropna()
corr_rmw = comp_rmw.corr().iloc[0,1]
beta_rmw = (comp_rmw.cov() / comp_rmw.RMW.var()).iloc[0,1]
versus FF Factor
versus FF Factor

We have pretty good correlations versus the original FF5 factors (as we did not do much refinement on data filtering). We just have to adjust for leverage (i.e. beta) to correctly scale our proxies.

ff_proxy_adjusted = ff_proxy.copy()
ff_proxy_adjusted["MKT_proxy"] = ff_proxy_adjusted["MKT_proxy"] / beta_mkt
ff_proxy_adjusted["HML_proxy"] = ff_proxy_adjusted["HML_proxy"] / beta_hml
ff_proxy_adjusted["SMB_proxy"] = ff_proxy_adjusted["SMB_proxy"] / beta_smb
ff_proxy_adjusted["CMA_proxy"] = ff_proxy_adjusted["CMA_proxy"] / beta_cma
ff_proxy_adjusted["RMW_proxy"] = ff_proxy_adjusted["RMW_proxy"] / beta_rmw

To test the quality of our proxies, we can upload a US index timeseries that we will regress against our 2 factor sets.

# we load an index total return timeserie
index_usa = pd.read_csv("index_usa.csv")
index_usa.dte_Date = pd.to_datetime(index_usa.dte_Date, format="%Y-%m-%d")
index_usa = index_usa.set_index("dte_Date")

start_date = "2023-06-30"
end_date = "2024-01-31"


y_ff = index_usa[start_date:end_date]
X_ff = data_ff.drop(columns=["RF"])[start_date:end_date]
y_ff = y_ff[y_ff.index.isin(X_ff.index)]
X_ff = X_ff[X_ff.index.isin(y_ff.index)]
model_ff = sm.OLS(y_ff, sm.add_constant(X_ff/100))
res_ff =


y = index_usa[start_date:end_date]
X = ff_proxy_adjusted.drop(columns=["RF"])[start_date:end_date]
y = y[y.index.isin(X.index)]
X = X[X.index.isin(y.index)]
model = sm.OLS(y, sm.add_constant(X))
res =

We got pretty similar results, near R² scores and close estimated betas.

Now we can check if the residuals are close too.


comp_resid = pd.merge(res_ff.resid.rename("ff"), res.resid.rename("proxy"), how="inner", left_index=True, right_index=True)
print("corr: ", comp_resid.corr().iloc[0,1])
print("beta: ", (comp_resid.cov() / comp_resid.ff.var()).iloc[0,1])

Correlation is near 98%!

To conclude, we denoised quite similarly the index timeseries with both proxy and FF factor sets. We can therefore use these proxy factors to fill in the missing data over 2 months. This will allow us to use this factor model for live investing/trading models. Free data here does not degrade much modeling performance.