Sanjiv R. Das
%pylab inline
import pandas as pd
Populating the interactive namespace from numpy and matplotlib
#Import the SBA Loans dataset
sba = pd.read_csv("DSTMAA_data/SBA.csv")
print(sba.columns)
print(sba.shape)
sba.head()
Index(['LoanID', 'GrossApproval', 'SBAGuaranteedApproval', 'subpgmdesc', 'ApprovalFiscalYear', 'InitialInterestRate', 'TermInMonths', 'ProjectState', 'BusinessType', 'LoanStatus', 'RevolverStatus', 'JobsSupported'], dtype='object') (527700, 12)
LoanID | GrossApproval | SBAGuaranteedApproval | subpgmdesc | ApprovalFiscalYear | InitialInterestRate | TermInMonths | ProjectState | BusinessType | LoanStatus | RevolverStatus | JobsSupported | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 733784 | 50000 | 25000 | FA$TRK (Small Loan Express) | 2006 | 11.25 | 84 | IN | CORPORATION | CANCLD | 1 | 4 |
1 | 733785 | 35000 | 17500 | FA$TRK (Small Loan Express) | 2006 | 12.00 | 84 | IL | CORPORATION | CANCLD | 0 | 3 |
2 | 733786 | 15000 | 7500 | FA$TRK (Small Loan Express) | 2006 | 12.00 | 84 | WV | INDIVIDUAL | CANCLD | 0 | 4 |
3 | 733787 | 16000 | 13600 | Community Express | 2006 | 11.50 | 84 | MD | CORPORATION | PIF | 0 | 1 |
4 | 733788 | 16000 | 13600 | Community Express | 2006 | 11.50 | 84 | MD | CORPORATION | CANCLD | 0 | 1 |
The dependent variable is the guaranteed amount as a percentage of the gross loan approved.
#Create the dependent variable
Y = sba.SBAGuaranteedApproval.astype("float")/sba.GrossApproval.astype("float")
X = sba[['GrossApproval', 'ApprovalFiscalYear', 'InitialInterestRate', 'TermInMonths',
'RevolverStatus','JobsSupported']]
x1 = pd.get_dummies(sba.subpgmdesc)
x1 = x1.drop(x1.columns[0],axis=1)
X = pd.concat([X,x1],axis=1)
X.head()
GrossApproval | ApprovalFiscalYear | InitialInterestRate | TermInMonths | RevolverStatus | JobsSupported | Community Advantage Initiative | Community Express | Contract Guaranty | EXPORT IMPORT HARMONIZATION | ... | Gulf Opportunity | International Trade - Sec, 7(a) (16) | Lender Advantage Initiative | Patriot Express | Revolving Line of Credit Exports - Sec. 7(a) (14) | Rural Lender Advantage | Seasonal Line of Credit | Small Asset Based | Small General Contractors - Sec. 7(a) (9) | Standard Asset Based | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 50000 | 2006 | 11.25 | 84 | 1 | 4 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1 | 35000 | 2006 | 12.00 | 84 | 0 | 3 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
2 | 15000 | 2006 | 12.00 | 84 | 0 | 4 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
3 | 16000 | 2006 | 11.50 | 84 | 0 | 1 | 0 | 1 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
4 | 16000 | 2006 | 11.50 | 84 | 0 | 1 | 0 | 1 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
5 rows × 22 columns
x1 = pd.get_dummies(sba.BusinessType)
x1 = x1.drop(x1.columns[0],axis=1)
X = pd.concat([X,x1],axis=1)
X.head()
GrossApproval | ApprovalFiscalYear | InitialInterestRate | TermInMonths | RevolverStatus | JobsSupported | Community Advantage Initiative | Community Express | Contract Guaranty | EXPORT IMPORT HARMONIZATION | ... | Lender Advantage Initiative | Patriot Express | Revolving Line of Credit Exports - Sec. 7(a) (14) | Rural Lender Advantage | Seasonal Line of Credit | Small Asset Based | Small General Contractors - Sec. 7(a) (9) | Standard Asset Based | INDIVIDUAL | PARTNERSHIP | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 50000 | 2006 | 11.25 | 84 | 1 | 4 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1 | 35000 | 2006 | 12.00 | 84 | 0 | 3 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
2 | 15000 | 2006 | 12.00 | 84 | 0 | 4 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
3 | 16000 | 2006 | 11.50 | 84 | 0 | 1 | 0 | 1 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
4 | 16000 | 2006 | 11.50 | 84 | 0 | 1 | 0 | 1 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
5 rows × 24 columns
#SAVE TO PICKLE
import pickle
SBAdata = {'sba': X}
pickle.dump(SBAdata, open( "DSTMAA_data/SBAdata.p", "wb" ))
#Run regression
from sklearn import linear_model
reg = linear_model.LinearRegression()
reg.fit(X,Y)
yhat = reg.predict(X)
from sklearn.metrics import mean_squared_error, r2_score
# The coefficients
print('Coefficients: \n', reg.coef_)
# The mean squared error
print("Mean squared error: %.2f" % mean_squared_error(Y,yhat))
# Explained variance score: 1 is perfect prediction
print('Variance score: %.2f' % r2_score(Y,yhat))
Coefficients: [-1.57772017e-08 -2.93254437e-03 -3.36804150e-03 -5.01226990e-05 -4.24578025e-04 -2.01392544e-05 9.66450791e-02 1.18528756e-01 4.73548794e-02 -1.64573288e-02 -2.36558347e-01 6.51446091e-02 1.21989162e-01 1.85657957e-01 6.89418169e-02 1.02387596e-01 1.72723592e-01 1.02752017e-01 4.41276709e-02 1.05984173e-01 5.08275851e-02 3.04713903e-02 2.42619720e-03 -6.42487723e-04] Mean squared error: 0.00 Variance score: 0.92
import statsmodels.api as sm
import statsmodels.formula.api as smf
X = sm.add_constant(X, prepend=True)
results = smf.OLS(Y,X).fit()
print(results.summary())
/home/srdas/anaconda3/lib/python3.6/site-packages/numpy/core/fromnumeric.py:52: FutureWarning: Method .ptp is deprecated and will be removed in a future version. Use numpy.ptp instead. return getattr(obj, method)(*args, **kwds)
OLS Regression Results ============================================================================== Dep. Variable: y R-squared: 0.921 Model: OLS Adj. R-squared: 0.921 Method: Least Squares F-statistic: 2.575e+05 Date: Wed, 24 Apr 2019 Prob (F-statistic): 0.00 Time: 13:46:37 Log-Likelihood: 8.9683e+05 No. Observations: 527700 AIC: -1.794e+06 Df Residuals: 527675 BIC: -1.793e+06 Df Model: 24 Covariance Type: nonrobust ===================================================================================================================== coef std err t P>|t| [0.025 0.975] --------------------------------------------------------------------------------------------------------------------- const 6.6665 0.058 115.637 0.000 6.554 6.780 GrossApproval -1.578e-08 1.54e-10 -102.493 0.000 -1.61e-08 -1.55e-08 ApprovalFiscalYear -0.0029 2.86e-05 -102.639 0.000 -0.003 -0.003 InitialInterestRate -0.0034 3.32e-05 -101.516 0.000 -0.003 -0.003 TermInMonths -5.012e-05 1.14e-06 -43.935 0.000 -5.24e-05 -4.79e-05 RevolverStatus -0.0004 0.000 -2.610 0.009 -0.001 -0.000 JobsSupported -2.014e-05 1.57e-06 -12.865 0.000 -2.32e-05 -1.71e-05 Community Advantage Initiative 0.0966 0.004 24.790 0.000 0.089 0.104 Community Express 0.1185 0.004 31.434 0.000 0.111 0.126 Contract Guaranty 0.0474 0.004 10.792 0.000 0.039 0.056 EXPORT IMPORT HARMONIZATION -0.0165 0.011 -1.448 0.148 -0.039 0.006 FA$TRK (Small Loan Express) -0.2366 0.004 -62.960 0.000 -0.244 -0.229 Guaranty 0.0651 0.004 17.326 0.000 0.058 0.073 Gulf Opportunity 0.1220 0.004 31.621 0.000 0.114 0.130 International Trade - Sec, 7(a) (16) 0.1857 0.004 45.123 0.000 0.178 0.194 Lender Advantage Initiative 0.0689 0.004 18.275 0.000 0.062 0.076 Patriot Express 0.1024 0.004 27.063 0.000 0.095 0.110 Revolving Line of Credit Exports - Sec. 7(a) (14) 0.1727 0.004 43.418 0.000 0.165 0.181 Rural Lender Advantage 0.1028 0.004 26.899 0.000 0.095 0.110 Seasonal Line of Credit 0.0441 0.005 8.256 0.000 0.034 0.055 Small Asset Based 0.1060 0.007 15.688 0.000 0.093 0.119 Small General Contractors - Sec. 7(a) (9) 0.0508 0.005 9.417 0.000 0.040 0.061 Standard Asset Based 0.0305 0.004 7.862 0.000 0.023 0.038 INDIVIDUAL 0.0024 0.000 14.331 0.000 0.002 0.003 PARTNERSHIP -0.0006 0.000 -1.456 0.145 -0.002 0.000 ============================================================================== Omnibus: 239343.300 Durbin-Watson: 1.555 Prob(Omnibus): 0.000 Jarque-Bera (JB): 13407205.167 Skew: 1.403 Prob(JB): 0.00 Kurtosis: 27.533 Cond. No. 5.52e+08 ============================================================================== Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. [2] The condition number is large, 5.52e+08. This might indicate that there are strong multicollinearity or other numerical problems.