Sanjiv R. Das
%pylab inline
import pandas as pd
Populating the interactive namespace from numpy and matplotlib
#Import the SBA Loans dataset
sba = pd.read_csv("DSTMAA_data/SBA.csv")
print(sba.columns)
print(sba.shape)
sba.head()
Index(['LoanID', 'GrossApproval', 'SBAGuaranteedApproval', 'subpgmdesc',
'ApprovalFiscalYear', 'InitialInterestRate', 'TermInMonths',
'ProjectState', 'BusinessType', 'LoanStatus', 'RevolverStatus',
'JobsSupported'],
dtype='object')
(527700, 12)
| LoanID | GrossApproval | SBAGuaranteedApproval | subpgmdesc | ApprovalFiscalYear | InitialInterestRate | TermInMonths | ProjectState | BusinessType | LoanStatus | RevolverStatus | JobsSupported | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 733784 | 50000 | 25000 | FA$TRK (Small Loan Express) | 2006 | 11.25 | 84 | IN | CORPORATION | CANCLD | 1 | 4 |
| 1 | 733785 | 35000 | 17500 | FA$TRK (Small Loan Express) | 2006 | 12.00 | 84 | IL | CORPORATION | CANCLD | 0 | 3 |
| 2 | 733786 | 15000 | 7500 | FA$TRK (Small Loan Express) | 2006 | 12.00 | 84 | WV | INDIVIDUAL | CANCLD | 0 | 4 |
| 3 | 733787 | 16000 | 13600 | Community Express | 2006 | 11.50 | 84 | MD | CORPORATION | PIF | 0 | 1 |
| 4 | 733788 | 16000 | 13600 | Community Express | 2006 | 11.50 | 84 | MD | CORPORATION | CANCLD | 0 | 1 |
The dependent variable is the guaranteed amount as a percentage of the gross loan approved.
#Create the dependent variable
Y = sba.SBAGuaranteedApproval.astype("float")/sba.GrossApproval.astype("float")
X = sba[['GrossApproval', 'ApprovalFiscalYear', 'InitialInterestRate', 'TermInMonths',
'RevolverStatus','JobsSupported']]
x1 = pd.get_dummies(sba.subpgmdesc)
x1 = x1.drop(x1.columns[0],axis=1)
X = pd.concat([X,x1],axis=1)
X.head()
| GrossApproval | ApprovalFiscalYear | InitialInterestRate | TermInMonths | RevolverStatus | JobsSupported | Community Advantage Initiative | Community Express | Contract Guaranty | EXPORT IMPORT HARMONIZATION | ... | Gulf Opportunity | International Trade - Sec, 7(a) (16) | Lender Advantage Initiative | Patriot Express | Revolving Line of Credit Exports - Sec. 7(a) (14) | Rural Lender Advantage | Seasonal Line of Credit | Small Asset Based | Small General Contractors - Sec. 7(a) (9) | Standard Asset Based | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 50000 | 2006 | 11.25 | 84 | 1 | 4 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 35000 | 2006 | 12.00 | 84 | 0 | 3 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2 | 15000 | 2006 | 12.00 | 84 | 0 | 4 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3 | 16000 | 2006 | 11.50 | 84 | 0 | 1 | 0 | 1 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4 | 16000 | 2006 | 11.50 | 84 | 0 | 1 | 0 | 1 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
5 rows × 22 columns
x1 = pd.get_dummies(sba.BusinessType)
x1 = x1.drop(x1.columns[0],axis=1)
X = pd.concat([X,x1],axis=1)
X.head()
| GrossApproval | ApprovalFiscalYear | InitialInterestRate | TermInMonths | RevolverStatus | JobsSupported | Community Advantage Initiative | Community Express | Contract Guaranty | EXPORT IMPORT HARMONIZATION | ... | Lender Advantage Initiative | Patriot Express | Revolving Line of Credit Exports - Sec. 7(a) (14) | Rural Lender Advantage | Seasonal Line of Credit | Small Asset Based | Small General Contractors - Sec. 7(a) (9) | Standard Asset Based | INDIVIDUAL | PARTNERSHIP | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 50000 | 2006 | 11.25 | 84 | 1 | 4 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 35000 | 2006 | 12.00 | 84 | 0 | 3 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2 | 15000 | 2006 | 12.00 | 84 | 0 | 4 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
| 3 | 16000 | 2006 | 11.50 | 84 | 0 | 1 | 0 | 1 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4 | 16000 | 2006 | 11.50 | 84 | 0 | 1 | 0 | 1 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
5 rows × 24 columns
#SAVE TO PICKLE
import pickle
SBAdata = {'sba': X}
pickle.dump(SBAdata, open( "DSTMAA_data/SBAdata.p", "wb" ))
#Run regression
from sklearn import linear_model
reg = linear_model.LinearRegression()
reg.fit(X,Y)
yhat = reg.predict(X)
from sklearn.metrics import mean_squared_error, r2_score
# The coefficients
print('Coefficients: \n', reg.coef_)
# The mean squared error
print("Mean squared error: %.2f" % mean_squared_error(Y,yhat))
# Explained variance score: 1 is perfect prediction
print('Variance score: %.2f' % r2_score(Y,yhat))
Coefficients: [-1.57772017e-08 -2.93254437e-03 -3.36804150e-03 -5.01226990e-05 -4.24578025e-04 -2.01392544e-05 9.66450791e-02 1.18528756e-01 4.73548794e-02 -1.64573288e-02 -2.36558347e-01 6.51446091e-02 1.21989162e-01 1.85657957e-01 6.89418169e-02 1.02387596e-01 1.72723592e-01 1.02752017e-01 4.41276709e-02 1.05984173e-01 5.08275851e-02 3.04713903e-02 2.42619720e-03 -6.42487723e-04] Mean squared error: 0.00 Variance score: 0.92
import statsmodels.api as sm
import statsmodels.formula.api as smf
X = sm.add_constant(X, prepend=True)
results = smf.OLS(Y,X).fit()
print(results.summary())
/home/srdas/anaconda3/lib/python3.6/site-packages/numpy/core/fromnumeric.py:52: FutureWarning: Method .ptp is deprecated and will be removed in a future version. Use numpy.ptp instead. return getattr(obj, method)(*args, **kwds)
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.921
Model: OLS Adj. R-squared: 0.921
Method: Least Squares F-statistic: 2.575e+05
Date: Wed, 24 Apr 2019 Prob (F-statistic): 0.00
Time: 13:46:37 Log-Likelihood: 8.9683e+05
No. Observations: 527700 AIC: -1.794e+06
Df Residuals: 527675 BIC: -1.793e+06
Df Model: 24
Covariance Type: nonrobust
=====================================================================================================================
coef std err t P>|t| [0.025 0.975]
---------------------------------------------------------------------------------------------------------------------
const 6.6665 0.058 115.637 0.000 6.554 6.780
GrossApproval -1.578e-08 1.54e-10 -102.493 0.000 -1.61e-08 -1.55e-08
ApprovalFiscalYear -0.0029 2.86e-05 -102.639 0.000 -0.003 -0.003
InitialInterestRate -0.0034 3.32e-05 -101.516 0.000 -0.003 -0.003
TermInMonths -5.012e-05 1.14e-06 -43.935 0.000 -5.24e-05 -4.79e-05
RevolverStatus -0.0004 0.000 -2.610 0.009 -0.001 -0.000
JobsSupported -2.014e-05 1.57e-06 -12.865 0.000 -2.32e-05 -1.71e-05
Community Advantage Initiative 0.0966 0.004 24.790 0.000 0.089 0.104
Community Express 0.1185 0.004 31.434 0.000 0.111 0.126
Contract Guaranty 0.0474 0.004 10.792 0.000 0.039 0.056
EXPORT IMPORT HARMONIZATION -0.0165 0.011 -1.448 0.148 -0.039 0.006
FA$TRK (Small Loan Express) -0.2366 0.004 -62.960 0.000 -0.244 -0.229
Guaranty 0.0651 0.004 17.326 0.000 0.058 0.073
Gulf Opportunity 0.1220 0.004 31.621 0.000 0.114 0.130
International Trade - Sec, 7(a) (16) 0.1857 0.004 45.123 0.000 0.178 0.194
Lender Advantage Initiative 0.0689 0.004 18.275 0.000 0.062 0.076
Patriot Express 0.1024 0.004 27.063 0.000 0.095 0.110
Revolving Line of Credit Exports - Sec. 7(a) (14) 0.1727 0.004 43.418 0.000 0.165 0.181
Rural Lender Advantage 0.1028 0.004 26.899 0.000 0.095 0.110
Seasonal Line of Credit 0.0441 0.005 8.256 0.000 0.034 0.055
Small Asset Based 0.1060 0.007 15.688 0.000 0.093 0.119
Small General Contractors - Sec. 7(a) (9) 0.0508 0.005 9.417 0.000 0.040 0.061
Standard Asset Based 0.0305 0.004 7.862 0.000 0.023 0.038
INDIVIDUAL 0.0024 0.000 14.331 0.000 0.002 0.003
PARTNERSHIP -0.0006 0.000 -1.456 0.145 -0.002 0.000
==============================================================================
Omnibus: 239343.300 Durbin-Watson: 1.555
Prob(Omnibus): 0.000 Jarque-Bera (JB): 13407205.167
Skew: 1.403 Prob(JB): 0.00
Kurtosis: 27.533 Cond. No. 5.52e+08
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 5.52e+08. This might indicate that there are
strong multicollinearity or other numerical problems.
