How We Can Predict a Profit of the Startup ?
Today we will continue our tutorial on linear regression.
As you know simple linear regression, we use it when we have a single independent variable and a single dependent variable, but in Multiple Linear Regression, several independent variables could affect determining the dependent variable.
For example, when we saw in simple linear regression, the salary of an employee depends on the number of years of experience, but it can also depend on the level of studies, on how much he knows. Or, things like that, as you can see we have several variables to predict the salary of an employee that's why we need to use multiple linear regression.
Objective
We want our model to predict the profit of a startup based on those independent variables, to help investors to know which companies they want to invest to achieve their goal of maximizing profit.
Dataset
The dataset (can be found here) that's we use in this model contains data about 50 startups. It has 5 columns: “R&D Spend”, “Administration”, “Marketing Spend”, “State”, “Profit”.
The first 3 columns indicate how much each startup spends on Research and Development, how much they spend on Marketing, and how much they spend on administration cost, the state column indicates which state the startup is based in, and the last column states the profit made by the startup.
-
Step 1: Import Libraries
we will use 4 libraries such as :
- NumPy
import numpy as np
- Pandas
import pandas as pd
- Matplotlib
import matplotlib.pyplot as plt
- Sklearn
from sklearn.linear_model import LinearRegression
- NumPy
-
Step 2: Load the Dataset
We will be using the pandas dataframe. Here X is contains all the independent variable which are “R&D Spend”, “Administration”, “Marketing Spend” and “State”. And y is the dependent variable which is the “Profit”.
So for X, we specify :
X = dataset.iloc[:,:-1].values
and for y, we specify :
y = dataset.iloc[:,4].values
-
Step 3: Convert text variable to numbers
We can see that in our dataset we have a categorical variable “State”, which we need to encode. Here the “State” variable is at index 3. We use OneHotEncoder and ColumnTransformer class to convert text to numbers.
After running the above code snippet, we can see that 3 dummy variables have been added as we had 3 different States. Now, we have to remove one of the dummy variables. You can read about the dummy variable trap and why we need to remove one of the dummy variables.
-
Step 4: Split dataset - training set and test set
Next, we have to split the dataset into training and testing. We will use the training dataset for training the model and then check the performance of the model on the test dataset. For this we will use the train_test_split method from library model_selection
-
Step 5: Fit our model to training set
This is a very simple step. We will be using the LinearRegression class from the library sklearn.linear_model.
-
Step 6: Predict the test set
Using the regressor we trained in the previous step, we will now use it to predict the results of the test set and compare the predicted values with the actual values.
Let us compare are see how well our model did. As you can see below, our model did pretty well.
-
Step 7: Backward Elimination
In the model that we just built, we used all the independent variables but its possible that some independent variables are more significant than others and have a greater impact on the profit and some are not significant meaning if we remove them from the model, we may get better predictions. The first step is for us to add a column of 1’s to our X dataset as the first column.
Now we will start the backward elimination process. Since we will be creating a new optimal matrix of features, we will call it X_opt. This will contain only the independent features that are significant in predicting profit. Next we create a new regressor of the OLS class (Ordinary Least Square) from statsmodels library. It takes 2 arguments- endog : which is the dependent variable.
- exog : which is the matrix containing all independent variables.
Let’s examine the output: x1 and x2 are the 2 dummy variables we added for state. x3 is R&D spent. x4 is Admin spent. x5 is marketing spent. We have to look for the highest P value greater than 0.5 which in this case is 0.99 (99%) for x2. So we have to remove x2 (2nd dummy variable for state) which has index 2.
Now we will repeat the process after removing the independent variables with highest p value.
Finally we are left with only 1 independent variable which is the 'R&D spent'.
Conclusion
Update: Backward Elimination step is no longer required as the sklearn LinearRegression class automatically takes care to select the most relevant or important features for us automatically when training the model in order to provide best accuracy.
- If you want to share your models with me on Kaggle here is the link to our dataset: 50 Startups Data - Kaggle.
- Here is the Full Source Code.
Comments