Linear Regression: Predict Car Price

Alex Lim
5 min readJun 4, 2021

Made by:
1)Alex Sander Lim(0706021910004)
2)Terrence Pramono(0706021910003)

Photo by Erik Mclean on Unsplash

The cost of a new car’s prices is determined by the manufacturer of the car’s company with extra cost of tax according to different places in the world. As time pass by, the cost of the new car slowly become cheaper and when the cars become really old or become antique is a whole different story where the prices of those antique cars can jump really high.

In some cases, customers want to buy a new car but they don’t want the newest car, rather it’s an old-produced car. In other cases, some customers like to collect antique cars and interested in investing in old cars, so the goal of this project is to help people that want to try the machine learning method or people that are interested to buy antique cars or rather people that want to predict car price. We will use linear regression to predict the car price. With expectation, the prediction of this machine learning model might bring a benefit to people that want to learn machine learning nor people that like to collect antique cars, or just normal people that interested in predicting car prices and want to learn predicting it.

The data in this article is used from Manish Kumar.

Introduction

supervised machine learning is defined by its use of labeled datasets to train algorithms to predict the result accurately. Supervised machine learning can be separated into 2 types which are classification and regression. To know when to use classification problems and regression problems, we can see the value to determine what algorithm to use. In classification algorithm are used to predict discrete values (ex: male or female, yes or no, dead or alive, etc.) while regression algorithm is used to predict continuous values (ex: price, temperature, speed, etc.).

Linear regression is one of the regression algorithms, it is used when we want to predict a specific value of a variable based on value from different variables. The variable that will be predicted is called dependent values while the variable that is used to predict the other variable’s value is called the independent variable. In this article, the dependent variable is the price while another variable is independent.

Preprocessing

Import Library:

Import Dataset and print it to see all the data from the dataset.

Exploratory Data Analysis

Start with checking all the null values that exist in the data

From the result of df.info(), we can see that null values do not exist hence the data is ok to be processed.

EDA Univariate, using df.describe() we can see characteristic of each column in the dataset. The purpose of showing these data is to see, is there any 0 value on the characteristic of each column. To determine the outlier from these data, there are requirements needed however characteristic of a car has too many factors hence it is hard to determine this data is an outlier or not.

The unique value is 147 which is really high compared to the total of data which is 205. Since 71.7% of the data is unique, we can’t use it to analyze. To organize the data, a new column is created with the name ‘Brand’ which the purpose of this new column is to represent the car name in a group of categories.

Next is the process of making a new column ‘Brand’ which represents the car name.

Since some of the data are categorical data or string-typed data, LabelEncoder is used to convert the string values to numeric values.

The result of label encoding:

EDA Multivariate, Heatmap is used here to see the correlation between independent value and dependent value(price).

From the result above there is a high correlation between 2 independent values, but since our purpose in this article is to predict a car price, We are going to drop a column that has a low correlation with the dependent variable (Price).

Linear Regression Model

In the linear regression model, the first thing to do is to split the dataset into independent(X) and dependent(y) Variables.

After splitting we are applying a linear regression model.

Dropping data with a high p-value with the value of α=0.1

Here is the comparison between Y_pred and y_test with purpose to see how close the result of the prediction with the actual score.

Conclusion

1. The main factors that affect the price of a car are drivewheel, engine size, bore ratio, and horsepower.

2. The Linear Regression model created to predict the car price has 3533.2 RMSE Value and 18.6 MAPE Value.

Reference

Regression vs Classification in Machine Learning — Javatpoint. www.javatpoint.com. (n.d.). https://www.javatpoint.com/regression-vs-classification-in-machine-learning.

By: IBM Cloud Education. (n.d.). What is Supervised Learning? IBM. https://www.ibm.com/cloud/learn/supervised-learning#:~:text=Supervised%20learning%2C%20also%20known%20as,data%20or%20predict%20outcomes%20accurately.

Stilt. (2020, November 27). THIS is how much tax you will pay on used car [2021]. Stilt Blog. https://www.stilt.com/blog/2020/11/how-much-tax-will-i-pay-for-a-used-car/.

Linear Regression Analysis using SPSS Statistics. Linear Regression Analysis in SPSS Statistics — Procedure, assumptions and reporting the output. (n.d.). https://statistics.laerd.com/spss-tutorials/linear-regression-using-spss-statistics.php.

--

--