HDSC Stage G OSP-Predicting Car Prices

5 min readJan 9, 2021

Navigating the used car marketspace with the predictive power of machine learning

This was a data science internship assignment done as a team with the following members — Olufunmilayo Aforijiku, Sooter Saalu, Amoo Eno, Micky Nnamdi, Fiyinfoba Ogunkeye, Ekemini Umanah, Sophia Jack, Adu Aanuoluwapo, Patrick Ogbonna, Oluwasayo Akinkunmi, Chukwuemeka Omeh, Echefu Charles, Toluwanimi Olorunnisola, Adeola Abiola, Samuel Adeapin, Gloria Agunanna

Link to the GitHub project

Tools Used: Python, Tableau, Git

Introduction

In many countries of the world, it is common to shop for used cars rather than buy new ones, this is a decision that is mostly informed by the financial cost of new vehicles or sometimes the belief that used cars work better or last longer than new ones. However, there is always ambiguity in the price of the car in question, as the resale price differs from seller to seller, with some expecting or requiring long negotiation before the final sale. Apart from these subjective factors, there are physical features that play a part in the price of a used car; such as the age, manufacturer, model, and mileage.

This project looks at predicting the price given for used cars based on these features.

Dataset

The dataset used in this project comes from Kaggle and it contains scraped data from Craigslist, an American classified advertisements website, with one of the world’s largest collection of used vehicles for sale.

The dataset had information on 458,213 cars with 25 columns showcasing the features of the car, the location it is being sold from, and when it was advertised for sale.

As most of the information in the dataset was user inputted, there were a large number of missing values and error inputs, requiring a lot of cleaning.

Percentage of missing values in each column

Our cleaning process began with an in-depth exploration of the dataset features, which led to the discovery that the description column contained some of the missing information on the condition of the car, as well as some other physical features, in an unstructured format.

This called for a meticulous extraction of data as we cleaned and processed the data, filling missing values with information from the description column when available and dropping rows and columns when needed. Outliers and error inputs were also removed using quartiles and granular data exploration. At the end of this process, we ended with a clean dataset of 186923 rows out of the initial 458213 entries.

Our next step was to perform some analysis on the dataset, exploring the dataset features and seeing how they correspond against price.

The ascendance of some particular categories of used cars is worth mentioning, with the diesel, pickup, 4 wheel-drive, other transmission options outselling their counterparts on average.

Modelling

In order to predict the price of a used car, there were a few features we picked up as important to the price variable (region, year, manufacturer, model, cylinders, fuel, odometer, title_status, transmission, drive, type, paint_color, state).

As there were a number of categorical variables in our dataset, an encoder was first used while the numerical variables were standardized and scaled. This standardized and encoded dataset was tested with several modelling algorithms like Linear Regression, Decision Trees, Random Forest, Extra Trees Regressor, CatBoost, LGBM, XGB Regressor, XGBRF Regressor.

From all the models used, the best performer was the Extra Trees Regressor which had the highest R2 score of 0.928 and the lowest RMSE score of 2912.007.

Taking a closer look at our extra trees regressor, we explored which features or variables had the most influence on our predictions.

Feature importance for the Extra Trees Regressor

The distance a used car has travelled and its age since manufactured were the Top 2 features in our model, highlighting their importance in the buying and selling of used cars.

Deployment

We deployed our best performing model through the Google AI platform such that it can be served by a web app.

Testing Deployment on Google AI Platform

Summary

A machine learning solution benefits both buyers and sellers in the marketplace as having an idea of the actual market price helps eliminate ambiguity in price.

There is a need to test this model on other marketplace dataset to avoid biases unique to Craigslist or the United States. Our next steps involve surveying Nigerian marketplace data and exploring used car images for additional data points for features.

Thank you for reading. Feedback is welcome. ✔