2 days ago
5 min read
Main Figure

0. Introduction

Hi everyone, the second project of Data Science Bootcamp in partnership with Istanbul Data Science Academy and Hepsiburada has been completed with presentations and so far it’s been a very instructive experience. Below I try to explain the second project that we did in that period, which was a smartphone price prediction project with regression and web scraping using Python paired with NumPy, pandas, matplotlib, seaborn, requests, beautifulsoup, scikit-learn, and time.

First of all, you can visit the project’s GitHub repository from here.

 

1. Problem Statement

We are a data analytics consulting team that serves marketplace domains. Our client requests building a prediction model for actual smartphone prices.

For this purpose, we built a variety of regression models using data that we scraped from the web.

 

2. Methodology

We described the following roadmap as project methodology ;

Data Collection with Web Scraping

Data Cleaning and Transformation

Exploratory Data Analysis (EDA)

Feature Selection and Modeling

Model Prediction

Interpreting Results

 

3. Data Collection with Web Scraping

By using the following functions that we developed, we scraped the required data from web site.

Figure 1 : Web Scraping Process

getAndParseURL()

Using this function, we can get the HTML code of the given URL to function.

getPageLinks()

Using this function, we recorded all page links to an empty list of products.

getProductLinks()

Using this function, we recorded all product links to an empty list for every page.

getDataFrame()

Using this function, we got pre-selected features for every product and recorded them to an empty list.

getResult()

Using this function, we created an entire dataframe from records and columns by giving the URL to the function.

 

So, if we call the getResult() function after defining all 5 functions, it returns an entire dataframe. Following is our data overview ;

 

Figure 2: Scraped Data Overview

 

4. Data Cleaning and Transformation

After we scraped and transformed data into a dataframe, we made some transformations on columns such as taking only numeric from string values using split and replace functions like this.

Finally, we applied the following methods to handle “Missing” values ;

Filling with mode

Filling with mean

Filling with true-founded value

Dropping rows

As an example of “Missing” value handling code.

After data cleaning and transformation target variable’s distribution (PRICE) is as follows ;

 

Figure 3: PRICE Distribution Plot

We decided that the product under 500 price value is outliers so filtered them. For a detailed process of data cleaning and transformation please click here.

 

5. Exploratory Data Analysis (EDA)

We visualized some features against to target variable which is PRICE to observe whether there is any linear relationship as follows ;

 

Figure 4: RELEASE YEAR vs PRICEFigure 5: STORAGE CAPACITY vs PRICE

 

Figure 6: RAM vs PRICEFigure 7: CPU THICKNESS vs PRICEFigure 8: SCREEN TECH vs PRICEFigure 9: CHARGE TYPE vs PRICE

As seen above bar charts, those 6 features have positive or negative linear or polynomial relationships with the target variable PRICE, also you can see the correlation heatmap given below ;

Figure 10: Correlation Heatmap

 

6. Feature Selection and Modeling

We sorted the dataset by product release year for testing the built model with the most up-to-date data. In the modeling process, we applied OneHotEncoding technique for the following features ;

Screen_Tech

CPU_Model

GPU_Model

Operating_System

Charge_Type

We used the following method in the model-building process ;

Figure 11: Train-Validation-Test Split

Given code shows us the building, predicting, evaluating, and K-Fold cross-validation process of a linear regression model with the train-validation-test split method.

 

7. Model Prediction and Interpreting Results

We used R2-Score and MSE as evaluation metrics in model prediction and made it for train-validation-test (%60-%20-%20)split,train-test(%80-%20) split and 10-Fold Cross-Validation.

Figure 12: Model Prediction Results

As seen upper prediction result table, we did not expose overfitting or underfitting problems. To be able to avoid this problem, the most important part of a machine learning project is feature selection and data quality so we need to be careful at this stage. Finally, we selected Ridge Regression Model as it was the best one according to evaluation metrics.

 

8. Conclusion

This is the end of the article. In this article, I try to explain in detail the second project of our Data Science Bootcamp. As a reminder, you can visit the project’s GitHub repository from here. If you wish you can follow me on Medium.Hope to see you in my next article…