Football Player Ranking With Classification and Streamlit App

Oct 9, 2022

5 min read

Write your own content on FeedingTrends
Write

 

Football Player Ranking with Classification and Streamlit App

 

0. Introduction

Hi everyone, the third project of Data Science Bootcamp in partnership with Istanbul Data Science Academy and Hepsiburada has been completed with presentations and so far it’s been a very enlightening experience. Below I try to explain the third project: a football player ranking with classification using Python paired with NumPy, pandas, matplotlib, seaborn, scikit-learn, xgboost, lightgbm, catboost, streamlit, tableau, and SQLite.

First of all, you can visit the project’s GitHub repository from here and streamlit app’s GitHub repository from here.

 

1. Problem Statement

We are a sports analytics consulting team that serves in the football domain. Our client requests building a classification model for finding the best-talented football players.

For this purpose, we built a variety of classification models using data that we read from the SQLite DB.

 

2. Methodology

We described the following roadmap as project methodology ;

Data Reading from DB with sqlite3

Data Cleaning and Transformation

Exploratory Data Analysis (EDA) with Tableau

Feature Selection and Modeling

Model Prediction

Interpreting Results

Football Player Ranking Streamlit App

 

3. Data Reading from DB with SQLite3

By using the following functions that we developed, we read all tables from DB that were written before via sqlite3.

getTableNames()

Using this function, we can get all table names from the SQLite DB.

getAllTables()

Using this function, we can get all table records for SQLite DB.

mergeAllData()

Using this function, we can get necessary table records from DB and merge them.

cleanAndgetCSV()

Using this function, we can write merged and cleaned data to CSV for using the model building in the next steps.

So, if we call the cleanAndgetCSV() function after the definition of all 4 functions, it will write data to CSV and after that, we can read it with pandas. Following is our data overview ;

Figure 2: Read Data Overview

4. Data Cleaning and Transformation

After we read and transformed data to dataframe, we made some transformations on columns such as extracting specific date parts from DateTime values and typecasting on some columns using split and astype functions like this;

Finally, we applied the following methods to handle with “Unknown” values ;

Filling with mode

Filling with mean

As an example for “Unknown” value handling code like this;

 

5. Exploratory Data Analysis (EDA)

We visualized some features via Tableau against to target variable which is OVERALL_RATING to observe whether there is any informative pattern like follows ;

Figure 3: Overall_rating vs VisionFigure 4: Overall_rating vs FinishingFigure 5: Overall_rating vs PotentialFigure 6: Overall_rating vs ReactionsFigure 7: Overall_rating vs Age vs Date

As seen above bar and time-series charts, those 5 features have positive or negative informative patterns with target variable OVERALL_RATING, also you can see the correlation heatmap given below ;

Figure 8: Correlation Heatmap

 

6. Feature Selection and Modeling

We built our model with 19 features in total and used the following method in the model-building process ;

Figure 9: Train-Validation-Test Split

We have imbalanced data as it has too many 0 values compared to 1. Therefore, we applied oversampling method in the training dataset with SMOTE function as given this code;

Figure 10: Oversampling in Imbalanced Data

Given this code shows us the building, predicting, evaluating, and K-Fold cross-validation process of many models with the train-test split method ;

 

7. Model Prediction and Interpreting Results

We used Precision, Recall, and F-1 SCORE as evaluation metrics in model prediction and made it for train-test(%80-%20) split and K-Fold Cross-Validation.

Figure 11: Model Prediction Results

As seen upper prediction result table, the algorithms that work with ensemble methodology gave us better test results compared to others. Finally, we selected XGBOOST Model as it was the best one according to evaluation metrics.

 

8. Football Player Ranking Streamlit App

To predict our built model, we developed an interactive web app using Streamlit and deployed with it Heroku to a live web environment as you can see a sample figure from the app given below. You can test the on-live streamlit app from here or you can visit streamlit app’s GitHub repository from here.

Figure 12: Streamlit App

 

9. Conclusion

This is the end of the article. In this article, I try to explain in detail the third project of our Data Science Bootcamp. As a reminder, you can visit the project’s GitHub repository from here. If you wish you can follow me on Medium. Hope to see you in my next article…

Write your own content on FeedingTrends
Write