Exploratory Data Analysis WomenTechWomenYes Annual Gala
0. Introduction
Hi everyone, the first project of Data Science Bootcamp in partnership with Istanbul Data Science Academy has been completed with presentations, and so far it’s been a very enlightening experience. Below I try to explain the first project we did in that period, which was an exploratory data analysis of the MTA turnstile and other supportive datasets using Python paired with NumPy, pandas, matplotlib, seaborn, and DateTime.
First of all, you can visit the project’s GitHub repository from here.
1.a. Problem Statement
As we mentioned, we are interested in harnessing the power of data and analytics to optimize the effectiveness of our street teamwork, which is a significant portion of our fundraising efforts.
WomenTechWomenYes (WTWY) has an annual gala at the beginning of the summer each year. As we are a new and inclusive organization, we try to do double duty with the gala both to fill our event space with individuals passionate about increasing the participation of women in technology and to concurrently build awareness and reach.
To this end, we place street teams at entrances to subway stations. The street teams collect email addresses and those who sign up are sent free tickets to our gala.
Where we’d like to solicit your engagement is to use MTA subway data, which as I’m sure you know is available freely from the city, to help us optimize the placement of our street teams, such that we can gather the most signatures, ideally from those who will attend the gala and contribute to our cause.
1.b. Problem Statement
‘‘ How should WTWY place street teams most effectively? ”
Gather maximum email addresses
Gain ideal attendees for WTWY
Find financial contributors
Consider size and time constraints
2. Methodology
Gathered New York City MTA Turnstile Data between January 02, 2021, and March 13, 2021, which is 10-week. As the name suggests, the dataset provides information on every turnstile at every station managed by the MTA. Regularly scheduled audits provide the number of entries and exits within four-hour periods, with a few exceptions — some audits occur outside the regularly scheduled intervals due to planning or troubleshooting activities and some audits are missed. In total, I analyzed a set of 2,092,870 rows of turnstile data and summarized our methodology in 3 basic steps ;
Finding Top 5 NYC Stations by Total Traffic
Filtering Top 5 NYC Stations by Household Income
Suggesting most appropriate NYC Stations by Hourly Traffic
3. Data Cleaning & Processing
We have taken the following 10-week’s of MTA data.
Figure 1: Original MTA Dataset
After that, made some cleaning on columns as dropping or cleaning blank characters at column names.
Each turnstile is defined by 4 components (“C/A”, “Unit”, “SCP”, “Station”) represented as 4 separate columns in the dataset, therefore we had to group the rest of the data by these four components to analyze each turnstile.
As seen from the MTA dataset, we know that ENTRIES and EXITS are the cumulative entry and exit register values for a given entry and exit point respectively. Since we are aggregating by day rather than by period for the first analysis, we decided to subtract the first which is the smallest one from the last which is the largest one’s counter values for entries and exits respectively.
We also aggregated entries and exits into a single column, TRAFFIC, as well as appended weekday attributes to make it easier to analyze the data by weekday later. Our results were put into mta_entries_exits.
Figure 2: Daily Total Traffic Dataset
We see that from the following code snippet there are no null values in the dataset but we will investigate irrelevant and outlier values to clean it.
Figure 3: Data Cleaning
You can look at the box plot of the ENTRIES and EXITS columns.
Figure 4: Box plot of ENTRIES and EXITS columns
We have decided that more than 17,000 total entries or exits are not sensible so removed them as outlier values.
4. Exploratory Data Analysis & Key Findings
Now we have found the Top 5 NYC Metro Stations by Total Traffic as follows ;
34 ST-PENN STA
34 ST-HERALD SQ
86 ST
125 ST
GRD CNTRL-42 ST
Figure 5: Top 5 Stations By Total Traffic
We realized that all these 5 stations are in the same region which is Manhattan. We found Manhattan is the richest one in NYC as average household income using an external dataset so we have decided to focus on these 5 NYC stations.
Figure 6: NYC Regional Average Income
Now we have found the Top 5 NYC Metro Stations by Average Income as follows ;
86 ST
GRD CNTRL-42 ST
34 ST-PENN STA
34 ST-HERALD SQ
125 ST
Figure 7: Top 5 Stations Average Income
We have made an assumption as; One thing we thought would be helpful would be to find stations where the number of tourists is low (people who will not be around to attend the Gala) and most riders are native New Yorkers (people who will). Stations that are used primarily for commuting will have many more native residents than those that are popular tourist locations. For this reason, selected stations should have a high difference between weekday and weekend in total traffic.
We can extract day-based total traffic for every station as follows ;
Figure 8: First Station’s Day-Based Traffic DatasetFigure 9: First Station’s Day-Based Traffic
Similarly, we have concatenated 5 stations together and all of them verify that they are not touristic stations.
Figure 10: 5 Stations Weekday Traffic
In addition, we have moved forward by analyzing the dataset for 4 hours time period to make better recommendations for street teams.
We added also to get every station's total hourly period traffic and took dataframe given below.
Figure 11: 5 Stations Hourly Traffic Dataset
Figure 12: 5 Stations Traffic By DayHours
As the last step, we visualized the above dataframe with a heatmap to interpret the results more clearly.
Figure 13: Stations DayHour Heatmap
5. Recommendations & Future Works
After the data cleaning, processing, exploratory analysis, and feature engineering process, I have recommended the following strategy to business development and street teams.
Figure 13: Recommendation Table for Business
We can list the following steps for future works ;
Use 2020 ’s census data when it is published
Look at the Education Level of the people and the Unemployment Levels of the people in the stations
Search for the companies that empower women in the Tech Fields
6. Conclusion
This is the end of the article. In this article, I try to explain in detail the first project of our Data Science Bootcamp. As a reminder, you can visit the project’s GitHub repository from here. Hope to see you in my next article…