- Project Overview
- About Dataset
- Goal of the Project
- Data Sourcing
- Tools
- Data Cleaning/Preparation
- Exploratory Data Analysis
- Data Analysis
- Results/Findings
- Recommendation
- Limitation
This project is about data cleaning and transformation to ensure quality by delving into the fascinating world of a data that includes 120 years (1896 to 2016) of Olympic games with information about athletes and medal results.
We focused on practicing the summary statistics and data visualization techniques that we've learned in the Udemy course.
In general, this dataset is popular to explore how the Olympics have evolved over time, including the participation and performance of different genders, different countries, in various sports and events.
- Examine/clean the dataset
- Explore distributions of single numerical and categorical features via statistics and plots
- Explore relationships of multiple features via statistics and plots
Check out the original source if you are interested in using this data for other purposes Download Here
- Python- Data Cleaning, Pivot_table and Visualization Download Here
In the initial stage of data cleaning, we performed the following tasks:
- Data Loading and Inspection
- Handling and treatinng missing values
- Handling the outliers
- Pivot Table for summarization
- Seaborn & Matplotlib for Data Visualization
- What are the average Age, Height, Weight of female versus male Olympic athletes?
index='Sex', values=['Age', 'Height', 'Weight'], aggfunc='mean')
-
What are the minimum, average, maximum Age, Height, Weight of athletes in different Year?
index= 'Year', values= ['Age', 'Height', 'Weight'], aggfunc= ['min', 'mean', 'max'])
-
What are the minimum, average, median, maximum Age of athletes for different Season and Sex combinations
index= ['Season', 'Sex'], values = 'Age', aggfunc= ['min', 'median', 'max'])
-
What are the average Age of athletes, and numbers of unique Team, Sport, Event, for different Season and Sex combinations
index= ['Season', 'Sex'], values = 'Age', aggfunc = 'mean', columns = ['Team', 'Sport', 'Event'])
- Is there and relationship between the weight and heights using the "Sex"?
Some interesting codes/features we worked with:
import seaborn as sns
sns.catplot(olympics, y = 'Year', hue = 'Sex', kind = 'count', col = 'Season');
This analysis examined various aspects of athletes participating in the Olympics. Here are the key findings:
- Average Athlete Demographics:
There's a noticeable difference in average age, height, and weight between male (M) and female (F) athletes: Female: Age: 23.75, Height: 168.53 cm, Weight: 61.01 kg Male: Age: 26.30, Height: 177.67 cm, Weight: 74.54 kg 2. Age Distribution:
The youngest athlete recorded was 10 years old (1896), and the oldest was 96 (1932). 3. Minimum Age by Season and Gender:
Interestingly, both Summer and Winter seasons have the same minimum age (11.0 years) for female athletes. 4. Missing Data:
The dataset contains some events with missing information on sports and teams. 5. Minimum Age by Gender and Medal:
The minimum age for female medalists ranges from 24 to 25 years old. Male medalists' minimum age falls within the range of 23 to 26 years old. 6. Correlation Between Height and Weight:
A positive correlation exists between athletes' height and weight, indicating taller athletes tend to be heavier (based on gender). 7. Gender Participation Imbalance:
The dataset reveals a significantly higher number of male athletes compared to females. 8. Participation Trends:
Both male and female athlete participation has increased over time, with a more significant rise observed in the Summer Olympics compared to Winter Olympics. This analysis provides valuable insights into the demographics and participation patterns of athletes in the Olympic Games. Further investigation could explore specific sports, medal distributions, or trends across different regions.
Addressing Gender Imbalance (Finding 7):
- Encourage and actively support participation of young girls in sports programs at early ages to bridge the participation gap.
- Implement initiatives to raise awareness about gender equity in sports and combat social biases that discourage female participation.
- Consider establishing specific quota systems for female athletes in certain sports, ensuring their representation.
Improving Data Quality (Finding 4):
- Collaborate with Olympic data collectors and organizations to implement stricter data collection procedures to minimize missing information.
- Investigate methods to impute missing data responsibly and ethically, considering potential biases and limitations.
-
Deep-dive into specific sports to identify factors influencing participation trends, analyzing variations in age, gender, and other demographics.
-
Analyze the relationship between athletes' physical attributes and their performance in different sports, considering various factors like training regimes and technical skills.
-
Explore the correlation between participation trends and socio-economic factors across different regions, identifying potential inequalities and opportunities for promoting broader athlete representation.
These recommendations aim to address the identified issues, promote gender equality, improve data quality, and foster further research for deeper understanding of athlete participation patterns and trends in the Olympics. By implementing these suggestions, the Olympic Games can strive towards greater inclusivity, fairness, and representation for all athletes.
- The analysis is limited by the specific data provided, which may not encompass all Olympic athletes across all years and sports. This could introduce biases or skew the findings towards certain demographics or events.
- The analysis focuses solely on average demographics and minimum age, overlooking the full distribution of these variables across different age groups and sports.
Check out the original source if you are interested in using this data for other purposes Download Here