Chapter 2 Data sources
This project focuses on the topic of the Olympics. Our data are derived from Sports Viz Sunday, which is a platform for people who are interested in sport and exploring the sports data and visualization in the world.
The following options were chosen for the datasets:
Data on medals & countries from the Summer and Winter Olympics dating back to 1896, with two main sheets of ‘Olympics’, ‘Olympics Athletes and Event’ on medal events and info of each athlete.
Data on Winter Olympics, with three main sheets of ‘Winter Olympic Medals’, ‘Alpine Skiing’ and ‘Winter Olympic Medal History’.
Data on Paralympics, with two main sheets of ‘ParalymicsData’ and ‘UKparalymics’
Since the goal is mainly focused on evaluating the evolution of Olympic Games by different factors and aspects over the years, the ‘Olympics Athletes and Event’ dataset was chosen. It not only includes data of the summer and winter Olympics, but also has specific athlete information, such as their age, weight and height. Based on these variables, we can analyze whether their age, weight and height are related to the competition results, while other dataset failed to provide the information needed. Moreover, since the 2022 Winter Olympics was held this year, The Alpine Skiing,one of the interesting winter Olympics events, was investigated. THerefore, the ‘Alpine Skiing’ dataset was also selected.
Here are some basic information about the dataset ‘Olympic Athletes and Events’:
- Brief introduction about this dataset:
This is a historical dataset on the modern Olympic Games, including all the Games from Athens 1896 to Rio 2016. It includes not only the year and city of each Olympics games, but also each athlete’s basic information.
- Types of variables
Categorical Variables: ID, Name, Sex, Team, NOC, Games, Year, Season, City, Sport, Event, and medal Numerical Variables: Age, Height, Weight
- Number of records:
271,116 Rows, 15 Columns
Here are some basic information about our dataset ‘alpine skiing’:
- Brief introduction about this dataset:
This is a dataset about the Alpine skiing Event in the Winter Olympic Games, including all the Alpine skiing Games from Athens 1896 to Rio 2016. The dominant information we would use in this dataset is Event, Season, Gender, and Nation, with many missing values in time-seconds related terms. Still, luckily we only won’t pay much attention to these time-related columns.
- Types of variables
Categorical Variables: Race-ID, Codex, Event, Place, Place_Country, Place_region, Date, Season, Athlete, Nation, Birthyear, Gender, Bib(starter_no), Rank, Time_total_datetime, Difference_Total_datetime, Difference_Total_seconds, Difference_R2_datetime, Difference_R2_seconds, Difference_R1_datetime, Difference_R1_seconds, Time_R2_seconds, Time_R1_datetime, Time_R1_seconds, FIS Points, WC Points
Numerical Variables: Gender, Time_total_seconds, Latitude, Longitude
- Number of records:
118,584 Rows, 30 Columns
Notation about ‘Olympic Athletes and Events’ dataset:
A common mistake people make when analyzing this data is to assume that the Summer and Winter Games have always been staggered. However, the Winter and Summer Games were held in the same year up until 1992. After 1992, they staggered them such that Winter Games occur on a four year cycle starting with 1994, then Summer in 1996, then Winter in 1998, and so on.
Between 1912-1920, and 1936-1948, there are two long periods without any Games, corresponding to the first World War and the Second World War.