Kona Run Data Analysis Part 1
Background
Entalpi, the company that works with Gustav Iden and Kristian Blummenfelt recently released some raw data from the run at Kona 2022.
I am going to be taking a basic look at some of the data released in a series of Jupyter Notebooks as a way of further my skills.
Data and more info can be found on their Github here: https://github.com/entalpi-no/kona-2022
Import necessary packages¶
import pandas as pd
import numpy as np
import seaborn as sns
import datetime
from math import cos, asin, sqrt, pi
Read CSVs¶
gustav = pd.read_csv('gustav.csv')
kristian = pd.read_csv('kristian.csv')
View the data¶
kristian.head()
datetime | latitude | longitude | speed | elevation | heartrate | cadence | core_temperature | skin_temperature | stride_length | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 2022-10-08 21:27:05+00:00 | 19.639484 | -155.997351 | 0.000 | 9.6 | 138.0 | 0.0 | 38.860001 | NaN | NaN |
1 | 2022-10-08 21:27:06+00:00 | 19.639443 | -155.997342 | 0.000 | 9.4 | 139.0 | 0.0 | 38.860001 | 34.200001 | NaN |
2 | 2022-10-08 21:27:07+00:00 | 19.639391 | -155.997383 | 0.000 | 9.4 | 138.0 | 0.0 | 38.860001 | 34.200001 | NaN |
3 | 2022-10-08 21:27:08+00:00 | 19.639349 | -155.997340 | 0.000 | 9.4 | 139.0 | 0.0 | 38.860001 | 34.200001 | NaN |
4 | 2022-10-08 21:27:09+00:00 | 19.639317 | -155.997337 | 1.148 | 9.2 | 139.0 | 0.0 | 38.860001 | 34.200001 | NaN |
Parse dates¶
kristian['dateparsed'] = pd.to_datetime(kristian['datetime'], infer_datetime_format=True)
gustav['dateparsed'] = pd.to_datetime(gustav['datetime'], infer_datetime_format=True)
Sample graph¶
sns.lineplot(x=kristian['dateparsed'], y=kristian['speed'])
<AxesSubplot:xlabel='dateparsed', ylabel='speed'>
We can see from the above graph that the watch was started shortly before the run and ended quite awhile after the race ended. To separate only the data we want we will calculate the distance from start and only select data where total distance is less than ~43 kilometers.
An alternative method of doing this is to only select moments when speed is above 2 as that looks like it would cover the same scenario. But since we want data points in relation to distance it will serve both purposes anyways.
Prepare to calculate distance ran¶
In order to calculate the distance we will use the haversine formula. We need the current lat long point and the previous lat long point in the same row, so we will use the pandas shift function to accomplish this on both data sets.
kristian['shiftlat'] = kristian['latitude'].shift(periods=1)
kristian['shiftlong'] = kristian['longitude'].shift(periods=1)
gustav['shiftlat'] = gustav['latitude'].shift(periods=1)
gustav['shiftlong'] = gustav['longitude'].shift(periods=1)
# view the sample data
kristian.head()
datetime | latitude | longitude | speed | elevation | heartrate | cadence | core_temperature | skin_temperature | stride_length | shiftlat | shiftlong | dateparsed | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2022-10-08 21:27:05+00:00 | 19.639484 | -155.997351 | 0.000 | 9.6 | 138.0 | 0.0 | 38.860001 | NaN | NaN | NaN | NaN | 2022-10-08 21:27:05+00:00 |
1 | 2022-10-08 21:27:06+00:00 | 19.639443 | -155.997342 | 0.000 | 9.4 | 139.0 | 0.0 | 38.860001 | 34.200001 | NaN | 19.639484 | -155.997351 | 2022-10-08 21:27:06+00:00 |
2 | 2022-10-08 21:27:07+00:00 | 19.639391 | -155.997383 | 0.000 | 9.4 | 138.0 | 0.0 | 38.860001 | 34.200001 | NaN | 19.639443 | -155.997342 | 2022-10-08 21:27:07+00:00 |
3 | 2022-10-08 21:27:08+00:00 | 19.639349 | -155.997340 | 0.000 | 9.4 | 139.0 | 0.0 | 38.860001 | 34.200001 | NaN | 19.639391 | -155.997383 | 2022-10-08 21:27:08+00:00 |
4 | 2022-10-08 21:27:09+00:00 | 19.639317 | -155.997337 | 1.148 | 9.2 | 139.0 | 0.0 | 38.860001 | 34.200001 | NaN | 19.639349 | -155.997340 | 2022-10-08 21:27:09+00:00 |
Create a distance formula and apply it to both data sets¶
I took this straight from stackoverflow
def distance(lat1, lon1, lat2, lon2):
p = pi/180
a = 0.5 - cos((lat2-lat1)*p)/2 + cos(lat1*p) * cos(lat2*p) * (1-cos((lon2-lon1)*p))/2
return 12742 * asin(sqrt(a))
kristian['distance'] = kristian.apply(lambda row: distance(row['shiftlat'], row['shiftlong'], row['latitude'], row['longitude']), axis=1)
gustav['distance'] = gustav.apply(lambda row: distance(row['shiftlat'], row['shiftlong'], row['latitude'], row['longitude']), axis=1)
kristian.head()
datetime | latitude | longitude | speed | elevation | heartrate | cadence | core_temperature | skin_temperature | stride_length | shiftlat | shiftlong | dateparsed | distance | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2022-10-08 21:27:05+00:00 | 19.639484 | -155.997351 | 0.000 | 9.6 | 138.0 | 0.0 | 38.860001 | NaN | NaN | NaN | NaN | 2022-10-08 21:27:05+00:00 | NaN |
1 | 2022-10-08 21:27:06+00:00 | 19.639443 | -155.997342 | 0.000 | 9.4 | 139.0 | 0.0 | 38.860001 | 34.200001 | NaN | 19.639484 | -155.997351 | 2022-10-08 21:27:06+00:00 | 0.004745 |
2 | 2022-10-08 21:27:07+00:00 | 19.639391 | -155.997383 | 0.000 | 9.4 | 138.0 | 0.0 | 38.860001 | 34.200001 | NaN | 19.639443 | -155.997342 | 2022-10-08 21:27:07+00:00 | 0.007174 |
3 | 2022-10-08 21:27:08+00:00 | 19.639349 | -155.997340 | 0.000 | 9.4 | 139.0 | 0.0 | 38.860001 | 34.200001 | NaN | 19.639391 | -155.997383 | 2022-10-08 21:27:08+00:00 | 0.006531 |
4 | 2022-10-08 21:27:09+00:00 | 19.639317 | -155.997337 | 1.148 | 9.2 | 139.0 | 0.0 | 38.860001 | 34.200001 | NaN | 19.639349 | -155.997340 | 2022-10-08 21:27:09+00:00 | 0.003554 |
View total distance¶
kristian['distance'].sum()
43.43302431102427
Not bad. Looks like we have 43.433 kilometers, a marathon is 42 kilometers and Kristian's own Strava workout shows 43.098 kilometers.
Add a column for cumulative distance¶
kristian['totaldistance'] = kristian['distance'].cumsum()
gustav['totaldistance'] = gustav['distance'].cumsum()
kristian.head()
datetime | latitude | longitude | speed | elevation | heartrate | cadence | core_temperature | skin_temperature | stride_length | shiftlat | shiftlong | dateparsed | distance | totaldistance | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2022-10-08 21:27:05+00:00 | 19.639484 | -155.997351 | 0.000 | 9.6 | 138.0 | 0.0 | 38.860001 | NaN | NaN | NaN | NaN | 2022-10-08 21:27:05+00:00 | NaN | NaN |
1 | 2022-10-08 21:27:06+00:00 | 19.639443 | -155.997342 | 0.000 | 9.4 | 139.0 | 0.0 | 38.860001 | 34.200001 | NaN | 19.639484 | -155.997351 | 2022-10-08 21:27:06+00:00 | 0.004745 | 0.004745 |
2 | 2022-10-08 21:27:07+00:00 | 19.639391 | -155.997383 | 0.000 | 9.4 | 138.0 | 0.0 | 38.860001 | 34.200001 | NaN | 19.639443 | -155.997342 | 2022-10-08 21:27:07+00:00 | 0.007174 | 0.011919 |
3 | 2022-10-08 21:27:08+00:00 | 19.639349 | -155.997340 | 0.000 | 9.4 | 139.0 | 0.0 | 38.860001 | 34.200001 | NaN | 19.639391 | -155.997383 | 2022-10-08 21:27:08+00:00 | 0.006531 | 0.018450 |
4 | 2022-10-08 21:27:09+00:00 | 19.639317 | -155.997337 | 1.148 | 9.2 | 139.0 | 0.0 | 38.860001 | 34.200001 | NaN | 19.639349 | -155.997340 | 2022-10-08 21:27:09+00:00 | 0.003554 | 0.022004 |
Remove extra timestamps¶
As mentioned before, to remove extra timestamps we're going to find any values where the distance was not increased any. This could be negative if for instance the athlete stopped during the run, we would lose those values. However graphing the speed values earlier we can see that is not the case for the entire run.
Later we will use the other method of only selecting speeds above 2 and see how the data different data selecting methods compare.
First we will make a series that lists all the cumulative distances.
vals = kristian['totaldistance'].value_counts()
vals[vals > 2].sort_index()
0.263526 12 0.265077 3 43.049955 4 43.057251 13 43.084558 9 43.089010 19 43.089105 4 43.090711 8 43.091035 8 43.096696 16 43.101549 4 43.103504 12 43.110125 10 43.110393 3 43.113543 3 43.126005 10 43.126706 22 43.260007 21 43.260102 8 43.260367 5 43.291533 12 43.293714 15 43.301594 11 43.302578 8 43.302712 3 43.303331 3 43.303420 5 43.303604 9 43.306179 15 43.315430 16 43.315564 10 43.315653 3 43.315743 9 43.316295 6 43.317058 13 43.317242 9 43.317332 3 43.317812 6 43.318231 7 43.321947 3 43.322077 3 43.322166 5 43.322435 10 43.323163 3 43.323252 6 43.323482 5 43.325534 26 43.325623 3 43.326070 1183 43.326160 3 43.422613 11 43.422708 5 43.422797 4 43.425766 3 43.426207 3 43.433024 163 Name: totaldistance, dtype: int64
Here if we set the values to give us any data that was the same distance for more than 2 collection samples we can see that at 43.049955 kilometers he was picking up a lot of samples that weren't moving much.
For Gustav this is not necessary as his data is a lot more concise.
sns.lineplot(x=gustav['dateparsed'], y=gustav['speed'])
<AxesSubplot:xlabel='dateparsed', ylabel='speed'>
Finally, we have only Kristian's running data.
k_running = kristian.loc[kristian['totaldistance'] < 43.049955]
sns.lineplot(x=k_running['dateparsed'], y=k_running['speed'])
<AxesSubplot:xlabel='dateparsed', ylabel='speed'>