ln.

Curling Shot by Shot Analysis


I had previously written curling off as a boring sport but the stones aligned and it ended up piquing my interest a couple of times this year. It is still quite a slow game but this allows you to study the position and make your own 'armchair skip' calls as everybody stands around looking at the rocks. This new interest as well as the fact that it is basically a turn-based game led me to explore it with the data it generates.

I was able to find an article written by Jordan Myslik on his personal website that showed that someone else had a similar idea. I avoided fully reading the article before conducting my project but I did use his information on where to access the World Curling Federation(WCF) data as well as the idea to parse the pdf documents as xml. Reading the article afterwards it seems that our methods were very similar; likely due to there being an 'clear' way to retrieve and parse this data but there is no doubt that my work is at the very least inspired by Jordan's work.

The general process of this project is outlined below, all code is available on Github.

Rough steps of the process:

2021-01-18 Update:

The method I used to determine the hammer colour was flawed to the point of being useless

Gathering Data

Finding the Documents


The data used in Jordan Myslik's article as well as my own is data available from the World Curling Federation(WCF) via a path that can't be intentional: https://odf2.worldcurling.co/data/. None-the-less, it holds directories containing results and other data for a number(all?, most?) tournaments from 2016 to 2019. Ideally I would love to get the data up to present day but it is not available from this source.

The data of most interest is the 'Shot by Shot' documents stored deep within the nested document structure. These documents contain representations of the positions of all of the stones after each throw.(henceforth referred to as frames or positions) These shot by shot summaries are not available for every match but in the end I was able to gather 2168 of them, 2165 of them being useful.(1 corrupt, the other two forfeit matches)

Aside: It turns out that these shot by shot documents and much of the other data is generated and made available by CURLIT Curling Information Technology. They are the official results provider for olympic curling and many other top tournaments but unfortunately I have not found a way to gather the data directly from this source.(though I haven't tried particularly hard) They provide their own CURLIT Coach Tool that uses all of their own data and this is what people should actually be using if they are interested in curling analysis.

The trickiest part of retrieving this data was sifting through the documents to find the correct directory/path structure. This repository is clearly used for more than just storing these pdf documents so there are plenty of other documents to work around.

After some searching and checking to make sure that the structure is consistent I was able to find the following path structure:

/data/<event>/<match_type>/<session>/<...>Game - Shot_by_Shot<...>.pdf
      

where:

Within the session directories there are the summary documents that I am looking for and all contain the string "Game - Shot_by_Shot" though I narrowed it to "Shot_by_Shot" to avoid any sillyness that might arise from matching on the spaces. With this, I wrote a function to navigate this directory structure and find the paths of all files in the relevant paths matching my query. It loosely follows the pseudocode below.

for each directory in /data/
  for all sub-directories matching one of the match types(Men's Teams, ...)
    for all sub-directories
      If name contains 'Shot_by_Shot'
        Add path to list
      

The splitting of the searching and downloading of documents made it simpler to test as well as adding a layer of protection in case the download failed halfway through, making it simpler to pick up where it failed.

Downloading the Data


This process is quite simple as it just requires looping through the saved list of paths saved from the previous step and requesting the indicated document. I introduced a bit of a delay as to not DOS their servers and after an hour I had all of the pdfs. As previously mentioned there was a pdf document that was corrupted so I tried to go back and find the problem but it turns out that the original file is corrupted and there is nothing I can do about it so I just blacklisted it in the parsing of the documents.

for path in file:
  if file already downloaded:
    Continue
  Download file
  Save file
  sleep(delay)
      

Parsing the Data


An example of each page of the pdf documents can be seen below. It is comprised of 1 to 16 positions/frames along with some throw/shot information. This is topped off with the header that contains match, team, and scoring information.

Shot by Shot PDF Example

The format of the page is relatively consistent but the content in the header is regularly shuffled around and there are a few different notations used for the throw/shot information. Additionally, the direction of play switches with each end, the houses vary in colour, some frames/positions are an even number of pixels wide while others are an odd number wide, the representations of the stones vary in size, 'ghost' shells are left behind in the position where a stone has been taken out, sometimes there are 'shoot-outs' which have a completely different format, Mixed Doubles games have prepositioned stones that have to be considered, among Mixed Doubles games there are two different methods used to lay out the positions, you can't assume where the prepositioned stones are because they are occasionally offset, there is no explicit statement about who has the hammer, you cannot assume that there will be a stone in a frame/position, you cannot assume that the centre of the house won't be covered, the gender of the player is not stated in mixed matches, the score isn't always the same format and isn't always a number, and the xml generated from the pdf documents isn't always valid.

The parsing process is a nightmare and borderline not possible if not for the saving graces that the positioning of the text elements relative to eachother is fairly consistent and that the stone colours are always the same shades of red and yellow.

The parsing of the text and other elements will be explored in the following sections.

Parsing Text


The parsing of the text was the most annoying part of the whole procedure. Once I was able to get the text information out of the pdf format it was fairly smooth sailing but getting to that point was the issue. On my way to just using the same method that Jordan Myslik used, I tried to implement borb, pypdf, pdfminer and PyMuPDF. All failed to correctly parse the document or didn't provide the data to parse the information I need.

The method I settled on was using the Python os library to use the pdftohtml command line tool on each pdf document. This is by far the most reliable method that I found but it still has its problems. First of all it occasionally just outright fails to create an xml document; this was resolved by adding a single retry in the case of failure and this has been rock solid over thousands of documents. The other issue was that in certain cases the xml that it produced was invalid caused by unterminated <span> elements in the document. I resolved this by opening the xml files as text, parsing out the erroneous elements, and then proceeding to parse them as xml.

Once I had those issues sorted out, all of the text elements are parsed as elements of their respective page with attributes indicating the x and y location of the top left of the element. An excerpt can be seen below:

<page number="1" position="absolute" top="0" left="0" height="1263" width="892">
  ...
  <text top="52" left="637" width="100" height="12" font="6">St. Jakobshalle Basel</text>
  <text top="52" left="196" width="86" height="12" font="7">Basel, Switzerland</text>
  <text top="111" left="500" width="238" height="17" font="13">Round Robin Session 1 - Sheet D</text>
  <text top="35" left="446" width="291" height="17" font="19">World Men's Curling Championship 2016</text>
  <text top="160" left="359" width="177" height="20" font="23">Game - Shot by Shot</text>
  <text top="116" left="196" width="81" height="12" font="27">SAT 2 APR 2016</text>
  <text top="129" left="196" width="77" height="12" font="31">Start Time 14:00</text>
  <text top="35" left="196" width="86" height="17" font="32">WMCC 2016</text>
  <text top="227" left="795" width="49" height="17" font="33">End 1</text>
  <text top="227" left="67" width="111" height="17" font="35">GER - Germany</text>
  <text top="227" left="449" width="122" height="17" font="38">SUI - Switzerland</text>
  <text top="228" left="280" width="8" height="17" font="1">0</text>
  <text top="227" left="663" width="8" height="17" font="1">0</text>
  <text top="228" left="310" width="8" height="17" font="1">0</text>
  <text top="227" left="693" width="8" height="17" font="1">1</text>
  <text top="228" left="295" width="9" height="17" font="1">+</text>
  <text top="227" left="678" width="9" height="17" font="1">+</text>
  <text top="254" left="46" width="8" height="15" font="3">1</text>
  <text top="501" left="166" width="5" height="10" font="4">4</text>
  <text top="501" left="148" width="15" height="10" font="4">Out</text>
  <text top="501" left="54" width="21" height="10" font="39">Front</text>
  <text top="488" left="54" width="79" height="10" font="42">SUI: GEMPELER S</text>
  <text top="254" left="183" width="8" height="15" font="3">2</text>
  <text top="501" left="303" width="5" height="10" font="4">4</text>
  <text top="501" left="285" width="15" height="10" font="4">Out</text>
  <text top="501" left="191" width="21" height="10" font="43">Draw</text>
  <text top="488" left="191" width="87" height="10" font="50">GER: SCHWEIZER S</text>
  <text top="254" left="320" width="8" height="15" font="3">3</text>
  <text top="501" left="440" width="5" height="10" font="4">4</text>
  <text top="501" left="422" width="15" height="10" font="4">Out</text>
  <text top="501" left="328" width="78" height="10" font="57">Promotion Take-out</text>
  <text top="488" left="328" width="79" height="10" font="60">SUI: GEMPELER S</text>
  ...
</page>
  <page number="2" position="absolute" top="0" left="0" height="1263" width="892">
          ...
      

The keen-eyed reader might notice that the parsing of the text elements is not particularly intuitive and upon inspection of multiple xml files it becomes clear that the order of the elements is not a reliable method to correctly identify the desired information. The worst case of this is in the header which contains an assortment of information about the match including the event, location, date, time, and sheet identifier. Directly under the header is a section that includes the team names, current match scores, and the number of points scored in the end. Finally, the main body of the pdf is 1-16 sets of player name, throw number, throw type, throw rating, and sometimes the throw spin.

The main technique I used to parse the correct information out of the document was identifying elements with unique characteristics(position or contents), and then using them as a datum to find elements relative to their position. A summary of the identification methods is below:

The header information is checked on every page in case there was an anomaly in conversion, it is likely that at least one of the pages was converted correctly. Each element of the xml document is checked against the above conditions and the matching information is saved into lists of tuples which are later converted into dataframes for output. In this case, they are later used as input to a SQLite database.

Parsing Positions


Beginning this project I expected this task to be the most difficult but once I found the correct method it was relatively straightforward. My first crack at identifying the locations of the circles was to use OpenCV's HoughCircles, this seems perfectly reasonable to me as 'circle' is even in the name. It took a couple of minutes to get my first 'results' but I spent hours after that trying to get it to correctly recognize circles in the colour masks that I had created. All combinations of minimum radius, maximum radius, and tuning parameters produced an image that recognized a few of the stones but would have erroneous circles scattered around the rest of the image. I attempted increasing and decreasing the image resolution as well as only parsing a single position but none of the methods provided useful results.

Luckily I was able to fall back on the ol'reliable findContours which returns a list of all sets of connected pixels in an image, known as contours. This was extremely effective as I could reliably distinguish the stones from the background(ice/house); however, this leaves the actual identification and classification of the circles up to me. I selected for only the full size stones(ignoring the small representations) by filtering on the area of the contour as well as the dimensions of the bounding box.

The small stone representations do not go unused though, their locations and count are used to identify the direction of play and which team has the hammer. This is made more complicated by the different format of Mixed Doubles matches but just requires a few extra cases.

The locations of the stones are complicated by the precense of the 'ghost' stones that have been removed from the house as shown in the images below. These sometimes overlap to form strange shapes or they are interpretted as part of the same contour as a stone. In these cases I am just rejecting the contour which results in some missing stones but is likely better than having stones in incorrect or invalid positions. There is potential for using the locations of stones in the previous position to assist in identification but it becomes very difficult because the stones aren't individually identified. If I had a lot of time then I am sure I could sort something out using the previous locations, type of throw and rating of the throw to identify some of the stones.

SQLite SchemaSQLite Schema

Another interesting aspect of the position images is that the most recently thrown stone has a larger border, resulting in a slightly smaller stone fill/contour. I added the contour size information to my database in an update and can now fairly reliably distinguish the thrown stone from the stones that were already on the ice.

The individual frames are then identified by finding bounding boxes of a particular size. The exact location of the house within the frame is determined using the direction of play and the size of the frame. Then the positions of all stones in each frame are calculated relative to the centre of the house. The result of this is a list of stones along with their x and y displacement from the center of the house, their colour, and their size. This is the main output of the image parsing function.

Parsing Summary


I had a couple of small sets of pdfs that I was using for testing that only took a few minutes but to read the data from the 2165 pdfs takes roughly 4.5 hours at a rate of 8 pdf documents per minute. This does include the next step of ingesting the data into a SQLite database(2.8% of total time) but overall I am satisfied with this time. It may seem long but in theory this only has to be done once and 64% of the duration is spent converting the pdf to xml and to high resolution images. I did some testing with lower resolution conversions but it significantly impacted the accuracy and completeness of the stone detection due to the small size of the stones.

Populating SQLite Database

Schema


The schema I designed for this data splits the events, matches, and ends in a fairly straightforward manner. I then created a record in the throw table for each throw made in an end and a separate record for each position in the end. I first considered storing the throw and position data in a single record/table but that schema is unable to store a position that is not associated with a throw. A pre-positioned set of stones could be stored but would have null values for all of the throw data and just complicate parsing.

SQLite Schema

Populating Database


The populating of the database occurs in a similarly nested manner to the identification of the pdf documents. It takes the various dataframes and lists from the parsing stage and follows the pseudocode structure below.

for each event:
  Create event record
  for each match:
    Create match record
    for each end:
      Create end record
      for each throw:
        Create player record if required
        Create throw record
      for each position:
        Create position record
        for each stone:
          Create stone record
      

One slightly interesting part of this section is the creation of player records. Since the summary pdfs include Men's, Women's, and Mixed matches, it is not always guaranteed that the gender of a player can be determined. The gender of players is quite obvious when parsing a Men's or Women's match but when parsing a Mixed Doubles or Mixed Teams match there is no indication of the player's gender. Using the logic of the pseudocode below I add and update the player records as they are found. If a player is first found in a mixed game(with unknown gender) but later found in a Men's/Women's game then the record is updated with the new data.

if player exists in db:
  if (player gender is now known) and (player record has unknown gender):
    Update player record with gender
      else:
        Create player record whether gender is known or not
      

This system mostly ignores the complicated web that is curling teams and just sticks to nationality.

Fun with Graphs


With all of the data collected and stored in the database it is easy to query and sort data for analysis. Below are a number of plots that I have generated along with some notes. The Jupyter notebook containing all of these plots and a few extra can be found on Github.

A lot of the plots are best displayed in an extremely wide aspect ratio but my website isn't really made for this so in case they are illegibly small you can open the image in a new tab.

Heatmaps

Heatmap of Thrown Stone by Throw Number


Heatmap of Stone Positions by Throw Number


Heatmap of Thrown Stone by Throw Number - Mixed Doubles


Heatmap of Stone Positions by Throw Number - Mixed Doubles


Throw Type Heatmap


Summaries

End by End Summary


Match Summary


Stone Location by Throw Class


Throw Ratings

Rating by Shot Type


Rating by Shot Class


Average Rating by Throw Number


Average Rating by Stone Count


Percentage of Throw Type by Throw Number


Player Rating vs. Throw Count


Assorted plots. If this held more useful and up-to-date information then I might be bothered to create some sort of interactive app/plot to highlight players/teams.



Scores

Team Match Score Difference


Team End Scores