December 5, 2019
TABLE OF CONTENTS
In 2018, NASCAR began to release telemetry data gathered from all racing circuit vehicles to teams across its series, including data pertaining to racing cars’ speed, RPM, brake position, and temperature. Although every team has access to the data for analysis, they often receive the data as much as a day after it is recorded. The client, a racing organization, sought to gain near real-time access to the data and its potential insights so as to provide a competitive advantage.
According to NASCAR guidelines, cars have a limited amount of practice time they can spend on a track. Not only is this practice time valuable for teams, but the data collected from these practice runs is crucial for improving various aspects of race performance.
Toyota Racing Development (TRD) collects telemetry data emitted by NASCAR vehicles during practices, and by agreement with NASCAR, TRD must share all this data with all NASCAR competitors. Currently, TRD makes this telemetry data available to racing teams via a public object storage folder (i.e. an AWS S3 bucket). However, the client desired to collect this data directly from onboard sensor devices via an existing AWS Kinesis Stream managed by TRD.
The client wanted a tool that would deliver insights within a few minutes of data collection to teams’ engineers. The engineers would gain competitive insights almost immediately, allowing them to consult amongst themselves and with drivers to coordinate techniques and mechanical changes during their practice runs instead of at a later time. To gain these insights, the client engaged with Levvel to explore data architecture and analytic options on AWS.
The client chose to engage Levvel based on a need to expand their cloud and big data expertise. The client’s existing technical staff included a number of experienced mechanical engineers with programming and mathematical skills, but it did not have the desired knowledge of vehicle telemetry data ingestion, visualization, and analysis. Levvel was brought on to build an ingestion pipeline and harness it to replicate, in near real-time, some of the analysis the client’s engineers were performing the day after practices and competitions.
Before the engagement, the client was unloading race data into an on-premise MySQL database as it arrived in S3 from a Kinesis Firehose. This meant that the greatest frequency of data transfer was approximately one delivery every few minutes. In order to use the telemetry data immediately while cars were practicing on the track, the client suggested a maximum latency goal of 15 seconds. In order to achieve this, Levvel first set about getting access to the source Kinesis streams from TRD on the Kinesis Consumer Library. However, since this required access to DynamoDB tables in the TRD account that was non-transferrable, Levvel chose to create a custom consumer using the AWS Python SDK and DynamoDB tables in the client’s AWS account in order to store placeholders in the stream. This was a more appropriate approach for the client as well, since they had future plans to staff more Python talent.
After obtaining the data directly from the source stream, the question became how and where to store the data in a way that was more useful than in the MySQL database and the S3 dump of raw JSON files. The best option determined was to funnel the data into a client-owned Kinesis stream because of the opportunity to take advantage of other Kinesis services such as Firehose and Analytics. Although Kinesis Analytics does not appear in the final product, the client is now well-poised to create multiple streaming applications on top of all of their competitive data. Owning the data in a Kinesis stream has the added advantage of being able to rewind the data longer than TRD was willing to provide. Levvel set the rewind threshold to the maximum so the client could replay races and practices for a full week.
Another challenge Levvel faced was deciding what to do with the historical data. It spanned the last 3 years and was distributed across about 750,000 small JSON files, which made the data almost impossible to process and rendered it effectively useless. A Python Spark job was written to consume the entire historical dataset, convert it to Parquet, then partition and write it to a data warehouse where it could be more readily and quickly ingested by analytic applications. Once transformed to larger Parquet files across only dozens of files, a reasonably sized EMR cluster could read it into memory it in a matter of seconds, whereas reading the raw files with the same cluster would take hours. As an added bonus, the same Spark job that processed the historical data could be used to append new data to the warehouse and repartition based on the date.
To actually use the data, the client needed an application reading from the Kinesis stream, updating visualizations in near real-time. Levvel chose to build a simple Shiny application with the statistical programming language R based on the flexibility and simplicity of their graphics packages and the existing skillset and tools in use by the client.
As data was emitted from the cars on the track, TRD’s Kinesis stream stored it and the Python code, running on EMR, collected it. The Python wrote it to both the client’s Kinesis stream and HDFS. The Kinesis stream was wired up to Firehose for storage to S3 and Analytics for any future streaming applications. In the meantime, the data was also displayed in a number of visualizations in the Shiny app, mirroring the static analysis produced by the client’s engineers. Options in the Shiny app allowed users to replay races and practices that existed on the stream in the last seven days, or collect data in near real-time.
The client went from day-after analysis of static data strewn across nearly a million JSON files to live visualizations of streaming data refreshed every few seconds. Plenty of work remains to be done, but the depth of analysis that is now possible is limited only by the client’s prerogatives rather than their inability to process data. Queries can be designed on raw streaming data in Python or R, applications can be built with Kinesis Analytics, and the EMR cluster can be resized and tuned to accommodate more data sources and analytic uses that come in the future. Even without all the live data streaming and machinery implied by Kinesis and EMR, the client can also write queries to be executed on Athena as needed.
The stakes are high in modern sporting events, and technology is front and center when it comes to earning and maintaining a competitive edge. The client’s new positioning allows them to take full advantage of their data when it is fresh and most valuable, opening the door to almost unlimited analytic possibilities.