Formula 1 AI PitWall

CS 7641 - Summer 25 - Group 4

Analysis: Driver DNA

Using unsupervised machine learning to move beyond lap times and quantify the unique 'fingerprint' of each driver's on-track performance.

Phase 1: High-Level Profiling (The "What" and "Who")

We began by aggregating telemetry from every lap into a set of Key Performance Indicators (KPIs). Using PCA and K-Means clustering, we successfully profiled all drivers into distinct, high-level archetypes such as "Aggressive Brakers," "Smooth Operators," and "Straight-Line Specialists," answering the question of what their overall style is.

Phase 2: Granular Event Analysis (The "How" and "Why")

We then moved from high-level KPIs to the raw telemetry traces. Using time-series clustering on braking events, we identified the canonical "input signatures" for different drivers, revealing the specific V-shaped vs. U-shaped speed profiles that define their technique. This granular analysis answers how drivers achieve their performance.

Phase 3: Codifying Insights (The "Rules")

Finally, we used Association Rule Mining to formalize our findings. By converting telemetry into "if-then" rules, we quantitatively confirmed the driving logic for different circuits and styles. This codifies the why behind on-track decisions and validates the archetypes and signatures discovered in the previous phases.

Understanding telemetry data

Creating Circuit Layouts from GeoJSON and Telemetry - Sanity Check

Limitations of GeoJSON Data

We discovered that while GeoJSON and telemetry data match in terms of understanding track layout, GeoJSON data is sparse and falls short when it comes to capturing nuances of the track's turns. Telemetry data also provides elevation coordinates which help us downstream in understanding a track's elevation changes.

Finding Corners with Code: A Clustering Approach

DBSCAN on high-curvature (X,Y) coordinates; automatically discovers the number of turns.
Process:

  1. Calculate turning angle at each coordinate point along the track layout.
  2. Filter for points with high curvature to create a dataset of potential corner locations.
  3. Cluster these corner points using DBSCAN; the number of dense clusters found equals the number of turns.

High-Level Profiling (The "What" and "Who")

The first phase of our project was dedicated to answering the most fundamental questions: What are the distinct archetypes of racetracks on the calendar, and who are the drivers that fit into different stylistic profiles? To achieve this, we needed to transform the raw, high-frequency telemetry data containing thousands of data points per lap into a simple, comparable format.

We can see in the above plot that the rookies and new drivers are classified into "Straight-Line Specialists" while the more experienced drivers fall in the "Balanced Style". Norris and Piastri, who currently have the best cars on the grid are in one cluster - "Smooth and Consistent". This is in line with what we see in the actual sport today.

Feature Engineering: Creating a "Performance Fingerprint"

The core of this phase was feature engineering. We aggregated the complex time-series data for each individual lap into a single, fixed-length vector of Key Performance Indicators (KPIs). This process creates a concise "performance fingerprint" that represents the essence of that lap. The key KPIs we engineered include:

  • Power & Aggression Metrics: These quantify how a driver utilizes the car's powertrain and brakes.
    • Throttle_Full_Pct: The percentage of the lap spent at 100% throttle.
    • Braking_Intensity: The peak negative G-force achieved during braking zones.
    • RPM_Avg & RPM_Std_Dev: The mean and standard deviation of engine RPM, indicating usage patterns.
  • Cornering & Handling Metrics: These describe how the car and driver behave through corners.
    • Total_G_Mean: The average combined longitudinal and lateral G-force, showing how consistently the driver operates at the limit of adhesion.
  • Strategy Metrics: These capture elements related to on-track strategy.
    • DRS_Uptime_Pct: The percentage of the lap where the Drag Reduction System was active.
    • Gear_Changes_per_Lap: The total number of gear shifts, often indicating how "busy" a track is.

Algorithms: Discovering the Archetypes

With a KPI matrix established for all laps, we employed a two-stage unsupervised learning pipeline to discover the hidden patterns:

  1. Principal Component Analysis (PCA): To handle the complexity of our many KPIs (which are often correlated), we first applied PCA. This technique distilled our features into a few, more meaningful "axes of performance." For example, it might combine throttle, DRS, and RPM metrics into a single "Straight-Line Performance" component. This crucial step reduces noise and makes the subsequent clustering far more robust and interpretable.
  2. K-Means Clustering: Using the simplified principal component scores, we then applied the K-Means algorithm. We ran this process twice:
    • For Drivers: By averaging the KPIs for each unique driver over the season, we clustered the drivers. This revealed distinct stylistic profiles, such as "Aggressive Brakers" and "Smooth & Consistent Operators."

The outcome of Phase 1 was a foundational, high-level understanding of our dataset. We successfully transformed millions of raw data points into a clear, quantitative classification of both the circuits and the competitors, setting the stage for the more granular event analysis in Phase 2.

Granular Event Analysis (The "How" and "Why")

While Phase 1 provided a high-level overview of driver and track styles, Phase 2 dives into the raw, high-frequency telemetry to answer the crucial questions of "how" and "why." The goal was to move beyond aggregated KPIs and analyze the specific, millisecond-level events and input shapes that define on-track performance.

Anomaly Detection: Finding the Critical Moments

The first step was to identify moments within a lap that deviated from the norm. These "anomalies" are not necessarily errors; they represent the most complex and information-rich parts of the lap, such as heavy braking zones or moments of instability.

  • Algorithm Used: We employed an Isolation Forest, a powerful and efficient algorithm that works by "isolating" outliers. It learns the normal operating envelope of a car at a specific track and then assigns an anomaly_score to every single telemetry timestamp.
  • Feature Engineering: To provide context to these anomalies, we engineered a time_gain_loss feature by comparing each lap to a fast reference lap. This allowed us to see if an anomalous event resulted in a time gain or loss.

Analysis of the Four-Quadrant Plots

By plotting the maximum anomaly score against the time gained or lost for every micro-sector of a lap, we created an automated performance review. The results for the 2022 Sakhir GP, where Charles Leclerc won and Daniel Ricciardo struggled, perfectly illustrate the power of this technique:

  • Charles Leclerc (Winner): His plot shows a dense cloud of points in the "Consistent Overperformance" and "Brilliance/Opportunity" quadrants. This is the data-driven fingerprint of a driver who was not only consistently faster than his reference but was also successfully pushing the limits to find extra time—the exact profile you'd expect from the race winner.
  • Daniel Ricciardo (Finished 14th): His plot tells the opposite story. The data is heavily concentrated in the "Consistent Underperformance" quadrant, indicating a fundamental lack of pace. Furthermore, his high-anomaly events are skewed towards the "Mistakes/Errors" quadrant, suggesting that when he tried to push, it often resulted in a time loss. This visualizes the narrative of a driver struggling to get comfortable with the car.

Time-Series Clustering: Deconstructing Driver Signatures

To understand the "how" behind different driving styles, we moved beyond single-point anomalies to analyze the *shape* of driver inputs over time. We focused on braking events to identify the canonical "braking signatures" that define a driver's technique.

  • Algorithm Used: Instead of using specialized time-series libraries, we developed a robust method using K-Means Clustering. We first engineered descriptive features for each braking event's speed trace (e.g., initial speed, speed drop, shape metric). We then clustered these features to group the braking events into distinct types.

Analysis of the Braking Signature Plots

By averaging the speed traces for all events within each cluster, we visualized the canonical braking profiles. The analysis successfully identified two primary signatures:

  • Signature 1 ("U-Shaped"): A braking profile with a higher entry and exit speed and a curved deceleration. This represents braking for medium-to-high-speed corners where maintaining momentum is key.
  • Signature 2 ("V-Shaped"): A profile with a lower entry speed and a much sharper, more linear deceleration to a lower minimum speed. This is the classic signature for a slow, tight corner where the focus is on maximizing braking in a straight line.

Codifying Insights (The "Rules")

The final phase of our analysis aimed to formalize the patterns we observed into a concrete set of "if-then" rules. The goal was to move beyond descriptive statistics and codify the unwritten "grammar" of how to drive a specific circuit. This allows us to quantitatively confirm the driving logic required for optimal performance.

Feature Engineering & Algorithm

Association Rule Mining requires data in a "market basket" format, where each row is a transaction and the columns are discrete items. To achieve this, we transformed our continuous telemetry data:

  • Feature Engineering (Discretization): We converted continuous channels like Speed and RPM into categorical bins (e.g., Speed_Low, Speed_Medium, Speed_High). Each telemetry timestamp, with its collection of binned states, became a "transaction."
  • Algorithm Used: We employed the FP-Growth algorithm, an efficient method for discovering frequent patterns in large datasets. We searched for rules with high confidence (the probability that the "then" part is true) and high lift (how much more often the items appear together than expected by chance).

Analysis of the Rule Visualization Plots

To validate our findings, we visualized where the most characteristic rules for two very different tracks—Monza and Monaco—were active on the circuit map. The results perfectly align with the real-world understanding of these tracks, as shown in the Mercedes AMG F1 circuit guides.

  • Monza (The "Temple of Speed"): Our model discovered a key high-speed rule: IF (Throttle_Full, Speed_Very_High) THEN (Gear_8, RPM_High). When we plotted where this rule was active, it perfectly highlighted the four main straights of the Monza circuit. This aligns exactly with the Mercedes map, which shows these sections are taken in 7th or 8th gear at maximum speed. The model successfully learned and visualized the defining characteristic of a power circuit.
  • Monaco (The Technical Challenge): For Monaco, the model found a defining low-speed rule: IF (Speed_Low, Throttle_Partial) THEN (Gear_2, RPM_Low). The visualization shows this rule is active exclusively in the tightest sections of the track: the Grand Hotel Hairpin (Turn 6), the Nouvelle Chicane (Turns 10-11), and the Rascasse/Antony Noghès complex (Turns 18-19). This again matches the Mercedes guide, which shows these corners are taken in 1st, 2nd, or 3rd gear at the lowest speeds on the entire F1 calendar.

This final step provides a powerful visual validation of our entire project. It demonstrates that our unsupervised learning pipeline has not only found statistical patterns but has successfully learned and codified the fundamental, real-world driving logic of two iconic and vastly different Formula 1 circuits.

Putting it All Together

Conclusion: From Raw Data to the Racing Line

This project successfully demonstrates a comprehensive, multi-phase framework for deconstructing Formula 1 performance using unsupervised machine learning. By systematically moving from a high-level overview to a granular, event-based analysis, we have transformed millions of raw telemetry points into a rich, interpretable narrative of driver skill, car behavior, and track characteristics.

A Strcutured Approach to Unsupervised Analysis

Our three-phase approach created a virtuous cycle, where the insights from each stage provided the context for the next:

  • In Phase 1, we answered the "what" and "who" by using PCA and K-Means clustering to distill complex data into clear, high-level archetypes. We successfully classified tracks into categories like "Power & Speed Circuits" and drivers into profiles such as "Aggressive Brakers," providing a foundational understanding of the performance landscape.
  • In Phase 2, we investigated the "how" and "why" behind these profiles. Anomaly detection pinpointed the most critical moments of a lap, while our Four-Quadrant analysis contextualized them as either mistakes or moments of brilliance. Furthermore, by clustering the shape of speed traces, we deconstructed driver techniques into their fundamental "signatures," revealing the visual difference between a "V-shaped" and "U-shaped" braking style.
  • Finally, in Phase 3, we codified our findings by applying Association Rule Mining. This transformed our observations into a set of concrete, "if-then" rules that represent the underlying driving logic for a specific circuit. Visualizing these rules on the track map provided a powerful validation, confirming that our model had successfully learned the distinct demands of vastly different circuits like Monza and Monaco.

Ultimately, this section proves that by combining feature engineering, clustering, anomaly detection, and rule mining, it is possible to move beyond simple lap times. We have created a framework that can quantify driver style, identify the key moments that define a race, and codify the very "rules" of what it takes to be fast, transforming raw data into true strategic intelligence.

F1 car scroll progress indicator Checkered flag