Transformers as a game changer for autonomous driving, and one stock pick
An academic look at the implementation of transformers in autonomous driving systems, and a detailed analysis of one name with a lot of upside potential
Introduction, tech adoption cycles and stock picking
Technological transitions can move very rapidly, with adoption rates moving from zero to maturity in a span of less than 10 years. Good recent examples here are smartphone usage, social media apps and streaming services, and looking at history, adoption cycles are only speeding up:
But even more than 100 years ago, the streets of New York went from horse and carriage to being packed with automobiles in a timespan of only ten years:
When these technological transitions are on the verge of taking off, the market frequently massively underestimates their potential. A few examples are Tesla, Nvidia, Amazon, Microsoft, ASML, Shopify, Netflix, ServiceNow etcetera, the list of examples is long. Thinking about barriers to entry and long term market shares are good tools to separate the wheat from the chaff. Companies without a sufficient moat won’t be able to seize the new opportunity as competitors quickly emerge, eroding returns in the industry despite high levels growth. Secondly, you want to pick new technologies that have a reasonably high chance of making it, or at least sizing your position accordingly. Having a large weighting in a long shot is obviously foolish.
Clearly tech is an extremely lucrative ground for stock pickers, although not an easy one to navigate, as usually an in-depth knowledge on the underlying technology and industry is required. However, the robotaxi and autonomous driving markets is a space where investors have met disappointment over the last eight years. In the years of 2016 to 2018, we were being promised by some of the largest tech companies that autopilots were just around the corner, with robotaxis due to launch shortly. Obviously this didn’t happen and now everyone seems to have forgotten about the space, myself included up to last year. I didn’t look for years at his space.
But this is another beauty about stock picking, the large cap tech names exposed to this theme did enormously well anyways in the subsequent years, as their core businesses performed extremely well despite the robotaxi moonshot not coming through. In the second part of this note, we will look to find names with similar asymmetric return profiles i.e. limited downside if we’re met with disappointment again, but large upside if the robotaxi opportunity can successfully be rolled out this time around.
Transformers as a game changer in autonomous driving
The reason for having a fresh look at this space is that AI technology has evolved considerably since the 2016-2018 years, which we’ll now go through.
Firstly, with the rise of ChatGPT, clearly innovation is still progressing at an impressive space in the world of AI. The algorithms underlying ChatGPT were only first thought of by researchers at Google in 2017, i.e. transformers. Transformers have a mathematical methodology of remembering context over a wider time-span. Similar to how when you’re reading a text, you’re remembering context from previous texts you’ve read.
Previous AI algorithms only made use of sentences they’ve just read (RNNs). So just because basic convolutional neural nets combined with RNNs and lines of handwritten C++ code weren’t sufficient to drive cars in 2018, AI algorithms are still evolving rapidly and in particular, the addition of better context-aware methodologies will obviously provide a big help.
So let’s say a pedestrian is crossing the street behind a parked truck, if the cameras of the car had already detected the pedestrian earlier, the transformer can remember that a pedestrian is still on his way behind the truck. With a convolutional model this isn’t possible, as it is purely analyzing the images it is being fed. An RNN will do better, however, its time span is limited, so if you’re being held up somewhere for a longer period, the transformer will still be able to remember the important context whereas with the RNN, it will be dropped out of memory.
The other advantage is that transformers can absorb much larger amounts of data. Again, this is hugely important in understanding context. Let’s say there are two dogs on opposing sides of the road. If the model has seen situations like this before, it now knows that these dogs are extremely likely to start interacting with other, such as one dog suddenly running across the street. Automatically, the vehicle can start slowing down to prepare for a possibly needed halt.
A recent paper from the self-driving team at Google illustrates how transformers can be used to model behavior of agents in the surroundings:
“Inspired by recent language modeling approaches, we use a masking strategy as the query to our model, enabling one to invoke a single model to predict agent behavior in many ways, such as potentially conditioned on the goal or full future trajectory of the autonomous vehicle or the behavior of other agents in the environment. Our model architecture employs attention to combine features across road elements, agent interactions, and time steps. We evaluate our approach on autonomous driving datasets for both marginal and joint motion prediction, and achieve state of the art performance across two popular datasets. We show that our model can unify a variety of motion prediction tasks from joint motion predictions to conditioned prediction.”
A recent paper on Arxiv summarizes current developments in transformers and autonomous driving. Below I’ve highlighted the most relevant paragraphs, which should give you a good idea of how transformers are being used:
“Transformers have revolutionized Natural Language Processing (NLP), with models like BERT, GPT, and T5 setting new standards in language understanding (Alaparthi 20, Radford 18, Raffel 20). Their impact extends beyond NLP, as the Computer Vision (CV) community adopts Transformers for visual data processing. This shift from traditional Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs) to Transformers in CV signifies their growing influence, with early implementations in image recognition and object detection (Dosovitskiy 20, Carion 2020, Zhu 20) showing promising outcomes.
In Autonomous Driving (AD), Transformers are transforming a range of critical tasks, including object detection (Mao 23), lane detection (Han 22) and segmentation (Ando 23, Cakir 22), and can be combined with reinforcement learning (Seo, Vecchietti) to execute complex path finding. They excel in processing spatial and temporal data, outperforming traditional CNNs and RNNs in complex functions like scene graph generation (Liu 22) and tracking (Zhang 23). The self-attention mechanism of Transformers provides a more comprehensive understanding of dynamic driving environments, essential for the safe navigation of autonomous vehicles.
Transformers in Autonomous Driving function as advanced feature extractors, differing from CNNs by integrating information across larger visual fields for global scene understanding. Their capability to process data in parallel offers significant computational efficiency, essential for real-time processing in autonomous vehicles. The global perspective and efficiency make the Transformer highly advantageous for Autonomous Driving technology, enhancing system capabilities.
The application of Vision Transformer models has led to significant progress in 3D and general perception tasks within autonomous driving. Initial models such as DETR (Carion 2020) adopted an innovative method to object detection by framing it as a set prediction issue, employing pre-defined boxes and utilizing the Hungarian algorithm to predict sets of objects. This methodology was further refined in Deformable DETR (Zhu20), which incorporated deformable attention for improved query clarity and faster convergence. DETR3D (wang 2022) extended these principles to 3D object detection, transforming LiDAR data into 3D voxel representations. Additionally, Vision Transformers like FUTR (Gong 2022) and FUTR3D (Chen 2023) have broadened their scope to include multimodal fusion, effectively processing inputs from various sensors to enhance the overall perception capabilities.
Transformers are increasingly pivotal in autonomous driving, notably in prediction, planning, and decision-making. This progression marks a significant shift towards end-to-end deep neural network models that integrate the entire autonomous driving pipeline, encompassing perception, planning, and control into a unified system. This holistic approach reflects a substantial evolution from traditional models, indicating a move towards more comprehensive and integrated solutions in autonomous vehicle technology
In trajectory and behavior prediction, Transformer-based models like VectorNet (Gao 2020), TNT (Zhao 2021), DenseTNT (Gu 2021), mmTransformer (Liu 2021) and AgentFormer (Yuan 2021) have addressed the limitations of standard CNN models, particularly in long-range interaction modeling and feature extraction. VectorNet enhances the depiction of spatial relationships by employing a hierarchical graph neural network, which is used for high-definition maps and agent trajectory representation. TNT and DenseTNT refine trajectory prediction, with DenseTNT introducing anchor-free prediction capabilities. The mmTransformer leverages a stacked architecture for simplified, multimodal motion prediction. AgentFormer uniquely allows direct inter-agent state influence over time, preserving crucial temporal and interactional information. WayFormer (Nayakanti 2023) further addresses the complexities of static and dynamic data processing with its innovative fusion strategies, enhancing both efficiency and quality in data handling.
End-to-end models in autonomous driving have evolved significantly, particularly in planning and decision-making. TransFuser (Chitta 2022, Lai 2023) exemplifies this evolution with its use of multiple Transformer modules for comprehensive data processing and fusion. NEAT (Chitta 2021) introduces a novel mapping function for BEV coordinates, compressing 2D image features into streamlined representations. Building upon this, InterFuser (Shao 2023) proposes a unified architecture for multimodal sensor data fusion, enhancing safety and decision-making accuracy. MMFN (Zhang 2022) expands the range of data types to include HD maps and radar, exploring diverse fusion techniques. STP3 (Hu 2022) and UniAD (Hu 2023) further contribute to this field, with STP3 focusing on temporal data integration and UniAD reorganizing tasks for more effective planning. These models collectively mark a significant stride towards integrated, efficient, and safer autonomous driving systems, demonstrating the transformative impact of Transformer technology in this domain.”
Clearly the pace of innovation is extremely vibrant in this field, with a large number of developments only being introduced over the last few years.
An example of how transformers can be used for route planning of the autonomous vehicle while predicting the movements of the surrounding objects:
An example of how transformers can be used to detect the key objects in the surroundings of the autonomous vehicle:
The key challenge in autonomous driving is the difficulty for the vehicle to understand the full context of its surroundings. For example, this lane is purely for trams and a wide variety of other complex city environments, such as street workers walking around with road signs, animals on the road, an ambulance which needs to pass through, an agent on the road waving instructions with his hands, a bicycle being transported at the back of a car, or images of persons or animals being displayed on vehicles.
There are an extremely wide variety of edge cases. Transformers will clearly be a big help in understanding this context, although it remains to be seen whether it will be fully sufficient.
In the meanwhile, autonomous vehicles (AVs) are clearly performing better in the field. With FSD 12, Tesla has moved to a single end-to-end neural network for autonomous driving, including transformers to process vision input. This network was trained on millions of video clips compiled from the Tesla fleet’s best drivers and replaced over 300,000 lines of C++ code. User reviews from what I’ve seen have been universally positive, for example with both Brad Gerstner and Bill Gurley comparing it to a ChatGPT or iPhone moment. At the same time, robotaxis are expanding operations in more and more cities.
The final factor in what is different now compared to the previously failed rollout in 2018, is that thanks to our friends at ASML, TSMC and Nvidia, hardware has just gotten so much better. The big advantage of better hardware in terms of autonomous driving is that it allows for larger parameter models to be trained and inferenced, thereby allowing the vehicle to understand much better all the necessary context from its surroundings. This leads to an increased probability that the vehicle will continuously make the correct decision, which should lead to strong outperformance compared to human drivers over time.
Lastly, there is a movement against robotaxis as people see them as unsafe. However, human drivers on the road produce a plethora of dangers: speeding, distracted driving, drunk driving, careless driving, running red lights and stop signs, cognitively impaired drivers, tailgating, road rage, etc. The US National Highway Traffic Safety Administration reports that in 2022, more than 289,000 people were injured as a result of distracted driving alone, such as drivers using their phone while driving.
Long term, if the data shows that autonomous vehicles have become 10x safer than humans, governments could well start considering to ban humans from operating a vehicle on the road. Similar to how smoking has now been banned in public indoor environments. Not too long ago, you could smoke on a train, in a plane, in the office, in college hallways, but not anymore.
In conclusion, putting the three factors together: we have lots of ongoing improvements in underlying mathematical algorithms giving the autonomous vehicle better context awareness and better decision making capabilities, we’re seeing indeed better results in the field, and we’re continuously still getting more powerful semi hardware allowing for better context aware models. So we could be at the start of a classic sigmoid-curved adoption pattern for new technologies — see the first chart we went through in this article — i.e. slow at first, with an accelerating slope thereafter until high adoption rates have been reached.
For premium subscribers, we’ll review:
Which companies are currently best positioned in the autonomous driving and robotaxi spaces.
And we’ll also go into one name in detail:
The company currently has an interesting collection of assets that provide a growing and cash generative business while being cheaply valued.
Its robotaxis are currently scaling up as we speak.
We’ll calculate a possible size for its robotaxi market by 2030 and we’ll take reasonable assumptions, such as robotaxis taking a 15% market share in the number of miles driven, and show how much upside this can give in the share price.