Discover more from Tech Investments
Nvidia - opportunities & risks in the AI big bang
Thoughts post Q2
"The most troublesome thing is Nvidia’s GPU chips, we never know how much we can get.”
— a Chinese server manufacturer speaking to the Financial Times.
The current boom in AI training
Nvidia’s datacenter performance has been nothing short of the extraordinary. Whereas the company previously guided for around $8bn of datacenter revenues this quarter, revenues came in above $10 billion as their suppliers were able to ramp capacity better than expected.
And the party isn’t over yet, Nvidia’s CFO: “Demand for our datacenter platform is tremendous and broad-based across industries and customers. Our demand visibility extends into next year. Our supply over the next several quarters will continue to ramp as we lower cycle times and work with our supply partners to add capacity.”
Currently demand is especially originating from the large cloud and internet companies, but that is also a function of them having a longstanding relationship with Nvidia. And the latter is now giving them preferential access to their products during the supply crunch. Smaller players on the other hand are struggling to obtain GPU allocations. Nvidia’s CFO detailed some of this on the call:
“Our cloud service providers drove exceptionally strong demand for HGX systems in the quarter as they undertake a generational transition to upgrade their datacenter infrastructure for the new era of accelerated computing. The NVIDIA HGX platform is the culmination of nearly 2 decades of full-stack innovation across silicon, systems, interconnects, networking, software and algorithms. Instances powered by the NVIDIA H100 Tensor Core GPUs are now generally available at AWS, Microsoft Azure, and several GPU cloud providers, with others on the way shortly.
Consumer internet companies also drove the very strong demand. Their investments in datacenter infrastructure purpose-built for AI are already generating significant returns. For example, Meta recently highlighted that since launching Reels, AI recommendations have driven a more than 24% increase in time spent on Instagram.
Enterprises are also racing to deploy generative AI, driving strong consumption of NVIDIA-powered instances in the cloud as well as demand for on-premise infrastructure.”
Later on the call she added some more details: “Our large cloud service providers, are contributing a little bit more than 50% of our revenue within Q2. And the next largest category will be our consumer internet companies. And then the last piece of that will be our enterprise and high-performance computing customers.”
Nvidia’s latest HGX platform is a combination of eight H100 GPUs, the current state of the art, connected with NVLink over four NVSwitches. 32 of these platforms can be networked together, giving a total of 256 GPUs able to act as one unit.
Jensen Huang, Nvidia’s founder, adding some color: “We call it H100 as if it's a chip that comes off of a fab. But H100s go out really as HGXs sent to the world's hyperscalers and they're really quite large system components. The HGX is 35,000 parts, 70 pounds, nearly 1 trillion transistors in combination. It takes a robot to build, well, many robots to build because it's 70 pounds to lift. And it takes a supercomputer to test a supercomputer.”
I uploaded a video of this process on my Twitter, but here’s also an image of the final result:
The main reason for Nvidia’s GPU shortages is that there simply isn’t enough manufacturing capacity available, especially in HBM (3D stacked DRAM) and CoWoS (advanced packaging). Nvidia’s CFO on the conference call: “Our supply partners have been exceptional in ramping capacity to support our needs. We have also developed and qualified additional capacity and suppliers for key steps in the manufacturing process such as CoWoS packaging. We expect supply to increase each quarter through next year.”
A recent TrendForce analysis shines some more light on this:
“According to a report from Taiwan’s Commercial Times, NVIDIA is aggressively establishing a non-TSMC CoWoS supply chain. Sources in the supply chain reveal that UMC is proactively expanding silicon interposer capacity, doubling it in advance, and now planning to further increase production by over two times. The monthly capacity for silicon interposers will surge from the current 3 kwpm (thousand wafers per month) to 10 kwpm, potentially aligning its capacity with TSMC’s next year, significantly alleviating the supply strain in the CoWoS process.
A prior report from Nomura Securities highlighted NVIDIA’s efforts since the end of Q2 this year to construct a non-TSMC supply chain. Key players include UMC for wafer fabrication, Amkor and SPIL for packaging and testing.
Addressing these expansion rumors, UMC affirms that growth in advanced packaging demand is an inherent trend and future focus, asserting their evaluation of capacity options and not ruling out the possibility of continuous enlargement of silicon interposer capabilities.”
WikiChip illustrating how HBM is bonded onto a die in a CoWoS package:
While capacity is increasing, which should alleviate shortages during the coming quarters, Nvidia is also launching a new AI chip which won’t need CoWoS. Jensen Huang explains how this chip is targeted at certain workloads:
“The L40S is really designed for a different type of application. H100 is designed for large-scale language models and processing, just very large models and a great deal of data. L40S' focus is to be able to fine-tune models, fine-tune pre-trained models, and it'll do that incredibly well. It has a transform engine. It's got a lot of performance and you can get multiple GPUs in a server.”
According to Nvidia’s tests, for these types of workloads the L40S outperforms the previous generation of the H100, i.e. the A100:
On Nvidia’s AI software offering, the company gave a list of examples how large enterprises are using their apps to build AI systems. A few that stood out: “With the NVIDIA NeMo platform for developing large language models, enterprises will be able to make custom LLMs (large language models) for advanced AI services, including chatbots, search and summarization, right from the Snowflake Data Cloud. AI Copilot and assistants are set to create new multi-hundred billion dollar market opportunities for our customers. WPP, the world's largest marketing and communication services organization, is developing a content engine using NVIDIA Omniverse to enable artists and designers to integrate generative AI into 3D content creation.”
Long term, I suspect that competitors such as AMD but also the hyperscalers, with their customly designed accelerators, will be able to take a further share of the AI training market. AMD will have a window already over the coming twelve months with the launch of the MI300, which should be able to compete for some AI training workloads. Also as AMD has been busy building out their software ecosystem such as their integration with Pytorch. Nvidia from their side is aiming to move from a two year product development cadence to one of 18 months, continuously ramping up the capabilities of their hardware to continue their dominance in the AI training market.
That said, Nvidia’s AI capabilities are much more versatile than purely being on the GPU side. We’ve already discussed NVLink and the company’s skills in networking above, which were obtained via the Mellanox acquisition. Software is a part of the equation too, and this goes much further than purely being integrated into Pytorch and Tensorflow. The company is sitting on top of a wide software ecosystem, enabling their hardware to run large scale AI training. Jensen detailed this on the call:
“So we have a runtime called NVIDIA AI Enterprise. This is one part of our software stack. And this is the runtime that just about every company uses for the end-to-end machine learning, from data processing, the training of any model on any framework you'd like to do, the deployment, and the scaling it out into a datacenter. It could be a scale-out for a hyperscale datacenter. It could be a scale-out for an enterprise datacenter, for example on VMware. So this runtime called NVIDIA AI Enterprise has something like 4,500 software packages, software libraries, and has something like 10,000 dependencies among each other. And that runtime is continuously updated and optimized for our stack.”
This all means that at least within the coming years, I suspect Nvidia to be able to hold on to what is a very dominant position in the AI training market. Currently their estimated market share is around 75%, with also Broadcom (Google’s TPU) having a significant share, and the remaining small part of the market being fragmented among a variety of players. Longer term, I suspect the strong competitors, like the ones mentioned above, will be able to take their share as well. A lot of these dynamics were detailed in a previous note on the industry which I’ve linked here.
Join the 1,000+ subscribers of this newsletter!
How long will the current AI big bang last?
“The thing about these numbers that's so remarkable is the amount of demand that remains unfulfilled. Talking to some of your customers, there's a demand in some cases for multiples of what people are getting.”
— Joseph Moore, Morgan Stanley semiconductor analyst
The obvious question is how long the current boom is going to last. Take Elon Musk for example who has been trying to purchase 14,000 GPUs for his new AI startup, meant to compete with Google and ChatGPT down the line. Presumably this AI will be built into the X app at some stage. I’m hypothesizing about demand here, but say if the AI startup buys 14,000 GPUs in the first year. In the second year, maybe capacity needs to get expanded somewhat and some GPUs which have failed need to get replaced, so perhaps the startup will buy 3,000 GPUs in the second year. In year 3 we might see a similar pattern and perhaps 5,000 GPUs will get purchased. As the average lifespan of a datacenter GPU is around 3 years, in year 4 Nvidia sees again a big order coming in. And as the AI startup is a success and they’re training a much larger language model now, this order is much larger totalling 28,000 GPUs. Overall, the GPU order flow of this startup might look something like this:
Currently the industry is in year 1 really, and as there is a massive shortage of Nvidia GPUs, this initial order flow will extend well into year 2. Extrapolating this basic example to the wider AI demand environment, cyclical pullbacks in orders will be extremely likely. As investors know, the semiconductor industry is notoriously cyclical and Nvidia has been no exception to this:
I suspect Nvidia’s revenue trend going forward will maintain a similar pattern, i.e. we remain in an upwards trajectory with cyclical pullbacks along the way. The steepness of the revenues trendline will depend on how much AI applications will be able to find their way into the real world. Already with the currently available technology, AI is obviously starting to be useful in a plethora of fields, from assisting in coding, to medical imaging analysis, to computer graphics design, and autonomous vehicles. An image of Tesla’s FSD driving through a city:
Long term, I suspect that we’re entering the AI century, and that this new technology will find its way into tons of applications. Shorter term, how much datacenter capex can we expect to go into Nvidia’s GPUS? Jensen provides some hints: “The world has something along the lines of about $1 trillion worth of datacenters installed in the cloud, in enterprise and otherwise. … Call it, $250 billion of capital spend each year.”
Overall, Microsoft and Google alone are about one-third of this number. Consensus capex estimates for these names might be too low, although GPU purchases are also being financed by reducing spend elsewhere.
This means that if Nvidia is going to do around $55 to 60 billion in datacenter revenues over the coming twelve months, that would be around 23% of the overall global datacenter capex spend. That is an extraordinarily high number and not sustainable for long periods of time. Prior to this AI boom, Nvidia’s datacenter revenues were around 6% of datacenter capex spend. The only way elevated investments like these would be sustainable is if one or more viral LLM applications would find their way into the real world and start generating tens of billions of dollars in revenues. I’m thinking of Google’s Search Experience or Microsoft’s Copilot as potential contenders here. Generating some images with Midjourney to post on Twitter won’t do it.
Snowflake’s CEO Frank Slootman added some color here as well during their conference call on Wednesday. The Evercore analyst questioned, “Frank, just as you spoke to a lot of executives at the Summit, do they recognize the fact that the road to AI does require perhaps a heavier level of investment than they were thinking 12 months ago?”
To which Slootman responded: “The reality is they don't really know yet in any real definitive terms what this is going to take. A lot of people, they have characterized their foray into language models as experimental, exploratory, and they're sort of trying to get their arms around how big a bread box is this. So it's going to take a while before we get a real read on what the level of investment is that people are going to stomach to do this. One of the great things about search historically has been that search also had a very potent business model to go with to pay for it (Google). And we cannot sort of unleash AI and have no business model to pay for it, and people will get tired of that really, really quick.”
This analysis is correct in my view, there will be lots of experimentation going on, including by startups, and lots of these won’t make it.. However, I also believe that there will be lots of examples were the winners do develop successful AI applications and these will drive the long term growth in the industry.
Inference to provide the next leg of growth?
The inference market for AI, i.e. the running of previously trained models to use them in the real world, is even bigger than the one for training although estimates vary. This market is split between those inference workloads running in the datacenter, and those on the edge of the network i.e. in smartphones, automobiles, surveillance cameras, robots, etc. This should remain an attractive growth market going forward.
However, the chip market for AI inference is much more competitive than in AI training, all kinds of chips are used here such as GPUs, CPUs, and FPGAs. As a result, a wider variety of semi companies will be able to compete in this arena, including Intel, AMD, Broadcom, Apple, NXP, Lattice, and possibly a variety of startups.
Due to the introduction of large language models, which are truly massive compared to the more traditional neural networks, there is the likelihood that we need much stronger chips than previously in this market. This could provide a growth angle for Nvidia to grow share in this market as well. Jensen Huang talked about the benefits and disadvantages of using smaller large language models on the edge during the Q2 call:
“What happens is you create these large language models and you create them as large as you can, and then you derive from it smaller versions of the model, essentially teacher-student models. It's a process called distillation. And so you start from a very large model, and it has a large amount of generality and generalization and what's called zero-shot capability. And so for a lot of applications and questions or skills that you haven't trained it specifically on, these large language models miraculously has the capability to perform them. That's what makes it so magical. On the other hand, you would like to have these capabilities in all kinds of computing devices, and so what you do is you distill them down. These smaller models might have excellent capabilities on a particular skill, but they don't generalize as well.”
So if the need grows to run more versatile models at the edge, obviously this increases the need for more powerful chips. Nvidia has been launching a series of GPUs aimed at addressing the needs of the inference market and I suspect the pace of innovation will only pick up from here.
For example, there was the recent launch of the L4 series GPUs, which have much lower price points than the A and H100s. The below analysis from Nvidia was carried out for AI video workloads, resulting in the L4 massively outperforming CPUs:
The company has also been active to power autonomous vehicles with their Thor superchip. Nvidia’s efforts in autonomous driving go much further than purely supplying the hardware, and include a large virtual world where you can train your autonomous driving models in.
China remains a risk
Nvidia is already prohibited from selling their most advanced GPUs into China, so are they are selling slower versions instead. From Reuters:
“Nvidia has created variants of its chips for the Chinese market that are slowed down to meet US rules. Industry experts told Reuters the newest one - the Nvidia H800 announced in March - will likely take 10% to 30% longer to carry out some AI tasks and could double some costs compared with Nvidia's fastest US chips. Even the slowed Nvidia chips represent an improvement for Chinese firms. Tencent Holdings, one of China's largest tech companies, in April estimated that systems using Nvidia's H800 will cut the time it takes to train its largest AI system by more than half, from 11 days to four days. ‘The AI companies that we talk to seem to see the handicap as relatively small and manageable,’ said Charlie Chai, a Shanghai-based analyst with 86Research.”
Additionally, there is a Chinese black market where original Nvidia GPUs are selling at double or more the usual prices. Some high-end chips are flowing into Russia as well at even higher markups via small trading firms.
The obvious risk is that the Biden administration would put further restrictions on Nvidia selling into China which would put a headwind onto revenues in the long term. The Chinese market is not only large but it is becoming technologically advanced and my best guess is that it will be an attractive, high-growth market for AI applications. The company’s CFO discussed this risk on the call:
“We believe the current regulation is achieving the intended results. We do not anticipate that additional export restrictions on our datacenter GPUs, if adopted, would have an immediate material impact to our financial results. However, over the long term, restrictions prohibiting the sale of our datacenter GPUs to China, if implemented, will result in a permanent loss and opportunity for the US industry to compete and lead in one of the world's largest markets.”
Overall, the company disclosed that around 20 to 25% of datacenter revenues are currently stemming from China. This is no surprise as the Chinese internet giants like Alibaba, Tencent, and Baidu, will aim to do their AI training on Nvidia accelerators. All of these are working on advanced AI models, from large language models to autonomous driving, and so their GPU demands are obviously substantial. And there is a long list of other Chinese companies using AI, from Hikvision (surveillance cameras) to Bytedance (TikTok).
Financials - share price at time of analysis is $470, ticker NVDA on the NASDAQ
Given that the company is continuing to see a tight market with growth continuing into next year, I modelled in three quarters of subsequent datacenter revenue growth. However, as current GPU demand is likely nothing short of excessive, I modelled in a cyclical correction starting as from Q2 next year. Once the large cloud and internet players have the GPU allocations that they need, we’ll likely see a drop off in demand until we get into a replacement or further expansion cycle. However, the market is already anticipating this to a large extent. At the time of writing, Nvidia is trading on 33x next-twelve-months’ GAAP EPS on my numbers, which is conservative if you believe that AI will be a compelling growth story during the coming decades and Nvidia will remain a kingpin underpinning this transition.
For the first three quarters, I’m above Wall Street’s consensus numbers, but subsequently I’m below on the expectation of a cyclical pullback. The sell side’s numbers are somewhat lacking inspiration and are simply modelling in a 20% annual revenue growth rate to continue, well, into perpetuity. Semiconductor history 101 suggests that is unlikely.
Even the last bear capitulated:
I compiled a basket of quality-growth tech companies, those typically seen by analysts as being high-quality companies, dominating their niches, while being exposed to attractive growth rates. This illustrates well that Nvidia's valuation is by no means a stretch. Despite having the best projected growth rates and some of the best margins, the shares trade on a non-demanding PE next-twelve-months' valuation. If Nvidia would hit those projected growth rates, clearly there is a lot of upside, with the shares likely to compound at Nvidia’s EPS growth, i.e. north of 30% per annum.
Relative to the last three years, Nvidia’ PE is now trading at the bottom of the range, while numbers are still being upgraded..
Overall, somewhat of a mixed bag here. Obviously the long term story should be attractive for Nvidia, although I’m expecting a cyclical pullback at some stage. However, looking at valuation, the market is already anticipating this to some extent. After all, if the market would believe the sell side’s numbers, this stock would be trading on 80x or so like Shopify, giving a share price of above $1,100.
If you enjoy research like this, hit the like button and subscribe. Also, please share a link to this post on social media or with colleagues with a positive comment, it will help the publication to grow.
I’m also regularly discussing technology investments on my Twitter.
Disclaimer - This article is not advice to buy or sell the mentioned securities, it is purely for informational purposes. While I’ve aimed to use accurate and reliable information in writing this, it cannot be guaranteed that all information used is of such nature. The views expressed in this article may change over time without giving notice. The mentioned securities’ future performances remain uncertain, with both upside as well as downside scenarios possible.