设计工具
应用程序

PC上的AI:为什么不呢?

普拉萨德Alluri | 2024年1月

How AI is transforming the PC landscape and what it means for memory and storage
 

人工智能无处不在. You cannot get through a day without hearing or seeing AI in action. From smart assistants to self-driving cars, AI is changing the way we interact with the world. 但是个人电脑呢? Can AI make your PC smarter, faster, 和更多的 personalized? 在这个博客中, we will explore how AI is transforming the PC landscape and what it means for memory and storage. At CES 24, all the buzz was about AI – more than 50% of the coverage at the show was related to AI.

AI is powered by large 语言 models (LLMs), models that are developed using the vast amount of unlabeled text that humans have been accumulating. The natural 语言 queries that return human like responses are built on neural networks with billions of parameters and in some cases multiple networks linked together to generate content. Some of the most popular examples of LLMs are ChatGPT and DALL-E, which can produce realistic and creative text and images based on user input. These LLMs are impressive, but they also require a lot of computing power and data to run. That is why most of them are hosted in the cloud, where they can access the massive hardware infrastructure and network bandwidth needed.

However, the cloud is not the only place where AI can happen. There are many reasons why moving some of the AI processing to the edge, i.e., the devices on the user end, can be beneficial. 例如, 边缘AI可以减少延迟, 提高隐私, 节省网络成本, 并启用离线功能. Imagine if you could use your PC to generate high-quality content, 编辑照片和视频, 转录的演讲, 过滤噪音, 识别人脸, 和更多的, 不依赖于云. 那不是很棒吗?
 

为什么是PC?
 

Of course, PCs are not the only device that can benefit from edge AI. 智能手机, 平板电脑, smartwatches, and other gadgets can also leverage AI to enhance their features and performance. But the PC has some unique advantages that make it a suitable platform for edge AI. 第一个, 个人电脑有大屏幕, which can display more information and provide a better user experience. Second, PCs have a large battery, which can support longer 和更多的 intensive AI tasks. Third, PCs have powerful compute, which can handle more complex and demanding 人工智能模型.

These advantages are not going unnoticed by the chip makers and software developers. 英特尔等公司, AMD, Qualcomm, 联发科, and Nvidia are embedding increasingly powerful neural processing engines and/or integrated graphics in their PC CPUs and chipsets, which can deliver tens of TOPS (trillions of operations per second) of AI performance. Microsoft has also stated that Windows 11 OS will be released this year with optimizations that take advantage of these embedded AI engines in CPUs. That should not be a surprise considering the push Microsoft is giving for Copilot, a feature that uses AI to help users write code, 调试错误, 并提出改进建议. Some of these players are also working with ISVs to enable AI optimized applications – enhanced video conference experience, 照片编辑功能, 语音到文本的转换, 背景环境和噪声抑制, and facial recognition just to name a few. Whether these under development applications are going to impress anyone or that killer application is yet to come is still a speculation. 但关键问题依然存在. How can we run 人工智能模型 on PC efficiently and effectively? 和…
 

What does it mean for the hardware capabilities of the PC?
 

One of the main challenges of running 人工智能模型 on PC is the model size. 人工智能模型, 尤其是llm, can have billions or even trillions of parameters, which require a lot of memory and storage to store and load. 例如, our internal experiments show that a 70 billion parameter Llama2 model with 4-bit precision, a state-of-the-art LLM for natural 语言 generation, takes about 42GB of memory for loading and inferencing, 输出速度为1.4牌/秒. This is a large amount of memory that is not available on a typical PC. This, in essence, states the problem and sets the direction for the future. There will be function specific models that will enable size reduction while maintaining accuracy. There is likely going to be a bifurcation that will happen – large 70 billion type models can be used with premium systems with large memory and storage and can run fine-tuned applications like chat completions and optimized for dialogue use cases. In addition, a local on-device personal assistant may also need a large parameter model. A less than 10B parameter model can be used with mainstream devices, conceivably consume smaller incremental memory to host the model (~2GB) and can be used with applications like 语言 tasks, 包括文本补全, 完成列表, 像分类这样的任务.

Model size 清晰的ly has an implication to memory – at least the size of the PC memory. Bandwidth and energy efficiency are equally important. With PC (specifically mobile) transitioning to LPDDR from DDR, it helps on both of these dimensions. 例如, LPDDR5X consumes 44-54% less power during active use and 86% less power during self-refresh compared to DDR5 and LPDDR5 bandwidth compares 6.4Gb/s与DDR5的4.8 gb / s. All of this points to a quicker transition to LPDDR5 if AI were to penetrate PC quickly. There are re搜索 and development efforts to improve energy efficiency by moving some of the processing into the memory. That is likely going to take a long time, if ever. Industry needs to converge onto a common set of primitives to offload to memory and that determines the software stack that needs to be developed. A given set of primitives may not be optimal for all applications. So, let us say that for the moment processing in memory for PC has more questions than answers.

The bigger question is where will the sweet spot 人工智能模型 land? If the model sizes remain relatively large, is there a way to reduce the reliance on memory and push part of the model into storage? If that happens, the model rotation will need to be accommodated by increased storage bandwidth. This may increase the proliferation of Gen5 PCIe storage into mainstream PC or perhaps accelerate the introduction of Gen6 PCIe storage. In a recent paper published by Apple on this same topic1, “LLM in a flash: Efficient Large Language Model Inference with Limited 内存 by Alizadeh et al” proposes a method to run large 语言 models (LLMs) on devices that exceed the available DRAM capacity. The authors suggest storing the model parameters on flash memory and bringing them on demand to DRAM. They also propose methods to optimize data transfer volume and enhance read throughput to significantly enhance inference speeds. The paper’s primary metric for evaluating various flash loading strategies is latency, dissected into three distinct components: the I/O cost of loading from flash, the overhead of managing memory with newly loaded data, and the compute cost for inference operations. 总之, the paper provides a solution to the challenge of efficiently running LLMs that exceed the available DRAM capacity by storing the model parameters on flash memory and bringing them on demand to DRAM.

人工智能能力将不断发展. Current embedded NPU integration into CPU and discrete GPUs is a start. 来自Kinara的AI加速卡, Memryx, and Hailo are an alternate implementation for offloading AI workloads in PC. Another way models may evolve are function specific models that are smaller and optimized to specific functions. These models will need to be rotated from storage to memory on demand but the implications to storage are similar to running a large model.

分立NPU的一些优点是:

  • They can handle complex 人工智能模型 and tasks with lower power consumption and heat generation than CPU and GPU.
  • They can provide faster 和更多的 accurate AI performance for image recognition, 生成的人工智能, 聊天机器人, 以及其他应用.
  • They can complement the existing CPU and GPU capabilities and enhance the overall AI experience for users.

联想, 在其ThinkCentre Neo Ultra台式机上, 将于2024年6月发射, claims that these cards offer more power-efficient and capable AI processing than the current CPU and GPU solutions.2

TOPS alone, as a figure of merit, can be misleading. At the end, what matters is the number of inferences in a unit time, accuracy, and energy efficiency. 所以对于生成式AI, it can be number of tokens per second or completing stable diffusion in less than a few seconds. Measuring these in an industry acceptable way will require benchmark developments. Case in point: I have visited all the booths of the CPU vendors, discrete NPU players demos at CES. Every demo claimed superiority of their implementation in one aspect or the other.

There is certainly a lot of enthusiasm around the introduction of AI into the PC space. PC OEMs view this as a stimulus to refresh PCs and an increased share of higher value content in them. Intel is touting enablement of 100M unit PCs by 2025, that is almost 30% of the overall PC TAM. Whatever the adoption rate may be, as a consumer, there is something to look forward to in 2024.

参考文献

SBU客户存储副总裁兼总经理

普拉萨德Alluri

普拉萨德Alluri is the vice president and general manager for 客户端 存储 in the 存储 Business Unit. Prior, he was our vice president of Corporate Strategy & 孵化项目. 他曾在美光和英特尔工作, where he held various positions in product development, 沙巴体育结算平台规划与策略. Prasad obtained his bachelor’s from the Indian Institute of Technology (Bombay), a Ph.D. from Arizona State University, and an MBA from Wharton School of Business. Outside work, Prasad enjoys playing poker and hiking.

普拉萨德Alluri