The Smart TV in Your LivingRoom Is a Node in the AIScraping Economy
Comments
🇺🇸 미국 · "NODE" · 총 10건
필터 보기현재 지수
50.0
0 = 부정 우세
50 = 중립
100 = 긍정 우세
최근 7일 기준 12,058건을 분석한 결과, 뉴스 심리지수는 50.0(균형)입니다. 긍정 1건(0.0%)·중립 12,056건(100.0%)·부정 1건(0.0%)이며, 중립 비중이 뚜렷하게 높습니다. 성향 지수는 종합 19.0(중도 균형)입니다.
Comments
While company representatives were tight-lipped about the exact technical details of their offering, they explained that a flexible, software-based system would allow individual member-nations to connect their sensors to another nation’s command nodes.
Direct-to-cell technology uses LEO satellites as spaceborne cell towers. It delivers LTE services to existing smartphones without hardware changes, bridging global coverage gaps. What Attendees will Learn How DTC works as a spaceborne cell tower — LEO satellites carry LTE eNodeB payloads in regenerative mode. How they serve unmodified phones using quasi-earth-fixed multi-beam antennas. How the satellite compensates for Doppler shift and time delay on thenetwork side. Why Doppler shift and round-trip time are critical challenges — A LEO satellite’s high velocity causes carrier frequency offsets in OFDMA systems. Pre-compensation at a reference point helps, but cell-edge users still face residual Doppler. How spectrum sharing and regulation shape DTC deployment — DTC has no dedicated spectrum allocation. It relies on spectrum sharing between terrestrial and satellite operators or re-farmed MSS bands. How national regulations like the FCC SCS framework govern access. Where DTC fits in the evolution toward 5G NTN and 6G — DTC is an interim technology offering fast time-to-market satellite services. It bridges the gap until 3GPP NR-NTN matures. How NR-NTN will bring purpose-built NTN features and international spectrum frameworks. Download this free whitepaper now!
I have been an application-specific IC (ASIC) designer for almost three decades. Over that time, I’ve moved through the full academic trajectory, from graduate student to full professor; later, I transitioned to industry after an unsuccessful stint at entrepreneurship. When I made the switch to the private sector in 2019, I began focusing on a critically important aspect of the electronic industry: silicon intellectual property. As much as 80 percent of the physical area in today’s most advanced chips is occupied by blocks that aren’t made for specific products or even designed by the consumer-facing companies that built them. Instead, chipmakers draw heavily on established silicon IP from companies like Arm, Cadence, Rambus, Synopsys, and the company I work for, Silicon Creations. Throughout my career, I’ve designed chips for very different purposes, including enabling the research program in my academic lab and expanding the IP portfolio of my company. When I joined Silicon Creations, I had no idea how differently the industry approaches IC design and encountered a steep learning curve. Initially, it seemed that much of my two decades of academic research and training did not directly translate to the role. I had to learn new skills and adopt a new mindset. Today, demand for ASICs is rapidly growing, driven by the need for specialized chips in the automotive sector, AI applications, and more. By one market estimate, the ASIC market is expected to grow from US $23.4 billion to $38.8 billion by 2033, and the semiconductor industry as a whole is projected to hit $1 trillion by 2030. The industry needs more chip designers—but if you’re coming from an academic background as I did, there are a few things you’ll need to know. Different goals lead to different strategies The differences between industry and academe begin with a divergence in purpose. In academia, my primary objective was to generate new knowledge: to propose a novel circuit technique, validate an unconventional architecture, or explore the limits of performance in a given domain. A successful chip is one that demonstrates a concept. In industry, it is not nearly enough to prove that something can work. The goal is to ensure that it works reliably, repeatedly, and at scale. Success is measured not by novelty but by whether the silicon meets specifications, yields as expected in production, and supports a competitive product delivered on schedule. This leads to a stark contrast in risk tolerance. Academic designs often deliberately push into unproven territory, where even partial success can yield valuable insight. In industry, however, we systematically minimize risk. The cost of failure makes first-time silicon success a central requirement—especially at advanced technology nodes, where the lithography masks used to transfer circuit designs onto silicon wafers alone can cost tens of millions of dollars. As a result, industry design flows are built around eliminating uncertainty through conservative margins, extensive validation, and careful reuse of proven solutions. “Academia explores the design space, asking what is possible, while industry exploits it, determining what is viable at scale.” This paradigm has existed since the 1970s, when application-specific chip design was established. However, the gulf between academia and industry has expanded since the mid-2010s, when FinFET technology, a 3D architecture using vertical “fins” of silicon, was widely adopted in industry. System designs are also becoming increasingly modular with the advent of chiplets. This fundamentally altered the economics and complexity of ASIC development, with design costs rising by almost an order of magnitude. Initiatives like Taiwan Semiconductor Manufacturing Co.’s University FinFET Program and new government-funded chip-design hubs now let some well-resourced universities design for more advanced architectures, but the technology is still out of reach for many academics. What the industry-academia split means in practice Consider a startup developing an ASIC. Its engineering team may have deep expertise in a particular algorithm, sensor interface, or system architecture, the features that define its competitive advantage. But it is unlikely to possess world-class expertise in every supporting function. Developing each of these blocks internally would require significant time, capital, and specialized talent. Doing so could delay market entry beyond the startup’s viability. Even large semiconductor companies face similar constraints. Advanced-node development demands intense focus. Allocating a team to redesign a standard interface block that has already been implemented elsewhere may be difficult to justify when differentiation lies at the system level, such as an inference chip’s ability to speed up neural network computations. The time it takes to move a new chip from conception to market and risk mitigation, not self-sufficiency, govern most decisions about in-house development versus outsourcing. The economics of advanced IC manufacturing reinforce this reality. When the development cost of a leading-edge chip reaches hundreds of millions of dollars, minimizing risk becomes a central design imperative. In this context, silicon IP emerged as a practical solution. Similar to how software developers rely on preexisting libraries rather than writing every function from scratch, ASIC designers license predesigned, preverified silicon blocks—such as processor cores, memory interfaces, and security engines—from highly specialized IP vendors. These blocks can then be integrated into larger, increasingly complex systems. Design scope, verification, and time horizons With the use of silicon IP, industry is able to widen the scope of its designs. Academic efforts tend to focus on block-level innovation: a new analog-to-digital converter architecture or an ultralow-noise amplifier, for instance. These designs typically abstract away many of the complexities of bringing a chip to market, such as packaging constraints, long-term reliability, and manufacturing yield. In industry, the focus shifts to system-level integration. Modern systems on chips, or SoCs, incorporate dozens or even hundreds of functional blocks. Managing signal integrity, timing, firmware interaction, and system-level validation becomes as critical as the design of any individual block. Verification philosophy also diverges sharply. In academia, the goal of verification is to demonstrate that the concept works under nominal conditions, which may not always reflect how it would perform in real applications. Even if only a fraction of fabricated chips from a multiproject wafer operates correctly, the design may still be considered a success if it validates the underlying idea. At my academic lab for instance, we used to receive 40 chips from a TSMC prototyping service and started testing them in batches of five. If the first five or 10 chips proved functional, we had already collected more than enough data for a publication. If some of them failed, we weren’t required to mention this when publishing the results. In industry, verification is exhaustive, critical, and often dominates the development schedule. Failures are measured in parts per million, and even rare anomalies are carefully analyzed and documented to identify root causes and prevent recurrence. When I started at Silicon Creations, I was surprised by the level of detail and scrutiny designs face. Differences in time horizons and economic constraints reinforce each of these contrasts. Academic projects operate on flexible timelines aligned with research and funding cycles. If I missed a deadline, I just had to wait for the next cycle. Industry projects are driven by fixed product schedules and market windows, frequently targeting costly leading-edge nodes to achieve competitive performance, power, and area efficiency. Missing a deadline can negate the value of an entire design and may have major financial consequences along the entire supply chain. In essence, academia explores the design space, asking what is possible, while industry exploits it, determining what is viable at scale. Both are indispensable, but they operate under fundamentally different definitions of success. As ASIC complexity continues to grow, understanding both perspectives will be essential for the next generation of engineers navigating the evolving semiconductor landscape. This article appears in the June 2026 print issue.
Comments
This sponsored article is brought to you by Wetour Robotics. A field technician on a wind turbine, harness clipped, both hands on a wrench, needs to send a command to the diagnostic device hanging at her belt. A logistics worker on a loading dock, gloves on, eyes on the pallet, needs to redirect a connected lift. A person using an assistive mobility device on a crowded street wants to nudge it forward without taking out a phone or speaking aloud. None of these moments call for a smarter robot. They call for a smarter way to be heard by the machines that already exist. The industry has been building from one side The past three years of Physical AI have been a story of remarkable progress on the robot side of the loop. Companies like Boston Dynamics, Figure, and Unitree have advanced actuators, locomotion, and dexterity to a level that would have seemed implausible a decade ago. Google DeepMind’s Gemini Robotics has redefined what vision-language-action models can do in unstructured settings. The trajectory of the hardware and the foundation models is real, and it is accelerating. But there is another side to this loop, and it has been treated as a solved problem for too long. The interface between humans and machines has defaulted, for 40 years, to three input modalities: screens, buttons, and voice. Each of those assumes the user can stop, look down, and translate intent into structured commands. That assumption breaks the moment the work moves into a real environment. On a turbine. On a dock. On a sidewalk. In any setting where hands are occupied, eyes are committed, or speaking is impractical, the conventional interface stack quietly fails. Spatial Intent Fusion is the simultaneous processing of three streams of human-centered information, namely spatial position, visual context, and gestural intent: Your body is the interface. The bottleneck on the human side of the loop is becoming as important as the one on the machine side. And solving it requires a different question. Not how do we make the robot more capable, but how do we let the human participate in the computing system as naturally as the robot already does. Wetour Robotics’ bet: put the human back into the computing loop Wetour Robotics is betting that the next architectural leap in Physical AI is not about making the robot more capable. It is about making the human a first-class node in the computing network, with the same kind of low-latency, high-fidelity participation that connected devices already enjoy. Wetour Robotics’ engineers frame the problem this way: a wristband that recognizes a gesture is not enough. A camera that recognizes a scene is not enough. The information a human carries about what they are about to do is distributed across multiple channels, including where their body is in space, what their eyes are attending to, and what their muscles are preparing to do, and any single channel observed in isolation is ambiguous. Reconstructing intent reliably means fusing those channels at the operating system level, with latency low enough that the loop feels closed rather than mediated. This approach has a name. Wetour Robotics calls it Spatial Intent Fusion: the simultaneous processing of three streams of human-centered information, namely spatial position, visual context, and gestural intent, fused into a single real-time command for any connected physical device. It is the technical implementation behind a simpler positioning statement the company uses externally: your body is the interface. Orchestra is a portable intelligent hub running the operating system that handles sensor fusion, intent inference, command translation, and safety arbitration. The reference compute platform is NVIDIA Jetson Orin Nano Super, which provides enough on-device inference capacity to keep the entire control loop at the edge, with no cloud dependency on the critical path. Wetour Robotics The architecture: three layers, four engines, one loop Orchestra is not a single device but a layered platform, designed from the start to be sensor-flexible and actuator-agnostic. The architecture decomposes into three perception layers and four coordination engines. Orchestra itself is the local compute and orchestration core: a portable intelligent hub running the operating system that handles sensor fusion, intent inference, command translation, and safety arbitration. The reference compute platform is NVIDIA Jetson Orin Nano Super, which provides enough on-device inference capacity to keep the entire control loop at the edge, with no cloud dependency on the critical path. Edge inference is non-negotiable for this application. Full-chain latency from biosignal acquisition to actuator command is held under 100 milliseconds, the envelope inside which closed-loop control feels natural rather than laggy. VisionLink handles visual and spatial perception. Cameras feed into vision models that identify objects, estimate distances, and track environmental context. VisionLink is designed not as a passive recognition layer but as a real-time command generator: its outputs feed directly into Orchestra OS to be fused with biosignal data. Conductor is the biosignal pipeline. It ingests raw surface electromyographic (sEMG) data from a wrist-worn device, classifies temporal patterns into discrete gestures or continuous control signals, and outputs actuator commands. The technically interesting property of sEMG for this use case is that the signal precedes visible motion. Motor unit action potentials appear at the skin surface roughly 50 to 80 milliseconds before a finger completes the corresponding gesture. Wetour Robotics calls this property pre-motion intent sensing, and it is what allows Orchestra to anticipate user intent rather than react to it. On top of the three perception layers, Orchestra OS runs four coordination engines. The Perception Engine ingests and normalizes raw sensor streams. The Intent Engine performs Spatial Intent Fusion across modalities, resolving what the user is trying to do given where they are, what they are looking at, and what their hand is signaling. The Orchestration Engine translates intent into device-specific command sequences for any connected actuator. The Safety Engine arbitrates conflicting commands, enforces operational envelopes, and gates execution against runtime safety conditions. Wetour Robotics The trade-offs we’re honest about No system that bridges the human body and the digital world is finished. Three engineering challenges remain open, and the company addresses each with a deliberate trade-off rather than a claim of having fully solved it. Baseline stability of sEMG under motion. In a stationary user, continuous gesture recognition from sEMG is reliable. Once the user is walking, climbing, or otherwise moving, motion artifacts and electrode drift degrade the signal in ways that are difficult to fully compensate for. Rather than overpromise on continuous control in dynamic settings, Orchestra defaults to a smaller set of robust discrete gestures in complex operating environments, and reserves continuous control modes for contexts where the signal-to-noise ratio supports them. Miniaturization of edge AI compute. Running the Orchestra control loop entirely at the edge requires real on-device inference, which has historically meant trading off between compute capacity, battery life, and form factor. Wetour Robotics’ approach has been a compact carrier board paired with a thermal design and a battery module sized for all-day wearability. The result is a hub that travels with the user rather than tethering them to a desk, and that performs the full perception-to-actuation loop without offloading to the cloud. Heterogeneity of third-party device protocols. The actuator side of the loop is a fragmented landscape. Different manufacturers expose different command interfaces, different communication stacks, and different safety conventions, and a Physical AI operating system has to integrate with all of them. Wetour Robotics uses an AI-agent layer to negotiate connection and protocol translation adaptively, so that Orchestra OS can ingest data from a wide range of devices, run them through neural network models that infer human intent, and emit the right command on the right protocol for the device on the other end. Why this matters, and why it helps the rest of the field The history of computing is a history of interface revolutions. Command lines gave way to graphical user interfaces, which gave way to touch, which gave way to voice. Each transition expanded who could participate in the system and what they could do with it. The next transition is not about a new screen or a new microphone. It is about treating the human body itself as a participant in the computing network, capable of contributing intent at the same speed and fidelity that any other connected node can. The history of computing is a history of interface revolutions. The next transition is not about a new screen or a new microphone — it is about treating the human body itself as a participant in the computing network. This path is not a competitor to the work being done on humanoid robots, foundation models for embodied AI, and dexterous manipulation. It is the missing complement to that work. The hardest open problem for humanoid systems is the data: every natural interaction between a human and the physical world is a potential training signal, and most of those interactions are currently invisible to any computing system. As more humans become first-class nodes in the loop, those interactions become observable, structured, and ultimately useful for training the next generation of embodied AI, including the humanoid robots being developed today. In other words: putting the human back into the computing loop is not just about better interfaces for individual users. It is about generating the kind of grounded, in-the-wild human-machine interaction data that the broader Physical AI ecosystem will need to keep advancing. The robot side and the human side of the loop are not two competing futures. They are two halves of the same one. That is what Wetour Robotics means when it says: Your body is the interface. Learn more at wetourrobotics.com.
Comments
A guide to ten technological components — from THz communications and AI/ML to reconfigurable intelligent surfaces — poised to define 6G wireless networks. What Attendees will Learn Which frequencies 6G will use — Understand why THz bands (above 100 GHz) and the7–24 GHz range are under consideration, what challenges CMOS technology faces at sub-THz frequencies, and how new semiconductor approaches aim to close the output-power gap for future link budgets. How AI/ML and joint communications and sensing reshape the air interface — how auto encoder-based end-to-end learning can replace traditional signal-processing blocks, and how a single waveform may serve both data transmission and radar-like environmental sensing. What reconfigurable intelligent surfaces and photonics bring to the radio environment— Explore how programmable metamaterial panels can steer and shape electromagnetic waves, and how visible light communications and all-photonics networks extend capacity and lower latency. How ultra-massive MIMO, full-duplex, and new network topologies enable a true 3D“network of networks” — Understand how antenna arrays with vastly more elements, simultaneously transmit/receive on the same frequency, and non-terrestrial nodes converge to deliver ubiquitous, high-capacity 6G coverage. Download this free whitepaper now!
On April 25, armed groups launched near-simultaneous attacks against military installations and key strategic sites across Mali. Claimed by Jama’at Nusrat al-Islam wal-Muslimin, a jihadist group, and conducted in coordination with Tuareg separatist forces from the Front de libération de l’Azawad, the attacks targeted multiple nodes across the country’s security architecture simultaneously, from the capital Bamako to Gao, Mopti, and Kidal.While the attacks themselves were a shock, they should be understood as the logical endpoint of a deteriorating security trajectory that Western governments have watched from an ever-increasing distance. The geographic scope of the attacks, the targeting of senior officials, The post Western Withdrawal, Jihadist Expansion: How the Sahel Became Ground Zero for Global Terrorism appeared first on War on the Rocks.
When it comes to AI models, size matters. Even though some artificial-intelligence experts warn that scaling up large language models (LLMs) is hitting diminishing performance returns, companies are still coming out with ever larger AI tools. Meta’s latest Llama release had a staggering 2 trillion parameters that define the model. As models grow in size, their capabilities increase. But so do the energy demands and the time it takes to run the models, which increases their carbon footprint. To mitigate these issues, people have turned to smaller, less capable models and using lower-precision numbers whenever possible for the model parameters. But there is another path that may retain a staggeringly large model’s high performance while reducing the time it takes to run an energy footprint. This approach involves befriending the zeros inside large AI models. For many models, most of the parameters—the weights and activations—are actually zero, or so close to zero that they could be treated as such without losing accuracy. This quality is known as sparsity. Sparsity offers a significant opportunity for computational savings: Instead of wasting time and energy adding or multiplying zeros, these calculations could simply be skipped; rather than storing lots of zeros in memory, one need only store the nonzero parameters. Unfortunately, today’s popular hardware, like multicore CPUs and GPUs, do not naturally take full advantage of sparsity. To fully leverage sparsity, researchers and engineers need to rethink and re-architect each piece of the design stack, including the hardware, low-level firmware, and application software. In our research group at Stanford University, we have developed the first (to our knowledge) piece of hardware that’s capable of calculating all kinds of sparse and traditional workloads efficiently. The energy savings varied widely over the workloads, but on average our chip consumed one-seventieth the energy of a CPU, and performed the computation on average eight times as fast. To do this, we had to engineer the hardware, low-level firmware, and software from the ground up to take advantage of sparsity. We hope this is just the beginning of hardware and model development that will allow for more energy-efficient AI. What is sparsity? Neural networks, and the data that feeds into them, are represented as arrays of numbers. These arrays can be one-dimensional (vectors), two-dimensional (matrices), or more (tensors). A sparse vector, matrix, or tensor has mostly zero elements. The level of sparsity varies, but when zeroes make up more than 50 percent of any type of array, it can stand to benefit from sparsity-specific computational methods. In contrast, an object that is not sparse—that is, it has few zeros compared with the total number of elements—is called dense. Sparsity can be naturally present, or it can be induced. For example, a social-network graph will be naturally sparse. Imagine a graph where each node (point) represents a person, and each edge (a line segment connecting the points) represents a friendship. Since most people are not friends with one another, a matrix representing all possible edges will be mostly zeros. Other popular applications of AI, such as other forms of graph learning and recommendation models, contain naturally occurring sparsity as well. Beyond naturally occurring sparsity, sparsity can also be induced within an AI model in several ways. Two years ago, a team at Cerebras showed that one can set up to 70 to 80 percent of parameters in an LLM to zero without losing any accuracy. Cerebras demonstrated these results specifically on Meta’s open-source Llama 7B model, but the ideas extend to other LLM models like ChatGPT and Claude. The case for sparsity Sparse computation’s efficiency stems from two fundamental properties: the ability to compress away zeros and the convenient mathematical properties of zeros. Both the algorithms used in sparse computation and the hardware dedicated to them leverage these two basic ideas. First, sparse data can be compressed, making it more memory efficient to store “sparsely”—that is, in something called a sparse data type. Compression also makes it more energy efficient to move data when dealing with large amounts of it. This is best understood by an example. Take a four-by-four matrix with three nonzero elements. Traditionally, this matrix would be stored in memory as is, taking up 16 spaces. This matrix can also be compressed into a sparse data type, getting rid of the zeros and saving only the nonzero elements. In our example, this results in 13 memory spaces as opposed to 16 for the dense, uncompressed version. These savings in memory increase with increased sparsity and matrix size. In addition to the actual data values, compressed data also requires metadata. The row and column locations of the nonzero elements also must be stored. This is usually thought of as a “fibertree”: The row labels containing nonzero elements are listed and linked to the column labels of the nonzero elements, which are then linked to the values stored in those elements. In memory, things get a bit more complicated still: The row and column labels for each nonzero value must be stored as well as the “segments” that indicate how many such labels to expect, so the metadata and data can be clearly delineated from one another. In a dense, noncompressed matrix data type, values can be accessed either one at a time or in parallel, and their locations can be calculated directly with a simple equation. However, accessing values in sparse, compressed data requires looking up the coordinates of the row index and using that information to “indirectly” look up the coordinates of the column index before finally reaching the value. Depending on the actual locations of the sparse data values, these indirect lookups can be extremely random, making the computation data-dependent and requiring the allocation of memory lookups on the fly. Second, two mathematical properties of zero let software and hardware skip a lot of computation. Multiplying any number by zero will result in a zero, so there’s no need to actually do the multiplication. Adding zero to any number will always return that number, so there’s no need to do the addition either. In matrix-vector multiplication, one of the most common operations in AI workloads, all computations except those involving two nonzero elements can simply be skipped. Take, for example, the four-by-four matrix from the previous example and a vector of four numbers. In dense computation, each element of the vector must be multiplied by the corresponding element in each row and then added together to compute the final vector. In this case, that would take 16 multiplication operations and 16 additions (or four accumulations). In sparse computation, only the nonzero elements of the vector need be considered. For each nonzero vector element, indirect lookup can be used to find any corresponding nonzero matrix element, and only those need to be multiplied and added. In the example shown here, only two multiplication steps will be performed, instead of 16. The trouble with GPUs and CPUs Unfortunately, modern hardware is not well suited to accelerating sparse computation. For example, say we want to perform a matrix-vector multiplication. In the simplest case, in a single CPU core, each element in the vector would be multiplied sequentially and then written to memory. This is slow, because we can do only one multiplication at a time. So instead people use CPUs with vector support or GPUs. With this hardware, all elements would be multiplied in parallel, greatly speeding up the application. Now, imagine that both the matrix and vector contain extremely sparse data. The vectorized CPU and GPU would spend most of their efforts multiplying by zero, performing completely ineffectual computations. Newer generations of GPUs are capable of taking some advantage of sparsity in their hardware, but only a particular kind, called structured sparsity. Structured sparsity assumes that two out of every four adjacent parameters are zero. However, some models benefit more from unstructured sparsity—the ability for any parameter (weight or activation) to be zero and compressed away, regardless of where it is and what it is adjacent to. GPUs can run unstructured sparse computation in software, for example, through the use of the cuSparse GPU library. However, the support for sparse computations is often limited, and the GPU hardware gets underutilized, wasting energy-intensive computations on overhead. Petra Péterffy When doing sparse computations in software, modern CPUs may be a better alternative to GPU computation, because they are designed to be more flexible. Yet, sparse computations on the CPU are often bottlenecked by the indirect lookups used to find nonzero data. CPUs are designed to “prefetch” data based on what they expect they’ll need from memory, but for randomly sparse data, that process often fails to pull in the right stuff from memory. When that happens, the CPU must waste cycles calling for the right data. Apple was the first to speed up these indirect lookups by supporting a method called an array-of-pointers access pattern in the prefetcher of their A14 and M1 chips. Although innovations in prefetching make Apple CPUs more competitive for sparse computation, CPU architectures still have fundamental overheads that a dedicated sparse computing architecture would not, because they need to handle general-purpose computation. Other companies have been developing hardware that accelerates sparse machine learning as well. These include Cerebras’s Wafer Scale Engine and Meta’s Training and Inference Accelerator (MTIA). The Wafer Scale Engine, and its corresponding sparse programming framework, have shown incredibly sparse results of up to 70 percent sparsity on LLMs. However, the company’s hardware and software solutions support only weight sparsity, not activation sparsity, which is important for many applications. The second version of the MTIA claims a sevenfold sparse compute performance boost over the MTIA v1. However, the only publicly available information regarding sparsity support in the MTIA v2 is for matrix multiplication, not for vectors or tensors. Although matrix multiplications take up the majority of computation time in most modern ML models, it’s important to have sparsity support for other parts of the process. To avoid switching back and forth between sparse and dense data types, all of the operations should be sparse. Onyx Instead of these halfway solutions, our team at Stanford has developed a hardware accelerator, Onyx, that can take advantage of sparsity from the ground up, whether it’s structured or unstructured. Onyx is the first programmable accelerator to support both sparse and dense computation; it’s capable of accelerating key operations in both domains. To understand Onyx, it is useful to know what a coarse-grained reconfigurable array (CGRA) is and how it compares with more familiar hardware, like CPUs and field-programmable gate arrays (FPGAs). CPUs, CGRAs, and FPGAs represent a trade-off between efficiency and flexibility. Each individual logic unit of a CPU is designed for a specific function that it performs efficiently. On the other hand, since each individual bit of an FPGA is configurable, these arrays are extremely flexible, but very inefficient. The goal of CGRAs is to achieve the flexibility of FPGAs with the efficiency of CPUs. CGRAs are composed of efficient and configurable units, typically memory and compute, that are specialized for a particular application domain. This is the key benefit of this type of array: Programmers can reconfigure the internals of a CGRA at a high level, making it more efficient than an FPGA but more flexible than a CPU. The Onyx chip, built on a coarse-grained reconfigurable array (CGRA), is the first (to our knowledge) to support both sparse and dense computations. Olivia Hsu Onyx is composed of flexible, programmable processing element (PE) tiles and memory (MEM) tiles. The memory tiles store compressed matrices and other data formats. The processing element tiles operate on compressed matrices, eliminating all unnecessary and ineffectual computation. The Onyx compiler handles conversion from software instructions to CGRA configuration. First, the input expression—for instance, a sparse vector multiplication—is translated into a graph of abstract memory and compute nodes. In this example, there are memories for the input vectors and output vectors, a compute node for finding the intersection between nonzero elements, and a compute node for the multiplication. The compiler figures out how to map the abstract memory and compute nodes onto MEMs and PEs on the CGRA, and then how to route them together so that they can transfer data between them. Finally, the compiler produces the instruction set needed to configure the CGRA for the desired purpose. Since Onyx is programmable, engineers can map many different operations, such as vector-vector element multiplication, or the key tasks in AI, like matrix-vector or matrix-matrix multiplication, onto the accelerator. We evaluated the efficiency gains of our hardware by looking at the product of energy used and the time it took to compute, called the energy-delay product (EDP). This metric captures the trade-off of speed and energy. Minimizing just energy would lead to very slow devices, and minimizing speed would lead to high-area, high-power devices. Onyx achieves up to 565 times as much energy-delay product over CPUs (we used a 12-core Intel Xeon CPU) that utilize dedicated sparse libraries. Onyx can also be configured to accelerate regular, dense applications, similar to the way a GPU or TPU would. If the computation is sparse, Onyx is configured to use sparse primitives, and if the computation is dense, Onyx is reconfigured to take advantage of parallelism, similar to how GPUs function. This architecture is a step toward a single system that can accelerate both sparse and dense computations on the same silicon. Just as important, Onyx enables new algorithmic thinking. Sparse acceleration hardware will not only make AI more performance- and energy efficient but also enable researchers and engineers to explore new algorithms that have the potential to dramatically improve AI. The future with sparsity Our team is already working on next-generation chips built off of Onyx. Beyond matrix multiplication operations, machine learning models perform other types of math, like nonlinear layers, normalization, the softmax function, and more. We are adding support for the full range of computations on our next-gen accelerator and within the compiler. Since sparse machine learning models may have both sparse and dense layers, we are also working on integrating the dense and sparse accelerator architecture more efficiently on the chip, allowing for fast transformation between the different data types. We’re also looking at ways to manage memory constraints by breaking up the sparse data more effectively so we can run computations on several sparse accelerator chips. We are also working on systems that can predict the performance of accelerators such as ours, which will help in designing better hardware for sparse AI. Longer term, we’re interested in seeing whether high degrees of sparsity throughout AI computation will catch on with more model types, and whether sparse accelerators become adopted at a larger scale. Building the hardware to unstructured sparsity and optimally take advantage of zeros is just the beginning. With this hardware in hand, AI researchers and engineers will have the opportunity to explore new models and algorithms that leverage sparsity in novel and creative ways. We see this as a crucial research area for managing the ever-increasing runtime, costs, and environmental impact of AI.