How's Linear so fast? A technical breakdown
Comments
๐บ๐ธ ๋ฏธ๊ตญ ยท IT/๊ธฐ์ ยท "LINEAR" ยท ์ด 5๊ฑด
ํํฐ ๋ณด๊ธฐํ์ฌ ์ง์
50.0
0 = ๋ถ์ ์ฐ์ธ
50 = ์ค๋ฆฝ
100 = ๊ธ์ ์ฐ์ธ
์ต๊ทผ 7์ผ ๊ธฐ์ค 11,305๊ฑด์ ๋ถ์ํ ๊ฒฐ๊ณผ, ๋ด์ค ์ฌ๋ฆฌ์ง์๋ 50.0(๊ท ํ)์ ๋๋ค. ๊ธ์ 1๊ฑด(0.0%)ยท์ค๋ฆฝ 11,303๊ฑด(100.0%)ยท๋ถ์ 1๊ฑด(0.0%)์ด๋ฉฐ, ์ค๋ฆฝ ๋น์ค์ด ๋๋ ทํ๊ฒ ๋์ต๋๋ค. ์ฑํฅ ์ง์๋ ์ข ํฉ 19.0(์ค๋ ๊ท ํ)์ ๋๋ค.
Comments
Comments
Electrons are great. We use them to move vehicles, illuminate cities, and, of course, compute. But computation is not confined to the world of electronics. And shifting to alternative nonelectronic realms can unlock unique advantages: Photonic chips, for instance, process information with light while generating little heat. Another compelling alternative is fluidics, which uses pressurized gases or liquids to build logic circuits. Pioneered in the 1960s but sidelined by microchips, the field reemerged in the 1990s as โmicrofluidics.โ This approach aims to shrink laboratories onto a single chip by creating microscopic fluid channels with integrated micropneumatic control systems. Today, there is a second fluidic revival, this time in the domain of soft robotics. Scaling microfluidic designs up to the millimeter-scale range (millifluidics) enables the higher flow rates necessary to drive robotic actuators. These robots exploit the nonlinear behaviors of soft materials to create lifelike motion and safer interactions, often utilizing pressurized air. By building systems that โthinkโ with the same air that powers them, we can drastically reduce the need for bulky electronic-to-pneumatic interfaces. This is the focus of my Soiboi Studio robotics lab. With millifluidic logic, I have steadily scaled the complexity of my designs. What began with a simple oscillator has most recently evolved into a clock featuring a soft, four-digit, seven-segment display. What Is Millifluidics? Building on microfluidics research from the early 2000s and recent developments from the Grover Lab at the University of California, Riverside, Iโve developed millifluidic devices using standard 3D printing and silicone casting. The basic architecture is simple: A flexible membrane is sandwiched between rigid layers embedded with networks of air channels. Just as electronics rely on differing voltage potentials, these fluidic circuits operate on the pressure difference between atmospheric pressure (logical 0) and a near-vacuum at around โ60 kilopascals of relative pressure (logical 1). Using negative pressure means the membrane is pulled into openings. This creates robust seals that allow me to replicate electronic building blocks. A cast silicone membrane forms the face of the clock [top], while behind it sits 3D-printed millifluidic blocks [middle rows]. An Arduino Uno controls driver boards that operate solenoids, which are connected to valves that are attached to a vacuum pump [bottom row].James Provost While fluidic resistors are easily realized by adjusting the channel geometry, the heart of the system is a valve that mimics a metal-oxide-semiconductor field-effect transistor, or MOSFET. This vacuum โtransistorโ features a flow layer with two chambers (the source and drain) divided by a central valve seat and a control layer containing a cavity (the gate). A membrane runs between the control and flow layers and normally prevents airflow between the source and drain chambers. To switch the transistor on, a vacuum is applied to the gate chamber, sucking the membrane into the cavity and lifting it off the seat. This opens a path for airflow, equivalent to closing an electric circuit. By adding a small aperture to the membrane, I created a check valveโthe fluidic equivalent of a diode. By combining transistors and resistive โpull-downโ channels, I can build a full suite of logic gates. The original microfluidic designs that inspired me were fabricated from etched glass and milled acrylic. Adapting them for a standard 3D printer required reengineering the logic elements and mastering two critical fabrication techniques. First, I need airtight prints, yet printed plastic is notoriously porous. By printing at elevated temperatures, slow speeds, and slight overextrusion, I was able to fill microscopic gaps. When youโre using transparent filament, thereโs a handy visual indicator: The more transparent the plastic appears, the lower its porosity. Second, I used glass for my print bed. By printing the upper and lower chambers directly against this bed, I got the interface surface to become mirror smooth. This finish is essential for creating reliable, airtight seals. A 0.3-millimeter silicone membrane is placed between the layers and secured with screws. How Does the Soft Clock Work? The clockface is a cast silicone membrane. Each digit segment is formed by a small underlying cavity. When air is evacuated from this cavity, the membrane is sucked inward to create a concave hollow; when atmospheric pressure is restored, the silicone pops back flush with the surface. The result is a mesmerizing, organic motion. The โbrainโ of the clock is an Arduino Uno, while the fluidics significantly reduce the hardware footprint. A four-digit, seven-segment display with two separator dots would require 29 solenoid valves to control directly. My clock needs just 11 valves. A pneumatic transistor is off when its upper control chamber is at atmospheric pressure [top]. When air is removed from the control chamber, it lifts a membrane, which allows air to flow between lower flow chambers and turns the transistor on [bottom]. James Provost To understand how it works, consider a standard electronic four-digit, seven-segment LED display. This also uses 11 pins to drive its digits. (In clockface displays, an additional pin is required to drive the separator dots.) Every digit is connected to a shared data bus with seven lines, one per segment. The four control lines select individual digits. Only one digit is illuminated at time, and strobing the digits at least 50 times per second creates the illusion that all four are simultaneously illuminated. Such high-speed switching is not possible with air. Instead, I rely on memory. Each segment acts like a capacitor: By evacuating its cavity (logic 1), you โchargeโ the segment; by restoring atmospheric pressure (logic 0), you discharge it. Hence, each digit acts as an independent 7-bit memory. If the system is sufficiently airtight, the segments maintain their state for several seconds. Like the electronic display, the system utilizes a seven-line data bus. Each line connects to a solenoid valve that provides either vacuum or atmospheric pressure. To selectively address the individual digits, I placed a fluidic transistor between each segment and its data line. All the transistorsโ control inputs for a given digit are combined into one โwrite enableโ line connected to its own solenoid valve. Activating this valve allows me to write data into the corresponding digitโs memory. The clock updates one digit per second, meaning a full cycle across the face takes 4 seconds. This cycle also drives the separator dots: A set of fluidic diodes connects the enable lines to the dotsโ cavities. Consequently, as each digit is addressed, the dots pulse automatically. This display is more than a clock; it is a soft robot that happens to tell time. By offloading computation to the same air that powers movement, the clock approaches a new class of machines that are simpler, lighter, and more integrated. Iโm now developing a guide for getting started with vacuum-powered logic and may release a refined version of this clock in the future. Watching the silicone skin morph serves as a fascinating reminder that not all logic needs silicon; sometimes, all you need is flexible silicone and a flow of air. This article appears in the June 2026 print issue as โThe Soft Clock.โ
When Ana Inรชs Inรกcio goes to work at the Netherlands Organization for Applied Scientific Research (TNO) in The Hague, she thinks about signals most people never notice: radio waves moving between satellites, sensors, and future wireless networks. The integrated circuits the research scientist designs lay the foundation for next-generation RF sensor systems critical to advancing radar technologies. Ana Inรชs Inรกcio EMPLOYER Netherlands Organization for Applied Scientific Research, TNO TITLE Scientist IEEE MEMBER GRADE Senior member ALMA MATER University of Aveiro, in Portugal Those invisible RF signals are only part of what earned the IEEE senior member her global recognition. Inรกcio recently received the IEEEโEta Kappa Nu Outstanding Young Professional Award for โleadership in IEEE Young Professionals, fostering innovation and inclusivity, and pioneering advancements in RF sensor systems, bridging technical excellence with impactful community engagement.โ The recognition from IEEEโs honor society reflects a career built along two parallel paths: advancing RF circuit design while helping engineers worldwide build professional communities. โIโve always liked building things,โ Inรกcio says. โSometimes that means circuits; sometimes it means helping people connect and grow together.โ That blend of technical innovation and global leadership gives her work impact far beyond the laboratory. EE lessons at the kitchen table Inรกcio grew up in Vales do Rio, a rural village near Covilhรฃ in central Portugal. The region was known for farming and textiles, she says. Many residents worked in the textile industry, including her grandfather, who repaired machinery such as industrial looms. He became her first engineering teacher without ever holding the formal title. Through correspondence courses delivered by mail, he taught himself electrical systems. At home, he explained electricity to his granddaughter while he repaired the householdโs appliances and wiring. โHe would show me why something broke and how we could fix it,โ she recalls. It sparked her curiosity. Her mother was a tailor who later managed other tailors. Her father left his factory job to attend culinary school and now cooks at an elder-care facility. Curiosity was a trait that ran through the family. By high school, Inรกcio was drawn equally to mathematics and physics and to biology and geology, she says. Encouragement from teachers and an uncle, an engineer, ultimately steered her toward electronics engineering. Conducting research on integrated circuits In 2008 she enrolled in an integrated masterโs degree program in electrical and telecommunications engineering at the Universidade de Aveiro in Portugal, a five-year degree that combined undergraduate and graduate studies. An opportunity to study abroad changed her path. In 2012 she moved to the Netherlands to study at Eindhoven University of Technology (TU/e) through a six-month European exchange program with UAveiro. A professor encouraged her to stay on, so she completed her final year of masters in the Netherlands. She focused on techniques to improve the linearization of RF power amplifiers at Thales. The company, based in Hengelo, Netherlands, designs and produces electronics for defense and security. She earned her masterโs degree from UAveiro in 2013. After graduating, she joined the integrated circuit design group at the University of Twente, in The Netherlands, conducting collaborative research as part of a nationally funded program on linearization techniques for RF front-end systems. The experience introduced her to international research culture and persuaded her to pursue a career abroad, she says. Engineering the future of wireless Inรกcio joined TNO in 2018 as a junior scientist and innovator: her first professional industry job. Today she designs integrated RF front-end systemsโthe circuits that allow devices to transmit and receive wireless signals. The components sit at the core of modern communications, enabling sensor networks, satellite links, and emerging 6G technologies. Her work aims to tackle a central challenge: getting greater performance from smaller chips. โAs communication evolves, we need more bandwidth to transfer more data at higher speeds,โ she says. โThe question is how much complexity you can integrate into one system while keeping it efficient.โ Unlike commercial lab environments, which reuse established designs, research projects often start from scratch. Each transmit-receive chainโthe signal path that converts digital data to radio waves and back againโis tailored to specific requirements. Her work focuses on improving key circuit characteristics including linearity (ensuring that the signals that go out of the antenna are not distorted) as well as noise reduction (so design blocks can be optimized). Advanced design techniques help devices communicate more reliably while consuming less energy, a critical need for large sensor networks such as the Internet of Things, she says. Artificial intelligence is beginning to influence her field, she says: โAI is already helping us work faster. The real challenge is learning how to use it to make better designs, not just quicker ones.โ A parallel vocation with IEEE While her technical career flourished in research labs, an additional journey unfolded through IEEE. Inรกcio joined the organization in 2009 as a student after discovering UAveiroโs student branch. What began as curiosity evolved into a long-term leadership path. She advanced through roles within Region 8โcovering Europe, Africa, and the Middle Eastโone of the organizationโs most culturally diverse regions. She was the student branchโs vice chair, and the regionโs student representative for more than 22,000 IEEE members. She also served as the Young Professionals Affinity Group chair for the IEEE Benelux Section, which encompasses Belgium, the Netherlands, and Luxembourg. Currently, she serves as the immediate past chair of the Region 8 Young Professionals Committee, and vice chair and IEEE Member and Geographical Activities representative on the IEEE Young Professionals Committee. In those roles, she represents close to 135,000 IEEE members. In addition, she is an active member of the IEEE Microwave Theory and Technology Society, currently serving as its Young Professionals liaison. Her involvement with IEEE has boosted her professional confidence, she says. โIEEE didnโt directly give me promotions at my day job, but it gave me leadership skills, networking opportunities, and the ability to work with people from everywhere,โ she says. Those experiences now shape her collaborations at TNO, where international teamwork is essential. The IEEE-HKN Outstanding Young Professional Award recognizes that combination of technical excellence and community impact, she says. Looking back, Inรกcio sees a clear thread connecting her childhood curiosity, her international career, and her IEEE leadership: Engineering, she says, is ultimately about people as much as it is about technology.
When it comes to AI models, size matters. Even though some artificial-intelligence experts warn that scaling up large language models (LLMs) is hitting diminishing performance returns, companies are still coming out with ever larger AI tools. Metaโs latest Llama release had a staggering 2 trillion parameters that define the model. As models grow in size, their capabilities increase. But so do the energy demands and the time it takes to run the models, which increases their carbon footprint. To mitigate these issues, people have turned to smaller, less capable models and using lower-precision numbers whenever possible for the model parameters. But there is another path that may retain a staggeringly large modelโs high performance while reducing the time it takes to run an energy footprint. This approach involves befriending the zeros inside large AI models. For many models, most of the parametersโthe weights and activationsโare actually zero, or so close to zero that they could be treated as such without losing accuracy. This quality is known as sparsity. Sparsity offers a significant opportunity for computational savings: Instead of wasting time and energy adding or multiplying zeros, these calculations could simply be skipped; rather than storing lots of zeros in memory, one need only store the nonzero parameters. Unfortunately, todayโs popular hardware, like multicore CPUs and GPUs, do not naturally take full advantage of sparsity. To fully leverage sparsity, researchers and engineers need to rethink and re-architect each piece of the design stack, including the hardware, low-level firmware, and application software. In our research group at Stanford University, we have developed the first (to our knowledge) piece of hardware thatโs capable of calculating all kinds of sparse and traditional workloads efficiently. The energy savings varied widely over the workloads, but on average our chip consumed one-seventieth the energy of a CPU, and performed the computation on average eight times as fast. To do this, we had to engineer the hardware, low-level firmware, and software from the ground up to take advantage of sparsity. We hope this is just the beginning of hardware and model development that will allow for more energy-efficient AI. What is sparsity? Neural networks, and the data that feeds into them, are represented as arrays of numbers. These arrays can be one-dimensional (vectors), two-dimensional (matrices), or more (tensors). A sparse vector, matrix, or tensor has mostly zero elements. The level of sparsity varies, but when zeroes make up more than 50 percent of any type of array, it can stand to benefit from sparsity-specific computational methods. In contrast, an object that is not sparseโthat is, it has few zeros compared with the total number of elementsโis called dense. Sparsity can be naturally present, or it can be induced. For example, a social-network graph will be naturally sparse. Imagine a graph where each node (point) represents a person, and each edge (a line segment connecting the points) represents a friendship. Since most people are not friends with one another, a matrix representing all possible edges will be mostly zeros. Other popular applications of AI, such as other forms of graph learning and recommendation models, contain naturally occurring sparsity as well. Beyond naturally occurring sparsity, sparsity can also be induced within an AI model in several ways. Two years ago, a team at Cerebras showed that one can set up to 70 to 80 percent of parameters in an LLM to zero without losing any accuracy. Cerebras demonstrated these results specifically on Metaโs open-source Llama 7B model, but the ideas extend to other LLM models like ChatGPT and Claude. The case for sparsity Sparse computationโs efficiency stems from two fundamental properties: the ability to compress away zeros and the convenient mathematical properties of zeros. Both the algorithms used in sparse computation and the hardware dedicated to them leverage these two basic ideas. First, sparse data can be compressed, making it more memory efficient to store โsparselyโโthat is, in something called a sparse data type. Compression also makes it more energy efficient to move data when dealing with large amounts of it. This is best understood by an example. Take a four-by-four matrix with three nonzero elements. Traditionally, this matrix would be stored in memory as is, taking up 16 spaces. This matrix can also be compressed into a sparse data type, getting rid of the zeros and saving only the nonzero elements. In our example, this results in 13 memory spaces as opposed to 16 for the dense, uncompressed version. These savings in memory increase with increased sparsity and matrix size. In addition to the actual data values, compressed data also requires metadata. The row and column locations of the nonzero elements also must be stored. This is usually thought of as a โfibertreeโ: The row labels containing nonzero elements are listed and linked to the column labels of the nonzero elements, which are then linked to the values stored in those elements. In memory, things get a bit more complicated still: The row and column labels for each nonzero value must be stored as well as the โsegmentsโ that indicate how many such labels to expect, so the metadata and data can be clearly delineated from one another. In a dense, noncompressed matrix data type, values can be accessed either one at a time or in parallel, and their locations can be calculated directly with a simple equation. However, accessing values in sparse, compressed data requires looking up the coordinates of the row index and using that information to โindirectlyโ look up the coordinates of the column index before finally reaching the value. Depending on the actual locations of the sparse data values, these indirect lookups can be extremely random, making the computation data-dependent and requiring the allocation of memory lookups on the fly. Second, two mathematical properties of zero let software and hardware skip a lot of computation. Multiplying any number by zero will result in a zero, so thereโs no need to actually do the multiplication. Adding zero to any number will always return that number, so thereโs no need to do the addition either. In matrix-vector multiplication, one of the most common operations in AI workloads, all computations except those involving two nonzero elements can simply be skipped. Take, for example, the four-by-four matrix from the previous example and a vector of four numbers. In dense computation, each element of the vector must be multiplied by the corresponding element in each row and then added together to compute the final vector. In this case, that would take 16 multiplication operations and 16 additions (or four accumulations). In sparse computation, only the nonzero elements of the vector need be considered. For each nonzero vector element, indirect lookup can be used to find any corresponding nonzero matrix element, and only those need to be multiplied and added. In the example shown here, only two multiplication steps will be performed, instead of 16. The trouble with GPUs and CPUs Unfortunately, modern hardware is not well suited to accelerating sparse computation. For example, say we want to perform a matrix-vector multiplication. In the simplest case, in a single CPU core, each element in the vector would be multiplied sequentially and then written to memory. This is slow, because we can do only one multiplication at a time. So instead people use CPUs with vector support or GPUs. With this hardware, all elements would be multiplied in parallel, greatly speeding up the application. Now, imagine that both the matrix and vector contain extremely sparse data. The vectorized CPU and GPU would spend most of their efforts multiplying by zero, performing completely ineffectual computations. Newer generations of GPUs are capable of taking some advantage of sparsity in their hardware, but only a particular kind, called structured sparsity. Structured sparsity assumes that two out of every four adjacent parameters are zero. However, some models benefit more from unstructured sparsityโthe ability for any parameter (weight or activation) to be zero and compressed away, regardless of where it is and what it is adjacent to. GPUs can run unstructured sparse computation in software, for example, through the use of the cuSparse GPU library. However, the support for sparse computations is often limited, and the GPU hardware gets underutilized, wasting energy-intensive computations on overhead. Petra Pรฉterffy When doing sparse computations in software, modern CPUs may be a better alternative to GPU computation, because they are designed to be more flexible. Yet, sparse computations on the CPU are often bottlenecked by the indirect lookups used to find nonzero data. CPUs are designed to โprefetchโ data based on what they expect theyโll need from memory, but for randomly sparse data, that process often fails to pull in the right stuff from memory. When that happens, the CPU must waste cycles calling for the right data. Apple was the first to speed up these indirect lookups by supporting a method called an array-of-pointers access pattern in the prefetcher of their A14 and M1 chips. Although innovations in prefetching make Apple CPUs more competitive for sparse computation, CPU architectures still have fundamental overheads that a dedicated sparse computing architecture would not, because they need to handle general-purpose computation. Other companies have been developing hardware that accelerates sparse machine learning as well. These include Cerebrasโs Wafer Scale Engine and Metaโs Training and Inference Accelerator (MTIA). The Wafer Scale Engine, and its corresponding sparse programming framework, have shown incredibly sparse results of up to 70 percent sparsity on LLMs. However, the companyโs hardware and software solutions support only weight sparsity, not activation sparsity, which is important for many applications. The second version of the MTIA claims a sevenfold sparse compute performance boost over the MTIA v1. However, the only publicly available information regarding sparsity support in the MTIA v2 is for matrix multiplication, not for vectors or tensors. Although matrix multiplications take up the majority of computation time in most modern ML models, itโs important to have sparsity support for other parts of the process. To avoid switching back and forth between sparse and dense data types, all of the operations should be sparse. Onyx Instead of these halfway solutions, our team at Stanford has developed a hardware accelerator, Onyx, that can take advantage of sparsity from the ground up, whether itโs structured or unstructured. Onyx is the first programmable accelerator to support both sparse and dense computation; itโs capable of accelerating key operations in both domains. To understand Onyx, it is useful to know what a coarse-grained reconfigurable array (CGRA) is and how it compares with more familiar hardware, like CPUs and field-programmable gate arrays (FPGAs). CPUs, CGRAs, and FPGAs represent a trade-off between efficiency and flexibility. Each individual logic unit of a CPU is designed for a specific function that it performs efficiently. On the other hand, since each individual bit of an FPGA is configurable, these arrays are extremely flexible, but very inefficient. The goal of CGRAs is to achieve the flexibility of FPGAs with the efficiency of CPUs. CGRAs are composed of efficient and configurable units, typically memory and compute, that are specialized for a particular application domain. This is the key benefit of this type of array: Programmers can reconfigure the internals of a CGRA at a high level, making it more efficient than an FPGA but more flexible than a CPU. The Onyx chip, built on a coarse-grained reconfigurable array (CGRA), is the first (to our knowledge) to support both sparse and dense computations. Olivia Hsu Onyx is composed of flexible, programmable processing element (PE) tiles and memory (MEM) tiles. The memory tiles store compressed matrices and other data formats. The processing element tiles operate on compressed matrices, eliminating all unnecessary and ineffectual computation. The Onyx compiler handles conversion from software instructions to CGRA configuration. First, the input expressionโfor instance, a sparse vector multiplicationโis translated into a graph of abstract memory and compute nodes. In this example, there are memories for the input vectors and output vectors, a compute node for finding the intersection between nonzero elements, and a compute node for the multiplication. The compiler figures out how to map the abstract memory and compute nodes onto MEMs and PEs on the CGRA, and then how to route them together so that they can transfer data between them. Finally, the compiler produces the instruction set needed to configure the CGRA for the desired purpose. Since Onyx is programmable, engineers can map many different operations, such as vector-vector element multiplication, or the key tasks in AI, like matrix-vector or matrix-matrix multiplication, onto the accelerator. We evaluated the efficiency gains of our hardware by looking at the product of energy used and the time it took to compute, called the energy-delay product (EDP). This metric captures the trade-off of speed and energy. Minimizing just energy would lead to very slow devices, and minimizing speed would lead to high-area, high-power devices. Onyx achieves up to 565 times as much energy-delay product over CPUs (we used a 12-core Intel Xeon CPU) that utilize dedicated sparse libraries. Onyx can also be configured to accelerate regular, dense applications, similar to the way a GPU or TPU would. If the computation is sparse, Onyx is configured to use sparse primitives, and if the computation is dense, Onyx is reconfigured to take advantage of parallelism, similar to how GPUs function. This architecture is a step toward a single system that can accelerate both sparse and dense computations on the same silicon. Just as important, Onyx enables new algorithmic thinking. Sparse acceleration hardware will not only make AI more performance- and energy efficient but also enable researchers and engineers to explore new algorithms that have the potential to dramatically improve AI. The future with sparsity Our team is already working on next-generation chips built off of Onyx. Beyond matrix multiplication operations, machine learning models perform other types of math, like nonlinear layers, normalization, the softmax function, and more. We are adding support for the full range of computations on our next-gen accelerator and within the compiler. Since sparse machine learning models may have both sparse and dense layers, we are also working on integrating the dense and sparse accelerator architecture more efficiently on the chip, allowing for fast transformation between the different data types. Weโre also looking at ways to manage memory constraints by breaking up the sparse data more effectively so we can run computations on several sparse accelerator chips. We are also working on systems that can predict the performance of accelerators such as ours, which will help in designing better hardware for sparse AI. Longer term, weโre interested in seeing whether high degrees of sparsity throughout AI computation will catch on with more model types, and whether sparse accelerators become adopted at a larger scale. Building the hardware to unstructured sparsity and optimally take advantage of zeros is just the beginning. With this hardware in hand, AI researchers and engineers will have the opportunity to explore new models and algorithms that leverage sparsity in novel and creative ways. We see this as a crucial research area for managing the ever-increasing runtime, costs, and environmental impact of AI.