Hello, DRAM design engineer here. Really informative video, and explained in such an easy to understand way! Love it! Just a quick comment, DRAM memory is generally pronounced "DEE-ram" and Via in TSVs is pronounced "VEE-ah". It's confusing and not intuitive, but hopefully this helps for your future videos!
Your ability to breakdown information into bit sizes that average person can understand is remarkable. I worked in the Electronic Test equipment marked, Great Job.
Informative and concise; thank you. I notice you pronounce SRAM as "ess-ram" (which has always made sense to me because of the acronym's origins as a *"Dynamic"* extension or iteration of the much older acronym/technology of RAM for Random Access Memory,) but you also pronounce DRAM as "dram." (I say "dee-ram" because, again, it's a variation on ol' trusty rusty RAM.) Unfortunately, "dram" is already a word in use outside computing - but in at least two other science fields - as a noun for: 1) A unit of weight in the US Customary System equal to 1/16 of an ounce or 27.34 grains (1.77 grams). 2) A unit of apothecary weight equal to 1/8 of an ounce or 60 grains (3.89 grams). 3) A small draft. Not like bad or wrong, but maybe worth noting for future usage. Again, excellent work; enlightening. Keep it up.
@@slypear Hm. I suppose I'd say "ee-DEE-ram." The core technology, Random Access Memory (RAM) persists and evolves. Similarly, there are variants of radar (RAdio Detection And Ranging) like UHF and VHF radar, and lasers (Light Amplification by Stimulated Emission of Radiation,) where the root acronym remains (though the Gamma Ray laser has only recently been taken beyond the theoretical.) In fairness, ROM (Read-Only Memory) became "EE-prom" in its Externally Programmable variation. I'm not sure that technology is still in use with the widespread and cheap availability of Flash memory, so this point may be so far out of use as to be moot. ¯\_(ツ)_/¯
I can explain the resistive memory outlined at 12:05. In electronics, there are parallel and series circuits. Resistances in series add together, meaning that if you connect the resistances from two memory banks, the resulting resistance can be written to a third memory bank. No logic required. I mean of course there’s logic required, but the memory itself is the input and the output. I have no clue how the memory chips work, but the idea is that you can use the properties of resistors to do addition for you.
I think the resistive memory is implemented with a memristor. When current flows one way, the resistance increases. When current flows the other way, resistance decreases.
Wouldn't that require a constant current source and an op amp to add the resulting voltage? Then an ADC to convert back to binary. I'm not sure if CMOS op amps are possible.
@@Syrinx11 CMOS OPAMPS (and mixed types) are produced since the early to mid-1970s ... See types like LF157 or CA3130, etc. The concept you are thinking about in your comment is not suitable for memory chips or processing units. Too big, not highly integratable and a horrible vision of extra to introduce parts and different technologies that all has to be populated on a die. Also: on die resistors are FAT("well-resistor")! The precision ones are even fatter, more expensive and time expensive to produce and calibrate (we are talking about magnitudes in contrast to the current pure memory silicon production processes). [1] "Most modern CMOS processes can guarantee the accuracy of on-chip resistors and capacitors to only within ±25%." [2] temperature coefficient ranging from 1500 to 2500 ppm [3] Etching and lateral diffusion errors in well resistors, & Etching errors in polysilicon resistors [4] The size ... just for your imagination: In the two digit µm ballpark, if we talk from very small resistors. see also: the datasheets; "IC Op-Amps Through the Ages". 2000 Thomas H. Lee; "A CMOS Op Amp Story - Part 1 - Analog Footsteps", 2017 by Todd Nelson; "Adaptive Techniques for Mixed Signal System on Chip", pp 67-94, Springer [1-4]. I hope this helps. And back to the drawing board, or? Thanks for your thoughts, Syrinx:)
I have memories of a project of "modified DRAM chips with internal logic units" around Y2K, I saw a paper probably from MIT but I don't remember whether it was implemented. It looked promising for certain kinds of massively parallel operations such as cellular automata simulations 🙂
This reminds me of the Connection Machine (CM), made by Thinking Machines Corporation back in the 1980s. The CM had a large number of single-bit processors with a few kilobits of memory each. They were interconnected in a high-dimensional hypercube. Lower dimensional connections were on-chip and higher dimensions went off-chip. It was programmed in a language called *Lisp. I remember that it seemed way ahead of its time.
We are now engineering systems which we cannot ever understand, millions of weighted matrices developing novel solutions. We are living during the dawn of something monumental.
Thinking Machines was founded by two guys and one of them was Richard Feynman's son. Feynman helped them to solve some problem using integer differential equations and since none of the two could understand it they were reluctant to use the solution ... in the end they used and it worked. Thinking Machines created the massively parallel computer technology that ended up killing Cray Supercomputers. I believe, btw, that CUDA cores are something like that - tiny CPUs with a bit of memory for each one. One thing I don't understand is why don't they use static memory to solve the problem ... does it consume to much power?
Circuit-Level CIM has one major limitation that I wish you had discussed. Its susceptibility to PVT (Process, Voltage, Temperature). When storing weights in SRAM Cells and applying 1 or 0 to the Word Line (WL) to perform the MAC operation (WL Multiplies with Weight and then the current on the Bit Line (BL) is the sum of all of the cells in the column) we are performing an analog operation. The BL current will depend on the process variation, supply voltage, and ambient temperature. That is, at two different temperatures, or supply voltages (Battery voltage changes), we will get different results, even with the same die. This makes it unsuitable for "Edge AI" applications. Between two chips or two different columns, we will also get different results, because of the process variation. The accuracy is significantly limited by this. With an Analog WL driven by a DAC, the problem is exaggerated even further. Granted, I do not know what sort of accuracy AI models really require but I imagine it is much greater than what can be offered by CIM in current CMOS processes. Of course, larger processes decrease variation, but the density suffers. The nice thing about conventional computing is that our accuracy does not depend on PV, only our speed. I think integrating DRAM dies with conventional CMOS dies is likely the way forward.
1. Battery voltage changes shouldn't affect CPU/memory voltages except when battery voltage forces DVFS (insufficient power-> enter lower power state). If battery voltage chanes do seriously affect CPU/memory voltage anytime else, it is a bad design. 2. Datacenter coolers can be designed to keep chip temperature relatievly constant (for example liquid cooler + piping similar to ICE vehicle cooling system).
@@volodumurkalunyak4651 1. I agree, that the supply voltage to the SRAM cells would not change much across battery SoC, but it will change in small amounts. Voltage regulator outputs are dependent on the input in the real world and using ones that are more isolated increases cost. But yes voltage is the least likely to vary drastically, but also the BL current is very sensitive to it. 2. Yes, data centers will provide stable voltage and temperature. But the accuracy is still much worse than conventional computing due to PV, and so it begs the question -- are clients that use data center computing willing to accept inaccuracies when compared to conventional data centers? It's a big tradeoff that I'm not equipped to answer. However, I think the Edge AI application is kinda buzzword BS.
Yes, which I'd recommend not using any of analog or multi-level digital logic (logic which uses more than 2 voltage states) even for neural network or other kind of systolic array computing. & it's worse than that: multi-level digital logic is not just difficult to store to & load from SRAM, it'd also severly complicate arithmetic in case you'd thought of not converting it back into binary but actually directly using building multi-level ALUs in both cases one might go about it: 1) using quasi-analog arithmetic built out of operational amplifiers because it a) requires a completely different semiconductor process because mixed analog-digital processing & b) have all the inaccuracies & reproducibility problems of analog circuits or 2) actually building arithmetic out of digital multi-level logic devices which for n-level digital logic requires superlinear amount of different transistors with also a superlinear amount of total transistors varieties for any m different possible logic circuits of which these m are superlinear many ones depending on n while also increasing successibility to process variation to at least a superlinear amount depending on n while also superlinearly increasing wiring & interconnect complexity. Example: Ternary digital logic which is digital logic with 3 different voltage levels when implemented using CMOS-like ternary logic doesn't just require 3 as opposed to 2 different kind of transistors, you'd actually need to reliably built 4 different kind of transistors (increased successibility to process variation) for any logic gate, while a ternary inverter in such a process isn't built out of 2 transistors but out of at least 4 while also having increased wiring complexity within the gate while also requiring 3 instead of 2 power lines while ternay doesn't just have 2 different unary gates (buffer & inverter out of which you'd functioanlly only need the inverter to built all unary gates) but at least 6 different unary gates (out of which you'd functionally need at least 2 different ones to build all unary gates). & this is getting even worse if you want more than just unary gates: multi-input combinatorial ternary gates require even more transitors & wiring as their binary counterpart & much more so then their ternary state would give you a more compact representation. & these disadvantages are getting all worse as when you go from ternary to quaternary to quinary ... digital logic so that it's practically impossible to have any efficiency gains by replacing binary digital logic circuits by any other one. OK, you could use a non-CMOS like n-ary digital logic but these all have the problem of having to statically draw power which drastically decreases power efficiency while only reducing the aforementioned problems partially.
Analogue computing is a false start for CIM. Yes, digital cmos multiply circuits require some space, but they're on the same process as fast sram, they just need to be tightly coupled together and ignore all other logic functions(which is what Google and Nvidia tensor cores implement - theoretically, only a MAC circuit and a register page). There's some complexity in the control circuits; caches for instructions, input, output and synchronization logic. You need that bit of complexity with the analogue circuits anyway, and you don't have to build analogue comparators and dacs - which don't scale at all on small manufacturing nodes.
As an aside, the book "Content Addressable Parallel Processors" by Caxton C. Foster (1976) discussed the ability to have a mass memory do computation. It is bit serial, but is parallel to all memory locations, meaning that you can do things like multiply all memory cells by the same number and similar operations. Its a good read.
Hey I learned about that book in college! Software engineering/Systems manager here, tho I do freelance mostly, I wonder if it would be possible to have a standard GPU combined with a mass memory system. Basically you build both systems separately but combine them on the dye so you could theoretically go back and forth between both systems. It would be quite bulky, however it would mitigate only having to utilize one design. Using aerogel as a frame could also significantly reduce heat output and therefore increase overall efficiency. Just a thought.
@@Kyrator88 *"why would aerogel decrease heat output? Being a superb insulator wouldn't it reduce heat dissipation?"* And thus reduce heat output... for a period of time... :P
One thing that always fascinated me was the use of content addressable memory. As I recall, we were using it for decoding micro-code back in the bit-slice mini-computer days. It seems that that approach of combining logic and memory would be an interesting approach to today's AI problem.
This sort of "content addressable memory" was called a "programmable logic array". These were made out of memory cells (ROM or EEPROM), but wired in a way which allowed them to perform logic operations, called "sum of products". So the memory cells stored the configuration of the device and performed the computations at the same time. The problem is, this did only work with nonvolatile memory cells, which are slow to program, can only survive a limited number of write cycles. Also, this technique cannot be easily scaled. When memory capacity gets bigger, the trasistors become smaller and will have many errors and defects. For memory this is not a problem because one can just make the memory slightly larger to give some space to use error-correcting codes. (this is the reason why flash driver and SSDs are cheap : they are actually slightly defective but the defects are hidden !). So this technique cannot be used for AI
I was hoping you were going to get into some of the software solutions that today's neural networks have been able to implement to allow 1000x increases in deep learning architectures while the VRAM has only increased 3x in the same timeframe instead of exclusively the hardware solutions. Stuff like how there have been great advancements in the ability of multiple gpus to communicate with each other efficiently to perform backpropogation which has allowed neural networks be trained on many gpus at a time. At first, a neural network could only be trained all on one gpu, but then the NN got too big to fit onto a single gpu so we figured out how to have a single layer on each gpu, but then the NN got too big for that, so we had to figure out how to have parts of each layer on each gpu. Each step along the way required innovation on the side of machine learning engineers to build exponentially larger neural networks while the gpu VRAM just isn't keeping up
I'm glad you mentioned this, just like 2 weeks ago they were able to write data on DNA for the first time which enables you to have massively higher levels of data compression, if you were to combine this technology with the gpus, this would solve the problem for raw processing capacity.
What do we need such powerful neural networks, actually? Like, there's stuff we *could* do, but should we? Like, maybe we need to give society time to adapt to new technologies, instead of rushing it all in one go?
@@TheVoiceofTheProphetElizer I'm a software dev, my experience is that the more efficient and compact you write code, the less people understand what it does or is meant to do. Sad, but true. I've ended up re-writing code to be less compact and less efficient simply because I got sick of having to explain what it does based on what logic every time a new developer came across that specific piece of code. People on average are simply not good at computing. This was in C#, when I wrote in assembly contained raw machine code it was worse, I got zero points in school tests because the teacher didn't understand why my code gave the correct answer. I'm a software dev since many years now, and write most code to be explicitly human readable just to save time in explaining, except when it's code I don't want others to mess with.
It really sucks that the 08 crash really killed a lot of the proposed solutions the large vendors were looking at the address these issues. If you look into HPs solution (optical computing/rack scale memory) and SUNs solutions they were putting R&D money into before the 08 crash caused all these companies to essentially abandon Their labs.
Its honestly stupid cus after 2008, all that VC money was wasted on stupid shit like Javascript libraries and whatnot instead of anything actually usefull.
I'm currently working on a project using NTC and PTC thermistors to store analog values. With the idea that they will respond differently to frequency of access and will also effect neighbouring cells much like a neural net.
ASIANOMETRY IS THE BEST EDUCATIONAL CHANNEL ON UA-cam, NO CONTEST!!! Your channel truly stands out like a diamond in the rough. There is plenty of stuff I like and watch on YT, but your channel is on an entire different level. You dive deep into complicated subjects over and over, and always do it in a way that is easy to understand. Other channels go deep too, but I frequently find large chunks of the video goes over my head because I don't have a PhD. Every single time I watch your vids, I not only learn new things, but by the end of the video I UNDERSTAND the subject you bring up and feel smarter. Can't sing your praises enough! Take care!!!
Some researchers have tried to understand what a neural network does to an image when trained for recognition and classification without a pre-set algorithm. The results are startling; the network gets fixated on the tiny differences and their patterns at the boundaries of the image, and other odd parameters that a programmers would never consider viable. The fact is that the system works, but relies on some preconditions that can fail all at the sudden. There is a long way to go in designing a reliable neural network, but there also is something to learn on how numerous are the intrinsic and unknown pre-conditions existing in human perception...
This oddly sounds like my autistic mind. I'll meet a person but couldn't recognize their face, and yet I will *_forever_* remember that they have a freckle on their left ear lobe. Now I want to note _everybody's_ left ear.... My brain isn't advanced enough to function in society yet it does more on 25W than I can with this computer.
That's a naive take. Neural nets don't see images beyond the 1st layer, you could consider them masks tho all you can actually see is what looks like random noise most of the time. Beyond that 1st layer everything is subjective, neural nets tune themselves in multiple dimensions through the linear combination of non-linear functions (neurons), this means that in every step during the learning process the network evolves in different "timings" pointing to the optimal combination of values that connect the image to it's class. There are cases where a network is biased to some particular feature of the scene which can be anything from an average value of the Green color to a watermark that's present in every forest picture, there are also many ways to treat biased models starting from curating the dataset to making a second net trying to fool your classifier, these bias aren't unique to neural nets, humans do it all the time. The "human error threshold" is around 70%, this means that the average human fails a task 30% of the time, this is a brute standard as it depends on the task, medical diagnosis for example is much worse, this is useful when making a neural net as you already have a minimum value product if you can reach these in a representative test set. The state-of-the-art of neural nets has been in an all-time high, by adding complexity you can learn more features of a dataset and train more complex tasks beyond tracking your face in an Instagram filter, examples of an academic benchmark is the German Traffic Sign Detection Benchmark top model reached 97% last time i checked, there are some images that humans cant classify yet a neural net can because it doesn't have to rely on simple geometric analysis like the human brain does detecting patterns and can do it in parallel. TLDR: Neural nets can see as far and beyond what a human can except when they can't, aren't powerful enough or just end up learning either nothing or too specific dumb details, yet the possibility of making general AI is getting scarily closer
@@eumim8020 Thanks for your contribution, though have you heard of paragraphs? They're kind of important to reading a lot of text if you have bad eyesight. Really interesting and scary stuff. Where'd you learn this?
@@elecbaguette College lol I'm writing this as I go to bed lmao. I'm majoring in the AI field and I fell in love with Neural Networks. Really weird dealing with minimizing the number of dimensions of reality, the truthiness value in statistics, information theory and how to make the calculation of 2*3 run faster, all at the same time
They haven’t. Along with discussing FP as ‘high precision,’ we are not looking at something super well researched. FP is like the computer version of scientific notation; it’s fast but gives up some accuracy in favor of scale.
Language models are not the only task requiring huge memory. Another example is genome scaffold assembly (which takes millions of DNA sequence snippets to produce a complete genome of an organism).
There's no such a thing as "enough memory" when it comes to science or technology. Breakthroughs in memory just make the industry even more hungry for more.
*4:45** "Everything new" has always been available to only governments and rich people. Consider your cell phone. That technology when new was hundreds of thousands of dollars. Now you can ask a friend if they have a phone they are not using and get it for free.* *1 TB drives used to be $25,000 and now I can buy a SSD, which didn't even exist technologically at that time, for $30.* *That was 30 years time.*
I'm a mechanical engineer, and not particularly savvy in the EE domain, although I love me some first principles. The first thing that comes to mind when I think about compute-in-memory is the problem of locality. Our universe is pretty clearly local, and if you store all your stuff, then want to compute using that stuff, you have to bring relevant stuff near each other. Unless there's a way bypass that locality requirement, it's hard to see how any clever computational or logical algorithm to avoid the problem of read/transmit/write times. There are obviously a lot of ways to reduce that cycle time, but I just don't see how compute happens without the process of read/transmit/write to memory. Maybe some fucking outrageous entanglement situation, but that would include like 10^10(or wildly more) particles, and that's just not on the table right now. That would be a good goal, to have a computer that can simultaneously solve every problem contained within its memory.
Late reply, but you might be interested in a new intersection of physics principles and computing. Memory-induced long-range order (something physicists talk about) can be used to somewhat bypass this locality issue and preform hard computations in memory. Called "MemComputing", it utilizes a dynamical systems view of computing and its all quite fresh! There is a good book by the physicist Max Di Ventra on the subject.
Just as a note, MRAM is actively commercialized in chip and subsystems today as an integrated NVM. ARM's Cryptoisland and TiempoSecure's Tesic are two security chips that integrate MRAMs, as examples.
@@royalwins2030 Cant. The minimum required to rebuild a world would be a lathe. Can craft virtually any requisite tool and most components of other machines. Humanity would never stoop to dark ages. There is a small chance if we stop using paper books and go full digital but thats a future problem.
There is a large hurdle of hardware ecosystem that really means the big companies are always going to control the AI landscape. As anybody who has had to suffer through tensor RT or any AI edge deployment framework knows, the closer you get to the hardware the more ecosystem support you need.
It occurs to me that perhaps much lower precision like 4 bit may be sufficient for AI purposes making the compute units much simpler and cheaper to build. Shedding all the exponential complexity that comes with higher precision would also greatly reduce computing time and power requirements. Something to think about.
Awesome video! Hearing dram made my head spin too. Then I realized we say R-A-M as ram, so add a "D" and it's dram. Ok. But then you said s-ram and I flipped my desk.
A DRAM (dynamic RAM) cell is made by a FET transistor and a small capacitor connected to its gate. Since the capacitance is small for reasons of speed, the capacitor lose its charge within milliseconds, and the DRAM has to be constantly "refreshed" (with a read cycle usually), so it can keep the bit properly memorised. A SRAM (static RAM) cell works on a completely different principle. It doesn't use a capacitor to memorise the value of the bit (0 or 1) and doesn't need a refresh cycle. A SRAM memory cell is basically a RS Flip Flop (Set - Reset) which keeps the set level until it is re-setted. Therefore, instead of a single transistor, each SRAM cell is made by four to six transistors. So the SRAM cell takes more chip space, but it can run at the same clock speed of the logic circuitry; moreover, the SRAM is much less susceptible to data errors and interferences. The Mars probes use exclusively SRAM memory in their onboard computers. The SRAM represents the ideal operating computer memory, but it takes six transistors instead of one for each memory cell...
what if instead of always needing more bandwidth between computation units and memory, we reduced the model complexity to fewer parameters, while maintaining the same performance? This topic has been researched a lot, specifically for IoT devices, google squeezenet and optimal number of features for classification!
I have seen a concept proposed to use analog computing elements in a neural network application. The concept being that a MAC instruction can be implemented using a single analog element (e.g., transistor) and it can achieve several bits of accuracy. Also with neural networks the need for absolute accuracy is not there.
13:50 It's like a backwards instruction set! Rather then the cpu waiting for data from memory, the memory can do quick simple operations more efficiently.
I don't know much about computers, but I have been following the growth of AI in chess. Conventional, brute force chess engines are still the most powerful tools for analyzing a chess game (Stockfish), but the newcomer using AI, known as Leela-Chess, is a very powerful number two and has beaten Stockfish many times in computer chess tournaments. Its play style isn't as accurate as Stockfish, but it's very fun, creative and plays a bit like a strong human with great intuition. But I still prefer Stockfish for its accuracy.
An outsider's way to perform compute within memory: Sample memory-strings A & B are each 8-bit, each bit addressable. A holds "01101001" B holds "00010111" When you want to add two bit-strings, you can feed each pair of bits into these two memories; "x:1, y:1, carry:0" goes to the memory-address (1,1,0), and that location in A yields the correct addition - "0", while that same location on B yields the correct carry - "1". Fundamentally, we let the memory act as a *look-up table* for any obscenely complex function we choose, and we can change that function just by updating the memory, which sounds better than an FPGA. And, each address in memory can actually be a row of data, for same input, many outputs.
We already have look up tables in programming. Bringing it lower level will require extensive configurability or a set algorithm. Either way, you're getting an FPGA or an ASIC. Remember that the precursor to the FPGA, the CPLD, was literally a memory platform retooled for computation.
@@dfsilversurfer For inference on massive neural networks, we already wait 30 seconds or more. Compute-in-memory is specifically for neural network applications; low precision, high parallelization, latency can wait. You definitely wouldn't need a custom chip just to run an app where you text your friend.
I don't get it. What is the memory-address format? How do the locations on A and B get the proper values for the result and carry respectively? Can you rephrase your solution to be more precise. Please use the bit values stored in A and B as in your original example so I can better follow your logic.
@@vaakdemandante8772 Sure! Each memory-slot, A & B, has a set of addresses, and at those addresses are held a bit, like this: Address Value Stored in {A,B} 000 {0,0} 001 {1,0} 010 {1,0} 011 {0,1} 100 {1,0} 101 {0,1} 110 {0,1} 111 {1,1} This uses only 16 bits of memory, and a *NON* programmable addressing, so it's simpler than the CPLD mentioned above, while still giving us a full adder! Now, if we had two arbitrarily long bit-strings that we want to add, called X & Y, then we can feed the first pair of bits from those strings into this memory-logic, using those bits *as the address that we will look-in, for A & B* . For example, if the first bits from our arbitrarily long strings are both "1", while we have *no carry-bit yet* , then that combined address is "011", meaning "0 carry-bits so far, and a 1 for string X and a 1 for string Y." At that address, "011", the value stored by A is "0" which is the correct answer for "what value should be stored in that addition's binary slot?" (1 + 1 in binary leaves a 0 in the one's column.) At the same time, that *same address* points to the value "1" in B, which is the correct answer for "what value should be *carried* to the next binary digit?" (1 + 1 leaves a 1 in the two's-column.) I mentioned "A & B" as separate vectors, but you would really just have a 2 x 8 block of memory, and pull both bits, A's & B's, at once. That is just a simplistic example. Realistically, memory would be in much larger blocks, such as 16 x 64, for massive, nuanced, bizarre logics to be held as fast look-up tables. This isn't a cure-all best-ever; it's for massive parallel operations, reconfigurable on the fly, with an eye toward advanced machine learning.
Memory latency is a big factor in CPU design. Tremendous effort has gone into caches, branch prediction, out of order execution, and prefetching to mitigate issues of memory latency.
Excellent video, but I must point out a few inaccuracies. First of all, let me introduce myself. I've been working with Computer Engineering for over 20 years, and I'm currently working in a company that makes chips for accelerating AI applications using exactly a non-Von Neumann architecture. 1. GPUs also don't use a Von Neumann architecture, otherwise, they'd be terrible for AI. They might not be as good as architectures tailored for AI (like Google's TPU). 2. At-memory computing is great for all applications, not just edge ones. It allows for more performance and for saving a lot of power. On an edge device, these translate to longer battery life, and in a data center, they translate into millions of dollars in cost savings. 3. DRAM is one type of memory; there's also SRAM. 4. The eDRAM is not actually a problem. Yes, they're hungrier and larger, but they're also much faster. The L1 cache in modern CPUs is never implemented with DRAM cells, that's why it's small, but it also means it can run at the same clock rates as the rest of the CPU. That's the same reason why CPUs have a cache hierarchy that goes from faster and smaller to slower and larger (L1 -> L2 -> L3). 5. The problem with exotic technologies like R-RAM is that we'll need entirely new production methods, and that's a big problem because the industry will need to see that as the way forward in order to invest in those new fabs. Until non-conventional memory technologies are considered a somewhat safe _commercial_ bet they won't catch on. The industry differentiates between in-memory and at-memory computing: In-memory means that computation happens, quite literally, in the memory cells. That minimizes data movement, but the solutions here are analog. That requires converting from digital to analog and then back to digital. This double conversion negates many savings obtained by in-memory computing. Excellent video about in-memory computing: ua-cam.com/video/IgF3OX8nT0w/v-deo.html&ab_channel=Veritasium At-memory computing means that data movement is reduced by keeping data very close to where computing takes place. What you have is _distributed_ memory. It's an all-digital digital solution that uses standard tools and standard manufacturing methods. That's where most of the industry is headed right now. Disclaimer: that's the architecture my company adopted for its chips. In a bit of futurism, I think that because AI is such an important application, eventually enough R&D money will be poured into in-memory computing that it'll eventually win out. But I don't see that happening for another 7 to 10 years. Other than those, the video is really excellent and well-researched. The memory wall explanation is spot-on.
Saying that every computer is Von Neumann is loosely true. Most microcontrollers are still modified Harvard architecture computers with a separate program memory bus whereas modern PC and server processors operate like modHarvard only when the program is not exceeding cache size and the memory bus is only used for data transfers. IMO using general purpose VN architectures is the first issue here.
There would also need to be massive increases in onboard cache. It is interesting how we have already hit size limits on dram, we are reaching similar limits on cpu's too, I suspect then we will see the fix improving both - multiple layers. Today the routing has for a long time been the biggest delay so if you could double layer then we would see much faster chips. Problem then is heat....
I'm just glad something has finally spurred all of the giants to get off their asses and fear being left behind again. I'm guessing the mad rush to push memory and processing to ludicrous speed will have (and has already had) some knock on benefits for the average consumer. One thing I don't like is how the PC is becoming less and less viable. Things are being pushed towards mobile and data centers in functionally the same way that things evolved into terminals (or thin clients) and mainframes (or supercomputers) in the 80's and 90's.
I grew up in computing as a systems programmer on Univac multiprocessing systems in the 70s. Then, we were all aware of the contention of memory access. Cache memory and memory banks eased the problem somewhat. Years after leaving Univac I developed a radical technology. In this technology a thousand processor core can access a common memory without conflict. It has a few drawbacks. Implementing stacks would be unlikely. Instructions are somewhat slower than are common today. Cache memory would not be practical. Pipelining would also not work. Here is how I devised this plan. I went back and looked at how computing evolved. Then, the speed of data through a wire was considered instantaneous. I began this trip with the idea that transmission was slow. How would the inventors deal with that? I began implementing a solution with a small gate array that I estimated would compete with an i5. But I got hit with a massive heart problem. I simply don’t know what to do with it cause I’m a nobody.
@@achannel7553 Well, its not as great as it sounds. They are only 8 bit processors. And on chip memory is somewhat limited. External standard memory would not work as specialized memory is required.
See, the entire concept of "computing memory" already exists as you mention it, and it is called "cache memory" on the same processor chip. Expanding the concept, and replacing the entire DRAM bank with cache memory is challenging, but it can be helped by the multi-core chip design, and by the use of chiplets AMD-style. I don't see ReRam taking up the market, there is more to come. If you recall the memory oscilloscope tubes from the '60s, which were a wonder at the time, the same concept can be applied at chip level, where the transistors on the chip can change their functions - based on the application of a electric field. In the end, the same transistor can work as a logic gate or as a memory cell, as commanded by a supervisory circuit. This improvement would require a redesign of all software EDA tools, and you know that can take a decade. Thank you for this pioneering video of yours on the most challenging computer design problem of the century. Regards, Anthony
Doing basic logic operations on entire memory rows completely changes how we store and process data. For instance, you would store an array of numbers with all the highest bits stuffed together, and all the lowest bits stuffed together, and so on, which is quite the opposite of the current practice of storing all bits each single number together
I think you fundamentally misunderstand how in-chip memory (sram) works. Each bit in a register is operated on in parallel for each word in all CPUs. And each bit, from each word, from each wave-front in GPUs/SIMD processors.
another option might be to change how AIs work in general. rather then each neuron checking every neuron on the next layer, you could limit them to only talking to the ones next to them or 1-2 neurons away. it would mean we would need larger AI models but id imagine it would be much easier to work with in a RL 3D space. you could feed data in one side and have it come out the other. would still need to somehow allow it to be trainable and to have the data read from each neuron if you ever wanted to copy the chips content into a software model. Given the massive increase in performance this would offer, even if the processing and memory storage is slower, it would still be better. Essentially, you could make bigger and bigger AIs by just making the chip larger and giving it more layers. it would gain massive benefits the larger the AI is you want to work with, rather then one neuron at a time like with standard models, youd be working with one layer at a time no matter how large it is. the potential gains should be obvious if the technical challenges could be worked out.
I believe current models are based on how the human cerebelum works. Limited input, massive bandwidth and parallelization in between to derive a small but coherent output from those small inputs.
Super informative about openai gpt and how all this is processed. (hope I've given you the keywords you need ) the depth you go into things is always astounding and no, I don't expect you to explain that wizardry. Thanks for making these, you're awesome.
Great video.. reminds me of the structure of the old SIMD ICL/AMT DAPs where AMT was Active Memory Technology.. the first machine I ever wrote a back prop ANN program in parallel fortran. For the right problem it was very efficient.. love to see one on a single die with enough memory of course.
For low memory latency deep learning accelerator applications, there is also innovations at the system protocol levels through CXL (compute express lunk), allowing for memory sharing, memory scaling (of multiple memory types). Great topic. Yes device, circuit, systems architecture innovations are needed for next gen AI.
It's kinda funny how often, theoretical abstractions are quite literally turned into real products, while the abstraction itself accommodates a much, much wider class of possible implementations. For instance, nothing in the von-Neumann architecture says that the CPU or the memory need to be a physically separate and monolithic device. It's just that they can be considered that conceptually for theoretical analysis.
I can help you explain it further. this changing of resistance to keep resistance is called a memristor which was the last discovered of the quartet of fundamental electrical components which comprises also the resistor, capacitor and inductor. basically a memristor is a resistor that remembers its resistance.
I'm reminded that it's time to reconsider: 1) Transport-Triggered-Architecture 2) Flow-Based-Programming and 3) Systolic-Arrays in light of the proposed solution vectors to the "memory wall" issue outlined here. I believe clockless designs are appropriate for power-saving modalities.
For a text on this see "Content Addressable Parallel Processors" (1976) by Caxton Foster. One of the earliest air traffic control computers, Staran, was one of these.
This is where the tesla super computer design is a brilliant dance around the memory wall … looking forward to the results coming later this year when the first pod is completed
When you talk about high precision and low precision human brains early in the video, I immediately thought of a TechTechPotato video about IBM making an architecture for lower precision but “wider” compute. Basically they are almost like PCI E graphics cards but for server computing only.
Your videos are excellent because of the content but mostly because you don't have background music, noise, and kettledrums interfering with the narration.
12:22, I think it might use the laws involving resistance (like those from circuits in parallel, series, etc) to cleverly infer from some 'measured' resistance the addition of data stored within the memory. It sounds similar to what analog computers do, and how they can be used at lower costs to implement trained neural networks for inference.
Doubt it. give me an example of a pure analog computer. Those mesured resistances are far to unprecise. And why would you do it if you can fit a billion transistors on the size of a stamp?
@@computerfis Check out Veritasium's video on analog computers. He shows modern examples of analog computers which are used in inference with trained neural network models. As for the reason why these are used, the video goes into it, but in short it is cost. Both cost of production and energy cost.
CAM (content-addressable memory) aka "associative memory" was on the drawing boards in the mid-00s, with an eye toward AI, and was essentially a hardware implementation of the associate array construct found in most programming languages. It did find use in network routers for fast hardware lookup of routing table entries.
Would you consider making a video about Neuromorphic Computing? The in-situ computing you covered still trys to implement the traditional gate sets like AND and XOR. Neuromorphic Computing seems not to fall under these directions you covered.
You can't do HPC on the cloud....cloud architecture specifically prevents this.... Now you _can_ do HPC on a custom cluster. It's more difficult because you have to manipulate the instructions to keep the memory as close to _every_ processor as possible! But this _can be done!_
I feel the need to mention here, that I consider (along with others i'm sure) that the proper term is "stored program computer". Von Neumann had nothing to do with the invention of the stored program computer design, which was due to Ekert and Mauchly, the designers of ENIAC. Von Neumann simply took notes on the design and published it under his own name, which led to a misinterpretation that he (ahem) declined to correct.
Two things I wish you had mentioned that could utilized to help out with this process. First off is the number of cores that exist on a CPU but also the way that they are accessed is very critical and several companies are designing CPUs that will be able to work hand in hand with gpus in the way that their core systems are designed, with multiple companies now designing CPUs with over 100 cores per chip. Second is the use of quantum computing in processing logic gates among other things. IBM's photon chip tech combined with quantum computing are going to revolutionize the way we use computers at a scale that is hard to imagine. Exciting times we live in.
It would be interesting to start that development with feature rich memory - not completely computation units but memory that can perform xor write and block copy that should integrate with existing technology quite well
David Patterson of UCB did not invent RISC. That was IBM with the 801 minicomputer project. A fellow named Glenn Myers wrote a book about what they learned designing it called "The Semantic Gap". Patterson became a disciple of the philosophy and did a great deal to spread the word.
I'd be extremely careful with statements like "the human brain runs on low precision". The understanding that lead to our current day nodal computational fuzzy logic (ie 'Neural Networks') are primitive to the extreme when compared with the flexibility an actual network of actual neurons exhibits. Neurons are not only not restricted to pure binary state, far more. By the base setup of different types of neurons and their ability to react differently to different stimuli certain things like communication are apparently hardwired into our brains. That's the reason why we still haven't been able to even come close to the computation efficiency of a "simple" Petry dish with a few thousand single type neurons. It's a bit annoying to anyone who knows both the maths based simplistic nodal networks and the research into how our brains function because it's really an apples to rocks comparison. I mean, neural networks don't even support information backflow at this point, and that was a known feature of neurons even before the term neural network was coined...
There is a simple solution... albeit one which holds the real possibility of radically changing the entire computer industry. Memristors. Memristors instead of transistors, were we able to achieve high-volume production, would be the only thing we needed. Memristors can replace the CPU, GPU, RAM, and SSD. You can intermix compute and memory trivially easy with memristors. And you can change whether a section of a memristor grid is computational or memory storage (all non-volatile, and all as fast as registers) on the fly with microsecond timings on changing it. For reasons I have thus far failed to understand, those using memristor technology have focused almost exclusively upon using them for neuromorphic accelerators. That is a tremendous waste. If you could efficiently mass-produce large volumes of memristors at small scales, you could ignore the neuromorphic aspects and instead replace the entire CPU, GPU, RAM, and SSD industries wholesale with a single product. Their design, at the production level anyway, would also be very simple, similar to how SSDs are just big grids of NAND gates compared to the monstrous complexity of mechanical hard drives. Price-fixing tactics have artificially preserved the existence of mechanical hard drive companies, and we might end up seeing that with memristors as well, which would be very sad, and would continue holding us back just to keep some current billionaires rich.
Well put out seminary. You have presented some real hardware design, a bit of the physical constraints and marketing issues. I bet, though, that problem is not only on an obsolete architecture, but also on flawed algorithms that spend a lot of resources on tasks that can be skipped, not recalculated or just infered by other means which altogether leads to less steps in intermediate computations and storage needs to do the very same job. Be it in CS field or even the brute maths field.
I suspect and we have seen this already with more and more optimized ASIC's that we will end up with a world where AI models are no longer being run on generic hardware but are created in silicon entirely with as mentioned memory directly attached (chiplet designs) or on die. The cost to build models with tens or hundreds of trillions of parameters means that producing a dedicated chip for this purpose and cramming a few thousand of them in a rack is not really cost prohibitive anymore. The bigger question I think we should ask is what is the future of these models, so far they are fun and curious but they are not really adding much of value compared to their cost to build. As much smarter people than me have pointed out the current inference AI "revolution" feels a lot like the Web3 movement where there is a lot of money sloshing about with VC's if you mention you are doing something with this great new tech. Yet no one seems to have build a convincing argument for this tech when it comes to making the kind of returns in the next 3 to 5 years that a VC would want to see from at least one of their bets in this space. It could very well be that in the near future the money will dry up and the big innovations and the whole new world that people are predicting simply does not materialize. Though this does not mean that the technological obstacles should not be bested and our memory wall problem is a real serious issue that is affecting way more than just AI so the sooner this si solved the better it would be but it would probably mean that the need to solve this quickly will be felt a lot less by the industry and the actual "solution/workaround/innovation" needed to slay that dragon is way further of than we currently hope.
this is wonderful information. a great breakdown. my thoughts were large parallel access to memory (no memory controller with any addressing required). and lots of sram. it seems that broadcasting out data is one thing, but needing the results of a lot of data in a timely manner is the real core of the problem. my gut feeling is the true solution to this is quadrillions of simple input->output units that have baked in logic tables. and no intermediary memory controller. no row/column access. just extreme connectivity. extreme parallel connectivity. and 3D dies the size of a dinner plate.
2:46 - i znów ekonomia, mózg jest jaki jest z powodu ekonomi (swiata), komputery musze to jeszcze udoskonalic. Mózg nie musi wykonywac takich operacji (on wykonuje je niejako nie jawnie, albo w locie). :) 4:09 - sami wspominaja o ekonomi :) 7:00 - fajna gra słow smal - smal ;D 8:00 - sam coś takiego wymysliłem na studiach, :) można uzyc pamieci PCM :) 12:00 - resisitive RAM; a jednak miałęm racje xD, miałęm xD en.wikipedia.org/wiki/Resistive_random-access_memory 12:43 - po prostu je sie podgzrewa albo schąłdza i od tego zlaezy czy jest stan krysztalcizny czy amroficzny :)
And just like that, we're back to cutting-edge computers the size of houses.
got to restart somewhere
@@ianrajkumar the computer hsrdware ouroboros
Back where I started then, with an ICL 1900 that took two huge rooms.
Floor area four times the size of my house.
Hello, DRAM design engineer here. Really informative video, and explained in such an easy to understand way! Love it!
Just a quick comment, DRAM memory is generally pronounced "DEE-ram" and Via in TSVs is pronounced "VEE-ah". It's confusing and not intuitive, but hopefully this helps for your future videos!
that isn't confusing at all
Confusing? Seems a straightforward pronunciation.
Thanks for your comment tho!
Ground all my gears this whole video
Thank you! After watching this I started thinking I was pronouncing it wrong but I'm glad to see DEE-ram is in fact the correct pronunciation.
This was killing me for the whole video.
Your ability to breakdown information into bit sizes that average person can understand is remarkable. I worked in the Electronic Test equipment marked, Great Job.
concise, brief, clear. It's perfect delivery without clutter. the visual style matches. love it!
exactly why i subscribed on this video
its easy to divide and multiply if you use the binary math system, its been used by every major civilization since the pyramids and probably longer.
its also what computers use
Yes!! Exactly
Informative and concise; thank you.
I notice you pronounce SRAM as "ess-ram" (which has always made sense to me because of the acronym's origins as a *"Dynamic"* extension or iteration of the much older acronym/technology of RAM for Random Access Memory,) but you also pronounce DRAM as "dram." (I say "dee-ram" because, again, it's a variation on ol' trusty rusty RAM.)
Unfortunately, "dram" is already a word in use outside computing - but in at least two other science fields - as a noun for:
1) A unit of weight in the US Customary System equal to 1/16 of an ounce or 27.34 grains (1.77 grams).
2) A unit of apothecary weight equal to 1/8 of an ounce or 60 grains (3.89 grams).
3) A small draft.
Not like bad or wrong, but maybe worth noting for future usage.
Again, excellent work; enlightening. Keep it up.
How do you pronounce eDRAM?
@@slypear Hm. I suppose I'd say "ee-DEE-ram." The core technology, Random Access Memory (RAM) persists and evolves. Similarly, there are variants of radar (RAdio Detection And Ranging) like UHF and VHF radar, and lasers (Light Amplification by Stimulated Emission of Radiation,) where the root acronym remains (though the Gamma Ray laser has only recently been taken beyond the theoretical.)
In fairness, ROM (Read-Only Memory) became "EE-prom" in its Externally Programmable variation. I'm not sure that technology is still in use with the widespread and cheap availability of Flash memory, so this point may be so far out of use as to be moot.
¯\_(ツ)_/¯
@@stevejordan7275 Good points, thanks!
You're on another level bro. I love it. Beautifully presented, in-depth and voiced perfectly. This channel rips.
A.G.I WILL BE MAN'S LAST INVENTION
@cody orr4 is it though?
its a purely bot channel
Voiced perfectly? 🤔 "In situ" is pronounced _in sit-yu,_ not _in see-tu._
@@mistycloud4455 It's so very close, can't wait to see what the future holds!
I can explain the resistive memory outlined at 12:05. In electronics, there are parallel and series circuits. Resistances in series add together, meaning that if you connect the resistances from two memory banks, the resulting resistance can be written to a third memory bank. No logic required. I mean of course there’s logic required, but the memory itself is the input and the output. I have no clue how the memory chips work, but the idea is that you can use the properties of resistors to do addition for you.
this will come true when someone can implement an hybrid digital - analog system
I think the resistive memory is implemented with a memristor. When current flows one way, the resistance increases. When current flows the other way, resistance decreases.
Wouldn't that require a constant current source and an op amp to add the resulting voltage? Then an ADC to convert back to binary. I'm not sure if CMOS op amps are possible.
current flow trough resistance = noise
more resistive elements = more noise
... just as a reminder. There is NO free lunch in nature.
@@Syrinx11 CMOS OPAMPS (and mixed types) are produced since the early to mid-1970s ... See types like LF157 or CA3130, etc.
The concept you are thinking about in your comment is not suitable for memory chips or processing units. Too big, not highly integratable and a horrible vision of extra to introduce parts and different technologies that all has to be populated on a die. Also: on die resistors are FAT("well-resistor")! The precision ones are even fatter, more expensive and time expensive to produce and calibrate (we are talking about magnitudes in contrast to the current pure memory silicon production processes).
[1] "Most modern CMOS processes can guarantee the accuracy of on-chip resistors and capacitors to only within ±25%."
[2] temperature coefficient ranging from 1500 to 2500 ppm
[3] Etching and lateral diffusion errors in well resistors, & Etching errors in polysilicon resistors
[4] The size ... just for your imagination: In the two digit µm ballpark, if we talk from very small resistors.
see also: the datasheets; "IC Op-Amps Through the Ages". 2000 Thomas H. Lee; "A CMOS Op Amp Story - Part 1 - Analog Footsteps", 2017 by Todd Nelson; "Adaptive Techniques for Mixed Signal System on Chip", pp 67-94, Springer [1-4].
I hope this helps. And back to the drawing board, or? Thanks for your thoughts, Syrinx:)
I have memories of a project of "modified DRAM chips with internal logic units" around Y2K, I saw a paper probably from MIT but I don't remember whether it was implemented. It looked promising for certain kinds of massively parallel operations such as cellular automata simulations 🙂
A.G.I WILL BE MAN'S LAST INVENTION
@@mistycloud4455 you underestimate humanity...
@@mistycloud4455 fuckin hope so...
Not only do I hope agi is man's last invention, I hope it gets rid of the current state of things
@@jasenq6986 Be the change you want to see in the world. Hope is useless.
This reminds me of the Connection Machine (CM), made by Thinking Machines Corporation back in the 1980s. The CM had a large number of single-bit processors with a few kilobits of memory each. They were interconnected in a high-dimensional hypercube.
Lower dimensional connections were on-chip and higher dimensions went off-chip. It was programmed in a language called *Lisp. I remember that it seemed way ahead of its time.
We are now engineering systems which we cannot ever understand, millions of weighted matrices developing novel solutions. We are living during the dawn of something monumental.
Thinking Machines was founded by two guys and one of them was Richard Feynman's son. Feynman helped them to solve some problem using integer differential equations and since none of the two could understand it they were reluctant to use the solution ... in the end they used and it worked. Thinking Machines created the massively parallel computer technology that ended up killing Cray Supercomputers.
I believe, btw, that CUDA cores are something like that - tiny CPUs with a bit of memory for each one.
One thing I don't understand is why don't they use static memory to solve the problem ... does it consume to much power?
@@ElectronFieldPulse So you,re doing nothing.
@@barreiros5077 - I never claimed ownership. I meant it in the way of progresss of mankind.
@@barreiros5077 Shit blood
Circuit-Level CIM has one major limitation that I wish you had discussed. Its susceptibility to PVT (Process, Voltage, Temperature). When storing weights in SRAM Cells and applying 1 or 0 to the Word Line (WL) to perform the MAC operation (WL Multiplies with Weight and then the current on the Bit Line (BL) is the sum of all of the cells in the column) we are performing an analog operation. The BL current will depend on the process variation, supply voltage, and ambient temperature. That is, at two different temperatures, or supply voltages (Battery voltage changes), we will get different results, even with the same die. This makes it unsuitable for "Edge AI" applications. Between two chips or two different columns, we will also get different results, because of the process variation. The accuracy is significantly limited by this. With an Analog WL driven by a DAC, the problem is exaggerated even further. Granted, I do not know what sort of accuracy AI models really require but I imagine it is much greater than what can be offered by CIM in current CMOS processes. Of course, larger processes decrease variation, but the density suffers. The nice thing about conventional computing is that our accuracy does not depend on PV, only our speed. I think integrating DRAM dies with conventional CMOS dies is likely the way forward.
It's also not much faster, only more energy efficient.
1. Battery voltage changes shouldn't affect CPU/memory voltages except when battery voltage forces DVFS (insufficient power-> enter lower power state). If battery voltage chanes do seriously affect CPU/memory voltage anytime else, it is a bad design.
2. Datacenter coolers can be designed to keep chip temperature relatievly constant (for example liquid cooler + piping similar to ICE vehicle cooling system).
@@volodumurkalunyak4651 1. I agree, that the supply voltage to the SRAM cells would not change much across battery SoC, but it will change in small amounts. Voltage regulator outputs are dependent on the input in the real world and using ones that are more isolated increases cost. But yes voltage is the least likely to vary drastically, but also the BL current is very sensitive to it.
2. Yes, data centers will provide stable voltage and temperature. But the accuracy is still much worse than conventional computing due to PV, and so it begs the question -- are clients that use data center computing willing to accept inaccuracies when compared to conventional data centers? It's a big tradeoff that I'm not equipped to answer. However, I think the Edge AI application is kinda buzzword BS.
Yes, which I'd recommend not using any of analog or multi-level digital logic (logic which uses more than 2 voltage states) even for neural network or other kind of systolic array computing. & it's worse than that: multi-level digital logic is not just difficult to store to & load from SRAM, it'd also severly complicate arithmetic in case you'd thought of not converting it back into binary but actually directly using building multi-level ALUs in both cases one might go about it: 1) using quasi-analog arithmetic built out of operational amplifiers because it a) requires a completely different semiconductor process because mixed analog-digital processing & b) have all the inaccuracies & reproducibility problems of analog circuits or 2) actually building arithmetic out of digital multi-level logic devices which for n-level digital logic requires superlinear amount of different transistors with also a superlinear amount of total transistors varieties for any m different possible logic circuits of which these m are superlinear many ones depending on n while also increasing successibility to process variation to at least a superlinear amount depending on n while also superlinearly increasing wiring & interconnect complexity. Example: Ternary digital logic which is digital logic with 3 different voltage levels when implemented using CMOS-like ternary logic doesn't just require 3 as opposed to 2 different kind of transistors, you'd actually need to reliably built 4 different kind of transistors (increased successibility to process variation) for any logic gate, while a ternary inverter in such a process isn't built out of 2 transistors but out of at least 4 while also having increased wiring complexity within the gate while also requiring 3 instead of 2 power lines while ternay doesn't just have 2 different unary gates (buffer & inverter out of which you'd functioanlly only need the inverter to built all unary gates) but at least 6 different unary gates (out of which you'd functionally need at least 2 different ones to build all unary gates). & this is getting even worse if you want more than just unary gates: multi-input combinatorial ternary gates require even more transitors & wiring as their binary counterpart & much more so then their ternary state would give you a more compact representation. & these disadvantages are getting all worse as when you go from ternary to quaternary to quinary ... digital logic so that it's practically impossible to have any efficiency gains by replacing binary digital logic circuits by any other one. OK, you could use a non-CMOS like n-ary digital logic but these all have the problem of having to statically draw power which drastically decreases power efficiency while only reducing the aforementioned problems partially.
Analogue computing is a false start for CIM. Yes, digital cmos multiply circuits require some space, but they're on the same process as fast sram, they just need to be tightly coupled together and ignore all other logic functions(which is what Google and Nvidia tensor cores implement - theoretically, only a MAC circuit and a register page). There's some complexity in the control circuits; caches for instructions, input, output and synchronization logic. You need that bit of complexity with the analogue circuits anyway, and you don't have to build analogue comparators and dacs - which don't scale at all on small manufacturing nodes.
As an aside, the book "Content Addressable Parallel Processors" by Caxton C. Foster (1976) discussed the ability to have a mass memory do computation. It is bit serial, but is parallel to all memory locations, meaning that you can do things like multiply all memory cells by the same number and similar operations. Its a good read.
Interesting.
And then we got a little problem: how to estimate total processing power of such a device, especially for hard-to-parallel tasks?
Hey I learned about that book in college! Software engineering/Systems manager here, tho I do freelance mostly, I wonder if it would be possible to have a standard GPU combined with a mass memory system. Basically you build both systems separately but combine them on the dye so you could theoretically go back and forth between both systems. It would be quite bulky, however it would mitigate only having to utilize one design. Using aerogel as a frame could also significantly reduce heat output and therefore increase overall efficiency. Just a thought.
@@xanderunderwoods3363 why would aerogel decrease heat output? Being a superb insulator wouldn't it reduce heat dissipation?
@@Kyrator88 *"why would aerogel decrease heat output? Being a superb insulator wouldn't it reduce heat dissipation?"*
And thus reduce heat output... for a period of time... :P
One thing that always fascinated me was the use of content addressable memory. As I recall, we were using it for decoding micro-code back in the bit-slice mini-computer days. It seems that that approach of combining logic and memory would be an interesting approach to today's AI problem.
This sort of "content addressable memory" was called a "programmable logic array". These were made out of memory cells (ROM or EEPROM), but wired in a way which allowed them to perform logic operations, called "sum of products". So the memory cells stored the configuration of the device and performed the computations at the same time. The problem is, this did only work with nonvolatile memory cells, which are slow to program, can only survive a limited number of write cycles. Also, this technique cannot be easily scaled. When memory capacity gets bigger, the trasistors become smaller and will have many errors and defects. For memory this is not a problem because one can just make the memory slightly larger to give some space to use error-correcting codes. (this is the reason why flash driver and SSDs are cheap : they are actually slightly defective but the defects are hidden !). So this technique cannot be used for AI
@@atmel9077 PLAs came out much later.
I was hoping you were going to get into some of the software solutions that today's neural networks have been able to implement to allow 1000x increases in deep learning architectures while the VRAM has only increased 3x in the same timeframe instead of exclusively the hardware solutions. Stuff like how there have been great advancements in the ability of multiple gpus to communicate with each other efficiently to perform backpropogation which has allowed neural networks be trained on many gpus at a time. At first, a neural network could only be trained all on one gpu, but then the NN got too big to fit onto a single gpu so we figured out how to have a single layer on each gpu, but then the NN got too big for that, so we had to figure out how to have parts of each layer on each gpu. Each step along the way required innovation on the side of machine learning engineers to build exponentially larger neural networks while the gpu VRAM just isn't keeping up
I'm glad you mentioned this, just like 2 weeks ago they were able to write data on DNA for the first time which enables you to have massively higher levels of data compression, if you were to combine this technology with the gpus, this would solve the problem for raw processing capacity.
@@xanderunderwoods3363 what area the read and write times on DNA when used as memory?
What do we need such powerful neural networks, actually?
Like, there's stuff we *could* do, but should we? Like, maybe we need to give society time to adapt to new technologies, instead of rushing it all in one go?
Does SAM/BAR help with this stuff? It allows access to ALL video memory at 1 time instead of 256mb blocks.
@@TheVoiceofTheProphetElizer I'm a software dev, my experience is that the more efficient and compact you write code, the less people understand what it does or is meant to do. Sad, but true. I've ended up re-writing code to be less compact and less efficient simply because I got sick of having to explain what it does based on what logic every time a new developer came across that specific piece of code. People on average are simply not good at computing. This was in C#, when I wrote in assembly contained raw machine code it was worse, I got zero points in school tests because the teacher didn't understand why my code gave the correct answer. I'm a software dev since many years now, and write most code to be explicitly human readable just to save time in explaining, except when it's code I don't want others to mess with.
It really sucks that the 08 crash really killed a lot of the proposed solutions the large vendors were looking at the address these issues. If you look into HPs solution (optical computing/rack scale memory) and SUNs solutions they were putting R&D money into before the 08 crash caused all these companies to essentially abandon Their labs.
Its honestly stupid cus after 2008, all that VC money was wasted on stupid shit like Javascript libraries and whatnot instead of anything actually usefull.
I'm currently working on a project using NTC and PTC thermistors to store analog values. With the idea that they will respond differently to frequency of access and will also effect neighbouring cells much like a neural net.
Thanks! Great content
ASIANOMETRY IS THE BEST EDUCATIONAL CHANNEL ON UA-cam, NO CONTEST!!!
Your channel truly stands out like a diamond in the rough. There is plenty of stuff I like and watch on YT, but your channel is on an entire different level. You dive deep into complicated subjects over and over, and always do it in a way that is easy to understand. Other channels go deep too, but I frequently find large chunks of the video goes over my head because I don't have a PhD. Every single time I watch your vids, I not only learn new things, but by the end of the video I UNDERSTAND the subject you bring up and feel smarter. Can't sing your praises enough! Take care!!!
it's great, but I think in a contest between asionometry and steve brunton, I'd pick steve brunton. granted, both are favorites.
There are plenty of great education focused channels - 1blue3brown, veritasium, steve mould
Let's not overhype shall we?
@@aravindpallippara1577 Let's accept that people have their own opinions shall we?
Nope.
Some researchers have tried to understand what a neural network does to an image when trained for recognition and classification without a pre-set algorithm.
The results are startling; the network gets fixated on the tiny differences and their patterns at the boundaries of the image, and other odd parameters that a programmers would never consider viable.
The fact is that the system works, but relies on some preconditions that can fail all at the sudden.
There is a long way to go in designing a reliable neural network, but there also is something to learn on how numerous are the intrinsic and unknown pre-conditions existing in human perception...
This oddly sounds like my autistic mind.
I'll meet a person but couldn't recognize their face, and yet I will *_forever_* remember that they have a freckle on their left ear lobe.
Now I want to note _everybody's_ left ear....
My brain isn't advanced enough to function in society yet it does more on 25W than I can with this computer.
That's a naive take. Neural nets don't see images beyond the 1st layer, you could consider them masks tho all you can actually see is what looks like random noise most of the time. Beyond that 1st layer everything is subjective, neural nets tune themselves in multiple dimensions through the linear combination of non-linear functions (neurons), this means that in every step during the learning process the network evolves in different "timings" pointing to the optimal combination of values that connect the image to it's class. There are cases where a network is biased to some particular feature of the scene which can be anything from an average value of the Green color to a watermark that's present in every forest picture, there are also many ways to treat biased models starting from curating the dataset to making a second net trying to fool your classifier, these bias aren't unique to neural nets, humans do it all the time. The "human error threshold" is around 70%, this means that the average human fails a task 30% of the time, this is a brute standard as it depends on the task, medical diagnosis for example is much worse, this is useful when making a neural net as you already have a minimum value product if you can reach these in a representative test set. The state-of-the-art of neural nets has been in an all-time high, by adding complexity you can learn more features of a dataset and train more complex tasks beyond tracking your face in an Instagram filter, examples of an academic benchmark is the German Traffic Sign Detection Benchmark top model reached 97% last time i checked, there are some images that humans cant classify yet a neural net can because it doesn't have to rely on simple geometric analysis like the human brain does detecting patterns and can do it in parallel.
TLDR: Neural nets can see as far and beyond what a human can except when they can't, aren't powerful enough or just end up learning either nothing or too specific dumb details, yet the possibility of making general AI is getting scarily closer
@@eumim8020 Thanks for your contribution, though have you heard of paragraphs? They're kind of important to reading a lot of text if you have bad eyesight.
Really interesting and scary stuff.
Where'd you learn this?
that's a lot more true of nets that are not trained to be adversarially robust.
@@elecbaguette College lol
I'm writing this as I go to bed lmao.
I'm majoring in the AI field and I fell in love with Neural Networks.
Really weird dealing with minimizing the number of dimensions of reality, the truthiness value in statistics, information theory and how to make the calculation of 2*3 run faster, all at the same time
When did people start pronouncing DRAM as a whole word instead of just saying D-Ram?
RN Bch deal.
I hear you. We don't all use one syllable
But it probably makes sense
But the pronunciation of In Situ is definitely not correct
They haven’t. Along with discussing FP as ‘high precision,’ we are not looking at something super well researched. FP is like the computer version of scientific notation; it’s fast but gives up some accuracy in favor of scale.
Similarly via rhymes with Mama-mea that's a big via.
Language models are not the only task requiring huge memory. Another example is genome scaffold assembly (which takes millions of DNA sequence snippets to produce a complete genome of an organism).
There's no such a thing as "enough memory" when it comes to science or technology. Breakthroughs in memory just make the industry even more hungry for more.
*4:45** "Everything new" has always been available to only governments and rich people. Consider your cell phone. That technology when new was hundreds of thousands of dollars. Now you can ask a friend if they have a phone they are not using and get it for free.*
*1 TB drives used to be $25,000 and now I can buy a SSD, which didn't even exist technologically at that time, for $30.*
*That was 30 years time.*
there is no channel like this on youtube. keep up the good work. it is really appreciated !
Wow, this channel has grown considerably. Very well-made content, thank you, and congrats!!
I'm a mechanical engineer, and not particularly savvy in the EE domain, although I love me some first principles. The first thing that comes to mind when I think about compute-in-memory is the problem of locality. Our universe is pretty clearly local, and if you store all your stuff, then want to compute using that stuff, you have to bring relevant stuff near each other. Unless there's a way bypass that locality requirement, it's hard to see how any clever computational or logical algorithm to avoid the problem of read/transmit/write times. There are obviously a lot of ways to reduce that cycle time, but I just don't see how compute happens without the process of read/transmit/write to memory. Maybe some fucking outrageous entanglement situation, but that would include like 10^10(or wildly more) particles, and that's just not on the table right now. That would be a good goal, to have a computer that can simultaneously solve every problem contained within its memory.
Late reply, but you might be interested in a new intersection of physics principles and computing. Memory-induced long-range order (something physicists talk about) can be used to somewhat bypass this locality issue and preform hard computations in memory. Called "MemComputing", it utilizes a dynamical systems view of computing and its all quite fresh! There is a good book by the physicist Max Di Ventra on the subject.
hbm3e ok?
3:14 Why does widening a highway not much help with traffic?
Choke points. Got get them chokes, ya know?
So ReRAM is what Veritasium talks about in his analog computing video (specifically what Mythic AI are doing)? Seemed really promising.
10:30 was just wondering if you were going to mention that disconnect, glad you did!
Just as a note, MRAM is actively commercialized in chip and subsystems today as an integrated NVM. ARM's Cryptoisland and TiempoSecure's Tesic are two security chips that integrate MRAMs, as examples.
The machines that run our society today are truly amazing. Great video!
No. Today our society is run by machines that are truely amkazing SHUT UPDSF SUDTPSDIFGDPETFR
How long would it take you to recreate it in the woods with only a hatchet?
@@royalwins2030 Cant. The minimum required to rebuild a world would be a lathe. Can craft virtually any requisite tool and most components of other machines. Humanity would never stoop to dark ages. There is a small chance if we stop using paper books and go full digital but thats a future problem.
There is a large hurdle of hardware ecosystem that really means the big companies are always going to control the AI landscape. As anybody who has had to suffer through tensor RT or any AI edge deployment framework knows, the closer you get to the hardware the more ecosystem support you need.
It occurs to me that perhaps much lower precision like 4 bit may be sufficient for AI purposes making the compute units much simpler and cheaper to build. Shedding all the exponential complexity that comes with higher precision would also greatly reduce computing time and power requirements. Something to think about.
Awesome video!
Hearing dram made my head spin too. Then I realized we say R-A-M as ram, so add a "D" and it's dram. Ok. But then you said s-ram and I flipped my desk.
A DRAM (dynamic RAM) cell is made by a FET transistor and a small capacitor connected to its gate. Since the capacitance is small for reasons of speed, the capacitor lose its charge within milliseconds, and the DRAM has to be constantly "refreshed" (with a read cycle usually), so it can keep the bit properly memorised.
A SRAM (static RAM) cell works on a completely different principle. It doesn't use a capacitor to memorise the value of the bit (0 or 1) and doesn't need a refresh cycle.
A SRAM memory cell is basically a RS Flip Flop (Set - Reset) which keeps the set level until it is re-setted. Therefore, instead of a single transistor, each SRAM cell is made by four to six transistors. So the SRAM cell takes more chip space, but it can run at the same clock speed of the logic circuitry; moreover, the SRAM is much less susceptible to data errors and interferences. The Mars probes use exclusively SRAM memory in their onboard computers.
The SRAM represents the ideal operating computer memory, but it takes six transistors instead of one for each memory cell...
@@rayoflight62 thanks ray that is a great explanation of the differences and I certainly appreciated and easily understood.
@@rayoflight62 im pretty sure he was talking about pronunciation
😂
12:37 could you try to explain it further than that?
😢
The simple material is physically modified vs dram using capacitors.
what if instead of always needing more bandwidth between computation units and memory, we reduced the model complexity to fewer parameters, while maintaining the same performance? This topic has been researched a lot, specifically for IoT devices, google squeezenet and optimal number of features for classification!
Old adage, perfection is not when there's nothing more to add, but when there's nothing left to take away. -some guy I can't be bothered to google.
@@Captaintrippz .... - Antoine de Saint-Exupéry
I have seen a concept proposed to use analog computing elements in a neural network application. The concept being that a MAC instruction can be implemented using a single analog element (e.g., transistor) and it can achieve several bits of accuracy. Also with neural networks the need for absolute accuracy is not there.
Great content, you have a gift for explaining technical concepts in an easy to follow manner.
13:50 It's like a backwards instruction set! Rather then the cpu waiting for data from memory, the memory can do quick simple operations more efficiently.
I don't know much about computers, but I have been following the growth of AI in chess. Conventional, brute force chess engines are still the most powerful tools for analyzing a chess game (Stockfish), but the newcomer using AI, known as Leela-Chess, is a very powerful number two and has beaten Stockfish many times in computer chess tournaments. Its play style isn't as accurate as Stockfish, but it's very fun, creative and plays a bit like a strong human with great intuition. But I still prefer Stockfish for its accuracy.
An outsider's way to perform compute within memory:
Sample memory-strings A & B are each 8-bit, each bit addressable.
A holds "01101001"
B holds "00010111"
When you want to add two bit-strings, you can feed each pair of bits into these two memories; "x:1, y:1, carry:0" goes to the memory-address (1,1,0), and that location in A yields the correct addition - "0", while that same location on B yields the correct carry - "1".
Fundamentally, we let the memory act as a *look-up table* for any obscenely complex function we choose, and we can change that function just by updating the memory, which sounds better than an FPGA. And, each address in memory can actually be a row of data, for same input, many outputs.
Latency?
We already have look up tables in programming. Bringing it lower level will require extensive configurability or a set algorithm. Either way, you're getting an FPGA or an ASIC. Remember that the precursor to the FPGA, the CPLD, was literally a memory platform retooled for computation.
@@dfsilversurfer For inference on massive neural networks, we already wait 30 seconds or more. Compute-in-memory is specifically for neural network applications; low precision, high parallelization, latency can wait. You definitely wouldn't need a custom chip just to run an app where you text your friend.
I don't get it. What is the memory-address format? How do the locations on A and B get the proper values for the result and carry respectively? Can you rephrase your solution to be more precise. Please use the bit values stored in A and B as in your original example so I can better follow your logic.
@@vaakdemandante8772 Sure! Each memory-slot, A & B, has a set of addresses, and at those addresses are held a bit, like this:
Address Value Stored in {A,B}
000 {0,0}
001 {1,0}
010 {1,0}
011 {0,1}
100 {1,0}
101 {0,1}
110 {0,1}
111 {1,1}
This uses only 16 bits of memory, and a *NON* programmable addressing, so it's simpler than the CPLD mentioned above, while still giving us a full adder!
Now, if we had two arbitrarily long bit-strings that we want to add, called X & Y, then we can feed the first pair of bits from those strings into this memory-logic, using those bits *as the address that we will look-in, for A & B* . For example, if the first bits from our arbitrarily long strings are both "1", while we have *no carry-bit yet* , then that combined address is "011", meaning "0 carry-bits so far, and a 1 for string X and a 1 for string Y." At that address, "011", the value stored by A is "0" which is the correct answer for "what value should be stored in that addition's binary slot?" (1 + 1 in binary leaves a 0 in the one's column.) At the same time, that *same address* points to the value "1" in B, which is the correct answer for "what value should be *carried* to the next binary digit?" (1 + 1 leaves a 1 in the two's-column.) I mentioned "A & B" as separate vectors, but you would really just have a 2 x 8 block of memory, and pull both bits, A's & B's, at once.
That is just a simplistic example. Realistically, memory would be in much larger blocks, such as 16 x 64, for massive, nuanced, bizarre logics to be held as fast look-up tables. This isn't a cure-all best-ever; it's for massive parallel operations, reconfigurable on the fly, with an eye toward advanced machine learning.
Memory latency is a big factor in CPU design. Tremendous effort has gone into caches, branch prediction, out of order execution, and prefetching to mitigate issues of memory latency.
all of which are fundamental security issues.
At 0:15 - This is the spot where I was already lost beyond help! Got a smart cookie here.
Your videos are pure liquid gold. You refine and explain information in a unique way. Thank you for providing this type of content.
Velveeta
Idols always do one thing, fall!
Excellent video, but I must point out a few inaccuracies. First of all, let me introduce myself. I've been working with Computer Engineering for over 20 years, and I'm currently working in a company that makes chips for accelerating AI applications using exactly a non-Von Neumann architecture.
1. GPUs also don't use a Von Neumann architecture, otherwise, they'd be terrible for AI. They might not be as good as architectures tailored for AI (like Google's TPU).
2. At-memory computing is great for all applications, not just edge ones. It allows for more performance and for saving a lot of power. On an edge device, these translate to longer battery life, and in a data center, they translate into millions of dollars in cost savings.
3. DRAM is one type of memory; there's also SRAM.
4. The eDRAM is not actually a problem. Yes, they're hungrier and larger, but they're also much faster. The L1 cache in modern CPUs is never implemented with DRAM cells, that's why it's small, but it also means it can run at the same clock rates as the rest of the CPU. That's the same reason why CPUs have a cache hierarchy that goes from faster and smaller to slower and larger (L1 -> L2 -> L3).
5. The problem with exotic technologies like R-RAM is that we'll need entirely new production methods, and that's a big problem because the industry will need to see that as the way forward in order to invest in those new fabs. Until non-conventional memory technologies are considered a somewhat safe _commercial_ bet they won't catch on.
The industry differentiates between in-memory and at-memory computing:
In-memory means that computation happens, quite literally, in the memory cells. That minimizes data movement, but the solutions here are analog. That requires converting from digital to analog and then back to digital. This double conversion negates many savings obtained by in-memory computing. Excellent video about in-memory computing: ua-cam.com/video/IgF3OX8nT0w/v-deo.html&ab_channel=Veritasium
At-memory computing means that data movement is reduced by keeping data very close to where computing takes place. What you have is _distributed_ memory. It's an all-digital digital solution that uses standard tools and standard manufacturing methods. That's where most of the industry is headed right now. Disclaimer: that's the architecture my company adopted for its chips.
In a bit of futurism, I think that because AI is such an important application, eventually enough R&D money will be poured into in-memory computing that it'll eventually win out. But I don't see that happening for another 7 to 10 years.
Other than those, the video is really excellent and well-researched. The memory wall explanation is spot-on.
Saying that every computer is Von Neumann is loosely true. Most microcontrollers are still modified Harvard architecture computers with a separate program memory bus whereas modern PC and server processors operate like modHarvard only when the program is not exceeding cache size and the memory bus is only used for data transfers.
IMO using general purpose VN architectures is the first issue here.
There would also need to be massive increases in onboard cache. It is interesting how we have already hit size limits on dram, we are reaching similar limits on cpu's too, I suspect then we will see the fix improving both - multiple layers. Today the routing has for a long time been the biggest delay so if you could double layer then we would see much faster chips.
Problem then is heat....
"Many of these limitations tie back to memory and how we use it" - Pretty much any computing problem be like
13:17 - "in sitch ooh" or "in sit yoo". It's a bastardization of "in situation" so it's however you pronounce situation minus the -ation.
I'm just glad something has finally spurred all of the giants to get off their asses and fear being left behind again. I'm guessing the mad rush to push memory and processing to ludicrous speed will have (and has already had) some knock on benefits for the average consumer.
One thing I don't like is how the PC is becoming less and less viable. Things are being pushed towards mobile and data centers in functionally the same way that things evolved into terminals (or thin clients) and mainframes (or supercomputers) in the 80's and 90's.
2:33 Where did you get your 320 count for A100/80's?
Thanks for the video!
I grew up in computing as a systems programmer on Univac multiprocessing systems in the 70s. Then, we were all aware of the contention of memory access. Cache memory and memory banks eased the problem somewhat. Years after leaving Univac I developed a radical technology. In this technology a thousand processor core can access a common memory without conflict. It has a few drawbacks. Implementing stacks would be unlikely. Instructions are somewhat slower than are common today. Cache memory would not be practical. Pipelining would also not work. Here is how I devised this plan. I went back and looked at how computing evolved. Then, the speed of data through a wire was considered instantaneous. I began this trip with the idea that transmission was slow. How would the inventors deal with that? I began implementing a solution with a small gate array that I estimated would compete with an i5. But I got hit with a massive heart problem. I simply don’t know what to do with it cause I’m a nobody.
@@achannel7553
Well, its not as great as it sounds. They are only 8 bit processors. And on chip memory is somewhat limited. External standard memory would not work as specialized memory is required.
See, the entire concept of "computing memory" already exists as you mention it, and it is called "cache memory" on the same processor chip.
Expanding the concept, and replacing the entire DRAM bank with cache memory is challenging, but it can be helped by the multi-core chip design, and by the use of chiplets AMD-style.
I don't see ReRam taking up the market, there is more to come. If you recall the memory oscilloscope tubes from the '60s, which were a wonder at the time, the same concept can be applied at chip level, where the transistors on the chip can change their functions - based on the application of a electric field. In the end, the same transistor can work as a logic gate or as a memory cell, as commanded by a supervisory circuit.
This improvement would require a redesign of all software EDA tools, and you know that can take a decade.
Thank you for this pioneering video of yours on the most challenging computer design problem of the century.
Regards,
Anthony
Doing basic logic operations on entire memory rows completely changes how we store and process data. For instance, you would store an array of numbers with all the highest bits stuffed together, and all the lowest bits stuffed together, and so on, which is quite the opposite of the current practice of storing all bits each single number together
I think you fundamentally misunderstand how in-chip memory (sram) works. Each bit in a register is operated on in parallel for each word in all CPUs. And each bit, from each word, from each wave-front in GPUs/SIMD processors.
I wonder if we'll ever be building PCs with "AI Cards", much like Video Cards, Sound Cards etc
Sounds very interesting!
There are already plenty of examples of this, like Apple M1
@@grizzomble Yes there are, and in the coming years we are going to see much more of those chips.
@@grizzomble what OP is saying more like upgradeable pcie add-on card that boosts AI rather than SOC
another option might be to change how AIs work in general. rather then each neuron checking every neuron on the next layer, you could limit them to only talking to the ones next to them or 1-2 neurons away. it would mean we would need larger AI models but id imagine it would be much easier to work with in a RL 3D space. you could feed data in one side and have it come out the other. would still need to somehow allow it to be trainable and to have the data read from each neuron if you ever wanted to copy the chips content into a software model. Given the massive increase in performance this would offer, even if the processing and memory storage is slower, it would still be better. Essentially, you could make bigger and bigger AIs by just making the chip larger and giving it more layers. it would gain massive benefits the larger the AI is you want to work with, rather then one neuron at a time like with standard models, youd be working with one layer at a time no matter how large it is. the potential gains should be obvious if the technical challenges could be worked out.
I believe current models are based on how the human cerebelum works. Limited input, massive bandwidth and parallelization in between to derive a small but coherent output from those small inputs.
Super informative about openai gpt and how all this is processed. (hope I've given you the keywords you need ) the depth you go into things is always astounding and no, I don't expect you to explain that wizardry. Thanks for making these, you're awesome.
The video doesn't talk about that, thanks for messing up everyone's search results 😑
Great video.. reminds me of the structure of the old SIMD ICL/AMT DAPs where AMT was Active Memory Technology.. the first machine I ever wrote a back prop ANN program in parallel fortran. For the right problem it was very efficient.. love to see one on a single die with enough memory of course.
Hello wonderful person reading this comment
You're an Anton fan aren't you.... Hello back at you, wonderful person.
Damn... You're my you tube subscription clone!
@@kayakMike1000 it's a bit subscribed to all big enough channels and looking for platform time before being used in a scam.
Why hello there wonderful person who wrote that comment
Hellooooooo
3:12 how does widening a highway not improve traffic much?
It's pronounced D-ram not Dram
Jon says S-ram correctly.
I think he's just trolling. 😉
Da-ram
@@jimurrata6785 yeah at this point I'm convinced he does it on purpose
For low memory latency deep learning accelerator applications, there is also innovations at the system protocol levels through CXL (compute express lunk), allowing for memory sharing, memory scaling (of multiple memory types). Great topic. Yes device, circuit, systems architecture innovations are needed for next gen AI.
I wish he had included something on CXL as that should be a game changer for all the problems he mentions right?
You lost me at dram, say after me. deeram
It's kinda funny how often, theoretical abstractions are quite literally turned into real products, while the abstraction itself accommodates a much, much wider class of possible implementations. For instance, nothing in the von-Neumann architecture says that the CPU or the memory need to be a physically separate and monolithic device. It's just that they can be considered that conceptually for theoretical analysis.
DRAM is pronouced as D-RAM not “drain” …
I can help you explain it further. this changing of resistance to keep resistance is called a memristor which was the last discovered of the quartet of fundamental electrical components which comprises also the resistor, capacitor and inductor. basically a memristor is a resistor that remembers its resistance.
I'm reminded that it's time to reconsider: 1) Transport-Triggered-Architecture 2) Flow-Based-Programming and 3) Systolic-Arrays in light of the proposed solution vectors to the "memory wall" issue outlined here. I believe clockless designs are appropriate for power-saving modalities.
6:01 I think it would be better to put all numbers in the same terms, so 128x, 20x, and 1.3x
For a text on this see "Content Addressable Parallel Processors" (1976) by Caxton Foster. One of the earliest air traffic control computers, Staran, was one of these.
This is where the tesla super computer design is a brilliant dance around the memory wall … looking forward to the results coming later this year when the first pod is completed
My thinking towards electronics change everytime with your new video! THANKS FOR MAKING VIDEOS WITH GREAT EXPLANATION AND THOUGHTS!
Great video man, really helpful to get an overview of the hardware limitations of current systems. thanks.
When you talk about high precision and low precision human brains early in the video, I immediately thought of a TechTechPotato video about IBM making an architecture for lower precision but “wider” compute. Basically they are almost like PCI E graphics cards but for server computing only.
As someone working on a machine learning project...that is so many levels beneath the stuff your talking about this video is really eye opening.
Going to use this as motivation to finish my work.
Your videos are excellent because of the content but mostly because you don't have background music, noise, and kettledrums interfering with the narration.
12:22, I think it might use the laws involving resistance (like those from circuits in parallel, series, etc) to cleverly infer from some 'measured' resistance the addition of data stored within the memory. It sounds similar to what analog computers do, and how they can be used at lower costs to implement trained neural networks for inference.
Doubt it. give me an example of a pure analog computer. Those mesured resistances are far to unprecise. And why would you do it if you can fit a billion transistors on the size of a stamp?
@@computerfis Check out Veritasium's video on analog computers. He shows modern examples of analog computers which are used in inference with trained neural network models.
As for the reason why these are used, the video goes into it, but in short it is cost. Both cost of production and energy cost.
CAM (content-addressable memory) aka "associative memory" was on the drawing boards in the mid-00s, with an eye toward AI, and was essentially a hardware implementation of the associate array construct found in most programming languages. It did find use in network routers for fast hardware lookup of routing table entries.
CAM is as old as computing. It's not news.
Would you consider making a video about Neuromorphic Computing? The in-situ computing you covered still trys to implement the traditional gate sets like AND and XOR. Neuromorphic Computing seems not to fall under these directions you covered.
What about neuromorphic chips? Are they some kind of processing memory also?
Just stumbled on your channel. This is a great video, can't wait to watch your future ones.
You can't do HPC on the cloud....cloud architecture specifically prevents this....
Now you _can_ do HPC on a custom cluster. It's more difficult because you have to manipulate the instructions to keep the memory as close to _every_ processor as possible! But this _can be done!_
I feel the need to mention here, that I consider (along with others i'm sure) that the proper term is "stored program computer". Von Neumann had nothing to do with the invention of the stored program computer design, which was due to Ekert and Mauchly, the designers of ENIAC. Von Neumann simply took notes on the design and published it under his own name, which led to a misinterpretation that he (ahem) declined to correct.
Bro, you are exactly what I have been looking for. Solid work and I am now a devout subscriber. Seriously genius content here.
Two things I wish you had mentioned that could utilized to help out with this process. First off is the number of cores that exist on a CPU but also the way that they are accessed is very critical and several companies are designing CPUs that will be able to work hand in hand with gpus in the way that their core systems are designed, with multiple companies now designing CPUs with over 100 cores per chip. Second is the use of quantum computing in processing logic gates among other things. IBM's photon chip tech combined with quantum computing are going to revolutionize the way we use computers at a scale that is hard to imagine. Exciting times we live in.
It would be interesting to start that development with feature rich memory - not completely computation units but memory that can perform xor write and block copy that should integrate with existing technology quite well
David Patterson of UCB did not invent RISC. That was IBM with the 801 minicomputer project. A fellow named Glenn Myers wrote a book about what they learned designing it called "The Semantic Gap". Patterson became a disciple of the philosophy and did a great deal to spread the word.
This video is great dude, keep it up! Would be nice if you included references
I'd be extremely careful with statements like "the human brain runs on low precision".
The understanding that lead to our current day nodal computational fuzzy logic (ie 'Neural Networks') are primitive to the extreme when compared with the flexibility an actual network of actual neurons exhibits. Neurons are not only not restricted to pure binary state, far more. By the base setup of different types of neurons and their ability to react differently to different stimuli certain things like communication are apparently hardwired into our brains. That's the reason why we still haven't been able to even come close to the computation efficiency of a "simple" Petry dish with a few thousand single type neurons.
It's a bit annoying to anyone who knows both the maths based simplistic nodal networks and the research into how our brains function because it's really an apples to rocks comparison.
I mean, neural networks don't even support information backflow at this point, and that was a known feature of neurons even before the term neural network was coined...
Even the newest neuromorphic hardware lag way behind the human's neural system.
There is a simple solution... albeit one which holds the real possibility of radically changing the entire computer industry. Memristors. Memristors instead of transistors, were we able to achieve high-volume production, would be the only thing we needed. Memristors can replace the CPU, GPU, RAM, and SSD. You can intermix compute and memory trivially easy with memristors. And you can change whether a section of a memristor grid is computational or memory storage (all non-volatile, and all as fast as registers) on the fly with microsecond timings on changing it. For reasons I have thus far failed to understand, those using memristor technology have focused almost exclusively upon using them for neuromorphic accelerators. That is a tremendous waste. If you could efficiently mass-produce large volumes of memristors at small scales, you could ignore the neuromorphic aspects and instead replace the entire CPU, GPU, RAM, and SSD industries wholesale with a single product. Their design, at the production level anyway, would also be very simple, similar to how SSDs are just big grids of NAND gates compared to the monstrous complexity of mechanical hard drives. Price-fixing tactics have artificially preserved the existence of mechanical hard drive companies, and we might end up seeing that with memristors as well, which would be very sad, and would continue holding us back just to keep some current billionaires rich.
Well put out seminary. You have presented some real hardware design, a bit of the physical constraints and marketing issues. I bet, though, that problem is not only on an obsolete architecture, but also on flawed algorithms that spend a lot of resources on tasks that can be skipped, not recalculated or just infered by other means which altogether leads to less steps in intermediate computations and storage needs to do the very same job. Be it in CS field or even the brute maths field.
Wait. . What? Adding a lane doesn’t help traffic? Who knew?
Fawk, man. . . The breadth and depth of your knowledge is stunning!
There is none like you out there, thanks for all the juicy info.
I suspect and we have seen this already with more and more optimized ASIC's that we will end up with a world where AI models are no longer being run on generic hardware but are created in silicon entirely with as mentioned memory directly attached (chiplet designs) or on die. The cost to build models with tens or hundreds of trillions of parameters means that producing a dedicated chip for this purpose and cramming a few thousand of them in a rack is not really cost prohibitive anymore.
The bigger question I think we should ask is what is the future of these models, so far they are fun and curious but they are not really adding much of value compared to their cost to build. As much smarter people than me have pointed out the current inference AI "revolution" feels a lot like the Web3 movement where there is a lot of money sloshing about with VC's if you mention you are doing something with this great new tech. Yet no one seems to have build a convincing argument for this tech when it comes to making the kind of returns in the next 3 to 5 years that a VC would want to see from at least one of their bets in this space.
It could very well be that in the near future the money will dry up and the big innovations and the whole new world that people are predicting simply does not materialize.
Though this does not mean that the technological obstacles should not be bested and our memory wall problem is a real serious issue that is affecting way more than just AI so the sooner this si solved the better it would be but it would probably mean that the need to solve this quickly will be felt a lot less by the industry and the actual "solution/workaround/innovation" needed to slay that dragon is way further of than we currently hope.
Absolutely amazing content. I really can't sing your praises enough.
Would be interesting to know your background.
Thanks for this cool video, and happy New Year!
this is wonderful information. a great breakdown. my thoughts were large parallel access to memory (no memory controller with any addressing required). and lots of sram. it seems that broadcasting out data is one thing, but needing the results of a lot of data in a timely manner is the real core of the problem. my gut feeling is the true solution to this is quadrillions of simple input->output units that have baked in logic tables. and no intermediary memory controller. no row/column access. just extreme connectivity. extreme parallel connectivity.
and 3D dies the size of a dinner plate.
2:46 - i znów ekonomia, mózg jest jaki jest z powodu ekonomi (swiata), komputery musze to jeszcze udoskonalic. Mózg nie musi wykonywac takich operacji (on wykonuje je niejako nie jawnie, albo w locie). :)
4:09 - sami wspominaja o ekonomi :)
7:00 - fajna gra słow smal - smal ;D
8:00 - sam coś takiego wymysliłem na studiach, :) można uzyc pamieci PCM :)
12:00 - resisitive RAM; a jednak miałęm racje xD, miałęm xD
en.wikipedia.org/wiki/Resistive_random-access_memory
12:43 - po prostu je sie podgzrewa albo schąłdza i od tego zlaezy czy jest stan krysztalcizny czy amroficzny :)
I like when your channel picture pops up at the end of videos after a long hard think. I would give that deer some Gardetto's.
The original CM-1 Connection Machine had a similar design: up to 2^16 1-Bit ALUs, each integrated with 1024 bits of memory.
Impressive summary as always. Thanks for all your hard work!
Great tutorial and relevant conclusion about deep learning expectations. Thank you for your serious work and investigation.