hu-po
hu-po
  • 355
  • 666 169
Simon (2024 Gemini Competition Submission Video)
Simon (2024 Gemini Competition Submission Video)
Переглядів: 1 411

Відео

gemini 1m tokens
Переглядів 1,1 тис.9 місяців тому
gemini 1m tokens
Beyond Surface Statistics: LDM deep dive
Переглядів 2,4 тис.Рік тому
Beyond Surface Statistics: LDM deep dive
Speech2Speech AI Conversation App
Переглядів 2,5 тис.Рік тому
Speech2Speech AI Conversation App
What is RLHF?
Переглядів 5 тис.Рік тому
What is RLHF?
Robots using LLMs
Переглядів 4,2 тис.Рік тому
Robots using LLMs
Visual ChatGPT
Переглядів 1,2 тис.Рік тому
Visual ChatGPT
MathPrompter
Переглядів 828Рік тому
MathPrompter
Vid2Avatar
Переглядів 1 тис.Рік тому
Vid2Avatar
StyleGAN-T
Переглядів 1,8 тис.Рік тому
StyleGAN-T
Speech to Calendar with OpenAI's Whisper
Переглядів 540Рік тому
Speech to Calendar with OpenAI's Whisper
Inside LLMs
Переглядів 1,6 тис.Рік тому
Inside LLMs
Robotic Microbe Farms
Переглядів 335Рік тому
Robotic Microbe Farms
Polars vs Pandas
Переглядів 2,1 тис.Рік тому
Polars vs Pandas
AI Platforms and Markets
Переглядів 602Рік тому
AI Platforms and Markets

КОМЕНТАРІ

  • @thivuxhale
    @thivuxhale День тому

    1:48:20 if we already has enough innovations in research to reach AGI, would i make bigger of an impact if i go into industry rather than research? feel like when doing research days, you have a really small chance of creating something impactful and fundamental, most of the research is incremental

  • @deathfighter1111
    @deathfighter1111 День тому

    In equation 18, the notation is wrong, by passing from ni to ak it should be ai, by summing you get, the author made a mistake with the notation

  • @wolpumba4099
    @wolpumba4099 2 дні тому

    *Visual Autoregressive Modeling: A New Approach to Image Generation* * *4:13** Introduction:* The paper "Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction" introduces a novel method for image generation. * *4:21** Key Idea:* Unlike traditional autoregressive models that predict the next token in a sequence, this approach predicts the next *scale* or *resolution* of an image. * *6:28** Context:* The first author, a former intern at ByteDance, is involved in a legal dispute with the company regarding alleged disruption of internal model training. * *9:13** Performance:* The model achieves state-of-the-art results on the ImageNet 256x256 benchmark, particularly in Fréchet Inception Distance (FID) and Inception Score, with significantly faster inference speed. * *15:06** Traditional Approach:* Current methods typically convert images into a 1D sequence of tokens using a raster scan order, feeding them into models like transformers. * *17:48** Proposed Method:* This paper introduces a hierarchical, multi-scale approach, akin to how convolutional neural networks (CNNs) process images, eliminating the need for positional embeddings used in traditional models. * *19:13** Analogy to CNNs:* The multi-scale approach is analogous to how CNNs use receptive fields to progressively aggregate information across layers, a concept inspired by the human visual system. * *23:59** Advantages:* This approach offers better results, a well-written paper, and a conceptually simple yet effective idea, contributing to its recognition as the best paper at a major conference. * *27:43** Tokenization:* Uses a standard VQ-VAE (Vector Quantized Variational Autoencoder) to convert images into discrete tokens. * *41:35** Core Innovation:* The main innovation lies in how these tokens are processed - not in a linear sequence, but in a multi-scale hierarchy. * *54:22** Implementation Detail:* Different resolutions of the token map are achieved through interpolation, a technique to estimate values between known data points. * *56:05** Key Takeaway:* This method demonstrates that simpler, more intuitive approaches can outperform complex ones, and it is likely to be widely adopted in various applications, including image and video generation. * *59:08** Efficiency:* Parallel processing at each resolution level, similar to how CNNs operate on GPUs, leads to a 20x speedup compared to traditional autoregressive models. * *1:01:49** Complexity Analysis:* The time complexity is reduced from O(n^6) for traditional models to O(n^4) for the new approach, making it more scalable. * *1:02:45** Shared Codebook:* Interestingly, the same vocabulary (codebook) of tokens is used across all scales, which is counterintuitive but contributes to the model's effectiveness. * *1:12:55** Scaling Laws:* The paper demonstrates scaling laws, meaning that increasing model size predictably improves performance, a crucial property for training larger and more powerful models. * *1:20:23** Conclusion:* The paper's success is attributed to both luck (choosing the right idea) and skill (well-written paper, good figures, and strong results). * *1:33:16** Complexity Proof:* The video discusses the mathematical proof of the model's time complexity, highlighting the clever use of geometric series to simplify the analysis. * *1:39:31** Limitations:* The discussion acknowledges the limitations of current language models in understanding and reasoning about the physical world, as exemplified by the "mosquito test." * *1:42:41** Future Work:* Potential future directions include improving the tokenizer, applying the method to text-to-image and video generation, and exploring its use in other domains beyond images. I used gemini-exp-1206 on rocketrecap dot com to summarize the transcript. Input tokens: 42402 Output tokens: 868

  • @EobardUchihaThawne
    @EobardUchihaThawne 2 дні тому

    how does it handle image inputs? i saw on the image they show as s e1 1 2 3 4 e2 1 2 .... 9 .... is it flattening the image?

  • @MilesBellas
    @MilesBellas 2 дні тому

    "Loop holes" = why the UK tax system is deliberately undermining the country, creating polarization of capital.......rich people operate from zero tax islands with 100% law enforcement protection, totaling BILLIONS..........while simultaneously ram raiding small family shops for selling untaxed cigs ie The Spider's Web documentary

  • @MilesBellas
    @MilesBellas 2 дні тому

    "Human evaluations liked mine most..... because I am a the human evaluating it !" 😀🤣

  • @xx1slimeball
    @xx1slimeball 2 дні тому

    cool vid, cool paper

  • @mlcat
    @mlcat 2 дні тому

    The same guys (and same github account) who made VAR also released a new text-to-image paper & model a few days ago called Infinity

  • @spirobel2.0
    @spirobel2.0 3 дні тому

    banger stream

  • @hjups
    @hjups 3 дні тому

    The shifting idea isn't new either; it was proposed in the SD3 paper, where s = sqrt(m/n). From experience, I have noticed shifting have an impact for low step count, but I have never seen it have an effect like they showed in Figure 11. Maybe the linear-quadratic schedule is simply a bad idea that starts to show up at low step count, or their result is a product of the distillation method? Interestingly, when applying classic CFG, shifting the timestep schedule at inference tends to negatively impact eval metrics. Regarding video quality, I think you are downplaying the issue with consistency. You could scale up the pixel fidelity and make the models cheaper, but then you'll still have the weird cases where the spaghetti isn't eaten, or gymnasts with limbs folding into and out of each other (Sora has this problem). The dataset alone won't fix this issue since it's fundamental to the way information is processed within the network. Scaling helps, but that also comes with increasing cost. I like to think about it via the following analogy: a transformer is the FPGA equivalent to deep learning - they can implement any function as long as you have enough "gates". However, you wouldn't want to train a big DNN on a FPGA, instead would favor a custom solution tailored to the type of data flow (e.g. GPU, TPU, or CGRA).

  • @wolpumba4099
    @wolpumba4099 4 дні тому

    *Video Generation: Exploring the Latest Advancements and Future Implications* * *0:01** Introduction:* The discussion starts with a focus on the "Hunyuan Video" paper by Tencent, highlighting advancements in open-source video generation. * *3:27** Hunyuan Video:* This paper details a systematic framework for large video generative models, similar to Meta's MovieGen, but without data specifics. It emphasizes an open-source 13 billion parameter model capable of generating high-quality videos. * *5:11** Replicate Demo:* The speaker demonstrates the model's capabilities using Replicate, comparing its output to Google's Imagen 3 and highlighting its ability to handle dynamic scenes. * *6:41** Will Smith Spaghetti Benchmark:* The speaker tests the model with the "Will Smith eating spaghetti" benchmark, noting its limitations in capturing complex mouth movements. * *7:14** Data Filtering and Model Architecture:* Discussion on the importance of data curation, model architecture, progressive model scaling, and efficient infrastructure, emphasizing the significance of data filtering in model performance. * *10:48** Data Filtering Techniques:* Detailed explanation of data filtering techniques, including splitting raw videos, identifying clear frames, deduplicating similar clips, and applying K-means for concept resampling. * *12:23** Hierarchical Data Filtering Pipeline:* Description of a multi-stage filtering pipeline involving automated and manual filtering to construct training datasets. * *14:34** Structured Captioning:* Introduction of a novel in-house vision-language model for generating structured captions, enabling synthetic data creation and diverse caption generation. * *16:54** Prompt Rewrite Model:* Explanation of a prompt rewrite model designed to adapt user prompts to model-preferred structured formats, facilitating non-expert users in generating high-quality videos. * *21:07** VideoGen-of-Thought:* Reference to a paper on a collaborative framework for multi-shot video generation, highlighting the potential of combining high-level prompts with per-shot descriptions and identity-preserving embeddings. * *23:06** 3D Variational Autoencoder (VAE):* Explanation of the 3D VAE's role in compressing pixel-space videos into a compact latent space for efficient generation. * *24:49** Spatial-Temporal Tiling Strategy:* Description of a technique for generating videos in tiles to enable high-resolution output on a single GPU. * *26:39** Dual-Stream to Single-Stream Hybrid Model:* Explanation of a model design that processes video and text tokens independently in the dual-stream phase and concatenates them in the single-stream phase for effective multimodal information fusion. * *27:59** Rope Embeddings:* Visualization and explanation of rope embeddings for encoding spatial information in video generation. * *29:23** Global Guidance with Text Encoders:* Discussion on using the final non-padded token of text features as global guidance for video generation. * *31:10** Model Scaling Laws:* Exploration of scaling laws relating model parameters, computation, and the number of tokens to training loss. * *33:58** Flow Matching:* Overview of flow matching for predicting the movement from noise to image distribution in diffusion models. * *35:03** Image and Video Training Strategies:* Discussion on pre-training strategies involving lower to higher resolutions and categorizing training data into buckets for optimized GPU resource utilization. * *37:04** Prompt Rewrite Stages:* Detailed explanation of the three stages of prompt rewriting: multilingual input adaptation, rephrasing, and simplification. * *39:29** Model Acceleration with Time Shifting:* Introduction of a time-shifting strategy to improve inference efficiency by focusing on earlier time steps in the generation process. * *43:01** Text-Guided Distillation:* Explanation of a technique to distill classifier-free guidance into a single student model, accelerating inference by 2X. * *45:19** Future of Consumer GPUs:* Discussion on the potential decline of consumer GPUs due to Nvidia's focus on data center GPUs, contrasted with the effectiveness of model distillation for running models on less powerful hardware. * *47:28** 01 Pro and Sora Pricing:* Speculation on the pricing of OpenAI's 01 Pro and Sora, suggesting a bundle approach to justify the $200/month cost. * *50:46** Political and Economic Factors:* Analysis of the potential impact of political changes on OpenAI's strategy, suggesting a need for rapid fundraising and hype generation. * *53:21** Maximum Perceptible Audio-Visual Input:* Discussion on the limits of human perception for audio-visual content, suggesting an upper bound for video generation quality. * *56:28** RLHF and Reasoning Models:* Discussion on the challenges of applying Reinforcement Learning from Human Feedback (RLHF) to reasoning models and the potential for models to develop non-human languages. * *1:00:14** World Models:* Exploration of the concept of world models conditioned on actions, with a vision of future models conditioned on neural activity for immersive VR experiences. * *1:02:56** IP and Data Ownership:* Discussion on the complexities of intellectual property and data ownership in the context of training video generation models on copyrighted material. * *1:04:29** Slow-Motion Artifacts:* Explanation of slow-motion artifacts in video generation as a consequence of data filtering that removes fast-motion and blurred videos. * *1:09:14** Distributed Training Clusters:* Brief mention of Tencent's distributed training infrastructure and its use of parallelism for large-scale model training. * *1:11:54** Human Evaluations:* Critical assessment of human evaluations in research papers, suggesting potential biases and the need for careful interpretation. * *1:12:53** Audio Generation and Avatar Driving:* Discussion on additional features of the Hunyuan model, including audio generation and upper-body talking avatar generation, with speculation on their inclusion in future OpenAI products. * *1:17:04** Export Restrictions and Loopholes:* Explanation of potential loopholes for circumventing export restrictions on high-performance GPUs to China. * *1:19:16** X-Prompt Paper:* Introduction of a paper on in-context image generation using multimodal prompts, suggesting a future direction for video generation models. * *1:21:09** Speculation on Future OpenAI Products:* Guesses on potential OpenAI product releases, including audio generation and video-based world models. * *1:23:29** Unified Image and Video Generation:* Argument for a unified model for image and video generation, simplifying research and development. * *1:24:24** 3D Asset Generation:* Discussion on the potential obsolescence of traditional 3D asset generation in favor of implicit representations within neural networks. * *1:28:32** Usefulness of Short Video Clips:* Critique of the limited utility of short video clips generated by current models, advocating for more advanced features like movie generation and avatar control. * *1:31:41** Robotics and Generative AI:* Discussion on the impact of generative AI on robotics, highlighting significant improvements in robot capabilities. * *1:33:03** Conclusion and Farewell:* The speaker concludes the stream, thanking participants and suggesting a future stream to analyze OpenAI's product announcements. I used gemini-1.5-pro-exp-0827 on rocketrecap dot com to summarize the transcript. Cost (if I didn't use the free tier): $0.05 Input tokens: 35727 Output tokens: 1595

  • @karamhaider1788
    @karamhaider1788 6 днів тому

    Can you please provide us with the algorithm and codes of the walking and balance robot. I want to read them to learn. Thanks

  • @devinbrown9925
    @devinbrown9925 7 днів тому

    Summary at 1:45:34

  • @fadelazidan7731
    @fadelazidan7731 9 днів тому

    meaw?

  • @oraz.
    @oraz. 9 днів тому

    2:05 starting tone

  • @prasukjain8107
    @prasukjain8107 9 днів тому

    How many paper you read everyday?

  • @arashakbari6986
    @arashakbari6986 10 днів тому

    These long videos are super useful!!!

  • @warthunderfather
    @warthunderfather 15 днів тому

    FUCK YEAH PANTERAAAAA

  • @croko2240
    @croko2240 15 днів тому

    Really helpful 👍

  • @viddeshk8020
    @viddeshk8020 16 днів тому

    Why don't you explain code too? I believe that would be great. 🎉

  • @Zoronoa01
    @Zoronoa01 16 днів тому

    The horn is getting more stylish 😂

  • @Zoronoa01
    @Zoronoa01 16 днів тому

    I am very sad that when the stream is live I am at work and can't attend :(

  • @SuperSoloSquad
    @SuperSoloSquad 18 днів тому

    thanks hupo

  • @wolpumba4099
    @wolpumba4099 18 днів тому

    *Visual Reasoning and the Future of AI: A Stream Summary* * *0:00** Stream Introduction:* Host introduces the theme of "visual reasoning" and the use of Google Illuminate to create AI-generated podcasts summarizing the discussed papers. * *1:27** Vision Encoder Scaling Laws:* Just as with large language models, vision encoders are continually improving, showing a strong correlation between scale and performance. * *10:09** Inference Optimization Nuances:* Inference for vision-language models presents a unique challenge. Balancing language model size and visual token count is crucial and highly task-specific. Tasks like OCR benefit from a higher number of tokens, while visual reasoning tasks might achieve optimal performance with fewer, even just one. * *11:48** GUI Agents: The Future of AI Interaction?* The future of AI might be dominated by GUI agents, interacting with existing user interfaces rather than relying on specialized APIs. This is due to the widespread use of GUIs and the inherent efficiency of leveraging existing systems. * *26:53** The Dawn of GUI Agents:* An exploration of the paper "Dawn of a GUI Agent" reveals successes and failures of agents interacting with software like Microsoft Word and the game Hearthstone. * *36:55** Structured Reasoning and Self-Improvement:* "LLaVA-o1" employs a structured, hardcoded approach to reasoning, demonstrating better performance through step-by-step analysis. This method can be further enhanced by training on self-generated data. * *42:18** Self-Improvement Through Consistency:* "Large Language Models Can Self-Improve in Long-Context Reasoning" shows how language models can enhance their performance by analyzing the consistency of their own outputs and fine-tuning based on that analysis. * *50:14** Generative World Exploration and Imagining the Future:* The "Generative World Explorer" paper explores an agent's ability to imagine future scenarios to make better decisions. This is achieved through a generative video model that envisions potential outcomes. * *1:06:14** The Arms Race of Speed and Reasoning:* The future likely holds an arms race between optimizing hardware for faster token processing (tokens per second) and the development of ever more complex reasoning chains that require more tokens to process. * *1:23:17** Stream Summary:* A final summary highlights the key takeaways from the discussed papers, emphasizing the ongoing improvements in vision encoders, the complex landscape of inference optimization, the rise of GUI agents, the potential for self-improving AI, and the future interplay between speed and reasoning. I used gemini-1.5-pro-exp-0827 on rocketrecap dot com to summarize the transcript. Cost (if I didn't use the free tier): $0.05 Input tokens: 36990 Output tokens: 542

  • @ProgramerSalar-eo7ct
    @ProgramerSalar-eo7ct 19 днів тому

    nice tutorial very helpful

  • @alirezaahmadi5018
    @alirezaahmadi5018 20 днів тому

    I watch your saved stream for first in my life and it's awsome, I enjoyed so much, please continue with heavy booster man. we love this job. good luck.

  • @shutthefuckupbiatch
    @shutthefuckupbiatch 20 днів тому

    Dude, This video is out of this world good

  • @danishjavaid6769
    @danishjavaid6769 22 дні тому

    How can i download 3d model after training? When i export it, it look dotted and scattered. But in training it look awsome

  • @SuperSoloSquad
    @SuperSoloSquad 22 дні тому

    love your video!

  • @KingBobertron
    @KingBobertron 22 дні тому

    was watching this video but you ignored your cat, so i aint wathcing none of your videosnow

  • @Zoronoa01
    @Zoronoa01 22 дні тому

    This is so informative please keep them coming thank you so much

  • @rafaykhattak483
    @rafaykhattak483 23 дні тому

    Have they released weights for BlueLM-V-3B?

  • @VR_Wizard
    @VR_Wizard 23 дні тому

    About Bitcoin. It costs a lot of energy and we have a limited earth. There is no point in getting rich in bitcoint if the world fails because of thet in some areas. Your personal richness grows but your richness also sinks by making everone else poorer. Not poorer by taking their money but destroying their world and their life and their richness. We are richer wenn we rise all boats bitcoin is the opposite it fuels on destruction and taking from others. Of course we have many destructing bad things in the world so bitcoin is not the only thing thats bad, so the point I am making don't think bitcoin or other crypros are the future of humanity at least not in the current form. Maybe in a more equal fair world they would play an important role to make sure the world is fair but in the current unfair world they are not realy helpfull and do more harm then good it seems. AI is the same can be used for good and is used for giod but the bad use cases can also be seen everywhere already. Humans are just very imperfect creatures not always acting in their own best interest and having human survival in mind.

  • @sue_green
    @sue_green 23 дні тому

    God I love your streams man! Thank you, thank you so much for what you're doing

  • @thivuxhale
    @thivuxhale 24 дні тому

    1:11 starting horn

  • @sino-cici
    @sino-cici 25 днів тому

    I just want to point out that for larger dimension, the wavelength is larger and the frequency is lower. see there is a minus symbol in the frequency equation.

  • @mountassirel7447
    @mountassirel7447 27 днів тому

    Awesome vid, keep up the good work man !

  • @wolpumba4099
    @wolpumba4099 27 днів тому

    *TokenFormer: Rethinking Transformer Scaling - A Deep Dive into Attention, Scaling, and Knowledge Representation* * *2:40** TokenFormer Intro:* The TokenFormer paper proposes a novel Transformer architecture where linear projections are replaced with "P attention" - a mechanism that treats model parameters as tokens. This approach facilitates a "crystallization" of intelligence, starting with a small model and incrementally adding more "model parameter tokens" during training, reducing computational costs. * *3:30** Attention as the Core:* TokenFormer uses the attention mechanism not only for processing input tokens but also for interactions between input tokens and model parameters, essentially replacing all MLP layers and linear projections with attention. * *7:00** Linear Projections:* Standard linear projections in transformers involve multiplying a vector input by a matrix of weights to produce a new vector output. TokenFormer replaces these fixed-size weight matrices with learnable "model parameter tokens." * *8:47** Natively Scalable Architecture:* By treating model parameters as tokens, TokenFormer allows for dynamic expansion during training. This means models can start small and gradually increase in size, leading to faster training and better utilization of compute resources. * *18:28** Queries, Keys, and Values:* A detailed explanation of the core attention mechanism is provided, using analogies of questions and answers. Queries represent what a token is looking for, keys represent what a token contains, and values are the information that gets added to tokens based on the agreement between their queries and keys. * *30:03** P Attention in Depth:* The P attention layer utilizes cross attention between input tokens (acting as queries) and learnable parameter tokens (acting as keys and values). This enables seamless scaling by expanding parameter tokens over time. * *41:04** Language as an Indexing System:* Drawing on insights from Francois Chollet, the idea of language as an indexing system for memories is explored. This suggests that language tokens, through attention mechanisms, guide the retrieval of information stored in high-dimensional vector spaces. * *49:40** The Johnson-Lindenstrauss Lemma:* The lemma explains how high-dimensional spaces offer an exponentially increasing number of orthogonal or near-orthogonal directions. This property is crucial for storing a vast amount of information and allows neural networks to become more sophisticated as their parameter count (and the dimensionality of their internal representations) grows. * *58:15** Temperature and Creativity:* High temperature in token sampling leads to more randomness and "creative" outputs. This randomness allows the model to explore areas of the high-dimensional space further away from the actual training data, leading to both creativity and hallucinations. * *1:06:00** Random Input Order:* The paper "Randomized Autoregressive Visual Generation" demonstrates that randomizing input order for images doesn't impact performance significantly. This aligns with the order-invariant nature of attention mechanisms and suggests the importance of understanding the specific structure and redundancy of different data types. * *1:11:59** Hallucinations as Extrapolation:* Hallucinations arise when a model interpolates or extrapolates information in the high-dimensional space and lands on a point not grounded in the training data. This issue is akin to exploring uncharted territories in the model's knowledge space, leading to potentially inaccurate outputs. * *1:13:46** Variance as a Measure of Uncertainty:* High variance in token probabilities can indicate potential hallucinations, as the model is less certain about the next token based on the input. This variance could be used as a signal to flag potentially unreliable outputs. * *1:26:52** Conclusion:* TokenFormer showcases the potential of replacing all linear projections with attention mechanisms, leading to more efficient scaling and offering a different perspective on the role of attention and MLPs in knowledge representation. It also emphasizes the importance of understanding high-dimensional spaces and the nature of information storage in neural networks. I used gemini-1.5-pro-exp-0827 on rocketrecap dot com to summarize the transcript. Cost (if I didn't use the free tier): $0.05 Input tokens: 36589 Output tokens: 819

  • @ShaohuaDong
    @ShaohuaDong 28 днів тому

    I think FLOPs rrefers to total number of floating-point operations of Network while FLOPS refers to the computational throughput of a GPU.

  • @sajjadshahabodini
    @sajjadshahabodini Місяць тому

    ❤❤

  • @PrathameshKatkar-w6j
    @PrathameshKatkar-w6j Місяць тому

    how about quantizing the less important mlp layer

  • @visionnoob580
    @visionnoob580 Місяць тому

    Thank you always!!! You explained it so well in easy English. As a non-native Eng speaker, I can understand quite well!

  • @litguru4748
    @litguru4748 Місяць тому

    Summary of the entire field of current AI research starts at 56:40

  • @404_Trader
    @404_Trader Місяць тому

    Well done reviewing the paper I appreciate it 🎉

  • @aiarchitecture6427
    @aiarchitecture6427 Місяць тому

    I don't know if it's in you line of interest but a review of controlnet/T2Iadapter/unicontrolnet/LORAdapater/Blora would be great 😊 It's getting confusing for me and probably for other people interested in diffusion too

    • @qimingwang9557
      @qimingwang9557 Місяць тому

      sounds gooood!!I really interest in and want to konw more about it, If you can do some video explain the CN training will be very grateful

  • @ipsdon
    @ipsdon Місяць тому

    what visualization tool,you are using?

  • @alexijohansen
    @alexijohansen Місяць тому

    The depth in this stream is great!

  • @nigelwan2841
    @nigelwan2841 Місяць тому

    Journey to the west is not that wrong though

  • @wolpumba4099
    @wolpumba4099 Місяць тому

    Summary starts at 1:26:54

    • @wolpumba4099
      @wolpumba4099 Місяць тому

      *TokenFormer: Exploring Transformers, Attention, and Knowledge Storage* This stream delves into the TokenFormer paper, a novel transformer architecture that uses the attention mechanism for interactions between tokens and model parameters. The key takeaways include: * *0:00** Introduction:* The stream begins with greetings, platform testing, and mentions of related papers like "Randomized Autoregressive Visual Generation" and "Differential Transformer." * *2:40** TokenFormer Overview:* TokenFormer replaces every part of a transformer block with a new attention mechanism (P-Attention) where model weights are treated as tokens. This allows the model to scale by adding new learnable tokens. * *7:00** Attention Deep Dive:* A detailed explanation of the attention mechanism is provided, drawing upon concepts from Andrej Karpathy's "Let's Build GPT" and ThreeBlueOneBrown's "Transformers" videos. Queries are analogous to questions, keys are answers, and the attention matrix represents the agreement between them. * *26:55** Parameterized Attention (P-Attention):* P-Attention employs cross-attention between input tokens (queries) and model parameter tokens (keys and values), enabling flexible model scaling by adding new key-value parameter pairs. * *34:38** Knowledge Storage in Transformers:* The traditional view suggests that knowledge is primarily stored in the MLP layers of a transformer. However, TokenFormer's ability to achieve high performance without MLPs challenges this view and raises questions about how knowledge is truly encoded. * *49:26** Johnson-Lindenstrauss Lemma:* This lemma highlights the exponential growth of nearly orthogonal vectors in high-dimensional spaces. This property is crucial for understanding how neural networks can store vast amounts of information. * *1:06:00** Randomized Autoregressive Visual Generation:* This paper explores randomizing the input sequence order during training for visual tasks, demonstrating that the order of image patches doesn't significantly impact performance due to the redundancy of visual data. * *1:11:35** Hallucinations and Uncertainty:* The discussion delves into the problem of hallucinations in language models, comparing it to the concept of high entropy in decision-making. Hallucinations arise when models extrapolate or interpolate in regions of high-dimensional space where little or no training data exists. * *1:26:52** Conclusion:* The stream concludes with a summary of TokenFormer and its implications for understanding transformer architectures, the attention mechanism, and knowledge storage in high-dimensional spaces. I used gemini-1.5-pro-exp-0827 on rocketrecap dot com to summarize the transcript. Cost (if I didn't use the free tier): $0.05 Input tokens: 37025 Output tokens: 529 Timestamps for the stream conclusion: *TokenFormer and the Nature of Transformer Layers* * *1:27:51** TokenFormer Introduction:* The TokenFormer paper proposes replacing linear projections within Transformer blocks with "p-attention," a cross-attention mechanism between input tokens and a sequence of tokens representing model parameters. * *1:27:59** Incremental Model Growth:* TokenFormer allows for incremental addition of model parameter tokens during training, enabling efficient scaling from smaller models and faster training. * *1:28:19** Crystallization Analogy:* The incremental growth of the model is likened to crystallization, starting from a small core and expanding outward. * *1:28:55** Exploring Transformer Layers:* The discussion shifts to the roles of different Transformer layers, questioning whether factual knowledge resides in the feedforward network (MLP) or if self-attention is primarily for communication. * *1:29:38** High-Dimensional Spaces and Knowledge:* The video suggests that neural networks leverage high-dimensional spaces to store concepts orthogonally, similar to the method of loci used by animals and humans for memory. * *1:30:19** Language as Indexing:* Language is presented as a powerful tool for indexing, storing, and retrieving information in these high-dimensional spaces, highlighting its importance for LLMs. * *1:31:05** MLP vs. Attention for Knowledge:* The video argues that the attention mechanism might be superior to MLPs for knowledge storage and retrieval due to its ability to encode queries, keys, and values separately. * *1:31:50** Emergence of Information:* The effectiveness of these mechanisms is attributed to the emergent property of information in high-dimensional spaces, where increased dimensionality allows for greater storage capacity. * *1:32:13** Anamorphic Illusion Analogy:* An anamorphic illusion (a distorted image that appears correct from a specific angle) is used as an analogy to illustrate how high-dimensional spaces can store information in unexpected ways. * *1:32:45** Recommended Resources:* The video concludes by recommending Transformer visualizers and specific UA-cam videos for further learning, particularly those by ThreeBlueOneBrown. I used gemini-1.5-pro-exp-0827 on rocketrecap dot com to summarize the transcript. Cost (if I didn't use the free tier): $0.0184 Input tokens: 12895 Output tokens: 454

  • @sathishkumar-ch4sx
    @sathishkumar-ch4sx Місяць тому

    this is an awesome stream, I learned a lot. Thanks for doing this.