Test-Time Adaptation: A New Frontier in AI

Поділитися
Вставка
  • Опубліковано 24 гру 2024

КОМЕНТАРІ • 49

  • @MachineLearningStreetTalk
    @MachineLearningStreetTalk  25 днів тому +9

    REFS:
    [0:02:25] **Academic position and research focus at ETH Zurich** | Jonas Hübotter is a doctoral researcher in the Learning and Adaptive Systems Group at ETH Zurich, working with Professor Andreas Krause on machine learning and local learning. | Jonas Hübotter
    jonhue.github.io/
    [0:02:50] **The Pile benchmark dataset for language model evaluation** | The Pile is an 825 GiB English text corpus used for training large-scale language models, consisting of 22 diverse high-quality subsets including academic writing, Stack Exchange, and other sources. | Leo Gao et al.
    arxiv.org/abs/2101.00027
    [0:05:52] **Framework for making machine learning accessible through teaching-focused approach** | Machine Teaching: A New Paradigm for Building Machine Learning Systems - Microsoft Research paper introducing the concept of machine teaching as a discipline focused on teachers rather than learners | Patrice Y. Simard et al.
    arxiv.org/abs/1707.06742
    [0:07:35] **Foundational paper introducing RAG architecture combining pre-trained models with explicit memory access** | RAG (Retrieval-Augmented Generation) paper by Patrick Lewis et al. introducing the concept of combining parametric and non-parametric memory for language generation | Patrick Lewis et al.
    arxiv.org/abs/2005.11401
    [0:09:50] **Comprehensive documentation of The Pile dataset including its mathematical components** | The Pile dataset including DeepMind Mathematics component, containing school-level math questions and other diverse text data | Stella Biderman et al.
    arxiv.org/pdf/2201.07311
    [0:11:25] **Survey paper analyzing knowledge conflicts in LLMs between pre-training and in-context information** | Research on conflicts between in-context learning and pre-training knowledge in large language models | Chen, Zhixiu and Wang, Yuchen and Zhang, Zhihao and Wang, Xu and Li, Zhiwei
    arxiv.org/html/2403.08319v2
    [0:13:40] **Study of ant foraging rules and pheromone trail network properties** | Research on ant colony foraging behavior and pheromone trail networks | Czaczkes, Tomer J
    pmc.ncbi.nlm.nih.gov/articles/PMC3291321/
    [0:16:05] **Theory of instrumental convergence in superintelligent AI systems** | Instrumental convergence thesis in AI safety, discussing how superintelligent AI systems might develop predictable sub-goals regardless of their final goals | Nick Bostrom
    nickbostrom.com/superintelligentwill.pdf
    [0:18:45] **Seminal paper defining universal intelligence and its relationship to compression** | Marcus Hutter's fundamental work on universal intelligence and its relationship to compression, particularly in his collaboration with Shane Legg defining machine intelligence | Shane Legg and Marcus Hutter
    arxiv.org/pdf/0712.3329.pdf
    [0:20:50] **Paper connecting active inference, free energy principle, and maximum entropy methods in machine learning** | Discussion of active inference as a form of maximum entropy inverse reinforcement learning, which relates to the paper 'The Free Energy Principle for Perception and Action: A Deep Learning Perspective' discussing the relationship between active inference and maximum entropy methods | Pietro Mazzaglia et al.
    www.mdpi.com/1099-4300/24/2/301/pdf
    [0:23:10] **Paper explaining how active inference leads to autonomous organization in biological systems** | Discussion of emergence of self-sustaining behaviors through active inference relates to 'The Markov blankets of life: autonomy, active inference and the free energy principle', which explores how active inference leads to autonomous behavior | Karl J. Friston
    royalsocietypublishing.org/doi/10.1098/rsif.2017.0792
    [0:23:30] **Technical framework for implementing intentional behavior in active inference agents** | Active Inference and Intentional Behaviour (2024) discusses how active inference frameworks can be used to create AI systems with constrained agency and specific preferences. | Karl J. Friston
    arxiv.org/html/2312.07547v2
    [0:24:10] **Research on genetic constraints and behavioral plasticity in intelligence** | The Paradox of Intelligence: Heritability and Malleability Coexist in Hidden Gene-Environment Interplay (2018) explores how genetic constraints interact with environmental plasticity. | Bruno Sauce, Louis D. Matzel
    www.ncbi.nlm.nih.gov/pmc/articles/PMC5754247/
    [0:26:55] **Foundational work establishing dual-process theory of cognition with System 1/2 framework** | System 1 (fast, intuitive, and emotional) and System 2 (slower, deliberative, and logical) thinking framework from 'Thinking, Fast and Slow'. Context: Discussion of cognitive architectures and their applicability to AI systems. | Daniel Kahneman
    www.amazon.com/Thinking-Fast-Slow-Daniel-Kahneman/dp/0374533555
    [0:28:55] **Analysis of computational mechanisms behind in-context learning versus fine-tuning** | Computational differences between in-context learning and fine-tuning, with ICL requiring forward computation for each token while fine-tuning uses back-propagation. Context: Discussion of efficiency in different learning approaches. | Wei et al.
    arxiv.org/pdf/2212.10559
    [0:30:55] **Foundational paper introducing nearest neighbor pattern classification** | Early work on nearest neighbor methods in the 1950s for pattern recognition and classification | Cover, T., Hart, P.
    ieeexplore.ieee.org/document/1053964

    • @MachineLearningStreetTalk
      @MachineLearningStreetTalk  25 днів тому +2

      PART 2:
      [0:32:05] **Fundamental work establishing theoretical framework for transductive learning** | Vladimir Vapnik's work on transductive inference and statistical learning theory | Vladimir Vapnik
      www.springer.com/gp/book/9780387987804
      [0:35:35] **Leading researcher in conformal prediction and machine learning at Royal Holloway** | Reference to Vladimir Vovk at Royal Holloway University, pioneer of conformal prediction | Vladimir Vovk
      pure.royalholloway.ac.uk/en/persons/vladimir-vovk
      [0:36:30] **Neuroscientist exploring consciousness and its relationship with emotional processing** | Reference to Mark Solms' work on consciousness and its relationship with ambiguity processing | Mark Solms
      ua-cam.com/video/CmuYrnOVmfk/v-deo.html
      [0:40:00] **Foundational paper establishing active inference as a model of agency and choice behavior** | Karl Friston's active inference model of agency, which describes how biological systems maintain their state through prediction and action | Karl Friston
      www.frontiersin.org/journals/human-neuroscience/articles/10.3389/fnhum.2013.00598/full
      [0:41:45] **Novel approach for efficient test-time adaptation of language models** | SIFT (Sparse Inference Fine-Tuning) paper discussing local distribution learning in language models | Jonas Hübotter et al.
      arxiv.org/pdf/2410.08020
      [0:43:35] **Research on improving LLM performance through test-time adaptation using nearest neighbors** | Test-Time Training on Nearest Neighbors for Large Language Models (2024). The paper discusses how updating models at test time with relevant data can improve performance, aligning with the speaker's points about local learning benefits. | Junxian He, Chunting Zhou, Xuezhe Ma, Taylor Berg-Kirkpatrick, Graham Neubig
      arxiv.org/html/2305.18466v3
      [0:48:25] **Survey of active learning techniques addressing domain shift and multi-domain sampling** | Concept of active learning addressing distribution shift in machine learning systems. Active learning systems continuously retrain on shifting data distributions to maintain model performance over time. | Shayne Longpre et al.
      arxiv.org/abs/2202.00254
      [0:50:55] **Research on combining retrieval and fine-tuning for in-context learning models** | Discussion of nearest neighbor retrieval and fine-tuning approach for local model adaptation. This relates to the naive approach mentioned in the conversation about retrieving nearest neighbors for fine-tuning. | Thomas et al.
      arxiv.org/abs/2406.05207
      [0:54:05] **Original RoBERTa paper introducing the improved BERT-based model for NLP tasks** | RoBERTa (Robust Optimized BERT Approach) is a robustly optimized BERT pretraining approach that improves on BERT's masking strategy and training methodology. It's commonly used for generating embeddings in information retrieval tasks. | Yinhan Liu et al.
      arxiv.org/pdf/1907.11692.pdf
      [0:58:45] **Comprehensive guide to deep learning that includes detailed discussion of fine-tuning practices** | Deep Learning with Python by François Chollet discusses the challenges and best practices of fine-tuning neural networks, particularly regarding learning rate selection and gradient steps | François Chollet
      www.amazon.com/Learning-Python-Second-Fran%C3%A7ois-Chollet/dp/1617296864
      [1:01:55] **Research paper examining the Linear Representation Hypothesis in neural networks** | Linear Representation Hypothesis (LRH) in neural networks, which posits that networks encode concepts as directions in activation space | Róbert Csordás et al.
      arxiv.org/abs/2408.10920
      [1:03:10] **ML researcher specializing in mechanistic interpretability of neural networks** | Neel Nanda - Machine Learning Researcher at DeepMind, previously at Anthropic, known for work in mechanistic interpretability | Neel Nanda
      www.neelnanda.io/about
      [1:05:40] **Foundational paper introducing LIME for model interpretability** | LIME (Local Interpretable Model-agnostic Explanations) - A technique for explaining predictions of any classifier using local linear approximations | Marco Tulio Ribeiro, Sameer Singh, Carlos Guestrin
      arxiv.org/abs/1602.04938
      [1:09:35] **Seminal paper on using influence functions for understanding black-box model predictions** | Influence Functions in machine learning as described in 'Understanding Black-box Predictions via Influence Functions' by Koh & Liang. The paper demonstrates how influence functions can trace model predictions back to training data. | Pang Wei Koh
      arxiv.org/abs/1703.04730
      [1:11:45] **Comprehensive overview of dataset security vulnerabilities including data poisoning and backdoor attacks** | Data poisoning attacks in machine learning security, where training data manipulation can create backdoors and vulnerabilities in ML systems | Micah Goldblum et al.
      arxiv.org/abs/2012.10544
      [1:16:05] **Fundamental textbook covering Bayesian linear regression with closed-form solutions** | Bayesian Linear Regression as described in 'Pattern Recognition and Machine Learning' by Bishop. The text discusses closed-form solutions for posterior computation with Gaussian priors and likelihood. Context: Speaker explains how linear surrogate models with Gaussian priors enable tractable posterior computation. | Christopher M. Bishop
      www.amazon.com/Pattern-Recognition-Learning-Information-Statistics/dp/0387310738
      [1:17:10] **Paper demonstrating Bayesian neural networks for uncertainty quantification** | Discussion of uncertainty quantification in neural networks using Bayesian methods, referencing 'Bayesian Deep Convolutional Encoder-Decoder Networks for Surrogate Modeling and Uncertainty Quantification'. Context: Speaker contrasts traditional neural networks with Bayesian approaches for uncertainty estimation. | Yinhao Zhu
      arxiv.org/abs/1801.06879
      [1:18:55] **Comprehensive review of variational inference methods including closed-form solutions with conjugate priors** | Closed-form Bayesian inference with Gaussian distributions (conjugate priors), as detailed in 'Variational Inference: A Review for Statisticians'. The paper discusses how conjugate priors lead to analytically tractable posterior distributions, particularly in the case of Gaussian distributions. | David M. Blei, Alp Kucukelbir, Jon D. McAuliffe
      arxiv.org/pdf/1601.00670
      [1:26:15] **MindsAI's breakthrough in ARC challenge using test-time fine-tuning** | MindsAI team's achievement in the ARC (Abstraction and Reasoning Corpus) Challenge, reaching 54.5% performance using test-time fine-tuning approach in late 2024 | Mohamed Osman & MindsAI Team
      www.reddit.com/r/singularity/comments/1gexvmj/new_arcagi_high_score_by_mindsai_545_prize_goal/
      [1:29:50] **Research on active inference for collaborative AI systems in unknown environments** | Active Inference in distributed AI systems as described in 'Collaborative AI Teaming in Unknown Environments via Active Goal Inference'. The paper discusses how active inference can be used in distributed AI systems for collaborative tasks. | Jaya Krishna Thota et al.
      arxiv.org/pdf/2403.15341
      [1:32:55] **Introduction of OpenAI o1 model with novel inference-time scaling properties** | OpenAI o1 model's inference-time scaling capabilities, introduced in September 2024, showing performance improvements with both train-time and test-time compute allocation | OpenAI
      openai.com/index/learning-to-reason-with-llms/
      [1:33:55] **Framework for active inference and uncertainty minimization in AI systems** | Active inference framework for AI systems, focusing on uncertainty minimization through predictive coding and exploration | Abdelrahman Sharafeldin
      www.sciencedirect.com/science/article/pii/S2666389924000977
      [1:36:05] **Theoretical analysis of convergence in uncertainty-based active learning** | Research on convergence guarantees in uncertainty-based active learning, discussing how selecting informative data points based on uncertainty reduction can lead to optimal convergence | Yingzhen Yang et al.
      arxiv.org/pdf/2312.13927
      [1:37:50] **Information-theoretic analysis of transductive learning generalization bounds** | Discussion of transductive learning theory and its relationship to inductive learning, particularly relevant to the proposed hybrid approach | Huayi Tang et al.
      arxiv.org/abs/2311.04561
      [1:38:50] **Research paper establishing theoretical foundations of transductive learning in machine learning** | Discussion of transductive learning in machine learning context, referencing the theoretical framework of transductive vs inductive learning approaches | Mathieu Chalvidal
      arxiv.org/abs/2302.00328
      [1:40:35] **Foundational paper establishing scaling laws for neural language models** | Reference to scaling laws in language model training, discussing the relationship between compute budget and model size | Jared Kaplan et al.
      arxiv.org/abs/2001.08361
      [1:42:20] **Latest Apple Silicon chip optimized for ML workloads** | Apple M4 chip, announced in May 2024, represents significant advancement in local ML processing capabilities for MacBook Pro line | Apple Inc.
      www.apple.com/newsroom/2024/05/apple-introduces-m4-chip/
      [1:42:40] **Advanced language model with improved performance and speed** | Claude 3.5 Sonnet by Anthropic, released in 2024, providing enhanced capabilities for model verification and complex reasoning tasks | Anthropic
      www.anthropic.com/news/claude-3-5-sonnet
      [1:45:45] **Information-based transductive active learning research with applications to safe exploration** | Reference to transactive fine-tuning and its future impact in AI systems, connecting to active learning and uncertainty estimation methods | Jonas Hübotter et al.
      arxiv.org/pdf/2405.05890

    • @Iophiel
      @Iophiel 24 дні тому +2

      Good references! Much quality! Craftsmanship is appreciated!!

    • @fiseticamente
      @fiseticamente 20 днів тому +2

      @@MachineLearningStreetTalk Thank you for the awesome videos and super nice feature for references

  • @a1marky421
    @a1marky421 21 день тому +5

    That "Google Earth" analogy was amazing it instantly flipped a switch in my brain and everything started making more sense.

    • @Thomas_basiv
      @Thomas_basiv 18 днів тому

      At which point does the analogy start sorry?

    • @ErikR-f1c
      @ErikR-f1c 2 дні тому +1

      37:52

    • @Thomas_basiv
      @Thomas_basiv 2 дні тому +1

      @@ErikR-f1c thank you kindly

  • @KevinKreger
    @KevinKreger 24 дні тому +2

    Tim looks great without hair! Jonas is a great guest. Thanks for bringing him back❤

  • @Charles-Darwin
    @Charles-Darwin 24 дні тому +4

    ETH Zurich's work on anymal (4-legged/dog robot platform) always amazes with their improvements

  • @dr.mikeybee
    @dr.mikeybee 24 дні тому +9

    Reason can emerge in a connectionist system, but loops and alfgorithms can only be rolled out to something less deep than the model's depth. So we need system 2 thinking for longer algorithms.

  • @sonOfLiberty100
    @sonOfLiberty100 24 дні тому +15

    Im quebono100 one of your first subscribers. You guys still killing it. Such good unique work that you are doing. Thank you

  • @luke.perkin.online
    @luke.perkin.online 24 дні тому +1

    So much groundwork to cover, bit snoozy, then about the hour mark it really starts to get meaty. Good questioning Tim! Jonas is a machine, so eloquent!

  • @BuFu1O1
    @BuFu1O1 24 дні тому +1

    1:00:00 you could use dropout only on biases..and train biases only when you finetune on your test data..disable otherwise

  • @RohanKumar-vx5sb
    @RohanKumar-vx5sb 24 дні тому +2

    great talk! want to hear your opinion if agentic AI is a form of test time compute. since youre literally answer the request over longer time than a single current non TTC LLM calls.

  • @mathematicalninja2756
    @mathematicalninja2756 24 дні тому +2

    We used similar techniques in our math lab and made 7% improvement in MATH level 5, on 7b model

  • @for-ever-22
    @for-ever-22 10 днів тому

    Amazing as usual !!

  • @ricosrealm
    @ricosrealm 24 дні тому

    That was an interesting point about the amortization of SFT vs in-context learning; SFT is able to parallelize the operations across the batch vs in-context learning has to sequentially operate on all of the examples.

  • @earleyelisha
    @earleyelisha 24 дні тому +1

    How does the current incarnation differ from Continual/Lifelong/Incremental learning?
    Sutton, LeCun, M.Mitchell, I.Rish, and others have eluded to this yet the field doesn’t seem to focus on this fundamental gap at all.

    • @MachineLearningStreetTalk
      @MachineLearningStreetTalk  24 дні тому +2

      Jonas talks about this in his paper "Retrieval and active learning can be seen as two extreme ends of a spectrum: retrieval selects relevant but potentially
      redundant data, while active learning selects diverse but potentially irrelevant data." - also Jonas is advocating for transductive inference when we go from "particular to particular" i.e. build a new model for each prediction from the data (and models) we have access to

    • @earleyelisha
      @earleyelisha 24 дні тому

      @ Ty! Looking forward to reading it!

    • @bradleyangusmorgan7005
      @bradleyangusmorgan7005 8 днів тому

      ​@@MachineLearningStreetTalk transductive inference, although not the same, reminds of the thousand brains theory.

  • @memegazer
    @memegazer 19 днів тому

    I think it is relatively simple to bridge the gap
    if the benchmarks are simulated as multimodal
    take general reasoning questions
    put the model in a virutual space
    and adaptation will probably reveal itself as some foundamental function

    • @memegazer
      @memegazer 19 днів тому

      I guess this speculation raises a question about what "test time" means as a more "bare metal" substrate in terms of "real time"

  • @kneau
    @kneau 24 дні тому

    22:47 through 23:22
    23:36 through 24:01
    24:53 through 25:52
    25:58 through 26:44
    As someone forever living the aftermath of numerous severe to moderate TBI and ABI - living with an, "acquired communication disorder" - as one can see, I am taking notes.

    • @kneau
      @kneau 24 дні тому

      I see my error of "severe to moderate." Correct terminology is "moderate to severe." That said, a snapshot of illogical sequencing (symptom manifestation) is more valuable to me than an edited comment.
      This reply being a clear exception; totally editing it for the sole purpose of removing a line break.
      Second edit to remove four words because... "aesthetic."
      *deadpan jazz hands*

    • @kneau
      @kneau 24 дні тому

      36:42 through 37:16
      Oh! 38:38 through 39:11 reminds me of my neurofeedback sessions. The image on the screen is fuzzy but if you "do it correctly" the image becomes increasingly clear. Once clear, keep it clear.

    • @kneau
      @kneau 24 дні тому

      40:15 "Situational computation" sends my mind to perceptual adaptation. Not sure if there's value in that, might be a "rhythmic association thing"

    • @kneau
      @kneau 24 дні тому

      42:50 I think, "...really big base model.." is my chosen cue to revisit the remainder of this video at a later point in time.

    • @justindankert7725
      @justindankert7725 21 день тому

      are you okay?

  • @brandonheaton6197
    @brandonheaton6197 21 день тому

    this dude is with it. Great interview

  • @jeanpaulniko
    @jeanpaulniko 20 днів тому

    For the future - - An amortized form of transductive active fine tuning

  • @jeanpaulniko
    @jeanpaulniko 21 день тому

    I think we need to redefine consciousness by exploring dimensionality or maybe not try to give you a definition at all

  • @ajudicator
    @ajudicator 24 дні тому

    So basically dreambooth?

  • @Carl-md8pc
    @Carl-md8pc 24 дні тому +1

    Thanks

  • @marshallmcluhan33
    @marshallmcluhan33 24 дні тому +1

    The million dollar question hehe

  • @surfaceoftheoesj
    @surfaceoftheoesj 24 дні тому

    Well said

  • @burnytech
    @burnytech 24 дні тому

  • @Hopyboby
    @Hopyboby 24 дні тому

    giga chad with a giga brain. nice combo.

  • @MylaSavannahHazel
    @MylaSavannahHazel 23 дні тому

    aitutorialmaker AI fixes this. "Test-Time Adaptation in AI"

  • @LostInTheRush
    @LostInTheRush 22 дні тому

    why do I want to make out with this guy

  • @deter3
    @deter3 24 дні тому +2

    I do not see any meaningful impact from this talk or paper .
    1. Embedding is not a proved method to measure relevant . Say what if you want to compare similar law cases , which need multiple dimensional similarity , embedding is not working at all .
    2. A super highly flexible use cases is only existed in theory . We all have certain fixed used cases in real world and multiple qlora dapaters can be instantly plug in is good enough .
    I do not care how fancy your theory are , if you can not guarantee use embedding can lead a successfully retrieval on highly special cases , then your method is useless . for normal use cases , LoRA offers a compelling alternative because it's lightweight, adaptable, and allows for storing multiple adapters for different tasks. This avoids retraining or storing multiple fine-tuned models. Real world application will go for multiple qlora adapters while their used cases care always fixed . A super highly flexible use cases is only existed in theory .
    Now , tell me what's the meaningfulness for your paper or method ?

    • @MachineLearningStreetTalk
      @MachineLearningStreetTalk  24 дні тому +1

      This is the paper we discussed - arxiv.org/pdf/2410.08020
      You might want to get Claude to explain it to you to fill in some of your gaps in understanding.
      The embeddings are used to retrieve nearest neighbours (to the test instance), then a local model is constructed (which loosely resembles a kernel ridge regression model) to iteratively select data points which maximise the information gain i.e. balancing relevance and diversity. Embedding-only search does indeed suck, that's the whole point of this research. The key insight here is selecting an optimal set of examples using a local surrogate model to fine tune the source model but the fine tuning itself is not a key part of the discussion.

  • @economicsanity2895
    @economicsanity2895 24 дні тому

    It would be much more accessible for the general audience if you at least explain the “big” words that you’d like to use. For example, it would be good if you explain what transduction means before talking more about other things.

    • @MachineLearningStreetTalk
      @MachineLearningStreetTalk  24 дні тому +3

      I think we do explain it and show a figure but perhaps a little way into the show. Most ML models are "inductive" where you train on data and build a general-purpose decision function which you re-use in many future situations. Transduction (test time learning is a form of transduction) is when in every prediction situation you use data (usually test data, or retrieved data "related" to test data) to build a new model on the fly for the sole purpose of that prediction. MLST is pitched at a technical audience, but I appreciate we could do better at explaining things.

    • @economicsanity2895
      @economicsanity2895 24 дні тому

      @@MachineLearningStreetTalk oh with this kind of content, I tend to watch it like a podcast whereby I usually listen rather watch the video. But it would be much better subjectively if you could explain them in words as well.

    • @devilsolution9781
      @devilsolution9781 24 дні тому +1

      ​@@MachineLearningStreetTalk You literally explained what it was after you initially mentioned the word. Ignore that dude

    • @huehuecoyotl2
      @huehuecoyotl2 23 дні тому

      I'm definitely part of the non-technical audience. Listening to these interviews without a technical background is sort of like listening to a foreign language podcast, it's doable, but it takes a bit of effort initially. You can still get a lot out of it, but you may need to pause and look up a term here and there for a while to follow along. Just keep ChatGPT/Claude in a second tab! But after you do that a few times, you'll hear certain terms and concepts come up again and again in interviews and you'll be able to follow along with progressively less difficulty. By the way, the Mark Solms interview is much more approachable for those without a computer science/machine learning background.