8-bit Methods for Efficient Deep Learning with Tim Dettmers

Поділитися
Вставка
  • Опубліковано 18 чер 2024
  • Tim Dettmers (PhD candidate, University of Washington) presents "8-bit Methods for Efficient Deep Learning" in this Cohere For AI Technical Talk.
    Abstract: Large language models are effective tools for many tasks but are difficult to train and inference due to their size. Moving from 32-bit models to 16-bit models resulted in considerable efficiency gains that made training and inference of large models easier. Can we train and inference in 8-bit to make further gains? In this talk, Tim will show that 8-bit inference and training can be used without degrading performance while improving efficiency. To make 8-bit methods work, it is essential to understand how quantization precision affects model performance and training stability as we scale the model size. He will talk about how these factors change with scale and how we need to adjust 8-bit methods to make them work. In particular, he will speak about 8-bit optimizers for training and Int8 inference for large language models with up to 175B parameters. These methods make training and inference more efficient and make large models more accessible to researchers.
    Learn more about Tim and his work at timdettmers.com/
    Learn more about Cohere For AI at cohere.for.ai.
  • Наука та технологія

КОМЕНТАРІ • 8

  • @yacinegaci2831
    @yacinegaci2831 22 дні тому

    Very informative video, thanks.
    In the slide where you explain the use of INT4 quantization + LoRA, you said that you pass the inputs through the frozen 4bit quantized pre-trained model, and finetune only the adapters. My question is, do you dequantize the int4 weights of the pre-trained model to fp16, or computations are carried out in int4 (so there is a need to quantize the input to int4 as well)?

  • @raynardzhang4986
    @raynardzhang4986 Рік тому

    Why does this video don't have any comment, the elaboration on how to experimenting this problem is beautiful. Please publish more video like this.

  • @wayne5676
    @wayne5676 Рік тому

    Amazing talk! Thanks!

  • @shahrohit1990
    @shahrohit1990 11 місяців тому

    I think one of the important findings of this is that as we go higher in model size we see a lot of outliers even though we have a batch normalization layer. so if we improve the training process we can actually do better in quantization?

  • @heejuneAhn
    @heejuneAhn 11 місяців тому

    Please explain the implementation more the theory is quite straight forwards in fact

  • @wayne5676
    @wayne5676 Рік тому

    @8:09 Should it be the opposite, in the sense that
    more bits for exponent + fewer bits for fraction => good for big numbers bad for small numbers? Since the range can be covered is bigger hence good for big numbers.

  • @wayne5676
    @wayne5676 Рік тому

    Can someone illustrate why 10011001 is -6.06e-3? In particular, why 00 is 1e-2 and 1001 is 0.1 + 0.9*9/16?

  • @heejuneAhn
    @heejuneAhn 11 місяців тому

    GPTQ is far faster than bitsandbytes in fact.