CLAIMs of Open Source LLMs FAKE?

1littlecoder

Додати в
- Мій плейлист
- Переглянути пізніше
Поділитися

Поділитися

Вставка

Розмір відео:

Показувати елементи керування програвачем

Автоматичне відтворення

Автоповтор

Опубліковано 21 жов 2024

КОМЕНТАРІ •

@SloanMosley Рік тому ⁺²⁹
This is an important video. It is really exciting seeing open source be so innovative, but the click bait headlines are more often than not, false promises.
@1littlecoder Рік тому ⁺³
I honestly don't know how much of it's true or not. But I wanted to get the word out :)
@jonconnor6697 Рік тому ⁺¹
your right this video is click bait, because it's not true, I've just about stopped using chatGPT simply because my model's can answer questions that chatGPT" from "OPENAI" can't answer.
@SloanMosley Рік тому ⁺¹
@@jonconnor6697 What models are your favourites and for which tasks?
@vbalaji4824 Рік тому
@@jonconnor6697 which model ur using man?
@tarun4705 Рік тому ⁺¹²
Increasing the model parameters will definitely increase factual accuracy. I think LLMs are like efficient data structures which can query and store a lot of data in compressed format. Patterns might exist in the English language which the LLMs can learn and generalize but patterns won't exist in factual data and they have to be memorized by LLM parameters in an efficient way. This might be the reason ChatGPT has been trained on 175 Billion parameters so that it can balance the learning of language context and memorizing facts. Since LLama has only 65 Billion parameters it might not perform that well in terms of factual accuracy when compared to ChatGPT.
@reinerheiner1148 Рік тому ⁺¹⁵
Seeing how falcon tops the benchmarks, yet underperforms when testing it with common questions, I wonder how much training goes into optimizing for the common benchmarks instead of trying to improve the system in general
@freedom_aint_free Рік тому ⁺¹
Even GPT-4 falls apart rather quickly with the right prompts, particularly it does not have full blown concept of things, only very rudimentary glimpses of what things means, exactly what happens with Image generation models like Dall-E and Stable Diffusion: this is why most can not generate text and they usually generate people with more than 2 arms a deformed hands, they can not generate a crowd where everybody is doing something meaningful instead of repetition of a common theme.
@ixwix Рік тому ⁺⁹
I was wondering about something similar. As a teacher, I observed, that instead of teaching students to achieve general and field specific problem solving, we simply teach them how to pass tests. Which is not necessarily the same, depending on the test. I am worried, we might do the same with LLMs. Training them to score high on benchmarks and therefore making false promises about their capabilities. But this is why I love the field of AI so much. Maybe by finding a way to better benchmark LLMs we will also learn on how to improve our school system(s). By learning about AI and being honest about ourselves, we might reveal great knowledge about us as a species.
@shotelco Рік тому ⁺²
Profound comment!
@rashedulkabir6227 Рік тому
We already know how to improve our school systems.
@shotelco Рік тому
@@rashedulkabir6227 Can you elaborate?
@robxmccarthy Рік тому ⁺²
Thank you. This mirrors my experience with the Llama variants.
It seems that fine tuning can allow for better immitation, but does not make up for high quality pre-training.
This is most obvious when giving novel instruction for output in a specific format (for example, when using the model in a multi-modal environment, generating code, etc).
I wonder if this is simply a limitation of model size.
Smaller models may be made "as good as" or better than gpt-3.5 on domain specific tasks, it may be impossible to have a smaller model that is better at general task completion.
@iDerJOoker Рік тому
Very interesting, thanks!
I'm wondering if these models might still perform mostly as good as stated on RAG (retrieval augmented Generation) tasks. Because in that scenario the chatbot doesn't have to rely on actual deep knowledge inside but in fact we want the model to master skill of (finding and) presenting certain bits of information in a nice, readable, structured, maybe with a specific tone way. So that's pretty much exactly what finetuning on "imitation data" (chatgpt chats) will achieve, right?
@farrael004 Рік тому ⁺³
Here's the thing. There is nothing new about this paper. The fact of the matter is: larger model = more space to store facts. The way a model answers questions has nothing to do with what it knows. Unfortunately people don't seem to understand this basic concept.
If you want to have an open source model that knows as much as ChatGPT, just train a 175B parameters model for a few months.
However, the way things are moving, models don't have to know out-of-the-box a lot of things because you can give it tools to fetch knowledge. This approach is better than just asking a big model because the chance of it just making something up is much lower
@wenhanzhou5826 Рік тому ⁺¹
But even with retrieval tools, don't you think that the size of the model and the data still contribute to the reading comprehension part?
@ppbroAI Рік тому
@@wenhanzhou5826 Then just specialize the model in the comprehension part more
@farrael004 Рік тому ⁺²
@@wenhanzhou5826 It does, but only to some extent. Because of how the attention mechanism works, a phenomenon called in-context learning allows the model to learn tasks with how you prompt it. But if you ask the model to perform a very complex task, even if you give everything it needs to complete it, it still may need some capabilities that only large models would have. So it depends on the task too.
@wenhanzhou5826 Рік тому ⁺¹
@@ppbroAI how would you do that though, I thought the reading comprehension was an emergent phenomena of next token prediction.
@zyxwvutsrqponmlkh Рік тому ⁺¹
I really want models that will output a ranked list of the actual sources used to generate a response. I understand this would make it more expensive to run but I feel there are many use cases that would justify the added expense.
@wenhanzhou5826 Рік тому ⁺¹
To be honest, I dont find the findings controversial at all. We have known since the "ancient time" that neural networks are statistical models that reflects patterns in the data, and how more data gives better performance. Yes, the quality of the data does matter, but it have always seem to be too good to be true that you can get away with so little fine-tuning compute as compared to the pre-training.
@KenOtwell Рік тому
Why would using the larger model's output be any better than using the same raw data?
@harisjaved1379 Рік тому
But it makes sense. Other LLMS can’t compete with chatgpt right because their learning scope is only limited to their original training data, plus the fine tuned training, these models will be good at specific tasks vs general tasks that chatgpt can handle because it was trained on the public web.
@Viewable11 Рік тому ⁺³
1littlecoder,
one claim certainly was fake: Some people on UA-cam claimed that Vicuna has "90% of ChatGPT 4 quality". That is false. The paper from the producers of Vicuna claimed that Vicuna has "90% of ChatGPT 3.5 Turbo quality" which is the *free version* of ChatGPT. GPT 4 is *much* better than GPT 3.5
The best currently free of cost LLM is "Wizard-Vicuna 13B". That one is not commercially usable, but only for private use.
For a comparison of the quality of LLMs, look at the Huggingface LLM leaderboard. There you see that "Wizard-Vicuna 13B" beats many much larger models. What also should be considered in comparisons is that uncensored models perform way better than the censored versions of the same models.
@Atreyuwu Рік тому ⁺¹
Can't say I agree. I've been running Guancano 33B locally and have gotten much better results in inference than any of the Vicuna models.
@Viewable11 Рік тому
@@Atreyuwu The smallest version of Guanaco 33B (the 4bit quantized GPTQ version) has 17GB file size. How does one run that locally? LLMs usually need ca 200% memory size of the file size.
@pokerandphilosophy8328 Рік тому ⁺²
@@Viewable11 I am also able to run it locally on CPU. I am using Koboldcpp and offloading 15 layers to my 8GB GPU. I do have 64GB of normal RAM, though. After processing the context (Koboldcpp does this smartly by reusing part of it) it generates about two completion tokens per second.
@Null-Red-Blue Рік тому
They are popular not because they are close to corporate closed-source models. People use *them because of privacy, local usage, and of course because they opensource models have unfiltered versions that aren’t dumbed down.
Closed source RLHF training has provenly made the responses more appealing to more people but far less accurate.
@knutjagersberg381 Рік тому ⁺¹
Factuality is difficult with LLMs. However, they look at small models only. Parameter count influences reasoning capacity, and by that factuality. According to a university presentation from a while ago I think by OpenAI staff, RLHF can also improve factuality, by aligning probability with kinda "internal knowledge certainty" of the model.
Of cause, smaller fine tuned open source models and ChatGPT do not have the same performance level.
I'm not surprised by their finding that too much fine tuning data can even hurt.
I'm not sure if sft is always only imitation, like also openai claims, to distinguish RLHF.
New sft approaches with little, carefully selected data (lima or evoinstruct, which is boosting reasoning capacity) on large open source models could indeed be useful enough to replace ChatGPT in some end applications, where the models reasoning is sufficient for factuality.
It does not train the model to self-evaluate its own uncertainty, though.
@1littlecoder Рік тому ⁺¹
Nice detailed reasoning!
@NickWindham Рік тому ⁺¹
Great find
@1littlecoder Рік тому
Thank you
@Emerson1 Рік тому ⁺¹
Good video!
@1littlecoder Рік тому
Thanks
@griffionpatton7753 Рік тому
Thanks for the leg work.
@griffionpatton7753 Рік тому
That ninety percent? With people going to ninety percent of an expert it takes ten percent effort. That last ten percent takes 90 percent more effort. A ten percent gap may be impossible to reach
@maximefleury4455 Рік тому
Didn't watch the video yet:
- Some Model uses in the training set, data to test LLMs so ............ They're not as good as it seems
- Too many models, too many "combinations", and no real way to benchmark them (since some used test data to train them, AND using a model to evaluate another model isn't very reliable yet)
The paper you're reading at first is only testing 13B models and lowers, we have 40 to 65B models that are so much better
(You can train a model for 20$ 65B, comparing it to openAI GPT-3.5 is relevent because no matter how good they are, 20$ compared to millions.... is amazing)
- Last point: they innovate more and more, I guess we would be able to get even better models because of innovations, and also because new GPU will come out, making servergade cloud GPU even cheaper
@robcz3926 Рік тому
honestly, I'm already sick of what is going on everyday in the llm space, every single day some crap comes out claiming to be better crap than the previous crap. I applaud LoRa and all the new innovations the open source community has come up with recently but now it just seems that anybody with access to GPUs is training some derivative of the Llama model with variations of the same crappy datasets that circulate online. this madness has to stop.
@Imran-Alii Рік тому
@An Eye Opener!!!!
@indianengineer5802 Рік тому
Please share your LinkedIn url/profile

Наступне

Автоматичне відтворення

AI Q&A with Falcon LLM on FREE Google Colab