This is an important video. It is really exciting seeing open source be so innovative, but the click bait headlines are more often than not, false promises.
your right this video is click bait, because it's not true, I've just about stopped using chatGPT simply because my model's can answer questions that chatGPT" from "OPENAI" can't answer.
Increasing the model parameters will definitely increase factual accuracy. I think LLMs are like efficient data structures which can query and store a lot of data in compressed format. Patterns might exist in the English language which the LLMs can learn and generalize but patterns won't exist in factual data and they have to be memorized by LLM parameters in an efficient way. This might be the reason ChatGPT has been trained on 175 Billion parameters so that it can balance the learning of language context and memorizing facts. Since LLama has only 65 Billion parameters it might not perform that well in terms of factual accuracy when compared to ChatGPT.
Seeing how falcon tops the benchmarks, yet underperforms when testing it with common questions, I wonder how much training goes into optimizing for the common benchmarks instead of trying to improve the system in general
Even GPT-4 falls apart rather quickly with the right prompts, particularly it does not have full blown concept of things, only very rudimentary glimpses of what things means, exactly what happens with Image generation models like Dall-E and Stable Diffusion: this is why most can not generate text and they usually generate people with more than 2 arms a deformed hands, they can not generate a crowd where everybody is doing something meaningful instead of repetition of a common theme.
I was wondering about something similar. As a teacher, I observed, that instead of teaching students to achieve general and field specific problem solving, we simply teach them how to pass tests. Which is not necessarily the same, depending on the test. I am worried, we might do the same with LLMs. Training them to score high on benchmarks and therefore making false promises about their capabilities. But this is why I love the field of AI so much. Maybe by finding a way to better benchmark LLMs we will also learn on how to improve our school system(s). By learning about AI and being honest about ourselves, we might reveal great knowledge about us as a species.
Thank you. This mirrors my experience with the Llama variants. It seems that fine tuning can allow for better immitation, but does not make up for high quality pre-training. This is most obvious when giving novel instruction for output in a specific format (for example, when using the model in a multi-modal environment, generating code, etc). I wonder if this is simply a limitation of model size. Smaller models may be made "as good as" or better than gpt-3.5 on domain specific tasks, it may be impossible to have a smaller model that is better at general task completion.
Very interesting, thanks! I'm wondering if these models might still perform mostly as good as stated on RAG (retrieval augmented Generation) tasks. Because in that scenario the chatbot doesn't have to rely on actual deep knowledge inside but in fact we want the model to master skill of (finding and) presenting certain bits of information in a nice, readable, structured, maybe with a specific tone way. So that's pretty much exactly what finetuning on "imitation data" (chatgpt chats) will achieve, right?
Here's the thing. There is nothing new about this paper. The fact of the matter is: larger model = more space to store facts. The way a model answers questions has nothing to do with what it knows. Unfortunately people don't seem to understand this basic concept. If you want to have an open source model that knows as much as ChatGPT, just train a 175B parameters model for a few months. However, the way things are moving, models don't have to know out-of-the-box a lot of things because you can give it tools to fetch knowledge. This approach is better than just asking a big model because the chance of it just making something up is much lower
@@wenhanzhou5826 It does, but only to some extent. Because of how the attention mechanism works, a phenomenon called in-context learning allows the model to learn tasks with how you prompt it. But if you ask the model to perform a very complex task, even if you give everything it needs to complete it, it still may need some capabilities that only large models would have. So it depends on the task too.
I really want models that will output a ranked list of the actual sources used to generate a response. I understand this would make it more expensive to run but I feel there are many use cases that would justify the added expense.
To be honest, I dont find the findings controversial at all. We have known since the "ancient time" that neural networks are statistical models that reflects patterns in the data, and how more data gives better performance. Yes, the quality of the data does matter, but it have always seem to be too good to be true that you can get away with so little fine-tuning compute as compared to the pre-training.
But it makes sense. Other LLMS can’t compete with chatgpt right because their learning scope is only limited to their original training data, plus the fine tuned training, these models will be good at specific tasks vs general tasks that chatgpt can handle because it was trained on the public web.
1littlecoder, one claim certainly was fake: Some people on UA-cam claimed that Vicuna has "90% of ChatGPT 4 quality". That is false. The paper from the producers of Vicuna claimed that Vicuna has "90% of ChatGPT 3.5 Turbo quality" which is the *free version* of ChatGPT. GPT 4 is *much* better than GPT 3.5 The best currently free of cost LLM is "Wizard-Vicuna 13B". That one is not commercially usable, but only for private use. For a comparison of the quality of LLMs, look at the Huggingface LLM leaderboard. There you see that "Wizard-Vicuna 13B" beats many much larger models. What also should be considered in comparisons is that uncensored models perform way better than the censored versions of the same models.
@@Atreyuwu The smallest version of Guanaco 33B (the 4bit quantized GPTQ version) has 17GB file size. How does one run that locally? LLMs usually need ca 200% memory size of the file size.
@@Viewable11 I am also able to run it locally on CPU. I am using Koboldcpp and offloading 15 layers to my 8GB GPU. I do have 64GB of normal RAM, though. After processing the context (Koboldcpp does this smartly by reusing part of it) it generates about two completion tokens per second.
They are popular not because they are close to corporate closed-source models. People use *them because of privacy, local usage, and of course because they opensource models have unfiltered versions that aren’t dumbed down. Closed source RLHF training has provenly made the responses more appealing to more people but far less accurate.
Factuality is difficult with LLMs. However, they look at small models only. Parameter count influences reasoning capacity, and by that factuality. According to a university presentation from a while ago I think by OpenAI staff, RLHF can also improve factuality, by aligning probability with kinda "internal knowledge certainty" of the model. Of cause, smaller fine tuned open source models and ChatGPT do not have the same performance level. I'm not surprised by their finding that too much fine tuning data can even hurt. I'm not sure if sft is always only imitation, like also openai claims, to distinguish RLHF. New sft approaches with little, carefully selected data (lima or evoinstruct, which is boosting reasoning capacity) on large open source models could indeed be useful enough to replace ChatGPT in some end applications, where the models reasoning is sufficient for factuality. It does not train the model to self-evaluate its own uncertainty, though.
That ninety percent? With people going to ninety percent of an expert it takes ten percent effort. That last ten percent takes 90 percent more effort. A ten percent gap may be impossible to reach
Didn't watch the video yet: - Some Model uses in the training set, data to test LLMs so ............ They're not as good as it seems - Too many models, too many "combinations", and no real way to benchmark them (since some used test data to train them, AND using a model to evaluate another model isn't very reliable yet) The paper you're reading at first is only testing 13B models and lowers, we have 40 to 65B models that are so much better (You can train a model for 20$ 65B, comparing it to openAI GPT-3.5 is relevent because no matter how good they are, 20$ compared to millions.... is amazing) - Last point: they innovate more and more, I guess we would be able to get even better models because of innovations, and also because new GPU will come out, making servergade cloud GPU even cheaper
honestly, I'm already sick of what is going on everyday in the llm space, every single day some crap comes out claiming to be better crap than the previous crap. I applaud LoRa and all the new innovations the open source community has come up with recently but now it just seems that anybody with access to GPUs is training some derivative of the Llama model with variations of the same crappy datasets that circulate online. this madness has to stop.
This is an important video. It is really exciting seeing open source be so innovative, but the click bait headlines are more often than not, false promises.
I honestly don't know how much of it's true or not. But I wanted to get the word out :)
your right this video is click bait, because it's not true, I've just about stopped using chatGPT simply because my model's can answer questions that chatGPT" from "OPENAI" can't answer.
@@jonconnor6697 What models are your favourites and for which tasks?
@@jonconnor6697 which model ur using man?
Increasing the model parameters will definitely increase factual accuracy. I think LLMs are like efficient data structures which can query and store a lot of data in compressed format. Patterns might exist in the English language which the LLMs can learn and generalize but patterns won't exist in factual data and they have to be memorized by LLM parameters in an efficient way. This might be the reason ChatGPT has been trained on 175 Billion parameters so that it can balance the learning of language context and memorizing facts. Since LLama has only 65 Billion parameters it might not perform that well in terms of factual accuracy when compared to ChatGPT.
Seeing how falcon tops the benchmarks, yet underperforms when testing it with common questions, I wonder how much training goes into optimizing for the common benchmarks instead of trying to improve the system in general
Even GPT-4 falls apart rather quickly with the right prompts, particularly it does not have full blown concept of things, only very rudimentary glimpses of what things means, exactly what happens with Image generation models like Dall-E and Stable Diffusion: this is why most can not generate text and they usually generate people with more than 2 arms a deformed hands, they can not generate a crowd where everybody is doing something meaningful instead of repetition of a common theme.
I was wondering about something similar. As a teacher, I observed, that instead of teaching students to achieve general and field specific problem solving, we simply teach them how to pass tests. Which is not necessarily the same, depending on the test. I am worried, we might do the same with LLMs. Training them to score high on benchmarks and therefore making false promises about their capabilities. But this is why I love the field of AI so much. Maybe by finding a way to better benchmark LLMs we will also learn on how to improve our school system(s). By learning about AI and being honest about ourselves, we might reveal great knowledge about us as a species.
Profound comment!
We already know how to improve our school systems.
@@rashedulkabir6227 Can you elaborate?
Thank you. This mirrors my experience with the Llama variants.
It seems that fine tuning can allow for better immitation, but does not make up for high quality pre-training.
This is most obvious when giving novel instruction for output in a specific format (for example, when using the model in a multi-modal environment, generating code, etc).
I wonder if this is simply a limitation of model size.
Smaller models may be made "as good as" or better than gpt-3.5 on domain specific tasks, it may be impossible to have a smaller model that is better at general task completion.
Very interesting, thanks!
I'm wondering if these models might still perform mostly as good as stated on RAG (retrieval augmented Generation) tasks. Because in that scenario the chatbot doesn't have to rely on actual deep knowledge inside but in fact we want the model to master skill of (finding and) presenting certain bits of information in a nice, readable, structured, maybe with a specific tone way. So that's pretty much exactly what finetuning on "imitation data" (chatgpt chats) will achieve, right?
Here's the thing. There is nothing new about this paper. The fact of the matter is: larger model = more space to store facts. The way a model answers questions has nothing to do with what it knows. Unfortunately people don't seem to understand this basic concept.
If you want to have an open source model that knows as much as ChatGPT, just train a 175B parameters model for a few months.
However, the way things are moving, models don't have to know out-of-the-box a lot of things because you can give it tools to fetch knowledge. This approach is better than just asking a big model because the chance of it just making something up is much lower
But even with retrieval tools, don't you think that the size of the model and the data still contribute to the reading comprehension part?
@@wenhanzhou5826 Then just specialize the model in the comprehension part more
@@wenhanzhou5826 It does, but only to some extent. Because of how the attention mechanism works, a phenomenon called in-context learning allows the model to learn tasks with how you prompt it. But if you ask the model to perform a very complex task, even if you give everything it needs to complete it, it still may need some capabilities that only large models would have. So it depends on the task too.
@@ppbroAI how would you do that though, I thought the reading comprehension was an emergent phenomena of next token prediction.
I really want models that will output a ranked list of the actual sources used to generate a response. I understand this would make it more expensive to run but I feel there are many use cases that would justify the added expense.
To be honest, I dont find the findings controversial at all. We have known since the "ancient time" that neural networks are statistical models that reflects patterns in the data, and how more data gives better performance. Yes, the quality of the data does matter, but it have always seem to be too good to be true that you can get away with so little fine-tuning compute as compared to the pre-training.
Why would using the larger model's output be any better than using the same raw data?
But it makes sense. Other LLMS can’t compete with chatgpt right because their learning scope is only limited to their original training data, plus the fine tuned training, these models will be good at specific tasks vs general tasks that chatgpt can handle because it was trained on the public web.
1littlecoder,
one claim certainly was fake: Some people on UA-cam claimed that Vicuna has "90% of ChatGPT 4 quality". That is false. The paper from the producers of Vicuna claimed that Vicuna has "90% of ChatGPT 3.5 Turbo quality" which is the *free version* of ChatGPT. GPT 4 is *much* better than GPT 3.5
The best currently free of cost LLM is "Wizard-Vicuna 13B". That one is not commercially usable, but only for private use.
For a comparison of the quality of LLMs, look at the Huggingface LLM leaderboard. There you see that "Wizard-Vicuna 13B" beats many much larger models. What also should be considered in comparisons is that uncensored models perform way better than the censored versions of the same models.
Can't say I agree. I've been running Guancano 33B locally and have gotten much better results in inference than any of the Vicuna models.
@@Atreyuwu The smallest version of Guanaco 33B (the 4bit quantized GPTQ version) has 17GB file size. How does one run that locally? LLMs usually need ca 200% memory size of the file size.
@@Viewable11 I am also able to run it locally on CPU. I am using Koboldcpp and offloading 15 layers to my 8GB GPU. I do have 64GB of normal RAM, though. After processing the context (Koboldcpp does this smartly by reusing part of it) it generates about two completion tokens per second.
They are popular not because they are close to corporate closed-source models. People use *them because of privacy, local usage, and of course because they opensource models have unfiltered versions that aren’t dumbed down.
Closed source RLHF training has provenly made the responses more appealing to more people but far less accurate.
Factuality is difficult with LLMs. However, they look at small models only. Parameter count influences reasoning capacity, and by that factuality. According to a university presentation from a while ago I think by OpenAI staff, RLHF can also improve factuality, by aligning probability with kinda "internal knowledge certainty" of the model.
Of cause, smaller fine tuned open source models and ChatGPT do not have the same performance level.
I'm not surprised by their finding that too much fine tuning data can even hurt.
I'm not sure if sft is always only imitation, like also openai claims, to distinguish RLHF.
New sft approaches with little, carefully selected data (lima or evoinstruct, which is boosting reasoning capacity) on large open source models could indeed be useful enough to replace ChatGPT in some end applications, where the models reasoning is sufficient for factuality.
It does not train the model to self-evaluate its own uncertainty, though.
Nice detailed reasoning!
Great find
Thank you
Good video!
Thanks
Thanks for the leg work.
That ninety percent? With people going to ninety percent of an expert it takes ten percent effort. That last ten percent takes 90 percent more effort. A ten percent gap may be impossible to reach
Didn't watch the video yet:
- Some Model uses in the training set, data to test LLMs so ............ They're not as good as it seems
- Too many models, too many "combinations", and no real way to benchmark them (since some used test data to train them, AND using a model to evaluate another model isn't very reliable yet)
The paper you're reading at first is only testing 13B models and lowers, we have 40 to 65B models that are so much better
(You can train a model for 20$ 65B, comparing it to openAI GPT-3.5 is relevent because no matter how good they are, 20$ compared to millions.... is amazing)
- Last point: they innovate more and more, I guess we would be able to get even better models because of innovations, and also because new GPU will come out, making servergade cloud GPU even cheaper
honestly, I'm already sick of what is going on everyday in the llm space, every single day some crap comes out claiming to be better crap than the previous crap. I applaud LoRa and all the new innovations the open source community has come up with recently but now it just seems that anybody with access to GPUs is training some derivative of the Llama model with variations of the same crappy datasets that circulate online. this madness has to stop.
@An Eye Opener!!!!
Please share your LinkedIn url/profile