I've had only one session with Claude 2. But I've already found deficiencies that I never see in the chat agent based on GPT-4. Claude 2 contradicts itself within single responses. Furthermore, when I tried to nudge it in the right direction, it instead generated worse responses. I'll mention also that I've come to doubt that multiple-choice questions are appropriate in testing AIs. For humans, performance on multiple-choice tests is predictive of performance on more open-ended tasks. We don't have a good basis for believing the same about agents based on large language models.
Thanks for sharing your perspective. I agree with your point that it's unclear the degree to which we can consider exams (particularly those designed with the constraints of human memory in mind) as predictive of broader performance for LLMs.
Really appreciate your excellent selection of excerpts from technical papers that I never even knew existed. Your interaction with the AI is also very instructive. I believe that it is now possible to upload up to 5 pdf files at once, as well as documents in several other formats. I thought that the 100k token limit was for the combination of input and response, but it may be only for input, whereas the token limit for response is 4k.
claude 2 is an awesome model, but it has serious problems with hallucinations. That could probably be fixed by providing it with data access via web access, and for the data files, they need to do some finetuning to make sure claude does not invent data that is not found in the files provided. Providing references in claude 2's respones would probably help. If I were anthropic, I would have realized that this is the most pressing concern for claude at the moment. On a positive note, I feel like for an estimates 174b parameters, claude 2 comes very close to gpt4 - which is said to be a bunch of experts, making it even more impressive.
For me so far. It's very unwilling to return full code in its answers (longer code), quite annoying. And wont even try refactoring larger codebases. Just returns summaries and psuedocode.
Thanks for sharing your experience. I've mainly used it for summarisation so far - less so for coding. It's interesting to hear that Claude 2 may be less well-suited for that use-case.
@@SamuelAlbanie1 For summarizing large texts its been great, but a bit of a battle code wise, also code quality is not as good in most cases as GPT4. At least in my experience. Though it may have some more interesting ideas some times as to what to do (ie, reviewing/creating a ticket for given code). I have yet to try Sourcegraph though, it looks like it may be a good competitor to Copilot Chat (which is pretty useless running on 3.5).
Thanks for the question. I don't think we can read too much into the homogeneity of the context from the figure - it's primarily aimed at demonstrating that the loss continues to trend downwards. That being said, intuitively it seems highly plausible that extending to a significantly larger context window may diminish the model's ability to pick up on details within the window (relative to an alternative that uses a a similar compute budget but a more compact window). I think it's an open question though.
I'm digging your videos man, you're doing really great work
Thanks @henryholloway5656!
Thanks for the review and commentary
Thanks!
That ending was priceless about the adobe measuring tool its like the guy who picked a car lock with a tennis ball
I've had only one session with Claude 2. But I've already found deficiencies that I never see in the chat agent based on GPT-4. Claude 2 contradicts itself within single responses. Furthermore, when I tried to nudge it in the right direction, it instead generated worse responses.
I'll mention also that I've come to doubt that multiple-choice questions are appropriate in testing AIs. For humans, performance on multiple-choice tests is predictive of performance on more open-ended tasks. We don't have a good basis for believing the same about agents based on large language models.
Thanks for sharing your perspective.
I agree with your point that it's unclear the degree to which we can consider exams (particularly those designed with the constraints of human memory in mind) as predictive of broader performance for LLMs.
Really appreciate your excellent selection of excerpts from technical papers that I never even knew existed. Your interaction with the AI is also very instructive. I believe that it is now possible to upload up to 5 pdf files at once, as well as documents in several other formats. I thought that the 100k token limit was for the combination of input and response, but it may be only for input, whereas the token limit for response is 4k.
Thanks @ram49967!
Thanks
You're most welcome.
Your videos are awesome. Can't wait for more.
Thanks @AkarshanBiswas - much appreciated.
really enjoy your walkthrough!
Thanks @JL-zI6ot!
8:22 Thanks for this insight
Thanks @juliangawrongsky9339.
Keep going!
Thanks for the encouragement!
awesome
Thanks!
Keep it up man!
Thanks! I'll do my best
Love your humor lol! gj!
Thanks @TheManinBlack9054!
claude 2 is an awesome model, but it has serious problems with hallucinations. That could probably be fixed by providing it with data access via web access, and for the data files, they need to do some finetuning to make sure claude does not invent data that is not found in the files provided. Providing references in claude 2's respones would probably help. If I were anthropic, I would have realized that this is the most pressing concern for claude at the moment. On a positive note, I feel like for an estimates 174b parameters, claude 2 comes very close to gpt4 - which is said to be a bunch of experts, making it even more impressive.
Thanks for sharing your experience with the model!
Hope we never reach the end
I'm not entirely sure I understand your meaning. But I hope this too.
@@SamuelAlbanie1 12:46
For me so far. It's very unwilling to return full code in its answers (longer code), quite annoying. And wont even try refactoring larger codebases. Just returns summaries and psuedocode.
Thanks for sharing your experience. I've mainly used it for summarisation so far - less so for coding. It's interesting to hear that Claude 2 may be less well-suited for that use-case.
@@SamuelAlbanie1 For summarizing large texts its been great, but a bit of a battle code wise, also code quality is not as good in most cases as GPT4. At least in my experience. Though it may have some more interesting ideas some times as to what to do (ie, reviewing/creating a ticket for given code).
I have yet to try Sourcegraph though, it looks like it may be a good competitor to Copilot Chat (which is pretty useless running on 3.5).
as much as i love my country and its people - ai unavailability makes me want to leave russia
seems like it was planned to force russia out of ai race
Have you experimented with SberBank GigaChat? (I saw it announced, but haven't tried it myself.)
That was a expected condition. A 👌 standard day
I think that's a reasonable assessment at this time.
That was a habitual 📍 moment. A commonplace event
Thanks for the question. I don't think we can read too much into the homogeneity of the context from the figure - it's primarily aimed at demonstrating that the loss continues to trend downwards.
That being said, intuitively it seems highly plausible that extending to a significantly larger context window may diminish the model's ability to pick up on details within the window (relative to an alternative that uses a a similar compute budget but a more compact window). I think it's an open question though.