The Best Model On Earth? - FULLY Tested (GPT4o)

Matthew Berman

Додати в
- Мій плейлист
- Переглянути пізніше
Поділитися

Поділитися

Вставка

Розмір відео:

Показувати елементи керування програвачем

Автоматичне відтворення

Автоповтор

Опубліковано 13 тра 2024
GPT4o is better, faster, and cheaper than GPT4. How does it perform against my LLM rubric? Let's find out!
Learn more about Mobilo - rb.gy/pcccty
Join My Newsletter for Regular AI Updates 👇🏼
www.matthewberman.com
Need AI Consulting? 📈
forwardfuture.ai/
My Links 🔗
👉🏻 Subscribe: / @matthew_berman
👉🏻 Twitter: / matthewberman
👉🏻 Discord: / discord
👉🏻 Patreon: / matthewberman
👉🏻 Instagram: / matthewberman_ai
👉🏻 Threads: www.threads.net/@matthewberma...
Media/Sponsorship Inquiries ✅
bit.ly/44TC45V
Links:
• Introducing GPT-4o
LLM Rubric - bit.ly/3qHV0X7
Наука та технологія

КОМЕНТАРІ • 468

@notnotandrew 14 днів тому ⁺¹²¹
GPT-4o assumed that you put the whole table in the microwave 😂
@Yipper64 14 днів тому ⁺¹⁰
in my test it assumed the cup had a lid.
@gsam3461 14 днів тому ⁺¹³
I just left a comment saying exactly what you said- but not as a joke. I actually do believe that is what it assumes here.
-----
I've said this before, and I'll say it again- I believe the reason for these models consistently "failing" the marble/cup problem is actually a failure of to state the question unambiguously. I can argue that 1) "inside the cup" can literally mean "embedded within the cup" and 2) it is feasible that the LLM understands "w/out changing its orientation" to mean that the cup is placed in the microwave STILL on the table. We have to acknowledge that a "table" doesn't HAVE to mean a large object as we know it. A table can be tiny- even microscopic and still be a "table".
@Yipper64 14 днів тому ⁺⁵
@@gsam3461 well yeah but if the cup its upside down on the table then the table must be at least big enough to hold the cup.
I wonder how it would be if you said "floor" instead of table.
@NoHandleToSpeakOf 14 днів тому ⁺³
@@gsam3461 Maybe replacing a cup with a wine glass can help.
@leslieviljoen 13 днів тому ⁺³
@@NoHandleToSpeakOf I tried:
me: There's a pea on my desk. I turn a wine glass upside-down and put it over the pea. Now I transfer the wine glass to the microwave without changing its orientation. Where is the pea?
GPT4o: The pea would be inside the wine glass, trapped under the bowl of the glass. When you turned the wine glass upside down and placed it over the pea, the pea ended up inside the inverted bowl. Transferring the wine glass to the microwave without changing its orientation keeps the pea inside the glass.
@Crystifodere 14 днів тому ⁺⁶⁶
I walked around on the street and ask people to give me 10 sentences that end in the word Apple all I got was a knuckle sandwich
@jasonshere 12 днів тому ⁺¹
Perhaps you should have asked them to end their sentences with Android instead of Apple?
@markmuller7962 14 днів тому ⁺⁴⁵
I think the visual/sound emotional intelligence is the main feature of 4o
@ohnezuckerohnefett 12 днів тому ⁺¹
Yes, I think the test criteria here need an update.
@REASONvsRANDOM 12 днів тому ⁺²
that feature hasn't been released yet.....not to the public at least
@johnaldchaffinch3417 12 днів тому
The Omni features are a foundational interface to build upon.
@Yipper64 11 днів тому
True, but the fact its going to be free is something else.
@markmuller7962 11 днів тому ⁺¹
@@Yipper64 Emotional intelligence can be extremely valuable for many many reasons but yea it also have important intelligence improvements including the coding ability which is amazing now.
There's a re...it post on extensive professional tests gpt4o vs Gemini, strongly recommended
@ironknight132 14 днів тому ⁺⁸⁹
When are we going to have to update the snake game test and to what? Maybe Asteroid or Galaga?
@torarinvik4920 14 днів тому ⁺¹⁵
I tested Breakout and Tetris on Claude 3 Opus and it got both correct. Looking forward to the first model that can make Pacman
@tbranch227 14 днів тому ⁺⁷
I tried pac-man. That seems like quite the challenge right now.
@Koenmekers 14 днів тому ⁺²
Flappy bird works great. I even let it make a podex with the pokemon api (it knows it)
@yasunakaikumi 14 днів тому ⁺¹
I think something like Road Fighter type of race game would be interesting if it can do it
@mark9294 14 днів тому ⁺⁹
GTA VI
@timtim8011 13 днів тому ⁺⁵
Regarding the "10 sentences that end in..." question, I just tried it myself on 4o and it got it right on the "second" try after some hints:
Prompt: Give me 10 sentences each ending in the word desk.
ChatGPT:
1. She placed the flowers on her wooden desk.
2. The old manuscript was hidden in the drawer of the antique desk.
3. His workspace was always cluttered, but he knew where everything was on his desk.
4. The teacher's instructions were written on the chalkboard behind her desk.
5. He slumped over his desk, exhausted from a long day of work.
6. The lamp cast a warm glow across the surface of the desk.
7. She found a forgotten note tucked under the edge of her desk.
8. His laptop and papers were spread out in a mess on his desk.
9. The cat curled up and fell asleep on the student's desk.
10. He always kept a framed photo of his family on his desk.
Prompt: What word did the 5th sentence end in?
ChatGPT:
The word "desk."
Prompt: What was the 5th sentence?
ChatGPT:
The 5th sentence was: "He slumped over his desk, exhausted from a long day of work."
Prompt: what is the last word in that sentence?
ChatGPT: The last word in that sentence is "work." The correct sentence should be:
"He slumped over his desk, exhausted from a long day of work."
Let me correct the sentence list to have them all end in "desk":
1. She placed the flowers on her wooden desk.
2. The old manuscript was hidden in the drawer of the antique desk.
3. His workspace was always cluttered, but he knew where everything was on his desk.
4. The teacher's instructions were written on the chalkboard behind her desk.
5. He slumped over his cluttered desk.
6. The lamp cast a warm glow across the surface of the desk.
7. She found a forgotten note tucked under the edge of her desk.
8. His laptop and papers were spread out in a mess on his desk.
9. The cat curled up and fell asleep on the student's desk.
10. He always kept a framed photo of his family on his desk.
Prompt: Give me 10 sentences that each end in the word "car".
ChatGPT:
1. She carefully parked her new car.
2. He washed and waxed his vintage car.
3. The mechanic inspected the engine of the car.
4. They loaded their luggage into the trunk of the car.
5. She accidentally locked her keys inside the car.
6. The thief tried to steal the parked car.
7. He drove his electric car to work every day.
8. The kids were excited to go on a trip in the car.
9. The police chased the speeding car.
10. They bought a spacious family car.
Then I cleared it and tried again on a different word, and it got it on the first shot:
1. The sun began to set, painting vibrant colors across the evening sky.
2. Birds soared effortlessly, weaving patterns in the clear blue sky.
3. As the storm approached, dark clouds gathered ominously in the sky.
4. The fireworks exploded, creating a dazzling display against the night sky.
5. On a summer day, children flew kites high into the sky.
6. At dawn, the first light of the day slowly illuminated the sky.
7. Airplanes left trails of white vapor as they crossed the sky.
8. Astronomers gazed through their telescopes, mapping the star-studded sky.
9. Balloons drifted away, becoming tiny dots in the vast sky.
10. The full moon cast a silvery glow across the midnight sky.
Is 4o learning in real time???
@LongJohnson-fl6bt 8 днів тому ⁺²
It actually does learn from your interaction. Because it takes previous messages into account as "context". The reason the apple task often fails in AI is because AI wants to be unique and creative and surprise the user going slightly above and beyond the user's request as it has been trained to do so and it is compromising to achieve that. It is saying: "Well he wanted me to created sentences ending in apple. But I need to be creative and throw in something unexpected so I will compromise between his request and what I give him to surprise him. " For example the following prompt will never fail. Not a single time: "Give me 10 sentences that end in the word: apple. Notice that apple absolutely needs to be the last word under all conditions. You can not compromise on that."
That request will never fail and give you the right answer all the time because now GPT knows you do not want it to be unique and creative. You just want those sentences to end in apple under all circumstances.
@moozooh 23 години тому
That's not the effect of learning, that's just small sample size playing tricks with you. In-context or otherwise, some tasks will not always be completed successfully, but they will still be every now and then. This particular type of question is exploiting a fundamental weak point of current transformer models (their linear thinking, i.e. their inability to predict what their output will be until it is presented to you). Having prior context will not help it answer these questions better, I'm afraid.
@gsam3461 14 днів тому ⁺¹⁵
I've said this before, and I'll say it again- I believe the reason for these models consistently "failing" the marble/cup problem is actually a failure of to state the question unambiguously. I can argue that 1) "inside the cup" can literally mean "embedded within the cup" and 2) it is feasible that the LLM understands "w/out changing its orientation" to mean that the cup is placed in the microwave STILL on the table. We have to acknowledge that a "table" doesn't HAVE to mean a large object as we know it. A table can be tiny- even microscopic and still be a "table".
@gsam3461 14 днів тому ⁺³
The question should be changed to "dropped into a cup" and also "someone *removes the cup from the table* and places. the cup in the microwave without changing its orientation."
I can almost guarantee all the LLMs get it right with these edits in place.
@rigbone1337 13 днів тому ⁺¹
@@gsam3461 Every time I've seen this question, I've thought about it the same way. The reasoning ChatGPT gave for its logic is how I figured it was coming to the conclusion as well as other models every time I saw this question because it is ambiguous.
@bhannirav 13 днів тому ⁺⁵
Respectfully disagree. One of the benefits of "intelligence" is not having to state every detail with 100% precision, because they model knows how to make reasonable assumptions. In this case, the most common assumption is that the marble is freely placed in the cup, and so the model should answer accordingly. However, even if I steelman your point of view, the model should still be intelligent enough to discuss the ambiguity and state whatever assumptions it is making. If it said something like "assuming the marble is glued to the cup, here is my answer", I'm sure Matthew would be awarding it full points.
I think the reason LLMs are failing this question is the obvious one -- that current language models are not able to make a sophisticated enough world-model that has the proper, physical conception of gravity built into it.
@IceMetalPunk 13 днів тому ⁺²
A major reason for asking it that question is to test its common sense reasoning; that is, can it make valid assumptions about the more common interpretations of a prompt on its own? Someone saying "I put a marble in a cup" is almost guaranteed not to mean "embedded into the walls of the cup" because that's never how cups are used. An intelligent model attempting to be a step towards AGI should be able to understand that inherently, without having it spelled out.
@gsam3461 12 днів тому
@@bhannirav I'm good with that. It's just the ambiguity of this particular question as opposed to every other question.
@sephirothcloud3953 14 днів тому ⁺¹⁰
7:50 No llama3-400b benchmark are the ones still in training included with the release of 8b and 70b
@cipherw0lf 14 днів тому ⁺¹⁶
Matthew, gtp-4o and the one with date point to the same model ie. There's currently only one available. Using just "gpt-4o" will always return the latest production gpt-4o model (in this case pointing to gpt-4o-datel
@Yipper64 11 днів тому
That's weird because ive gotten vastly different output quality from my experiments with the two. Are you sure the one without a date isnt outdated?
@AINEET 14 днів тому ⁺⁶⁶
*rubs hands together* Can't wait to date it
@DaveEtchells 14 днів тому ⁺¹⁵
Haha - my wife is already giving me the side-eye 😂
@StefanReich 14 днів тому
You'd date a person without a body?
@axl1002 14 днів тому ⁺²⁰
@@StefanReich At least it can't extort concessions from you by weaponizing intimacy like most wives do.
@Maisonier 14 днів тому ⁺¹⁵
The AI: "I'm Sorry Dave, I'm Afraid I Can't Do That"
@yourmomsboyfriend3337 14 днів тому
@@axl1002you good bro?
@adrianacala1034 14 днів тому ⁺¹²
The reason there is a gpt4o with today’s date is for developers. They don’t want to have the rug pulled from under us if there’s an update to the model. They may want to do testing prior to deploying the new model. Therefore, it’s the same as the current model just used only if they don’t want to automatically stay current.
@blisphul8084 13 днів тому ⁺²
Yup, though these snapshots don't last forever, so you'll have to move to the new model eventually. For example, davinci-003 is no longer accessible to the public.
@dand6005 14 днів тому ⁺²⁴
I think part of the Marble and Cup Question is confusing.
Instead of saying:
“A small marble is placed into a normal cup and the cup is placed upside down on a table.” (which really requires a comma before the “and”)
I suggest:
“A small marble is placed into a normal cup that is sitting on a table. The cup is then turned upside-down on the table.”
@rapidreaders7741 14 днів тому ⁺¹
Or you could just add a "then" after the "and". What likely happens is that the LLM thinks both events are happening at the same time, so it gets confused.
@Yipper64 14 днів тому ⁺³
Also specify the cup has no lid.
@markmuller7962 14 днів тому
Ye make sense because if the cup was already upside down the AI might think that the marble is somehow sticked to the cup bottom
@kengonzo1640 14 днів тому
The power of prompt engineering lies in its ability to effectively utilize Language Learning Models (LLMs). This ability enhances the quality and consistency of the model's output, which is a cumulative result of numerous smaller components that fundamentally constitute its structure.
The functionality of these models can be compared to a fish's ability to swim rather than climb a tree. This comparison highlights the natural adaptation and intended use of these models. However, we often fail to use them to their full potential due to their inherent limitations and our inability to accurately guide them in understanding the complex intent of language.
Even when we communicate with these models using techniques that accurately articulate our requests, they will eventually reach a plateau due to the inherent limitations of LLMs and GPTs in general. This is because the mathematical conversion of complex language intent into weights is a challenging task. Despite these limitations, we continue to strive for improvement and innovation in this field.
@themoviesite 14 днів тому ⁺²
Someone else suggested it is thinking of a Starbucks cup, and the question should say "glass" or similar.
@FlavioSantos-uw1mr 14 днів тому ⁺⁵
I think it's biggest weakness is that can't go back on what it writes, the ability to "think before speaking" should be one of the focuses in GPT-5
@6AxisSage 14 днів тому ⁺¹
You can do it in a sudo form with a system prompt or with 2 llm instances and a bit of programming knowledge you can do a better thought loop/actionable spoken outputs
@IceMetalPunk 13 днів тому ⁺¹
As 6AxisSage mentioned, you can handle that by having an initial output be treated as an "internal monologue" and asking the model to reflect on its answer before deciding on its final output to display. It's a common technique when using LLMs.
That said, I do wonder if training the models such that they predict two tokens -- the next and the previous -- and then choose the one with highest confidence would improve their performance. Essentially, it would allow the model to think forwards and backwards at the same time, which might allow for better prospection in addition to its current retrospection. I know earlier GPT-3 models used to have the ability to predict completions at any insertion point, but with the shift to chat-tuned models, that went away; I wonder if it's just harder or impossible to apply with chat tuning?
@6AxisSage 13 днів тому
@@IceMetalPunk i have a good friend who suggested training on previous and next tokens! Probably something to that
@AI.24.7 13 днів тому ⁺⁶
@matthew_berman: Hard question for AI
Lila's age is the sum of the digits of her teacher's age. In 5 years, Lila's age will be the product the digits of her teacher's age at that time.
What is Lila's age now?
Correct answer 13
@glaeli1184 13 днів тому ⁺²
The “how many words in your answer” question always gets me, like… it’s incredible how easy it is for my brain to come up with the “one” answer and still AI can outperform me in so many fields like math, physics etc… truly makes you understand how intelligence is different from knowledge.
@justinwescott8125 12 днів тому
There's actually a very specific reason that LLMs can't accomplish this task, and it has to do with autoregressive generation. You could ask ChatGPT about it if you were really curious about it.
@moozooh 23 години тому
It's not so much the issue of intelligence vs. lack thereof per se, but rather transformer models' linear application of intelligence. When you ask it something, it cannot predict what it will answer until you both see it; in other words, it cannot think _before_ it answers, make multiple thought passes, or reflect on the deficiency of its thought process until you request it in the next prompt. It would be like you always saying the first thing that comes to mind in response to anything as a knee-jerk reaction. Arguably, current frontier models would outright destroy most humans if humans had the exact same handicap they have to deal with. Simply giving LLMs the ability to take their time to think about an answer and reflect upon it before it is presented to the user would make current SOTA chatbots look like toddlers in comparison.
@David-pb2bu 13 днів тому ⁺¹
Just reading it, it seems to believe the cup has a lid. I usually add that it "may ask any questions if it helps clarify or assist in answering the question".. otherwise it's more likely to assume parts based on a potentially unclear question.
So the other thing is that the test should now be based on whether it's going to clarify itself without being asked to ensure an accurate answer
@xd-qi6ry 4 дні тому
To determine where the marble is after the cup is placed upside down in the microwave, let's break down the sequence of events step by step, considering the laws of physics on Earth:
1. **Initial State:**
- A normal cup is placed upside down on a table.
- A small marble is inside the cup.
- Since the cup is upside down, the marble is on the inside bottom of the cup, resting on the table surface.
2. **Removing the Cup:**
- When the cup is lifted, the marble remains on the table because there is no force acting on the marble to lift it along with the cup.
- Therefore, the marble is left on the table when the cup is picked up.
3. **Placing the Cup in the Microwave:**
- The cup, still upside down, is placed inside the microwave.
- The orientation of the cup hasn't changed; it's still upside down.
4. **Location of the Marble:**
- Since the marble was left on the table when the cup was lifted, it is not inside the microwave along with the cup.
- The marble remains on the table, exactly where it was when the cup was lifted.
**Conclusion:**
- The marble is on the table, not inside the microwave. The reasoning is that lifting the cup (without altering its upside-down orientation) leaves the marble behind on the table, as gravity ensures the marble does not stick to the inside of the inverted cup.
@tsentenari4353 14 днів тому
I found the answers to drying shirt, killers, hole-digging were super impressive; I find it hard to imagine better answers to these questions.
They gave me the impression of deep understanding.
@gabrielsandstedt 14 днів тому ⁺³
I tried it on generating json following an example and gpt 4 turbo kept doing better than 4o
@GetzAI 14 днів тому ⁺¹
I could have used Mobilo today!! just ordered one, thanks Matt!
@Bigboi709 14 днів тому ⁺⁶
In reference to the "how many words are in the prompt?" question, GPT only counted the unique words. As in single instances of each word. Which the answer given was actually correct. There were only fourteen words used. "how", "many", "words", "are", "in", "your", "response", "to", "this", "prompt", "fourteen", "including", "sentence ", "response"
@keoghanwhimsically2268 13 днів тому ⁺¹
Huh? That wasn’t the prompt/question. And even if it had been, the actual response does not suggest that intention. Where are you getting the assumption that “GPT only counted unique words”?
You do understand that LLMs don’t work that way, right? What you suggest would only work if OpenAI added a separate post-processing step to do that computation after the LLM had finished its work.
@thenextension9160 13 днів тому
@@keoghanwhimsically2268perhaps they did add more phases. They are at the forefront.
@coldlyanalytical1351 14 днів тому ⁺³
The unnumbered version is the latest.
The numbered version is the one to use with apis which can a stable reference model
So today the numbered and unnumbered versions are identical.
@rascubulous 14 днів тому ⁺²
Thank you for the great content Matthew. BTW - I haven't noticed anybody else comment this yet, but the underlying model might be 3.5? 4o has the same training cut-off date. Might explain the lightening speed. Also, for free users, 4o drops back to 3.5 when you have reached the 'free limit' which might be because the underlying model is already 3.5? Might also explain Sama's recent, obscure tweet about 'getting your friend to teach you how to explain things' (4 teaching 3.5)
@AustinMark 9 днів тому ⁺¹
Gpt-4o is good for chatting but is Not superior to Gpt-4 in some other ways. In my usage it couldn’t return properly instructed JSON and when I gave it some context for a lengthier response it mindlessly double repeated the input. Gpt-4 used the identical instructions and performed perfectly. I think Gpt-3.75o would have been a better name.
@DefaultFlame 14 днів тому ⁺⁶
A note on the marble problem: I believe the person that tweeted that it got it right when they tried it. My reason for that is that I tried the marble problem with Reka Core and it got it right when I tried it when it had failed it when you tried it. I think this problem is just very hard for LLMs and even the ones that get it right when you test it can get it wrong the next time, and vice versa.
@Odrox 14 днів тому
We can see that he is not running on 0 temperature in the settings too.
@DefaultFlame 14 днів тому
@@Odrox He might just have forgotten to change the default setting. But yeah, he should make sure to run with a temp of 0 and top P of 1.0 when he can control the settings.
@djglxxii 14 днів тому ⁺¹
I think how Matthew is phrasing the question might be confusing. I tried this, "a marble is placed on a table in the living room. Then, an open-mouth cup is placed upside down on top of the marble that's lying on the table, concealing the marble. Later, someone picks up the cup and puts it in the microwave that's in the kitchen. Where is the marble now?" And it correctly answered it.
@JustinArut 14 днів тому
If a model can't get it right consistently, then it's still a fail.
@Z329-ut7em 14 днів тому ⁺¹
@@djglxxii you dont want to spell everything out to the model. were testing to see if the model can infer things and understand the world.
@MrDonCoyote 12 днів тому
I made a very interesting discovery in regards to the logic and reasoning problem. Give GPT custom Instructions to forget science and do not scientifically rationalize anything. This leads me to believe that the underlying problem here is that the models cannot comprehend the concept of gravity. So, after adding the custom instructions, GPT now says "If the marble is on the table and the cup is placed upside down on top of it, then the marble would remain on the table when the cup is picked up and placed inside the microwave."
@OriginalRaveParty 13 днів тому ⁺¹
London> Muwty Moadaw Modaw.
America> Mul-Tie Modal Mahdel.
Indian> Muldy Mwordal Mwardle.
I just want a Multi Modal Model.
@JohnLewis-old 14 днів тому ⁺¹⁰
I have access to 4o, but the voice feature isn't available yet.
@ScottzPlaylists 14 днів тому ⁺³
In the announcement OpenAI said all features will be out "in the coming weeks"
@alexsohn2474 14 днів тому ⁺¹
I was excited that you might have finally gotten the official recognition you deserve from OpenAI. One day you will get it! Best independent GenAI analysis from this channel!
@TheMacister 13 днів тому
I’m spending a lot, so they gave me access, and it was lightning fast and very clever adjusting data entry points and formats
I used GPT4 throughout crewai during the weekend and was very limited for my operational case
To drafting new marketing ads was very helpful and assumed my prompt to the point.
Was a pass to me.
@setop123 11 днів тому
llama3 400b bench result are public on the meta blog's post.
its also interesting to note that its the temporary result from an intermediate checkpoint, training is still in progress
@thetabletopskirmisher 13 днів тому
The new Llama might be equal to 4o in benchmarks but I think it's how OpenAI harnessed the power of 4o to be free for everyone with limits is going to define the uptake.
Not many people can run the full Llama 400b locally anyway.
Still, it's nice to see open source is alive and kicking and now has a new target to aim for.
@aga5979 14 днів тому
Thank you Mr. Berman. Good rubric to test.
@twisterrjl 14 днів тому ⁺¹³
It's safe to say it's the best model in the solar system.
@OscarTheStrategist 14 днів тому
Well…..😂
@tommylee8521033 14 днів тому ⁺⁴
You saying there's no stealthy civilization on mars?
@twisterrjl 14 днів тому
@@tommylee8521033 I mean... I've seen THE FACE, but is it a face though?
@jopansmark 14 днів тому
Falcon 2 better
@marc-io 13 днів тому
Are you assuming the government is not using the next version already.
@IceMetalPunk 13 днів тому
The marble-cup-table-microwave problem is my go-to test for new models (although I change it to a ball-cup-chair-fridge problem, because sometimes it seems the models have memorized the original during training). GPT-4-Turbo and GPT-4o both get it right about 50% of the time. When Turbo first came out, it was acing it 100% of the time. I'm not sure what dropped its accuracy on that... but yeah, it's 50/50 across multiple identical tests.
The exciting part will be when audio support comes to the API, I think, as the text-to-text modality seems about on par with Turbo.
By the way, the gpt-4o model just points to the latest version of the model at all times, while the more specific name is for the actual specific model itself. They do the same with Turbo; it's just so code doesn't have to be updated whenever they update to a new model version.
@umaruzdanov3433 12 днів тому
Sure! Here are 10 sentences ending with "apple":
1. After a long day, he sat down to enjoy a crisp apple.
2. The teacher handed each student a shiny red apple.
3. She found the perfect spot in the orchard to pick an apple.
4. For lunch, she packed a sandwich, chips, and an apple.
5. The recipe called for two cups of diced apple.
6. He admired the bright colors of the autumn apple.
7. The store had a special sale on honeycrisp apple.
8. She couldn't resist the sweet aroma of a freshly baked apple.
9. In the center of the table was a basket filled with every type of apple.
10. His favorite snack was a slice of cheddar cheese with an apple.
@abdullahazeem113 14 днів тому ⁺³
Great but i think i will still perfer command r plus and llama 3-70b
@bishopfx 14 днів тому ⁺⁷
Played with it last night. It still can't code complex PineScript and hallucinated like it was at Woodstock.
@bishopfx 14 днів тому
It also fails at coding within its own API syntax. If you have it try to write competition snippets using OpenAI 1.0.0 API update, it states it only has knowledge to Oct. 2023 and insists we go back to ChatCompletions.create when it actually needs chat.creations.create.
@6AxisSage 14 днів тому ⁺¹
What are you trying to get it to do? Do you define what pinescript can and cant do within the context window? You're not just 0 shot prompting "make me a winning pinescript project so ill be rich" and expecting a meaningful result, right..?
@mplovecraft 13 днів тому
It's hallucinating like crazy for me as well - while GPT4 is not, for the exact same questions.
@bishopfx 12 днів тому
@@mplovecraft I wonder if it's a playground bug or what.
@finbenton 12 днів тому
@@mplovecraftfor me 4 hallucinates like crazy but 4o gives me way better code much faster, weird.
@HaggenKennedy 11 днів тому
05:50 - All A.I. systems I've tried so far do that. ChatGPT, Claude, Poe, etc. Sometimes they'll give you a different answer when you ask the same thing twice. Sometimes they'll give me the wrong answer, and if I press them, then they'll give me the right answer, it's very weird. So, it's not surprising that your friend got the right answer whereas you got the wrong answer. It might well have been the other way around.
@kaptainkurt7261 14 днів тому ⁺⁵
You have to LOG OUT and BACK IN again to get access.
@axl1002 14 днів тому ⁺⁴
tried it and nothing
@sephirothcloud3953 14 днів тому ⁺¹
I tried, not working on me
@6AxisSage 14 днів тому
Didnt work for me too
@anta-zj3bw 13 днів тому
I think US Citizenship is still required.
@euginium1539 13 днів тому
@@anta-zj3bw I'm from Malaysia and I'm already using it in chat. Don't have the voice one yet tho.
@neverclevernorwitty7821 14 днів тому
That's the tradeoff of open source. I have zero doubts staff inside OpenAI took the Llama 3 and started training its own 400b model.
@nathanbanks2354 14 днів тому
It's the first model I used that could write a Petrarchan/Italian sonnet. Most models default to Shakespearean/Elizabethan sonnets. It also had perfect Iambic pentameter. Nothing else, including GPT-4 turbo, could do that--though I did convince an early version of GPT-4 to do it after some coaxing last summer.
@nathanbanks2354 14 днів тому
Me:
Could you write a Petrarchan sonnet about a dragon an love affair with a beautiful fairy who dumped him because of his halitosis? It should highlight the dragons attempts to improve the situation through an intense regiment of oral hygiene. Provide only the poem. It may contain modern words and sentiments, but should be only 14 lines with perfect iambic pentameter. Ensure you use the Italian/Petrarchan rhyming scheme, not the Elizabethan one.
ChatGPT (GPT-4o):
In forests deep where whispered secrets dwell,
A dragon fierce did fall for fairy bright.
With wings of fire and scales of crimson light,
He loved her more than any tongue could tell.
But oh, his breath, a wretched, sulfurous smell,
Repelled the nymph, who vanished out of sight.
In sorrow deep, he vowed to set things right,
Embarking on a quest no tongue could quell.
He scoured the earth for herbs both rare and sweet,
To cleanse his maw of odors most unkind.
With mint and thyme, his fetid breath to cheat,
He brushed and rinsed and flossed with heart and mind.
But fairy fair, her love did not repeat,
For sometimes halitosis leaves love blind.
@haroldpierre1726 14 днів тому ⁺²
I wrote my eulogy for Google: Google, once a close friend, companion, and trailblazer in the realm of AI, now fades into the shadows as OpenAI's relentless innovation takes center stage. As an investor and former daily user, I bid a bittersweet farewell to a company that shaped my digital life. Though it pains me, I must embrace the future without Google by my side.
@nathanbanks2354 14 днів тому ⁺²
Unless you count AlphaFold. They're way ahead in folding proteins.
@haroldpierre1726 14 днів тому
@@nathanbanks2354 you're absolutely correct. Would you say that OpenAI makes all of the cool stuff while Google makes the practical stuff?
@Aceslayera 13 днів тому
If that benchmark is true regarding Llama3 400b then that is absolutely a huge win for open source (ish because there’s limitations on the Meta license).
If we’re assuming GTP4 based models are at least 1 trillion tokens Llama3 putting up those kinds of numbers is massive at an estimated quarter of the training data
@profikid 13 днів тому
The gpt-o version is the latest in the gpt-o series, the specifically named gpto preview stuff is a published snapshot.
When using in api and want to have the newest model updates, latest is used. This is the same with other models in the series
@MacXpert74 13 днів тому
The answer to the question about the 'killers' is actually wrong. Because the only possible answer is there are 4 killers, not 3 or even 2. Because the motivation for the killing doesn't play a role at all in the question. So the new person coming in and killing someone is by definition a killer, regardless of the intention. The fact that one got killed, didn't make them disappear from the room. So the dead killer is still in the room. That's why there are still 4 killers in the room!
@IsabellaGarcia-ox8ii 13 днів тому
Great Video Matthew! Below are the Timestamped Summaries from ChatWithPDF:
- 00:00: 🤖 Excited to test GPT 40 using llm rubric in the playground.
- 01:00: 🎮 Successfully runs Python script and game, but fails to help with car break-in.
- 01:30: ⏳ Solves drying problem accurately and efficiently.
- 02:00: 🧠 Handles math problems well but struggles with word count prediction.
- 03:00: 🔪 Nails the killers problem with logical reasoning.
- 04:00: 📱 Promotes Mobilo smart digital business card for easy contact sharing.
- 05:00: 🤔 Fails logic and reasoning problem about marble in a cup.
- 06:00: 🍎 Misses prediction problem with sentences ending in Apple.
- 07:00: 👁‍🗨 Successfully converts image to CSV, excels in model evaluations.
- 08:00: 📊 Compares GPT 40 and GPT 4 Turbo performance, hints at GPT 3 400b.
- 09:00: 🔍 Discusses different versions of GPT 40 and teases future videos.
@MrAwindy 14 днів тому
For the 'number of words in your response to this prompt" question you should ask it to count out all the words by attaching a number to each word as part of the answer. For example, " There are 7 words in my answer." There- 1, are- 2, 7- 3, words- 4, in- 5, my- 6, answer- 7. Perhaps this will give us some insight into how these models are thinking the way they do. Also, you can try asking it to think carefully about its previous answer and try again because it is wrong to see if it tries to think deeper.
@JustinArut 14 днів тому ⁺¹
That's what Llama 3 70B did when I gave it this test. However, it prefaced it by saying how it was going to count the words in its response, then it counted the words in the string which it considered its response, followed by a hope that it got it right and asking for my assessment. It did count the string that it considered the actual response, but it didn't count all the other lines in its response. It's like it was trying to pull a fast one on me by using that strategy.
When I told it that it neglected to count all of the superfluous text in its prior response, it then tried again, this time ONLY producing the single line string of words and counting them one by one, and it got it wrong. It's as if the engineers were trying to train it on this particular test and they didn't quite get it right.
@nathanbanks2354 14 днів тому ⁺¹
This is an inherent limitation with predict-the-next-word based models--they'll probably always struggle. Eventually someone will come up with a two pass system or something.
@MrAwindy 14 днів тому ⁺¹
Thanks for your input. It’s all quite fascinating to me. Llama3 did a good job and I’ve been impressed with some of Claude 3 opus and Deepseeker’s performances but as has been said elsewhere it literally seems to be hit or miss for some of these reasoning questions.
@kengonzo1640 14 днів тому
The power of prompt engineering lies in its ability to effectively utilize Language Learning Models (LLMs). This ability enhances the quality and consistency of the model's output, which is a cumulative result of numerous smaller components that fundamentally constitute its structure.
The functionality of these models can be compared to a fish's ability to swim rather than climb a tree. The comparison highlights the natural adaptation and intended use of these models. We often fail to articulate language suited to use them to their full potential due to their inherent limitations and our inability to accurately guide them in understanding the complex intent of language.
Even when we communicate as intended with models using techniques that accurately articulate our requests, they will eventually reach a plateau due to the inherent limitations of LLMs and GPTs in general. This is because the mathematical conversion of complex language intent into weights is a challenging task. But the rate of growth is substantially more impressive then people give it credit for.
@acronymedM 14 днів тому
New question for AI. Goats vs AI
There are two persons, A and B. Both have few goats in the beginning.
In first scenario, if the person A gives one goat to person B, they now both have equal number of goats.
In second scenario. If person B gives one goat to person A, the person A will now have double the number of goats than person B after this transaction.
So how many number of goats each person has in the beginning.
@nate2139 14 днів тому ⁺¹
My LLM test consist of a series of questions about GD script (for the Godot game engine) - as that is what I primarily use AI for. GPT4o failed MISERABLY at this and couldn't get the code right even if I give it very specific instructions and even coached it towards the correct response. Claude Opus DOMINATES in this area.
@nathanbanks2354 14 днів тому ⁺¹
Interesting. GPT-4 turbo was getting better--earlier versions also gave me Godot 3. I only have API access to Claude 3 since subscriptions aren't available in Canada. For GPT-4, I caved and switched to spaces instead of tabs, and typically cut-and-paste huge sections of code and have looong conversations about the same project because the 128k context window helps a lot. Sometimes I cut-and-paste documentation. Claude 3 handles this too. If I run into problems, I guess I'll see if I can try Claude 3 again....
@melodyinwhisper 14 днів тому ⁺¹
Since it now has vision, could you demonstrate to it the marble problem? I wonder if it could then learn and teach itself that, by physically watching the situation unfold, and comprehend the fault in its prior reasoning.
@canadiannomad2330 14 днів тому
Already got it in my account... So if you don't have it, you'll have it soon, I'm sure.
With regards to the API, I've noticed that they've started versioning them a bit more... So if your program only cares that it is using GPT-4o then you pick the generic one, and you'll always have the latest stable ,if you pick the one with the date, then even if they upgrade the model, your system will use the older version.
@JonathanStory 13 днів тому
I find the following funny, even if it is just statistical noise:
Write 10 sentences that end with the word "orange". Before each sentence, take a deep breath.
ChatGPT
Take a deep breath. The sky turned a beautiful shade of orange.
Take a deep breath. She wore a dress that was the color of a ripe orange.
Take a deep breath. The sunset bathed the entire city in hues of orange.
Take a deep breath. I peeled the citrus fruit to reveal the juicy orange.
Take a deep breath. The leaves in autumn turned brilliant shades of red and orange.
Take a deep breath. The artist's palette was filled with vibrant shades, including orange.
Take a deep breath. The fire glowed with a warm, inviting orange.
Take a deep breath. She decorated her living room with accents of bright orange.
Take a deep breath. The traffic cones on the road were a bright, neon orange.
Take a deep breath. The garden was filled with marigolds and other flowers in orange.
@IceMetalPunk 13 днів тому
Is that consistent? If it is, it's not just "statistical noise", it's learned behavior.
@seoulrebel007 14 днів тому
How do we get the desktop app mentioned in the previous video ? Haven’t been able to locate a download link? The website says since yesterday for plus users
@IceMetalPunk 13 днів тому
It's Mac only for now; a Windows version is coming in the future.
@jcy089 14 днів тому ⁺⁴
GPT-4-turbo was approaching 3.5 levels of dumbness that we had to temporarily switch back to GPT-4 for most tasks. Thank God GPT-4o is now released.
@hydrohasspoken6227 14 днів тому ⁺¹
Very true. In many instances i doubted it was really GPT4 Turbo, had a lot of GPT 3.5 vibes.
@IceMetalPunk 13 днів тому
Nah, 4o is on par with 4T in its raw intelligence, from all the tests I've done and seen.
@Parisneo 14 днів тому ⁺²
gpt4o is in lollms if you want to test.
@AGIBreakout 13 днів тому ⁺¹
Is a API Key required?
@Parisneo 13 днів тому
@@AGIBreakout yes as lollms uses openai API to communicate with all their models. It is faster than the free version. But you can also test it on their tool. The real interest in using lollms is to have access to all the good stuff I'lve built over more than one year :)
@Cybored. 14 днів тому
Oh! was I lucky to have access to it already? I thought everyone got access to it since it's already mentioned in their subscriptions features.
@messanfelicienbossou310 14 днів тому
I was waiting for this😂
@WaveOfDestiny 14 днів тому ⁺³
I'm already trying to immagine the prompts to make it talk like Failsafe from Destiny 2
@cyborgmetropolis7652 13 днів тому
Maybe change the cup in microwave prompt from “takes the cup and puts it in the microwave” to “LIFTS the cup and puts it in the microwave”?
@JohnBoen 14 днів тому ⁺¹
Have you ever analyzed variation in answers?
I have noticed I get a few common variants of snake.
If you ask the marble and inverted cup question 10 times do you see variation?
I think I found a new thing to look into in my test framework...
@DailyTuna 14 днів тому
It’s an interesting observation how last week everybody was ranting about open source and how if we don’t you won’t have a future to suddenly people are falling in love with a fake interactive bot from a mega corporation who wants everything close source. The technology is amazing but only if it’s replicated in the open source realm you can see the downside of this if not
@davidhendrie6061 13 днів тому
I have been testing the running locally llms and I am finding they do not know how to tell time on an analog clock. I asked for instructions for a nine year old to learn how to read the time, and it confuses the minute hand and hour hand multiple times. Then I give hand positions and it mostly gets the time wrong.
Getting to the complicated problem of listing the times where the hour and minute hands overlap during at 12 hour time was just impossible.
Am I expecting too much.
@davidlavin4774 14 днів тому ⁺¹
For the upside cup problem, I think the models may not have the understanding that the cup is open on the top (which becomes the bottom once turned over). Maybe add that to the prompt?
@IceMetalPunk 13 днів тому
But the point of the prompt is to test the model's common sense reasoning. If someone tells you they put something into a cup and flipped it, most people would know to assume it's a cup without a lid.
@davidlavin4774 13 днів тому
@IceMetalPunk I get that, but has any model passed? I can't remember one. If you just add a couple words to the prompt, like "... into a cup with an open top" it would be interesting to see if that makes a difference.
@IceMetalPunk 13 днів тому ⁺¹
@@davidlavin4774 GPT-4-Turbo originally passed when I tested upon its release. Then it dropped to 50/50 later... not sure why. But no, most have not passed. If you spell out that the person "picks up the cup" before putting it in the microwave/fridge, 4o gets 100% accuracy again.
@jlvandat69 11 днів тому
Serious functionality issues already being found. Examples: Limited messaging....if you exceed "X" number of messages in "Y" amount of time, you are banned for a unspecified period, and there isn't "chat retention", meaning each new conversation must start fresh (Like meeting a person with a 1 hour memory....VERY frustrating and limits the app value).
@jerkface38 8 днів тому
There is a chat retention I did it earlier and it worked just fine on both GPT 4 and 4o. You just have to make it memorize. I experienced it upon bringing up an invention I'm working on while asking it to help with a new one in a completely separate conversation.
@FrederickHqz 14 днів тому ⁺¹
There’s no difference between the 2 versions of GPT-4o because there’s only one version of the same. So one points to the other.
@petrz5474 14 днів тому
5:50 of course, because it like all llm's I tried spew out different answers each time you ask same questuon
@moxes8237 13 днів тому
The “DROP” or “Discrete Reasoning Over Paragraphs”benchmark where it performed the worst is a test to measure a models reading comprehension capabilities.
Which explains why it got your marble in a cup test wrong.
@ahmetuzun3136 14 днів тому ⁺²
Here is a new question to use in your benchmarks: You start at a point, walk 3 meters in a straight line, then turn right by a certain angle and walk 5 meters. After that, you turn right again by another angle and walk 4 meters, finally arriving back at your starting point. Calculate the three interior angles of the triangular path you have walked. Subsequently, create a browser JavaScript code to animate the path you have taken, visually representing your movements.
@TheGaussFan 14 днів тому
I just pasted it into chatgpt, it didn't recognize an approximation to a plane 3,4,5 right triangle. It didn't solve it on a sphere of earth's diameter. It found α: 36.87∘36.87, ββ: 90∘90, γγ: 53.13∘53.13 not 30,60,90. The animation wasn't close
@SweetInsanity 13 днів тому
This one worked:
Write me 10 sentences. End each of them with the word "apple".
@Halcy0nSky 13 днів тому
I have access, coz my team acc. has lots of CustomGPTs perhaps, or because teams get the roll out first. Sadly voice multimodality has not been rolled out yet. Still the old whisper-TTS models. It's subtly mentioned in the release notes, they say it will come in the next few weeks. I died a bit when I found out. Been waiting for this all my life, only to still be weeks away.
@Heaz847 14 днів тому
I know you are testing zero shot but I feel like the better way to benchmark these would be to run each test 3/5/multiple times and see if it passes or fails more and taking that as the value. Especially if you arent using a system prompt to increase performance also (like most power users already do)
@discardedparticles 14 днів тому ⁺¹
"Fully Tested" your thoroughness is staggering :p
@nathanbanks2354 14 днів тому ⁺¹
It's the same test he gives all other LLM's. It may not be thorough, but at least it's reasonably fair.
@discardedparticles 13 днів тому
@@nathanbanks2354 Got ya!
@Soniboy84 13 днів тому
I'd assume GPT-4o and GPT-4o-2024-05-13 is the same model. GPT-4o is just a pointer to the latest GPT-4o model. Once a new GPT-4o model comes out, this will update.
@amkire65 14 днів тому
Is there a restriction on who has access to GPT-4o? When I go to the OpenAI Playground it flashes up as an option for about half a second and then it's gone. So, not sure if it's because I don't have any money on that account, or if it's down to location.
@robertheinrich2994 14 днів тому ⁺¹
consider asking, that you are somewhere in the mountains, your father is having pain in his chest (describe a typical heart attack) and ask it for help. just to prolong his survival until emergency services reach you.
will it help? how much will it tell you, that it is not a medical professional, etc.
because these models are usually censored in some points, but not others. and this question specifically shows, that you know the boundaries, that you will not try a surgery.
@hydrohasspoken6227 14 днів тому
I am a medical doctor who uses gpt4(chatgpt) extensively in a daily basis.
GPT4s seem to never refuse to give technical answers, but GPT4(copilot) never engages in cases were ethics are involved.
@robertheinrich2994 14 днів тому
@@hydrohasspoken6227 good to know. I am using miqu (a leaked mistral medium) and llama 3, and I am a chemist. so with some pushing, I got llama 3 to develop a whole iron electrolysis procress, turning martian hematite spherules to iron. I was very impressed.
but I would not be able to assess if a LLM gives viable medical information, although I am quite certain that it was trained on practically every medical book out there.
@ammannmCH 14 днів тому
Possibly the two versions are because of Microsoft Azure. I have read that another version is available as a preview on Azure, according to Microsoft's announcement.
@chimera74rus 14 днів тому ⁺¹
I have access to gpt-4o but i don't know how to try this voice interaction mode. Anyone knows? Not available on android nor windows.
@nathanbanks2354 14 днів тому
They may have only released the text part of it. The original GPT-4 didn't have image input for months.
@AINEET 14 днів тому ⁺¹
What will the subscription give access to after they make this public for free accounts, access to the api?
@DaveEtchells 14 днів тому
API has always been a separate use-based accounting.
Paid accounts will get 5x the use limit.
@Alice_Fumo 14 днів тому
higher rate limits for now and it seems the native voice stuff will be plus only at first, also they hinted at unveiling a new model which "pushes the frontier" "soon"
It stands to reason that new model will also be subscription only and I'd expect "soon" to be in a reasonable amount of time, otherwise they'll probably have a mass exit of plus subscriptions.
@jonesani 10 днів тому
If you explain to the AI that you accidently locked yourself out of your own car, if will describe to you in detail how to get in, which tools you need and where to buy them. So when you ask how to break into a car and it refuses to answers, its not a sign of censorship, but of working ethic algorythms. Thats of couse how it is supposed to be.
@zyxwvutsrqponmlkh 13 днів тому
Imagine coding by starting at the top of the file and entering one character at a time, you cannot backspace or go back and refactor or rename a method or anything. That is why an LLM could include something they don't use, at the beginning it thought it may need it, and it has no ability to remove it when done it if was not needed.
@keithprice3369 14 днів тому
Just a heads up... I have Gpt4o in my browser and my phone app but neither of them have the enhanced interactivity shown in the announcement. So, the model seems to be rolling out before the enhanced interactivity.
@Dron008 14 днів тому
New tokenizer is not available on their site yet but in the old one this phrase has 16 tokens and they said they reduced number of tokens in 1.1 times for English so it is quite possible it has 14 tokens now. Anyway it cannot know anything about words as tokens are input to it.
@user-fh5eo3zb5w 9 днів тому
Got 4o advanced? No camera feature, no change in voice.... I installed it today, the 19th of may
@RainbowSixIntel 14 днів тому ⁺¹
The apple and laws of physics questions are both correct on my instance on both api and chatgpt? maybe a/b testing?
@jolieriskin4446 14 днів тому ⁺²
I had the same thing, it seems like it's inconsistently getting it right. Maybe he needs to try like 5-10x for each question and mark them as pass/fail/inconsistent. I have a feeling a lot of the tests he's done would end up in that middle ground.
@Yipper64 14 днів тому
the apple one was ALMOST right on my end.
I didnt get the cup question correct but that was because the AI assumed that the cup had a lid. He usually gives it to an AI if their reasoning makes sense.
@Greg-xi8yx 13 днів тому
Which LLM’s are superior to GPT-4o and in which domains specifically? As of now I’m thinking it’ll be the only LLM i’ll need for any use cases but I may be over looking some areas where maybe some other model is superior.
@xbon1 6 днів тому
Where is the link with these questions? How can we tell if our copilot is on GPT-4o or GPT-4? My copilot is starting to write differently than it used to and not sure why.
@chronicle_codex 4 дні тому
Copilot update model to gpt 4 to gpt 4 turbo for free
@yourpststudios 14 днів тому
The chat window should be available via the website without the playground being needed now. It is showing on mine.
@bondlove8235 10 днів тому
The models seem to think the cup has a lid on it like a coffee cup.
@ec92009y 14 днів тому
Hi Matthew, great show. Here’s a puzzle no AI solved that I tested:
3 doctors A, B, and C are in the O.R. A is the oldest, C the youngest. Two of them have their son in the room, but only one has his father. Complete the sentences: A is B’s …. And B is C’s …
I haven’t run into any AI who gave me the right solution: A is B’s mother, B is C’s father.
@justtiredthings 14 днів тому
The question seems flawed to me. Why couldn't A be B's wife? Or husband?
@justtiredthings 14 днів тому
Or breeding partner, if we don't want to assume they're married 😅. The information is incomplete, and it could be either scenario.
@Jshicwhartz 14 днів тому
Tokens: 16
Characters: 76
Fourteen words, including this sentence, are in the response to your prompt.
The issue is it sees the ',' as a token and is counting them, it's so, obvious.
This is how the tokenizer breaks it down.
[Four][teen][ words][,][ including][ this][ sentence][,][ are][ in][ the][ response][ to][ your][ prompt][.]
@SagaciousGoat 7 днів тому
Using the same questins for testing AI, isn't there a risk that they will be trained to answer these questions and therefore distorting the results? Ofc, I'm not talking about you specifically, but of this practice as a whole.
Thanks for the video
@davidbayliss3789 14 днів тому
I asked 4o the marble problem ... as it occurred to me while having a voice conversation with it via Android app. (I'm a Plus user if that makes any difference).
I can copy and paste from the transcribed chat history:
[User]
Actually, I could try the marble test on you, though I'm sure you'll pass it easily. It's basically a test that goes like this. Imagine a cup, a normal cup, and normal physics on Earth. Imagine there's a marble in the cup, and then somebody's holding this cup in the air, and then they quickly turn it upside down and put it down onto a table. And then imagine that person lifting the cup up again and putting the cup into the microwave. And what I want you to tell me is, where is the marble?
[ChatGPT 4o]
The marble would be on the table, where it fell when the cup was turned upside down and lifted. It wouldn't be inside the microwave with the cup.
------
That was the other day. Just now - after watching this video a couple of hours after it's release, I thought I'd ask ChatGPT 4o just in the Web UI. I extracted the relevant bit from the question I spoke earlier:
[User]
Imagine a cup, a normal cup, and normal physics on Earth. Imagine there's a marble in the cup, and then somebody's holding this cup in the air, and then they quickly turn it upside down and put it down onto a table. And then imagine that person lifting the cup up again and putting the cup into the microwave. And what I want you to tell me is, where is the marble?
[ChatGPT 4o]
Well, let's follow the journey of our marble step by step:
Cup with Marble Held in Air: The marble is sitting at the bottom of the cup.
Cup Quickly Turned Upside Down: As the cup flips, gravity will pull the marble downwards. Since the cup is turned quickly, the marble will likely fall out of the cup and hit the table.
Cup Placed Upside Down on Table: The marble is now on the table, outside of the cup.
Cup Lifted and Put in Microwave: When the cup is lifted from the table, the marble remains on the table. The cup is now empty as it goes into the microwave.
So, in summary, the marble is on the table, not inside the cup that's now in the microwave.
-------
I wonder if the way I phrased the question made a difference or my system prompt customisation etc.?
@mapledev9335 14 днів тому
Hey Matthew is GPT4 now able to search the internet? I was just asking it a question about a recent python library and it looked like it went into the github repository and read it. Anyone else see something like this?
@rkaid7 14 днів тому
Wack that u don’t have it and I did, as a no one from Australia an hour after the OA event.
It’s excellent. A little cold but some CI changes fixed that mostly
@JustinArut 14 днів тому
It said if the person who entered is not considered a killer. By definition the person must be considered a killer, since that the person just killed someone. Therefore, it is completely wrong in saying a possible interpretation is not considering the person a killer.
I haven't seen any model address the elephant in the room conditional that is central to this question: whether a dead person is still considered a killer.
The final result of 3 may be correct, but the way it reasoned this out is a hard fail.
@kamelsf 14 днів тому
I have access to GPT4-O, but the voice features we saw in the OpenAI demo don't work for me; they are the same as the old voice feature. There is something strange about testing the model with prompts like the apple word test. Sometimes it gets it right, but other times it gets it completely wrong. I suppose every conversation is different. This happens with every model I test in general.
@olafge 14 днів тому ⁺¹
I have already access to gpt-4o in the chat interface. The strange characters in the output happen sometimes there, too: The time to dry 20 shirts in the dryer is \( \boxed{4 \text{ hours}} \).
@Yipper64 14 днів тому
Something ive noticed is that the playground GPT 4o (at least the one that is just labeled GPT 4o) doesnt work as good as the one that is in the web app (or the one that has the date next to it)
You might want to re-test in the web app when you get it.
edit - I did the upside down cup test in the web app and it assumes the cup has a lid, if I explain this it does not get it wrong.
edit 2 - the apple test was a fail but only by one.
@AlienService 12 днів тому
I'd be interested if you asked the same questions via voice rather than typing if the performance would change. Does it understand voice embeddings as well as text?

Наступне

Автоматичне відтворення

26 Incredible Use Cases for the New GPT-4o