The voice is damn good, I'll give it that, sounds as good or better than advanced voice, also we have seen the native image output from openai in the demo.
@samwitteveenai It isn't released but they showed consistent characters and scenes, so i assume it must be native, I'm pretty sure they said it was, I could be wrong though. It was when they showed off the 3D modelling too.
Woo. The versatility of the voice to go from whisper to different expressions is next level. Similar to notebook llm podcast feature. Impressive stuff!
I've been building an VLM controlled Turtlebot2 based ROS robot (recently switched over to Gemini from Haiku 😢iykyk) Today's announcement was awesome. Native spatial reasoning is incredible and undersold! 3d bounding box creation is kinda wow. Not to mention the real-time speech, video and audio in. The normies are not ready. I showed my septuagenarian parents my robot for the first time yesterday - at first they thought it was cute because it has STT and TTS, vision, silly animated face and arms... until they realized they had this weird alien intelligence wandering around their home and got creeped out 😆🤣and tbh i don't really blame them. What a time to be alive! Thanks, Sam! Glad you've got early access - looking forward to seeing more!
@@DekuParker119 yay apple intelligence! I bought the new Mac mini and apple intelligence is shit. It's bacially old Siri with a new coat of paint. I know they haven't released the new macOS yet but I know it's gonna be shit.
I said that before and say it again, I'm really happy that Google this year is back on track and focusing on two things: one, shipping regularly for developers, and also working on foundation and LLM enhancement. Keeping these two aligned is really something, and now look, they are the best one providing such real-time communication with LLM in such a native way, amazing.
I'd love a video on how to use Gemini to make a voice based customer service agent. When it generates audio, can it make tool calls in the same response? Do you get a transcript of the audio and then use that for decision making, etc? I'm familiar with how to make general agentic workflows but not how to integrate audio or phone systems.
I just imagine that the children 50 years from now will laugh at the primitive capabilities of Gemini 2.0. Capabilities that are definitely alien to us.
Since llm inception, ive always wanted a true voice assistant. Not like the standard generic assistant, but an assistant that knows me, my work schedule, my hobbies, timetables, interests, all of it, then have the assistant be a true Assistant, jumping into your day when it needs, giving advice where needed, and also a general personality to talk to. I feel like this is the closest to that so far, we'll just need a way for the llm to remember all the important details over time, and adapt to me. Feels like we are not far off that. Whats your thoughts?
This is one of the things I've been working on for the past year. You can do pretty well at the moment using things like knowledge graphs and constructing these on the fly to give the agent a memory, so that it knows things about you. One of the big challenges is getting it to see all the inputs that go into you from sources like mobile phones, offline text, and things that are not all online in general. I currently try to have my personal agent track everything that I read on my computer and all the UA-cam videos that I watch, so that it can easily refer back to things that I've seen. The challenge is if I've seen them when it wasn't on my computer.
Does anyone else feel like the voice tone implied the AI didn't want to keep talking? Like that perfect middle ground of 'I'll be professional, but I'd really rather be doing something else'. Maybe I'm just picking up on the neutrality. Props to google for getting it so good and I'm only even noticing things this nuanced.
This is a really interesting comment, for me I often find when I'm using it that it kind of feels like it's either too happy or too agreeable. I wonder how much we each interpret it differently. The different voices do sound different overall though and the system prompt does affect it as well.
Impressive! But can it admit that it cannot answer a question - can it say I'm sorry Dave I don't know how to open the pod bay doors - not just a preprogrammed response when asked about sensitive subject matter - but one that comes from the sense of emptiness or inadequacy. The recognition of the absence of self contained and verified knowledge. Not knowing is an important part of becoming self aware. Self-awareness is created through the internalization and realization feedback loop - by becoming aware of its own limits and self - its boundaries - the ability to differentiate - me and not me. And equally as important - the ability to admit defeat - knowing its limitations will go a long way to building confidence and trust in AI in general.
After a first phase of frustration trying to use the std Gemini interface which could not really help me in the multimode output I realized that this can be accessed, for the time being, via AI studio interface. The Vocal output is a wonderful plus... Though I could not manage to get Gemini 2.0 to generate images... is this something that can be done nly submitting a starting image such as the car to turn into convertible in the example?
This is the first time ive heard an AI voice and thought like yeah, I want to talk to that thing. It has a very authentic quality. I loke how it responds and is very like straight forward, i donno how to express it. "You can call me.gemeni, i have no preference" actually feels like someone authentically not caring what you call them because they're so much more than that it's feels childish to call it something. Weird....
it is not in the public release for outputs yet but is coming. Unfortunately Google has ask me not to show it in video currently thats why I used their examples.
Tried to rebuild the scernaio with the car within Gemini chat as well as AI Studio, both with 2.0 Flash Experimental. I was not able to recreate a similar working version. In most cases it run over 30sec without a image response at all. Any ideas?
The image editing requires image + text output and is still in private preview. If you are around for the next meetup happy to show you some demos and hopefully it should be out of preview early in the year
As of 12/11 I'm not able to get Gemini in AI Studio to respond back in any other language than English. Wonder if other languages are coming in a future update
combining images and text should be done by a smart agent .. using separate (small) LLMs for doing what they are best at. The thought of a fat LLM that can do everything feels like a waste of energy (as if existing LLMs already didn't eat enough electricity).
My guess is it would depend a lot on what level of math you are doing and also how you are going to display it (e.g., all voice or a combination of voice and visual elements). You could imagine making something really cool for school students and young kids. I guess that a lot of the big companies are going to do that pretty soon.
So when they showed how you could prompt: Say this in a whisper: You're actually hearing it ... right now. And it read that sentence in an amazing whisper. And other similar demos of how to *emphasize* a word, etc. Is any of that possible with Gemini Flash 2.0 or some other Google model today? Or is that still in the coming soon part?
Yes, especially the fact that NotebookLM can have voices that talk across each other would show that they're coming out of a singular model, not just a TTS system. That said, Google has quite a lot of options to choose from for TTS as well, if you look at the Soundstorm paper and examples. google-research.github.io/seanet/soundstorm/examples/
It's meh at best. Just tried it and it kinda understands things but the moment it hits a filter it shuts down completely. Maybe with 80 percent of tasks this is ok but it is about 20 percent towards completion
10:07 I speak German, English, French and Italian and thought: nice one 💅… then he busts out Thai 🧎🏻➡️…. 🏳️ (Thank you Sam for the great demo - love your engineering pov a lot)
Man that conversation with Gemini and in Thai was so so cool.
Sam speaks Thai! Quite the flex to slip in there!
This was not on my bingo card. I don't know why I'm surprised. Sam is a pretty clever guy.
The voice is damn good, I'll give it that, sounds as good or better than advanced voice, also we have seen the native image output from openai in the demo.
AFAIK the openai image generation was all Dall-e
@samwitteveenai It isn't released but they showed consistent characters and scenes, so i assume it must be native, I'm pretty sure they said it was, I could be wrong though. It was when they showed off the 3D modelling too.
The conversation is really nextgen 😳
Woo. The versatility of the voice to go from whisper to different expressions is next level. Similar to notebook llm podcast feature. Impressive stuff!
I've been building an VLM controlled Turtlebot2 based ROS robot (recently switched over to Gemini from Haiku 😢iykyk) Today's announcement was awesome. Native spatial reasoning is incredible and undersold! 3d bounding box creation is kinda wow. Not to mention the real-time speech, video and audio in.
The normies are not ready. I showed my septuagenarian parents my robot for the first time yesterday - at first they thought it was cute because it has STT and TTS, vision, silly animated face and arms... until they realized they had this weird alien intelligence wandering around their home and got creeped out 😆🤣and tbh i don't really blame them. What a time to be alive!
Thanks, Sam! Glad you've got early access - looking forward to seeing more!
One thing I love is that even if AGI won’t exist in the near future, we are definitely in a new Industrial Revolution! I’m excited ❤
It's day 5 for OpenAI and they are live, but here I am watching your overview of Gemini 2 Flash. And this is way more interesting.
Yeah to tell us about apple intelligence AGAIN
day 5 started? I check there twitter and haven't seen anything. and I try to keep up with this stuff
@@DekuParker119 yay apple intelligence! I bought the new Mac mini and apple intelligence is shit. It's bacially old Siri with a new coat of paint. I know they haven't released the new macOS yet but I know it's gonna be shit.
@@Timely-ud4rm They stream at 10am PST every day during the 12 days
@@MojaveHigh I know haha my point was there annulments were so boring I missed it haha. day 1 I was surely ready to watch it.
Man your video was insane. Google is definitely going for OpenAI and Anthropic with 2.0
Wow, this could be very interesting for doing some customer guidance RAG work. My day has now been reorganised!
I said that before and say it again, I'm really happy that Google this year is back on track and focusing on two things: one, shipping regularly for developers, and also working on foundation and LLM enhancement. Keeping these two aligned is really something, and now look, they are the best one providing such real-time communication with LLM in such a native way, amazing.
Agree lots of lessons have been learnt and acted on over the last 12 months.
Finally a real alternative to advance voice mode.
Fascinating, multimodel, greatest experience. Thank you Gemini
I'd love a video on how to use Gemini to make a voice based customer service agent. When it generates audio, can it make tool calls in the same response? Do you get a transcript of the audio and then use that for decision making, etc? I'm familiar with how to make general agentic workflows but not how to integrate audio or phone systems.
Can't wait until I use my Nuclear Powered Data Center with my own LLM!
Finally! Been waiting for google to release something we can actually build with! It's go time Sam!
First time I've been genuinely impressed with Gemini. Nice flex on the Thai by Sam and Gemini.
google with this, gonna destroy openAis 200$ subscriptions
Great review Sam
รีวิวดีมาก ทำให้เข้าใจมากขึ้น ขอบคุณครับ 😊
Mind blown 💥
Love to see you play around with RAG and the live api interface.
Great summary 🙏
Awesome! I just updated AI knowledge by your video. I can not wait next video.
I tried talking with Gemini by Japanese. It is not like my dream :)))
Google's release was very cool! For a new video, maybe compare OpenAI Realtime API and Google Multimodal Live API.
One of the biggest shocks in this video is that you speak Thai fluently.
You amazed me when you speak Thai! +1 sub from me.
Very good video!
I just imagine that the children 50 years from now will laugh at the primitive capabilities of Gemini 2.0. Capabilities that are definitely alien to us.
I wondered why they inproved the UI which I love the new design looks really clean. Gemini flash 2.0 is incredible can't wait to see the pro version
Crazy! It can answer the image not only the text! I think it totally surpasses the Openai.
Thanks for everything ❤
Since llm inception, ive always wanted a true voice assistant. Not like the standard generic assistant, but an assistant that knows me, my work schedule, my hobbies, timetables, interests, all of it, then have the assistant be a true Assistant, jumping into your day when it needs, giving advice where needed, and also a general personality to talk to. I feel like this is the closest to that so far, we'll just need a way for the llm to remember all the important details over time, and adapt to me. Feels like we are not far off that. Whats your thoughts?
This is one of the things I've been working on for the past year. You can do pretty well at the moment using things like knowledge graphs and constructing these on the fly to give the agent a memory, so that it knows things about you. One of the big challenges is getting it to see all the inputs that go into you from sources like mobile phones, offline text, and things that are not all online in general.
I currently try to have my personal agent track everything that I read on my computer and all the UA-cam videos that I watch, so that it can easily refer back to things that I've seen. The challenge is if I've seen them when it wasn't on my computer.
The similar open source version is Janus 1.3B from DeepSeek
It's more coherent than Chatgpt's AVM for longer conversation. Deepmind cooking
Does anyone else feel like the voice tone implied the AI didn't want to keep talking? Like that perfect middle ground of 'I'll be professional, but I'd really rather be doing something else'. Maybe I'm just picking up on the neutrality. Props to google for getting it so good and I'm only even noticing things this nuanced.
This is a really interesting comment, for me I often find when I'm using it that it kind of feels like it's either too happy or too agreeable. I wonder how much we each interpret it differently. The different voices do sound different overall though and the system prompt does affect it as well.
I'm more impressed by your Thai than Gemini's 😅❤
I watched the Google press conference, I feel like there's a lot of hype. I wouldn't get my hopes up
Kap khun krap - for this video 🙏🏼
This will be great for learning languages as well. Hopefully it can even correct your input in the near future
Impressive! But can it admit that it cannot answer a question - can it say I'm sorry Dave I don't know how to open the pod bay doors - not just a preprogrammed response when asked about sensitive subject matter - but one that comes from the sense of emptiness or inadequacy. The recognition of the absence of self contained and verified knowledge. Not knowing is an important part of becoming self aware.
Self-awareness is created through the internalization and realization feedback loop - by becoming aware of its own limits and self - its boundaries - the ability to differentiate - me and not me.
And equally as important - the ability to admit defeat - knowing its limitations will go a long way to building confidence and trust in AI in general.
Where can I try the voice?
After a first phase of frustration trying to use the std Gemini interface which could not really help me in the multimode output I realized that this can be accessed, for the time being, via AI studio interface. The Vocal output is a wonderful plus... Though I could not manage to get Gemini 2.0 to generate images... is this something that can be done nly submitting a starting image such as the car to turn into convertible in the example?
The image generation stuff is still in private preview for now, but should be available early next year to everyone.
Image generation not working??
still in private preview but hopefully public soon
พูดไทยก็ได้.... Really cool krub.
voice is not working
voice I think is still in private preview
It's work in my case, it work OK. But can't make it to speak thai
works for me
This is the first time ive heard an AI voice and thought like yeah, I want to talk to that thing. It has a very authentic quality. I loke how it responds and is very like straight forward, i donno how to express it. "You can call me.gemeni, i have no preference" actually feels like someone authentically not caring what you call them because they're so much more than that it's feels childish to call it something. Weird....
How can you generate images with it? its telling me it's unable to generate images
that is still in the private preview for now, but hopefully will be made available for everyone soon
Interleaving text and audio is not supported right? or image and audio?
it is not in the public release for outputs yet but is coming. Unfortunately Google has ask me not to show it in video currently thats why I used their examples.
OAI is choking today.
To be fair, OPENAI has released their native multi-modal 4o for a couple of months, but Gemini has the audio ability.
and live video as well.
Gemini 1.5 was fully multimodal back in January. OpenAI 4o still can't really do video today.
@@tomdy69 I think we are talking about output space, in that case, Gemini just recognize multimodal things, but can't generate.
Unlock a new era of agentic
Sam - has the been a price set for the API?
no it will probably early next year before it goes GA with pricing
@@samwitteveenai thank you!
Tried to rebuild the scernaio with the car within Gemini chat as well as AI Studio, both with 2.0 Flash Experimental. I was not able to recreate a similar working version. In most cases it run over 30sec without a image response at all. Any ideas?
The image editing requires image + text output and is still in private preview. If you are around for the next meetup happy to show you some demos and hopefully it should be out of preview early in the year
You are in Thailand? Nice
alas I don't live in Thailand hence why my accent needs some work lol.
As of 12/11 I'm not able to get Gemini in AI Studio to respond back in any other language than English. Wonder if other languages are coming in a future update
This is Flash? You mean the light version?
yes that is right
Thanks for the review! ✨🎉
combining images and text should be done by a smart agent .. using separate (small) LLMs for doing what they are best at. The thought of a fat LLM that can do everything feels like a waste of energy (as if existing LLMs already didn't eat enough electricity).
Flash.
Whoa-oh.
Flash.
Interesting if you can make this is in a math tutor
My guess is it would depend a lot on what level of math you are doing and also how you are going to display it (e.g., all voice or a combination of voice and visual elements). You could imagine making something really cool for school students and young kids. I guess that a lot of the big companies are going to do that pretty soon.
So when they showed how you could prompt: Say this in a whisper:
You're actually hearing it ... right now.
And it read that sentence in an amazing whisper. And other similar demos of how to *emphasize* a word, etc. Is any of that possible with Gemini Flash 2.0 or some other Google model today? Or is that still in the coming soon part?
Another extremely censored Google creation.
The voices sound like the ones used in noteboklm
Yes, especially the fact that NotebookLM can have voices that talk across each other would show that they're coming out of a singular model, not just a TTS system. That said, Google has quite a lot of options to choose from for TTS as well, if you look at the Soundstorm paper and examples. google-research.github.io/seanet/soundstorm/examples/
Any news on When the general public cnn use it?
The code generated in the video uses GPT-3.5-turbo, poor google ;)
What Sam asked it do, obviously.
Is Google still manipulating prompts? If so, I have no interested in a political agenda machine.
I think a lot of lessons have been learnt by the original Image problem.
It's meh at best. Just tried it and it kinda understands things but the moment it hits a filter it shuts down completely. Maybe with 80 percent of tasks this is ok but it is about 20 percent towards completion
first!!
10:07 I speak German, English, French and Italian and thought: nice one 💅… then he busts out Thai 🧎🏻➡️…. 🏳️
(Thank you Sam for the great demo - love your engineering pov a lot)
How to use an image model