on arxiv, this paper (posted about the same day as gemini update and OAI's sora) might be one / the reason they moved when they did - this paper follows 2 others from these students last fall... just odd how major it is and the timelines here. It's limited to more compute, but the accuracy with such a huge context is what's astounding. [ since links wont post here, just search "world model on million length video ring attention"] - 'Ring Attention' might be the RA in SORA
Excellent - I work for a pharmaceutical company, and a longer context window + high data retrieval accuracy is what I need. I wonder what is your opinion about new RAG systems based on new models@@samwitteveenai
I still think RAG is relevant for now for most uses, but this majorly unlocks things that couldn't be done before without many calls to the LLM and things like MapReduce etc. I think models like this one unlock a lot for agents, which I would like to show at some point. I can totally see how this kind of model can help serious work like you are doing, much more than just chat with a bot stuff etc.
Sure, I'll still be using RAGs. Having a longer context window gives me the chance to be more strategic and flexible with how I organize content chunks. This means I won't just be breaking things up every 1000 tokens without any thought. :)@@samwitteveenai
very interesting and informative! I am wondering how this would work for literature review type workflows. Say you choose a technical topic like text to video, and you upload 5-10 key relevant papers (like the ones hf summarized after the SORA release), how would will the model perform in synthesize the papers? An even crazier task is to add another literature review paper as an example, so it will be like 5-10 papers with a 1-shot prompt. If the model can kind of reason through this, the implications would be huge
I have played with it with single papers and found it to be very interesting. I might give it a shot with a group of connected papers, interesting idea.
Awesome video! thanks for being the hero we needed! Keep going forward and enjoy Singapore! I wonder if it could be good at coding / making coding agents for people who don't know code at all.
I think that you really need to understand code to know when the LLMs are doing simple dumb things, but that level of understanding doesn't need to be super deep. Learning the basics of coding are still a very good skill to have also to improve thinking about these things.
I’m a film editor. I think you could choose a classic narrative short video from UA-cam (cleared rights) and try different levels of questioning (high level narrative comprehension, emotion, character’s emotional arc etc)
can you upload a storybook or novel and and ask about characterization of some new book that may have just released? so excited about this. can't wait to try it out
I am making an automated video editor using gpt vision and another speech to text api. It does work this way, but I would like to see what Gemini can do! Can you please test if Gemini 1.5 can act as a Professional Video Editor and output timestamps where to place zoom in/out effects, emojis or sound effects?
nice! could you try something harder, say: show a security camera video of a breakin and ask it describe what happens in the video(don't mention breakin).
no once you upload it once you can query it many times for that session. It may lose it session to session though. I think they are looking at how to best handle this for the UI and API going forward.
how long of a response can you get out of it. could it describe a full video like a normal human does, if yes how long of a video. will it ever be able to work with audio and video at once?
You could've as Gemini to return the timestamp of its responses, so that you could then verify if it was said around the timestamp it returned. That why you'd actually have a higher likelihood of seeing if it was really said in the video.
This video was made almost a year ago. Back then, the timestamps weren't working as well on that version of Gemini. If you look recently, I made a video for Gemini Flash 2.0 Video Analyzer where I did exactly what you talked about getting timestamps back in it.
Gemini models in my experience tend to hallucinate info not contained in the prompt much more often than other models; I'd do more tests about this. In most of the tests you're showing, it's very hard to tell if Gemini is actually retrieving correct info or it's just guessing. Since LLMs are very good at lying, one has to be very careful about what you ask a model: I'd never trust any of the results you showed in this video, because there's too high risk of the model making up stuff
Great video. However, the video that was uploaded was probably not the one that would demonstrate its potential the best. Frame by frame analysis using conventional entity extraction could have yielded similar results. The context is written in text on the slides. Using things like sports analytics where might have been a bigger stretch where motion is tested.
I agree the sports or some form of action would show a different kind of analysis for action identification and understanding motion etc. I still think the idea of it making a set of notes from a hours worth of slides to be a pretty impressive feat. Will look into some ideas for the sports/action though too.
Let it read zig documentation and ask ds and algo qustions to it. If it can learn language from grammar book it should solve ds algo leet code problem in zig.
give him some downloaded viral tiktok video that has some stoic narration and let him generate similar text with same style and length of text. like at least 3 different text.
When people say that RAG is not dead, that’s like Bill Gates saying “640K ought to be enough for anybody” back in 1981. Just because inference times over 1M tokens is 1 minute long now, why assume that from now on until the end of the Universe, inference times across 1M will remain constant at 1 minute? Since when did things ever stagnate like that in digital technology? It’s kind of baffling to me.
For most big companies 10mil tokens is a drop in the ocean on the data that they need to RAG over. RAG will still be around for a the coming future for serious applications.
Agree to Disagree but I really appreciate you chiming in. I don't want this place to just be people who agree. I do totally agree with you that pricing will go down and speed will get faster over time.
I agree that the speeds will improve, but there will always be a need to increase speed and capability. Using these two approaches together will achieve that. Even though we have storage needs that are bigger than we thought and bigger storage than we ever imagined for cheap, we still compress, we still distribute storage.
@@dusanbosnjakovic6588 Yeah, probably there will be some kind of semantic router in most AI apps judging which kind of retrieval will make the most sense for each particular query.
Gemini 1.5 seems like a truly gigantic leap in LLMs.. probably the first time I've been wowed since the release of gpt-4
on arxiv, this paper (posted about the same day as gemini update and OAI's sora) might be one / the reason they moved when they did - this paper follows 2 others from these students last fall... just odd how major it is and the timelines here. It's limited to more compute, but the accuracy with such a huge context is what's astounding. [ since links wont post here, just search "world model on million length video ring attention"] - 'Ring Attention' might be the RA in SORA
This was awesome. Thank you Sir Sam.
Good job ❤ very exciting progress!
Sam - Great video! More Google content, please. New features made Gemini useful in my workflows.
I will probably make a few more about this and possibly some new stuff from Google
Excellent - I work for a pharmaceutical company, and a longer context window + high data retrieval accuracy is what I need. I wonder what is your opinion about new RAG systems based on new models@@samwitteveenai
I still think RAG is relevant for now for most uses, but this majorly unlocks things that couldn't be done before without many calls to the LLM and things like MapReduce etc. I think models like this one unlock a lot for agents, which I would like to show at some point. I can totally see how this kind of model can help serious work like you are doing, much more than just chat with a bot stuff etc.
Sure, I'll still be using RAGs. Having a longer context window gives me the chance to be more strategic and flexible with how I organize content chunks. This means I won't just be breaking things up every 1000 tokens without any thought. :)@@samwitteveenai
I hope they give access to this model soon :/
when will it be released for public?
Great video and model ( not seen yet)
Please anything on Data Analysis tasks ? (CSV, XLS ….)
I am planning to do one on code so let me try and put it in there
very interesting and informative! I am wondering how this would work for literature review type workflows. Say you choose a technical topic like text to video, and you upload 5-10 key relevant papers (like the ones hf summarized after the SORA release), how would will the model perform in synthesize the papers? An even crazier task is to add another literature review paper as an example, so it will be like 5-10 papers with a 1-shot prompt. If the model can kind of reason through this, the implications would be huge
I have played with it with single papers and found it to be very interesting. I might give it a shot with a group of connected papers, interesting idea.
Awesome video! thanks for being the hero we needed! Keep going forward and enjoy Singapore!
I wonder if it could be good at coding / making coding agents for people who don't know code at all.
I think that you really need to understand code to know when the LLMs are doing simple dumb things, but that level of understanding doesn't need to be super deep. Learning the basics of coding are still a very good skill to have also to improve thinking about these things.
Could you try a narrative video? This would be really useful to understand the model's capacity to understand semantics of juxtaposed images.
I’m a film editor. I think you could choose a classic narrative short video from UA-cam (cleared rights) and try different levels of questioning (high level narrative comprehension, emotion, character’s emotional arc etc)
why blur release date of the video? it was the 16th of feb if you're wondering
Certainly not intentional, my guess is the editor was blurring my email and that blur stayed on the screen.
Can you do a video with audio summarization? Feed it a large audio file and ask for a per-timestamp summary?
The current release doesn't support audio yet, but you can do the timestamp summaries based on a whisper transcript etc.
@@samwitteveenaiwhich ai is best at converting audio? can it be implemented or merged? The two to together would be amazing. Thanks
Can you please do a video on Gemini 1.5 pro reading an entire college level science textbook?? That would be so awesome!
i applied for beta
how long till i get access
i got access in 1 week i am just a normal developer. Maybe scientist get access really fast.
Informative vid. Thanks
Legal doc review basically fully automated at this point.
can you upload a storybook or novel and and ask about characterization of some new book that may have just released? so excited about this. can't wait to try it out
happy to try, but need to find something that is a new book. Any suggestions?
Any sense of whether it could understand a video with no captions or words spoken in the video. Like maybe 30 seconds of a stream in a snowstorm?
Can you share your Gemini Chat like ChatGPT allows you to ❓
What were the costs to process all that video multiple times ❓
Keep up the good work 👍
This is in Google AI Studio not the consumer interface. in Gemini.google.com you can do all those. The 1.5 models will come there over time
I am making an automated video editor using gpt vision and another speech to text api. It does work this way, but I would like to see what Gemini can do!
Can you please test if Gemini 1.5 can act as a Professional Video Editor and output timestamps where to place zoom in/out effects, emojis or sound effects?
nice! could you try something harder, say: show a security camera video of a breakin and ask it describe what happens in the video(don't mention breakin).
Got a link to footage like that?
Hi Sam, do you have to upload the video manually every time?
no once you upload it once you can query it many times for that session. It may lose it session to session though. I think they are looking at how to best handle this for the UI and API going forward.
how long of a response can you get out of it. could it describe a full video like a normal human does, if yes how long of a video. will it ever be able to work with audio and video at once?
Thanks for this video
Why can't I access Gemini 1.5, although I'm using Gemini advanced. Is it not released publically or for some countries?
It is not in Gemini Advanced yet, not sure if it will come to that or when.
You could've as Gemini to return the timestamp of its responses, so that you could then verify if it was said around the timestamp it returned. That why you'd actually have a higher likelihood of seeing if it was really said in the video.
This video was made almost a year ago. Back then, the timestamps weren't working as well on that version of Gemini. If you look recently, I made a video for Gemini Flash 2.0 Video Analyzer where I did exactly what you talked about getting timestamps back in it.
@@samwitteveenai Ahh okay nice!
When will the waitlist be approved 😢😢😢
I think they have started to approve some people for the waitlist as of yesterday
Sam, how can I get Gemini to learn game strategy from video + sound in a tennis game?
currently the audio version is not out but hopefully soon.
@@samwitteveenai how can I best dm you for advice & help on my project
Not taking audio in is bizarre. any ideas why not?
this video is quite old now. Current version should be able to handle Audio now
Isn't there a way to play with it on their Vertex AI platform?
not yet but I think it is coming.
Sam is on a like train right now
Oriol said that they are working on improving the speed
I think Oriol and his team will improve a number of things about this. Don't forget this one is just the Pro.
In the future, will the Gemini 1.5 model with 1 million tokens be available for free?
The applications for the general public are huge. One that come to my mind is police officers having to analyze hours of CCTV data.
Yes there are lots of security applications that I suspect Google isn't too keen on talking about.
Hi, what's the pricing of this API?
not publicly announced yet sorry.
@@samwitteveenai
So what are you paying for when you use the model? Is it free at the moment?
@@hashiromer7668I'd also like to know what the price was that you paid for in this demo or is that classified?
ooh man things would be so much different
is there api for grmini1.5?
It looks like once you are able to access 1.5 in ai studio you can also query the API as such
Man, can you imagine how long beam search would take with this?
First reaction - yeah agree. 2nd - maybe not as long you think, depends on how they are doing the attention etc.
video icon is demed in my google ai studio
Yeah it currently only works on the 1.5 models
They still didn't gave any access to 99% of people, feels like they gave it only to paid promoters 😔
Gemini models in my experience tend to hallucinate info not contained in the prompt much more often than other models; I'd do more tests about this. In most of the tests you're showing, it's very hard to tell if Gemini is actually retrieving correct info or it's just guessing. Since LLMs are very good at lying, one has to be very careful about what you ask a model: I'd never trust any of the results you showed in this video, because there's too high risk of the model making up stuff
Gemini 1.5 Pro has nearly perfect recall.
Great video. However, the video that was uploaded was probably not the one that would demonstrate its potential the best. Frame by frame analysis using conventional entity extraction could have yielded similar results. The context is written in text on the slides. Using things like sports analytics where might have been a bigger stretch where motion is tested.
I agree the sports or some form of action would show a different kind of analysis for action identification and understanding motion etc. I still think the idea of it making a set of notes from a hours worth of slides to be a pretty impressive feat. Will look into some ideas for the sports/action though too.
It would be interesting to see if it can figure out cause and effect. Or object permanence.
Let it read zig documentation and ask ds and algo qustions to it.
If it can learn language from grammar book it should solve ds algo leet code problem in zig.
yeah this will definitely change how videos are consumed and video essay's are planned out idk
give him some downloaded viral tiktok video that has some stoic narration and let him generate similar text with same style and length of text. like at least 3 different text.
upload coding tutorial video of you and ask it to extract code and explainit , that will be true test of intelligence
Already been done before on other videos. Do some searching and you will see.
Upload a public domain novel and ask to write a new chapter / prologue or epilogue.
When people say that RAG is not dead, that’s like Bill Gates saying “640K ought to be enough for anybody” back in 1981.
Just because inference times over 1M tokens is 1 minute long now, why assume that from now on until the end of the Universe, inference times across 1M will remain constant at 1 minute? Since when did things ever stagnate like that in digital technology?
It’s kind of baffling to me.
For most big companies 10mil tokens is a drop in the ocean on the data that they need to RAG over. RAG will still be around for a the coming future for serious applications.
@@samwitteveenai I guess let’s agree to disagree on that one then. 😄
Thanks for another really awesome demo video! I really appreciate it. 🙌🏻
Agree to Disagree but I really appreciate you chiming in. I don't want this place to just be people who agree. I do totally agree with you that pricing will go down and speed will get faster over time.
I agree that the speeds will improve, but there will always be a need to increase speed and capability. Using these two approaches together will achieve that. Even though we have storage needs that are bigger than we thought and bigger storage than we ever imagined for cheap, we still compress, we still distribute storage.
@@dusanbosnjakovic6588 Yeah, probably there will be some kind of semantic router in most AI apps judging which kind of retrieval will make the most sense for each particular query.
Promo-SM 💥
First like
I subscribed after gemini 1.5 pro video
I have one doubt
What is the output token length of gemini 1.5 pro
not yet public but I talk about it in this vid
Is 1.5 as oppressively woke as the public 1.0?