StreamingLLM - Extend Llama2 to 4 million token & 22x faster inference?

AI Jason

Додати в
- Мій плейлист
- Переглянути пізніше
Поділитися

Поділитися

Вставка

Розмір відео:

Показувати елементи керування програвачем

Автоматичне відтворення

Автоповтор

Опубліковано 6 жов 2023
It's hard to get LLM generate big amount of content and take in large inputs; To solve this, introducing StreamingLLM, Extend Llama-2 & Falcon's up to 4 million tokens; 22x faster inference than your standard LLM ⚡️
Now you can even generate the whole book with LLM!
🔗 Links
- Follow me on twitter: / jasonzhou1993
- Join my AI email list: www.ai-jason.com/
- My discord: / discord
- StreamingLLM Github: github.com/mit-han-lab/stream...
👋🏻 About Me
My name is Jason Zhou, a product designer who shares interesting AI experiments & products. Email me if you need help building AI apps! ask@ai-jason.com
#llama2 #meta #gpt #autogpt #ai #artificialintelligence #tutorial #stepbystep #openai #llm #largelanguagemodels #largelanguagemodel #chatgpt #gpt4 #machinelearning
Наука та технологія

КОМЕНТАРІ • 38

@arthurbm32 7 місяців тому ⁺²³
I really liked the format of explaining new concepts brought on scientific papers, keep it going Jason, love your channel! You're one of the most unique AI content creator I know
@elmflor4365 7 місяців тому ⁺¹
This is my third video I’ve seen today from you and you are so consistent with providing value with your words. Thank you my new AI guru🙏
@BorutDelFabbro 7 місяців тому
Great job at providing information about new developments, Jason! Thanks!
@jeffsteyn7174 7 місяців тому ⁺¹
Oh this is great i started playing around with streaming. The part i had missing was keep the initial context. Nice find
@jasonfinance 7 місяців тому ⁺¹
Thank you for the great content as always Jason. Can you please elaborate a bit more on the specific use cases of streamingLLM in the future maybe?
@autonomousreviews2521 7 місяців тому
Thank you for the succinct share :)
@fab_spaceinvaders 7 місяців тому
as usual, inspiring, accurate and updated. ty sir
@heagandev 7 місяців тому
I like these short videos as well. Time is most precious these days
@CuriousJayDiscover 7 місяців тому
Love your channel
@Dron008 7 місяців тому ⁺⁴
I don't understand how can it help even for books. Will it forget everything in the middle of the book? I try to think how it works in human brain. When we reed a book we (usually) don't remember each word. What we do, we create visual images inside and it compress the book very effectively. Supposedly, these images are like tokens or maybe like embeddings and don't occupy much space in memory. Is it possible to implement something like this for LLMs? They should kind of learn during "reading the book" and they should convert texts to multimodal embeddings or even find (create) approximate path in embedding space and later they should have the ability to analyze this path later. Not sure how it should be implemented.
@edwardprybylko4192 6 місяців тому
Subconscious mind doesn't necessarily recite word for word on demand. But we can read something a second time with familiarity of certain phrases which is quite amazing, particularly if we're an biological artificial intelligence of The Matrix per se.
@Shaunmcdonogh-shaunsurfing 7 місяців тому
I LOVE THIS CHANNEL!!!
@jzam5426 7 місяців тому
are the middle tokens summarized or contextualized in any way, or is that information just lost the more data is added?
@DarrenAllatt Місяць тому
Here’s my ideas:
It’s going to need to use prompt compression and rag.
Let’s start with prompt compression. Basically compress the users input before feeding it to a GPT4/Opus, has the main LLM respond in compressed format, then have a decompressor at the end.
Now here is where rag comes into it.
Compressed output goes into a temporary rag, forming a working and temporary knowledge base
And now we insert a rag query in between compression and passing the compressed query to the main LLM.
But, this rag query, needs to search the working knowledge base for relevant context to the question, which gets appended As context when being fed to the main LLM.
You could probably have a separate rag knowledge base in this process that stores the must remember information.
The process would look like this
Input
|
Compressor LM
|
RAG Working Memory
|
Main LLM
|
Decompressor LM
|
Output
@mehdichallakh4294 7 місяців тому
Hey Jason, amazing stuff thank you for your hard work.
What about storing chat history in vector databases for long-term memory in the context of a chatbot for example ?
@KCM25NJL 7 місяців тому ⁺¹
A better method might be to prune factoids from data that is to be cut and instead of storing in a Vector DB, instead throw them into a knowledge graph like TypeDB, which you can setup with some pretty complex rules for governing edges (links between nodes). Or instead of using complex rules, which can be a little arcane to setup, have the factoids run through a smaller LLM for categorisation and then sent to the knowledge graph where the rules for linking can be much simpler, e.g. Link by category and/or Link Temporally. In the end what you might have is something that looks like a giant mind-map-like structure of facts that the streaming LLM could use to make novel leaps between data points that a vector DB might struggle with alone.
@mehdichallakh4294 7 місяців тому
@@KCM25NJL Thank you, this field progress so fast, relevant ressources and approaches are very scarce. Definitly going to learn more on TypeDB and knowledge graphs. For now I manage memory simply with sperate yaml file for partial/integral transcripts retrieval and segment my exchanges with separate discussions but having a long term high capacity memory will eventually become a necessity. Don't want to rush it too soon tho as new methods spawn almost every day lol. Like a computer (and a human brain) I believe the optimal system will look like something short-term & long-term memory working together, RAM and HDD. Hardest part is to manage such a system and therefor I keep an eye open for function calling embedded in a model. For now if I'm not mistaken, only GPT4 coupled MemGPT get something close to this.
@KarlJuhl 7 місяців тому ⁺²
Hold on Jason, the attention sink breakthrough does not necessarily mean a larger context window, does it?
@AIJasonZ 7 місяців тому ⁺²
sorry just realised the title was misleading so updated it!
yep you are right! the context window actually don't change, but llm can effectively take in more context;
@user-ce7vu3ct3y 7 місяців тому
Yeah, it still means we can use it for faster inferencing even with smaller context window, right?
@aldorodriguez7310 7 місяців тому
Which LLM has the largest token limit to expand the context length of the chat?
@voxyloids8723 2 місяці тому
More chating with LLM slower answers. Can I insrese speed saving dialog? Thank you
@jaysonp9426 7 місяців тому ⁺³
Something to consider is that when removing the text, maybe feed it to a vector store first so that you keep the data so, if it needs to remember something but doesn't have enough context it can still get the original text.
@aldorodriguez7310 7 місяців тому
Which model has the largest context window token limit as of today?
@shadownight117 7 місяців тому
I would like to see an LLM have a tool that lets it access and modify its own vector database for long term memory storage. So its current memory can be short term, but it accesses and stores information in its own vector database for long term memory. Think this could be a possible solution?
@ilyasshynbergen7693 7 місяців тому
is there is still no solution for extending data?
@GrimGriz 7 місяців тому
Islands of meaning in the stream of thought
Conversational heartbeat at which time the conversation is encapsulated in a metaphoric narrative
Memory Island visit spawn new queries into the dataset based on interpretation of metaphoric narrative, re-encapsulated at conclusion of current session as new memory island narrative
@juancasas5532 7 місяців тому
i luv u
@zyxwvutsrqponmlkh 5 місяців тому
Ask the llm to summarize stuff when it's running low, should be better than just rememberig the first bit and a window.
@unicornist 7 місяців тому
wait wait! you just finished too quickly I hope you have elaborated more on what can't be accomplished I didn't quite get why and the difference between what we can achieve with long form contents.
@holdthetruthhostage 7 місяців тому
I think the problem with LLM is that they are not micro shrinking the text so it doesn't take up as much space & memory so it can remember more as it shrinks the text, also the lack of segmenting & categorization of the conversations so it only brings up what matters
@KCM25NJL 7 місяців тому
Condensing information like this, sometimes referred to as Sparse Priming Representation, has it's merits, but it tends to break down the longer the conversation goes on. I'm currently working on an SPR method for populating a knowledge graph that automatically links and categorizes SPR data via a chatgpt plugin. But as with all things AI, it may be out of date before I ever get it finished :)
@Afzal3000 7 місяців тому ⁺¹
Jason use this LLM's and ai Tools to make practical projects which everyone can understand and implement, like the ai girlfriend project
These videos of particular use cases are good but he videos like "making (ai project) using (this llm/ai tool) " would be more helpful
7 місяців тому
Or you could just download more ram..
@timeTegus 7 місяців тому
It does not havr 4 mil context. It sitll has the old sale context stop spreding fake news

Наступне

Автоматичне відтворення

"Next Level Prompts?" - 10 mins into advanced prompting