AI agent + Vision = Incredible

Поділитися
Вставка
  • Опубліковано 1 чер 2024
  • A step by step tutorial of how to build vision powered AI agent via autogen + llava + stable diffusion AND Break down of 160-page analysis of GPT4V capabilities
    🤘 Get 15% off on sceneXplain via my code AIJASON : go.jina.ai/scenexplainjason
    🔗 Links
    - Follow me on twitter: / jasonzhou1993
    - Join my AI email list: www.ai-jason.com/
    - My discord: / discord
    - sceneXplain: go.jina.ai/scenexplainjason
    - Vision-agent Github: github.com/JayZeeDesign/visio...
    ⏱️ Timestamps
    0:00 Intro
    1:15 What is multi-modal model
    2:12 GPT4V ability break down
    4:34 sceneXplain
    6:00 Visual prompt techniques
    10:53 Use cases
    13:00 Build vision agent #1 - Setup
    14:20 Build vision agent #2 - Use Llava model
    15:58 Build vision agent #3 - Use Stable diffusion
    16:52 Build vision agent #4 - Set agent system via autogen
    18:53 Build vision agent #5 - Demo
    👋🏻 About Me
    My name is Jason Zhou, a product designer who shares interesting AI experiments & products. Email me if you need help building AI apps! ask@ai-jason.com
    #gpt4 #autogen #autogpt #ai #artificialintelligence #tutorial #stepbystep #openai #llm #chatgpt #largelanguagemodels #largelanguagemodel #bestaiagent #chatgpt #agentgpt #agent #babyagi #llava #stablediffusion
  • Наука та технологія

КОМЕНТАРІ • 100

  • @AIJasonZ
    @AIJasonZ  7 місяців тому +12

    Which vision-enabled agent do you want to see me building? Leave comment and let me know! 🤖

    • @jasonfinance
      @jasonfinance 7 місяців тому +8

      Would love to see AI agent that can control browser!

    • @Nodeagent
      @Nodeagent 7 місяців тому +2

      Yes browser control would be hot. Also image manipulation for things like precise mockups for customers - useful for Ecom stores who sell personalized goods

    • @loualibarca
      @loualibarca 7 місяців тому +4

      Successfully getting this done with a local llm would be interesting to see.

    • @PeterAustin666
      @PeterAustin666 7 місяців тому

      mine.

    • @dawidzurawski8870
      @dawidzurawski8870 7 місяців тому

      I would to see an agent that can read old handwritten documents and turn them into pdf

  • @T33KS
    @T33KS 7 місяців тому +20

    Your content has the right amount of abstraction, making your videos short sweet nd appealing to a wide audience (it's not a course).
    But at the same time it has the right amount of technical detail for devs and engineers to replicate what you are demonstrating.
    Thank you for this great content

  • @asithakoralage628
    @asithakoralage628 7 місяців тому +4

    Hi Jason, yet another great video, I learned a lot from your channel. Thanks for sharing your knowledge.

  • @vakman9497
    @vakman9497 7 місяців тому +3

    Hey bro good job on that thumbnail! I didnt even realize it was one of your videos I honestly thought I was clicking on a VICE video lmao,

  • @craigcasee7183
    @craigcasee7183 7 місяців тому

    I've been needing to see a video like this where someone strings together some ai with code, glad to see. I want to add eye tracking to ar and ai vision. It would be nice to quickly ask questions in the real world. And the automation aspect is very nice for you to share, plus continue to make informative instructional demonstrative amazing videos like this! Thank you!

  • @frankchangshow
    @frankchangshow 7 місяців тому

    I really appreciate you and the videos your creating ai Jason. They are helping me a lot in learning this space

  • @SamuelHollis
    @SamuelHollis 7 місяців тому +9

    🎯 Key Takeaways for quick navigation:
    00:00 🌐 Introduction to AI Vision Integration
    - The video begins with an introduction to the integration of AI agents and vision capabilities.
    - AI agents with vision power can revolutionize various applications, from web design to answering complex questions and enabling general-purpose robots.
    02:06 📸 Multimodal Models and Their Potential
    - Multimodal models can process not only text but also images, audio, and videos, enabling them to understand different types of data and their relationships.
    - GPT-4 Vision (GPT-4V) can handle various image types, including photographs, text within images, diagrams, tables, and floor plans, unlocking numerous use cases.
    04:36 🧠 Understanding GPT-4V's Abilities
    - GPT-4V demonstrates impressive out-of-the-box performance, such as identifying objects, recognizing people, counting objects, and even understanding perspective.
    - However, it also has limitations and can make mistakes, particularly in tasks like text extraction and chart interpretation.
    06:49 🚀 Promoting GPT-4V's Performance
    - Different prompting techniques can be used to improve GPT-4V's performance in image-related tasks.
    - Techniques include providing detailed text instructions, setting performance expectations, using few-shot prompts, and visual referring prompts.
    09:19 🌟 Expanding Use Cases with GPT-4V
    - GPT-4V's ability to understand the relationship between multiple images opens up new possibilities, such as calculating costs from images or determining the sequence of images in a task.
    - It can also facilitate interactions through visual annotations, allowing users to point or circle objects for AI understanding.
    11:44 🤖 Building Autonomous AI Agents
    - GPT-4V's capabilities make it possible to create autonomous AI agents that can continuously improve image generation and perform tasks like desktop automation.
    - These agents have potential applications in various industries, from architecture and engineering to customer support and medical diagnosis.
    Made with HARPA AI

  • @MattLuceen
    @MattLuceen 7 місяців тому +1

    This is exactly what I needed. Thank you.

  • @markksantos
    @markksantos 7 місяців тому

    You're the best. PLEASE POST MORE OFTEN!

  • @ryzikx
    @ryzikx 7 місяців тому

    Great stuff I was looking for vision autogen tutorials

  • @cliffordramsey2500
    @cliffordramsey2500 7 місяців тому

    Thank you for this clever integration of tools!

  • @leu2304
    @leu2304 7 місяців тому

    This channel is real gold! Thank you so much

  • @Hisma01
    @Hisma01 7 місяців тому

    Great content. You have a new sub. Keep up the great work!

  • @skanderbegvictor6487
    @skanderbegvictor6487 6 місяців тому +1

    Wow this content is great. Subscribed

  • @moberpriller
    @moberpriller 7 місяців тому +1

    Thanks for the great content!

  • @Ychuah_1997
    @Ychuah_1997 7 місяців тому +1

    Chatgpt: I can't count apples...
    Prompt: You are an expert in counting!
    Chatgpt: Giving the correct answer :)
    These prompts are just fascinating - and great content as usual!

  • @ultimategolfarchives4746
    @ultimategolfarchives4746 7 місяців тому +3

    Always providing incredible content. 👍 👍👍

  • @PrincepsPolycap
    @PrincepsPolycap 7 місяців тому

    Notification enabled for that parse automation!

  • @AI-Wire
    @AI-Wire 7 місяців тому +1

    Great job, Jason. In the future could you please consider showing how to use these tools without paying for any API keys. For example, using PaLM API or some of the open source models. This is because building projects at scale is cost prohibitive using recursive tools like Autogen.

  • @leandrogoethals6599
    @leandrogoethals6599 6 місяців тому

    how to use a uncensored stable diffusion variant with this.
    Great vid by the way can't wait for what u do next!
    Also could it be that the discord invite link is broken? Can't wait to join!

  • @KCM25NJL
    @KCM25NJL 7 місяців тому +1

    I cannot even begin to imagine the API costs for running things like these on frontier models right now. As impressive as it is, you'll need a real profitable use case if you wanna use it like this.

  • @joxxen
    @joxxen 7 місяців тому

    Really nice video, i for myself would love if the agents could start running stable diffusion on local machine. Any chance you want to create a video about that?

  • @kodeengatai1347
    @kodeengatai1347 7 місяців тому

    Thanks mate great stuff would really be interested in agents that can generate video based on prompts even if the agents need to be first trained on sample videos.

  • @krisograbek
    @krisograbek 7 місяців тому +1

    Would that be possible to build a similar agent but improve on illustrated, short stories for kids? That way it would improve both the images as well as the text provided in the stories...
    BTW, I've been learning so much from you, Jason! Your channel is a gem!
    As a fellow UA-camr, you make me feel small...

  • @user-ug3pf3uw6x
    @user-ug3pf3uw6x 7 місяців тому

    You are the best!

  • @pocoso
    @pocoso 7 місяців тому +3

    First! Good tutorial man

  • @lifeofdean3647
    @lifeofdean3647 7 місяців тому

    very good man :))

  • @GabrielVeda
    @GabrielVeda 7 місяців тому

    Brilliant

  • @GlenBland
    @GlenBland 6 місяців тому

    I would love to see one video that summarizes the most popular libraries and api's for llms along with which are the best to work together and which have replaced older ones. Include: AutoGPT, MemGPT, ChromaDB, LangChain, Ollama, Pinecone, etc.

  • @spicer41282
    @spicer41282 7 місяців тому

    My Request Please...
    Can you apply this GPT4V Agent?
    Simple shed photo and analyze its size, the pitch of the roof, and perhaps how many or how much wood is used to build the simple shed from a photo.
    Thank you for considering this and testing the multimodal capabilities with this use case.

  • @amandamate9117
    @amandamate9117 7 місяців тому

    can you write agents that operate a headless browser. Within this browser, one window can utilize GPT-4's website features designed for Plus users, while another window can generate images using DALL-E 3. These images can then be uploaded for review in the same headless browser session. Although you'll be limited to 50 prompts every 3 hours, this setup should still be sufficient for most use-cases. Additionally, this approach allows you to conduct user interface analysis or other tasks without incurring API costs.

  • @JosephDefendre
    @JosephDefendre 7 місяців тому +1

    This is nuts auto gen is a game changer

  • @aliyousefi9735
    @aliyousefi9735 7 місяців тому

    AI Jason is da man

  • @Joy_jester
    @Joy_jester Місяць тому

    Hey can u do an agent where it has to do instruction following in a simulator? I think that will be a very practical and interesting application

  • @carterjames199
    @carterjames199 7 місяців тому +1

    I think another good video would be comparing these different agent creation frameworks. Feel like I see another one everyday. I specifically would like to hear your opinion on autogen vs superagi

  • @bbproperties-oq5vu
    @bbproperties-oq5vu 7 місяців тому

    Hey hi jason it is really good. can you upload browser automation. i am really more interested on it.

  • @stereotyp9991
    @stereotyp9991 7 місяців тому

    I'm always hitting the token limit after just a few posts of the agent. Is there a way to work around this?

  • @darkbelg
    @darkbelg 7 місяців тому

    For what i'm trying to do llava isn't yet good enough like GPT-4V. GPT-4V has once again raised the bar for me. And now the waiting begins for an api.

  • @nashvillebrandon
    @nashvillebrandon 7 місяців тому

    Would be awesome to give the agent the ability to do inpainting!

  • @popfizz311
    @popfizz311 7 місяців тому

    Can you use this feature with the gpt4 api?

  • @ward_jl
    @ward_jl 7 місяців тому

    So interesting. Is it possible to get the code to experiment with it?

    • @AIJasonZ
      @AIJasonZ  7 місяців тому

      Yep it is in the description

  • @ibrahimhalouane8130
    @ibrahimhalouane8130 7 місяців тому

    How about a SuperAgent that can create other agents by its own to perform a complex task?

  • @yasinyaqoobi
    @yasinyaqoobi 7 місяців тому

    Great video as always. Can you please put your head to the bottom right. It cut off a lot of the content. :)

  • @itshuskai
    @itshuskai 7 місяців тому

    Now to really test it, see if it can pass the "Are you a robot?" prompts lol.

  • @brando2818
    @brando2818 7 місяців тому

    How do you finetune llava?

  • @georgecochran4091
    @georgecochran4091 7 місяців тому +1

    Ok you know how the game no man's sky you get a analysis visor to scan the environment and save data on plants rocks and animals. Something like that for irl.i would be collecting data all the time

  • @jp00738
    @jp00738 7 місяців тому

    hahaha oh man, you are a legendary.

  • @spookyrays2816
    @spookyrays2816 7 місяців тому +1

    Create a bot that can read, and visually react to output, so that way it can create a Deep Learning type feedback loop improving upon itself until it no longer can

    • @jtjames79
      @jtjames79 7 місяців тому +1

      I was thinking AutoGen, an artist agent, and editor agent.
      I don't know how to do it, but theoretically it should work.

  • @markksantos
    @markksantos 7 місяців тому

    make a video about memgpt

  • @jtjames79
    @jtjames79 7 місяців тому

    I want to be able to use AutoGen or something like that, to set up adversarial agents to use Stable Diffusion for me. So I can ask for an image before I go to bed, and by morning it'll have worked out something.

    • @jtjames79
      @jtjames79 7 місяців тому

      I should have just kept watching, instead of commenting before watching.

  • @matthewboyd8689
    @matthewboyd8689 7 місяців тому +1

    They need to make it be able to work on less information and make correct deductions that aren't in its training data before trying to make it more generalized. Otherwise it will just compound hyperbolically the amount of information they need to be able to understand as much as a human can.

  • @brisonvsn
    @brisonvsn 7 місяців тому

    Can agents browse and interact with the internet yet?

  • @olivMertens
    @olivMertens 7 місяців тому

    Could you give the source for the file and examples shown in this video ?

    • @olivMertens
      @olivMertens 6 місяців тому

      so i found by myself
      arxiv.org/pdf/2309.17421.pdf ;)

  • @pissmilker2313
    @pissmilker2313 7 місяців тому +11

    Our obsolescence as human beings isnt to be feared, but celebrated. Rejoice!

    • @raresmircea
      @raresmircea 7 місяців тому +4

      There’s gonna be a long time until AI will be conscious & match my subtlety. But even then, this take would still be so myopic. Have birds, elephants & dolphins "became obsolete" when humans arrived? Has your mother & brother "became obsolete" when that Indian boy was found to have a huge IQ? These kinds of extreme opinions, desires & manifestations that most people have often betray some unmet need, and I’m sorry for that.

    • @greengoblin9567
      @greengoblin9567 7 місяців тому +1

      @@raresmirceawe don’t need the ai to be conscious. We just need it to be more intelligent.

    • @arpitkumar2981
      @arpitkumar2981 7 місяців тому

      ​@@greengoblin9567yes

  • @defaultdefault812
    @defaultdefault812 7 місяців тому

    It got the speedometer right - just equated the wrong measurement circle to MPH.

  • @SkyJensen
    @SkyJensen 7 місяців тому

    Full website builder. Full website builder. Full Website Builder

  • @aghasaad2962
    @aghasaad2962 7 місяців тому

    GPT4V will soon be able to take research papers write code, write thesis, get a job, then marry....wait what thats what humans are for....

  • @psychxx7146
    @psychxx7146 7 місяців тому

    « 2023 »

  • @AntonioRonde
    @AntonioRonde 7 місяців тому +1

    there were too many basics in the video, I enjoyed your videos were you provided a more in-depth review like in the Autogen video

    • @AIJasonZ
      @AIJasonZ  7 місяців тому +1

      Thanks for the feedback - is there specific area you would like to see me dive deeper?

  • @Huru_
    @Huru_ 7 місяців тому +1

    I wonder what kind of results you'd get if you were to feed that model some proper English...

    • @soulspawn
      @soulspawn 7 місяців тому +2

      Well, it generated human hands because it has been tasked to create palms instead of hooves (see manager reply @20:04 ). I'd call this a win. 👀

    • @Huru_
      @Huru_ 7 місяців тому +1

      Didn't say it wasn't one. Just giving pointers for optimization. Also, I wasn't even going that deep. Just regular ass grammar and complete sentences for starters... @@soulspawn

    • @PeterAustin666
      @PeterAustin666 7 місяців тому

      wastes tokens@@Huru_

    • @AIJasonZ
      @AIJasonZ  7 місяців тому +2

      I honestly didn’t know palm is specifically for human, hah 😂😂 thanks will try again

    • @Huru_
      @Huru_ 7 місяців тому +1

      Lol, that's why you need to read your Manager's input.@@AIJasonZ

  • @brytonkalyi277
    @brytonkalyi277 7 місяців тому

    `• I believe we are meant to be like Jesus in our hearts and not in our flesh. But be careful of AI, for it is just our flesh and that is it. It knows only things of the flesh (our fleshly desires) and cannot comprehend things of the spirit such as peace of heart (which comes from obeying God's Word). Whereas we are a spirit and we have a soul but live in the body (in the flesh). When you go to bed it is your flesh that sleeps but your spirit never sleeps (otherwise you have died physically) that is why you have dreams. More so, true love that endures and last is a thing of the heart (when I say 'heart', I mean 'spirit'). But fake love, pretentious love, love with expectations, love for classic reasons, love for material reasons and love for selfish reasons that is a thing of our flesh. In the beginning God said let us make man in our own image, according to our likeness. Take note, God is Spirit and God is Love. As Love He is the source of it. We also know that God is Omnipotent, for He creates out of nothing and He has no beginning and has no end. That means, our love is but a shadow of God's Love. True love looks around to see who is in need of your help, your smile, your possessions, your money, your strength, your quality time. Love forgives and forgets. Love wants for others what it wants for itself. Take note, love works in conjunction with other spiritual forces such as faith and patience. We should let the Word of God be the standard of our lives not AI. If not, God will let us face AI on our own and it will cast the truth down to the ground, enslave us and make us worship it. We can only destroy ourselves but with God all things are possible. God knows us better because He is our Creater and He knows our beginning and our end. Our prove text is taken from the book of John 5:31-44, Daniel 7-9, Revelation 13-15, Matthew 24-25 and Luke 21. Let us watch and pray... God bless you as you share this message to others.