Stable Diffusion AI Audiobook Player With Real Time Transcription Prompt And Image Generation

Поділитися
Вставка
  • Опубліковано 5 вер 2024
  • Proof of concept of AI Audiobook player with real time image generation.
    Join or don't join the discord : / discord (it's really small and quiet server, but you can reach me there, if you have any questions or ideas)
    Tools used:
    story : #chatgpt
    voice : #elevenlabs
    Real time:
    transcription : #whisper
    prompt generation : #ollama #meta #llama
    image generation: #stablediffusion #automatic1111
    GPU used for demo 4070Ti 12GB VRAM
    So much to optimize, I'm pretty sure I can squeeze better performance from prompt generation (was way too long). Not to mention that prompts were not great and image consistency was pretty bad.
    UI is basic, but it's just a proof of concept.
    Models used :
    github.com/ope...
    ai.meta.com/bl...
    civitai.com/mo...

КОМЕНТАРІ • 33

  • @222inverter
    @222inverter Місяць тому

    Wow... looks really cool... keep up your great work...

  • @chama6775
    @chama6775 Місяць тому +2

    Wow amazing, great work ! Hope you will get more visibility 👌 Another step toward the unlimited generated entertainment... But awesome work

  • @kaizen9554
    @kaizen9554 Місяць тому

    Great project. Hopefully we can try out hands on it when it’s shared. Cheers 🥂

    • @roundycreations
      @roundycreations  Місяць тому

      I will try to push some early version to github asap with some simplified installation , cheers!

  • @SpencerThayer
    @SpencerThayer Місяць тому

    An impressive first step.

  • @playthisnote
    @playthisnote Місяць тому +3

    Nice similar to what I made in Python.

    • @roundycreations
      @roundycreations  Місяць тому +1

      It's in Python, probably I will rewrite it in Unity engine and C# for more flexibility

  • @joshuaam7701
    @joshuaam7701 Місяць тому

    Amazing workflow, the images didn’t always match but amazing potential.

    • @roundycreations
      @roundycreations  Місяць тому

      Yes, prompt generator is not aware of context of the whole book, which would be the next challenge. As for consistency and perfect matching to the current situation in the story it would be even harder challenge :)

  • @BeAsYouAre108
    @BeAsYouAre108 Місяць тому +3

    Wow. Please create a tutorial on how to do this.

    • @roundycreations
      @roundycreations  Місяць тому +6

      I will improve it a bit and upload it to github some day, for now it's just a barebones

  • @johnstarfire
    @johnstarfire Місяць тому

    this is genial and shows how to combine more ai to do stuffs and this will be the future, think to when will be possible to generate videos with consistent characters, it would generate movies from books, maybe we'll need the power of quantum computers, but we are seing were we are going.

    • @roundycreations
      @roundycreations  Місяць тому

      And couple bazillion of GBs of VRAM would be handy as well

  • @CrudelyMade
    @CrudelyMade Місяць тому

    I wonder if a top model like grok might be able to read through and generate prompts per paragraph while keeping the whole book concept in mind. and then those prompts might be commented out, so they are not read, but be processed by the image engine. "on the fly" would require a very fast llm, but if the graphic book can be pre-developed by the engines, this is a visual story book generator that could do some pretty intense stuff. like.. if the llm pre-read the book, created prompts for all the characters, stored samples of the images for each character for reference throughout the book... I think you have most of the cogs, it's just very impressive to think where this can end up in a year.

    • @roundycreations
      @roundycreations  Місяць тому

      Yeah, I'm working on it :) ChatGPT4o mini was released with very cheap API, so probably going to employ it to manage concept of the book, style, consistency, plot and character development, then it will reply data to local LLM for prompt generation. It wouldn't be then fully local, but there is already max usage of GPU with current stuff, so not much can be added to current workflow if we talk about local computation. Optionally 4o mini can do also prompt generation, this way I will save about 7-8 seconds GPU work and I could use this spare power for SD and use TurboXL models with some control nets or animatediff or some picture interpolation...rabbit hole in general :) If you haven't noticed it doesn't do TTS, I load audio book that was pre generated before, so I need to also transcribe it on the fly

    • @CrudelyMade
      @CrudelyMade Місяць тому

      @@roundycreations well, your efforts are appreciated. :-) I would love to see the same story illustrated in pixel art or anime style, especially if they're european stories like the brothers grimm stuff.
      once that works decently, will be nice to have LLM agents code games based on the stories with the different graphic styles, leading to different gameplay concepts based on the same basic story.
      like.. hansel and gretel would be very different games if they were pixel style or action anime style. :-) my brain is years in the future enjoying things that might never be made. :-D

    • @roundycreations
      @roundycreations  Місяць тому

      @@CrudelyMade "I would love to see the same story illustrated in pixel art or anime style, especially if they're european stories like the brothers grimm stuff." - it's just a matter of SD model used, so that shouldn't be big deal I'd say. Challenge is the consistency and context of the book translated to correct prompts each paragraph.
      "will be nice to have LLM agents code games based on the stories with the different graphic styles, leading to different gameplay concepts based on the same basic story" , I think we need to wait a little bit more. For now LLM can code maybe flappy birds without errors lol.
      But by looking at the speed of everything now, we might wake up one day and it will be there

    • @CrudelyMade
      @CrudelyMade Місяць тому

      @@roundycreations it'll be there because people like you are making the building blocks. ;-) I work in tech, I know we're years away. and it's fascinating to see early development of concepts that'll end up in much greater things.
      then I can say, "I used 8 inch floppy disks!" and "I remember when the guy first automated decent on the fly image generation for stories!"
      your efforts are also great examples of how things can work together, and these concepts can often be applied to other projects, as it's easier to see outside the box when you watch someone outside the box. :-)
      "one day... we'll have a box so big, the whole universe will be inside of it.. and then we'll climb out of the box."

    • @roundycreations
      @roundycreations  Місяць тому +1

      I don't really make these blocks, people way smarter than me do those ;) but you don't need to know how lego brick is made to build a lego castle I guess

  • @aimademerich
    @aimademerich Місяць тому

    Phenomenal

  • @therobotocracy
    @therobotocracy Місяць тому

    Nice!

  • @ziad_jkhan
    @ziad_jkhan Місяць тому

    Hopefully, it's an open-source project

    • @roundycreations
      @roundycreations  Місяць тому

      Yes, but I didn't release the source yet as it's a big mess for now

    • @ziad_jkhan
      @ziad_jkhan Місяць тому

      @@roundycreations Well, that could actually be a reason to open-source it and ask others to help clean up the code if they find it useful

  • @therobotocracy
    @therobotocracy Місяць тому

    Do you have a discord or something like that?