Image Captioning using CNN and RNN | Image Captioning using deep learning

Поділитися
Вставка
  • Опубліковано 1 жов 2024
  • In this video, I have explained how to perform Image Captioning using CNN-RNN Architectures.
    GitHub: github.com/Aar...
    Email id: aarohisingla1987@gmail.com
    By combining these two architectures, we can create a powerful model for image captioning.
    There are two main components: the CNN encoder and the RNN decoder.
    Dataset used for this tutorial is COCO dataset.
    #nlp #computervision #deeplearning

КОМЕНТАРІ • 33

  • @nakulmali1413
    @nakulmali1413 4 місяці тому +1

    Mam first upon thanks for your all valuable videos. Mam, please upload one video on image classification using YOLOv5 on a custom dataset in Google Colab. I am facing issues please upload a video on this topic.

    • @CodeWithAarohi
      @CodeWithAarohi  4 місяці тому

      I have this video on yolov5 image classification. You can use this code. This code will work on colab also. Only thing is to change the paths where ever required: ua-cam.com/video/PwIQc06gnCI/v-deo.html

  • @vikramvikky3830
    @vikramvikky3830 Місяць тому

    Hello mam,,, Can you please make a video to Install and setup CUDA ....
    I have tried 10 times..... But its not working....
    Please help me with this mam 😢😢😢😢

    • @CodeWithAarohi
      @CodeWithAarohi  Місяць тому

      1- Download the CUDA Toolkit installer : developer.nvidia.com/cuda-downloads
      2- Run the Installer.
      3- After installation, CUDA's binaries are located in the bin directory, e.g., C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.8\bin.
      You can verify Installation by running nvcc --version.
      Open a Command Prompt and type nvcc --version to check if the installation was successful.

  • @angelospapadopoulos7679
    @angelospapadopoulos7679 4 місяці тому +3

    Amazing once again. Would love to see video captioning now !!!

    • @CodeWithAarohi
      @CodeWithAarohi  4 місяці тому +1

      Will do soon

    • @satvik4225
      @satvik4225 3 місяці тому

      @@CodeWithAarohi wheeeeennnnnnnn

    • @satvik4225
      @satvik4225 3 місяці тому

      can you use transformers for captioning?
      How does chatpgt understand images so well, does it use transformers for this?

  • @LeeYeon-e9x
    @LeeYeon-e9x 2 місяці тому

    I have a question : Does we feed the feature vector only at timestamp 0, not the token
    Then, from timestamp-1, we start feeding actual GT that is normally token at timestamp -1
    And does at test time we only feed feature vector also at timestap-0 so it generate token at timestamp-0
    We do not need to feed the start token in the test phase

  • @krishanlakhotia814
    @krishanlakhotia814 11 днів тому

    Is it a good Project for CV?

  • @alexbosneanu1979
    @alexbosneanu1979 5 місяців тому

    Hi Aarohi,
    I truly appreciate the videos you create; they cover very useful topics and are easy to understand thanks to your explanations 😃👍
    I have a question related to a project I've been considering:
    I possess a substantial collection of images and video clips from my garden, and I'd like to organize them in a way that allows me to track various elements such as trees, flowers, and other permanent features.
    My goal is to build a database with keywords for each image and video to keep everything organized. This will make it easier to filter and retrieve specific items(files) for various projects later on.
    By creating a model trained on labeled images, I hope to automatically assign the appropriate keywords to new additions in my collection. Is this possible?
    It would be fantastic if you could create some tutorials around this concept. 🙏

  • @sanathspai3210
    @sanathspai3210 4 місяці тому

    Hi Arohi
    Could you please share the dataset link. I am confused with the labels with the images. It is not clearly mentioned in the video.

  • @Sunil-ez1hx
    @Sunil-ez1hx 4 місяці тому

    Superb content ma’am. Your videos are amazing. Thank you very much 👍👍✌️✌️👏👏👏

  • @arnavthakur5409
    @arnavthakur5409 5 місяців тому

    Amazing video. Such an incredible way of explaining.Thank you so much ma'am

  • @malikfahadsarwar2281
    @malikfahadsarwar2281 2 місяці тому

    Does we have to pad all caption to make the length of all captions same so that LSTM can processes in batches?
    Kindly if you can explain

    • @CodeWithAarohi
      @CodeWithAarohi  2 місяці тому

      Yes, padding is necessary to ensure that all captions have the same length, facilitating efficient batch processing and consistent input dimensions for the LSTM. By using padding and masking techniques, you can handle sequences of varying lengths without compromising the model's performance.

    • @malikfahadsarwar2281
      @malikfahadsarwar2281 2 місяці тому

      @@CodeWithAarohi
      Since we are working with batches, it is possible that some captions might produce the token before others within the same batch. Due to batch processing, we might continue generating tokens for those captions even after the token is produced. To tackle this, Does we do this
      We will use the ground truth (GT) tokens up to the token and compare them with the corresponding predicted tokens. When the token appears in the GT, we calculate the loss only up to that point between the GT and PC and since after end token of GT pad token continue so ignore _idx will ignore the pad token of GT and also the tokens of the Predicted Caption(PC) This way, even if the predicted caption continues beyond the token, those extra tokens will be ignored. Does my understanding is correct?
      and according to general approach. Kindly if you can explain

    • @malikfahadsarwar2281
      @malikfahadsarwar2281 2 місяці тому

      ​@@CodeWithAarohi
      Since we are working with batches, it is possible that some captions might produce the token before others within the same batch. Due to batch processing, we might continue generating tokens for those captions even after the token is produced.
      To compute the loss, we will use the ground truth (GT) tokens up to the token and compare them with the corresponding predicted tokens. When the token appears in the GT, we calculate the loss only up to that point between the GT and predicted caption then since pad token comes in the GT then that pad token and also the predicted caption token get ignored as we will be using the pad_idx=pad value. This way, even if the predicted caption continues beyond the token, those extra tokens of predicted caption will be ignored.

  • @pinocchio200
    @pinocchio200 4 місяці тому

    great! can you make similar to this using in an android app?

  • @soravsingla8782
    @soravsingla8782 5 місяців тому

    Great work👏👏👏

  • @kishand.r8765
    @kishand.r8765 4 місяці тому

    provide the code for the model

    • @CodeWithAarohi
      @CodeWithAarohi  4 місяці тому

      github.com/AarohiSingla/Image-Captioning/blob/main/Training.ipynb

  • @RajKumar-bl8ox
    @RajKumar-bl8ox 5 місяців тому

    Thank you.

  • @nunoalexandre6408
    @nunoalexandre6408 5 місяців тому

    Love it!!!!!!!!!!!!

  • @pifordtechnologiespvtltd5698
    @pifordtechnologiespvtltd5698 5 місяців тому

    Really nice