What is TF-IDF for Beginners (Topic Modeling in Python for DH 02.01)

Поділитися
Вставка
  • Опубліковано 24 січ 2025

КОМЕНТАРІ • 31

  • @waelhussein4606
    @waelhussein4606 Місяць тому

    Great videos, thanks! It’s important to understand that IDF reduces the weight of common words that frequently appear in most documents within the corpus, as these words contribute little to document classification. Conversely, it highlights less common words, making them more important for distinguishing the documents in which they appear.

  • @olucasharp
    @olucasharp Рік тому +3

    What a treasure this is! ⚡Many thanks!
    So interesting and I've even managed to use some of the ideas at work already 😀

  • @pritamsarkar2075
    @pritamsarkar2075 2 роки тому

    this channel is a beauty

  • @amirrahimi6979
    @amirrahimi6979 4 роки тому +3

    This is really useful. Thank you.

  • @nazmusas
    @nazmusas 8 місяців тому

    You are the best. You are so cool.

  • @oliviern.2095
    @oliviern.2095 2 роки тому +1

    very clear sir

  • @mehmetkaya4330
    @mehmetkaya4330 2 роки тому +1

    Thank you so much!!

  • @stevedavis3813
    @stevedavis3813 4 роки тому +1

    This is great! A++

  • @olucasharp
    @olucasharp Рік тому +1

    Hi, I have a question if you will (still trying to figure out what are the best ways to use all there different methods): I got some data re requirements for the data analytics in finance from job postings website and wanted to get the sense of what are the most wanted requirements (skills, knowledge) are among those. Now I'm on my way to explore all the methods you explain based on this corpus but it seems that probably for the purpose of summarizing a bunch of similar job requirements' descriptions it's better to use something like key words (mostly threegramms) extraction. So would KeyBert be your choice?
    Sorry for the long question )

    • @python-programming
      @python-programming  Рік тому

      I think KeyBert may be a great option. Out of the box, it will do a lot. It really dependa on the data, though. No two corpora are exactly the same. It will require a bit of experimentation.

    • @olucasharp
      @olucasharp Рік тому

      @@python-programming huge thanks for your comment, indeed, from the result I get I can better understand where to go further) Looking forward to hearing more from you on this channel on these existing topics and the ways to use them in different contexts. Thanks!

  • @КристинаДолганова-к6т
    @КристинаДолганова-к6т 3 місяці тому +1

    Hi. Thank you for your video. Have you compared TF-IDF that you calculated with the one that Python gives? I use Google Colab
    When I've calculated I had 0 for "on". TF-IDF = 1/7 * lg (2/2) = 0
    But Python gives 0.3
    from sklearn.feature_extraction.text import TfidfVectorizer
    documents = ["The cat is laying on the carpet","The carpet is on the floor "]
    vectorizer = TfidfVectorizer()
    X = vectorizer. fit_transform(documents)
    feature_names = vectorizer. get_feature_names_out()
    print("tokens:",feature_names)
    print("matrix:")
    print(X.toarray().round(2))
    Output:
    tokens: ['carpet' 'cat' 'floor' 'is' 'laying' 'on' 'the']
    matrix:
    [[0.3 0.42 0. 0.3 0.42 0.3 0.6 ]
    [0.33 0. 0.47 0.33 0. 0.33 0.67]]

  • @ry2743
    @ry2743 10 місяців тому

    if i have tweets is it the best to use it for?

  • @ANUbhav918
    @ANUbhav918 3 роки тому +1

    Good

  • @feroncia
    @feroncia 2 роки тому +1

    if we only have one document that is compiled all our text, will TF-IDF useful?

    • @python-programming
      @python-programming  2 роки тому

      Yea it can still tell you the most common words within that document, but for that I would use KeyBERT

  • @dwisetyoaji5007
    @dwisetyoaji5007 3 роки тому

    sir how to access the website?I wanna read some more of it thanks

  • @ayanjain3106
    @ayanjain3106 3 роки тому +1

    Wouldn't the IDF score be same for all documents, why do we need to multiply every time with TF score if we just want comparisons?

    • @python-programming
      @python-programming  3 роки тому +1

      Great question. Not all docs in a corpus will have a word. The IDF places a poportional assessment on that word relative to density in a single document against all relevant docs in a corpus. If you just compare TF alone, you would not get a sense of the docs larger place.

    • @ayanjain3106
      @ayanjain3106 3 роки тому +1

      @@python-programming Got it! Thank You!

    • @python-programming
      @python-programming  3 роки тому +1

      @@ayanjain3106 no problem!

    • @ANUbhav918
      @ANUbhav918 3 роки тому +1

      You can say that you are comparing after normalizing

    • @khadimhussainmalik3284
      @khadimhussainmalik3284 Рік тому

      The corpus may contain various types of documents, such as newspapers, which will enable us to understand the extent to which the term varies across different kinds of documents.

  • @mehmetkaya4330
    @mehmetkaya4330 2 роки тому +1

    Thank you so much!!