Bertopic starts with embeddings, which create numerical representations of the documents (by default using a sentence embeddings model). It then reduces the dimensions of these representations and clusters the result to identify topics. After clustering, it tokenizes the text to create a document-term matrix. Finally, it uses a class-based TF-IDF, to identify the most representative words for each topic.
An interesting and great presentation. Thanks for sharing.
Glad you enjoyed it! 😊
i think tokenization should be perform before embedding .but in your video Bertopic picture have 1st step is embedding.its confusing for me
Bertopic starts with embeddings, which create numerical representations of the documents (by default using a sentence embeddings model). It then reduces the dimensions of these representations and clusters the result to identify topics. After clustering, it tokenizes the text to create a document-term matrix. Finally, it uses a class-based TF-IDF, to identify the most representative words for each topic.
@@windowviews150 thanks for explanation
Excellent content. Just what I was looking for! Any tips for how to optimize topic modeling process using gpt models from OpenAI?
hello! I'm a researcher in Politecnico di Milano and University of South Australia, I'm trying to do the same thing, maybe we can have a chat!
@giacomocassano1439 sure, how can I reach you
Could you give an example of how to merge this topic modeling with our original dataset for further analysis and report creation
Hey Josh, you found a way to do this?
Topic -1 is an outlier and should be ignored