The Best Way to do Topic Modeling in Python - Top2Vec Introduction and Tutorial

Python Tutorials for Digital Humanities

Додати в
- Мій плейлист
- Переглянути пізніше
Поділитися

Поділитися

Вставка

Розмір відео:

Показувати елементи керування програвачем

Автоматичне відтворення

Автоповтор

Опубліковано 8 тра 2022
Join this channel to get access to perks:
/ @python-programming
If you enjoy this video, please subscribe.
✅Be my Patron: / wjbmattingly
✅PayPal: www.paypal.com/cgi-bin/webscr...
If there's a specific video you would like to see or a tutorial series, let me know in the comments and I will try and make it.
If you liked this video, check out www.PythonHumanities.com, where I have Coding Exercises, Lessons, on-site Python shells where you can experiment with code, and a text version of the material discussed here.
You can follow me at:
/ wjb_mattingly
Наука та технологія

КОМЕНТАРІ • 96

@justinhuang8034 2 роки тому ⁺⁹
Your killing it lately with these videos. Keep up the great work.
@python-programming 2 роки тому ⁺⁴
Thanks! That is great to hear. I am trying out a new style this month to see if subscribers like it.
@jesusmtz29 2 роки тому
@@python-programming New subscriber here. Love your style of presentation
@dankchan420 2 роки тому ⁺⁸
I am a new subscriber and this .. was .. simply .. great! I wish there were more Top2Vec videos (ranging from beginner to advanced) . Keep up the excellent work. *hint* *hint* 🙂
@python-programming 2 роки тому ⁺²
Thanks! Great to hear! I will be making more in the future.
@Kylbigel Рік тому ⁺¹
Exactly what I needed thank you!
@Adrian_Marmy Рік тому ⁺²
Dude, this video is awesome. Breaking things down seems to be your super power.... 👌
@python-programming Рік тому ⁺¹
Thanks so much! I always wanted a super power. Since this video came out, I think BertTopic is a bit better. It has more features and is a bit more accessible now to beginners too. It also has a thriving community.
@Adrian_Marmy Рік тому
@@python-programming wow, awesome for you to comment this. I will have a look at it :-)
@TheAbdallahk Рік тому ⁺¹
Wow, this is amazing. Thank you so much!
@python-programming Рік тому ⁺¹
No problem!! So happy to hear it useful!
@SonnyGeorgeVlogs 2 роки тому ⁺¹
Great video. Glad to have stumbled on it.
@python-programming 2 роки тому
Thanks!
@sjoerdbraaksma9358 Рік тому ⁺¹
This is such a great find! What I am wondering is: Can you train a BERT sentencetransformer on a large set of documents spanning several projects, then have top2vec use these embeddings to make a topic model for each project (so basically, for each subset of the larger corpus)?
@abasisadegh 5 місяців тому ⁺¹
Thank you very much for this video man, Is there a way to use pyLDAvis visualizations with top2vec?
@rush19772112 2 роки тому ⁺¹
Dr Mattingly wish you my best to your channel and CONGRATULATIONS,
you 've been GREAT help/assist with your videos in understanding Pandas. Topic modeling is an area of INTEREST to me specially everything related to social sciences especially the LDA. Looking forward seeing your video-tutorial.
needless to say how grateful I am to you, cause you HELPED ME to UNDERSTAND by showing step by step Pandas Tutorials. If you could do the same with Latent Dirichlet Allocation algorithm that you would be marvelous. Even though you do have some code already written in a past video tutorial I am still not quite there in how to apply it in a project with texts in third languages than English, such as Greek or Hebrew.
Looking forward seeing your video-tutorial
kind wishes,
Christos Bardas
@python-programming 2 роки тому
Thank you so much for your very kind comment! It means a lot to me. I will see if I can put together a video for topic modeling with non-English texts. Would Latin be alright? I don't have Greek or Hebrew unfortunately.
@rush19772112 2 роки тому
@@python-programming any non english language would be ok. What I can't work out in the video about LDA is how to transform data. For instance how to make a data set of texts (from historical data) in pdf format and tokenize words, make all necessary steps to run the lda algorithm etc but please make everything from scratch as you did in the pandas series. Your videos in pandas series have been proved an inspiration for me, therefore I'm truly grateful Dr Mattingly!
I'm looking forward for an LDA one as well!
..Kind regards..
..christos bardas..
@juanmanuelaguiar3368 Рік тому
Great video, very clear!
Do you know how Top2Vec deals with outliers? there is no 'outlier topic' at the end and all the documents seem to be assigned a topic. (I have BERTopic in mind where there is a -1 topic with the outliers)
@AlexAlexanderIII Рік тому ⁺¹
Great video.
@python-programming Рік тому
Thanks!!
@BispensGipsGebis Рік тому ⁺¹
You my Sir are awesome
@python-programming Рік тому
Thanks!
@sarasharick5209 2 роки тому ⁺¹
I just started my first data science role and there’s a project coming up with a topic modeling aspect to it. Looking forward to this video.
@python-programming 2 роки тому
Awesome! Glad to hear it. I hope it helps out a lot.
@cuneyttyler4922 Рік тому
Nice video. But when I listed the words for each topic it shows stop words only - isn't it supposed to remove them in preprocessing stage?
@RedCloudServices 9 місяців тому
How do you filter stop words and how does this compare to Bartopic
@fetchthebattleaxe 2 роки тому ⁺²
Great video! Do you know if top2vec has options for when you have a dataset too large to fit into RAM? I have a dataset that is something like 9gb of text that I've been trying to topic model with different methods, so I'd be curious to try this out. But I probably can't just load the whole thing into a list and pass it in
@python-programming 2 роки тому ⁺¹
Thanks! Great question. I have not personally tried it with a dataset that large just yet. What are your computer's specs? Do you have a Cuda-accelerated GPU?
@fetchthebattleaxe 2 роки тому
@@python-programming
CPU: AMD Ryzen 7 3700X 8 core
16 gb ram
GPU: RTX 2070 super
The GPU does have Cuda installed and I've used it for deep learning a bit. But the GPU itself only has 8gb vram and i've run into cuda memory issues before. Though admittedly I have no idea how memory needs are shared between CPU and GPU.
Either way, I'll probably try this library on a random slice of the full data to see if it shows promise. Thanks for drawing my attention to it!
@patrykkoakowski4357 Рік тому
How did you force the code to run on CPU?
@tonyberber Рік тому
I'm getting this error:
from top2vec import Top2Vec
ImportError: cannot import name 'Top2Vec' from partially initialized module 'top2vec' (most likely due to a circular import)
any ideas are appreciated.
I'm using an M1 Mac
@prabhacar 2 роки тому ⁺¹
brilliant stuff! thanks! Just a small comment.....i am quite visual and I learn better with pictures...in future if possible please include some visualizations of the topic modeling.
@python-programming 2 роки тому
Great idea! Thanks for the feedback!
@amrmoursi7303 Рік тому
Thanks, wish you my best to your channel, and CONGRATULATIONS,
How can we evaluate the topic modeling algo like top2vec or BerTopic
Thanks in advanced
@JayShankarpure 2 роки тому ⁺¹
Hi sir I checked out your NER Playlist and had a doubt . How can we calculate accuracy of a ner model ?
@python-programming 2 роки тому
Hi. I am glad you are watching the video. You analyze the Precision, Recall, and F-Score during training, but this only let's you know how the model is performing during the training process. To gather proper metrics, you need to structure a formal test with a heldout set and monitor the results. I have a video on it here: ua-cam.com/video/k1FtpADlusE/v-deo.html
@JayShankarpure 2 роки тому ⁺¹
@@python-programming Got it , Thanks sir . Actually i am making a stock research platform called Shodh . Which involves some advanced nlp. Would love to take your guidance on few of topics that i am making . Can we connect anytime soon. Thanks
@python-programming 2 роки тому
Sure! I do consultation, just fill out the form on my website wjbmattingly.com
@dankchan420 2 роки тому ⁺⁴
can you show how to compare it to lda with topic information gain? or coherence score? something i’m curious to see
@edadila 2 роки тому ⁺²
I need this too! Thanks for the great video by the way👏
@python-programming 2 роки тому ⁺²
Great idea for a new video! Thanks!
@boubacarbah1455 2 роки тому
Hello , i'm trying to reproduce your exercice. But i got a problem when i tried to import Top2vec " from top2vec import Top2Vec ".I get this error " no module named "llvmlite.binding.dylib". And i could not fix it.So i wonder if you have a solution ?
@bben4507 Рік тому
similar here, but I got: OSError: Could not load shared object file: libllvmlite.dylib
@sinabaghaei3504 Рік тому ⁺¹
so do you suggest working with Top2Vec rather than LDA? I mean do you think doing those manual changes in implementing LDA and data preprocessing worth it? or let's stick to Top2Vec. by the way your videos are awesome and I am really interested to go deep into Topic Modeling.
@python-programming Рік тому ⁺¹
For most tasks, it makes more sense to use the newer methods applied in Top2Vec, BERTopic, or LeetTopic than doing traditional LDA Topic Modeling. That said, there are times that LDA may make more sense. It just depends on the problem that you are trying to solve. I have not had to use LDA in a while because the results from transformer-based topic modeling is far superior.
@j0shm0o1 2 роки тому ⁺¹
Thanks for a great video ! I installed top2vec and tried importing it it. I get following error 'No module named 'wordcloud.query_integral_image'. Any ideas
@python-programming 2 роки тому
Thanks! Interesting question. Did you create a new environment? I am wondering if you have an older version of wordcloud installed in your base?
@j0shm0o1 2 роки тому ⁺¹
This got resolved when I created a new environment
@python-programming 2 роки тому
@@j0shm0o1 excellent!
@dynahmhyte 6 місяців тому ⁺¹
ValueError: Documents need to be a list of strings (I get this when I type model = Top2Vec(docs)
@python-programming 6 місяців тому
Perhaps a few of your items are NaN values or ints or floats?
@kosemekars 2 роки тому ⁺²
Great vid, as always. I'm interested in creating my own WordNet dataset, any ideas where I should start?
@python-programming 2 роки тому ⁺¹
You are too kind! Thanks. That is a very interesting question that I have never gotten before. I have never attempted something precisely like that before (so take what I say with a grain of salt), but I have worked with similar problems that were very domain--specific. I used a combination of heuristics and FastText embeddings to generate a sort of weak supervised approach to forming a knowledge tree based on semantic and syntactic meaning. Does this help?
@kosemekars 2 роки тому ⁺¹
@@python-programming Thanks for the illuminating answer. Do you think that a graph-based approach (using something like NetworkX) could be helpful? Basically starting from a lexicon or dictionary and mapping the relations.
@python-programming 2 роки тому ⁺¹
No problem! Indeed I do. That was actually exactly how I graphed them out. Also use word vectors and use the similarity to calculate the weights of the edges in the graph. That may help
@kosemekars 2 роки тому ⁺¹
@@python-programming Very interesting. Thanks.
@python-programming 2 роки тому
No problem!
@moemarocha3893 Рік тому
Hi! Anyone here having trouble importing Top2Vec due to problems with Numpy version? Just tried most of possible solutions I found on stackoverflow but nothing works...
@radoslavkoynov322 Рік тому ⁺¹
try a clean environment using venv
@jubinamarie Рік тому ⁺¹
This and your other top2vec videos are awesome! This is exactly what I needed. I have a question for you. I use other tools (e.g., Tableau) and would want to export topic data from Jupyter Notebooks to use elsewhere. I figured out how to export the DF to Excel with a column added for the topic numbers, but can't for the life of me figure out how to get columns with the other information, such as the document scores, maybe the top 10 words for each topic. The inability to move all the data out is holding me back. Hope you can help. Thank you!
@python-programming Рік тому
I ran into these same issues thats why I created LeetTopic with a colleague. It does a lot of the same things as Too2Vec but returns a df with all this data you want.
@thepresistence5935 Рік тому
Try BERT TOPIC
@speedTurtle 10 місяців тому ⁺¹
Bro is the NLP Gawd.
@python-programming 10 місяців тому
Thanks so much!
@python-programming 10 місяців тому
Thanks so much!
@AndrewPeverells 2 роки тому ⁺¹
Hello Dr Mattingly, great guide as always!
I'm in need of help though. I'll post this here, so maybe other people who have the same issue can solve it, but if you prefer I can send you a pm.
I'm trying to feed my model 2 kinds of lists:
1. ["arma", "virumque", "cano", "troiae"...]
2. ["arma virumque cano troiae qui primus ab oris..."]
(from what i get in the documentation, the first one should be the way to go, as it processes lists of strings)
When trying to build my model, I get these two types of errors; for the first one:
"Exception in thread Thread-171:
Traceback (most recent call last):
File "/usr/lib/python3.8/threading.py", line 932, in _bootstrap_inner
self.run()
File "/usr/lib/python3.8/threading.py", line 870, in run
self._target(*self._args, **self._kwargs)
File "/home/apevere/.local/lib/python3.8/site-packages/gensim/models/word2vec.py", line 1163, in _worker_loop
tally, raw_tally = self._do_train_job(data_iterable, alpha, thread_private_mem)
File "/home/apevere/.local/lib/python3.8/site-packages/gensim/models/doc2vec.py", line 424, in _do_train_job
tally += train_document_dbow(
File "gensim/models/doc2vec_inner.pyx", line 358, in gensim.models.doc2vec_inner.train_document_dbow
TypeError: Cannot convert list to numpy.ndarray" (and it gets stuck loading)
For the second one:
"hdbscan/_hdbscan_boruvka.pyx in hdbscan._hdbscan_boruvka.KDTreeBoruvkaAlgorithm.__init__()
hdbscan/_hdbscan_boruvka.pyx in hdbscan._hdbscan_boruvka.KDTreeBoruvkaAlgorithm._compute_bounds()
sklearn/neighbors/_binary_tree.pxi in sklearn.neighbors._kd_tree.BinaryTree.query()
k must be less than or equal to the number of training points"
Do you know how to solve it?
@python-programming 2 роки тому
Thanks!
And great question. I think we can work on it here, that way others can get the benefit of hearing about the issue and potential solutions. First, unlike other topic modeling approaches, with top2vec, you do not need to tokenize your text, so a list of docs (as strings) is what you want to give the model. I have not tried to give it a list of lists, yet, but from what I can see from your first example, you appear to just be giving the model a list of words.
In this scenario, you would typically want to give it a list of lists with each sublist being the tokens (words) from each document. Does that make sense?
I suspect this is the origin of the error, but I would need to see your code more to address it properly. If you want, DM me on Twitter with a larger snippet and I will respond here with a better answer.
@AndrewPeverells 2 роки тому ⁺¹
@@python-programming Thank you for your quick answer!
Yes, it does make sense. As with the a pretty consistent part of coding-related problems, it's an issue of data types and how to properly handle them.
Now though I'm a bit lost. As a test, I'm trying to feed my model this list of strings:
" lst = [["arma", "virumque", "cano", "troiae", "qui", "primus", "ab", "oris"],
["nunc", "est", "bibendum", "nunc", "pede", "libero", "pulsare", "tellus"],
["uiuamus", "mea", "lesbia", "atque", "amemus"]] "
(yes, I'm working with latin!)
The error for model = Top2Vec(lst) now is: "ValueError: Documents need to be a list of strings"
Isn't it, like you said, a list of lists, with each sublist being strings (the tokens)? Am I missing something terribly basic, because I'm a complete beginner at coding?
@python-programming 2 роки тому
@@AndrewPeverells no problem! Happy to help. Can you try and give it a list of sentences rather than a list of lists of tokens and see if that helps? Also can you paste your whole code here so I can see how you are loading your data? Also what OS are you using?
@AndrewPeverells 2 роки тому
@@python-programming Now I tried with a simple list of sentences, and it gave me another error.
I'll paste the whole code, although it's very short:
>> from top2vec import Top2Vec
>> lst = ["arma virumque cano troiae qui primus ab oris", "nunc est bibendum, nunc pede libero pulsare tellus", "uiuamus mea lesbia atque amemus"]
>> model = Top2Vec(lst)
Error:
"RuntimeError Traceback (most recent call last)
/tmp/ipykernel_1573/2552625371.py in
----> 1 model = Top2Vec(lst)
~/.local/lib/python3.8/site-packages/top2vec/Top2Vec.py in __init__(self, documents, min_count, ngram_vocab, ngram_vocab_args, embedding_model, embedding_model_path, embedding_batch_size, split_documents, document_chunker, chunk_length, max_num_chunks, chunk_overlap_ratio, chunk_len_coverage_ratio, sentencizer, speed, use_corpus_file, document_ids, keep_documents, workers, tokenizer, use_embedding_model_tokenizer, umap_args, hdbscan_args, verbose)
524 logger.info('Creating joint document/word embedding')
525 self.embedding_model = 'doc2vec'
--> 526 self.model = Doc2Vec(**doc2vec_args)
527
528 self.word_vectors = self.model.wv.get_normed_vectors()
[...]
RuntimeError: you must first build vocabulary before training the model"
I'm working on jupyter notebook, from an Ubuntu terminal environment for Windows.
@AndrewPeverells 2 роки тому ⁺¹
Update
I think I found the issue for this. It's the size of your corpus. If I raise the number of documents (being whole sentences) in my corpus, it stops giving me the error. I went for at least 15 documents.
Now it gives me another error though, and I'm quite lost.
Code:
>> from top2vec import Top2Vec
>> lst = ["document1", "document2", "document3", ... "document17"]
>> model = Top2Vec(lst)
Error:
"~/.local/lib/python3.8/site-packages/top2vec/Top2Vec.py in __init__(self, documents, min_count, ngram_vocab, ngram_vocab_args, embedding_model, embedding_model_path, embedding_batch_size, split_documents, document_chunker, chunk_length, max_num_chunks, chunk_overlap_ratio, chunk_len_coverage_ratio, sentencizer, speed, use_corpus_file, document_ids, keep_documents, workers, tokenizer, use_embedding_model_tokenizer, umap_args, hdbscan_args, verbose)
682
683 # create topic vectors
--> 684 self._create_topic_vectors(cluster.labels_)
685
686 # deduplicate topics
~/.local/lib/python3.8/site-packages/top2vec/Top2Vec.py in _create_topic_vectors(self, cluster_labels)
857 unique_labels.remove(-1)
858 self.topic_vectors = self._l2_normalize(
--> 859 np.vstack([self.document_vectors[np.where(cluster_labels == label)[0]]
860 .mean(axis=0) for label in unique_labels]))
861
in vstack(*args, **kwargs)
~/.local/lib/python3.8/site-packages/numpy/core/shape_base.py in vstack(tup)
280 if not isinstance(arrs, list):
281 arrs = [arrs]
--> 282 return _nx.concatenate(arrs, 0)
283
284
in concatenate(*args, **kwargs)
ValueError: need at least one array to concatenate"
I really don't know what's this all about.
@malikrumi1206 2 роки тому ⁺¹
Do you mean that Top2Vec requires *actual sentences*? What about paragraphs? Paragraphs with more than one topic inside them?
@python-programming 2 роки тому ⁺¹
Great question. You can use any length text but if you are using BERT, you want to keep it under 512 tokens. (Double check my number). If your texts have frequently overlapping subjects you can plot the texts and see where that overlap occurs visually and assign labels accordingly. Say topic 3 shares features of topics 1 and 2. It would be plotted theoretically between the teo with pull towards the one it shares the most overlap. But yes, you can use sentences or paragarphs. Either should be fine.
@malikrumi1206 2 роки тому ⁺¹
Great! Thanks.
@python-programming 2 роки тому
No problem!
@TC-bv4on 2 роки тому ⁺¹
Working on topic modeling for legal opinions. Have you tried Bert?
@python-programming 2 роки тому ⁺¹
I have. It works very nicely. There is a library that wraps around HuggingFace BERT model. It is called BerTopic, but top2vec does the same actions and a bit more. Just specify the BERT model.
@TC-bv4on 2 роки тому ⁺²
@@python-programming awesome! Thanks. I know there is a Legal Bert that is pretrained on legal materials so idk if there is a way to specify it. Also hoping to supplement it with a citation network because you really can’t understand an opinion without understanding it’s citations. If you have any ideas I’m all ears!
Btw your channel is so needed. Hope it keeps growing while staying helpful and non-youtubey.
@python-programming 2 роки тому
@@TC-bv4on you should be able to point to the legal BERT. Thanks so much for your kind words about my channel! If you want to see some legal content, let me know.
@TC-bv4on 2 роки тому
I personally would but idk I might be the only one. Law is super far behind as far as technology goes
@wenqianzhou9174 2 роки тому ⁺¹
how about BerTopic
@python-programming 2 роки тому ⁺²
I will do a video on that
@khalifakhalifa610 Рік тому
@@python-programming Can't wait for your BerTopic video. Your style is just amazing, Kudos!!!
@jordoobodi Рік тому
4:20
which is it!?
"Each word in that document, type, all th the items of that vector, all the documents.."
@avi2923 Рік тому
This is really amazing work. I am a Product Manager with an eye for analytics. However, I am limited by my knowledge of the available techniques. I really can not thank you enough for explaining it so well and making the knowledge so accessible for business people like us.
@babyroo555 Рік тому ⁺¹
Any R coders here?

Наступне

Автоматичне відтворення

How to use BERTopic - Machine Learning Assisted Topic Modeling in Python