Topic modeling with R and tidy data principles

Julia Silge

Додати в
- Мій плейлист
- Переглянути пізніше
Поділитися

Поділитися

Вставка

Розмір відео:

Показувати елементи керування програвачем

Автоматичне відтворення

Автоповтор

Опубліковано 15 жов 2024

КОМЕНТАРІ • 93

@learningstuffs5718 4 роки тому ⁺³
I am learning R and just got pass the basics and try to implement it into projects. Your channel is a fantastic place for people like me to learn please keep teaching. Thank you.
@XJRULO 4 роки тому ⁺¹
I took one or two of your DataCamp courses, but making this available with no fees is a remarkable and nice work, thnks a lot!!!
@samuelholt7775 4 роки тому ⁺²⁰
Please do more! This was a brilliant introduction with perfect pace, I learned so much in less than 30 min! Hopefully this tip helps you as much as this demonstration helped me: crtl+shift+m (or cmd+shift+m) is a handy dplyr shortcut. Thank me later ;)
@donataamato3418 4 місяці тому
THANK YOU so much!!!
@toshiyukihasumi825 2 роки тому
Thank you so much for your video. It's the ONLY tutorial I've found that talks about STM! Please keep them coming and truly appreciate your video!
@JuliaSilge 2 роки тому ⁺¹
I've also got this blog/screencast that demonstrates how to use STM: juliasilge.com/blog/spice-girls/
@lightspd714 5 років тому ⁺²
Julia you are a great teacher. I love your text mining with R book but it is nice to see the concepts come to life in video.
@Mrsandis89 3 роки тому
Julia, you’re an angel. I have to do my dissertation through STM, and, thank to you, I can literally complete it in 2 weeks b4 April 4th deadline.
@djcfb2889 3 роки тому
Wow! This is probably the best R tutorial I've seen like forever!
@happylearning-gp 2 роки тому
Excellent contribution, so fast, very clear, error-free, well explained
@hesamseraj 2 роки тому
It is again very helpful. I wish you keep sharing more videos on any new topic that interests you.
@jadesweeney1690 3 роки тому
This was so helpful to me during my research placement on tidytext data mining, thank you!
@gabriellakountourides6726 3 роки тому
You directed me to topic modeling after I asked a Q on stackoverflow, thank you so much! Thank you for this amazing amazing resrouce!
@terraflops 3 роки тому
****PLEASE ZOOM IN **** for the future, please! I _Love_ this, thank you so much!
@Dawgs10100 3 роки тому ⁺¹
Thank you for this great video. I hope there are more to come! :)
@entrepreneuriatrecherchesetcon Рік тому
Nice presentation. I suggest to increase the size via tools, general settings, appearance and choose for instance 16 or 18. Codes will e more clear.
@mxm8900 4 місяці тому
Wow great video. I have nothing to do with text analysis, but I still watched the whole video
@RosieOutdoors 5 років тому
Thank you so much for this video. As a complete newcomer to r and topic modelling, this was so well explained.
@robertc2121 6 років тому
Julia - this is amazing. Love your book -and I had been tempted by DataCamp for months before only signing up because of your Course. What a help they both have been Thank you!!
@JuliaSilge 6 років тому ⁺²
HA you are so welcome! I'm really glad these resources are helpful. 👍
@DanTaninecz 6 років тому
Great work. Very clear video. This type of solid instruction is all too rare in data science. Generally this type of stuff is just dumped on the user.
@prabhacar 2 роки тому
thanks for such a nice explanation. loved the demo!
@Mrsandis89 3 роки тому
And of course, I’ve read your work. You’re brilliant.
@abdulrahmanabdulkadri4825 4 роки тому ⁺¹
This is great and very helpful! I would like to ask, how might we know which documents fall under which topic? Might there also be a data visualization for this? We only see how many documents fall under which topic, but not specifically which document.
@JuliaSilge 4 роки тому ⁺¹
Yes, check out the topic modeling section of the workshop I taught at rstudio::conf this year:
github.com/rstudio-conf-2020/text-mining
@abdulrahmanabdulkadri4825 4 роки тому
@@JuliaSilge Amazing! Thank you very much!
@edutimqiu1168 3 роки тому
Amazing work, incredibly helpful. All the best!
@jianzhang9157 4 роки тому
I really like your Introduction! It's great.
@englianhu 6 років тому
I used to use quanteda for my professional certificate few years ago.
The tidytext and stm packages that you introduce will be more suitable for natural language processing. 😉
@vikrantnag86 4 роки тому ⁺¹
Thank you Julia. Ca you please share some knowledge on how to do Sentiment analysis in R. Will be very helpful.
@vm2321 3 роки тому
She's written a book about it bro lol
Here's the link www.tidytextmining.com/
@botswithabeat 6 років тому ⁺¹
Great video! I am hoping to do some topic modeling on some 19th-century German texts with your approach. I still am unsure what I will do to import German stop words, but I will do some digging.
One critique: it is difficult to type along while you are talking, especially when you are entering things into the console so quickly. Maybe slow down by 5%.
Thanks a lot for the great website and video.
@botswithabeat 6 років тому
Thanks a lot for the quick reply and very useful info!
@hkia7893 3 роки тому
You can reduce the playback speed
@RajatSrivatava 6 років тому
Hi ma'am your presentation and teaching skills are so good . thanks so much
@knowledgeispower7007 4 роки тому ⁺¹
Thank you so much for this video. I’m very new to R and to STM. I’m working on a paper and trying to analyze press releases to formulate my hypotheses and find relevant topics. The press releases are stored on a word document. Could you please help/guide me on where to start and how to go about this? I’m trying to find latent variables and I heard that STM is a great modeling to use for this purpose. I appreciate your help 🙏
@JuliaSilge 4 роки тому
The first thing you need to do is read the Word files into R, because Word files are a special format that require specific handling. One package I like for dealing with Word and other Office files is officer: davidgohel.github.io/officer/
You can look at the same of the other options folks use here: stackoverflow.com/questions/50439684/how-to-extract-plain-text-from-docx-file-using-r
@knowledgeispower7007 4 роки тому
@@JuliaSilge thank you so much for your prompt response and for the resources you provided 🙏 I will definitely try them
@emilierademakers70 6 років тому
Hi Julia, thanks for sharing this tutorial! It was exactly what I needed. I am working on recovering latent dimensions in job descriptions and I am using R topic modelling to gain insight. I have two questions.
\1. I first started working on my data using the Text Mining in R and got acquinted with the lda methods. I see there are similarities with the stm package, however in the documentation it stated that without covariates (which is what I am doing at the moment), STM reduces to a logistic-normal topic model, oftern called the Correlated Topic Model. What would you say are the main differences between the CTM and LDA? And apart from it being fast (indeed!) what would you say is the main motivation for using the STM package (with spectral initilization)?
\2. Would you recommend first filtering out synonyms using e.g. the wordnet package in R? Or should the co-currence of these words with other words in documents solve this more or less?
Many many thanks!
Emilie
@JuliaSilge 6 років тому ⁺²
I don't think you need to filter out synonyms before implementing topic modeling, because that is one of the things topic modeling is doing, during the modeling process, finding the latent topics. Related, you might want to even consider whether stemming is useful for your domain space: transacl.org/ojs/index.php/tacl/article/view/868
I have had consistent, excellent results with STM, which is one of the reasons I recommend it to folks. LDA models are based on the Dirichlet distribution (if you draw a sample from a Dirichlet distribution, you get a positive vector that sums to one); these models are based on priors over topics/words, then you solve for (approximate) posterior. CTM is a different approach, which models that one topic can be correlated with another (LDA assumes they are independent). Instead of Dirichlet, it uses the logistic normal distribution, as I understand it. If you want to read the original paper for CTM, it is here:
arxiv.org/pdf/0708.3601.pdf
As far as spectral initialization, it is a good place to start and nice for getting quick and reasonable results. If I need something very robust, then I do all the work that is laid out in the stm package vignette. I am working on some tidy tooling around that, and hope to get it out sometime soon!
@dr.tarunsengupta6248 2 роки тому
gutenbergr package is not available in new version of R. please change the code accordingly so that analysis can be done form ant text or pdf document.
@kaswin6527 6 років тому ⁺¹
Fabulous explanation ever seen ..
Thank you sooooo much
@sonabaghdasaryan1198 6 років тому ⁺²
Hi, an amazing video. But still I have a problem from the very beginning: I get an error while downloading gutenbergr. Error: No package with the name gutengergr. Which RStudio version do you use in this video? Thx in beforehand ^^
@sonabaghdasaryan1198 6 років тому ⁺¹
Everything is fine, thx. After restarting my computer my code is running ^^ Julia, u r great ^^u inspired me to do TM ..
@TerezaS 4 роки тому
THank you so much for this video! And I love your book :)) If you considered doing more videos, I would love aspect-based sentiment analysis as a topic :))))
@entrepreneuriatrecherchesetcon Рік тому
@Tereza S look on my video on sentiment analysis on many documents ua-cam.com/video/rU97L9Tu7Dg/v-deo.html
@swazy1777 4 роки тому
You are an amazing teacher!
@pe66o 5 років тому
Dear Julia - how can I create a topic model , when I have dataset as follows - Column1 word , Column 2 frequency of the word in the texts, Column 3 Main class and Column 4 the subclass? The topics should be classes and the subclasses. I made already a dictionary with the classes and subclasses. Thank you
@delando983 5 років тому ⁺¹
Nice video!! I am getting an error not sure if its me...more likely it is :(
sherlock_tf_idf %>%
+ mutate(word = reorder(word, tf_idf, story)) %>%
+ ggplot(aes(word, tf_idf, fill = story)) +
+ geom_col(alpha = 0.8, show.legend = FALSE) +
+ facet_wrap(~ story, scales = "free", ncol = 3) +
+ scale_x_reordered() +
+ coord_flip() +
+ theme(strip.text=element_text(size=11)) +
+ labs(x = NULL, y = "tf-idf",
+ title = "Highest tf-idf words in Sherlock Holmes short stories",
+ subtitle = "Individual stories focus on different characters and narrative elements")
Error in mutate_impl(.data, dots) :
Evaluation error: object 'FUN' of mode 'function' was not found.
@hkia7893 3 роки тому
Thanks Julia for this interesting implementation of topic modelling
So in the end we get 6 topics with probability of 7 words each. And we do not know which story belongs to which topics.... 🤔
@JuliaSilge 3 роки тому ⁺¹
If you look at the gamma probabilities, you can see how the stories are related to topics. Check out the plot "Distribution of document probabilities for each topic" here: juliasilge.com/blog/sherlock-holmes-stm/
@hkia7893 3 роки тому
@@JuliaSilge thanks, I'm gonna check that out...
@morzaq123 6 років тому
Amazing Video. Looking Forward to more videos on Text Mining
5 років тому ⁺¹
That was an awesome teaching, thanks so much!
@stewartli5395 6 років тому
great insights in a tidy way. like it very much. thanks.
@lrschm 3 роки тому
Awesome video - super helpful! :)
@avijitnandy6662 6 років тому
Maam we need more videos like this.
@bistanz 3 роки тому
Thanks for the video! One small question. Don't you need Sherlock %>% filter(!is.na(story)) to remove all NA rows?
@JuliaSilge 3 роки тому
It's been a while since I looked at this, but I don't believe there are any NA rows, at least as of how the data was formatted when I originally created this video/post back in 2018. You can see that in the tf-idf plot, no NA story facet: juliasilge.com/blog/sherlock-holmes-stm/
@bistanz 3 роки тому
@@JuliaSilge Thanks for replying. Don't we select only the top 10 words on each document to plot td-idf? Oh! eventually NA is not that frequent. You are right, we may no need to remove NAs. Thanks again for the amazing material.
@Jaji1948 2 роки тому
Resolution too low. Can’t read the screen. Can you send me a link to a higher res version?
@GustavoMontanha 4 роки тому
thanks julia, loved it
@celloharper 6 років тому
Thanks for the video. Please post more. How does one find your blog.
@shilpasuresh641 4 роки тому
How do you text mine a lot of urls stored in a CSV file ? or in other words topic modeling
@davidizquierdogomez 5 років тому
hello Julia...very nice video thanks a lot. I have a question...in my network graph of bi-grams, I get nodes without names...does it mean that i haven´t clean the white spaces properly? thank you very much.
@davidizquierdogomez 5 років тому
Thanks for the response...I double checked and it is not a problem related to white spaces. I coded to get a igraph of bigrams and i get bigrams which are alone in two-nodes associations. Instead a bigram, there is a number on the empty node...
@dinohadjiyannis3225 4 місяці тому
Julia, if I'm using a topic model on UA-cam comments to determine which video best explains topic modeling, how can I decide if your video or another video should be suggested? I see the model ranks comments with "gamma." If each comment is linked to a video ID, and based on gamma some or all comments rank highly in a hypothetical "topic modeling" topic, what then ? can we infer that your video is the best ?
@JuliaSilge 4 місяці тому ⁺¹
HAHA I can't tell if this is serious or not 🙈
In case it is, I will say that since topic modeling is unsupervised ML, it can't be used in a straightforward way to evaluate better/worse (you are not predicting a label). Instead, like you say, you could compare the relative proportion of certain topics (like, say, a topic that seems to be mostly about topic modeling) in one video's comments compared to others, and make an evaluation of videos based on that.
@dinohadjiyannis3225 4 місяці тому
@@JuliaSilge
If I can "cluster" comments related to topic modeling and find that the most relevant ones are linked to your video ID (based on beta, which will give you the top word probabilities), your video will appear with the highest relevance to that topic (based on gamma). This means your video is the most representative of that specific topic. But wait..
Then, if I manually compare, say, the top 10 most relevant videos and see that your video (which is at the top) also has a lot of likes, comments, engagement, and perhaps a great sentiment (after computing it) compared to the other 9, I can conclude that your video is the "best" and would recommend it.
Does this make sense, or am I misinterpreting the gamma/beta.
***Assume I have concatenated all comments into 1 corpora. Each corpora is linked to a video ID.
@JuliaSilge 3 місяці тому
@@dinohadjiyannis3225 I think that makes sense! Sounds to me like you are interpreting correctly. 👍
@dinohadjiyannis3225 3 місяці тому
@@JuliaSilge A big thanks to you for replying, given that this video is 6 years old. 🥇
@srisreshtan1471 4 роки тому
When I am trying to install the 'Guttenberger' package, I am getting a message package ‘guttenberger’ is not available (for R version 3.6.3)
@JuliaSilge 4 роки тому ⁺¹
I think you're dealing with some typos there; there's just one "t" and no "e" at the end: cran.r-project.org/package=gutenbergr
@srisreshtan1471 4 роки тому
Yes. My mistake. Apologies. Thanks for correcting it.
@jacobbonsell4776 6 років тому
Is there a way to get the frequency counts next to the betas in the topic-word distribution? I wanted to either use mutate or join somehow but I don't know where to retrieve the counts.
@jacobbonsell4776 6 років тому
Thank you
@janidelemmanuelcastaneda8318 4 роки тому
Awesome content
@paulmm6878 3 роки тому
Me encantan tus videos 😃 saludos desde Ecuador ✌️
@Yi-cu7ie 4 роки тому
Hi, thank you for your video, which helps me a lot. I have a question. I have raw text with pdf and word form, how could I transfer this to data frame form like sherlock_raw and sherlock in the program. Thank you so much for your time and consideration!!!
@JuliaSilge 4 роки тому ⁺¹
For PDFs, my favorite tool for reading text into R is the pdftools package: docs.ropensci.org/pdftools/
I have less experience reading in .docx files, but I have occasionally used the textreadr package: github.com/trinker/textreadr
Good luck!
@odhiambogigs2829 5 років тому
nice work....this was very helpful
@PatriciaRiosblog 5 років тому
Hi julia would stm work nowadays for twitter or facebook content? thanks
@JuliaSilge 5 років тому
Yep! This example shows using stm for topic modeling with long documents (books) but this approach also works with shorter documents. If you want to see an example of this, I have a blog here implementing topic modeling with Hacker News posts: juliasilge.com/blog/evaluating-stm/
@ilCapotasto 6 років тому ⁺¹
cast_dfm has been moved from quanteda to tidytext, correct?
@justinwallace1304 6 років тому
Ol
@2108966 6 років тому
Julia you are amazing!!! Thank´s!!!
@biaoyang6207 5 років тому
Great! Thanks for sharing!
@bbbbraveheart 2 роки тому
thank you so much~~~~
@dianaszabo3875 3 роки тому
Thank you :)
@puspa_indah 5 років тому
How to calculate theta and beta in structural topic modeling manually? does anyone know the formula or concept?
@puspa_indah 5 років тому
@Julia Silge yes, I've already checked that paper but I don't find specific information that related to the formula I mention, does the algorithm on estimating theta and beta matrix is similar to any topic modeling methods (i.e LDA, CTM, STM, etc)? thanks for the previous reply btw :)
@PaulYoung-r8g Рік тому
Great
@renatacavalcanti8297 4 роки тому
vídeo mais que perfeito

Наступне

Автоматичне відтворення

Get started with tidymodels using vaccination rate data