As a person who is just starting out in the the research domain and have to work with wiki dumps, this was a god send. THANKS a ton, you just saved me tons of time and mental stress. Did I say thanks yet. THANKS A TON. You sir, get a like, subscribe, notification enabling and I am sharing your channel on my twitter space.
Thank you for another great video, Jeff. Not only is it useful but, as the zombie apocalypse **has** been on my mind lately, it is also very timely. 😁 As others have already commented, I also think it would be nice to see the same process in spark. Keep up the great work.
Thank you so much. I am working on this right now. For the output, I need to generate a new XML file after filtering the wiki. I tried to use the modul but they said "ElementTree is not a streaming writer". What do you recommend?
Hi there, thank you for the video, but there's an issue, namely when I use your code it won't fill the redirect column for some reason. Could you help me with this problem?
@@HeatonResearch and another thing that i wanted to do is to grab the text of each article and connect it to the table as a separate column for each title. Could you give me some pointers or tips on how I can do this, please? Would help a lot. Been trying to do it, but it without success.
Thank you for this amazing tutorial. It's very informative. Can you please explain how to create a dataset of topics from Wikipedia dump, say to retrieve 100 topics for eg.? My question is, how we can crawl Wikipedia to get documents and images? Thanks in advance.
I am not just liking this but want to thank you for your time to show this. It is awesome Jeff!
As a person who is just starting out in the the research domain and have to work with wiki dumps, this was a god send. THANKS a ton, you just saved me tons of time and mental stress. Did I say thanks yet. THANKS A TON.
You sir, get a like, subscribe, notification enabling and I am sharing your channel on my twitter space.
I am doing pyspark with this for my language model- thanks so much for this!! I needed this!
Thank you for another great video, Jeff. Not only is it useful but, as the zombie apocalypse **has** been on my mind lately, it is also very timely. 😁
As others have already commented, I also think it would be nice to see the same process in spark. Keep up the great work.
I took a look at the content of your channel and it is very impressive. Please keep doing this!
Thank you Jeff - your video provides a really structured example.
Thanks a lot for your videos. Love to see more on how to deal with big data in python. Best regards
* stars video 👏👏👏. It would be nice to see the same process using big data tech like hdsf, spark, etc.
You're amazing. Just what I needed
Interesting video, keep it up!
I'm a beginner about that I will try this code after the file download =). Thanks for it
Hello Mr. Heaton. I wonder, can we get the 'text' data from the dataset into csv too?
I get FileNotFoundError: [Error 2] No such file or directory although it created the 2 csv file in the directory
The 3 csv files**
Thank you so much.
I am working on this right now.
For the output, I need to generate a new XML file after filtering the wiki. I tried to use the modul but they said "ElementTree is not a streaming writer". What do you recommend?
I have seen lxml used for that before, but have not done it myself.
Helpful !
Hi there, thank you for the video, but there's an issue, namely when I use your code it won't fill the redirect column for some reason. Could you help me with this problem?
Let me have a look at that!
@@HeatonResearch and another thing that i wanted to do is to grab the text of each article and connect it to the table as a separate column for each title. Could you give me some pointers or tips on how I can do this, please? Would help a lot. Been trying to do it, but it without success.
thanks so much!
thanks for the video! would be awesome to have this to process with spark
Yes, that is coming. Once you start to add any NLP functions on that Wikipedia text the process can take weeks without Spark.
Has a spark implementation been made since?
Thank you for this amazing tutorial. It's very informative. Can you please explain how to create a dataset of topics from Wikipedia dump, say to retrieve 100 topics for eg.?
My question is, how we can crawl Wikipedia to get documents and images? Thanks in advance.
3:53 Funny you say that...
You can also torrent it it's much faster to download.