Processing Large XML Wikipedia Dumps that won't fit in RAM in Python without Spark

Поділитися
Вставка
  • Опубліковано 10 вер 2024

КОМЕНТАРІ • 27

  • @opalkabert
    @opalkabert 5 років тому +12

    I am not just liking this but want to thank you for your time to show this. It is awesome Jeff!

  • @biologyigcse
    @biologyigcse 4 роки тому +6

    As a person who is just starting out in the the research domain and have to work with wiki dumps, this was a god send. THANKS a ton, you just saved me tons of time and mental stress. Did I say thanks yet. THANKS A TON.
    You sir, get a like, subscribe, notification enabling and I am sharing your channel on my twitter space.

  • @noneyahbiz6976
    @noneyahbiz6976 20 днів тому

    I am doing pyspark with this for my language model- thanks so much for this!! I needed this!

  • @BiancaAguglia
    @BiancaAguglia 5 років тому +4

    Thank you for another great video, Jeff. Not only is it useful but, as the zombie apocalypse **has** been on my mind lately, it is also very timely. 😁
    As others have already commented, I also think it would be nice to see the same process in spark. Keep up the great work.

  • @sadiko3000
    @sadiko3000 5 років тому

    I took a look at the content of your channel and it is very impressive. Please keep doing this!

  • @mariagraetsch3700
    @mariagraetsch3700 4 роки тому

    Thank you Jeff - your video provides a really structured example.

  • @DanielWeikert
    @DanielWeikert 5 років тому +2

    Thanks a lot for your videos. Love to see more on how to deal with big data in python. Best regards

  • @tonym5857
    @tonym5857 5 років тому +1

    * stars video 👏👏👏. It would be nice to see the same process using big data tech like hdsf, spark, etc.

  • @woetotheconquered3451
    @woetotheconquered3451 2 роки тому

    You're amazing. Just what I needed

  • @mariumbegum7325
    @mariumbegum7325 Рік тому

    Interesting video, keep it up!

  • @paulowiz
    @paulowiz 3 роки тому

    I'm a beginner about that I will try this code after the file download =). Thanks for it

  • @nonenogood
    @nonenogood Рік тому

    Hello Mr. Heaton. I wonder, can we get the 'text' data from the dataset into csv too?

  • @sarasmith1647
    @sarasmith1647 Рік тому

    I get FileNotFoundError: [Error 2] No such file or directory although it created the 2 csv file in the directory

  • @lisanoorarida4009
    @lisanoorarida4009 4 роки тому

    Thank you so much.
    I am working on this right now.
    For the output, I need to generate a new XML file after filtering the wiki. I tried to use the modul but they said "ElementTree is not a streaming writer". What do you recommend?

    • @HeatonResearch
      @HeatonResearch  4 роки тому

      I have seen lxml used for that before, but have not done it myself.

  • @RollingcoleW
    @RollingcoleW Рік тому

    Helpful !

  • @tamastarisnyas1191
    @tamastarisnyas1191 3 роки тому

    Hi there, thank you for the video, but there's an issue, namely when I use your code it won't fill the redirect column for some reason. Could you help me with this problem?

    • @HeatonResearch
      @HeatonResearch  3 роки тому +1

      Let me have a look at that!

    • @tamastarisnyas1191
      @tamastarisnyas1191 3 роки тому

      @@HeatonResearch and another thing that i wanted to do is to grab the text of each article and connect it to the table as a separate column for each title. Could you give me some pointers or tips on how I can do this, please? Would help a lot. Been trying to do it, but it without success.

  • @quackcharge
    @quackcharge 3 роки тому

    thanks so much!

  • @victoriar8179
    @victoriar8179 4 роки тому +2

    thanks for the video! would be awesome to have this to process with spark

    • @HeatonResearch
      @HeatonResearch  4 роки тому +2

      Yes, that is coming. Once you start to add any NLP functions on that Wikipedia text the process can take weeks without Spark.

  • @saleem801
    @saleem801 4 роки тому

    Has a spark implementation been made since?

  • @rohitreddy3609
    @rohitreddy3609 3 роки тому

    Thank you for this amazing tutorial. It's very informative. Can you please explain how to create a dataset of topics from Wikipedia dump, say to retrieve 100 topics for eg.?
    My question is, how we can crawl Wikipedia to get documents and images? Thanks in advance.

  • @Knightmare535
    @Knightmare535 4 роки тому +1

    3:53 Funny you say that...

  • @623-x7b
    @623-x7b 4 роки тому

    You can also torrent it it's much faster to download.