Turning multiple CSV files into a single pandas data frame

Поділитися
Вставка
  • Опубліковано 14 лют 2022
  • Data sets often come as multiple CSV files. How can you read those into a single pandas data frame? In this video, I show you how to use a Python "for" loop, or even a list comprehension, along with the "glob" module and the pd.concat function, to accomplish this.
    The Jupyter notebook for this video is at github.com/reuven/youTube-not..., and my free, weekly newsletter about Python and software engineering is at BetterDevelopersWeekly.com/.
  • Наука та технологія

КОМЕНТАРІ • 58

  • @sloughpacman
    @sloughpacman 2 роки тому +3

    Your material has the habit of provoking thought, which often leads me off on exploratory tangents for a couple of hours. Thanks for wasting my day, Reuven 😀

  • @ManjeeriG
    @ManjeeriG 3 місяці тому +1

    Thank you so much for this tutorial. I have gone through multiple tutorials but this one is the easiest and (ofcourse) smartest way to combine multiple csv into the one. 😊😊 You got a new subscriber.

    • @ReuvenLerner
      @ReuvenLerner  3 місяці тому +1

      I'm delighted that it helped!

  • @learner7273
    @learner7273 Рік тому

    I never knew you could have a list of dataframes. Thank you!

  • @matthewhsu7299
    @matthewhsu7299 Рік тому +1

    Learn something new today. Thank you so much!

  • @Merqos
    @Merqos 2 роки тому +2

    Thank you so much for this example! This made me save so much time!

    • @ReuvenLerner
      @ReuvenLerner  2 роки тому +1

      I'm so happy to hear it!

    • @Merqos
      @Merqos 2 роки тому +1

      @@ReuvenLerner yeah, really! I was told to extract data from a vessel where all data was given in 10 minutes interval excel documents. This means within a day i had to open 144 files and collect them in one sheet. I love you and the internett man, thanks again!

  • @33samogo
    @33samogo Рік тому +1

    Excellent, worked fine for me also, thank you!

  • @gabbyf2906
    @gabbyf2906 Рік тому +1

    Great video helped a lot!

  • @silvajonatan
    @silvajonatan 6 місяців тому +1

    Hi, fantastic infomation. Very didatic. Thanks a lot.

  • @israaaljowder9751
    @israaaljowder9751 Рік тому +1

    very informative video, Thanks

  • @Tshediso1
    @Tshediso1 2 роки тому +1

    Thank very much

  • @gam3rman85
    @gam3rman85 7 місяців тому +1

    helpful. thanks!

  • @tedfitzgerald4202
    @tedfitzgerald4202 Рік тому +1

    Really super video.

  • @traal
    @traal 2 роки тому +1

    Would this also work with a generator expression passed into pd.concat? That would look nicer and probably save some memory. 😊

    • @ReuvenLerner
      @ReuvenLerner  2 роки тому +1

      It would definitely work. But I'm not sure how much memory it would save, because at the end of the day, we still get all of the data from the three smaller data frames. And the way that behind-the-scenes memory is managed, I'm going to guess (and might be wrong about this) that the usage would be the same. But it would probably be wise to experiment with this; I'll try to find some time to do that, and if I have interesting results, I'll report them on this channel.

  • @varunsharma3485
    @varunsharma3485 2 роки тому +1

    Worked for me. Earlier I was using excel power query to join multiple csv's and then importing them to pandas, but it had a limitation of max 10million rows in excel. This tutorial is very helpful. Thanks and subscribed. One issue is that I am facing in this code is that I want to add my individual csv file names as a column in concatenate dataframe that is missing in this code.Any tips

    • @ReuvenLerner
      @ReuvenLerner  2 роки тому

      I'm delighted to know that the video helped!
      BTW, the maximum number of rows in Excel is (in)famous. It even led to the loss of a bunch of covid data in the UK, by silently removing a whole bunch of inputs.
      Thanks for joining me here, and I'll see you around...

  • @awaisraza2285
    @awaisraza2285 Рік тому +1

    I have a main folder and in that main folder there are 35 sub folders. and their names are like S01, S02... S35. And each folder has a dataset of same structure. How i can concatenate that data into one dataframe?

    • @ReuvenLerner
      @ReuvenLerner  Рік тому

      You can! Look upthe "recursive" parameter in glob.glob, and you'll see that you can get all of the files that match patterns across multiple subdirectories.

  • @mikej62
    @mikej62 Рік тому +1

    Nice!!! - I've seen several ways to solve this problem and this is the most efficient I've personally came across!
    Question for you, In [13], there will be times I want to use this approach because I want to verify the number of rows in each file....how would you do it? TIA!

    • @ReuvenLerner
      @ReuvenLerner  Рік тому +1

      Glad to hear it helped! If you want to verify the number of rows in a data frame, the fastest/easiest way is to ask for len(df.index). You can have an "if" statement in your "for" loop checking how many rows there are in the data frame, and thus know how many you retrieved.
      I don't believe that there's a way to check the number of rows in a file without reading it into a data frame first, at least with read_csv.

    • @mikej62
      @mikej62 Рік тому +1

      ​@@ReuvenLerner Gotcha!
      What I did was added another print statement before "all_dfs.append(new_df)" print(len(new_df.index)). Wanted to see the way you would approach it.
      All for the purpose of documenting what I started with.
      Cheers!!

  • @pankajgoikar4158
    @pankajgoikar4158 Рік тому +1

    Just awesome.... :)

  • @tomtaetz20
    @tomtaetz20 2 роки тому +2

    How would you go about this when the csv files don’t all contain the same column names(some different some the same)?

    • @ReuvenLerner
      @ReuvenLerner  2 роки тому +2

      Ooh, that makes it more interesting!
      If all of the input data frames are guaranteed to contain the columns you want, then you can just select columns with double square brackets + a list of column names in the top line of the comprehension. That is, you can say pd.read_csv(one_filename)[['email', 'phone']].
      But if they have different columns, and/or you have to do transformations, then this technique gets much messier. Perhaps you could call a function in the top line of the comprehension. And the function could find the columns it wants and output them. But it won't be quite as elegant.

    • @tomtaetz20
      @tomtaetz20 2 роки тому +1

      Awesome! Thank you, I will give it a try

    • @anushkajoshi8427
      @anushkajoshi8427 3 місяці тому

      I was literally looking for the exact question in the comment box, before commenting mine.

  • @koosschutter1675
    @koosschutter1675 8 місяців тому +1

    Where can I download some of these large files for testing? I want to split or combine some CSV files but with my basic laptop can't load everything in memory and I need all the columns. Me and chatgtp came up with appending files which is slow. I'm not a data scientist, I just want to split and combine csv files and test my and my pc's capabilities.

    • @ReuvenLerner
      @ReuvenLerner  8 місяців тому +1

      You can download the (large!) files from Pandas Workout from here: files.lerner.co.il/pandas-workout-data.zip

    • @koosschutter1675
      @koosschutter1675 8 місяців тому +1

      @@ReuvenLerner Thank you very much.

  • @regal7548
    @regal7548 3 місяці тому +1

    What if the datasets doesnt haw anything in common , like one is geological data, one is survey data, one is market analysis and each of them has a massive number of null values . Also the unique ids are different for example , one table has SLT20284903 and some others just numbers . What do we do then ?

    • @ReuvenLerner
      @ReuvenLerner  3 місяці тому

      Then you shouldn't be combining them, in this way or any other way! My assumption for this video was that you have a data set broken up across a number of CSV files, each with the same column names and dtypes. You want to take these multiple files and turn them into a single data frame. Pandas provides us with the "pd.concat" method, which is good for such things, but the problem is how you read them into Pandas quickly and easily.
      If you have geological data, survey data, and market analysis, then *perhaps* they have some factors in common. But you don't want them in the same data frame. Rather, read each into its own data frame, and use "join" or "merge" to combine them.

    • @regal7548
      @regal7548 3 місяці тому +1

      @@ReuvenLerner ok.. thank you

  • @FletchersHeating
    @FletchersHeating Рік тому

    Thanks for this video!
    Is there a tutorial or information on how to do the same but for multiple data frames? ie. one csv = one dataframe
    Many thanks

    • @ReuvenLerner
      @ReuvenLerner  Рік тому

      Glad it helped! The same code would basically work for what you want, if you don't then concatenate the data frames together at the end. You'll end up with a list of data frames, each from a separate CSV file, with which you can do whatever you want.

  • @guocity
    @guocity 4 місяці тому +1

    thats really helpful, what about reading multiple csv file, read new csv file ignore repeated rows

    • @ReuvenLerner
      @ReuvenLerner  4 місяці тому

      To read multiple CSV files, you need to run read_csv multiple times. And I think that if you want to ignore repeated rows, that's something you have to do after creating the data frame, not during the loading process.

  • @nvduk3
    @nvduk3 Рік тому +1

    If one of the csv files is blank inbetween then it's breaking for loop. How to avoid that?

    • @ReuvenLerner
      @ReuvenLerner  Рік тому +1

      My best suggestion is to skip a data frame with zero rows from the output list. There might well be better solutions, though!

    • @nvduk3
      @nvduk3 Рік тому +1

      @@ReuvenLerner ​ @Reuven Lerner yes the video was of immense help. I got the concept but was facing an issue as I am trying to run it for over 2300 .csv files and many of them inbetween are blank and so it's not able to get those mentioned column names defined in the loop and is stopping there.
      Manually deleting them is time consuming & kinda stupid lol. I'm sorry as my Python is of intermediate level, but I will try to skip the ones based on if it can't find a matching column name as you mentioned. Thanks a lot again!

    • @ReuvenLerner
      @ReuvenLerner  Рік тому +1

      @@nvduk3 Oh, right - if the file is completely blank, then you can't select columns. That is a problem! Unfortunately, that's a tough one to solve. Maybe you could write a function that loads the first 5 lines, and checks to see if there are any columns. If such a function returns True/False, then you can use it in the "if" of your comprehension, and only truly process those that have the columns you want. Yes, you'll end up reading (at least part of) each file twice, but that might still be best. I'm not sure.

    • @nvduk3
      @nvduk3 Рік тому +1

      @@ReuvenLerner skipping the column names didn't work but skipping the blank sheets entirely worked somehow after few hit and trials. Thanks a lot, I really appreciate it 👍🏽

  • @KingofDiamonds117
    @KingofDiamonds117 2 роки тому +1

    I can't get this to waork for me. I need help.

    • @ReuvenLerner
      @ReuvenLerner  2 роки тому

      What are you doing, and what error do you get?

    • @KingofDiamonds117
      @KingofDiamonds117 2 роки тому

      @@ReuvenLerner I tried doing this:
      import glob
      data = glob.glob['channel1/*.csv']
      and got:
      TypeError Traceback (most recent call last)
      ~\AppData\Local\Temp/ipykernel_13700/502398532.py in
      1 import glob
      ----> 2 data = glob.glob['channel1/*.csv']
      TypeError: 'function' object is not subscriptable

    • @ReuvenLerner
      @ReuvenLerner  2 роки тому

      @@KingofDiamonds117 "function object is not subscriptable" means: you're using [] on a function, and you really should be using (). Try that!

    • @KingofDiamonds117
      @KingofDiamonds117 2 роки тому +1

      @@ReuvenLerner I'm not getting any errors this time, thanks. I have poor eyesight so it's difficult for me to see properly.