Importing a Pandas Dataframe to a Database in Python [For Your Data Science Project]

Поділитися
Вставка
  • Опубліковано 21 вер 2024

КОМЕНТАРІ • 105

  • @IrakliChitishvili
    @IrakliChitishvili 3 роки тому +6

    How the hell this channel isn't more popular?

    • @stratascratch
      @stratascratch  3 роки тому +1

      I know right?! I guess not many people like to watch me code...

    • @IrakliChitishvili
      @IrakliChitishvili 3 роки тому +4

      @@stratascratch I think it's just matter of time before your channel and the work you do takes off.

    • @flipcase
      @flipcase 3 роки тому +1

      I second that!

  • @andrealves3104
    @andrealves3104 Місяць тому

    I was having weird issues to insert data via pandas to_sql method on a Postgresql database on AWS while passing it through a sqlalchemy engine API interface.
    For now, adding it manually through a SQL query while iterating row by row on pandas worked fine, both for my project's purpose as well as for the video's goal.
    All and all, thank you very much for the overall explanation.

  • @toilinginobscurity3091
    @toilinginobscurity3091 2 роки тому +3

    There's really no rhyme or reason in the universe. This channel and these tutorials are gold but instead I am bombarded by ML channels playing with toy datasets and brand themselves as all you need clickbaits. You Sir are a saint.

    • @stratascratch
      @stratascratch  2 роки тому

      haha =) I know exactly what you mean. Too many of those channels around.

  • @prateek2159
    @prateek2159 3 роки тому +18

    Hey Nate, your videos are just too good. I love how your channel is so dedicated towards real word data science. By the way I noticed that you started a video series, "For your Data Science Project" and I really want you to continue making videos for this particular series because there's literally no one on UA-cam with such guidance on DS projects and I have been looking for one since a very long time because I have my placements just after 12 months and I really want to make a full stack data science project. Thank you.

    • @stratascratch
      @stratascratch  3 роки тому +6

      Thanks for the kind words! I have created a series of these videos from importing a pd df to a db to grabbing data from an API. I think most of the videos in the series should get you started on grabbing and managing data.Are there other types of skills you're interested in?

  • @Networkprofessor
    @Networkprofessor Рік тому +2

    Just what I needed! This will take you to a whole new level. Thank you.

    • @stratascratch
      @stratascratch  Рік тому

      You are welcome. Thank you for the nice comment. 😀

  • @jaquelinerazzo
    @jaquelinerazzo 2 роки тому

    I already play back the video like 4 times just to watch the ads because I don't know how to thank you! Great channel, great video. :)

  • @awds121
    @awds121 2 роки тому +1

    Hey Nate! Great video! I have a few feedback. First, the videos are very helpful and you should definitely continue doing more videos like this. Here are some ideas: How to schedule scripts to run on AWS; create tables, analyze data, and create a dashboard. Second, the playlist that includes this video is unordered. It is not hard to figure out which video should be watched first, but it doesn't hurt to sort it in the order that you want videos to be watched. Third, AWS doesn't include PostgreSQL in the free tier (at least anymore). Just a heads up. Keep up the good work!

  • @zynkers401
    @zynkers401 2 роки тому

    Great video Nate! You really have the spirit for teaching. The vocabulary and content is easy to understand and follow. I hope Strata Scratch has a continued exponential growth.

    • @stratascratch
      @stratascratch  2 роки тому +1

      Thanks for the kind words! And thanks for watching!

  • @RenatoCosta1
    @RenatoCosta1 Рік тому +1

    Hey Nate, amazing stuff! I think it would be great if you could do two more videos covering the topics you've mentioned in you video about the only project one would need to get a job in data science. I guess the next videos would be about machine learning modelling and the last one on how to deploy it online. That would be awesome! Thanks anyway

    • @stratascratch
      @stratascratch  Рік тому +2

      We're planning to publish 1 data project a month! When they are released, it'd be great if you can provide feedback. For example, did we dive deep enough into the technical aspects of the project? Did we dive too deep? Is the video too long/too short? etc? This will help us iterate on the videos so we can provide something of value for all viewers. Thanks for the feedback!

  • @its_me7363
    @its_me7363 3 роки тому

    much awaited video by myself...thanks for your time and effort Nate

    • @stratascratch
      @stratascratch  3 роки тому

      Thanks for watching! I just updated the description with a link to the github repo that has the final version of this project. I'd definitely take a look at that and follow along with the notebook. I felt a bit too rushed on this video trying to explain all the concepts. But if there's anything to take away from this video, it's (1) how to properly structure your functions and (2) how to solve for memory & performance issues when potentially dealing with millions and billions of rows of data.

    • @its_me7363
      @its_me7363 3 роки тому

      @@stratascratch how industry people manage large data stored in database while exploring data? what are the best practices for data scientist or analyst in these cases?

    • @stratascratch
      @stratascratch  3 роки тому +1

      @@its_me7363 The dbs are optimized to handle large amounts of data with speed. You definitely won't be using any postgres or mysql dbs. You'll be using HIVE or Greenplum or Snowflake. Also, depending on what you are trying to do, you will likely move away from using pandas and onto py spark to do some work. All of this is taught when you join the company. Most people know that not all entry level ds have used these tools.

    • @its_me7363
      @its_me7363 3 роки тому

      @@stratascratch does these dbs are similar to postgresql? and what do you mean by moving away from pandas...does industry not use pandas in their projects?

    • @stratascratch
      @stratascratch  3 роки тому

      @@its_me7363 yes the dbs are similar to postgres. There's just minor differences in syntax but it's easy to learn. The industry uses pandas but sometimes when dealing with large amounts of data, you want to work with something that can handle large amounts of data and that's pyspark.

  • @javierjdaza
    @javierjdaza 3 роки тому +2

    Brother, never miss one video. this is pure gold!! keeo doing this please, can i connect with u in linked in?
    thanks bro, ure the best

    • @stratascratch
      @stratascratch  3 роки тому

      Thanks so much for watching! And glad you like my videos =) Yes please connect with me on LinkedIn. You can search for StrataScratch and I should pop-up (I think?). If not, let me know.

    • @javierjdaza
      @javierjdaza 3 роки тому

      @@stratascratch i cant find u. :(

    • @stratascratch
      @stratascratch  3 роки тому

      @@javierjdaza Sorry about that. Here you go! linkedin.com/in/nathanrosidi

  • @JeffersonCanedo
    @JeffersonCanedo 2 роки тому +1

    I so glad I found this video Thankyou

  • @quadrialli3715
    @quadrialli3715 Рік тому

    Great and detailed explanation

  • @PATRICKCHUAD
    @PATRICKCHUAD 3 роки тому

    Thanks Nath for your video on Pandas. I will apply this to my project. in python. Thanks,

    • @stratascratch
      @stratascratch  3 роки тому +1

      Great! Happy to help, thanks for watching!

  • @TripPiper
    @TripPiper 2 роки тому

    If your connection to AWS fails try creating a new inbound security group that allows All Access.
    But also learn how to create a connection with TCP or some form of security because in production you’d never have inbound connections on All access

    • @denyseperezdevera7188
      @denyseperezdevera7188 Рік тому

      Hello! do you have any idea why it returns " database "database-1" does not exist " ?

  • @codingstyle9480
    @codingstyle9480 2 роки тому +11

    Hi, Thank you for the video. I have a question: How would you go around if you wanted to update your database in real time without you manually running the code. Perhaps we could design the code such that it is triggered when something changes in the source data to insert or update data automatically in real-time(perhaps with some delay, of course)

    • @muhammedmarong6645
      @muhammedmarong6645 2 роки тому

      hey did you figure that out?

    • @pashlenenaidoo3695
      @pashlenenaidoo3695 2 роки тому

      I would also like to know if you have achieved this

    • @stratascratch
      @stratascratch  2 роки тому +1

      You would need something like Airflow or Jenkins to schedule the trigger.

    • @caioriet458
      @caioriet458 2 роки тому

      I would like to know if you solved this problem!

    • @stratascratch
      @stratascratch  2 роки тому +1

      @@caioriet458 Yes, I left a comment. You'll need to use a scheduler like Airflow in order to keep updating your db without manually running the code. The scheduler will run it for you.

  • @bodhid3091
    @bodhid3091 2 роки тому +1

    Good video. A couple of questions:
    Why do you store the new rows into another dataframe rather than calling insert_into_table directly?
    Any reason you don't use the postgresql INSERT ON CONFLICT option to create an upsert, combining the insert and update into one call?

  • @jessinthamathew3032
    @jessinthamathew3032 3 роки тому +1

    Hi Nate! Really appreciate your content, its relevance and quality. So much to learn with each video! I wanted to ask why you opted for postgreSQL specifically? Can I use MS SQL server too?

    • @stratascratch
      @stratascratch  3 роки тому +2

      I'm just used to postgres and it's open source. MS SQL is fine to learn as well. In the future, the platform will be able to handle both db engines but currently it only uses postgres. Thanks for watching!

  • @hogthebuilder
    @hogthebuilder 2 роки тому

    2:20 in and know that this is about to be the video for me.

  • @MartinoxxHD
    @MartinoxxHD 2 роки тому +1

    Nice vid, I actually learned new things! One question, what's the point on creating a temporary DF for the new videos and then UPDATE them on a new for lopp. Can't we just UPDATE and INSERT the values in the same for loop?

    • @stratascratch
      @stratascratch  2 роки тому +1

      I hold the new videos in a temp df so that I only need to upload the new videos or replace existing videos with new metrics. I'm trying to stay away from updating all videos for performance. Then I do the update and insert all at once but it's only for the new vids. Hope that makes sense. Thanks for watching

  • @GauravKumar-xl1fs
    @GauravKumar-xl1fs 3 роки тому +1

    please make more video of this series and pipelines

    • @stratascratch
      @stratascratch  3 роки тому

      glad you liked the videos. Will do if there are more views!

  • @williamausenka9251
    @williamausenka9251 2 роки тому +1

    Great video! One question, what ML implementation would you do with that data? I'm trying to come up with something, but no success so far.

  • @noobshady
    @noobshady 2 роки тому

    That was really helpful.

  • @skyace173
    @skyace173 2 роки тому

    Hey Nate! Great video! How much time do you expect to get this finish if this is a project at work?

    • @stratascratch
      @stratascratch  2 роки тому

      This is probably a 1-2 day project depending on what is being asked of me. But it shouldn't take too long if you already know what you're doing.

  • @elextures
    @elextures 5 місяців тому

    im having a problem with channel id, in CHANNEL_ID, i put my own channel, but each time i press response, i get a different channel

  • @wisjnujudho3152
    @wisjnujudho3152 2 роки тому +1

    hi Nate, is there any alternative to aws rds, a cloud database server which doesn't require credit card even though it's free tier. alternatively, is it possible (maybe) to use heroku postgrest as database? thx

    • @stratascratch
      @stratascratch  2 роки тому +1

      Hi Winsjnu, I find this article helpful.
      You might want to check out. www.lastweekinaws.com/blog/10-free-cloud-databases-you-should-consider-and-1-you-shouldnt/

  • @khairulhakimi508
    @khairulhakimi508 2 роки тому

    Im new to this field. I have one noob question. Why people(DA, DE, DS) using pyhton(PANDAS DATAFRAME) for data ingestion to transfer sql . Why not just upload the csv into the ssms using query or using import flat file?

    • @stratascratch
      @stratascratch  2 роки тому +1

      It's basically to allow processing or manipulation of the column names and data, if you want to do that.

  • @suprithm3802
    @suprithm3802 2 роки тому

    I am having data type clob i am not able to update data but the append function is working fine

  • @SensiStarToaster
    @SensiStarToaster 2 роки тому

    Nice video. But writing sql directly gets messy to maintain, why not use something like Sqlalchemy to separate your python code from the having to worry about the specific details of the specific sql implementation and make it independent of the type of database manager used?

    • @stratascratch
      @stratascratch  2 роки тому +1

      yes, you can definitely do that. I haven't used sqlalchemy but have seen the documentation and agree with you. It could be a much better solution.

  • @ruthfussee5503
    @ruthfussee5503 2 роки тому

    Hi Nate, thank you for this video super useful was scanning through the internet for such a solution but could not find it. I am thinking what if there are many columns (e.g. 100 columns) wouldnt it be inefficient to type in row['column1], ..., row['column100]. ISsthere a way we can put the column information in a tuple, list or dictionary and passing through all this, like after SET also or vars_to_update.
    another issue is where the excel column names have space (different from SQL columns) how do I then write the code as I keep having sytnax errors
    Lastly, in the real world sometimes the excel files do not follow the template we stipulated, how should I design a validation check that is useful?

  • @mcmoodoo
    @mcmoodoo 2 роки тому

    Why not just use pandas.to_sql("table_name", engine) which will create the table based on the dataframe object? You can also pass types dictionary to specify the exact types for the columns like varchar and integer?

    • @stratascratch
      @stratascratch  2 роки тому +1

      Yes you can do that. It doesn't always work the way I want it to when I use to_sql() and I often like to have QA checks along the way so my method helps with that. But there are multiple ways to do what I'm doing. I would choose the option that allows you to get your work done!

  • @reality8221
    @reality8221 2 роки тому

    Hi Nate, Thanks for the great videos. Won't it be simpler to insert it as a part of 1st for loop; instead of creating a df and then iterating through once again. Would like to know your thoughts.

    • @stratascratch
      @stratascratch  2 роки тому

      My thinking about the temp df was to only update the new videos and metrics, and leave the rest alone. I think that's what I did in this project. Hope that makes sense.

  • @suprithm3802
    @suprithm3802 2 роки тому

    The code is not updating the only itz appending the new values

  • @denyseperezdevera7188
    @denyseperezdevera7188 Рік тому

    Hi Nate, thank you so much for these amazing resources ! Do you know why when i try to connect the database i get"OperationalError: FATAL: database "database-1" does not exist" ?

    • @stratascratch
      @stratascratch  Рік тому +1

      Do you have a db? Did you create one for the project? You need to add your credentials.

    • @denyseperezdevera7188
      @denyseperezdevera7188 Рік тому +1

      just solved my issue! i switched DB instance identifier with DB username.

    • @artyomashigov
      @artyomashigov 22 дні тому

      @@denyseperezdevera7188 oh thanks, i had the same issue, so my username and dbname are the same

  • @yashgupte3044
    @yashgupte3044 2 роки тому

    Getting a connection timed out error while connecting to the RDS instance. I have allowed all traffic in the inbound rules of the security group.

    • @stratascratch
      @stratascratch  2 роки тому

      Hi Yash, how about this link could help to troubleshoot. aws.amazon.com/premiumsupport/knowledge-center/rds-oracle-connection-errors/

    • @jakubjankovic4144
      @jakubjankovic4144 2 роки тому

      Have you managed to solve it? I am having the same issue

    • @llamashockz
      @llamashockz 2 роки тому

      having the same issue for hours, nothing seems to work to help fix it

    • @jakubjankovic4144
      @jakubjankovic4144 2 роки тому

      @@llamashockz I ended up using ElephantSQL instead of AWS to host the db and it resolved the issue for me.

  • @hoanglam2814
    @hoanglam2814 2 роки тому

    Hi Nate, thanks for your videos. I have to retrieve data from almost 100 API and your video help a lot. But normally it takes me almost 10 hours to update all the data/records each day.
    Yesterday, I just came across some articles about Webhooks, and it mentioned that Webhooks will push data whereas we have to pull data from API. What do you think if I can use Webhooks to update data and insert new data into table? It will only retrieve data when there are events happen, instead of retrieving all data like pulling through API.
    For API, I think I can use it to retrieve old data from last year.

    • @stratascratch
      @stratascratch  2 роки тому +1

      How many records do you have where it would take 10 hrs a day? If it takes that long, I wouldn't implement this method. Seems like you have a lot of data. I would use Airflow and setup DAGs to get those jobs scheduled.
      Without an understanding of what type of data you're trying to move and for what purpose, I can't tell you if webhooks is the right way

    • @hoanglam2814
      @hoanglam2814 2 роки тому

      ​@@stratascratch I retrieve about 40 thousand records each day, these are orders from a CRM system. My company helps around 100 clients manage their business, and each of them have one account in this CRM provided by a third party ( this means 100 API source). Actually, we are building an analytic system to work on it (included ETL process and dashboard), but until that, I to collect data from 100 APIs one by one. So I tried to implement your method and it works.
      But, yeah, like you said, it took too much time. I also did research on Airflow but I don't know what the difference between using Airflow and using a cron tool like Power automate/Zapier is. For me, Power automate/Zapier is much easier to learn.

    • @stratascratch
      @stratascratch  2 роки тому +1

      @@hoanglam2814 The difference between Airflow and something like Zapier is that Zapier is good for low volume jobs that a non-technical person can setup. Airflow is the industry standard and can help manage huge databases and data lakes.
      For your use case, an automated ETL process where you collect data from APIs seems like it should work just fine for 40K records. Just make sure you automate the pulls since there are 100 of them. You can try webhooks if you'd like. I just don't have any experience using webhooks for your use case.

    • @hoanglam2814
      @hoanglam2814 2 роки тому

      @@stratascratch Thank you for your short and concise answer :)) it is really helpful!

  • @acsrr4288
    @acsrr4288 2 роки тому +1

    what library shall we use to load data to azure DB Cloud?

    • @stratascratch
      @stratascratch  2 роки тому

      You can use various libraries and frameworks to connect.

    • @acsrr4288
      @acsrr4288 2 роки тому

      @@stratascratch can you recommend a couple for us to explore
      ? thx

  • @Digital-Light
    @Digital-Light Рік тому

    i'm geting this error "DatatypeMismatch: column "upload_date" is of type date but expression is of type double precision
    LINE 3: ... VALUES('mTL23Gd-T3g','Engagement_Ring_Animation','NaN'::flo..." please help

    • @stratascratch
      @stratascratch  Рік тому +1

      The error is saying that you have a dtype mismatch in the upload_date column. So clean up the values in the column and get it to a dtype that the db is expecting.

    • @Digital-Light
      @Digital-Light Рік тому +1

      @@stratascratch thank you for reply! i love your videos.

  • @amadysvlog3449
    @amadysvlog3449 2 роки тому

    Great video! Thank you! But question:
    How do i get access to the Amazon RDS free databases? I created an account on aws but can't find any database-yt.

    • @stratascratch
      @stratascratch  2 роки тому +1

      you'll need to create your own db. That's a whole topic by itself. But you can try it by going on aws and then creating a db on the RDS service.

  • @andrew6233
    @andrew6233 2 роки тому

    Newbie question: what is the database manager you show at 11:00? Thank you so much for these video

    • @stratascratch
      @stratascratch  2 роки тому +1

      It's Datagrip!

    • @stratascratch
      @stratascratch  2 роки тому +1

      It's my favorite BTW. I hate the in-browser db managers.

    • @andrew6233
      @andrew6233 2 роки тому

      @@stratascratch Thanks! It looks really nice + convenient to work with compared to what I've been using.

  • @Davidkiania
    @Davidkiania 2 роки тому

    Hi Nate, any chance you might show us how to automatically run the script? it is a cron job or how does that work? Thanks

    • @stratascratch
      @stratascratch  2 роки тому

      Yeah, it's essentially a cron/scheduled job. I use Airflow at work but I've seen others use Jenkins.

  • @andrewraad5626
    @andrewraad5626 2 роки тому

    Hey Nate, thanks for this video. how do you deal with foreign keys? My csv file has strings instead of the foreign keys needed to upload to a specific table that uses foreign key constraints. What do you suggest?

    • @stratascratch
      @stratascratch  2 роки тому

      I have no idea how to fix your situation. I would just change the strings to FK. Some data manipulations will need to be done with your issue it seems.

  • @ShahnazMalik.
    @ShahnazMalik. 2 роки тому

    To insert or update in a table having millions of records in SQL database takes hours to complete. What is the best solution , please advise
    Thank you.

    • @stratascratch
      @stratascratch  2 роки тому +1

      That's normal. Sometimes jobs just take hours to complete. I would optimize for ensuring that the process doesn't break and if it does, you don't lose all the work. One way to do this is to batch the update/ inserts so that if the process does break half way through, you at least have updated/inserted half the records.

  • @hardikvig2723
    @hardikvig2723 Рік тому

    Where to get the name of database??

    • @stratascratch
      @stratascratch  Рік тому

      It's your db that you need to setup so it can be any name you give it.

  • @nqomuleya
    @nqomuleya 2 роки тому

    InterfaceError: (pyodbc.InterfaceError) ('IM002', '[IM002] [Microsoft][ODBC Driver Manager] Data source name not found and no default driver specified (0) (SQLDriverConnect)')