Це відео не доступне.

Перепрошуємо.

Solve Data Science Tasks In Python (Saving Hours) (automating import CSV files to database)

StrataScratch

Додати в
- Мій плейлист
- Переглянути пізніше
Поділитися

Поділитися

Вставка

Розмір відео:

Показувати елементи керування програвачем

Автоматичне відтворення

Автоповтор

Опубліковано 18 сер 2024
Solving real world data science tasks by automating this common data science task will save you hours of work. I'll cover how to write a python script to import a CSV or Excel file to a database. In this video I’ll specifically be showing you how to import a CSV file to a postgres database on AWS. Importing CSV files to a database is a common task needed for data science and analytics and it can be done completely with python using pandas dataframes, numpy, os library, and the posgres database wrapper psycopg2.
Here's the github repo: github.com/Str...
This python script is completely manual. Part 2 where I completely refactor the python script so that our code can handle multiple files without changing a line of python code can be found here ( • Automating Your Data S... ).
This is also my first python video so please leave a comment and let me know what you think. Did I cover the concepts in detail? Do you want more detail? Less? Was the code too small to read? Any feedback would be appreciated!
Topics Covered:
- How to read your CSV file into a pandas dataframe
- How to clean the table name and column headers using methods like replace() and lower()
- How to convert your pandas dataframe to a CSV file
- How to create a database table and import your CSV file to a postgres database on Amazon Web Services (AWS) in python
______________________________________________________________________
Resources:
- How to setup Anaconda: docs.anaconda....
- How to install psygopg2: anaconda.org/a...
👉 Subscribe to my channel: bit.ly/2GsFxmA
👉 Practice real data science interview questions: platform.strat...
______________________________________________________________________
Timestamps:
Intro: (0:00)
Outlining steps & import python libraries: (2:23)
Reading your CSV file as a pandas dataframe: (4:31)
Cleaning your table name: (6:03)
Cleaning your column names: (10:25)
Generating SQL query to create database table: (12:40)
Connecting to your database: (19:04)
Generating SQL queries to import CSV file to database: (21:53)
Double checking file was imported correctly: (30:50)
Conclusion: (31:47)
______________________________________________________________________
Contact:
If you have any questions, comments, or feedback, please leave them here!
Feel free to also email me at nathan@stratascratch.com
______________________________________________________________________

КОМЕНТАРІ • 106

@stratascratch 3 роки тому ⁺¹⁰
Part 2 where I fully automate the notebook is here ua-cam.com/video/TDwy1lSjEZo/v-deo.html
@sahej97 3 роки тому ⁺⁶
Love this series of real world on the job tasks
@stratascratch 3 роки тому ⁺¹
Thank you!
@prateek2159 3 роки тому ⁺⁹
Hey Nate, your videos are just too good. I love how your channel is so dedicated towards real word data science. By the way I noticed that you started a video series, "For your Data Science Project" and I really want you to continue making videos for this particular series because there's literally no one on UA-cam with such guidance on DS projects and I have been looking for one since a very long time because I have my placements just after 12 months and I really want to make a full stack data science project. Thank you.
@hameedmulani21 Рік тому ⁺²
One of the best tutorial I have ever seen
@brentsimpson3791 Рік тому
Mate! I'm a newbie python 'programmer' and this video has to be one of the most useful practical examples I've found. It'll take me a while to absorb it all but THANK YOU!😊
Brent
@lettalkaboutit Рік тому ⁺¹
Your video is fantastic, it help me, but this is where I struggle and I don't know if you can help with the following: my entire csv file needs to go in the table results, the dataframe created from the csv file have columns that contain foreign keys, I mean the csv files is a mix of different tables values, each column of my dataframe represents a table, My use case is that: 1/ first step: I have to loop through each column and compare the value in the column with the value from the corresponding table in the database, if the value does not exist then I create a SQL insert query to add this new value. After getting all new values in the databases then I do step 2
2/ Step : I get the value from each corresponding table from the database and for each columns I have to replace the data in each column by the table ID (foreign key) , and next send all the dataframe in the database
Can you help me achieve this please?
@austincelestine1925 2 роки тому ⁺¹
Holy cow, I wish I came across this earlier. This helped alot to importing these csv files into mysql. The 2nd video was great to except my formatting was different for my job. Great work!!! Absolutely love your content it really helps!
@stratascratch 2 роки тому ⁺¹
Thank you and glad you found this vid useful!
@mickyelmb 2 роки тому ⁺²
Excellent tutorial. I was able to follow along in Jupyter. with no issues. Thank you for posting it.
@alanamantinodasilva4362 Рік тому ⁺²
Awesome. Thank you for this content.
@stratascratch Рік тому ⁺¹
You're welcome. Glad you liked it.
@luzestrada9536 Рік тому ⁺¹
You are the best! 💯 Thank you so much for sharing this valuable information 👏
@stratascratch 3 роки тому ⁺³
Timestamps:
Intro: (0:00)
Outlining steps & import python libraries: (2:23)
Reading your CSV file as a pandas dataframe: (4:31)
Cleaning your table name: (6:03)
Cleaning your column names: (10:25)
Generating SQL query to create database table: (12:40)
Connecting to your database: (19:04)
Generating SQL queries to import CSV file to database: (21:53)
Double checking file was imported correctly: (30:50)
Conclusion: (31:47)
@rashadnelson1873 Рік тому ⁺¹
Thanks again for the tutorial:
I ran this code:
# Upload to database
SQL_STATEMENT = """
COPY fast_food_data FROM STDIN WITH
CSV
HEADER
DELIMITER AS ','
"""
cursor.copy_expert(sql=SQL_STATEMENT, file=my_file)
print('File copied to database')
I received this error:
InFailedSqlTransaction Traceback (most recent call last)
Input In [34], in ()
1 # Upload to database
3 SQL_STATEMENT = """
4 COPY fast_food_data FROM STDIN WITH
5 CSV
6 HEADER
7 DELIMITER AS ','
8 """
---> 10 cursor.copy_expert(sql=SQL_STATEMENT, file=my_file)
11 print('File copied to database')
InFailedSqlTransaction: current transaction is aborted, commands ignored until end of transaction block
Currently stuck. Bummer. I'll keep working it.
@zynkers401 2 роки тому
These series are the best! Thanks Nate
@arpitakar3384 Рік тому ⁺¹
I loved it man thAt"s a serious effort vro
@exili 3 роки тому ⁺¹
Extremely helpful video, and a very useful tool , glad I found this on Reddit.
@stratascratch 3 роки тому
I'm glad you found it useful. You should also take a look at part 2 where I talk about automating the entire process (ua-cam.com/video/TDwy1lSjEZo/v-deo.html). Part 3 will come out soon and will cover using software development techniques to polish off the code and share with your team.
@gregNFL 2 роки тому
10/10 Clear, concise.
@Davidkiania 2 роки тому
This is really good thanks. You’ve saved me lots of manual hours.
@TheSwil325 3 роки тому
I really appreciate the quality of your videos! 💯
@stratascratch 3 роки тому
Thanks so much! Will keep doing more!
@smsyrandomacts1725 3 роки тому
Simply amazing.. knowledge transfer series. Great job Nat 👍
@radhakrishnanaik5656 2 роки тому
Hey Nate, your way of explanation is really very good .I would like to request you to prepare a video to fetch database table in csv file in python code in the form of pipeline
@RedShipsofSpainAgain 3 роки тому ⁺¹
Great video Nate. Few suggestions:
6:58 It is more clear to use a variable name like "file_name" rather than the more ambiguous variable name "file"
8:31 To improve readability, consider putting each .replace() call on a separate line followed by the escape character. And perhaps consider consolidating all these replace statements into a single replace() call, using, say, a regex.
@stratascratch 3 роки тому ⁺¹
Thanks so much for your input. I'll keep those in mind in my next python video. Keep the feedback coming. Also, if you have any requests on topics, I'd be happy to try to cover them on this channel.
@RedShipsofSpainAgain 3 роки тому ⁺¹
@@stratascratch happy to help and offer constructive feedback! Admittedly my advice is opinionated, but I've found adapting certain details like these really help to prevent headaches for debugging and for future development of your code. Sort of like a set of "best practices". Love your videos, btw. Keep up the great work Nate.
@stratascratch 3 роки тому ⁺¹
@@RedShipsofSpainAgain Thanks so much for the kind words. I'm hoping to get into more python code but it takes sooo long to record compared to SQL stuff. Hope to have another python series in Jan or Feb. Your "best practices" are great advice and I'll keep these in mind as I build my solutions out.
@akshayjadiya 3 роки тому ⁺²
Great Video Nate! Can you make a video playlist on how to get started with cloud services or provide relevant reading material for that here in comments?
@stratascratch 3 роки тому ⁺⁴
Absolutely! I'd start simple and try to create a database on AWS first. If you're not going to be creating apps, there's no reason to use the other services like EC2. Here's a nice overview on how to create your (medium.com/edureka/rds-aws-tutorial-for-aws-solution-architects-eec7217774dd). In reality it probably won't be so simple. But a lot of learning this stuff is trial and error. My videos always seems simple but there's actually hours and hours of troubleshooting that never make it to the video. Maybe I should add a few minutes of "outtakes" showing all the errors I get in my next video =)
@StefanoVerugi Рік тому ⁺²
As a beginner I find this video useful and well explained, thanks for posting it
To remove non alphanumerical characters can't you use re.sub from re?
like: print(re.sub("[^a-zA-Z.0-9]+","",string))
more or less you have the same result, anything wrong with this method?
cheers
@stratascratch Рік тому ⁺²
Thanks for the kind words. Yes, you can use re.sub just fine!
@cubano100pct Рік тому
@@stratascratch There is the issue of changing the whitespaces into one _ character, then removing all other non a-z and 0-9 characters and _
tablename = re.sub('[\x00-\x1F\s\-]+', '_', filename.lower(), 0)
tablename = re.sub('[^0-9a-z_]*','',tablename, 0)
@adithya5604 Рік тому ⁺¹
Thank you so much 🎉
@stratascratch Рік тому
You're welcome.
@sureshkumarramachandran6993 3 роки тому ⁺¹
Great job!
@stratascratch 3 роки тому
Thank you. I'm glad you liked it!
@Jay-eh5rw 2 роки тому
Thank you very much. Great help:)
@stratascratch 2 роки тому
You're welcome!
@cristiansoto7581 Рік тому
Hi Nate, thank you for your this valiable information. Could you make more videos about this?
I'm suscribed in stratascratch an it's amazing.
@junaidmalik9593 2 роки тому
Fantastic Video , thanks Nate, u r awesome .
i just struggled understanding the col_str part at 17:50 of the Video , may be coz i am new to this!
@stratascratch 2 роки тому
Glad it helped!
@gambu4810 2 роки тому ⁺³
Hi Nate. Your work is so good, I can't believe my eyes. Can you please let me know how to make a connection to a SQL management studio instead? This can really help if I knew how to do it in SQL server management studio instead of postgres.
@ihhdina 6 місяців тому ⁺¹
I also need help with importing CSV to SQL.
@javierjdaza 3 роки тому
How The fuck this video only has +1000 views? you deserve more visibility man, u re a fucking Genius and great explaining the things. Go ahead, u got a new subs
@stratascratch 3 роки тому
I guess not a lot of people like to see me code =) But, honestly, thanks for the kind words. I'll keep trying to code and provide thoughtful explanations to my solutions and approaches. This python series took a really long time and I had probably like 20 pages of notes that I covered across the 3 videos.
@sergeitokarev6033 2 роки тому
If someone has problems with connecting to the database, just edit your VPC Security Group and allow all IPs. It helped in my case.
@stratascratch 2 роки тому
Wonderful suggestion!
@tacy91 2 роки тому
Yea. Please can you help? How do I edit it? I've been trying since yesterday. Thanks
@komaldiwe5667 2 роки тому ⁺¹
hi Nate, how can we do cursor.copy_expert(sql= SQL_statement, file = my_file) for MySQL ???
@stratascratch Рік тому
Have you tried MySQL Connector/Python?
@sasongkosan8832 2 роки тому
Very Good ,, :) Nice tutorial ...
@stratascratch 2 роки тому
Thanks for watching!
@yassw1995 2 роки тому ⁺¹
Hi Nate, this video is amazing and concise, thank you so much. However, I'm trying to learn this using as few advanced python libraries as possible for a class, so I can't use pandas or the like. I'm limiting myself to mySQL.connector and csv libraries. For this reason, MySQLCursor does not accept the "cursor.copy_expert" syntax. What would be a good alternative method to insert the contents of the csv file we saved during that step into the sql database?
@sd0n0158 2 роки тому ⁺¹
Do you know if there is a work around for Redshift without using an S3 bucket? It doesn't allow COPY FROM STDIN.
@stratascratch Рік тому
No, I don't know about such workaround. Have you tried asking on re: Post or Stack Overflow?
@chyang0107 Рік тому
Hello Nate, I appreciate your valuable work! Could you provide your insights on these two questions, First, what is 'STDIN' within the SQL Statement '"""Copy Customer_contract from STDIN With ....""" around (28:51)?
Second, if I would like to insert value to table from Excel file, My SQL Statement would be
""""
Copy Customer_contract from STDIN With
Excel
Header
""""
Am I on the right track?
Thanks,
Hank
@CawlCawl1 3 роки тому ⁺²
Amazing video - thank you! I am using MS SQL SERVER Management Studio, thus have to use PYODBC and get an error "'pyodbc.Cursor' object has no attribute 'copy_expert'" in the line of code "cursor.copy_expert(SQL_STATEMENT, file=my_file)". Is there any alternative code I can use since Im using PYODBC?
@stratascratch 3 роки тому ⁺¹
If you're going to use PYODBC, you'll probably need to INSERT your values row by row rather than using something like copy_expert() which takes the entire file and just does it INSERT all at once. Here's the documentation on INSERTs for PYODBC (docs.microsoft.com/en-us/sql/connect/python/pyodbc/step-3-proof-of-concept-connecting-to-sql-using-pyodbc?view=sql-server-ver15). Every database is different and requires a different python library. Hope this helps.
@prabirkumarsahoo6368 Рік тому
Could you tell me from where you copied the 'AWS connection string'????
@darshpancholi4919 2 роки тому
I use a macbook M1 and I am unable to import the psycopg2 library in my jupyter notebook. have tried almost all the solutions provided on youtube and StackOverflow. Can someone help me out with the same?
@sai6837 Рік тому ⁺¹
Hai, I am using mysql...what keyword should i use instead of copy_expert
@stratascratch Рік тому
I don't use MySQL that often in these situations, but shouldn't execute() or executemany() work?
@sai6837 Рік тому ⁺¹
@@stratascratch no worries I tried and it worked
@AndyLeal-pz7xw Рік тому
This is incredibly helpful. I have a question, I'm using paramiko to connect to a SFTP to read csv files from there, how would I go about setting up the cwd to the sftp server? I can only navigate to local directories currently.
@stratascratch Рік тому
Thanks for the kind words! Is there a way to navigate to the directory of the stfp server? It might be a permissions issue that you can have your IT dept resolve. If they are able to allow you to access the STFP server via your script, then you're all set. Otherwise, you'll need to manually move the files.
@AndyLeal-pz7xw Рік тому
@@stratascratch Thanks for the quick feedback! I checked permissions and everything seemed fine. I ended up switching to pysftp and it worked on the first try! keep up the great work!
@BelowIce 2 роки тому
Love this video. Can you share why you copy the dataframe to a csv, and then import the csv to the database? Is it possible to directly reference the values in the dataframe and use those to populate the insert statements?
@stratascratch 2 роки тому ⁺¹
You can definitely just upload the df to the database. I used a function that required a csv to import to a db. But there are other approaches like you mentioned that would likely work as well.
@jerrywang1550 2 роки тому ⁺¹
connect to the AWS part is difficult
@stratascratch 2 роки тому
I hear you, could be tricky.
@tacy91 2 роки тому
Yeah. I've been trying for 2 days. Is there any solution for that? The error: could not connect to server: Connection timed out.
@tacy91 2 роки тому
@Andrew Levinton Yeah I did....I don't know what else to do. Thanks for your reply🙏
@thelatecomer4245 2 роки тому
Hey, Can you give a solution on exporting table from sql server into local as multi character delimited file?
@CodingNinja-lt8ge Рік тому
Hi Nate,
Thanks for the wonderful explanation. I tried to create the SQL table with your data and code using jupyter notebook. The code works fine but it failed to create the table in Postgres database. I am wondering what could be the reason. In the Pgadmin I found this under SQL TAB. Would appreciate your hints to help me figure out why I am not getting the desired result although code works
Database: aw
-- DROP DATABASE IF EXISTS aw;
CREATE DATABASE aw
WITH
OWNER = postgres
ENCODING = 'UTF8'
LC_COLLATE = 'English_Finland.1252'
LC_CTYPE = 'English_Finland.1252'
TABLESPACE = pg_default
CONNECTION LIMIT = -1
IS_TEMPLATE = False;
@stratascratch Рік тому
Maybe you need to commit the code to the db? It's a postgres thing to do that.
@arpittrivedi6636 2 роки тому ⁺¹
what about excel
@stratascratch 2 роки тому
That can also work. But for now, we can use Python.
@MichelUNITED 3 роки тому ⁺¹
hello i ran this code:: col_str = ", ".join("{} {}" .format(n, d) for (n, d) in zip(dta.columns, dta.dtypes.replace(replacements)))
col_str the error received was that i cannot compare types ndarray(dtype = objects) and str. please help
@stratascratch 3 роки тому
Check out the notebook here: github.com/Strata-Scratch/csv_to_db_automation Hopefully that helps you troubleshoot.
@stratascratch 3 роки тому ⁺¹
But the error sounds like you have an object dtype which is expected but you're trying to compare that to a string. It leads me to believe that your "col_str" might be wrong? Or your zip() code? Hard to say without really seeing your code in its entirety.
@MichelUNITED 3 роки тому
@@stratascratch Hello thank you for the reply. i did exactly what did. i followed your line one after the other.
datype_str = ", ".join("{} {}" .format(n, d) for (n, d) in zip(dta.columns ,dta.dtypes.replace(replacements)))
this is the line. sorry but i don't see any kind of comparing going on.
@stratascratch 3 роки тому
@@MichelUNITED I believe there might be something wrong with the dta values or column headers. You should also check your replacements array too to make sure you're getting the same output as I did. See here for info stackoverflow.com/questions/53478932/cannot-compare-types-ndarraydtype-int64-and-str
@MichelUNITED 3 роки тому
@@stratascratch can i send you my code please. help me please. can i send you my code to have a look at. i hope you see this. i am working on something related to your video.
@martithon 2 роки тому
[SOLVED] at @16.10 I'm getting a " ', ' KeyError " and have no idea how to solve it. Any tips?
edit: misplaced the closing bracket of the join function
@its_me7363 3 роки тому
Thanks for these amazing helpful videos, something different from other youtubers. Can we use vscode to connect and open AWS database and write queries? I don't want to download another software for database.
@stratascratch 3 роки тому ⁺¹
I think you can use vscode to connect to a db but I've never used that platform before so I don't know how to. But seeing as how vscode is for developers, it's likely you can use it.
@its_me7363 3 роки тому
@@stratascratch I am trying to use an extension in vscode named SQLtools but it is not able to connect to AWS RDS for postgresql database i created as you showed.
@stratascratch 3 роки тому ⁺¹
@@its_me7363 You might need to find another platform that allows you to connect to an aws rds. Another reason could be that the database you created isn't configured right. There could be a firewall in place or ports closed. Try to see if you can connect to the rds through another way.
@full_bladder Рік тому
You did not block your password. If it was an important db or a password used before for other accounts, change it.
@mariusbotezatu6365 Рік тому
Try the same thing with 1M rows in csv file :)
@stratascratch Рік тому ⁺¹
It'll break! You'll need to import in batches. I think there's a parameter for batching in one of the import functions in python. =)
@limichelle6895 3 роки тому
Hi, I have a question. How to only upload "newest csv data" in a folder? Thanks
@stratascratch 3 роки тому
Sorry, not sure what you mean?
@limichelle6895 3 роки тому
@@stratascratch hi , thanks for getting back to me. So I’ve like a folder with a new csv data every single day. I would like to only update the newest data to MySQL. Can help me ? Thanks
@stratascratch 3 роки тому ⁺¹
@@limichelle6895 You can just upload it to colabs ever day. That's the manual way. Otherwise, you could automate it by specifying the name of the CSV file in your code. If there's a new csv file everyday, I would make sure that csv file has the current date at the end of the file name. Then have your code find the file name with the current date. The script will find the file and import it.
@limichelle6895 3 роки тому
And my csv data come with a new date everyday. For example 280721, 290721 and so on
@huseyindansman7415 2 роки тому
its really so slow to insert a lot big data into ms sql server db using pyodbc library, any idea you have might be really useful for me to do it faster.
@stratascratch 2 роки тому
I feel like it's always a slow process no matter what db engine. There are engines that allow you to multi-thread the processes but otherwise, batch processing is the only thing that comes to mind. There's usually a parameter setting for batch loading in whatever python library you're using to import your data
@huseyindansman7415 2 роки тому
@@stratascratch thank you so much for your reply but I found out a way to get it faster using a different parameter being able to insert 16 thousand rows withing 20 or 30 seconds
@rashadnelson1873 Рік тому ⁺²
Good video. I'm presently looking for something that will process a massive dataset with thousands of rows of data, then upload that data into a database. I've yet to run across a tutorial that processes and uploads a massive dataset into a database. I'm running into some interesting errors.
1st Code Attempt: df = pd.read_csv('fast-food-data-2020.csv')
1st Error: UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 20: invalid continuation byte
2nd Code Attempt: df = pd.read_csv('fast-food-data-2020.csv', encoding='UTF-8')
2nd Error: UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 52676: invalid continuation byte
3rd Code Attempt: df = pd.read_csv('fast-food-data-2020.csv', encoding='latin-1')
3rd Error: DtypeWarning: Columns (12,13,14,15,16,17,18,19,20,21,22,24,25,26,27,28,29,30,37,49,52) have mixed types. Specify dtype option on import or set low_memory=False.
df = pd.read_csv('fast-food-data-2020.csv', encoding='latin-1')
4th Code Attempt: df = pd.read_csv('fast-food-data-2020.csv', error_bad_lines=False, index_col=False, dtype='unicode')
4th Error: UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 20: invalid continuation byte
5th Code Attempt: df = pd.read_csv('fast-food-data-2020.csv', low_memory=False)
5th Error: UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 20: invalid continuation byte
6th Code Attempt: df = pd.read_csv('fast-food-data-2020.csv', engine='python')
6th Code Error: UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 3524: invalid continuation byte
@bennguyen1313 5 місяців тому
What if the data file is not in a perfect tabular / csv format? For example, I have a large text file like this:
#Time (HHH:MM:SS): 002:34:02
# T(ms) BUS CMD1 CMD2 FROM SA TO SA WC TXST RXST ERROR DT00 DT01 DT02 DT03 DT04 DT05 DT06 DT07
# ===== === ==== ==== ==== == ==== == == ==== ==== ====== ==== ==== ==== ==== ==== ==== ==== ====
816 B0 D84E BC RT27 2 14 D800 2100 0316 0000 0000 0000 0000 CCCD 0000
817 A0 DC50 RT27 2 BC 16 D800 2120 0000 4080 3000 0000 3000 0000 0000
#Time (HHH:MM:SS): 002:34:03
# T(ms) BUS CMD1 CMD2 FROM SA TO SA WC TXST RXST ERROR DT00 DT01 DT02 DT03 DT04 DT05 DT06 DT07
# ===== === ==== ==== ==== == ==== == == ==== ==== ====== ==== ==== ==== ==== ==== ==== ==== ====
056 B0 D84E BC RT27 2 14 D800 2100 0316 0000 0000 0000 0000 CCCD 0000
057 A0 DC50 RT27 2 BC 16 D800 2120 0000 4080 3000 0000 3000 0000 0000
How can get just the data from DT00 thru DT07 into an array, without doing lots of preprocessing to scrub out the repeating #Time headers that appear throughout the file?

Наступне

Автоматичне відтворення

Automating Your Data Science Tasks In Python (importing CSV files to database AUTOMATION TUTORIAL)