Many have asked for the file I used for this video- You can download it from here - drive.google.com/file/d/1e6phh7Df8mzYoE-sBXPVJklnSt_wHwkq/view?usp=sharing Remove the last 2 line from the csv file
is the %sql command that makes SQL statement available? In that case is it only possible to use SQL when using Databricks? Thus making SQL not available for python scripts... Please correct me and also provide any input you may have ...
@@AIEngineeringLife i dont have excel on my pc. i tried opening it thru onedrive live or google excel . both says error in opening the file. i was able to open your other files which you have provided in github repository, when i tried to upload the file to dataset in databricks its throwing the error.
Thank you so much, Sir. Millions of blessings from every student who watch this. I was looking for some real resources to learn Spark and your content saved a lot of effort to get wasted but made to put the effort in the right direction. Thanks a lot and please never stop creating such wonderful content for us.
I've been trying to get up to speed on Databricks and Spark for two weeks now, and I just learn 10x as much in 1 hour than I did in the previous 2 weeks. Thank you!
So how is Databricks service? I mean if I use it, what's their billing policy? Pay per use Pay per activity Pay per minutes/hours of use Pay per data size Can you let me know?
Truly love your channel! Such a wealth of information and brilliantly explained. Thank you for providing this real world example. It was exactly what I needed to elevate my spark skills. You're a terrific instructor.
Awesome. It's not a video series it's an entire course I must say. I really appreciate your hard work and the teaching technique, thanks . Sir, keep it up. One request i think from many of students like me is ,please upload the notebook sir. So that it will be a little time saving too. Thanks
df.select([count(when(isnan(c), c)).alias(c) for c in df.columns]).show() and df.filter(df.col.isNull()).count() should return same result, right? Upper command is giving 0 nulls whereas below command is giving some nulls for the same column. Can you please help?
These videos are awsome and a great help. Thanks. You are doing a wonderful job for people like us who have entered the industry. Just wanted to ask that I have worked in ML models using Python but not worked in Apache Spark. WIll I face any difficulty doing the same thing here in Spark?
Nope. It must be a smooth transition. Remember Spark ML pipeline is inspired by scikit pipelines, so process is similar. Only time consuming part will be understanding distributed architectures which might take time
A suggestion: when you load the data-set & if it is not the same as one shared on kaggle please also let us know what transformations, filtration you have performed so that we can have same, similar results as we follow along.
I am sorry if the dataset is not same.. I did not do any transformation but rather I downloaded by Lending Club directly in below link www.lendingclub.com/info/statistics.action Earlier it was download for all but post I downloaded sometime back they made it sign in based and hence I referred to kaggle thinking it should be similar. But from my end I did not make any changes to the dataset I got from Lending club. Are you facing any particular issue as few have reached out in past on some clarification and were able to execute all the commands successfully
Great Stuff really enjoying the hands on videos. I have one input, not a big constraint however, I guess in the last part when you are creating the permanent table the data frame should be df_sel_final instead of df_sel
Sir, i had a question, why are we creating a temporary table every time for sql functions. In pyspark, the main advantage is that we can use the sql functions simply on the dataframe as well for example: loan_dfsel.groupby("loan_status").count().orderBy(col("Count").desc()).show() where i have used 'loan_dfsel' is a dataframe please enlighten me if im wrong....
Tanisha.. Is your question why I am using dataframe functions rather SQL functions in pyspark. If so the yes SQL is easy way of processing data in Spark but for iterative processing dataframe functions are very powerful and simple. Typically in projects we use combination of SQL as well as df functions. In this case I wanted to show dataframe functions but in future videos I have covered SQL as well. Underneath both SQL and df compile to same plan so performance might not differ
Hi, Thank you for the informative videos. I'm just getting started with Spark. The code seems easy to understand. What are other aspects of Spark that I should read through for a better understanding?
Thank you for the tutorial, I am just curious while dealing with revol_util column, we are finding the average when the column is string and using it to replace "null" values and then cast it to "double". Will there be a difference if we are casting the values first to "double" then select the average value and replace the nulls. Hoping to get your insights on this.
Check this playlist. It is available in it and will be updating upcoming videos into it as well ua-cam.com/play/PL3N9eeOlCrP7_vt6jq7GdJz6CSFmtTBpI.html
Thank you Sir for responding my comments and clearing my doubt. I have one more doubt I am using regexp_replace function in which while changing the place of string I can see 2 diff output. Years at 1st place trimmed completely from output but if I will interchange the Years at 2nd place and Year at first place in output 'S' won't trimmed. Please refer the screenshot :)
All the topics you have covered in the Spark series here.. how close they are when it comes to the real-time projects (MNCs like IBM, CTS, Google etc.) - just asking
Hema, The "Master Spark" course I have in my channel was to bring out real world scenarios that one face in Industry. It takes a use case based approach rather function or api based approach. Many working professional in Spark have also benefited from this course as they were able to upskill themselves on specific area they had to work. I am saying this not because I have created it but You can compare the coverage with other courses and pick one that works for you
Hi sir, I'm not understanding what is the exact purpose of using spark, as per my understand in one word answer spark is used for data analysis or data preparation am I correct....?
Spark is used for end to end pipeline starting from data processing (cleaning, preparation) till machine learning or advance analytics. Reason we need spark is when your input data grows and point where typical tools like Pandas start failing to handle the volume and computation. Spark can work on TBs of data and Pandas is limited to few GBs if you are looking at large scale ML computation
Hello Sir, As in this Video you said like Data Type needs to change manually while using 'Corr' & 'Cov' functions.. Please help me, How we can change the DataType
Thank you Very much for the very informative videos. Could you please let us know what programming language(s) is(are) used in this video? Is it Spark (or) Scala (or) pyspark (or) pysql ? ( I dont know any of these) I only know Python including the Numpy and Pandas. So, Would you recommend knowing the relevant languages as a Pre-requisite? so that I should feel easy when a real world problem is given. Or any courses you recommend also fine. Thank you.
Most of my spark videos on pyspark and Spark SQL. Python is a good start as pyspark syntax are similar to pandas with slight variation. only thing is it is distributed. You can check my entire spark course on youtube to learn spark ua-cam.com/play/PL3N9eeOlCrP5PfpYrP6YxMNtt5Hw27ZlO.html
For Pandas in Python, do we have something dedicated like reg extract or something that cleans data from within the values or the conventional regex have to be employed?
Hello @aiengineering, I have a question, is it possible to run an Oracle MERGE statement in the Oracle database (Oracle tables) using Python libraries such as for example "spark.sql.write.format('jdbc').options"...
Shaik you can but if you are looking to expand it with external datapoints then you can check my video -ua-cam.com/video/Rk_nGgsPQII/v-deo.html In this video I show how you can use external data sources and combine with lending club kind of dataset
Venkat. It is there in my video in case if you have missed it df_sel.select([count(when(isnan(c) | col(c).isNull(), c)).alias(c) for c in df_sel.columns]).show() to delete you can use df.drop(
Hello Sir, I appreciate your effort and time to teach us. I am facing a Job aborted error when trying to create a Permanent table at the end of the analysis. Is there a workaround for this.
@@AIEngineeringLife org.apache.spark.SparkException: Job aborted. Py4JJavaError: An error occurred while calling o3407.saveAsTable. : org.apache.spark.SparkException: Job aborted. at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:201) at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:192) at org.apache.spark.sql.execution.datasources.DataSource.writeAndRead(DataSource.scala:555) at org.apache.spark.sql.execution.command.CreateDataSourceTableAsSelectCommand.saveDataIntoTable(createDataSourceTables.scala:216) at org.apache.spark.sql.execution.command.CreateDataSourceTableAsSelectCommand.run(createDataSourceTables.scala:175) at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:108) at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:106) at org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:126) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:150) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:138) at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$5.apply(SparkPlan.scala:191) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:187) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:138) at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:117) at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:115) at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:710) at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:710) at org.apache.spark.sql.execution.SQLExecution$$anonfun$withCustomExecutionEnv$1$$anonfun$apply$1.apply(SQLExecution.scala:112) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:217) at org.apache.spark.sql.execution.SQLExecution$$anonfun$withCustomExecutionEnv$1.apply(SQLExecution.scala:98) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:835) at org.apache.spark.sql.execution.SQLExecution$.withCustomExecutionEnv(SQLExecution.scala:74) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:169) at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:710) at org.apache.spark.sql.DataFrameWriter.createTable(DataFrameWriter.scala:508) at org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:487) at org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:430) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:380) at py4j.Gateway.invoke(Gateway.java:295) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:251) at java.lang.Thread.run(Thread.java:748) Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 76.0 failed 1 times, most recent failure: Lost task 0.0 in stage 76.0 (TID 974, localhost, executor driver): java.rmi.RemoteException: com.databricks.api.base.DatabricksServiceException: QUOTA_EXCEEDED: You have exceeded the maximum number of allowed files on Databricks Community Edition. To ensure free access, you are limited to 10000 files and 10 GB of storage in DBFS. Please use dbutils.fs to list and clean up files to restore service. You may have to wait a few minutes after cleaning up the files for the quota to be refreshed. (Files found: 17327); nested exception is: com.databricks.api.base.DatabricksServiceException: QUOTA_EXCEEDED: You have exceeded the maximum number of allowed files on Databricks Community Edition. To ensure free access, you are limited to 10000 files and 10 GB of storage in DBFS. Please use dbutils.fs to list and clean up files to restore service. You may have to wait a few minutes after cleaning up the files for the quota to be refreshed. (Files found: 17327)
Sir, I have an requirement where i have a reusable code to run for different files and need to pass the filename to code from blob storage.Pass as a parameter. Can u help me
Raviteja.. I have already covered groupby orderby.. the one you have mentioned are RDD functions and spark is making data frame functions as primary going forward. Not sure if you really need to learn RDD functions as 98% of time dataframe functions are easy and will do the job
Hi Sir, Instead of string regex, how to do numeric regex? For example : Username having abc12def here i need only character which is abcdef. Could you please help me?
Hi Imran.. Have you seen below playlist where I am adding it in sequence. Please see if it helps ua-cam.com/play/PL3N9eeOlCrP7_vt6jq7GdJz6CSFmtTBpI.html
i used this to convert string to integer : from pyspark.sql.types import IntegerType df = df.withColumn("loan_amnt", df["loan_amnt"].cast(IntegerType())) i can see in schema that loan_amnt is now changed to int type but when i am running the below command quantileProbs = [0.25, 0.5, 0.75, 0.9] relError=0.05 df_sel.stat.approxQuantile("annual_inc", quantileProbs, relError) i am getting the error :: "java.lang.IllegalArgumentException: requirement failed: Quantile calculation for column annual_inc with data type StringType is not supported. " Can u please help here
Raghu, Is the file path correct that you loaded.. Can you check if it is similar to below or missing something # File location and type file_location = "/FileStore/tables/LoanStats_2018Q4.csv" file_type = "csv" # CSV options infer_schema = "true" first_row_is_header = "true" delimiter = "," # The applied options are for CSV files. For other file types, these will be ignored. df = spark.read.format(file_type) \ .option("inferSchema", infer_schema) \ .option("header", first_row_is_header) \ .option("sep", delimiter) \ .load(file_location) display(df)
@@AIEngineeringLife never mind I utilized the option of "createtable in notebook" that databricks provided and it worked..strange..thanks for your reply
hi, when i was running df_sel.stat.cov('annual_inc', 'loan_amnt') > i got this error "java.lang.IllegalArgumentException: requirement failed: Currently covariance calculation for columns with dataType string not supported." i realised loan_amnt and annual_inc is showing as string in the schema. i followed all steps as per you. Can you correct me what i missed? i saw ur previous schema commands look like it was showing integer in your videos but when i ran a schema command, its showing these 2 columns as string thats why the error.
Biswajit.. Have you tried iterating columns and checking for null in each column in scala. That is what I am doing in python as well. I think map function can do that. I will try it out and paste exact syntax later in the week Any reason for using Scala as Spark 2.3 and above pyspark is almost at equal footing as Scala
Hello Sir , For 2018Q4 data we have to slice the original data Loan.csv which contains (2260668, 145) right as kagle gave me 2gb zip file? Following are the steps I did on my local df=read the whole (2260668, 145) file LoanStats_2018Q4=df[(df['issue_d']=="Oct-2018") | (df['issue_d']=="Nov-2018") | (df['issue_d']=="Dec-2018")] LoanStats_2018Q4.shape (128412, 145) LoanStats_2018Q4.to_csv('/path/LoanStats_2018Q4.csv', index = False) Then I will upload this to Data Bricks
I just ran it with subset so users need not wait on video for every instruction but in your case you can use it all or subset it as well.. Whole file would have made my video to run for additional hour :)
the dataset i have for lending club have noise in first row and the header starts from row 2, i am not able to skip first row and set 2nd row as header, any input on how to do this
The describe and null count is not readable most of the times, doesn't that post a big problem in industry projects? I have dataset of hundreds of columns so how to view describe or null count for all in spark?
Hello Sir. I have given file_location = r"C:/Users/dipanja/Desktop/data science/LoanStats_2018.csv". This is the path of the loanstats csv file in my system. but while trying to execute it i am getting the error 'java.io.IOException: No FileSystem for scheme: C". can u please help me fix this?
We are using this command: df = spark.read.format() I haven't worked on Spark but by syntax, I can say that this is Spark's method of reading DataFrame. We are typing this command in Jupyter notebook which by default is Python-compatible. To use others we have to use Magic Command at the top. Then how are we able to use Spark in python. is this py-spark? or something else.
Ajeet pyspark is enabled y default in notebook so you get python packages loaded in databricks by default for others we need to have magic. Did not get your question completely though
@@AIEngineeringLife You solved my doubt though. My question was "how r we using spark in python without using magic command?". And as your answer suggested its py-spark that we are using and not Spark directly.
@@hemanthdevarapati519 yes it is lazy evaluation. It will get loaded to cache when I call subsequent action first time below. I thought I might have some action down somewhere. Is it not?. I might not be doing it explicitly with cache command
Hi RK, I have mentioned it in my FAQ below on github link and on scenario which I will be sharing notebook www.linkedin.com/pulse/course-launch-scaling-accelerating-machine-learning-srinivasan/ In some cases I will be sharing it in my git link few months after the video. Sorry in case if you dont get notebook immediately post video in some cases
I have a entire course on Apache Spark which is free.. Why do you want to pay for mentorship while I have covered all that is required - ua-cam.com/play/PL3N9eeOlCrP5PfpYrP6YxMNtt5Hw27ZlO.html . Just practice along videos and you should be good
Manoj. Can u check pinned comment of this video. I have given the link to dataset. This dataset is huge and so I could not push it to git due to size limit
Thank you so much for reply ,I want to get trained under your guidance could u help me, could u please help how to start your video lectures please could you tell the order as I m beginner
Manoj.. If you go to my channel and then playlist tab.. You can see multiple playlist. Pick area of your interest. To start with you can learn from end to end ML playlist which talks about the lifecycle of ML projects. It is purely theory but good to know before getting into details
Thank you so much for your response but I want to be an end to end full stack so please help me with order of your play list to follow I am from banking background so please do help me in transition
Manoj.. I do not have video coverage on basics of ML.. So i would suggest go through coursera Andrew Ng ML course that will be helpful and once done you can check my courses on NLP, Computer Vision and Time Series
Many have asked for the file I used for this video- You can download it from here -
drive.google.com/file/d/1e6phh7Df8mzYoE-sBXPVJklnSt_wHwkq/view?usp=sharing
Remove the last 2 line from the csv file
is the %sql command that makes SQL statement available? In that case is it only possible to use SQL when using Databricks? Thus making SQL not available for python scripts... Please correct me and also provide any input you may have ...
hello Sir, did you set any permission to this file. i am unable to open it. i tried to open it in onedrive office online , it says conversion error.
@@harshithag5769 Nope it is open for all. Can you open it in google drive and see
@@AIEngineeringLife i dont have excel on my pc. i tried opening it thru onedrive live or google excel . both says error in opening the file. i was able to open your other files which you have provided in github repository, when i tried to upload the file to dataset in databricks its throwing the error.
How to see the entity relationship diagram in data bricks or pyspark just like as we see in mysql... Please help me with this.
Thank you so much, Sir. Millions of blessings from every student who watch this. I was looking for some real resources to learn Spark and your content saved a lot of effort to get wasted but made to put the effort in the right direction. Thanks a lot and please never stop creating such wonderful content for us.
You are welcome Himanshu and Thanks for such nice and encouraging words to drive me create more such content :)
This is HUGE! Gems of wisdom for a Machine learning aspirant. Excellent. Thank you very much.
I've been trying to get up to speed on Databricks and Spark for two weeks now, and I just learn 10x as much in 1 hour than I did in the previous 2 weeks. Thank you!
Glad to know Kenny this was useful. All the best on your learning journey
So how is Databricks service?
I mean if I use it, what's their billing policy?
Pay per use
Pay per activity
Pay per minutes/hours of use
Pay per data size
Can you let me know?
Thanks for the great tutorial. The Data Science community needs more people like you SS :)
This is the best content Video i have never seen in UA-cam with respect to Realtime scenarios.... Thanks a lot Sir. Please do more to help us..
Truly love your channel! Such a wealth of information and brilliantly explained. Thank you for providing this real world example. It was exactly what I needed to elevate my spark skills. You're a terrific instructor.
Tremendous effort and knowledge can be seen in your video. Thank you
Superb video. In 40 minutes, you covered pretty much everything :) . Please upload more videos
I have a complete course on Apache spark in my playlist section of youtube channel. Have you seen it?
@@AIEngineeringLife Yep. seen and subscribed as well :)
Awesome. It's not a video series it's an entire course I must say. I really appreciate your hard work and the teaching technique, thanks . Sir, keep it up.
One request i think from many of students like me is ,please upload the notebook sir. So that it will be a little time saving too. Thanks
Thanks Shubham.. :) .. The code is already available in my git repo - github.com/srivatsan88/Mastering-Apache-Spark
@@AIEngineeringLife
Hi thank you for your video.
Just wanna ask if this video is about data profiling or data wrangling using Pyspark?
@@norpriest521 this video is more on profiling/ cleaning of data but I have detailed videos on wrangling in my apache Spark playlist
@@AIEngineeringLife
I couldn't find the video regarding data wrangling in your list.
Could you please let me know the title of the video?
@@norpriest521 It is named as data engineering in this playlist. There are 2 parts to it ua-cam.com/play/PL3N9eeOlCrP5PfpYrP6YxMNtt5Hw27ZlO.html
Amazing explanation!! Thank you very much!
Excellent content, precise and to the point
Thanks very much for detailed hands on. It helps!
This is amazing, waiting for your next video on Spark Analysis and cleaning
Next part on EDA using spark is here
ua-cam.com/video/X6OkT2YPZVs/v-deo.html
Wonderful video sir, was looking for such content since many days. thanks a ton
👍
df.select([count(when(isnan(c), c)).alias(c) for c in df.columns]).show() and
df.filter(df.col.isNull()).count() should return same result, right?
Upper command is giving 0 nulls whereas below command is giving some nulls for the same column. Can you please help?
Now, this is what I was looking for :)
This is amazing. Thank you very much!
Thank you so much sir for this detailed video.
These videos are awsome and a great help. Thanks. You are doing a wonderful job for people like us who have entered the industry.
Just wanted to ask that I have worked in ML models using Python but not worked in Apache Spark. WIll I face any difficulty doing the same thing here in Spark?
Nope. It must be a smooth transition. Remember Spark ML pipeline is inspired by scikit pipelines, so process is similar. Only time consuming part will be understanding distributed architectures which might take time
A suggestion: when you load the data-set & if it is not the same as one shared on kaggle please also let us know what transformations, filtration you have performed so that we can have same, similar results as we follow along.
I am sorry if the dataset is not same.. I did not do any transformation but rather I downloaded by Lending Club directly in below link
www.lendingclub.com/info/statistics.action
Earlier it was download for all but post I downloaded sometime back they made it sign in based and hence I referred to kaggle thinking it should be similar. But from my end I did not make any changes to the dataset I got from Lending club. Are you facing any particular issue as few have reached out in past on some clarification and were able to execute all the commands successfully
Great Stuff really enjoying the hands on videos.
I have one input, not a big constraint however, I guess in the last part when you are creating the permanent table the data frame should be df_sel_final instead of df_sel
Thank you Aniket.. You are right.. Maybe did it in a hurry.. Good catch :)
Sir, i had a question, why are we creating a temporary table every time for sql functions. In pyspark, the main advantage is that we can use the sql functions simply on the dataframe as well
for example: loan_dfsel.groupby("loan_status").count().orderBy(col("Count").desc()).show() where i have used 'loan_dfsel' is a dataframe
please enlighten me if im wrong....
Tanisha.. Is your question why I am using dataframe functions rather SQL functions in pyspark. If so the yes SQL is easy way of processing data in Spark but for iterative processing dataframe functions are very powerful and simple. Typically in projects we use combination of SQL as well as df functions. In this case I wanted to show dataframe functions but in future videos I have covered SQL as well. Underneath both SQL and df compile to same plan so performance might not differ
Hi, Thank you for the informative videos. I'm just getting started with Spark. The code seems easy to understand. What are other aspects of Spark that I should read through for a better understanding?
Thank you for the tutorial, I am just curious while dealing with revol_util column, we are finding the average when the column is string and using it to replace "null" values and then cast it to "double". Will there be a difference if we are casting the values first to "double" then select the average value and replace the nulls. Hoping to get your insights on this.
Nicely explained
Thanks for your work, Could you please upload all this series video in playlist?
Check this playlist. It is available in it and will be updating upcoming videos into it as well
ua-cam.com/play/PL3N9eeOlCrP7_vt6jq7GdJz6CSFmtTBpI.html
thanks for amazing video.would you provide link of notebook for practice ?
Thanks.. Should be in this repo - github.com/srivatsan88/Mastering-Apache-Spark
@@AIEngineeringLife Thank you so much
thanks for this tutorial
Thank you Sir for responding my comments and clearing my doubt. I have one more doubt I am using regexp_replace function in which while changing the place of string I can see 2 diff output. Years at 1st place trimmed completely
from output but if I will interchange the Years at 2nd place and Year at first place in output 'S' won't trimmed. Please refer the screenshot :)
Kush.. I did not get any screen snapshot here
@@AIEngineeringLife sir, I keep getting attribute errors like "'NoneType' object has no attribute 'groupby'"
awesome !! very useful
Need to connect with azure data lake and Load it here in databricks. Do you have any document that support it?
Kunai.. Nope not done any on Azure yet
Where did you create the spark session? I don't see the initialization of the spark variable. Can you please explain more on this?
that is awesome content, big thanks, man !
All the topics you have covered in the Spark series here.. how close they are when it comes to the real-time projects (MNCs like IBM, CTS, Google etc.) - just asking
Hema, The "Master Spark" course I have in my channel was to bring out real world scenarios that one face in Industry. It takes a use case based approach rather function or api based approach. Many working professional in Spark have also benefited from this course as they were able to upskill themselves on specific area they had to work. I am saying this not because I have created it but You can compare the coverage with other courses and pick one that works for you
@@AIEngineeringLife Thanks, Srivatsav! Looking forward to learning more from your channel
Thank you for dataset
Hi sir, I'm not understanding what is the exact purpose of using spark, as per my understand in one word answer spark is used for data analysis or data preparation am I correct....?
Spark is used for end to end pipeline starting from data processing (cleaning, preparation) till machine learning or advance analytics. Reason we need spark is when your input data grows and point where typical tools like Pandas start failing to handle the volume and computation. Spark can work on TBs of data and Pandas is limited to few GBs if you are looking at large scale ML computation
@@AIEngineeringLife Thank you so much sir finally my doubt is cleared
Hello Sir, As in this Video you said like Data Type needs to change manually while using 'Corr' & 'Cov' functions.. Please help me, How we can change the DataType
Kush.. you can use custom schema when you load data or after loading you can use CAST or astype to change from one data type to another
@@AIEngineeringLife Thank you Sir for guiding me on each step.. I have done and can able to CAST datatype now.
Thank you Very much for the very informative videos.
Could you please let us know what programming language(s) is(are) used in this video?
Is it Spark (or) Scala (or) pyspark (or) pysql ? ( I dont know any of these)
I only know Python including the Numpy and Pandas.
So, Would you recommend knowing the relevant languages as a Pre-requisite? so that I should feel easy when a real world problem is given. Or any courses you recommend also fine.
Thank you.
Most of my spark videos on pyspark and Spark SQL. Python is a good start as pyspark syntax are similar to pandas with slight variation. only thing is it is distributed. You can check my entire spark course on youtube to learn spark
ua-cam.com/play/PL3N9eeOlCrP5PfpYrP6YxMNtt5Hw27ZlO.html
@@AIEngineeringLife Ok Sure. Thank you for the swift response.
For Pandas in Python, do we have something dedicated like reg extract or something that cleans data from within the values or the conventional regex have to be employed?
Check this
ua-cam.com/video/0V8bQ70HM0U/v-deo.html
@@AIEngineeringLife Thanks :)
Hello @aiengineering, I have a question, is it possible to run an Oracle MERGE statement in the Oracle database (Oracle tables) using Python libraries such as for example "spark.sql.write.format('jdbc').options"...
Sir, i would like to put this lending club type problem project in my resume...
can i consider same columns even for same kind of other projects ..
Shaik you can but if you are looking to expand it with external datapoints then you can check my video -ua-cam.com/video/Rk_nGgsPQII/v-deo.html
In this video I show how you can use external data sources and combine with lending club kind of dataset
sir how to count null in pyspark like pandas command
how to delete entire column
Venkat. It is there in my video in case if you have missed it
df_sel.select([count(when(isnan(c) | col(c).isNull(), c)).alias(c) for c in df_sel.columns]).show()
to delete you can use df.drop(
What's the 3rd argument in regexp_replace function?
What do we need to replace the pattern in second column with
Hello Sir,
I appreciate your effort and time to teach us.
I am facing a Job aborted error when trying to create a Permanent table at the end of the analysis. Is there a workaround for this.
Hi Dinesh.. Thank you.. Can you please paste the error you are getting?
@@AIEngineeringLife
org.apache.spark.SparkException: Job aborted.
Py4JJavaError: An error occurred while calling o3407.saveAsTable.
: org.apache.spark.SparkException: Job aborted.
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:201)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:192)
at org.apache.spark.sql.execution.datasources.DataSource.writeAndRead(DataSource.scala:555)
at org.apache.spark.sql.execution.command.CreateDataSourceTableAsSelectCommand.saveDataIntoTable(createDataSourceTables.scala:216)
at org.apache.spark.sql.execution.command.CreateDataSourceTableAsSelectCommand.run(createDataSourceTables.scala:175)
at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:108)
at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:106)
at org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:126)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:150)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:138)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$5.apply(SparkPlan.scala:191)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:187)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:138)
at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:117)
at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:115)
at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:710)
at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:710)
at org.apache.spark.sql.execution.SQLExecution$$anonfun$withCustomExecutionEnv$1$$anonfun$apply$1.apply(SQLExecution.scala:112)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:217)
at org.apache.spark.sql.execution.SQLExecution$$anonfun$withCustomExecutionEnv$1.apply(SQLExecution.scala:98)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:835)
at org.apache.spark.sql.execution.SQLExecution$.withCustomExecutionEnv(SQLExecution.scala:74)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:169)
at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:710)
at org.apache.spark.sql.DataFrameWriter.createTable(DataFrameWriter.scala:508)
at org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:487)
at org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:430)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:380)
at py4j.Gateway.invoke(Gateway.java:295)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:251)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 76.0 failed 1 times, most recent failure: Lost task 0.0 in stage 76.0 (TID 974, localhost, executor driver): java.rmi.RemoteException: com.databricks.api.base.DatabricksServiceException: QUOTA_EXCEEDED: You have exceeded the maximum number of allowed files on Databricks Community Edition. To ensure free access, you are limited to 10000 files and 10 GB of storage in DBFS. Please use dbutils.fs to list and clean up files to restore service. You may have to wait a few minutes after cleaning up the files for the quota to be refreshed. (Files found: 17327); nested exception is:
com.databricks.api.base.DatabricksServiceException: QUOTA_EXCEEDED: You have exceeded the maximum number of allowed files on Databricks Community Edition. To ensure free access, you are limited to 10000 files and 10 GB of storage in DBFS. Please use dbutils.fs to list and clean up files to restore service. You may have to wait a few minutes after cleaning up the files for the quota to be refreshed. (Files found: 17327)
Sir, I have an requirement where i have a reusable code to run for different files and need to pass the filename to code from blob storage.Pass as a parameter. Can u help me
What is the problem u r facing. You can pass filename as runtime parameters to spark and trigger multiple spark jobs with different file name
Sir can you make tutorial on functions like groupByKey,sortByKey,oderByKey,reduceByKey,join....
Raviteja.. I have already covered groupby orderby.. the one you have mentioned are RDD functions and spark is making data frame functions as primary going forward. Not sure if you really need to learn RDD functions as 98% of time dataframe functions are easy and will do the job
Hi Sir,
Instead of string regex, how to do numeric regex?
For example : Username having abc12def
here i need only character which is abcdef.
Could you please help me?
You can search for [0-9] regex pattern and replace it with empty string. That way the output is only alphabets
@@AIEngineeringLife thank you it worked🙂
I have another clarification sir.. Do spark will works on incremental load? I have searched many sites but can't find a proper solution
@@revathis2844 It is not straight forward in regular spark but databricks has a functionality called delta lake. Check it out
@@AIEngineeringLife thank you sir
how to validate schema in spark against record in text files if every record follows different schema and to separate record as per schema
Hi sir will u plz make a proper playlist for this tutorial Bec it's very confusing
Hi Imran.. Have you seen below playlist where I am adding it in sequence. Please see if it helps
ua-cam.com/play/PL3N9eeOlCrP7_vt6jq7GdJz6CSFmtTBpI.html
i used this to convert string to integer :
from pyspark.sql.types import IntegerType
df = df.withColumn("loan_amnt", df["loan_amnt"].cast(IntegerType()))
i can see in schema that loan_amnt is now changed to int type but when i am running the below command
quantileProbs = [0.25, 0.5, 0.75, 0.9]
relError=0.05
df_sel.stat.approxQuantile("annual_inc", quantileProbs, relError)
i am getting the error :: "java.lang.IllegalArgumentException: requirement failed: Quantile calculation for column annual_inc with data type StringType is not supported.
"
Can u please help here
Mohit .I see u have done cast for loan amount but using annual inc in quantile . Can u do cast for annual inc as well and see
@@AIEngineeringLife hey i did casting for both already. still getting the same error :/
@@AIEngineeringLife it worked now :) thank you.
Sir, do you have git repo for the codes used in this proj..if yes then please share
Yes but all modules might not be there as I have uploaded for only selected ones. Below is the link
github.com/srivatsan88/UA-camLI
@@AIEngineeringLife thankx for sharing sir.
I dont find codes for spark program.please upload the same if possible..it will really be a great help
I'm trying to import the file and create the df as mentioned and get the below error. Can you pls suggest what i missed.
Error in SQL statement: ParseException:
mismatched input 'file_location' expecting {'(', 'CONVERT', 'COPY', 'OPTIMIZE', 'ADD', 'ALTER', 'ANALYZE', 'CACHE', 'CLEAR', 'COMMENT', 'COMMIT', 'CREATE', 'DELETE', 'DESC', 'DESCRIBE', 'DFS', 'DROP', 'EXPLAIN', 'EXPORT', 'FROM', 'GRANT', 'IMPORT', 'INSERT', 'LIST', 'LOAD', 'LOCK', 'MAP', 'MERGE', 'MSCK', 'REDUCE', 'REFRESH', 'REPLACE', 'RESET', 'REVOKE', 'ROLLBACK', 'SELECT', 'SET', 'SHOW', 'START', 'TABLE', 'TRUNCATE', 'UNCACHE', 'UNLOCK', 'UPDATE', 'USE', 'VALUES', 'WITH'}(line 1, pos 0)
== SQL ==
file_location = "/FileStore/tables/LoanStats_2018Q4-2.csv"
^^^
file_type = "csv"
Raghu, Is the file path correct that you loaded.. Can you check if it is similar to below or missing something
# File location and type
file_location = "/FileStore/tables/LoanStats_2018Q4.csv"
file_type = "csv"
# CSV options
infer_schema = "true"
first_row_is_header = "true"
delimiter = ","
# The applied options are for CSV files. For other file types, these will be ignored.
df = spark.read.format(file_type) \
.option("inferSchema", infer_schema) \
.option("header", first_row_is_header) \
.option("sep", delimiter) \
.load(file_location)
display(df)
@@AIEngineeringLife yes I double checked and the filepath is correct...
@@AIEngineeringLife never mind I utilized the option of "createtable in notebook" that databricks provided and it worked..strange..thanks for your reply
@@raghuramsharma2603 Great it worked.. That is how I got the loading part as well, used databricks provided :). All the best for remaining tutorial
@@AIEngineeringLife Thank you...great work uploading these videos very helpful...
hi, when i was running df_sel.stat.cov('annual_inc', 'loan_amnt') > i got this error "java.lang.IllegalArgumentException: requirement failed: Currently covariance calculation for columns with dataType string not supported."
i realised loan_amnt and annual_inc is showing as string in the schema. i followed all steps as per you. Can you correct me what i missed? i saw ur previous schema commands look like it was showing integer in your videos but when i ran a schema command, its showing these 2 columns as string thats why the error.
can u tellme in between the code, how can i change the specific column schema from string to integer? what exact code i should execute?
Mohit I see your other message u have figured it out. You have to do cast to convert datatype
Do you have these commands written somewhere?
You can check my git repo - github.com/srivatsan88/Mastering-Apache-Spark
@@AIEngineeringLife thanks
what will be alternative fo scala code for this
df.select([count(when(isnan(c) | col(c).isNull(), c)).alias(c) for c in df.columns]).show()
Biswajit.. Have you tried iterating columns and checking for null in each column in scala. That is what I am doing in python as well. I think map function can do that. I will try it out and paste exact syntax later in the week
Any reason for using Scala as Spark 2.3 and above pyspark is almost at equal footing as Scala
@@AIEngineeringLife we been using scala for all data pipeline jobs as it faster than python
Hello Sir ,
For 2018Q4 data we have to slice the original data Loan.csv which contains (2260668, 145) right as kagle gave me 2gb zip file?
Following are the steps I did on my local
df=read the whole (2260668, 145) file
LoanStats_2018Q4=df[(df['issue_d']=="Oct-2018") | (df['issue_d']=="Nov-2018") | (df['issue_d']=="Dec-2018")]
LoanStats_2018Q4.shape
(128412, 145)
LoanStats_2018Q4.to_csv('/path/LoanStats_2018Q4.csv', index = False)
Then I will upload this to Data Bricks
I just ran it with subset so users need not wait on video for every instruction but in your case you can use it all or subset it as well.. Whole file would have made my video to run for additional hour :)
How do I use a subset of the loan data? The original datset is too large(2gb), and takes time to upload in databricks.
Bharadwaj.. best is to download and split it in spark or unix
the dataset i have for lending club have noise in first row and the header starts from row 2, i am not able to skip first row and set 2nd row as header, any input on how to do this
Is skiprows in read_csv not working for you?
@@AIEngineeringLife i didn't know about it, will see if it works
Can you show a more mechanical way for feature selection.
Usman.. Are you referring to manual feature selection?
@@AIEngineeringLife yes by using any feature importance technique.
Can you please share the link to the csv?
thanks!
The describe and null count is not readable most of the times, doesn't that post a big problem in industry projects? I have dataset of hundreds of columns so how to view describe or null count for all in spark?
Sahil.. In databricks we can use formatted output but in regular spark yes. In some case we load it into table and view it to understand
@@AIEngineeringLife okay thanks!
and how to handle delta or incremental load in pyspark
I was actually not planing to cover ingesting of data to show incremental load but will see if I can in future
sir can you give the link to dataset
Venkata, it is from kaggle - www.kaggle.com/wendykan/lending-club-loan-data
Hello Sir. I have given file_location = r"C:/Users/dipanja/Desktop/data science/LoanStats_2018.csv". This is the path of the loanstats csv file in my system. but while trying to execute it i am getting the error 'java.io.IOException: No FileSystem for scheme: C". can u please help me fix this?
Are you using databricks spark or spark in your local system?.
@@AIEngineeringLife hi sir. i resolved the issue. thanks!
can you make one video for pyspark on google cloud
Will try to do it as part of cloud series. Spark job is same but will show how to run it on cloud dataproc
@@AIEngineeringLife thank you
Sir can you please share this whole notebook
It is in this folder - github.com/srivatsan88/Mastering-Apache-Spark
@@AIEngineeringLife Thankq sir
Where can I access this code?
We are using this command: df = spark.read.format() I haven't worked on Spark but by syntax, I can say that this is Spark's method of reading DataFrame. We are typing this command in Jupyter notebook which by default is Python-compatible. To use others we have to use Magic Command at the top. Then how are we able to use Spark in python. is this py-spark? or something else.
Ajeet pyspark is enabled y default in notebook so you get python packages loaded in databricks by default for others we need to have magic. Did not get your question completely though
@@AIEngineeringLife You solved my doubt though. My question was "how r we using spark in python without using magic command?". And as your answer suggested its py-spark that we are using and not Spark directly.
I believe if you are doing this exercise in Python - you should have used spark.read.load rather than using scala syntax spark.read.format.
Can we have the notebook in github or somewhere?
Yes.. You can check it against Spark course in below git repo - github.com/srivatsan88
By the looks of it Databricks is using Zeppelin kinda notebooks
Yes Hemanth it is pretty similar to Zeppelin but I think databricks have their own custom one that resembles it
One question. Shouldn't we use an action after the df.cache() to cache data as it works on lazy evaluation? something like df.cache().count().
@@hemanthdevarapati519 yes it is lazy evaluation. It will get loaded to cache when I call subsequent action first time below. I thought I might have some action down somewhere. Is it not?. I might not be doing it explicitly with cache command
@@AIEngineeringLife Yeah, that makes sense. It was a very intuitive video. I enjoyed every bit of it.
Thank you for all your efforts Srivatsan. (Y)
Sir, please share the notebook.
Hi RK, I have mentioned it in my FAQ below on github link and on scenario which I will be sharing notebook
www.linkedin.com/pulse/course-launch-scaling-accelerating-machine-learning-srinivasan/
In some cases I will be sharing it in my git link few months after the video. Sorry in case if you dont get notebook immediately post video in some cases
Hi sir I like to learn spark for DE role ? Can u mentor me Looking for paid mentor?
I have a entire course on Apache Spark which is free.. Why do you want to pay for mentorship while I have covered all that is required - ua-cam.com/play/PL3N9eeOlCrP5PfpYrP6YxMNtt5Hw27ZlO.html . Just practice along videos and you should be good
Unable to get data from git please help
Manoj. Can u check pinned comment of this video. I have given the link to dataset. This dataset is huge and so I could not push it to git due to size limit
Thank you so much for reply ,I want to get trained under your guidance could u help me, could u please help how to start your video lectures please could you tell the order as I m beginner
Manoj.. If you go to my channel and then playlist tab.. You can see multiple playlist. Pick area of your interest. To start with you can learn from end to end ML playlist which talks about the lifecycle of ML projects. It is purely theory but good to know before getting into details
Thank you so much for your response but I want to be an end to end full stack so please help me with order of your play list to follow I am from banking background so please do help me in transition
Manoj.. I do not have video coverage on basics of ML.. So i would suggest go through coursera Andrew Ng ML course that will be helpful and once done you can check my courses on NLP, Computer Vision and Time Series
The cluster i am creating is taking forever... anyone else have this problem? :(
As an advise, could you speak more slowly please, is difficult understand you