Data Cleaning and Analysis using Apache Spark

AIEngineering

Додати в
- Мій плейлист
- Переглянути пізніше
Поділитися

Поділитися

Вставка

Розмір відео:

Показувати елементи керування програвачем

Автоматичне відтворення

Автоповтор

Опубліковано 24 жов 2024

КОМЕНТАРІ • 182

@AIEngineeringLife 4 роки тому ⁺¹⁵
Many have asked for the file I used for this video- You can download it from here -
drive.google.com/file/d/1e6phh7Df8mzYoE-sBXPVJklnSt_wHwkq/view?usp=sharing
Remove the last 2 line from the csv file
@vivekpuurkayastha1580 4 роки тому
is the %sql command that makes SQL statement available? In that case is it only possible to use SQL when using Databricks? Thus making SQL not available for python scripts... Please correct me and also provide any input you may have ...
@harshithag5769 3 роки тому
hello Sir, did you set any permission to this file. i am unable to open it. i tried to open it in onedrive office online , it says conversion error.
@AIEngineeringLife 3 роки тому
@@harshithag5769 Nope it is open for all. Can you open it in google drive and see
@harshithag5769 3 роки тому
@@AIEngineeringLife i dont have excel on my pc. i tried opening it thru onedrive live or google excel . both says error in opening the file. i was able to open your other files which you have provided in github repository, when i tried to upload the file to dataset in databricks its throwing the error.
@MachalaPuli 2 роки тому
How to see the entity relationship diagram in data bricks or pyspark just like as we see in mysql... Please help me with this.
@himanshupatanwala09 4 роки тому ⁺⁸
Thank you so much, Sir. Millions of blessings from every student who watch this. I was looking for some real resources to learn Spark and your content saved a lot of effort to get wasted but made to put the effort in the right direction. Thanks a lot and please never stop creating such wonderful content for us.
@AIEngineeringLife 4 роки тому ⁺¹
You are welcome Himanshu and Thanks for such nice and encouraging words to drive me create more such content :)
@ijeffking 4 роки тому ⁺¹²
This is HUGE! Gems of wisdom for a Machine learning aspirant. Excellent. Thank you very much.
@KennyJacobson1 4 роки тому ⁺³
I've been trying to get up to speed on Databricks and Spark for two weeks now, and I just learn 10x as much in 1 hour than I did in the previous 2 weeks. Thank you!
@AIEngineeringLife 4 роки тому ⁺¹
Glad to know Kenny this was useful. All the best on your learning journey
@norpriest521 4 роки тому
So how is Databricks service?
I mean if I use it, what's their billing policy?
Pay per use
Pay per activity
Pay per minutes/hours of use
Pay per data size
Can you let me know?
@nikhilkumarjha 3 роки тому ⁺³
Thanks for the great tutorial. The Data Science community needs more people like you SS :)
@DiverseDestinationsDiaries 3 роки тому ⁺¹
This is the best content Video i have never seen in UA-cam with respect to Realtime scenarios.... Thanks a lot Sir. Please do more to help us..
@rlmclaughlinmusic 3 роки тому ⁺¹
Truly love your channel! Such a wealth of information and brilliantly explained. Thank you for providing this real world example. It was exactly what I needed to elevate my spark skills. You're a terrific instructor.
@The_Bold_Statement 2 роки тому ⁺¹
Tremendous effort and knowledge can be seen in your video. Thank you
@sandeepsankar6353 3 роки тому ⁺²
Superb video. In 40 minutes, you covered pretty much everything :) . Please upload more videos
@AIEngineeringLife 3 роки тому
I have a complete course on Apache spark in my playlist section of youtube channel. Have you seen it?
@sandeepsankar6353 3 роки тому
@@AIEngineeringLife Yep. seen and subscribed as well :)
@shubhamtripathi5138 4 роки тому ⁺³
Awesome. It's not a video series it's an entire course I must say. I really appreciate your hard work and the teaching technique, thanks . Sir, keep it up.
One request i think from many of students like me is ,please upload the notebook sir. So that it will be a little time saving too. Thanks
@AIEngineeringLife 4 роки тому
Thanks Shubham.. :) .. The code is already available in my git repo - github.com/srivatsan88/Mastering-Apache-Spark
@norpriest521 4 роки тому
@@AIEngineeringLife
Hi thank you for your video.
Just wanna ask if this video is about data profiling or data wrangling using Pyspark?
@AIEngineeringLife 4 роки тому ⁺¹
@@norpriest521 this video is more on profiling/ cleaning of data but I have detailed videos on wrangling in my apache Spark playlist
@norpriest521 4 роки тому ⁺¹
@@AIEngineeringLife
I couldn't find the video regarding data wrangling in your list.
Could you please let me know the title of the video?
@AIEngineeringLife 4 роки тому
@@norpriest521 It is named as data engineering in this playlist. There are 2 parts to it ua-cam.com/play/PL3N9eeOlCrP5PfpYrP6YxMNtt5Hw27ZlO.html
@prafulmaka7710 3 роки тому ⁺¹
Amazing explanation!! Thank you very much!
@cloudmusician 3 роки тому ⁺¹
Excellent content, precise and to the point
@Azureandfabricmastery 3 роки тому ⁺¹
Thanks very much for detailed hands on. It helps!
@beyourbest199 4 роки тому ⁺¹
This is amazing, waiting for your next video on Spark Analysis and cleaning
@AIEngineeringLife 4 роки тому
Next part on EDA using spark is here
ua-cam.com/video/X6OkT2YPZVs/v-deo.html
@purushothamchanda898 4 роки тому ⁺¹
Wonderful video sir, was looking for such content since many days. thanks a ton
@AIEngineeringLife 4 роки тому
👍
@sahil0094 3 роки тому ⁺³
df.select([count(when(isnan(c), c)).alias(c) for c in df.columns]).show() and
df.filter(df.col.isNull()).count() should return same result, right?
Upper command is giving 0 nulls whereas below command is giving some nulls for the same column. Can you please help?
@insane_billa 3 роки тому
Now, this is what I was looking for :)
@prometeo34 2 роки тому
This is amazing. Thank you very much!
@rahulraoshindek131 3 роки тому ⁺¹
Thank you so much sir for this detailed video.
@deeptigupta518 3 роки тому ⁺¹
These videos are awsome and a great help. Thanks. You are doing a wonderful job for people like us who have entered the industry.
Just wanted to ask that I have worked in ML models using Python but not worked in Apache Spark. WIll I face any difficulty doing the same thing here in Spark?
@AIEngineeringLife 3 роки тому ⁺¹
Nope. It must be a smooth transition. Remember Spark ML pipeline is inspired by scikit pipelines, so process is similar. Only time consuming part will be understanding distributed architectures which might take time
@chwaleedsial 4 роки тому
A suggestion: when you load the data-set & if it is not the same as one shared on kaggle please also let us know what transformations, filtration you have performed so that we can have same, similar results as we follow along.
@AIEngineeringLife 4 роки тому
I am sorry if the dataset is not same.. I did not do any transformation but rather I downloaded by Lending Club directly in below link
www.lendingclub.com/info/statistics.action
Earlier it was download for all but post I downloaded sometime back they made it sign in based and hence I referred to kaggle thinking it should be similar. But from my end I did not make any changes to the dataset I got from Lending club. Are you facing any particular issue as few have reached out in past on some clarification and were able to execute all the commands successfully
@AniketKumar-ij3ew 4 роки тому ⁺¹
Great Stuff really enjoying the hands on videos.
I have one input, not a big constraint however, I guess in the last part when you are creating the permanent table the data frame should be df_sel_final instead of df_sel
@AIEngineeringLife 4 роки тому
Thank you Aniket.. You are right.. Maybe did it in a hurry.. Good catch :)
@tanishasharma3665 4 роки тому ⁺¹
Sir, i had a question, why are we creating a temporary table every time for sql functions. In pyspark, the main advantage is that we can use the sql functions simply on the dataframe as well
for example: loan_dfsel.groupby("loan_status").count().orderBy(col("Count").desc()).show() where i have used 'loan_dfsel' is a dataframe
please enlighten me if im wrong....
@AIEngineeringLife 4 роки тому ⁺¹
Tanisha.. Is your question why I am using dataframe functions rather SQL functions in pyspark. If so the yes SQL is easy way of processing data in Spark but for iterative processing dataframe functions are very powerful and simple. Typically in projects we use combination of SQL as well as df functions. In this case I wanted to show dataframe functions but in future videos I have covered SQL as well. Underneath both SQL and df compile to same plan so performance might not differ
@aishwaryagopal5553 3 роки тому
Hi, Thank you for the informative videos. I'm just getting started with Spark. The code seems easy to understand. What are other aspects of Spark that I should read through for a better understanding?
@saitejap7876 3 роки тому ⁺¹
Thank you for the tutorial, I am just curious while dealing with revol_util column, we are finding the average when the column is string and using it to replace "null" values and then cast it to "double". Will there be a difference if we are casting the values first to "double" then select the average value and replace the nulls. Hoping to get your insights on this.
@KishoreKumar-yx4nw 4 роки тому ⁺¹
Nicely explained
@CRTagadiya 4 роки тому ⁺²
Thanks for your work, Could you please upload all this series video in playlist?
@AIEngineeringLife 4 роки тому
Check this playlist. It is available in it and will be updating upcoming videos into it as well
ua-cam.com/play/PL3N9eeOlCrP7_vt6jq7GdJz6CSFmtTBpI.html
@iamhappy7759 4 роки тому ⁺²
thanks for amazing video.would you provide link of notebook for practice ?
@AIEngineeringLife 4 роки тому ⁺¹
Thanks.. Should be in this repo - github.com/srivatsan88/Mastering-Apache-Spark
@iamhappy7759 4 роки тому
@@AIEngineeringLife Thank you so much
@HemantSharma-fw2gx 3 роки тому ⁺¹
thanks for this tutorial
@ankushojha5089 4 роки тому
Thank you Sir for responding my comments and clearing my doubt. I have one more doubt I am using regexp_replace function in which while changing the place of string I can see 2 diff output. Years at 1st place trimmed completely
from output but if I will interchange the Years at 2nd place and Year at first place in output 'S' won't trimmed. Please refer the screenshot :)
@AIEngineeringLife 4 роки тому
Kush.. I did not get any screen snapshot here
@christineeee96 4 роки тому
@@AIEngineeringLife sir, I keep getting attribute errors like "'NoneType' object has no attribute 'groupby'"
@prashanthprasanna1484 3 роки тому
awesome !! very useful
@kunalr_ai 3 роки тому ⁺¹
Need to connect with azure data lake and Load it here in databricks. Do you have any document that support it?
@AIEngineeringLife 3 роки тому
Kunai.. Nope not done any on Azure yet
@soumyagupta9301 2 роки тому
Where did you create the spark session? I don't see the initialization of the spark variable. Can you please explain more on this?
@aleksei_cherniaev 3 роки тому ⁺¹
that is awesome content, big thanks, man !
@hemaswaroop7970 4 роки тому ⁺¹
All the topics you have covered in the Spark series here.. how close they are when it comes to the real-time projects (MNCs like IBM, CTS, Google etc.) - just asking
@AIEngineeringLife 4 роки тому ⁺²
Hema, The "Master Spark" course I have in my channel was to bring out real world scenarios that one face in Industry. It takes a use case based approach rather function or api based approach. Many working professional in Spark have also benefited from this course as they were able to upskill themselves on specific area they had to work. I am saying this not because I have created it but You can compare the coverage with other courses and pick one that works for you
@hemaswaroop7970 4 роки тому
@@AIEngineeringLife Thanks, Srivatsav! Looking forward to learning more from your channel
@devarajuessampally1338 4 роки тому ⁺¹
Thank you for dataset
@teja2775 4 роки тому ⁺¹
Hi sir, I'm not understanding what is the exact purpose of using spark, as per my understand in one word answer spark is used for data analysis or data preparation am I correct....?
@AIEngineeringLife 4 роки тому ⁺²
Spark is used for end to end pipeline starting from data processing (cleaning, preparation) till machine learning or advance analytics. Reason we need spark is when your input data grows and point where typical tools like Pandas start failing to handle the volume and computation. Spark can work on TBs of data and Pandas is limited to few GBs if you are looking at large scale ML computation
@teja2775 4 роки тому ⁺¹
@@AIEngineeringLife Thank you so much sir finally my doubt is cleared
@ankushojha5089 4 роки тому ⁺¹
Hello Sir, As in this Video you said like Data Type needs to change manually while using 'Corr' & 'Cov' functions.. Please help me, How we can change the DataType
@AIEngineeringLife 4 роки тому ⁺¹
Kush.. you can use custom schema when you load data or after loading you can use CAST or astype to change from one data type to another
@ankushojha5089 4 роки тому
@@AIEngineeringLife Thank you Sir for guiding me on each step.. I have done and can able to CAST datatype now.
@srikd9829 4 роки тому ⁺¹
Thank you Very much for the very informative videos.
Could you please let us know what programming language(s) is(are) used in this video?
Is it Spark (or) Scala (or) pyspark (or) pysql ? ( I dont know any of these)
I only know Python including the Numpy and Pandas.
So, Would you recommend knowing the relevant languages as a Pre-requisite? so that I should feel easy when a real world problem is given. Or any courses you recommend also fine.
Thank you.
@AIEngineeringLife 4 роки тому
Most of my spark videos on pyspark and Spark SQL. Python is a good start as pyspark syntax are similar to pandas with slight variation. only thing is it is distributed. You can check my entire spark course on youtube to learn spark
ua-cam.com/play/PL3N9eeOlCrP5PfpYrP6YxMNtt5Hw27ZlO.html
@srikd9829 4 роки тому
@@AIEngineeringLife Ok Sure. Thank you for the swift response.
@anandruparelia8970 4 роки тому ⁺¹
For Pandas in Python, do we have something dedicated like reg extract or something that cleans data from within the values or the conventional regex have to be employed?
@AIEngineeringLife 4 роки тому
Check this
ua-cam.com/video/0V8bQ70HM0U/v-deo.html
@anandruparelia8970 4 роки тому
@@AIEngineeringLife Thanks :)
@mustakahmad383 2 роки тому
Hello @aiengineering, I have a question, is it possible to run an Oracle MERGE statement in the Oracle database (Oracle tables) using Python libraries such as for example "spark.sql.write.format('jdbc').options"...
@shaikrasool1316 4 роки тому ⁺¹
Sir, i would like to put this lending club type problem project in my resume...
can i consider same columns even for same kind of other projects ..
@AIEngineeringLife 4 роки тому
Shaik you can but if you are looking to expand it with external datapoints then you can check my video -ua-cam.com/video/Rk_nGgsPQII/v-deo.html
In this video I show how you can use external data sources and combine with lending club kind of dataset
@venkatesanp2240 4 роки тому ⁺²
sir how to count null in pyspark like pandas command
how to delete entire column
@AIEngineeringLife 4 роки тому
Venkat. It is there in my video in case if you have missed it
df_sel.select([count(when(isnan(c) | col(c).isNull(), c)).alias(c) for c in df_sel.columns]).show()
to delete you can use df.drop(
@dhruvsharma9065 3 роки тому ⁺¹
What's the 3rd argument in regexp_replace function?
@AIEngineeringLife 3 роки тому
What do we need to replace the pattern in second column with
@dineshvarma6733 4 роки тому
Hello Sir,
I appreciate your effort and time to teach us.
I am facing a Job aborted error when trying to create a Permanent table at the end of the analysis. Is there a workaround for this.
@AIEngineeringLife 4 роки тому
Hi Dinesh.. Thank you.. Can you please paste the error you are getting?
@dineshvarma6733 4 роки тому
@@AIEngineeringLife
org.apache.spark.SparkException: Job aborted.
Py4JJavaError: An error occurred while calling o3407.saveAsTable.
: org.apache.spark.SparkException: Job aborted.
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:201)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:192)
at org.apache.spark.sql.execution.datasources.DataSource.writeAndRead(DataSource.scala:555)
at org.apache.spark.sql.execution.command.CreateDataSourceTableAsSelectCommand.saveDataIntoTable(createDataSourceTables.scala:216)
at org.apache.spark.sql.execution.command.CreateDataSourceTableAsSelectCommand.run(createDataSourceTables.scala:175)
at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:108)
at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:106)
at org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:126)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:150)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:138)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$5.apply(SparkPlan.scala:191)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:187)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:138)
at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:117)
at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:115)
at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:710)
at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:710)
at org.apache.spark.sql.execution.SQLExecution$$anonfun$withCustomExecutionEnv$1$$anonfun$apply$1.apply(SQLExecution.scala:112)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:217)
at org.apache.spark.sql.execution.SQLExecution$$anonfun$withCustomExecutionEnv$1.apply(SQLExecution.scala:98)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:835)
at org.apache.spark.sql.execution.SQLExecution$.withCustomExecutionEnv(SQLExecution.scala:74)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:169)
at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:710)
at org.apache.spark.sql.DataFrameWriter.createTable(DataFrameWriter.scala:508)
at org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:487)
at org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:430)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:380)
at py4j.Gateway.invoke(Gateway.java:295)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:251)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 76.0 failed 1 times, most recent failure: Lost task 0.0 in stage 76.0 (TID 974, localhost, executor driver): java.rmi.RemoteException: com.databricks.api.base.DatabricksServiceException: QUOTA_EXCEEDED: You have exceeded the maximum number of allowed files on Databricks Community Edition. To ensure free access, you are limited to 10000 files and 10 GB of storage in DBFS. Please use dbutils.fs to list and clean up files to restore service. You may have to wait a few minutes after cleaning up the files for the quota to be refreshed. (Files found: 17327); nested exception is:
com.databricks.api.base.DatabricksServiceException: QUOTA_EXCEEDED: You have exceeded the maximum number of allowed files on Databricks Community Edition. To ensure free access, you are limited to 10000 files and 10 GB of storage in DBFS. Please use dbutils.fs to list and clean up files to restore service. You may have to wait a few minutes after cleaning up the files for the quota to be refreshed. (Files found: 17327)
@gayathriv90 4 роки тому ⁺¹
Sir, I have an requirement where i have a reusable code to run for different files and need to pass the filename to code from blob storage.Pass as a parameter. Can u help me
@AIEngineeringLife 4 роки тому
What is the problem u r facing. You can pass filename as runtime parameters to spark and trigger multiple spark jobs with different file name
@Ravi-gu5ww 4 роки тому ⁺¹
Sir can you make tutorial on functions like groupByKey,sortByKey,oderByKey,reduceByKey,join....
@AIEngineeringLife 4 роки тому
Raviteja.. I have already covered groupby orderby.. the one you have mentioned are RDD functions and spark is making data frame functions as primary going forward. Not sure if you really need to learn RDD functions as 98% of time dataframe functions are easy and will do the job
@revathis2844 4 роки тому ⁺¹
Hi Sir,
Instead of string regex, how to do numeric regex?
For example : Username having abc12def
here i need only character which is abcdef.
Could you please help me?
@AIEngineeringLife 4 роки тому ⁺¹
You can search for [0-9] regex pattern and replace it with empty string. That way the output is only alphabets
@revathis2844 4 роки тому
@@AIEngineeringLife thank you it worked🙂
@revathis2844 4 роки тому ⁺¹
I have another clarification sir.. Do spark will works on incremental load? I have searched many sites but can't find a proper solution
@AIEngineeringLife 4 роки тому ⁺¹
@@revathis2844 It is not straight forward in regular spark but databricks has a functionality called delta lake. Check it out
@revathis2844 4 роки тому
@@AIEngineeringLife thank you sir
@vishalmishra1937 3 роки тому
how to validate schema in spark against record in text files if every record follows different schema and to separate record as per schema
@imransharief2891 4 роки тому ⁺¹
Hi sir will u plz make a proper playlist for this tutorial Bec it's very confusing
@AIEngineeringLife 4 роки тому
Hi Imran.. Have you seen below playlist where I am adding it in sequence. Please see if it helps
ua-cam.com/play/PL3N9eeOlCrP7_vt6jq7GdJz6CSFmtTBpI.html
@imohitr888 4 роки тому ⁺¹
i used this to convert string to integer :
from pyspark.sql.types import IntegerType
df = df.withColumn("loan_amnt", df["loan_amnt"].cast(IntegerType()))
i can see in schema that loan_amnt is now changed to int type but when i am running the below command
quantileProbs = [0.25, 0.5, 0.75, 0.9]
relError=0.05
df_sel.stat.approxQuantile("annual_inc", quantileProbs, relError)
i am getting the error :: "java.lang.IllegalArgumentException: requirement failed: Quantile calculation for column annual_inc with data type StringType is not supported.
"
Can u please help here
@AIEngineeringLife 4 роки тому
Mohit .I see u have done cast for loan amount but using annual inc in quantile . Can u do cast for annual inc as well and see
@imohitr888 4 роки тому
@@AIEngineeringLife hey i did casting for both already. still getting the same error :/
@imohitr888 4 роки тому ⁺¹
@@AIEngineeringLife it worked now :) thank you.
@biswadeeppatra1726 4 роки тому ⁺¹
Sir, do you have git repo for the codes used in this proj..if yes then please share
@AIEngineeringLife 4 роки тому ⁺²
Yes but all modules might not be there as I have uploaded for only selected ones. Below is the link
github.com/srivatsan88/UA-camLI
@biswadeeppatra1726 4 роки тому
@@AIEngineeringLife thankx for sharing sir.
I dont find codes for spark program.please upload the same if possible..it will really be a great help
@raghuramsharma2603 4 роки тому ⁺¹
I'm trying to import the file and create the df as mentioned and get the below error. Can you pls suggest what i missed.
Error in SQL statement: ParseException:
mismatched input 'file_location' expecting {'(', 'CONVERT', 'COPY', 'OPTIMIZE', 'ADD', 'ALTER', 'ANALYZE', 'CACHE', 'CLEAR', 'COMMENT', 'COMMIT', 'CREATE', 'DELETE', 'DESC', 'DESCRIBE', 'DFS', 'DROP', 'EXPLAIN', 'EXPORT', 'FROM', 'GRANT', 'IMPORT', 'INSERT', 'LIST', 'LOAD', 'LOCK', 'MAP', 'MERGE', 'MSCK', 'REDUCE', 'REFRESH', 'REPLACE', 'RESET', 'REVOKE', 'ROLLBACK', 'SELECT', 'SET', 'SHOW', 'START', 'TABLE', 'TRUNCATE', 'UNCACHE', 'UNLOCK', 'UPDATE', 'USE', 'VALUES', 'WITH'}(line 1, pos 0)
== SQL ==
file_location = "/FileStore/tables/LoanStats_2018Q4-2.csv"
^^^
file_type = "csv"
@AIEngineeringLife 4 роки тому ⁺¹
Raghu, Is the file path correct that you loaded.. Can you check if it is similar to below or missing something
# File location and type
file_location = "/FileStore/tables/LoanStats_2018Q4.csv"
file_type = "csv"
# CSV options
infer_schema = "true"
first_row_is_header = "true"
delimiter = ","
# The applied options are for CSV files. For other file types, these will be ignored.
df = spark.read.format(file_type) \
.option("inferSchema", infer_schema) \
.option("header", first_row_is_header) \
.option("sep", delimiter) \
.load(file_location)
display(df)
@raghuramsharma2603 4 роки тому
@@AIEngineeringLife yes I double checked and the filepath is correct...
@raghuramsharma2603 4 роки тому
@@AIEngineeringLife never mind I utilized the option of "createtable in notebook" that databricks provided and it worked..strange..thanks for your reply
@AIEngineeringLife 4 роки тому
@@raghuramsharma2603 Great it worked.. That is how I got the loading part as well, used databricks provided :). All the best for remaining tutorial
@raghuramsharma2603 4 роки тому
@@AIEngineeringLife Thank you...great work uploading these videos very helpful...
@imohitr888 4 роки тому
hi, when i was running df_sel.stat.cov('annual_inc', 'loan_amnt') > i got this error "java.lang.IllegalArgumentException: requirement failed: Currently covariance calculation for columns with dataType string not supported."
i realised loan_amnt and annual_inc is showing as string in the schema. i followed all steps as per you. Can you correct me what i missed? i saw ur previous schema commands look like it was showing integer in your videos but when i ran a schema command, its showing these 2 columns as string thats why the error.
@imohitr888 4 роки тому
can u tellme in between the code, how can i change the specific column schema from string to integer? what exact code i should execute?
@AIEngineeringLife 4 роки тому
Mohit I see your other message u have figured it out. You have to do cast to convert datatype
@Shahzada1prince 4 роки тому ⁺¹
Do you have these commands written somewhere?
@AIEngineeringLife 4 роки тому ⁺¹
You can check my git repo - github.com/srivatsan88/Mastering-Apache-Spark
@Shahzada1prince 4 роки тому
@@AIEngineeringLife thanks
@biswajitbastia2 4 роки тому ⁺¹
what will be alternative fo scala code for this
df.select([count(when(isnan(c) | col(c).isNull(), c)).alias(c) for c in df.columns]).show()
@AIEngineeringLife 4 роки тому
Biswajit.. Have you tried iterating columns and checking for null in each column in scala. That is what I am doing in python as well. I think map function can do that. I will try it out and paste exact syntax later in the week
Any reason for using Scala as Spark 2.3 and above pyspark is almost at equal footing as Scala
@biswajitbastia2 4 роки тому
@@AIEngineeringLife we been using scala for all data pipeline jobs as it faster than python
@sankarshkadambari2742 4 роки тому ⁺¹
Hello Sir ,
For 2018Q4 data we have to slice the original data Loan.csv which contains (2260668, 145) right as kagle gave me 2gb zip file?
Following are the steps I did on my local
df=read the whole (2260668, 145) file
LoanStats_2018Q4=df[(df['issue_d']=="Oct-2018") | (df['issue_d']=="Nov-2018") | (df['issue_d']=="Dec-2018")]
LoanStats_2018Q4.shape
(128412, 145)
LoanStats_2018Q4.to_csv('/path/LoanStats_2018Q4.csv', index = False)
Then I will upload this to Data Bricks
@AIEngineeringLife 4 роки тому ⁺¹
I just ran it with subset so users need not wait on video for every instruction but in your case you can use it all or subset it as well.. Whole file would have made my video to run for additional hour :)
@bharadwajiyer3504 4 роки тому
How do I use a subset of the loan data? The original datset is too large(2gb), and takes time to upload in databricks.
@AIEngineeringLife 4 роки тому
Bharadwaj.. best is to download and split it in spark or unix
@towardsmlds9130 4 роки тому
the dataset i have for lending club have noise in first row and the header starts from row 2, i am not able to skip first row and set 2nd row as header, any input on how to do this
@AIEngineeringLife 4 роки тому ⁺¹
Is skiprows in read_csv not working for you?
@towardsmlds9130 4 роки тому
@@AIEngineeringLife i didn't know about it, will see if it works
@uskhan7353 4 роки тому ⁺¹
Can you show a more mechanical way for feature selection.
@AIEngineeringLife 4 роки тому
Usman.. Are you referring to manual feature selection?
@uskhan7353 4 роки тому
@@AIEngineeringLife yes by using any feature importance technique.
@Anupamk36 Рік тому
Can you please share the link to the csv?
@demidrek-heyward 4 роки тому ⁺¹
thanks!
@sahil0094 3 роки тому
The describe and null count is not readable most of the times, doesn't that post a big problem in industry projects? I have dataset of hundreds of columns so how to view describe or null count for all in spark?
@AIEngineeringLife 3 роки тому
Sahil.. In databricks we can use formatted output but in regular spark yes. In some case we load it into table and view it to understand
@sahil0094 3 роки тому
@@AIEngineeringLife okay thanks!
@seetharamireddybeereddy222 4 роки тому ⁺¹
and how to handle delta or incremental load in pyspark
@AIEngineeringLife 4 роки тому ⁺¹
I was actually not planing to cover ingesting of data to show incremental load but will see if I can in future
@venkatasaireddyavuluri2051 4 роки тому ⁺¹
sir can you give the link to dataset
@AIEngineeringLife 4 роки тому
Venkata, it is from kaggle - www.kaggle.com/wendykan/lending-club-loan-data
@dipanjanghosh6862 3 роки тому
Hello Sir. I have given file_location = r"C:/Users/dipanja/Desktop/data science/LoanStats_2018.csv". This is the path of the loanstats csv file in my system. but while trying to execute it i am getting the error 'java.io.IOException: No FileSystem for scheme: C". can u please help me fix this?
@AIEngineeringLife 3 роки тому
Are you using databricks spark or spark in your local system?.
@dipanjanghosh6862 3 роки тому
@@AIEngineeringLife hi sir. i resolved the issue. thanks!
@seetharamireddybeereddy222 4 роки тому ⁺¹
can you make one video for pyspark on google cloud
@AIEngineeringLife 4 роки тому
Will try to do it as part of cloud series. Spark job is same but will show how to run it on cloud dataproc
@seetharamireddybeereddy222 4 роки тому
@@AIEngineeringLife thank you
@dinavahikalyan4929 3 роки тому ⁺¹
Sir can you please share this whole notebook
@AIEngineeringLife 3 роки тому
It is in this folder - github.com/srivatsan88/Mastering-Apache-Spark
@dinavahikalyan4929 3 роки тому
@@AIEngineeringLife Thankq sir
@anshapettugari Рік тому
Where can I access this code?
@Ajeetsingh-uy4cy 4 роки тому
We are using this command: df = spark.read.format() I haven't worked on Spark but by syntax, I can say that this is Spark's method of reading DataFrame. We are typing this command in Jupyter notebook which by default is Python-compatible. To use others we have to use Magic Command at the top. Then how are we able to use Spark in python. is this py-spark? or something else.
@AIEngineeringLife 4 роки тому
Ajeet pyspark is enabled y default in notebook so you get python packages loaded in databricks by default for others we need to have magic. Did not get your question completely though
@Ajeetsingh-uy4cy 4 роки тому
@@AIEngineeringLife You solved my doubt though. My question was "how r we using spark in python without using magic command?". And as your answer suggested its py-spark that we are using and not Spark directly.
@owaisshaikh3983 4 роки тому
I believe if you are doing this exercise in Python - you should have used spark.read.load rather than using scala syntax spark.read.format.
@suprobhosantra 3 роки тому
Can we have the notebook in github or somewhere?
@AIEngineeringLife 3 роки тому
Yes.. You can check it against Spark course in below git repo - github.com/srivatsan88
@hemanthdevarapati519 4 роки тому ⁺¹
By the looks of it Databricks is using Zeppelin kinda notebooks
@AIEngineeringLife 4 роки тому
Yes Hemanth it is pretty similar to Zeppelin but I think databricks have their own custom one that resembles it
@hemanthdevarapati519 4 роки тому
One question. Shouldn't we use an action after the df.cache() to cache data as it works on lazy evaluation? something like df.cache().count().
@AIEngineeringLife 4 роки тому
@@hemanthdevarapati519 yes it is lazy evaluation. It will get loaded to cache when I call subsequent action first time below. I thought I might have some action down somewhere. Is it not?. I might not be doing it explicitly with cache command
@hemanthdevarapati519 4 роки тому ⁺¹
@@AIEngineeringLife Yeah, that makes sense. It was a very intuitive video. I enjoyed every bit of it.
Thank you for all your efforts Srivatsan. (Y)
@ranjitkumarthakur5123 4 роки тому ⁺¹
Sir, please share the notebook.
@AIEngineeringLife 4 роки тому
Hi RK, I have mentioned it in my FAQ below on github link and on scenario which I will be sharing notebook
www.linkedin.com/pulse/course-launch-scaling-accelerating-machine-learning-srinivasan/
In some cases I will be sharing it in my git link few months after the video. Sorry in case if you dont get notebook immediately post video in some cases
@praveenprakash143 4 роки тому ⁺¹
Hi sir I like to learn spark for DE role ? Can u mentor me Looking for paid mentor?
@AIEngineeringLife 4 роки тому
I have a entire course on Apache Spark which is free.. Why do you want to pay for mentorship while I have covered all that is required - ua-cam.com/play/PL3N9eeOlCrP5PfpYrP6YxMNtt5Hw27ZlO.html . Just practice along videos and you should be good
@hakunamatata-qu7ft 4 роки тому ⁺¹
Unable to get data from git please help
@AIEngineeringLife 4 роки тому
Manoj. Can u check pinned comment of this video. I have given the link to dataset. This dataset is huge and so I could not push it to git due to size limit
@hakunamatata-qu7ft 4 роки тому
Thank you so much for reply ,I want to get trained under your guidance could u help me, could u please help how to start your video lectures please could you tell the order as I m beginner
@AIEngineeringLife 4 роки тому
Manoj.. If you go to my channel and then playlist tab.. You can see multiple playlist. Pick area of your interest. To start with you can learn from end to end ML playlist which talks about the lifecycle of ML projects. It is purely theory but good to know before getting into details
@hakunamatata-qu7ft 4 роки тому
Thank you so much for your response but I want to be an end to end full stack so please help me with order of your play list to follow I am from banking background so please do help me in transition
@AIEngineeringLife 4 роки тому
Manoj.. I do not have video coverage on basics of ML.. So i would suggest go through coursera Andrew Ng ML course that will be helpful and once done you can check my courses on NLP, Computer Vision and Time Series
@seemunyum832 3 роки тому
The cluster i am creating is taking forever... anyone else have this problem? :(
@atruismoti2402 2 роки тому
As an advise, could you speak more slowly please, is difficult understand you

Наступне

Автоматичне відтворення

Apache Spark for Data Engineering and Analysis - Overview