Data Analysis with Python for Excel Users

Поділитися
Вставка
  • Опубліковано 22 гру 2024

КОМЕНТАРІ • 143

  • @farmakoxeris
    @farmakoxeris 6 років тому +3

    EXCELLENT! Of course, I have many many questions but it's up to me to dive into panda's documentation to find the answers.

  • @bharani3412
    @bharani3412 6 років тому +6

    Very clearly explained and I hope it will be helpful for many of us.Thanks for doing the video.

  • @phytasea
    @phytasea 7 років тому +60

    >>Import Korean_Language as kl
    >>comment = kl.convert(Thank you So much ^^.)
    >>print( comment)
    >> "정말 감사함니다 ^^. "

    • @annodude1189
      @annodude1189 5 років тому +2

      I actually tried to run this in python.
      smh.

  • @ameyajalali9555
    @ameyajalali9555 4 роки тому +7

    if data_file.ix[:, 's1':'s4'] doesn't work. Try replacing it by data_file.loc[:, 's1', 's4']

  • @John-qt5em
    @John-qt5em 7 років тому +4

    Is there a better way of renaming the average column? I did it this way, which works:
    >> list(result)
    >>[time', 's1', 's2', 's3', 's4', 0]
    >>result.columns=['time', 's1', 's2', 's3', 's4', 'average']

  • @annodude1189
    @annodude1189 5 років тому +6

    If there were 2 subscription buttons I'd hit them both.
    Newfound love!

    • @apm
      @apm  5 років тому +1

      Thanks, Anno!

  • @gajju3152
    @gajju3152 5 років тому +1

    That was really a useful video......Basics of Pandas for file handling........cool stuff

  • @CaliFlower
    @CaliFlower 8 років тому +9

    thanks for this video. quick question, this use case seems absurdly easy to do in excel. why might one use numpy/pandas for this in the real world?

    • @apm
      @apm  7 років тому +22

      +Cali Flower, Excel is probably the best tool if you are doing it one time and the data set is small. Excel is going to be difficult to use for large or complex data sets where a few lines of code in Python will do the same thing and much faster. If you need to repeat the same analysis on multiple data sets then Python is also a clear winner.

    • @khoathivanle9249
      @khoathivanle9249 7 років тому +5

      Excellent answer.

    • @whiteroommenace
      @whiteroommenace 6 років тому +3

      1. Automation and scalability.
      2. Python and its packages are easily usable across various OS Platforms and most importantly linux.
      3. You save MS office license costs.

    •  6 років тому +2

      Bigger datasets will bog down Excel. I work with 40k rows in Excel. Very slow with the simplest formulas.

    • @fumezflori
      @fumezflori 6 років тому +1

      I use DB of 900k lines with formulas, best is to not keep formula in each cell were are needed but insted only in the first row. Don't get me wrong, I will try to implement Py for future but for now this is what I'm doing.

  • @tanyasinha2897
    @tanyasinha2897 4 роки тому +2

    can we use for eg: content.head(3) to print 3 rows

    • @apm
      @apm  4 роки тому

      Yes, that is correct.

  • @eduardolpz386
    @eduardolpz386 5 років тому +1

    Not sure if anyone ran into this issue, but Pandas need the openpyxl module in order to work with excel files.
    You don't have to import it, just make sure it's installed in your environment:
    pip install openpyxl

    • @apm
      @apm  5 років тому +1

      Excellent tip! Thanks for including this.

  • @yamensaban4187
    @yamensaban4187 6 років тому +2

    the ix function replacements are iloc() and loc() functions

    • @apm
      @apm  6 років тому +1

      Thanks for the tip!

    • @yamensaban4187
      @yamensaban4187 6 років тому

      No thanks for you it was very helpful me
      Is there any vedio like this but for data visualization?

    • @apm
      @apm  6 років тому

      Here are basic tutorials on generating plots in Python: apmonitor.com/che263/index.php/Main/PythonPlots

  • @三弟-r7q
    @三弟-r7q 3 роки тому +1

    you use np.mean for average, but how to do + - */ for specific data

    • @apm
      @apm  3 роки тому

      You can do operations on data with single values x[2,4]*y[3] or on matrices with X@Y or X+Y.

  • @TheYasin67
    @TheYasin67 7 років тому

    I'm a beginner, I try to rewrite your codes in pycharm, but after 07:03 all of your codes cannot be runned by pycharm!
    I cannot get the same results, is it possible to rewrite your codes in pycharm?

    • @apm
      @apm  7 років тому

      Could you download the source files from apmonitor.com/che263/uploads/Main/python_with_pandas.zip - make sure you unzip the files first before running the script. You may be missing the data file. Additional methods and code are available at apmonitor.com/che263/index.php/Main/PythonDataAnalysis

  • @RyanAlexander87
    @RyanAlexander87 4 роки тому +1

    Does this work with Python 3? Just tried it w/ Pandas 1.0.1 and getting error on this line.
    sensors = data_file.ix[:,'s1':'s4']
    ---------------------------------------------------------------------------
    AttributeError Traceback (most recent call last)
    in
    ----> 1 sensors = data_file.ix[:,'s1':'s4']
    2 print(sensors[0:6])
    C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\generic.py in __getattr__(self, name)
    5272 if self._info_axis._can_hold_identifiers_and_holds_name(name):
    5273 return self[name]
    -> 5274 return object.__getattribute__(self, name)
    5275
    5276 def __setattr__(self, name: str, value) -> None:
    AttributeError: 'DataFrame' object has no attribute 'ix'

    • @apm
      @apm  4 роки тому +1

      Try "sensors = data_file.loc[:, 's1':'s4']" instead. Here is the complete code that works with later versions of Pandas: apmonitor.com/che263/index.php/Main/PythonDataAnalysis?action=sourceblock&num=2

  • @mattf160
    @mattf160 4 роки тому +1

    Hey there, I was just trying to follow along here and the code entered in cell 5 throws an error for me 'DataFrame' object has no attribute 'ix'. Any idea why and how to fix and proceed. Sooooper new to this.

    • @apm
      @apm  4 роки тому

      Try "sensors = data_file.loc[:, 's1':'s4']" instead. Pandas updated their code. Here is the full source code: apmonitor.com/che263/index.php/Main/PythonDataAnalysis?action=sourceblock&num=2 that you can find with more examples at apmonitor.com/che263/index.php/Main/PythonDataAnalysis

  • @ashwinipadhy6397
    @ashwinipadhy6397 8 років тому

    If i want to read .txt file can I use the same pd.read_csv.Also i have a criteria like below:
    REGION SITE_NAME FT_SHIPPED_LAST_HOUR READY_TO_SHIP TOTAL_FT_BACKLOG FT_BKLG_NEXT_CPT FT_BKLG_NEXT_TO_NEXT_CPT SNAPSHOT_DATETIME
    North QNAA 77 24 33 8 25 2016-08-04 10:00:00
    North QNAB 13 0 3 3 0 2016-08-04 10:00:00
    North QNAC 0 0 0 0 0 2016-08-04 11:00:00
    North QNAD 0 15 0 0 0 2016-08-04 12:00:00
    If i want to select only the data which is of 10 am data how can we do that.

    • @rrc
      @rrc 8 років тому

      yes, you can read strings, headers, numbers, and dates with Pandas. Pandas is like Microsoft Excel for Python but with scripting features to process your data. Once you've imported the data, you can slice it to get a subset such as 10am data.

    • @ashwinipadhy6397
      @ashwinipadhy6397 8 років тому

      yes i have done by this
      import datetime
      import numpy as np
      import pandas as pd
      data_file = pd.read_csv('D_HRLY_FT_DATA.txt',sep='\t',index_col=False)
      date_time = datetime.datetime.strptime('2016-08-04 11:00:00', "%Y-%m-%d %H:%M:%S")
      date_time1 = datetime.datetime.strptime('2016-08-04 12:00:00', "%Y-%m-%d %H:%M:%S")
      dff = data_file[data_file.SNAPSHOT_DATETIME==str(date_time)]
      dff1 = data_file[data_file.SNAPSHOT_DATETIME==str(date_time1)]
      now dff is having data for 11 and dff1 is having data for 12 but what if i want to compare the data from 11 am to 12pm and do some calculation

    • @apm
      @apm  8 років тому

      You can index and select data in Pandas. Here are some examples: pandas.pydata.org/pandas-docs/stable/indexing.html I also like to use Numpy for array slicing. Here is a brief tutorial: ua-cam.com/video/mOZ0UCeuRX4/v-deo.html

  • @John-qt5em
    @John-qt5em 7 років тому

    You have to press SHIFT+ENTER to activate each line of code (for example at 5:00)

    • @apm
      @apm  7 років тому +1

      +John2000, thanks for the tip on the shortcut key combination. CTRL+ENTER is another one. One activates and gives a new cell below while the other just activates.

    • @John-qt5em
      @John-qt5em 7 років тому +1

      APMonitor.com,thank you. I didn't know about CTRL+ENTER. Very useful!

  • @navinmech540
    @navinmech540 4 роки тому +1

    How do i import and work the excel file in Python not in Jupiter Notebook?

    • @apm
      @apm  4 роки тому

      It is the same code as in the Python notebook but you just need to save a text file as myScript.py and run it with "python myScript.py".

  • @vaibhavoberoi7
    @vaibhavoberoi7 5 років тому +1

    Thanks a lot for the information. Please suggest a way to assign a name to the 'avg_name' column in the saved excel/csv sheet.

    • @apm
      @apm  5 років тому

      result.columns.values[-1] = 'avg_name'

  • @terr104
    @terr104 6 років тому

    @APMonitor.com what is the software or device you are using for handwriting.. looks pretty sleek

    • @apm
      @apm  6 років тому

      Here is one of the devices that I use: ua-cam.com/video/YLRVZXedSlc/v-deo.html I created that video a while back and have newer computers but the idea is the same.

    • @terr104
      @terr104 6 років тому

      Thank you so much , you are awesome

  • @jellyjams7217
    @jellyjams7217 6 років тому

    How do you know to use num py when doing the average? Do all mathematical computations of your data require a reference of num py?

    • @apm
      @apm  6 років тому

      Numpy is a very common package for data analysis in Python. It is considered a base package that is also used by many other packages for mathematical calculations, such as calculating averages. There is also Scipy, Pandas and others: apmonitor.com/che263/index.php/Main/PythonDataAnalysis

  • @stuartbriscar7287
    @stuartbriscar7287 7 років тому +1

    I don't understand anything that you are talking about, but the script worked... so... cool :)

    • @apm
      @apm  7 років тому

      +Stuart Briscar, the same example in Numpy may be a little easier to understand. apmonitor.com/che263/index.php/Main/PythonDataAnalysis

  • @vaibhavoberoi7
    @vaibhavoberoi7 5 років тому +1

    Hi. Would this work for Python 3.x also? Or are there any changes

    • @apm
      @apm  5 років тому

      Yes, this should work. The only minor change between Python 2.7 and Python 3+ is to use parenthesis for the print statements.

  • @ayushchaurasia5547
    @ayushchaurasia5547 4 роки тому +1

    this has been of great help. thanks

  • @lecadou
    @lecadou 7 років тому +1

    Best Jupyter tut

  • @nischalkarki2960
    @nischalkarki2960 6 років тому

    Hi, I cannot import csv file from my desktop. Do I need to save the csv file in a particular location? I am using enthought canopy code editor.

    • @apm
      @apm  6 років тому +1

      Save the csv file in the same directory location as your script file. You may also try installing Python 3.7 if you want to try another distribution. ua-cam.com/video/bXWlyOMYpRE/v-deo.html

    • @ConsulthinkProgrammer
      @ConsulthinkProgrammer 5 років тому

      Thanks Sir @@apm it's work :)

  • @TsunamicBlaze
    @TsunamicBlaze 7 років тому

    So how do I go about creating a header for a column of data that was computed with python? For instance when you had avg_row, the header by default is 0, but how can you change that to something like "Average"

    • @apm
      @apm  7 років тому

      +Kevin Le, you can either write the header and then append to it or else use the header argument. There is some help here: docs.scipy.org/doc/numpy/reference/generated/numpy.savetxt.html

    • @SamHopperton
      @SamHopperton 7 років тому +1

      Hi Kevin, I did it like this:
      result = result.rename(columns={'time': 'Time', 's1':'Sensor 1', 's2':'Sensor 2','s3':'Sensor 3', 's4':'Sensor 4', 0: 'Average'})
      You can name just one of the columns or all of the columns with this method

    • @apm
      @apm  7 років тому

      Thanks for this response - I just realized that my previous response was for the NumPy version of this video at ua-cam.com/video/Tq6rCWPdXoQ/v-deo.html Thanks for providing the appropriate response for the Pandas version.

  • @willseltenright2086
    @willseltenright2086 5 років тому +1

    Awesome! Thank you for the upload

  • @yamax87
    @yamax87 5 років тому +1

    Thanks-great help. At 9.30-9.40, my subtracted time data keeps coming out as 0. I presume this is due to rounding issues? (I'm using python repl). Is there any way to correct for this?

    • @apm
      @apm  5 років тому

      The numbers are different but I need to display more digits to see the difference. Here is one way to see the difference: print('{0:.15f}'.format(val)) where val is the floating point number.

  • @cristianscript5649
    @cristianscript5649 6 років тому

    How can I insert instead of a column instead of a row at the bottom of the last column thanks

    • @apm
      @apm  6 років тому

      You can insert a new column with:
      df.insert(col_num, "new_col_name", data)
      You could use pandas.concat() or DataFrame.append() to add a new row.

  • @ankurpatel9673
    @ankurpatel9673 8 років тому

    Could someone why time - time[0] reset the time back to 0? I'm confused as why this resets the time instead of the column just being 0

    • @apm
      @apm  8 років тому

      +A Patel, the value of time[0] is a single value from the very first row. It subtracts this value from all of the other time values so that the time sequence starts at zero and not the other large value at the beginning of the original data file. Python subtracts the time[0] from time[0], time[1], time[2], .... to the end and returns a new vector that starts at zero. Let me know if this isn't clear.

    • @ashdhuri9494
      @ashdhuri9494 6 років тому

      This is simply called standardization technique. You take 1st value as a standard value and other values are djusted according.
      Here 1st values i.e. time[0] is taken as standard value and subtracted from each value of 'time' column for adjustment.
      Hence the line is
      time = time - time[0]
      Hope you understood! :)

  • @MrLenzi1983
    @MrLenzi1983 7 років тому

    sorry i am a begginer, id like to learn more and this routine, tryed to follow and had this message: "FileNotFoundError: File b'data_with_headers.csv' does not exist"
    i think i should extract the files in a certain diretory, could you guys help me out?

    • @apm
      @apm  7 років тому

      You can download the data_with_headers.csv file from the zipped archive at apmonitor.com/che263/uploads/Main/python_with_pandas.zip - don't forget to extract the folder (right click...extract to...).

  • @farmakoxeris
    @farmakoxeris 6 років тому

    Yesterday I installed jupyter. The nymph and panda packages must be installed separately, right?

    • @apm
      @apm  6 років тому

      Yes, packages are installed separately. However, if you pip install a package that has a dependency then the dependency will be automatically installed as well.

    • @farmakoxeris
      @farmakoxeris 6 років тому

      Thanks

  • @ashdhuri9494
    @ashdhuri9494 6 років тому

    Can I replace line [8] and [9] by
    my_data = pd.DataFrame(time, sensors, avg_row)

    • @apm
      @apm  6 років тому

      Yes, that syntax also works. You would want to assign it to the variable result, however:
      result = pd.DataFrame(time,sensors,avg_row)
      because that is used later. You can download the source code from apmonitor.com/che263/uploads/Main/python_with_pandas.zip

    • @ashdhuri9494
      @ashdhuri9494 6 років тому +1

      APMonitor.com ya correct sir :)
      Thank you

  • @rajshah9031
    @rajshah9031 7 років тому +2

    You are a good teacher

  • @macmacc6377
    @macmacc6377 7 років тому +1

    thank you sir...i am going into python but still thinking if web or data science which will help me...am a computer science student ..thank you

    • @apm
      @apm  7 років тому

      Most of my background is in data science - you can start with my course at apmonitor.com/che263 or check out some courses at Coursera or Udemy. Web programming is also valuable but does require a different skill set.

  • @kinggames5517
    @kinggames5517 3 роки тому +1

    great explanation, really appreciated (y)

  • @henryfox6012
    @henryfox6012 6 років тому

    Will all of the lines of code work in Python 3?

    • @apm
      @apm  6 років тому

      Yes, this should also work in Python 3.

  • @GThomas748
    @GThomas748 7 років тому +1

    very clearly explained example.Thanks!

  • @SamHopperton
    @SamHopperton 7 років тому +2

    Great video, thanks!

  • @BiancaAguglia
    @BiancaAguglia 6 років тому +1

    Nice job. Thank you for posting this.

  • @mohammedhasnain8959
    @mohammedhasnain8959 6 років тому

    Hi sir,
    How to plot pdf and cdf in ipython notebook

    • @apm
      @apm  6 років тому

      Here is some help on plotting a CDF: stackoverflow.com/questions/9378420/how-to-plot-cdf-in-matplotlib-in-python

  • @nehahule7159
    @nehahule7159 6 років тому

    How can we upload .gz file?

    • @apm
      @apm  6 років тому

      You can use the gzip package to extract the compressed files in Python: docs.python.org/3/library/gzip.html

  • @MG-fg5hs
    @MG-fg5hs 6 років тому +5

    Thank you so much; prose and informative.
    Darn it, where is that second like button !

    • @apm
      @apm  6 років тому +1

      Thanks for the positive feedback.

    • @jacobroman9965
      @jacobroman9965 3 роки тому

      A tip : watch series on Flixzone. I've been using them for watching loads of movies during the lockdown.

    • @lukasmarlon5846
      @lukasmarlon5846 3 роки тому

      @Jacob Roman yup, been using Flixzone} for since november myself :D

    • @ianmoshe227
      @ianmoshe227 3 роки тому

      @Jacob Roman Yea, been watching on Flixzone} for months myself =)

    • @carterkareem2347
      @carterkareem2347 3 роки тому

      @Jacob Roman Yup, have been using Flixzone} for months myself :)

  • @miller1520
    @miller1520 6 років тому +1

    Thank you. This is excellent.

  • @tanchienhao
    @tanchienhao 8 років тому +2

    great video! could you do one for tensorflow? subbed!

    • @apm
      @apm  4 роки тому

      Here is content on TensorFlow: apmonitor.com/do/index.php/Main/DeepLearning

  • @viallykazadimutombo225
    @viallykazadimutombo225 6 років тому +1

    Great video, thank you so much.

  • @rajeshkhanna8276
    @rajeshkhanna8276 6 років тому

    how do we remove particular emails which are containing with in the same column, I have tried a lot, but nothing comes up, could you suggest me?
    I have to remove the singe domain "python@example.com" here @example.com I need to remove from my mails column

  • @keithlyons2383
    @keithlyons2383 6 років тому +1

    Thanks for the video!

  • @ctea1233
    @ctea1233 6 років тому

    NameError Traceback (most recent call last)
    in
    1 # load data file
    ----> 2 data_file = pd.read_cav('data_with_headers.cav')
    NameError: name 'pd' is not defined

    • @apm
      @apm  6 років тому

      You need to import pandas first as:
      import pandas as pd
      Also make sure that the file extension is correct. Should it be .csv instead of .cav?

  • @jamelstringer6734
    @jamelstringer6734 7 років тому +1

    Very good! Thanks

  • @nikola4294
    @nikola4294 4 роки тому +1

    Thank you very much!

  • @ahmedbadal3795
    @ahmedbadal3795 5 років тому

    cant find the source code plz upload it agian in your site so we can code along with you thanks for video

    • @apm
      @apm  5 років тому

      here it is: apmonitor.com/che263/index.php/Main/PythonDataAnalysis

  • @FB-tr2kf
    @FB-tr2kf 7 років тому

    I keep getting the following error:
    AttributeError: module 'pandas' has no attribute 'read. is this because my csv is not in the same place as the model that i am running and how do i fix it?

    • @apm
      @apm  7 років тому +1

      Make sure it is 'read_csv' with an underscore and not 'read csv' with a space. Here is some additional help if that doesn't work: stackoverflow.com/questions/40554657/module-pandas-has-no-attribute-read-csv

    • @FB-tr2kf
      @FB-tr2kf 7 років тому

      Perfect. thank you and great work!

  • @johnrverno1
    @johnrverno1 7 років тому +1

    Great video!

    • @apm
      @apm  7 років тому

      +John V, thanks!

  • @methuselah12
    @methuselah12 7 років тому +1

    Thank you so much for this

  • @iolanda5707
    @iolanda5707 5 років тому +1

    Thank you!

  • @prblmchild83
    @prblmchild83 7 років тому +1

    Thanks a lot!

  • @arunprakash4435
    @arunprakash4435 6 років тому +1

    Kudos To The Master!!!!!!

  • @khoathivanle9249
    @khoathivanle9249 7 років тому +1

    Excellent

  • @Anu_was_here
    @Anu_was_here 7 років тому

    6:47 My sub-woofer went crazy here

    • @apm
      @apm  7 років тому

      My headphones don't do the same but I definitely hear some sort of low frequency impact in the background.

  • @JoseAlvarez-dl3hm
    @JoseAlvarez-dl3hm 6 років тому

    Hi nice video, really amazing the way you teach. Anyway I keep getting this error:
    Traceback (most recent call last):
    File "C:\Users\JOSE\AppData\Local\Programs\Python\Python36\lib\site-packages\pandas\__init__.py", line 26, in
    from pandas._libs import (hashtable as _hashtable,
    File "C:\Users\JOSE\AppData\Local\Programs\Python\Python36\lib\site-packages\pandas\_libs\__init__.py", line 4, in
    from .tslib import iNaT, NaT, Timestamp, Timedelta, OutOfBoundsDatetime
    File "pandas\_libs\tslib.pyx", line 67, in init pandas._libs.tslib
    ImportError: DLL load failed: No se encontró el proceso especificado.
    During handling of the above exception, another exception occurred:
    Traceback (most recent call last):
    File "import_with_pandas.py", line 3, in
    import pandas as pd
    File "C:\Users\JOSE\AppData\Local\Programs\Python\Python36\lib\site-packages\pandas\__init__.py", line 35, in
    "the C extensions first.".format(module))
    ImportError: C extension: DLL load failed: No se encontró el proceso especificado. not built. If you want to import pandas from the source directory, you may need to run 'python setup.py build_ext --inplace --force' to build the C extensions first.
    And I do not see how to solve this, I hope you can give me some ideas, thanks

    • @apm
      @apm  6 років тому +1

      It looks like there is a problem with your Pandas installation. You may have downloaded the source of Pandas but have not compiled it. I recommend that you use "pip install pandas" instead of trying to compile it yourself. There is additional information on how to pip install a package at apmonitor.com/pdc/index.php/Main/InstallPython (see 3rd video). You may consider uninstalling Python and start over with either an Anaconda distribution or else the Python.org distribution.

    • @JoseAlvarez-dl3hm
      @JoseAlvarez-dl3hm 6 років тому

      Thanks it worked, although I had to do the installation all over again but this time I did it with anaconda after I uninstalled all python packages. Cheers

  • @conradohernanvillagil2764
    @conradohernanvillagil2764 7 років тому +1

    Amazing!!

  • @dynamicbdg2123
    @dynamicbdg2123 7 років тому +1

    thanks

  • @jonathanpacheco5506
    @jonathanpacheco5506 7 років тому +2

    master!!

  • @vijaykumar-yq7sf
    @vijaykumar-yq7sf 6 років тому +1

    Great

  • @shekhnasim1541
    @shekhnasim1541 6 років тому +1

    Nice

  • @loongyan5595
    @loongyan5595 7 років тому +2

    赞一个

  • @kisore7921
    @kisore7921 8 років тому

    thx

  • @msnhantoeic422
    @msnhantoeic422 6 років тому +1

    Thaks so much!

  • @alaaeltayeb5794
    @alaaeltayeb5794 6 років тому +1

    best

  • @arsegacom
    @arsegacom 2 роки тому

    Hacked i guess

    • @apm
      @apm  2 роки тому

      ?