Loop / Iterate over pandas DataFrame (2020)

Chart Explorers

Додати в
- Мій плейлист
- Переглянути пізніше
Поділитися

Поділитися

Вставка

Розмір відео:

Показувати елементи керування програвачем

Автоматичне відтворення

Автоповтор

Опубліковано 16 лис 2024

КОМЕНТАРІ • 136

@arnoldrosielle5290 3 роки тому ⁺⁷
Hi Braden - I've been wrestling with how to speed up some conditional changes to a large 1m x 100 row data frame - so the entire df is transformed with True/False according to the condition. The condition is a function of a quantile for the column. I had to use iteritems so I could go column wise. Did not need df.index but used np.where. It took 2.89 secs for the 1m x 100 data frame. It seemed more like 30 secs but still, wow! Thank you for your video - it's the best.
@soreanustefanalexandru8063 8 місяців тому
Thank you a lot, I have two files with huge data to confront with one another and has taken me over 2 hours to loop and do all the checks, tomorrow I'll try one of your techniques
@rshah645 7 місяців тому
What was the result?
@marcellosoto Рік тому ⁺¹
Savage! Best loop explanation on youtube!
@chapmansbg 2 роки тому
not just improvements with regards to large data sets but also for operations that need to happen many times per second.
@juanbetancourt5106 3 роки тому
Finally I get a good reason to master numpy, thank you so much!
@ChartExplorers 3 роки тому ⁺¹
It's taken me a long time to realize how amazing numpy is. Now if I know how to do something in numpy I will!
@ItsCloudHub 10 місяців тому
Very good tutorial.if we want to access previous row inside apply function, how can we do that. ? If I have index defined, can I do row.index-1 or if I dont have index defined , how do we know current location of current row?
@chuckt8246 3 роки тому
Thank you!
I've been struggling with filling in a recommendation column from check on three other columns for weeks! Your iloc_example was the piece that I was missing.
@mikeyg8631 3 роки тому
Very cool showing the performance differences, waiting for your Lambda video
@kolinsin23 3 роки тому ⁺²
Thanks very much, just saved me a huge amount of time analysing :D
@ChartExplorers 3 роки тому ⁺¹
You're welcom!
@tahiliani22 Рік тому
you are making these time comparasions on different columns. Although the time may be similar, I feel showing calcuation comparasions on one particular column would have been a better way.
@syedronaldo4758 9 місяців тому
Im trying to flatten one json using pandas...im trying to create a generic script which can continueosly iterable and flatten the dicts and lists...but still im struggling in that...if you can help in this case can u give a idea how i can achieve this...
@andreamarino95 8 місяців тому
I was wondering what's the most "pandas" (and fast) way to do the following.
Each row of my table is associated with a patient, while each column with a certain connectivity measure (200 of them). For each column n with label tag[n] I compute the average avg[n] =df[tag[n]].mean() and the standard deviation std[n] = sqrt( df[tag[n]].var()).
I would like to run the following pseudocode:
for row in df:
for n in range(number of columns):
row[n] = (row[n] - avg[n])/std[n]
How'd you do that?
@shubhamaggarwal3676 2 роки тому
How to change pandas & numpy code into pyspark OR how to boost performance of pandas and numpy probability or matrix calculation?
@jhillyt 2 роки тому ⁺¹
Great stuff! What approach would you recommend when not comparing to a static value, but another dataFrame where I just want to test is df1['col'] == df2['col']
Thanks again for sharing the knowledge....
@xingtinggong344 3 роки тому
this was a great video. Thanks for the clear demonstration of all these different options and the pros/cons
@ChartExplorers 3 роки тому
Thank you! I'm glad you found it helpful!
@ultimategamer7859 3 роки тому ⁺¹
@@ChartExplorers ur stuff is acc quality content. Props to you bro
@ChartExplorers 3 роки тому
@@ultimategamer7859 Thanks!!
@prashanthardikar9895 3 роки тому ⁺¹
I want to know write a large dataframe 35K rows 1200 columns to csv. Data is string and numeric. It takes 3 min to write. Is there any way to make it faster?
@ChartExplorers 3 роки тому
This is a good question. I'll need to look into this in more depth... Here are a couple options that should speed up the output process:
1) Ensure your pandas is the most up to date
2) Try using numpy's savetxt numpy.org/doc/stable/reference/generated/numpy.savetxt.html
3) Consider using pickle instead of csv (this might not be possible depending on your situation). it's the same as a csv just replace csv with pickle pd.to_pickle('file_name.pkl')
and
pd.read_pickle('file_name.pkl')
@qasimzaheer1140 2 роки тому
How did you find out/learn which code returns faster? I hope my question make sense.
@Borzacchinni 3 роки тому ⁺¹
Great video. Helped a bunch.
Any idea how to apply these methods for more columns with changing output - and applying these into the same col?
@ChartExplorers 3 роки тому
Thanks. Possibly, could you provide me with an example? If it is easier you could join my discord server here (so our discussion is not confined to small comment boxes). discord.gg/EEajBNpeK9
@oswaldocastro9600 3 роки тому ⁺²
Finally a clear and concise explanation. Many thanks!!
By the way, If I need to make a calculation say between two sequential rows like A[i] + B[i] / C[i+1] * D[i+1], how this could be done with the fast methods?
Thanks again
@ChartExplorers 3 роки тому ⁺⁴
Thanks! This is a great question. You should be able to accomplish this with the pandas shift() method. I'll work on an example I can share with you tomorrow! 😀
@damon1236 3 роки тому
@@ChartExplorers Eagerly waiting for it.
@ChartExplorers 3 роки тому ⁺³
@@damon1236 Thanks for the reminder! Alright here is what you can do for a vectorized solution.
df['new_col'] = (df['A'] + df['B']) / (df['C'].shift(-1) * df['D'].shift(-1))
The first part:
(df['A'] + df['B'])
is straight forward it adds the two columns together.
The second part
(df['C'].shift(-1) * df['D'].shift(-1)
uses the shift method to move all the values in the column up one (I know it is weird to use -1 to shift up haha).
We take the sum from the first part and divide it by the second part to get our answer :)
@oswaldocastro9600 3 роки тому ⁺²
Wow... thanks!!! I'll stay tunned here for many other little treasures... think about json manipulation with very nested structures next time please 😉
@ChartExplorers 3 роки тому ⁺¹
@@oswaldocastro9600 Good idea! thanks for your support :)
@immanuelsuleiman7550 2 роки тому
I will be sure to use np.where more often from now on
Thank you
@wheatonrecurrence9525 3 роки тому
This changed my life. Thanks!
@ChartExplorers 3 роки тому
Mine Too!! 😛
@Zenoandturtle Рік тому
Great presentation. Exactly what I was after.
@malikhan0123 3 роки тому ⁺¹
Excellent bro..can you please help me in my situation i have to first split column with , seperator then i have to check string like if string is A then it should be place in A column else string is B then it should set on column B
@ChartExplorers 3 роки тому ⁺¹
Thanks ALI. Could you give me some examples of what the first 5 rows look like from the column you need to split and filter.
For example:
row0: 'a, b'
row1: ' b,a'
row2: 'b,b'
row3: 'a,b'
row4: 'a,b,c'
@malikhan0123 3 роки тому
@@ChartExplorers thanks for reply below is example colA and col B auto generated based on col1
Col1. ColA. Col B
Row1. 29A,45B. 29A. 45 B
Row2. 1,33A,44B. 33A. 44 B
Row3. 12A,35A,2B. 12A,35A. 2B
.
.
@ChartExplorers 3 роки тому ⁺¹
@@malikhan0123 Try this out and please let me know how it works. If you have a very large df and want to optimize this code further then jump onto this discord server for my channel and we can discuss things further. Or if you want further clarification on what the code is doing.
discord.gg/mCRwew6n
import pandas as pd
import re
# create function to get the A or B values
# if there are more than 2 A's or B's a list will
# be returned. Get the cell_value as a string rather
# than a list.
def get_a_or_b(a_or_b,x):
r = re.compile(f".*{a_or_b}", re.I) # re.I means it is case insensitive
list_ab = list(filter(r.match, x))
cell_value = ','.join([str(elem) for elem in list_ab])
return cell_value
# create practice DF
df = pd.DataFrame({'need_to_split':['29A,45B','1,33A,44B','12A,35A,2B']})
# split the column on ,
df['split'] = df['need_to_split'].str.split(',')
# Create col_a by passing A to the function we created
df['col_a'] = df['split'].apply(lambda x : get_a_or_b('A', x))
# Create col_b by passing B to the function we created
df['col_b'] = df['split'].apply(lambda x : get_a_or_b('B', x))
@malikhan0123 3 роки тому ⁺¹
@@ChartExplorers thanks alot brayn it's working in my data set i just need to clean data before feeding this as this throwing error if cell is empty...i haven't used re librery in past can you make a video on that and explain this code... thank you once again
@ChartExplorers 3 роки тому ⁺¹
@@malikhan0123 I have written it down to create a video on regular expressions using the re library which I should be able to get out in the next 10 days or so. I'll either make a video or do a write up that explains what the code is doing in the example I gave you. Happy Analyzing!
@khatri2046 2 роки тому
I want a new column, say B, in which I want the values of next row of A.
Eg. If The column A has 1,2,3,4,5, I will get 2,3,4,5 in column B. Can you help me with that?
@danielflorea3001 2 роки тому ⁺¹
Super comparison.
@ChartExplorers 2 роки тому
Thanks!
@farheensulthana7324 2 роки тому
How to select only first value of dataframe for eg : if condition satisfied to second row but I need to ignore that. In your example convert 1 if the value greater than 0.5 if condition satisfies to coming rows then also it needs to ignore.selct only first row of condition satisfied
@neemiahluis6449 2 роки тому
I noticed that the focus was on specific columns. What if I want results from all the columns?
@Alpacastan21m 3 роки тому ⁺¹
Hi, I have a .csv file with a text column and multiple rows. I wish to execute a function on each row of the text and have a resultant column as well. Can I use the iterations methods shown in the video to accomplish this?
@ChartExplorers 3 роки тому
Hi Blessen Koshy, it depends on what method you are using and what function you are applying to your text column. Can you provide more details?
@Alpacastan21m 3 роки тому
@@ChartExplorers thanks for the quick response. I have a column with rows consisting of text (say 200-500 chars on average). I'd like to manipulate data - like executing a summarize function, append text, remove text etc. and create a new column to compare changes. Is this straightforward? Please let me know if you'd need to take this conversation via a private communication thread.
@AshisRaj 3 роки тому
How can we filter first non null value in every rows of a dataframe? Pls reply
@DJBONEZDIEZ 3 роки тому ⁺¹
can you exclude the 'else' from the list comprehension? I can't get it to work without the else part.
@ChartExplorers 3 роки тому
You can, check out example 1 on this website. If that doesn't work you can send me your list comprehension and I'll see if I can figure it out.
@DJBONEZDIEZ 3 роки тому
@@ChartExplorers thanks, which website ?
@ChartExplorers 3 роки тому ⁺¹
@@DJBONEZDIEZ haha, well that's awkward I didn't paste in the website!
stackoverflow.com/questions/12709062/python-lambda-with-if-but-without-else
@fraggotmode 3 роки тому
Very cool, thanks for taking the time to make this.
@KarlosEPM 2 роки тому
This was enlightening. Many thanks!
@dabombguitars6897 3 роки тому ⁺¹
EXTREMELY HELPFUL!
@ChartExplorers 3 роки тому
Glad it helped!
@ap5672 3 роки тому ⁺¹
May I clarify. At 1:57 the returned row data is a pandas series or a dictionary? Thank you. Fantastic guide.
@ChartExplorers 3 роки тому
Good question. row is a pandas Series.
@MADgamer9212 3 роки тому
Can you do multiple ifs in the last one
@TiR_official 2 роки тому ⁺¹
Hey Brayden, super helpful video! I noticed though, after running some tests, that using "np.where" does not like to work when using int values in the dataframe. I found that initializing a numpy array beforehand and then using that instead of directly calling df['column_name'].values in the "np.where" method works, but I'm not exactly sure why. If you have a reasoning behind this I'd love to hear it!
@akosasuke5128 2 роки тому
Can you give an example?
@stanTrX Місяць тому
Thanks. Can you tell once again for beginners pls.
@mustqueembutt9001 2 роки тому
How can we update rows dynamically on all columns rather then mentioning the name of the column. I am having same scenario but the number of columns are 4000+
@user-wc7em8kf9d 3 роки тому ⁺¹
Useful content nicely explained! Thank you for sharing.
@ChartExplorers 3 роки тому
Thanks!
@farouksouissi3421 4 роки тому ⁺¹
Thanks very useful, please can you talk about geopandas and spatial data.
@ChartExplorers 4 роки тому
I'm glad you found this useful. Geospatial data is a little out of my wheel house, but I'll see what I can do 😀
@farouksouissi3421 4 роки тому ⁺¹
@@ChartExplorers Thanks a lot for your
Interaction and good continuation.
@cairun 3 роки тому ⁺¹
You are awesome! Such useful content!
@ChartExplorers 3 роки тому
Thanks, I'm glad you liked it!!
@ronaldsanchez2506 2 роки тому
Thank you for this, very helpful. I’m trying to apply this in a particular way. I have two data frames: A and B. Data frame A contains 37 rows and 1 column, called indicator. Data frame b is has many rows and 2 columns: ID and Indicator. What i want to do is iterate over data frame a’s indicator column, and do a count the number of IDs in data frame b that match that indicator. I was successful in iterating and doing the nested group by, however, the group by value does not reset and only returns the very first iteration for all 37 indicators. What’s missing?
@vinodbhurtia6440 2 роки тому
Thanks
@SuperVarunsharma 3 роки тому ⁺¹
What if instead of a particular column (A or B, ...), I want to loop over each row to find out the largest of each row.
@ChartExplorers 3 роки тому
In your example we could use a built in method (max). I think you will find this video particularly helpful ua-cam.com/video/n2ff3rcLgnE/v-deo.html
@stephaniakossman2923 3 роки тому ⁺¹
Great video, thanks!
@tomparatube6506 3 роки тому ⁺¹
How about a string? Which method will be fastest?
@ChartExplorers 3 роки тому ⁺¹
I'm glad you brought this up. This is something that I should have discussed in the video. If possible use pandas built in string methods .str. When this is not possible try to us np.where() or np.select (these methods still work with strings). Most other things you should be able to cover with .apply and a lambda function. And if none of those work, use .iloc but you should never have to use df.iterrows even with strings.
@tomparatube6506 3 роки тому
@@ChartExplorers Thanks for answering so promptly. This has been driving me crazy. Turns out there's been a "standoff" between Numpy vs other camps. One of the posts at this (warning: lengthy) discussion at StackOverflow explains:
stackoverflow.com/questions/40659212/futurewarning-elementwise-comparison-failed-returning-scalar-but-in-the-futur
"There is a disagreement between Numpy and native python on what should happen when you compare a strings to numpy's numeric types. Notice the left operand is python's turf, a primitive string, and the middle operation is python's turf, but the right operand is numpy's turf. Should you return a Python style Scalar or a Numpy style ndarray of Boolean? Numpy says ndarray of bool, Pythonic developers disagree. Classic standoff.
Should it be elementwise comparison or Scalar if item exists in the array?
If your code or library is using the in or == operators to compare python string to numpy ndarrays, they aren't compatible, so when if you try it, it returns a scalar, but only for now. The Warning indicates that in the future this behavior might change so your code pukes all over the carpet if python/numpy decide to do adopt Numpy style."
Submitted Bug reports:
Numpy and Python are in a standoff, for now the operation returns a scalar, but in the future it may change.
github.com/numpy/numpy/issues/6784
github.com/pandas-dev/pandas/issues/7830"
@imad_uddin 3 роки тому ⁺¹
That's amazing thanks!
@dhirajp4677 3 роки тому ⁺¹
Thanks for the video bro..
@iamniladri 4 роки тому ⁺¹
Hi! Thanks for such nice explanation. Can you please make a video on how to iterate over each rows and find out the max/min value for each rows across multiple columns. Suppose, I have a dataframe showing Jan - Dec (in rows) data for 5 scenarios (in columns). Now I would like to know which scenario is having max/min values for each rows. Thanks in advance
@ChartExplorers 4 роки тому ⁺¹
Hi Niladri, Thanks for the suggestion! I'll start working on that video right away. I'll post it here in the comments when I'm finished.
@ChartExplorers 4 роки тому ⁺¹
Here is a video I created on how to get the min & max values of each row. It also discusses how to get the index location of those values as well. ua-cam.com/video/n2ff3rcLgnE/v-deo.html
@iamniladri 4 роки тому
@@ChartExplorers Awesome! And THANK YOU for considering my request. Wish you all the very best 👍
@TheCDLSchool 3 роки тому
Thanks for this great video! Would you be able to use one of these faster methods to loop over an array or dataframe in a specific # of rows, then create an xml file, and start process again? What I am doing is building XML files from a large dataframe (can be array or series or whatever). The data is just one column of. Each row is being put into a the xml file as a value. I want to do 50K at a time, print file and then continue it again with the last one not having to have exactly 50K but whatever is left. Thanks,
@triumphmusic7495 2 роки тому
I need help in python can we connect?
@derekhan8396 2 роки тому
how can i iterrate through a colunm
@mrmuranga 3 роки тому ⁺¹
very very useful..thanks
@ChartExplorers 3 роки тому
I'm glad you found it helpful!
@joegalbraith1585 4 роки тому
Great explanation, thanks!
@marco_6145 4 роки тому ⁺¹
Excellent thanks a lot mate
@ChartExplorers 4 роки тому
You're very welcome!
@kennethstephani692 Рік тому
Great video!
@harperguo373 3 роки тому
I'm just wondering what the drawbacks are? Why is it mentioned on SO that "iterrows should only be used in very, very specific cases"?(stackoverflow.com/a/24871316/11901732) Is it appropriate to be used in sequential search such as the case of a for-loop? Thanks
@ChartExplorers 3 роки тому ⁺¹
Hi Harper, good question. The drawback to iterrows is that it is SLOW... like really slow. It's not that iterrows is bad (even though the way some people talk about it you would think it was the worst thing you could ever use).
Code that runs slow is not really a big deal if you have a small data set. In fact, there have been times where I have spent more times coming up with a vectorized solution than the time it would have taken to write and run an iterrows solution.
You have probably heard about code being Pythonic (or writing code in the Python style and using Python methods to solve problems). In the same way some code is not "Pythonic" using iterrows is not very "data scientistic."
I think the moral of the story is that using iterrows is fine if you only work with small datasets, are not going to reuse the code, and don't do anything computationally expensive. Some people would disagree with me on this point, but there is no sense spending 5 minutes optimizing code that you are only going to use once to get a 30 second speed increase. On the other hand, if you want to keep improving as a programmer / analyst / data scientist I think we need to constantly be learning and implementing faster better solutions - even if there is a bit of a learning curve.
@Vctrlreux 3 роки тому
Hi, thank you so much for the video. I was working on a python simulation where I iterate through rows like in your first example and that takes me over 3 hours to run. Can't wait to try with your methods.
How would you go about dealing with a cell which value depends on a cell at the previous iteration. Like A2 = 1 if B1 > 5 ?
@ChartExplorers 3 роки тому ⁺¹
I'm so curious how this will turn out for you. Please let me know.
I am interpreting "a cell at the previous iteration" to mean the cell above it? Is that correct?
If you are trying to compare to the previous column you can use the pandas shift methods pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.shift.html
# Practice dataframe
df = pd.DataFrame({"col1": [6, 2, 5, 13, 6],
"col2": [3, 2, 8, 6, 4],
"col3": [7, 2, 6, 9, 5]})
# Shift values by one
previous_values = df['col1'].shift(1, fill_value=0)
# Compare previous values to current value if the
# previous value was greater than 5 set it equal
# to 1 else set it equal to the value that it currently is
df['col1'] = np.where(previous_values > 5, 1, df['col1'])
# Example for entire dataframe not just a column
previous_values = df.shift(1)
new_values = np.where(previous_values > 5, 1, df)
df = pd.DataFrame(new_values, columns=df.columns)
There are a couple things to note. When we use shift there will be a row of missing. If we shift the rows up by one the bottom row will be missing. If we shift the rows down by one the top row will be missing. We can fill this in with what ever value is most appropriate or leave it as NaN if you're analysis allows NaN.
we are using np.where which takes the condition we are evaluating (is the previous value great than 5). Then we pass in the value we want the cell to be if the condition is true (in this case 1). Note, we also need to pass in the value we want the cell to be if the condition is not true. In this example I pass in the original cell value (do not change the value).
In your example you said A2 = 1 if B1 > 5. If you wanted to compare the the current cell (A2) to the previous_value of another column you could just change the column we use when we create the previous_values variable.
# create previous values for col2 instead of col1
previous_values = df['col2'].shift(1, fill_value=0)
Let me know if I explained things well enough (these topics are more advanced). And please let me know if you get a boost in performance!
@Vctrlreux 3 роки тому ⁺²
@@ChartExplorers Thank you so much for the detailed answer !! I love that channel, definitely subscribing !!
I just tried it with shift. Tell me if I am wrong, but I think in my example below the interdependency makes it impossible without a loop. There are some dependencies on values on rows above, some on values on the same row. And I feel like you can't fill a whole column without completing the rows one by one. I added a more detailed example below.
Please let me know if you see something wrong in my implementation of .shift. I am also gonna try using NumPy arrays and also with Numba for just in time compiling.
The table looks like this:
Columns: date, prices, qty, collateral, LTV, margin call, margin release, margin termination
Date, price, and qty are inputs.
Margin call, release, termination are booleans.
Cash collateral is 0 at first then its value derives from the previous line eg: if margin call at T-1 is 1: =price(t-1)*qty(t-1)-cash collateral(t-1)
LTV is 45% at first then some percentage of price and cash collateral (based on T, not T-1)
Margin calls, release, and termination are 0 or 1 based on some functions of cash collateral, price, LTV.
So before I had something like (works but slow - I am running that for a thousand rows and redo that for 200 stocks on 3000 dates. So like 600M iterations - so far 4 hours):
df.loc[0, 'cash']=0
df.loc[0, 'LTV'] = 0.45
df.loc[0, 'margin call'] = 1 if df.loc[0, 'LTV'] > 0.5 else 0
...
for i in range(1, len(df)):
if df.loc[i-1,' margincall'] == 1:
df.loc[i, 'cash'] = df.loc[i-1, 'LTV'] * ...
elif ...
...
df.loc[i, 'LTV'] = 100 - df.loc[i, 'cash'] / ...
Now I have something like this:
if df['margincall'].shift(1, fill_value=0) == 1:
df['cash'] = df['LTV'].shift(1, fill_value=0.45) * ...
elif ...
...
df 'LTV'] = 100 - df 'cash'] / ...
Thanks again for your answer !! And your videos !!
@ChartExplorers 3 роки тому ⁺¹
HI @V1ct0r 6l0r1eux, this may be a case where you need to loop. I have to head out to work and then I have a few other things I need to get done today. After that I will try see if there is anything I can figure out. Would you be able to send me the first 50 rows or so of you data? If you are not allowed for whatever reason could you make a fake df with the columns you described above with some fake data in it and send it my way. I want to make sure I am using the best data to practice on as possible. If you can do that my email is bradonvalgardson@gmail.com.
@Vctrlreux 3 роки тому
@@ChartExplorers Thanks, looks like it might be that one mandatory looping case indeed. I made it work by switching to numpy arrays. About 10 times faster. Still dealing with a couple issues to use numba which should cut the load by another 3/5 times. It's fast enough for what I need it to do now so don't spend time on this but for sake of curiosity I can send you the code tomorrow at your email. Thanks again for replying so fast and making those awesome videos.
@jongcheulkim7284 2 роки тому
Thank you so much!!
@מוגוגוגו 3 роки тому ⁺²
Thanks! :)
@ChartExplorers 3 роки тому
You're Welcome :)
@akshitmiglani5419 4 роки тому ⁺¹
Great Video! While Lambda & Numpy methods are time efficient but how would we use conditions based on different columns in these methods. For example: I'm trying to create a sequence number based on other variables & using "k" as an incrementor when a condition is met. I'm not able to incorporate this in Lamda/Numpy but I can easily do it in Index/iterrows().
My Code:
c5=df_temp['DESCRIPTION'].str.match(pat=r'ISIN|IS INf')
cumm=0
df_temp['Seq']=0
for i,row in df_temp.iterrows():
if c5[i]==True:
df_temp['Seq'][i]=cumm
cumm=cumm+1
else:
df_temp['Seq'][i]=cumm
@ChartExplorers 4 роки тому ⁺¹
Thanks! I like this question. There are some situations where you need to iterate through a DataFrame (this might be one of them). I did some digging and I couldn't find anything that uses an incrementor with lambda or np.where, np.select. I'm going to keep looking into this, but for now this is the best solution I came up with.
Note: When you are testing if something is true and you have a Boolean (True/False) variable you don't need to use "== True".
Also, there is a short hand for cumm = cumm += 1 you can instead use cumm += 1 (you probably knew this already, but just in case).
If you use df.index instead of df.iterrows() you should get a slight performance boost.
c5 = df_temp['DESCRIPTION'].str.match(pat=r'ISIN|IS INf')
cumm=0
df_temp['Seq']=0
for i in df_temp.index:
if c5[i]: # Removed == True
df_temp['Seq'].iat[i] = cumm
cumm += 1 #cumm+1
else:
df_temp['Seq'].iat[i] = cumm
@akshitmiglani5419 4 роки тому ⁺²
@@ChartExplorers Exactly! About 90% of the time when I have to loop, it's always related to iterating over a dataframe & lambda/np just doesn't do the job. Thanks for looking it up, let me know if you come up with something.
Thanks a lot for your note. I'm aware of both of them but I'm intentionally using them as I have to deploy the code & it's easier to explain that way to the client.
@fredericnicholson80 2 роки тому
@@ChartExplorers To add to this topic : let's assume you have one operation that requires 'iteration', and other operations that can use 'apply'. Is it better to piggyback all operations to one loop, or is it better to use 'apply' or 'where' whenever possible? BTW: I like the presentation a lot, nicely donw!
@emanalshoala6004 Рік тому
Thank you so much
@rshah645 7 місяців тому
Awesome!
@thuy7671 3 роки тому
thank you!!!
@benravenhill484 Місяць тому
Cool bro ❤
@aaditt8120 Рік тому
where's the lambda video!!??
@movie_emotion6010 4 роки тому ⁺¹
comparing 2 different column, I need to creat a binary column. how to do it?
@ChartExplorers 4 роки тому
Hi Subhankar. Thanks for the question. Are your columns numeric? If so you can just compare the columns directly df['binary'] = df['col1'] > df['col2'] (you can replace the > with any comparison operator). Will this work in your situation?
@ChartExplorers 4 роки тому
Hi Subhankar Roy, I made this video addressing your question ua-cam.com/video/2xwto0MK_9U/v-deo.html
@ramishnishah3411 3 роки тому
Magic
@alipiofernandes1 Рік тому
You just missed the itertuples...
@andersolsson4919 3 роки тому
manufacturer
= (Ford, Ford, Ford, Chevrolet, Chevrolet, Chevrolet, Hyundai, Kia, nan, nan, nan, nan, nan, nan, nan)
model = (Focus, Fiesta, F150, Captiva, Bolt, Silverado1500, Tucson, Sorento, F150, Fiesta, F150, Silverado1500, Captiva, F150, Focus, Tucson)
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
here is a sample data that my the dataset I'm working with, the code I posted does a good job but it's very slow. model Focus has a manufacturer Ford at first, but afterward it has nan so the code is scanning and picking a name from previously declared observations.
@luisweber4399 3 роки тому ⁺¹
dude... I thought you were going to actually show how to make changes to the data frame with loop, not to show which type of change is faster..... too bad.
@ChartExplorers 3 роки тому
Is that not what I do? Each example shows how to change the dataframe.
@ChartExplorers 3 роки тому
If you were thinking something more helpful I would love to get your feedback so I can create better content.
@iskandarrasimoon734 3 роки тому
its just so painful
@ChartExplorers 3 роки тому
What's painful?
@iskandarrasimoon734 3 роки тому
@@ChartExplorers pandas data frame, been struggling with an assignment which requires me to use pandas dfs, been struggling with it for over a week.
@ChartExplorers 3 роки тому
@@iskandarrasimoon734 Oh man! that stinks. Let me know if you have any specific questions. If I can I'd be happy to try and help you.
@iskandarrasimoon734 3 роки тому
@@ChartExplorers Is there any way I could contact you?
@ChartExplorers 3 роки тому
@@iskandarrasimoon734 if you have discord discord.gg/EEajBNpeK9 or my email is bradonvalgardson@gmail.com

Наступне

Автоматичне відтворення

SUM COLUMNS & ROWS IN A DATAFRAME | How to sum columns and rows in a Pandas DataFrame