Sum all elements in a column in pandas - python-3.x

I have a data in one column in Python dataframe.
1-2 3-4 8-9
4-5 6-2
3-1 4-2 1-4
The need is to sum all the data available in that column.
I tried to apply below logic but it's not working for list of list.
lst=[]
str='5-7 6-1 6-3'
str2 = str.split(' ')
for ele in str2:
lst.append(ele.split('-'))
print(lst)
sum(lst)
Can anyone please let me know the simplest method ?
My expected result should be:
27
17
15

I think we can do a split
df.col.str.split(' |-').map(lambda x : sum(int(y) for y in x))
Out[149]:
0 27
1 17
2 15
Name: col, dtype: int64
Or
pd.DataFrame(df.col.str.split(' |-').tolist()).astype(float).sum(1)
Out[156]:
0 27.0
1 17.0
2 15.0
dtype: float64

Using pd.Series.str.extractall:
df = pd.DataFrame({"col":['1-2 3-4 8-9', '4-5 6-2', '3-1 4-2 1-4']})
print (df["col"].str.extractall("(\d+)")[0].astype(int).groupby(level=0).sum())
0 27
1 17
2 15
Name: 0, dtype: int32

Use .str.extractall and sum on a level:
df['data'].str.extractall('(\d+)').astype(int).sum(level=0)
Output:
0
0 27
1 17
2 15

A for loop works fine here, and should be performant, since we are dealing with strings:
Using #HenryYik's sample data:
df.assign(sum_ = [sum(int(n) for n in ent
if n.isdigit())
for ent in df.col])
Out[1329]:
col sum_
0 1-2 3-4 8-9 27
1 4-5 6-2 17
2 3-1 4-2 1-4 15
I hazard that it will be faster taking it out and working within Python, before returning back to the pandas dataframe.

Related

Computing 10d rolling average on a descending date column in pandas [duplicate]

Suppose I have a time series:
In[138] rng = pd.date_range('1/10/2011', periods=10, freq='D')
In[139] ts = pd.Series(randn(len(rng)), index=rng)
In[140]
Out[140]:
2011-01-10 0
2011-01-11 1
2011-01-12 2
2011-01-13 3
2011-01-14 4
2011-01-15 5
2011-01-16 6
2011-01-17 7
2011-01-18 8
2011-01-19 9
Freq: D, dtype: int64
If I use one of the rolling_* functions, for instance rolling_sum, I can get the behavior I want for backward looking rolling calculations:
In [157]: pd.rolling_sum(ts, window=3, min_periods=0)
Out[157]:
2011-01-10 0
2011-01-11 1
2011-01-12 3
2011-01-13 6
2011-01-14 9
2011-01-15 12
2011-01-16 15
2011-01-17 18
2011-01-18 21
2011-01-19 24
Freq: D, dtype: float64
But what if I want to do a forward-looking sum? I've tried something like this:
In [161]: pd.rolling_sum(ts.shift(-2, freq='D'), window=3, min_periods=0)
Out[161]:
2011-01-08 0
2011-01-09 1
2011-01-10 3
2011-01-11 6
2011-01-12 9
2011-01-13 12
2011-01-14 15
2011-01-15 18
2011-01-16 21
2011-01-17 24
Freq: D, dtype: float64
But that's not exactly the behavior I want. What I am looking for as an output is:
2011-01-10 3
2011-01-11 6
2011-01-12 9
2011-01-13 12
2011-01-14 15
2011-01-15 18
2011-01-16 21
2011-01-17 24
2011-01-18 17
2011-01-19 9
ie - I want the sum of the "current" day plus the next two days. My current solution is not sufficient because I care about what happens at the edges. I know I could solve this manually by setting up two additional columns that are shifted by 1 and 2 days respectively and then summing the three columns, but there's got to be a more elegant solution.
Why not just do it on the reversed Series (and reverse the answer):
In [11]: pd.rolling_sum(ts[::-1], window=3, min_periods=0)[::-1]
Out[11]:
2011-01-10 3
2011-01-11 6
2011-01-12 9
2011-01-13 12
2011-01-14 15
2011-01-15 18
2011-01-16 21
2011-01-17 24
2011-01-18 17
2011-01-19 9
Freq: D, dtype: float64
I struggled with this then found an easy way using shift.
If you want a rolling sum for the next 10 periods, try:
df['NewCol'] = df['OtherCol'].shift(-10).rolling(10, min_periods = 0).sum()
We use shift so that "OtherCol" shows up 10 rows ahead of where it normally would be, then we do a rolling sum over the previous 10 rows. Because we shifted, the previous 10 rows are actually the future 10 rows of the unshifted column. :)
Pandas recently added a new feature which enables you to implement forward looking rolling. You have to upgrade to pandas 1.1.0 to get the new feature.
indexer = pd.api.indexers.FixedForwardWindowIndexer(window_size=3)
ts.rolling(window=indexer, min_periods=1).sum()
Maybe you can try bottleneck module. When ts is large, bottleneck is much faster than pandas
import bottleneck as bn
result = bn.move_sum(ts[::-1], window=3, min_count=1)[::-1]
And bottleneck has other rolling functions, such as move_max, move_argmin, move_rank.
Try this one for a rolling window of 3:
window = 3
ts.rolling(window).sum().shift(-window + 1)

Finding which rows have duplicates in a .csv, but only if they have a certain amount of duplicates

I am trying to determine which sequential rows have at least 50 duplicates within one column. Then I would like to be able to read which rows have the duplicates in a summarized manner, ie
start end total
9 60 51
200 260 60
I'm trying to keep the start and end separate so I can call on them independently later.
I have this to open the .csv file and read its contents:
df = pd.read_csv("BN4 A4-F4, H4_row1_column1_watershed_label.csv", header=None)
df.groupby(0).filter(lambda x: len(x) > 0)
Which gives me this:
0
0 52.0
1 65.0
2 52.0
3 52.0
4 52.0
... ...
4995 8.0
4996 8.0
4997 8.0
4998 8.0
4999 8.0
5000 rows × 1 columns
I'm having a number of problems with this. 1) I'm not sure I totally understand the second function. It seems like it is supposed to group the numbers in my column together. This code:
df.groupby(0).count()
gives me this:
0
0.0
1.0
2.0
3.0
4.0
...
68.0
69.0
70.0
71.0
73.0
65 rows × 0 columns
Which I assume means that there are a total of 65 different unique identities in my column. This just doesn't tell me what they are or where they are. I thought that's what this one would do
df.groupby(0).filter(lambda x: len(x) > 0)
but if I change the 0 to anything else then it screws up my generated list.
Problem 2) I think in order to get the number of duplicates in a sequence, and which rows they are in, I would probably need to use a for loop, but I'm not sure how to build it. So far, I've been pulling my hair out all day trying to figure it out but I just don't think I know Python well enough yet.
Can I get some help, please?
UPDATE
Thanks! So this is what I have thanks to #piterbarg:
#function to identify which behaviors have at least 49 frames, and give the starting, ending, and number of frames
def behavior():
df2 = (df
.reset_index()
.shift(periods=-1)
.groupby((df[0].diff() != 0).cumsum()) #if the diff between a row and the prev row is not 0, increase cumulative sum
.agg({0 : 'mean', 'index':['first','last',len]})) #mean is the behavior category
df3 = (df2.where(df2[('index','len')]>49)
.dropna() #drop N/A
.astype(int) #type = int
.reset_index(drop = True))
print(df3)
out:
0 index
mean first last len
0 7 32 87 56
1 19 277 333 57
2 1 785 940 156
3 30 4062 4125 64
4 29 4214 4269 56
5 7 4450 4599 150
6 1 4612 4775 164
7 7 4778 4882 105
8 8 4945 4999 56
The current issue is trying to make it so the dataframe includes the last row of my .csv. If anyone happens to see this, I would love your input!
Let's start by mocking a df:
import numpy as np
np.random.seed(314)
df=pd.DataFrame({0:np.random.randint(10,size = 5000)})
# make sure we have a couple of large blocks
df.loc[300:400,0] = 5
df.loc[600:660,0] = 4
First we identify where the changes to the consecutive numbers occur, and groupby each of such groups. We record where it starts, where it finishes, and the size of each group
df2 = (df.reset_index()
.groupby((df[0].diff() != 0).cumsum())
.agg({'index':['first','last',len]})
)
Then we only pick those groups that are longer than 50
(df2.where(df2[('index','len')]>50)
.dropna()
.astype(int)
.reset_index(drop = True)
)
output:
index
first last len
0 300 400 101
1 600 660 61
For your question as to what df.groupby(0).filter(lambda x: len(x) > 0) does, as far as I can tell it does nothing. It groups by different values in column 0 and then discard those groups whose size is 0, which is none of them by definition. So this returns your full df
Edit
Your code is not quite right, should be
def behavior():
df2 = (df.reset_index()
.groupby((df[0].diff() != 0).cumsum())
.agg({0 : 'mean', 'index':['first','last',len]}))
df3 = (df2.where(df2[('index','len')]>50)
.dropna()
.astype(int)
.reset_index(drop = True))
print(df3)
note that we define and return df3 not df2, and also I amended the code to return the value that is repeated in the mean column (sorry names are not very intuitive but you can change them if you want)
first is the index when the repetition starts, last is the last index, and len is how many elements there.
#function to identify which behaviors have at least 49 frames, and give the starting, ending, and number of frames
def behavior():
df2 = (df.reset_index()
.groupby((df[0].diff() != 0).cumsum()) #if the diff between a row and the prev row is not 0, increase cumulative sum
.agg({0 : 'mean', 'index':['first','last',len]})) #mean is the behavior category
.shift(-1)
df3 = (df2.where(df2[('index','len')]>49)
.dropna() #drop N/A
.astype(int) #type = int
.reset_index(drop = True))
print(df3)
yields this:
0 index
mean first last len
0 7 31 86 56
1 19 276 332 57
2 1 784 939 156
3 31 4061 4124 64
4 29 4213 4268 56
5 8 4449 4598 150
6 1 4611 4774 164
7 8 4777 4881 105
8 8 4944 4999 56
Which I love. I did notice that the group with 56x duplicates of '7' actually starts on row 32, and ends on row 87 (just one later in both cases, and the pattern is consistent throughout the sheet). Am I right in believing that this can be fixed with the shift() function somehow? I'm toying around with this still :D

Pandas shift datetimeindex takes too long time running

I have a running time issue with shifting a large dataframe with datetime index.
Example using created dummy data:
df = pd.DataFrame({'col1':[0,1,2,3,4,5,6,7,8,9,10,11,12,13]*10**5,'col3':list(np.random.randint(0,100000,14*10**5)),'col2':list(pd.date_range('2020-01-01','2020-08-01',freq='M'))*2*10**5})
df.col3=df.col3.astype(str)
df.drop_duplicates(subset=['col3','col2'],keep='first',inplace=True)
If I shift not using datetimeindex, it only takes about 12s:
%%time
tmp=df.groupby('col3')['col1'].shift(2,fill_value=0)
Wall time: 12.5 s
But when I use datetimeindex, as that situation that I need, it takes about 40 minutes:
%%time
tmp=df.set_index('col2').groupby('col3')['col1'].shift(2,freq='M',fill_value=0)
Wall time: 40min 25s
In my situation, I need the data from shift(1) until shift(6) and merge them with original data by col2 and col3. So I use for looping and merge.
Is there any solution for this? Thanks for your answer, will appreciate so much any respond.
Ben's answer solves it:
%%time
tmp=df1[['col1','col3', 'col2']].assign(col2 = lambda x: x['col2'] + MonthEnd(2)).set_index(['col3', 'col2']).add_suffix(f'_{2}').fillna(0).reindex(pd.MultiIndex.from_frame(df1[['col3','col2']])).reset_index()
Wall time: 5.94 s
also implement to the looping:
%%time
res=(pd.concat([df1.assign(col2 = lambda x: x['col2'] + MonthEnd(i)).set_index(['col3', 'col2']).add_suffix(f'_{i}') for i in range(0,7)],axis=1).fillna(0)).reindex(pd.MultiIndex.from_frame(df1[['col3','col2']])).reset_index()
Wall time: 1min 44s
Actually, my real data is already using MonthEnd(0) so I just use loop in range(1,7). I also implement to multiple columns so I don't use astype and implement reindex because I use left merge.
The two operations are slightly different, and the results are not the same because your data (at least the dummy here) is not ordered and especially if you have missing dates for some col3 values. That said, the time difference seems enormous. So I think you should go a bit differently.
One way is to add X MonthEnd to col2 for X from 0 to 6, use concat all of them, after set_index the col3 and col2, add_suffix to keep track of the "shift" value. fillna and convert the dtype to original one. The rest is mostly cosmetic depending on your needs.
from pandas.tseries.offsets import MonthEnd
res = (
pd.concat([
df.assign(col2 = lambda x: x['col2'] + MonthEnd(i))
.set_index(['col3', 'col2'])
.add_suffix(f'_{i}')
for i in range(0,7)],
axis=1)
.fillna(0)
# depends on your original data
.astype(df['col1'].dtype)
# if you want a left merge ordered like original df
#.reindex(pd.MultiIndex.from_frame(df[['col3','col2']]))
# if you want col2 and col3 back as columns
# .reset_index()
)
Note that concat does a outer join by default, so you end up with month that where not in your original data and col1_0 is actually the original data with my random numbers.
print(res.head(10))
col1_0 col1_1 col1_2 col1_3 col1_4 col1_5 col1_6
col3 col2
0 2020-01-31 7 0 0 0 0 0 0
2020-02-29 8 7 0 0 0 0 0
2020-03-31 2 8 7 0 0 0 0
2020-04-30 3 2 8 7 0 0 0
2020-05-31 4 3 2 8 7 0 0
2020-06-30 12 4 3 2 8 7 0
2020-07-31 13 12 4 3 2 8 7
2020-08-31 0 13 12 4 3 2 8
2020-09-30 0 0 13 12 4 3 2
2020-10-31 0 0 0 13 12 4 3
This is an issue with groupby + shift. The problem is that if you specify an axis other than 0 or a frequency it falls back to a very slow loop over the groups. If neither of those are specified it's able to use a much faster path, which is why you see an order of magitude difference between the performance.
The relevant code in for DataFrame.GroupBy.shift is:
def shift(self, periods=1, freq=None, axis=0, fill_value=None):
"""..."""
if freq is not None or axis != 0:
return self.apply(lambda x: x.shift(periods, freq, axis, fill_value))
Previously this issue extended to specifying a fill_value

loops application in dataframe to find output

I have the following data:
dict={'A':[1,2,3,4,5],'B':[10,20,233,29,2],'C':[10,20,3040,230,238]...................}
and
df= pd.Dataframe(dict)
In this manner I have 20 columns with 5 numerical entry in each column
I want to have a new column where the value should come as the following logic:
0 A[0]*B[0]+A[0]*C[0] + A[0]*D[0].......
1 A[1]*B[1]+A[1]*C[1] + A[1]*D[1].......
2 A[2]*B[2]+A[2]*B[2] + A[2]*D[2].......
I tried in the following manner but manually I can not put 20 columns, so I wanted to know the way to apply a loop to get the desired output
:
lst=[]
for i in range(0,5):
j=df.A[i]*df.B[i]+ df.A[i]*df.C[i]+.......
lst.append(j)
i=i+1
A potential solution is the following. I am only taking the example you posted but is works fine for more. Your data is df
A B C
0 1 10 10
1 2 20 20
2 3 233 3040
3 4 29 230
4 5 2 238
You can create a new column, D by first subsetting your dataframe
add = df.loc[:, df.columns != 'A']
and then take the sum over all multiplications of the columns in D with column A in the following way:
df['D'] = df['A']*add.sum(axis=1)
which returns
A B C D
0 1 10 10 20
1 2 20 20 80
2 3 233 3040 9819
3 4 29 230 1036
4 5 2 238 1200

Assigning integers to dataframe fields ` OverflowError: Python int too large to convert to C unsigned long`

I have a dataframe df that looks like this:
var val
0 clump_thickness 5
1 unif_cell_size 1
2 unif_cell_shape 1
3 marg_adhesion 1
4 single_epith_cell_size 2
5 bare_nuclei 1
6 bland_chrom 3
7 norm_nucleoli 1
8 mitoses 1
9 class 2
11 unif_cell_size 4
12 unif_cell_shape 4
13 marg_adhesion 5
14 single_epith_cell_size 7
15 bare_nuclei 10
17 norm_nucleoli 2
20 clump_thickness 3
25 bare_nuclei 2
30 clump_thickness 6
31 unif_cell_size 8
32 unif_cell_shape 8
34 single_epith_cell_size 3
35 bare_nuclei 4
37 norm_nucleoli 7
40 clump_thickness 4
43 marg_adhesion 3
50 clump_thickness 8
51 unif_cell_size 10
52 unif_cell_shape 10
53 marg_adhesion 8
... ... ...
204 single_epith_cell_size 5
211 unif_cell_size 5
215 bare_nuclei 7
216 bland_chrom 7
217 norm_nucleoli 10
235 bare_nuclei -99999
257 norm_nucleoli 6
324 single_epith_cell_size 8
I want to create a new column that holds the values of the var and val columns, converted to a number. I wrote the following code:
df['id'] = df.apply(lambda row: int.from_bytes('{}{}'.format(row.var, row.val).encode(), 'little'), axis = 1)
When I run this code I get the following error:
df['id'] = df.apply(lambda row: int.from_bytes('{}{}'.format(row.var, row.val).encode(), 'little'), axis = 1)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pandas/core/frame.py", line 4262, in apply
ignore_failures=ignore_failures)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pandas/core/frame.py", line 4384, in _apply_standard
result = Series(results)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pandas/core/series.py", line 205, in __init__
default=np.nan)
File "pandas/_libs/src/inference.pyx", line 1701, in pandas._libs.lib.fast_multiget (pandas/_libs/lib.c:68371)
File "pandas/_libs/src/inference.pyx", line 1165, in pandas._libs.lib.maybe_convert_objects (pandas/_libs/lib.c:58498)
OverflowError: Python int too large to convert to C unsigned long
I don't understand why. If I run
for column in df['var'].unique():
for value in df['val'].unique():
if int.from_bytes('{}{}'.format(column, value).encode(), 'little') > maximum:
maximum = int.from_bytes('{}{}'.format(column, value).encode(), 'little')
print(int.from_bytes('{}{}'.format(column, value), 'little'))
print()
print(maximum)
I get the following result:
65731626445514392434127804442952952931
67060854441299308307031611503233297507
65731626445514392434127804442952952931
68390082437084224179935418563513642083
69719310432869140052839225623793986659
73706994420223887671550646804635020387
16399285238650560638676108961167827102819
67060854441299308307031611503233297507
72377766424438971798646839744354675811
75036222416008803544454453864915364963
69719310432869140052839225623793986659
16399285238650560638676108961167827102819
76365450411793719417358260925195709539
68390082437084224179935418563513642083
76365450411793719417358260925195709539
73706994420223887671550646804635020387
83632281929131549175300318205721294812263623257187
71048538428654055925743032684074331235
75036222416008803544454453864915364963
72377766424438971798646839744354675811
277249955343544548646026928445812341
256480767909405238131904943128931957
266865361626474893388965935787372149
287634549060614203903087921104252533
64059424565585367137514643836585471605
261673064767940065760435439458152053
282442252202079376274557424775032437
.....
60968996531299
69179002195346541894528099
58769973275747
62068508159075
59869484903523
6026341019714892551838472781928948268513458935618931750446847388019
Based on these results I would say that the conversion to integers works fine. Furthermore, the largest created integer is not so big that it should cause problems when being inserted into the dataframe right?
Question: How can I successfully create a new column with the newly created integers? What am I doing wrong here?
Edit: Although bws's solution
str(int.from_bytes('{}{}'.format(column, value).encode(), 'little'))
solves the error, I now have a new problem: the ids are all unique.. I don't understand why this happens but I suddenly have 3000 unique ids, while there are only 92 unique var/val combinations.
I dont know the why. Maybe lamda use by default int in front of int64?
I have a workaround that maybe is useful for you.
Convert the result to string (object):df['id'] = df.apply(lambda row: str(int.from_bytes('{}{}'.format(row["var"], row["val"]).encode(), 'little')), axis = 1)
This is interesting to know: https://docs.scipy.org/doc/numpy-1.10.1/user/basics.types.html
uint64 Unsigned integer (0 to 18446744073709551615)
edit:
After read the last link I asume that when you use a loop, you are using the int python type, not the int that use pandas (come from numpy). So, when you work with a Dataframe you are using the types that numpy provide...
Int type from numpy come from Object so I think that the correct way to work with large integer is use object.
Its my conclusion but maybe I am wrong.
Edit second question:
Simple example works:
d2 = {'val': [2, 1, 1, 2],
'var': ['clump_thickness', 'unif_cell_size', 'unif_cell_size', 'clump_thickness']
}
df2 = pd.DataFrame(data=d2)
df2['id'] = df2.apply(lambda row: str(int.from_bytes('{}{}'.format(row["var"], row["val"]).encode(), 'little')), axis = 1)
Result of df2:
print (df2)
val var id
0 2 clump_thickness 67060854441299308307031611503233297507
1 1 unif_cell_size 256480767909405238131904943128931957
2 1 unif_cell_size 256480767909405238131904943128931957
3 2 clump_thickness 67060854441299308307031611503233297507

Resources