Increasing iteration speed - python-3.x

Good afternoon,
I'm iterating through a huge Dataframe (104062 x 20) with the following code:
import pandas as pd
df_tot = pd.read_csv("C:\\Users\\XXXXX\\Desktop\\XXXXXXX\\LOGS\\DF_TOT.txt", header=None)
df_tot = df_tot.replace("\[", "", regex=True)
df_tot = df_tot.replace("\]", "", regex=True)
df_tot = df_tot.replace("\'", "", regex=True)
i = 0
while i < len(df_tot):
to_compare = df_tot.iloc[i].tolist()
for j in range(len(df_tot)):
if to_compare == df_tot.iloc[j].tolist():
if i == j:
print('Matched itself.')
else:
print('MATCH FOUND - row: {} --- match row: {}'.format(i,j))
i += 1
I am looking to optimize time spent for each iteration as much as possible, since this code iterates 104062(^2) times. (More or less ten billions iterations).
With my computing power, time spent comparing to_compare in the whole DF is around 26 seconds.
I want to clarify that in case it would be needed, the whole code could be changed with faster constructs.
As usual, Thanks in advance.

as far as i understand, You just want to find duplicated rows.
Sample data(2 last rows are duplicated):
In [1]: df = pd.DataFrame([[1,2], [3,4], [5,6], [7,8], [1,2], [5,6]], columns=['a', 'b'])
df
Out[1]:
a b
0 1 2
1 3 4
2 5 6
3 7 8
4 1 2
5 5 6
This will return all duplicated rows:
In [2]: df[df.duplicated(keep=False)]
Out[2]:
a b
0 1 2
2 5 6
4 1 2
5 5 6
And indexes, grouped by duplicated row:
In [3]: df[df.duplicated(keep=False)].reset_index().groupby(list(df.columns), as_index=False)['index'].apply(list)
Out[3]: a b
1 2 [0, 4]
5 6 [2, 5]
You can also just remove duplicates from dataframe:
In [4]: df.drop_duplicates()
Out[4]:
a b
0 1 2
1 3 4
2 5 6
3 7 8

Related

Pandas Adding Column Maximum to the Original Dataframe [duplicate]

I have a dataframe with columns A,B. I need to create a column C such that for every record / row:
C = max(A, B).
How should I go about doing this?
You can get the maximum like this:
>>> import pandas as pd
>>> df = pd.DataFrame({"A": [1,2,3], "B": [-2, 8, 1]})
>>> df
A B
0 1 -2
1 2 8
2 3 1
>>> df[["A", "B"]]
A B
0 1 -2
1 2 8
2 3 1
>>> df[["A", "B"]].max(axis=1)
0 1
1 8
2 3
and so:
>>> df["C"] = df[["A", "B"]].max(axis=1)
>>> df
A B C
0 1 -2 1
1 2 8 8
2 3 1 3
If you know that "A" and "B" are the only columns, you could even get away with
>>> df["C"] = df.max(axis=1)
And you could use .apply(max, axis=1) too, I guess.
#DSM's answer is perfectly fine in almost any normal scenario. But if you're the type of programmer who wants to go a little deeper than the surface level, you might be interested to know that it is a little faster to call numpy functions on the underlying .to_numpy() (or .values for <0.24) array instead of directly calling the (cythonized) functions defined on the DataFrame/Series objects.
For example, you can use ndarray.max() along the first axis.
# Data borrowed from #DSM's post.
df = pd.DataFrame({"A": [1,2,3], "B": [-2, 8, 1]})
df
A B
0 1 -2
1 2 8
2 3 1
df['C'] = df[['A', 'B']].values.max(1)
# Or, assuming "A" and "B" are the only columns,
# df['C'] = df.values.max(1)
df
A B C
0 1 -2 1
1 2 8 8
2 3 1 3
If your data has NaNs, you will need numpy.nanmax:
df['C'] = np.nanmax(df.values, axis=1)
df
A B C
0 1 -2 1
1 2 8 8
2 3 1 3
You can also use numpy.maximum.reduce. numpy.maximum is a ufunc (Universal Function), and every ufunc has a reduce:
df['C'] = np.maximum.reduce(df['A', 'B']].values, axis=1)
# df['C'] = np.maximum.reduce(df[['A', 'B']], axis=1)
# df['C'] = np.maximum.reduce(df, axis=1)
df
A B C
0 1 -2 1
1 2 8 8
2 3 1 3
np.maximum.reduce and np.max appear to be more or less the same (for most normal sized DataFrames)—and happen to be a shade faster than DataFrame.max. I imagine this difference roughly remains constant, and is due to internal overhead (indexing alignment, handling NaNs, etc).
The graph was generated using perfplot. Benchmarking code, for reference:
import pandas as pd
import perfplot
np.random.seed(0)
df_ = pd.DataFrame(np.random.randn(5, 1000))
perfplot.show(
setup=lambda n: pd.concat([df_] * n, ignore_index=True),
kernels=[
lambda df: df.assign(new=df.max(axis=1)),
lambda df: df.assign(new=df.values.max(1)),
lambda df: df.assign(new=np.nanmax(df.values, axis=1)),
lambda df: df.assign(new=np.maximum.reduce(df.values, axis=1)),
],
labels=['df.max', 'np.max', 'np.maximum.reduce', 'np.nanmax'],
n_range=[2**k for k in range(0, 15)],
xlabel='N (* len(df))',
logx=True,
logy=True)
For finding max among multiple columns would be:
df[['A','B']].max(axis=1).max(axis=0)
Example:
df =
A B
timestamp
2019-11-20 07:00:16 14.037880 15.217879
2019-11-20 07:01:03 14.515359 15.878632
2019-11-20 07:01:33 15.056502 16.309152
2019-11-20 07:02:03 15.533981 16.740607
2019-11-20 07:02:34 17.221073 17.195145
print(df[['A','B']].max(axis=1).max(axis=0))
17.221073

Pandas aggregate column and keep header

I have code which works but gives me data without header is there a way I can write this code so header is not removed? I know one way will be to add back header, but is there a better way?
My code:
df = pd.read_csv(“_data.csv",skiprows=[0], header=None)
df = df.groupby([2])[10].sum().astype(float)
Data:
A B
1 2
1 1
2 3
2 4
I have data like above trying to get this result:
A B
1 3
2 7
Try to use the function reset_index after the sum:
data = [{'a': 1, 'b': 2},{'a': 1, 'b': 1},{'a': 2, 'b': 3},{'a': 2, 'b': 4}]
df = pd.DataFrame(data)
df
a b
0 1 2
1 1 1
2 2 3
3 2 4
df.groupby('a').sum().reset_index()
a b
0 1 3
1 2 7
You should specify the separator (several spaces in your case) and that the header is the first row (=0, with python indexing), than groupby the column you want.
df = pd.read_csv("_data.csv", sep='\s*', header=0)
A B
0 1 2
1 1 1
2 2 3
3 2 4
df = df.groupby(['A']).sum()
B
A
1 3
2 7

Sum and collapse two rows in pandas if two values are equal (order does not matter)

I am analyzing a dataset that has an Origin ID (Column A), a Destination ID (Column B), and how many trips have happened between them (Column Count). Now I want to sum the A-B trips with the B-A trips. This sum is the total number of trips between A and B.
Here is how my data looks like (it is not necessarily ordered in the same way):
In [1]: group_station = pd.DataFrame([[1, 2, 100], [2, 1, 200], [4, 6, 5] , [6, 4, 10], [1, 4, 70]], columns=['A', 'B', 'Count'])
Out[2]:
A B Count
0 1 2 100
1 2 1 200
2 4 6 5
3 6 4 10
4 1 4 70
And I want the following output:
A B C
0 1 2 300
1 4 6 15
4 1 4 70
I have tried groupby and setting the index to both variables with no success. Right now I am doing a very inefficient double loop, that is too slow for the size of my dataset.
If it helps this is the code for the double loop (I removed some efficiency modifications to make it more clear):
# group_station is the dataframe
collapsed_group_station = np.zeros(len(group_station), 3))
for i, row in enumerate(group_station.iterrows()):
start_id = row[0][0]
end_id = row[0][1]
count = row[1][0]
for check_row in group_station.iterrows():
check_start_id = check_row[0][0]
check_end_id = check_row[0][1]
check_time = check_row[1][0]
if start_id == check_end_id and end_id == check_start_id:
new_group_station[i][0] = start_id
new_group_station[i][1] = end_id
new_group_station[i][2] = time + check_time
break
I have ideas of how to make this code more efficient, but I wanted to know if there is a way of doing it without looping.
You can using np.sort with groupby.sum()
import numpy as np; import pandas as pd
group_station[['A','B']]=np.sort(group_station[['A','B']],axis=1)
group_station.groupby(['A','B'],as_index=False).Count.sum()
Out[175]:
A B Count
0 1 2 300
1 1 4 70
2 4 6 15

Getting all rows where for column 'C' the entry is larger than the preceding element in column 'C'

How can I select all rows of a data frame where a condition is met according to a column, which has to do with the relationship between every 2 entries of that column. To give the specific example, lets say I have a DataFrame:
>>>df = pd.DataFrame({'A': [ 1, 2, 3, 4],
'B':['spam', 'ham', 'egg', 'foo'],
'C':[4, 5, 3, 4]})
>>> df
A B C
0 1 spam 4
1 2 ham 5
2 3 egg 3
3 4 foo 4
>>>df2 = df[ return every row of df where C[i] > C[i-1] ]
>>> df2
A B C
1 2 ham 5
3 4 foo 4
There is plenty of great information about slicing and indexing in the pandas docs and here, but this is a bit more complicated, I think. I could also be going about it wrong. What I'm looking for is the rows of data where the value stored in C is no longer monotonously declining.
Any help is appreciated!
Use boolean indexing with compare by shifted column values:
print (df[df['C'] > df['C'].shift()])
A B C
1 2 ham 5
3 4 foo 4
Detail:
print (df['C'] > df['C'].shift())
0 False
1 True
2 False
3 True
Name: C, dtype: bool
If want all monotonously declining rows compare diff of column:
print (df[df['C'].diff() > 0])
A B C
1 2 ham 5
3 4 foo 4

Python/Pandas return column and row index of found string

I've searched previous answers relating to this but those answers seem to utilize numpy because the array contains numbers. I am trying to search for a keyword in a sentence in a dataframe ('Timeframe') where the full sentence is 'Timeframe for wave in ____' and would like to return the column and row index. For example:
df.iloc[34,0]
returns the string I am looking for but I am avoiding a hard code for dynamic reasons. Is there a way to return the [34,0] when I search the dataframe for the keyword 'Timeframe'
EDIT:
For check index need contains with boolean indexing, but then there are possible 3 values:
df = pd.DataFrame({'A':['Timeframe for wave in ____', 'a', 'c']})
print (df)
A
0 Timeframe for wave in ____
1 a
2 c
def check(val):
a = df.index[df['A'].str.contains(val)]
if a.empty:
return 'not found'
elif len(a) > 1:
return a.tolist()
else:
#only one value - return scalar
return a.item()
print (check('Timeframe'))
0
print (check('a'))
[0, 1]
print (check('rr'))
not found
Old solution:
It seems you need if need numpy.where for check value Timeframe:
df = pd.DataFrame({'A':list('abcdef'),
'B':[4,5,4,5,5,4],
'C':[7,8,9,4,2,'Timeframe'],
'D':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4],
'F':list('aaabbb')})
print (df)
A B C D E F
0 a 4 7 1 5 a
1 b 5 8 3 3 a
2 c 4 9 5 6 a
3 d 5 4 7 9 b
4 e 5 2 1 2 b
5 f 4 Timeframe 0 4 b
a = np.where(df.values == 'Timeframe')
print (a)
(array([5], dtype=int64), array([2], dtype=int64))
b = [x[0] for x in a]
print (b)
[5, 2]
In case you have multiple columns where to look into you can use following code example:
import numpy as np
import pandas as pd
df = pd.DataFrame([[1,2,3,4],["a","b","Timeframe for wave in____","d"],[5,6,7,8]])
mask = np.column_stack([df[col].str.contains("Timeframe", na=False) for col in df])
find_result = np.where(mask==True)
result = [find_result[0][0], find_result[1][0]]
Then output for df and result would be:
>>> df
0 1 2 3
0 1 2 3 4
1 a b Timeframe for wave in____ d
2 5 6 7 8
>>> result
[1, 2]

Resources