How to add a column in pandas dataframe based on other columns for large dataset? - python-3.x

I have a CSV file that contains 1,000,000 rows and columns like following.
col1 col2 col3...col20
0 1 0 ... 10
0 1 0 ... 20
1 0 0 ... 30
0 1 0 ... 40
0 0 1 ... 50
................
I want to add a new column called col1_col2_col3 like following.
col1 col2 col3 col1_col2_col3 ...col20
0 1 0 2 ... 10
0 1 0 2 ... 20
1 0 0 1 ... 30
0 1 0 2 ... 40
0 0 1 3 ... 50
.................
I have loaded the data file in a pandas data frame. Then tried following.
for idx, row in df.iterrows():
if (df.loc[idx, 'col1'] == 1):
df.loc[idx, 'col1_col2_col3'] = 1
elif (df.loc[idx, 'col2'] == 1)
df.loc[idx, 'col1_col2_col3'] = 2
elif (df.loc[idx, 'col3'] == 1)
df.loc[idx, 'col1_col2_col3'] = 3
The above solution wroks. However, my code is taking very long time to run. Is there any way to create col1_col2_col3 fast?

Here's one way using multiplication. The idea is to multiply each column by 1, 2 or 3 depending on which column it is, then keep the nonzero values:
df['col1_col2_col3'] = df[['col1','col2','col3']].mul([1,2,3]).mask(lambda x: x==0).bfill(axis=1)['col1'].astype(int)
N.B. It assumes that each row can have only one nonzero value in columns ['col1_col2_col3'].
Output:
col1 col2 col3 ... col20 col1_col2_col3
0 0 1 0 ... 10 2
1 0 1 0 ... 20 2
2 1 0 0 ... 30 1
3 0 1 0 ... 40 2
4 0 0 1 ... 50 3

You can use Numpy's argmax
df.assign(
col1_col2_col3=
df[['col1', 'col2', 'col3']].to_numpy().argmax(axis=1) + 1
)
col1 col2 col3 col20 col1_col2_col3
0 0 1 0 10 2
1 0 1 0 20 2
2 1 0 0 30 1
3 0 1 0 40 2
4 0 0 1 50 3

Related

Count number of non zero columns in a given set of columns of a data frame - pandas

I have a df as shown below
df:
Id Jan20 Feb20 Mar20 Apr20 May20 Jun20 Jul20 Aug20 Sep20 Oct20 Nov20 Dec20 Amount
1 20 0 0 12 1 3 1 0 0 2 2 0 100
2 0 0 2 1 0 2 0 0 1 0 0 0 500
3 1 2 1 2 3 1 1 2 2 3 1 1 300
From the above I would like to calculate Activeness value which is the number of non zero columns in the month columns as given below.
'Jan20', 'Feb20', 'Mar20', 'Apr20', 'May20', 'Jun20', 'Jul20',
'Aug20', 'Sep20', 'Oct20', 'Nov20', 'Dec20'
Expected Output:
Id Jan20 Feb20 Mar20 Apr20 May20 Jun20 Jul20 Aug20 Sep20 Oct20 Nov20 Dec20 Amount Activeness
1 20 0 0 12 1 3 1 0 0 2 2 0 100 7
2 0 0 2 1 0 2 0 0 1 0 0 0 500 4
3 1 2 1 2 3 1 1 2 2 3 1 1 300 12
I tried below code:
df['Activeness'] = pd.Series(index=df.index, data=np.count_nonzero(df[['Jan20', 'Feb20',
'Mar20', 'Apr20', 'May20', 'Jun20', 'Jul20',
'Aug20', 'Sep20', 'Oct20', 'Nov20', 'Dec20']], axis=1))
which is working well, but I would like to know is there any method that is faster than this.
You can try:
df['Activeness'] = df.filter(like = '20').ne(0, axis =1).sum(1)

Writing Function on Data Frame in Pandas

I have data in excel which have two columns 'Peak Value' & 'Label'. I want to add value in 'Label' column based on 'Peak Value' column.
So, Input looks like below
Peak Value 0 0 0 88 0 0 88 0 0 88 0
Label 0 0 0 0 0 0 0 0 0 0 0
Input
Whenever the value in 'Peak Value' is greater than zero then it add 1 in 'Label' and replace all the zeros below it. For the next value greater than zero it should get incremented to 2 and replace all the zeros by 2.
So, the output will look like this:
Peak Value 0 0 0 88 0 0 88 0 0 88 0
Label 0 0 0 1 1 1 2 2 2 3 3
Output
and so on....
I tried writing function but I am only able to add 1 when the value is greater than 0 in 'Peak Value'.
def funct(row):
if row['Peak Value']>0:
val = 1
else:
val = 0
return val
df['Label']= df.apply(funct, axis=1)
May be you could try using cumsum and ffill:
import numpy as np
df['Labels'] = (df['Peak Value'] > 0).groupby(df['Peak Value']).cumsum()
df['Labels'] = df['Labels'].replace(0, np.nan).ffill().replace(np.nan, 0).astype(int)
Output:
Peak Value Labels
0 0 0
1 0 0
2 0 0
3 88 1
4 0 1
5 0 1
6 88 2
7 0 2
8 0 2
9 88 3
10 0 3

Comparing two different sized pandas Dataframes and to find the row index with equal values

I need some help with comparing two pandas dataframe
I have two dataframes
The first dataframe is
df1 =
a b c d
0 1 1 1 1
1 0 1 0 1
2 0 0 0 1
3 1 1 1 1
4 1 0 1 0
5 1 1 1 0
6 0 0 1 0
7 0 1 0 1
and the second dataframe is
df2 =
a b c d
0 1 1 1 1
1 1 0 1 0
2 0 0 1 0
I want to find the row index of dataframe 1 (df1) which the entire row is the same as the rows in dataframe 2 (df2). My expect result would be
0
3
4
6
The order of the above index does not need to be in order, all I want is the index of dataframe 1 (df1)
Is there a way without using for loop?
Thanks
Tommy
You can using merge
df1.merge(df2,indicator=True,how='left').loc[lambda x : x['_merge']=='both'].index
Out[459]: Int64Index([0, 3, 4, 6], dtype='int64')

Splitting a each column value into different columns [duplicate]

This question already has answers here:
Convert pandas DataFrame column of comma separated strings to one-hot encoded
(3 answers)
Closed 4 years ago.
I have a survey response sheet which has questions which can have multiple answers, selected using a set of checkboxes.
When I get the data from the response sheet and import it into pandas I get this:
Timestamp Sports you like Age
0 23/11/2013 13:22:30 Football, Chess, Cycling 15
1 23/11/2013 13:22:34 Football 25
2 23/11/2013 13:22:39 Swimming,Football 22
3 23/11/2013 13:22:45 Chess, Soccer 27
4 23/11/2013 13:22:48 Soccer 30
There can be any number of sport values in sports column (further rows has basketball,volleyball etc.) and there are still some other columns. I'd like to do statistics on the results of the question (how many people liked Football,etc). The problem is, that all of the answers are within one column, so grouping by that column and asking for counts doesn't work.
Is there a simple way within Pandas to convert this sort of data frame into one where there are multiple columns called Sports-Football, Sports-Volleyball, Sports-Basketball, and each of those is boolean (1 for yes, 0 for no)? I can't think of a sensible way to do this
What I need is a new dataframe that looks like this (along with Age column) -
Timestamp Sports-Football Sports-Chess Sports-Cycling ....
0 23/11/2013 13:22:30 1 1 1
1 23/11/2013 13:22:34 1 0 0
2 23/11/2013 13:22:39 1 0 0
3 23/11/2013 13:22:45 0 1 0
I tried till this point can't proceed further.
df['Sports you like'].str.split(',\s*')
which splits into different columns but the first column may have any sport, I need only 1 in first column if the user likes Football or 0.
Problem is separator ,\s*, so solution is add str.split with str.join before str.get_dummies:
df1 = (df.pop('Sports you like').str.split(',\s*')
.str.join('|')
.str.get_dummies()
.add_prefix('Sports-'))
df = df.join(df1)
print (df)
Timestamp Age Sports-Chess Sports-Cycling Sports-Football \
0 23/11/2013 13:22:30 15 1 1 1
1 23/11/2013 13:22:34 25 0 0 1
2 23/11/2013 13:22:39 22 0 0 1
3 23/11/2013 13:22:45 27 1 0 0
4 23/11/2013 13:22:48 30 0 0 0
Sports-Soccer Sports-Swimming
0 0 0
1 0 0
2 0 1
3 1 0
4 1 0
Or use MultiLabelBinarizer:
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
s = df.pop('Sports you like').str.split(',\s*')
df1 = pd.DataFrame(mlb.fit_transform(s),columns=mlb.classes_).add_prefix('Sports-')
print (df1)
Sports-Chess Sports-Cycling Sports-Football Sports-Soccer \
0 1 1 1 0
1 0 0 1 0
2 0 0 1 0
3 1 0 0 1
4 0 0 0 1
Sports-Swimming
0 0
1 0
2 1
3 0
4 0
df = df.join(df1)
print (df)
Timestamp Age Sports-Chess Sports-Cycling Sports-Football \
0 23/11/2013 13:22:30 15 1 1 1
1 23/11/2013 13:22:34 25 0 0 1
2 23/11/2013 13:22:39 22 0 0 1
3 23/11/2013 13:22:45 27 1 0 0
4 23/11/2013 13:22:48 30 0 0 0
Sports-Soccer Sports-Swimming
0 0 0
1 0 0
2 0 1
3 1 0
4 1 0

How to iterate through 'nested' dataframes without 'for' loops in pandas (python)?

I'm trying to check the cartesian distance between each set of points in one dataframe to sets of scattered points in another dataframe, to see if the input gets above a threshold 'distance' of my checking points.
I have this working with nested for loops, but is painfully slow (~7 mins for 40k input rows, each checked vs ~180 other rows, + some overhead operations).
Here is what I'm attempting in vectorialized format - 'for every pair of points (a,b) from df1, if the distance to ANY point (d,e) from df2 is > threshold, print "yes" into df1.c, next to input points.
..but I'm getting unexpected behavior from this. With given data, all but one distances are > 1, but only df1.1c is getting 'yes'.
Thanks for any ideas - the problem is probably in the 'df1.loc...' line:
import numpy as np
from pandas import DataFrame
inp1 = [{'a':1, 'b':2, 'c':0}, {'a':1,'b':3,'c':0}, {'a':0,'b':3,'c':0}]
df1 = DataFrame(inp1)
inp2 = [{'d':2, 'e':0}, {'d':0,'e':3}, {'d':0,'e':4}]
df2 = DataFrame(inp2)
threshold = 1
df1.loc[np.sqrt((df1.a - df2.d) ** 2 + (df1.b - df2.e) ** 2) > threshold, 'c'] = "yes"
print(df1)
print(df2)
a b c
0 1 2 yes
1 1 3 0
2 0 3 0
d e
0 2 0
1 0 3
2 0 4
Here is an idea to help you to start...
Source DFs:
In [170]: df1
Out[170]:
c x y
0 0 1 2
1 0 1 3
2 0 0 3
In [171]: df2
Out[171]:
x y
0 2 0
1 0 3
2 0 4
Helper DF with cartesian product:
In [172]: x = df1[['x','y']] \
.reset_index() \
.assign(k=0).merge(df2.assign(k=0).reset_index(),
on='k', suffixes=['1','2']) \
.drop('k',1)
In [173]: x
Out[173]:
index1 x1 y1 index2 x2 y2
0 0 1 2 0 2 0
1 0 1 2 1 0 3
2 0 1 2 2 0 4
3 1 1 3 0 2 0
4 1 1 3 1 0 3
5 1 1 3 2 0 4
6 2 0 3 0 2 0
7 2 0 3 1 0 3
8 2 0 3 2 0 4
now we can calculate the distance:
In [169]: x.eval("D=sqrt((x1 - x2)**2 + (y1 - y2)**2)", inplace=False)
Out[169]:
index1 x1 y1 index2 x2 y2 D
0 0 1 2 0 2 0 2.236068
1 0 1 2 1 0 3 1.414214
2 0 1 2 2 0 4 2.236068
3 1 1 3 0 2 0 3.162278
4 1 1 3 1 0 3 1.000000
5 1 1 3 2 0 4 1.414214
6 2 0 3 0 2 0 3.605551
7 2 0 3 1 0 3 0.000000
8 2 0 3 2 0 4 1.000000
or filter:
In [175]: x.query("sqrt((x1 - x2)**2 + (y1 - y2)**2) > #threshold")
Out[175]:
index1 x1 y1 index2 x2 y2
0 0 1 2 0 2 0
1 0 1 2 1 0 3
2 0 1 2 2 0 4
3 1 1 3 0 2 0
5 1 1 3 2 0 4
6 2 0 3 0 2 0
Try using scipy implementation, it is surprisingly fast
scipy.spatial.distance.pdist
https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.pdist.html
or
scipy.spatial.distance_matrix
https://docs.scipy.org/doc/scipy-0.19.1/reference/generated/scipy.spatial.distance_matrix.html

Resources