Create counter column based on values in 2 dataframe columns - python-3.x

I am looking to create a counter column based on row values in 2 dataframe columns, represented here at Col1 and Col2.
An example of the dataset is as follows:
Col1 Col2
a 0
a 0
a 0
a 1
a 0
a 0
a 0
a 1
a 1
b 0
b 0
b 1
b 1
b 0
b 0
Where Col1 is an identification variable, and where I want the counter to start over when a new identification variable comes across (so when 'a' switches to 'b', the counter returns to 0).
Col2 is an indication of a new input in the data. When a 1 arises, a new input arises, and the 0s after that correspond to measurements in that input. Each time a 1 arises, I want the counter variable to increment 1. Each time the 1 returns to a 0 (and vice versa), I also want the counter to increment 1. Based on the above dataset, I want the output to look like the following in Col3:
Col1 Col2 Col3
a 0 0
a 0 0
a 0 0
a 1 1
a 0 2
a 0 2
a 0 2
a 1 3
a 1 4
b 0 0
b 0 0
b 1 1
b 1 2
b 0 3
b 0 3
So basically every time Col2 switches from a 0 to a 1, and each time a 1 arises, I want the counter to increment. Each time a 0 is present in Col2, I want the counter to remain the same value. And every time Col1 changes to a new ID (in this case, from 'a' to 'b') I want the counter to start over at 0.
I've been mainly doing this with conditional statements, but there are a ton of them and I'm looking to run this on a large dataset, which would take hours to run. Is there a quick and easy way to run something like this, with these conditions on both columns? Or does anyone have suggestions on transformations to this data that would make running a categorization like this easier?
I understand that this is a slightly confusing request, so please let me know if there is anything I can do to provide more clarity into what I'm looking for.
Thanks!

df.assign(Col4=df1.groupby('Col1').Col2.apply(lambda x:
pd.Series(pd.np.r_[False,(x[1:]==1) |(x.values[1:] != x.values[:-1])].cumsum())).values)
Col1 Col2 Col3 Col4
0 a 0 0 0
1 a 0 0 0
2 a 0 0 0
3 a 1 1 1
4 a 0 2 2
5 a 0 2 2
6 a 0 2 2
7 a 1 3 3
8 a 1 4 4
9 b 0 0 0
10 b 0 0 0
11 b 1 1 1
12 b 1 2 2
13 b 0 3 3
14 b 0 3 3

Related

How to add a column in pandas dataframe based on other columns for large dataset?

I have a CSV file that contains 1,000,000 rows and columns like following.
col1 col2 col3...col20
0 1 0 ... 10
0 1 0 ... 20
1 0 0 ... 30
0 1 0 ... 40
0 0 1 ... 50
................
I want to add a new column called col1_col2_col3 like following.
col1 col2 col3 col1_col2_col3 ...col20
0 1 0 2 ... 10
0 1 0 2 ... 20
1 0 0 1 ... 30
0 1 0 2 ... 40
0 0 1 3 ... 50
.................
I have loaded the data file in a pandas data frame. Then tried following.
for idx, row in df.iterrows():
if (df.loc[idx, 'col1'] == 1):
df.loc[idx, 'col1_col2_col3'] = 1
elif (df.loc[idx, 'col2'] == 1)
df.loc[idx, 'col1_col2_col3'] = 2
elif (df.loc[idx, 'col3'] == 1)
df.loc[idx, 'col1_col2_col3'] = 3
The above solution wroks. However, my code is taking very long time to run. Is there any way to create col1_col2_col3 fast?
Here's one way using multiplication. The idea is to multiply each column by 1, 2 or 3 depending on which column it is, then keep the nonzero values:
df['col1_col2_col3'] = df[['col1','col2','col3']].mul([1,2,3]).mask(lambda x: x==0).bfill(axis=1)['col1'].astype(int)
N.B. It assumes that each row can have only one nonzero value in columns ['col1_col2_col3'].
Output:
col1 col2 col3 ... col20 col1_col2_col3
0 0 1 0 ... 10 2
1 0 1 0 ... 20 2
2 1 0 0 ... 30 1
3 0 1 0 ... 40 2
4 0 0 1 ... 50 3
You can use Numpy's argmax
df.assign(
col1_col2_col3=
df[['col1', 'col2', 'col3']].to_numpy().argmax(axis=1) + 1
)
col1 col2 col3 col20 col1_col2_col3
0 0 1 0 10 2
1 0 1 0 20 2
2 1 0 0 30 1
3 0 1 0 40 2
4 0 0 1 50 3

How to return all rows that have equal number of values of 0 and 1?

I have dataframe that has 50 columns each column have either 0 or 1. How do I return all rows that have an equal (tie) in the number of 0 and 1 (25 "0" and 25 "1").
An example on a 4 columns:
A B C D
1 1 0 0
1 1 1 0
1 0 1 0
0 0 0 0
based on the above example it should return the first and the third row.
A B C D
1 1 0 0
1 0 1 0
Because you have four columns, we assume you must have atleast two sets of 1 in a row. So, please try
df[df.mean(1).eq(0.5)]

pandas assign value in multiple columns based on value in one

I have a dataset like this,
sample = {'Theme': ['never give a ten','interaction speed','no feedback,premium'],
'cat1': [0,0,0],
'cat2': [0,0,0],
'cat3': [0,0,0],
'cat4': [0,0,0]
}
pd.DataFrame(sample,columns = ['Theme','cat1','cat2','cat3','cat4'])
Theme cat1 cat2 cat3 cat4
0 never give a ten 0 0 0 0
1 interaction speed 0 0 0 0
2 no feedback,premium 0 0 0 0
Now, I need to replace the values in cat columns based on value in Theme. If the Theme column has 'never give a ten', then change cat1 as 1, similarly if the theme column has 'interaction speed', then change cat2 as 1, if the theme column has 'no feedback' in it, change 'cat3' as 1 and for 'premium' change cat4 as 1.
In this sample I have provided 4 categories, I have in total 21 categories. I can do if word in string 21 times for 21 categories, but I am looking for an efficient way to write this in a function, loop every row and go through the logic and update the corresponding columns, can anyone help please?
Thanks in advance.
Here is possible set columns names by categories with Series.str.get_dummies - columns names are sorted:
df1 = df['Theme'].str.get_dummies(',')
print (df1)
interaction speed never give a ten no feedback premium
0 0 1 0 0
1 1 0 0 0
2 0 0 1 1
If need first column in output add DataFrame.join:
df11 = df[['Theme']].join(df['Theme'].str.get_dummies(','))
print (df11)
Theme interaction speed never give a ten no feedback \
0 never give a ten 0 1 0
1 interaction speed 1 0 0
2 no feedback,premium 0 0 1
premium
0 0
1 0
2 1
If order of columns is important add DataFrame.reindex:
#removed posible duplicates with remain ordering
cols = dict.fromkeys([y for x in df['Theme'] for y in x.split(',')]).keys()
df2 = df['Theme'].str.get_dummies(',').reindex(cols, axis=1)
print (df2)
never give a ten interaction speed no feedback premium
0 1 0 0 0
1 0 1 0 0
2 0 0 1 1
cols = dict.fromkeys([y for x in df['Theme'] for y in x.split(',')]).keys()
df2 = df[['Theme']].join(df['Theme'].str.get_dummies(',').reindex(cols, axis=1))
print (df2)
Theme never give a ten interaction speed no feedback \
0 never give a ten 1 0 0
1 interaction speed 0 1 0
2 no feedback,premium 0 0 1
premium
0 0
1 0
2 1

Comparing two different sized pandas Dataframes and to find the row index with equal values

I need some help with comparing two pandas dataframe
I have two dataframes
The first dataframe is
df1 =
a b c d
0 1 1 1 1
1 0 1 0 1
2 0 0 0 1
3 1 1 1 1
4 1 0 1 0
5 1 1 1 0
6 0 0 1 0
7 0 1 0 1
and the second dataframe is
df2 =
a b c d
0 1 1 1 1
1 1 0 1 0
2 0 0 1 0
I want to find the row index of dataframe 1 (df1) which the entire row is the same as the rows in dataframe 2 (df2). My expect result would be
0
3
4
6
The order of the above index does not need to be in order, all I want is the index of dataframe 1 (df1)
Is there a way without using for loop?
Thanks
Tommy
You can using merge
df1.merge(df2,indicator=True,how='left').loc[lambda x : x['_merge']=='both'].index
Out[459]: Int64Index([0, 3, 4, 6], dtype='int64')

Increment values in a column based on another column (Pandas)

I have DataFrame containing three columns:
The incrementor
The incremented
Other
I would like lengthen the DataFrame in a particular way. In each row, I want to add a number of rows, depending on the incrementor, and in these rows we increment the incremented, while the "other" is just replicated.
I made a small example which makes it more clear:
df = pd.DataFrame([[2,1,3], [5,20,0], ['a','b','c']]).transpose()
df.columns = ['incrementor', 'incremented', 'other']
df
incrementor incremented other
0 2 5 a
1 1 20 b
2 3 0 c
The desired output is:
incrementor incremented other
0 2 5 a
1 2 6 a
2 1 20 b
3 3 0 c
4 3 1 c
5 3 2 c
Is there a way to do this elegantly and efficiently with Pandas? Or is there no way to avoid looping?
First get repeated rows on incrementor using repeat and .loc
In [1029]: dff = df.loc[df.index.repeat(df.incrementor.astype(int))]
Then, modify incremented with cumcount
In [1030]: dff.assign(
incremented=dff.incremented + dff.groupby(level=0).incremented.cumcount()
).reset_index(drop=True)
Out[1030]:
incrementor incremented other
0 2 5 a
1 2 6 a
2 1 20 b
3 3 0 c
4 3 1 c
5 3 2 c
Details
In [1031]: dff
Out[1031]:
incrementor incremented other
0 2 5 a
0 2 5 a
1 1 20 b
2 3 0 c
2 3 0 c
2 3 0 c
In [1032]: dff.groupby(level=0).incremented.cumcount()
Out[1032]:
0 0
0 1
1 0
2 0
2 1
2 2
dtype: int64

Resources