Create counter column based on values in 2 dataframe columns

Create counter column based on values in 2 dataframe columns - python-3.x

I am looking to create a counter column based on row values in 2 dataframe columns, represented here at Col1 and Col2.
An example of the dataset is as follows:
Col1 Col2
a 0
a 0
a 0
a 1
a 0
a 0
a 0
a 1
a 1
b 0
b 0
b 1
b 1
b 0
b 0
Where Col1 is an identification variable, and where I want the counter to start over when a new identification variable comes across (so when 'a' switches to 'b', the counter returns to 0).
Col2 is an indication of a new input in the data. When a 1 arises, a new input arises, and the 0s after that correspond to measurements in that input. Each time a 1 arises, I want the counter variable to increment 1. Each time the 1 returns to a 0 (and vice versa), I also want the counter to increment 1. Based on the above dataset, I want the output to look like the following in Col3:
Col1 Col2 Col3
a 0 0
a 0 0
a 0 0
a 1 1
a 0 2
a 0 2
a 0 2
a 1 3
a 1 4
b 0 0
b 0 0
b 1 1
b 1 2
b 0 3
b 0 3
So basically every time Col2 switches from a 0 to a 1, and each time a 1 arises, I want the counter to increment. Each time a 0 is present in Col2, I want the counter to remain the same value. And every time Col1 changes to a new ID (in this case, from 'a' to 'b') I want the counter to start over at 0.
I've been mainly doing this with conditional statements, but there are a ton of them and I'm looking to run this on a large dataset, which would take hours to run. Is there a quick and easy way to run something like this, with these conditions on both columns? Or does anyone have suggestions on transformations to this data that would make running a categorization like this easier?
I understand that this is a slightly confusing request, so please let me know if there is anything I can do to provide more clarity into what I'm looking for.
Thanks!

df.assign(Col4=df1.groupby('Col1').Col2.apply(lambda x:
pd.Series(pd.np.r_[False,(x[1:]==1) |(x.values[1:] != x.values[:-1])].cumsum())).values)
Col1 Col2 Col3 Col4
0 a 0 0 0
1 a 0 0 0
2 a 0 0 0
3 a 1 1 1
4 a 0 2 2
5 a 0 2 2
6 a 0 2 2
7 a 1 3 3
8 a 1 4 4
9 b 0 0 0
10 b 0 0 0
11 b 1 1 1
12 b 1 2 2
13 b 0 3 3
14 b 0 3 3

Related

How to add a column in pandas dataframe based on other columns for large dataset?

I have a CSV file that contains 1,000,000 rows and columns like following.
col1 col2 col3...col20
0 1 0 ... 10
0 1 0 ... 20
1 0 0 ... 30
0 1 0 ... 40
0 0 1 ... 50
................
I want to add a new column called col1_col2_col3 like following.
col1 col2 col3 col1_col2_col3 ...col20
0 1 0 2 ... 10
0 1 0 2 ... 20
1 0 0 1 ... 30
0 1 0 2 ... 40
0 0 1 3 ... 50
.................
I have loaded the data file in a pandas data frame. Then tried following.
for idx, row in df.iterrows():
if (df.loc[idx, 'col1'] == 1):
df.loc[idx, 'col1_col2_col3'] = 1
elif (df.loc[idx, 'col2'] == 1)
df.loc[idx, 'col1_col2_col3'] = 2
elif (df.loc[idx, 'col3'] == 1)
df.loc[idx, 'col1_col2_col3'] = 3
The above solution wroks. However, my code is taking very long time to run. Is there any way to create col1_col2_col3 fast?

Here's one way using multiplication. The idea is to multiply each column by 1, 2 or 3 depending on which column it is, then keep the nonzero values:
df['col1_col2_col3'] = df[['col1','col2','col3']].mul([1,2,3]).mask(lambda x: x==0).bfill(axis=1)['col1'].astype(int)
N.B. It assumes that each row can have only one nonzero value in columns ['col1_col2_col3'].
Output:
col1 col2 col3 ... col20 col1_col2_col3
0 0 1 0 ... 10 2
1 0 1 0 ... 20 2
2 1 0 0 ... 30 1
3 0 1 0 ... 40 2
4 0 0 1 ... 50 3

You can use Numpy's argmax
df.assign(
col1_col2_col3=
df[['col1', 'col2', 'col3']].to_numpy().argmax(axis=1) + 1
)
col1 col2 col3 col20 col1_col2_col3
0 0 1 0 10 2
1 0 1 0 20 2
2 1 0 0 30 1
3 0 1 0 40 2
4 0 0 1 50 3

How to return all rows that have equal number of values of 0 and 1?

I have dataframe that has 50 columns each column have either 0 or 1. How do I return all rows that have an equal (tie) in the number of 0 and 1 (25 "0" and 25 "1").
An example on a 4 columns:
A B C D
1 1 0 0
1 1 1 0
1 0 1 0
0 0 0 0
based on the above example it should return the first and the third row.
A B C D
1 1 0 0
1 0 1 0

Because you have four columns, we assume you must have atleast two sets of 1 in a row. So, please try
df[df.mean(1).eq(0.5)]

pandas assign value in multiple columns based on value in one

I have a dataset like this,
sample = {'Theme': ['never give a ten','interaction speed','no feedback,premium'],
'cat1': [0,0,0],
'cat2': [0,0,0],
'cat3': [0,0,0],
'cat4': [0,0,0]
}
pd.DataFrame(sample,columns = ['Theme','cat1','cat2','cat3','cat4'])
Theme cat1 cat2 cat3 cat4
0 never give a ten 0 0 0 0
1 interaction speed 0 0 0 0
2 no feedback,premium 0 0 0 0
Now, I need to replace the values in cat columns based on value in Theme. If the Theme column has 'never give a ten', then change cat1 as 1, similarly if the theme column has 'interaction speed', then change cat2 as 1, if the theme column has 'no feedback' in it, change 'cat3' as 1 and for 'premium' change cat4 as 1.
In this sample I have provided 4 categories, I have in total 21 categories. I can do if word in string 21 times for 21 categories, but I am looking for an efficient way to write this in a function, loop every row and go through the logic and update the corresponding columns, can anyone help please?
Thanks in advance.

Here is possible set columns names by categories with Series.str.get_dummies - columns names are sorted:
df1 = df['Theme'].str.get_dummies(',')
print (df1)
interaction speed never give a ten no feedback premium
0 0 1 0 0
1 1 0 0 0
2 0 0 1 1
If need first column in output add DataFrame.join:
df11 = df[['Theme']].join(df['Theme'].str.get_dummies(','))
print (df11)
Theme interaction speed never give a ten no feedback \
0 never give a ten 0 1 0
1 interaction speed 1 0 0
2 no feedback,premium 0 0 1
premium
0 0
1 0
2 1
If order of columns is important add DataFrame.reindex:
#removed posible duplicates with remain ordering
cols = dict.fromkeys([y for x in df['Theme'] for y in x.split(',')]).keys()
df2 = df['Theme'].str.get_dummies(',').reindex(cols, axis=1)
print (df2)
never give a ten interaction speed no feedback premium
0 1 0 0 0
1 0 1 0 0
2 0 0 1 1
cols = dict.fromkeys([y for x in df['Theme'] for y in x.split(',')]).keys()
df2 = df[['Theme']].join(df['Theme'].str.get_dummies(',').reindex(cols, axis=1))
print (df2)
Theme never give a ten interaction speed no feedback \
0 never give a ten 1 0 0
1 interaction speed 0 1 0
2 no feedback,premium 0 0 1
premium
0 0
1 0
2 1

Comparing two different sized pandas Dataframes and to find the row index with equal values

I need some help with comparing two pandas dataframe
I have two dataframes
The first dataframe is
df1 =
a b c d
0 1 1 1 1
1 0 1 0 1
2 0 0 0 1
3 1 1 1 1
4 1 0 1 0
5 1 1 1 0
6 0 0 1 0
7 0 1 0 1
and the second dataframe is
df2 =
a b c d
0 1 1 1 1
1 1 0 1 0
2 0 0 1 0
I want to find the row index of dataframe 1 (df1) which the entire row is the same as the rows in dataframe 2 (df2). My expect result would be
0
3
4
6
The order of the above index does not need to be in order, all I want is the index of dataframe 1 (df1)
Is there a way without using for loop?
Thanks
Tommy

You can using merge
df1.merge(df2,indicator=True,how='left').loc[lambda x : x['_merge']=='both'].index
Out[459]: Int64Index([0, 3, 4, 6], dtype='int64')

Increment values in a column based on another column (Pandas)

I have DataFrame containing three columns:
The incrementor
The incremented
Other
I would like lengthen the DataFrame in a particular way. In each row, I want to add a number of rows, depending on the incrementor, and in these rows we increment the incremented, while the "other" is just replicated.
I made a small example which makes it more clear:
df = pd.DataFrame([[2,1,3], [5,20,0], ['a','b','c']]).transpose()
df.columns = ['incrementor', 'incremented', 'other']
df
incrementor incremented other
0 2 5 a
1 1 20 b
2 3 0 c
The desired output is:
incrementor incremented other
0 2 5 a
1 2 6 a
2 1 20 b
3 3 0 c
4 3 1 c
5 3 2 c
Is there a way to do this elegantly and efficiently with Pandas? Or is there no way to avoid looping?

First get repeated rows on incrementor using repeat and .loc
In [1029]: dff = df.loc[df.index.repeat(df.incrementor.astype(int))]
Then, modify incremented with cumcount
In [1030]: dff.assign(
incremented=dff.incremented + dff.groupby(level=0).incremented.cumcount()
).reset_index(drop=True)
Out[1030]:
incrementor incremented other
0 2 5 a
1 2 6 a
2 1 20 b
3 3 0 c
4 3 1 c
5 3 2 c
Details
In [1031]: dff
Out[1031]:
incrementor incremented other
0 2 5 a
0 2 5 a
1 1 20 b
2 3 0 c
2 3 0 c
2 3 0 c
In [1032]: dff.groupby(level=0).incremented.cumcount()
Out[1032]:
0 0
0 1
1 0
2 0
2 1
2 2
dtype: int64

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Create counter column based on values in 2 dataframe columns - python-3.x

Related

How to add a column in pandas dataframe based on other columns for large dataset?

How to return all rows that have equal number of values of 0 and 1?

pandas assign value in multiple columns based on value in one

Comparing two different sized pandas Dataframes and to find the row index with equal values

Increment values in a column based on another column (Pandas)

Categories

Resources