Pandas: How to increment a new column based on increment and consecutive properties of 2 other columns? - python-3.x

I'm currently working on a bulk data pre-processing framework in pandas and since I'm relatively new to pandas, I can't seem to solve this problem:
Given: A dataset with 2 columns :col_1, col_2
Required: A new column req_col such that it's value is incremented if
a. the values in col_1 are not consecutiveORb.the value of col_2 is incremented
consecutively
NOTE:
col_2 always starts from 1 and always increases in value and values are never missing (always consecutive), eg:1,1,2,2,3,3,4,5,6,6,6,7,8,8,9.....
col_1 always starts from 0 and always increases in value, but some
values can be missing (need not be consecutive), eg:0,1,2,2,3,6,6,6,10,10,10...
EXPECTED ANSWER:
col_1 col_2 req_col #Changes in req_col explained below
0 1 1
0 1 1
0 2 2 #because col_2 value has incremented
1 2 2
1 2 2
3 2 3 #because '3' is not consectutive to '1' in col_1
3 3 4 #because of increment in col_2
5 3 5 #because '5' is not consecutive to '3' in col_1
6 4 6 #because of increment in col_2 and so on...
6 4 6

Try:
df['req_col'] = (df['col_1'].diff().gt(1) | # col_1 is not consecutive
df['col_2'].diff().ne(0) # col_2 is has a jump
).cumsum()
Output:
0 1
1 1
2 2
3 2
4 2
5 3
6 4
7 5
8 6
9 6
dtype: int32

Related

pandas compare 1 row value with every other row value and create a matrix

DF in hand
Steps I want to perform:
compare A001 data with A002, A003,...A00N
for every value that matches raise a counter by 1
do not increment the count if NA
repeat for row A002 with all other rows
create a matrix using the index with total count of matching values
DF creation:
data = {'name':['A001', 'A002', 'A003',
'A004','A005','A006','A007','A008'],
'Q1':[2,1,1,1,2,1,1,5],
'Q2':[4,4,4,2,4,2,5,4]
'Q3':[2,2,3,2,2,3,2,2]
'Q4':[5,3,5,2,3,2,4,5]
'Q5':[2,2,3,2,2,2,2,2]}
df = pd.DataFrame(data)
df.at[7, 'Q3'] = None
desired output
thanks in advance.
IIUC,
df = pd.DataFrame({'name':['A001', 'A002', 'A003', 'A004','A005','A006','A007','A008'],
'Q1':[2,1,1,1,2,1,1,5],
'Q2':[4,4,4,2,4,2,5,4],
'Q3':[2,2,3,2,2,3,2,2],
'Q4':[5,3,5,2,3,2,4,5],
'Q5':[2,2,3,2,2,2,2,2]})
dfm = df.merge(df, how='cross').set_index(['name_x','name_y'])
dfm.columns = dfm.columns.str.split('_', expand=True)
df_out = dfm.stack(0).apply(pd.to_numeric, errors='coerce').diff(axis=1).eq(0).sum(axis=1).groupby(level=[0,1]).sum().unstack()
output:
name_y A001 A002 A003 A004 A005 A006 A007 A008
name_x
A001 5 3 2 2 4 1 2 4
A002 3 5 2 3 4 2 3 3
A003 2 2 5 1 1 2 1 2
A004 2 3 1 5 2 4 3 2
A005 4 4 1 2 5 1 2 3
A006 1 2 2 4 1 5 2 1
A007 2 3 1 3 2 2 5 2
A008 4 3 2 2 3 1 2 5

Replacing str by int for all the columns of dataframe without making dictionary for each column

Suppose I have the following dataframe,
d = {'col1':['a','b','c','a','c','c','c','c','c','c'],
'col2':['a1','b1','c1','a1','c1','c1','c1','c1','c1','c1'],
'col3':[1,2,3,2,3,3,3,3,3,3]}
data = pd.DataFrame(d)
I want to go through categorical columns and replace strings with integers. The usual way of doing this is to do:
col1 = {'a': 1,'b': 2, 'c':3}
data.col1 = [col1[item] for item in data.col1]
Namely to make a dictionary for each categorical column and do the replacement. But if you have many columns making dictionary for them one by one is time consuming, so I wonder if there is a better way of doing it? Also how can I do this without dictionary. In this example we can 3 distinct values on col1 for example but if we have many more we should have wrote all that by hand (say {'a': 1,'b': 2, 'c':3, ..., 'z':26}). I wonder what is the most efficient way of doing this? namely to go through all the categorical column and replace the string with numbers without needing to make dictionaries column by column?
Get only object columns first by DataFrame.select_dtypes and then for each column use factorize in DataFrame.apply:
cols = data.select_dtypes(object).columns
data[cols] = data[cols].apply(lambda x: pd.factorize(x)[0]) + 1
print (data)
col1 col2 col3
0 1 1 1
1 2 2 2
2 3 3 3
3 1 1 2
4 3 3 3
5 3 3 3
6 3 3 3
7 3 3 3
8 3 3 3
9 3 3 3
If possible, you could avoid the apply,by using a dictionary comprehension in the assign expression(I feel a dictionary is going to be more efficient; I may be wrong):
values = {col: data[col].factorize()[0] + 1
for col in data.select_dtypes(object)}
data.assign(**values)
col1 col2 col3
0 1 1 1
1 2 2 2
2 3 3 3
3 1 1 2
4 3 3 3
5 3 3 3
6 3 3 3
7 3 3 3
8 3 3 3
9 3 3 3

Taking different records from groups using group by in pandas

Suppose I have dataframe like this
>>> df = pd.DataFrame({'id':[1,1,1,2,2,2,2,3,4],'value':[1,2,3,1,2,3,4,1,1]})
>>> df
id value
0 1 1
1 1 2
2 1 3
3 2 1
4 2 2
5 2 3
6 2 4
7 3 1
8 4 1
Now I want top all records from each group using group id except last 3. That means I want to drop last 3 records from all groups. How can I do it using pandas group_by. This is dummy data.
Use GroupBy.cumcount for counter from back by ascending=False and then compare by Series.gt for greater values like 2, because python count from 0:
df = df[df.groupby('id').cumcount(ascending=False).gt(2)]
print (df)
id value
3 2 1
Details:
print (df.groupby('id').cumcount(ascending=False))
0 2
1 1
2 0
3 3
4 2
5 1
6 0
7 0
8 0
dtype: int64

Increment values in a column based on another column (Pandas)

I have DataFrame containing three columns:
The incrementor
The incremented
Other
I would like lengthen the DataFrame in a particular way. In each row, I want to add a number of rows, depending on the incrementor, and in these rows we increment the incremented, while the "other" is just replicated.
I made a small example which makes it more clear:
df = pd.DataFrame([[2,1,3], [5,20,0], ['a','b','c']]).transpose()
df.columns = ['incrementor', 'incremented', 'other']
df
incrementor incremented other
0 2 5 a
1 1 20 b
2 3 0 c
The desired output is:
incrementor incremented other
0 2 5 a
1 2 6 a
2 1 20 b
3 3 0 c
4 3 1 c
5 3 2 c
Is there a way to do this elegantly and efficiently with Pandas? Or is there no way to avoid looping?
First get repeated rows on incrementor using repeat and .loc
In [1029]: dff = df.loc[df.index.repeat(df.incrementor.astype(int))]
Then, modify incremented with cumcount
In [1030]: dff.assign(
incremented=dff.incremented + dff.groupby(level=0).incremented.cumcount()
).reset_index(drop=True)
Out[1030]:
incrementor incremented other
0 2 5 a
1 2 6 a
2 1 20 b
3 3 0 c
4 3 1 c
5 3 2 c
Details
In [1031]: dff
Out[1031]:
incrementor incremented other
0 2 5 a
0 2 5 a
1 1 20 b
2 3 0 c
2 3 0 c
2 3 0 c
In [1032]: dff.groupby(level=0).incremented.cumcount()
Out[1032]:
0 0
0 1
1 0
2 0
2 1
2 2
dtype: int64

How to add string to all values in a column of pandas DataFrame

Say you have a DataFrame with columns;
col_1 col_2
1 a
2 b
3 c
4 d
5 e
how would you change the values of col_2 so that, new value = current value + 'new'
Use +:
df.col_2 = df.col_2 + 'new'
print (df)
col_1 col_2
0 1 anew
1 2 bnew
2 3 cnew
3 4 dnew
4 5 enew
Thanks hooy for another solution:
df.col_2 += 'new'
Or assign:
df = df.assign(col_2 = df.col_2 + 'new')
print (df)
col_1 col_2
0 1 anew
1 2 bnew
2 3 cnew
3 4 dnew
4 5 enew

Resources