Indexing based on multiple columns - python-3.x

I'm new to python and below mentioned is an ongoing data engineering issue I'm currently trying to resolve.
Table structure
Data:
Index 1 :
Is sequential and would increment by 1 as rows are added.
Index 2 : The problem <<-- To tabulate index 2
This is dependent on values stored in the columns [A,B,C,D,E]. If the value remains the same, we need to assign a single index for these rows.
eg: Rows 1,2,3 have 567 as a value for A,B,C respectively.
Therefore, index 2 is 100 for these 3 rows.
Record types :
1 - A
2 - B
3 - C
4 - D
5 - E
Code
data = [(100, 100, 1 , 567,'','','','') ,
(101, 100, 2 , '',567,'','','') ,
(102, 100, 3 , '','',567,'','') ,
(103, 101, 3 , '','',568,'','') ,
(104, 101, 4 , '','','',568,'') ,
(105, 101, 5 , '','','','',568) ]
#Creates the data frame
df = pd.DataFrame( data, columns = ['index1' , 'index2', 'record_type' , 'A','B','C','D','E'], dtype=str)
#Combines columns A,B,C,D,E and adds a $ where ever it is null in order to stack these values
df['combined'] = df[['A', 'B', 'C','D','E']].stack().groupby(level=0).agg('$'.join)
# Cleans the column 'combined'
df['combined_cleaned']= df['combined'].replace({'\$':''}, regex = True)
Attempting to use the combined_cleaned column to calculate index2.
Not sure if this is the right approach, open to suggestions.

A few assumptions here, but seem to fit your problem.
If there is only ever 1 value over those columns for each row then you can take the max along the row, and then find consecutive groups checking whether that Series is equal to itself, shifted.
We add 99 because by definition the counting will start at 1, but you seem to want 100.
val_cols = ['A', 'B', 'C', 'D', 'E']
s = df[val_cols].apply(pd.to_numeric).max(1)
#0 567.0
#1 567.0
#2 567.0
#3 568.0
#4 568.0
#5 568.0
#dtype: float64
df['index2'] = s.ne(s.shift()).cumsum() + 99
print(df)
index1 record_type A B C D E index2
0 100 1 567 100
1 101 2 567 100
2 102 3 567 100
3 103 3 568 101
4 104 4 568 101
5 105 5 568 101
If instead of a single value, 'record_type' points to the appropriate column you can use numpy indexing.
import numpy as np
arr = df[val_cols].to_numpy()
idx = df['record_type'].astype(int).to_numpy()
vals = arr[np.arange(len(arr)), idx-1]
#array(['567', '567', '567', '568', '568', '568'], dtype=object)

The combined_cleaned column could be generated directly using
cols = ['A', 'B', 'C','D','E']
df[cols].replace('', np.nan).apply(lambda x: x.dropna().item(), axis=1)

You can also try with stack followed by factorize:
cols = ['A', 'B', 'C','D','E']
s = pd.factorize(df[cols].replace('',np.nan).stack())[0]
df['index2_new'] = int(df['index1'].iat[0]) + s
print(df)
index1 index2 record_type A B C D E index2_new
0 100 100 1 567 100
1 101 100 2 567 100
2 102 100 3 567 100
3 103 101 3 568 101
4 104 101 4 568 101
5 105 101 5 568 101

Related

Pandas DataFrame: Same operation on multiple sets of columns

I want to do the same operation on multiple sets of columns of a DataFrame.
Since "for-loops" are frowned upon I'm searching for a decent alternative.
An example:
df = pd.DataFrame({
'a': [1, 11, 111],
'b': [222, 22, 2],
'a_x': [10, 80, 30],
'b_x': [20, 20, 60],
})
This is a simple for-loop approach. It's short and well readable.
cols = ['a', 'b']
for col in cols:
df[f'{col}_res'] = df[[col, f'{col}_x']].min(axis=1)
a b a_x b_x a_res b_res
0 1 222 10 20 1 20
1 11 22 80 20 11 20
2 111 2 30 60 30 2
This is an alternative (w/o for-loop), but I feel that the additional complexity is not really for the better.
cols = ['a', 'b']
def res_df(df, col, name):
res = pd.Series(
df[[col, f'{col}_x']].min(axis=1), index=df.index, name=name)
return res
res = [res_df(df, col, f'{col}_res') for col in cols]
df = pd.concat([df, pd.concat(res, axis=1)], axis=1)
Does anyone have a better/more pythonic solution?
Thanks!
UPDATE 1
Inspired by the proposal from mozway I find the following solution quite appealing.
Imho it's short, readable and generic, since the particular operation can be swapped into a function and the list comprehension applies the function to the given sets of columns.
def operation(s1, s2):
# fill in any operation on pandas series'
# e.g. res = s1 * s2 / (s1 + s2)
res = np.minimum(s1, s2)
return res
df = df.join(
[operation(df[f'{col}'], df[f'{col}_x']).rename(f'{col}_res') for col in cols]
)
You can use numpy.minimum after setting the arrays to identical column names:
cols = ['a', 'b']
cols2 = [f'{x}_x' for x in cols]
df = df.join(np.minimum(df[cols],
df[cols2].set_axis(cols, axis=1))
.add_suffix('_res'))
output:
a b a_x b_x a_res b_res
0 1 222 10 20 1 20
1 11 22 80 20 11 20
2 111 2 30 60 30 2
or, using rename as suggested in the other answer:
cols = ['a', 'b']
cols2 = {f'{x}_x': x for x in cols}
df = df.join(np.minimum(df[cols],
df[list(cols2)].rename(columns=cols2))
.add_suffix('_res'))
One idea is rename columns names by dictionary, select columns by list cols and then group by columns names with aggregate min, sum, max or use custom function:
cols = ['a', 'b']
suffix = '_x'
d = {f'{x}{suffix}':x for x in cols}
print (d)
{'a_x': 'a', 'b_x': 'b'}
print (df.rename(columns=d)[cols])
a a b b
0 1 10 222 20
1 11 80 22 20
2 111 30 2 60
df1 = df.rename(columns=d)[cols].groupby(axis=1,level=0).min().add_suffix('_res')
print (df1)
a_res b_res
0 1 20
1 11 20
2 30 2
Last add to original DataFrame:
df = df.join(df1)
print (df)
a b a_x b_x a_res b_res
0 1 222 10 20 1 20
1 11 22 80 20 11 20
2 111 2 30 60 30 2

Python for-loop to change row value based on a condition works correctly but does not change the values on pandas dataframe?

I am just getting into Python, and I am trying to make a for-loop that loops on every row and randomly select two columns on each iteration based on a given condition and change their values. The for-loop works without any problems; however, the results don't change on the dataframe.
A reproducible example:
df= pd.DataFrame({'A': [10,40,10,20,10],
'B': [10,10,50,40,50],
'C': [10,20,10,10,10],
'D': [10,30,10,10,50],
'E': [10,10,40,10,10],
'F': [2,3,2,2,3]})
df:
A B C D E F
0 10 10 10 10 10 2
1 40 10 20 30 10 3
2 10 50 10 10 40 2
3 20 40 10 10 10 2
4 10 50 10 50 10 3
This is my for-loop; the for loop iterates on all rows and check if the value on column F = 2; it randomly selects two columns with value 10 and change them to 100.
for index, i in df.iterrows():
if i['F'] == 2:
i[i==10].sample(2, axis=0)+100
print(i[i==10].sample(2, axis=0)+100)
This is the output of the loop:
E 110
C 110
Name: 0, dtype: int64
C 110
D 110
Name: 2, dtype: int64
C 110
D 110
Name: 3, dtype: int64
This is what the dataframe is expected to look like:
df:
A B C D E F
0 10 10 110 10 110 2
1 40 10 20 30 10 3
2 10 50 110 110 40 2
3 20 40 110 110 10 2
4 10 50 10 50 10 3
However, the columns on the dataframe are not change. Any idea what's going wrong?
This line:
i[i==10].sample(2, axis=0)+100
.sample returns a new dataframe so the original dataframe (df) was not updated at all.
Try this:
for index, i in df.iterrows():
if i['F'] == 2:
cond = (i == 10)
# You can only sample 2 rows if there are at
# least 2 rows meeting the condition
if cond.sum() >= 2:
idx = i[cond].sample(2).index
i[idx] += 100
print(i[idx])
You should not modify the original df in place. Make a copy and iterate:
df2 = df.copy()
for index, i in df.iterrows():
if i['F'] == 2:
s = i[i==10].sample(2, axis=0)+100
df2.loc[index,i.index.isin(s.index)] = s

How to replace rows with character value by integers in a column in pandas dataframe?

I am working on one large dataset, the problem am facing is that there are columns that have all integer values, however, as the dataset is uncleaned there are a few rows where there are 'characters' along with integers. Here am trying to illustrate the problem with a small pandas dataframe example,
I have the following dataframe:
Index
l1
l2
l3
0
1
123
23
1
2
Z3V
343
2
3
321
21
3
4
AZ34
345
4
5
432
3
With dataframe code :
l1,l2,l3 = [1,2,3,4,5], [123, 'Z3V', 321, 'AZ34', 432], [23,343,21,345,3]
data = pd.DataFrame(zip(l1,l2,l3), columns=['l1', 'l2', 'l3'])
print(data)
Here as you can see, column 'l2' at rows index 1 and 3 have 'characters' along with integers. I want to find such rows in this particular column and print them. Later I want to replace them with integer values like 100 or something similar integer. i.e. those numbers that I am replacing with will be different for example, am replacing instances of 'Z3V' with 100 and instances of 'AZ34' with 101. My point is to replace characters containing values with integers. Now, if in 'l2' column, 'Z3V' occurs again, there too, I will replace it with 100.
Expected output :
Index
l1
l2
l3
0
1
123
23
1
2
100
343
2
3
321
21
3
4
101
345
4
5
432
3
As you can see, the two instances where there were characters have been replaced with 100 and 101 respectively
How to get this expected output ?
You could do:
import pandas as pd
import numpy as np
# setup
l1, l2, l3 = [1, 2, 3, 4, 5, 6], [123, 'Z3V', 321, 'AZ34', 432, 'Z3V'], [23, 343, 21, 345, 3, 3]
data = pd.DataFrame(zip(l1, l2, l3), columns=['l1', 'l2', 'l3'])
# find all non numeric values across the whole DataFrame
mask = data.applymap(np.isreal)
rows, cols = np.where(~mask)
# create the replacement dictionary
replacements = {k: i for i, k in enumerate(np.unique(data.values[rows, cols]), 100)}
# apply the replacements
res = data.replace(replacements)
print(res)
Output
l1 l2 l3
0 1 123 23
1 2 101 343
2 3 321 21
3 4 100 345
4 5 432 3
5 6 101 3
Note that I added an extra row to verify the desire behaviour, now the data DataFrame looks like:
l1 l2 l3
0 1 123 23
1 2 Z3V 343
2 3 321 21
3 4 AZ34 345
4 5 432 3
5 6 Z3V 3
By changing this line:
# create the replacement dictionary
replacements = {k: i for i, k in enumerate(np.unique(data.values[rows, cols]), 100)}
you can change the replacement values as you see fit.

Add two more columns to csv file based on matching values of other csv

I have two csv files
csv1:
csv2:
What i need to process is:
Get each value of column c of csv1 file and match it with column number of csv2.
If any row of csv2 matches with that number then add a new column c_text into csv1 that will contain value of text column for matching row of csv2
Repeat above process for column d of csv1 and add a new column d_text into csv1
Here is what i need at the end
Am new to pandas. How can i do this using pandas.
You can use apply():
csv1['c_text'] = csv1['c'].apply(lambda x: csv2[csv2['number']==x]['text'].values[0])
csv1['d_text'] = csv1['d'].apply(lambda x: csv2[csv2['number']==x]['text'].values[0])
Yields:
a b c d c_text d_text
0 1 4 101 201 val1 val4
1 2 5 105 202 val2 val5
2 3 6 107 203 val3 val6
In terms of an option using merge(), this will yield the same output:
csv1 = csv1.merge(csv2, left_on='c', right_on='number', how='left')
csv1 = csv1.merge(csv2, left_on='d', right_on='number', how='left')
csv1 = csv1.rename(columns={'text_x': 'c_text', 'text_y': 'd_text'})[['a','b','c','d','c_text','d_text']]
Here's something that will do the trick:
df1 = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6], 'c':[101, 105, 107], 'd':[201, 202, 203]})
df2 = pd.DataFrame({'number': [101, 105, 107, 201, 202, 203, 205, 2010, 310], 'text': ["val_{x}".format(x=y + 1) for y in range(9)]})
df1
a b c d
0 1 4 101 201
1 2 5 105 202
2 3 6 107 203
df2
number text
0 101 val_1
1 105 val_2
2 107 val_3
3 201 val_4
4 202 val_5
5 203 val_6
6 205 val_7
7 2010 val_8
8 310 val_9
merged = df1.merge(df2, left_on='c', right_on='number', how='left')
merged
a b c d number text
0 1 4 101 201 101 val_1
1 2 5 105 202 105 val_2
2 3 6 107 203 107 val_3
output = merged.merge(df2, left_on='d', right_on='number', how='left')[['a', 'b', 'c', 'd', 'text_x', 'text_y']]
output
a b c d text_x text_y
0 1 4 101 201 val_1 val_4
1 2 5 105 202 val_2 val_5
2 3 6 107 203 val_3 val_6
What you want is the merge functionality of Pandas. Assuming you have imported the Pandas module with the shorthand name like import pandas as pd, then:
csv1_with_text_col = pd.merge(csv1, csv2, left_on='c', right_on='number', how='left')
This will give you a new dataframe, csv1_with_text_col, with the columns from csv2 merged into csv1 where csv1['c'] == csv2['number']. Additionally, by specifying how='left', only rows from the left dataframe, csv1, will be kept.
You can then merge this new dataframe, csv1_with_text_col, with csv2 again but with left_on='d'.

summing up certain rows in a panda dataframe

I have a pandas dataframe with 1000 rows and 10 columns. I am looking to aggregate rows 100-1000 and replace them with just one row where the indexvalue is '>100' and the column values are the sum of rows 100-1000 of each column. Any ideas on a simple way of doing this? Thanks in advance
Say I have the below
a b c
0 1 10 100
1 2 20 100
2 3 60 100
3 5 80 100
and I want it replaced with
a b c
0 1 10 100
1 2 20 100
>1 8 140 200
You could use ix or loc but it shows SettingWithCopyWarning:
ind = 1
mask = df.index > ind
df1 = df[~mask]
df1.ix['>1', :] = df[mask].sum()
In [69]: df1
Out[69]:
a b c
0 1 10 100
1 2 20 100
>1 8 140 200
To set it without warning you could do it with pd.concat. May be not elegant due to two transposing but worked:
ind = 1
mask = df.index > ind
df1 = pd.concat([df[~mask].T, df[mask].sum()], axis=1).T
df1.index = df1.index.tolist()[:-1] + ['>{}'.format(ind)]
In [36]: df1
Out[36]:
a b c
0 1 10 100
1 2 20 100
>1 8 140 200
Some demonstrations:
In [37]: df.index > ind
Out[37]: array([False, False, True, True], dtype=bool)
In [38]: df[mask].sum()
Out[38]:
a 8
b 140
c 200
dtype: int64
In [40]: pd.concat([df[~mask].T, df[mask].sum()], axis=1).T
Out[40]:
a b c
0 1 10 100
1 2 20 100
0 8 140 200

Resources