How to replace rows with character value by integers in a column in pandas dataframe? - python-3.x

I am working on one large dataset, the problem am facing is that there are columns that have all integer values, however, as the dataset is uncleaned there are a few rows where there are 'characters' along with integers. Here am trying to illustrate the problem with a small pandas dataframe example,
I have the following dataframe:
Index
l1
l2
l3
0
1
123
23
1
2
Z3V
343
2
3
321
21
3
4
AZ34
345
4
5
432
3
With dataframe code :
l1,l2,l3 = [1,2,3,4,5], [123, 'Z3V', 321, 'AZ34', 432], [23,343,21,345,3]
data = pd.DataFrame(zip(l1,l2,l3), columns=['l1', 'l2', 'l3'])
print(data)
Here as you can see, column 'l2' at rows index 1 and 3 have 'characters' along with integers. I want to find such rows in this particular column and print them. Later I want to replace them with integer values like 100 or something similar integer. i.e. those numbers that I am replacing with will be different for example, am replacing instances of 'Z3V' with 100 and instances of 'AZ34' with 101. My point is to replace characters containing values with integers. Now, if in 'l2' column, 'Z3V' occurs again, there too, I will replace it with 100.
Expected output :
Index
l1
l2
l3
0
1
123
23
1
2
100
343
2
3
321
21
3
4
101
345
4
5
432
3
As you can see, the two instances where there were characters have been replaced with 100 and 101 respectively
How to get this expected output ?

You could do:
import pandas as pd
import numpy as np
# setup
l1, l2, l3 = [1, 2, 3, 4, 5, 6], [123, 'Z3V', 321, 'AZ34', 432, 'Z3V'], [23, 343, 21, 345, 3, 3]
data = pd.DataFrame(zip(l1, l2, l3), columns=['l1', 'l2', 'l3'])
# find all non numeric values across the whole DataFrame
mask = data.applymap(np.isreal)
rows, cols = np.where(~mask)
# create the replacement dictionary
replacements = {k: i for i, k in enumerate(np.unique(data.values[rows, cols]), 100)}
# apply the replacements
res = data.replace(replacements)
print(res)
Output
l1 l2 l3
0 1 123 23
1 2 101 343
2 3 321 21
3 4 100 345
4 5 432 3
5 6 101 3
Note that I added an extra row to verify the desire behaviour, now the data DataFrame looks like:
l1 l2 l3
0 1 123 23
1 2 Z3V 343
2 3 321 21
3 4 AZ34 345
4 5 432 3
5 6 Z3V 3
By changing this line:
# create the replacement dictionary
replacements = {k: i for i, k in enumerate(np.unique(data.values[rows, cols]), 100)}
you can change the replacement values as you see fit.

Related

How to split dataframe by column value condition, pandas

I want to split a dataframe in to different lists based on column value condition.
Here is a dataframe example.
df=pd.DataFrame({'flag_1':[1,2,3,1,2,500,498,495,1,1,1,1,1,500,440,430,2,3,4,4],'dd':[1,1,1,7,7,7,8,8,8,1,1,1,7,7,7,8,8,8,5,7]})
df_out
df_out=pd.DataFrame({'flag_1':[500,498,495,500,440,430],'dd':[7,8,8,7,7,8]})
Try this:
grp = (df['flag_1']<500).cumsum()
pd.concat({n: g[1:] for n, g in tuple(df.groupby(grp)) if len(g) > 1}, ignore_index=True)
Output:
flag_1 dd
0 500 7
1 598 8
2 595 8
3 500 7
4 540 7
5 5430 8

Python for-loop to change row value based on a condition works correctly but does not change the values on pandas dataframe?

I am just getting into Python, and I am trying to make a for-loop that loops on every row and randomly select two columns on each iteration based on a given condition and change their values. The for-loop works without any problems; however, the results don't change on the dataframe.
A reproducible example:
df= pd.DataFrame({'A': [10,40,10,20,10],
'B': [10,10,50,40,50],
'C': [10,20,10,10,10],
'D': [10,30,10,10,50],
'E': [10,10,40,10,10],
'F': [2,3,2,2,3]})
df:
A B C D E F
0 10 10 10 10 10 2
1 40 10 20 30 10 3
2 10 50 10 10 40 2
3 20 40 10 10 10 2
4 10 50 10 50 10 3
This is my for-loop; the for loop iterates on all rows and check if the value on column F = 2; it randomly selects two columns with value 10 and change them to 100.
for index, i in df.iterrows():
if i['F'] == 2:
i[i==10].sample(2, axis=0)+100
print(i[i==10].sample(2, axis=0)+100)
This is the output of the loop:
E 110
C 110
Name: 0, dtype: int64
C 110
D 110
Name: 2, dtype: int64
C 110
D 110
Name: 3, dtype: int64
This is what the dataframe is expected to look like:
df:
A B C D E F
0 10 10 110 10 110 2
1 40 10 20 30 10 3
2 10 50 110 110 40 2
3 20 40 110 110 10 2
4 10 50 10 50 10 3
However, the columns on the dataframe are not change. Any idea what's going wrong?
This line:
i[i==10].sample(2, axis=0)+100
.sample returns a new dataframe so the original dataframe (df) was not updated at all.
Try this:
for index, i in df.iterrows():
if i['F'] == 2:
cond = (i == 10)
# You can only sample 2 rows if there are at
# least 2 rows meeting the condition
if cond.sum() >= 2:
idx = i[cond].sample(2).index
i[idx] += 100
print(i[idx])
You should not modify the original df in place. Make a copy and iterate:
df2 = df.copy()
for index, i in df.iterrows():
if i['F'] == 2:
s = i[i==10].sample(2, axis=0)+100
df2.loc[index,i.index.isin(s.index)] = s

Indexing based on multiple columns

I'm new to python and below mentioned is an ongoing data engineering issue I'm currently trying to resolve.
Table structure
Data:
Index 1 :
Is sequential and would increment by 1 as rows are added.
Index 2 : The problem <<-- To tabulate index 2
This is dependent on values stored in the columns [A,B,C,D,E]. If the value remains the same, we need to assign a single index for these rows.
eg: Rows 1,2,3 have 567 as a value for A,B,C respectively.
Therefore, index 2 is 100 for these 3 rows.
Record types :
1 - A
2 - B
3 - C
4 - D
5 - E
Code
data = [(100, 100, 1 , 567,'','','','') ,
(101, 100, 2 , '',567,'','','') ,
(102, 100, 3 , '','',567,'','') ,
(103, 101, 3 , '','',568,'','') ,
(104, 101, 4 , '','','',568,'') ,
(105, 101, 5 , '','','','',568) ]
#Creates the data frame
df = pd.DataFrame( data, columns = ['index1' , 'index2', 'record_type' , 'A','B','C','D','E'], dtype=str)
#Combines columns A,B,C,D,E and adds a $ where ever it is null in order to stack these values
df['combined'] = df[['A', 'B', 'C','D','E']].stack().groupby(level=0).agg('$'.join)
# Cleans the column 'combined'
df['combined_cleaned']= df['combined'].replace({'\$':''}, regex = True)
Attempting to use the combined_cleaned column to calculate index2.
Not sure if this is the right approach, open to suggestions.
A few assumptions here, but seem to fit your problem.
If there is only ever 1 value over those columns for each row then you can take the max along the row, and then find consecutive groups checking whether that Series is equal to itself, shifted.
We add 99 because by definition the counting will start at 1, but you seem to want 100.
val_cols = ['A', 'B', 'C', 'D', 'E']
s = df[val_cols].apply(pd.to_numeric).max(1)
#0 567.0
#1 567.0
#2 567.0
#3 568.0
#4 568.0
#5 568.0
#dtype: float64
df['index2'] = s.ne(s.shift()).cumsum() + 99
print(df)
index1 record_type A B C D E index2
0 100 1 567 100
1 101 2 567 100
2 102 3 567 100
3 103 3 568 101
4 104 4 568 101
5 105 5 568 101
If instead of a single value, 'record_type' points to the appropriate column you can use numpy indexing.
import numpy as np
arr = df[val_cols].to_numpy()
idx = df['record_type'].astype(int).to_numpy()
vals = arr[np.arange(len(arr)), idx-1]
#array(['567', '567', '567', '568', '568', '568'], dtype=object)
The combined_cleaned column could be generated directly using
cols = ['A', 'B', 'C','D','E']
df[cols].replace('', np.nan).apply(lambda x: x.dropna().item(), axis=1)
You can also try with stack followed by factorize:
cols = ['A', 'B', 'C','D','E']
s = pd.factorize(df[cols].replace('',np.nan).stack())[0]
df['index2_new'] = int(df['index1'].iat[0]) + s
print(df)
index1 index2 record_type A B C D E index2_new
0 100 100 1 567 100
1 101 100 2 567 100
2 102 100 3 567 100
3 103 101 3 568 101
4 104 101 4 568 101
5 105 101 5 568 101

loops application in dataframe to find output

I have the following data:
dict={'A':[1,2,3,4,5],'B':[10,20,233,29,2],'C':[10,20,3040,230,238]...................}
and
df= pd.Dataframe(dict)
In this manner I have 20 columns with 5 numerical entry in each column
I want to have a new column where the value should come as the following logic:
0 A[0]*B[0]+A[0]*C[0] + A[0]*D[0].......
1 A[1]*B[1]+A[1]*C[1] + A[1]*D[1].......
2 A[2]*B[2]+A[2]*B[2] + A[2]*D[2].......
I tried in the following manner but manually I can not put 20 columns, so I wanted to know the way to apply a loop to get the desired output
:
lst=[]
for i in range(0,5):
j=df.A[i]*df.B[i]+ df.A[i]*df.C[i]+.......
lst.append(j)
i=i+1
A potential solution is the following. I am only taking the example you posted but is works fine for more. Your data is df
A B C
0 1 10 10
1 2 20 20
2 3 233 3040
3 4 29 230
4 5 2 238
You can create a new column, D by first subsetting your dataframe
add = df.loc[:, df.columns != 'A']
and then take the sum over all multiplications of the columns in D with column A in the following way:
df['D'] = df['A']*add.sum(axis=1)
which returns
A B C D
0 1 10 10 20
1 2 20 20 80
2 3 233 3040 9819
3 4 29 230 1036
4 5 2 238 1200

Reorder columns in groups by number embedded in column name?

I have a very large dataframe with 1,000 columns. The first few columns occur only once, denoting a customer. The next few columns are representative of multiple encounters with the customer, with an underscore and the number encounter. Every additional encounter adds a new column, so there is NOT a fixed number of columns -- it'll grow with time.
Sample dataframe header structure excerpt:
id dob gender pro_1 pro_10 pro_11 pro_2 ... pro_9 pre_1 pre_10 ...
I'm trying to re-order the columns based on the number after the column name, so all _1 should be together, all _2 should be together, etc, like so:
id dob gender pro_1 pre_1 que_1 fre_1 gen_1 pro2 pre_2 que_2 fre_2 ...
(Note that the re-order should order the numbers correctly; the current order treats them like strings, which orders 1, 10, 11, etc. rather than 1, 2, 3)
Is this possible to do in pandas, or should I be looking at something else? Any help would be greatly appreciated! Thank you!
EDIT:
Alternatively, is it also possible to re-arrange column names based on the string part AND number part of the column names? So the output would then look similar to the original, except the numbers would be considered so that the order is more intuitive:
id dob gender pro_1 pro_2 pro_3 ... pre_1 pre_2 pre_3 ...
EDIT 2.0:
Just wanted to thank everyone for helping! While only one of the responses worked, I really appreciate the effort and learned a lot about other approaches / ways to think about this.
Here is one way you can try:
# column names copied from your example
example_cols = 'id dob gender pro_1 pro_10 pro_11 pro_2 pro_9 pre_1 pre_10'.split()
# sample DF
df = pd.DataFrame([range(len(example_cols))], columns=example_cols)
df
# id dob gender pro_1 pro_10 pro_11 pro_2 pro_9 pre_1 pre_10
#0 0 1 2 3 4 5 6 7 8 9
# number of columns excluded from sorting
N = 3
# get a list of columns from the dataframe
cols = df.columns.tolist()
# split, create an tuple of (column_name, prefix, number) and sorted based on the 2nd and 3rd item of the tuple, then retrieved the first item.
# adjust "key = lambda x: x[2]" to group cols by numbers only
cols_new = cols[:N] + [ a[0] for a in sorted([ (c, p, int(n)) for c in cols[N:] for p,n in [c.split('_')]], key = lambda x: (x[1], x[2])) ]
# get the new dataframe based on the cols_new
df_new = df[cols_new]
# id dob gender pre_1 pre_10 pro_1 pro_2 pro_9 pro_10 pro_11
#0 0 1 2 8 9 3 6 7 4 5
Luckily there is a one liner in python that can fix this:
df = df.reindex(sorted(df.columns), axis=1)
For Example lets say you had this dataframe:
import pandas as pd
import numpy as np
df = pd.DataFrame({'Name': [2, 4, 8, 0],
'ID': [2, 0, 0, 0],
'Prod3': [10, 2, 1, 8],
'Prod1': [2, 4, 8, 0],
'Prod_1': [2, 4, 8, 0],
'Pre7': [2, 0, 0, 0],
'Pre2': [10, 2, 1, 8],
'Pre_2': [10, 2, 1, 8],
'Pre_9': [10, 2, 1, 8]}
)
print(df)
Output:
Name ID Prod3 Prod1 Prod_1 Pre7 Pre2 Pre_2 Pre_9
0 2 2 10 2 2 2 10 10 10
1 4 0 2 4 4 0 2 2 2
2 8 0 1 8 8 0 1 1 1
3 0 0 8 0 0 0 8 8 8
Then used
df = df.reindex(sorted(df.columns), axis=1)
Then the dataframe will then look like:
ID Name Pre2 Pre7 Pre_2 Pre_9 Prod1 Prod3 Prod_1
0 2 2 10 2 10 10 2 10 2
1 0 4 2 0 2 2 4 2 4
2 0 8 1 0 1 1 8 1 8
3 0 0 8 0 8 8 0 8 0
As you can see, the columns without underscore will come first, followed by an ordering based on the number after the underscore. However this also sorts of the column names, so the column names that come first in the alphabet will be first.
You need to split you column on '_' then convert to int:
c = ['A_1','A_10','A_2','A_3','B_1','B_10','B_2','B_3']
df = pd.DataFrame(np.random.randint(0,100,(2,8)), columns = c)
df.reindex(sorted(df.columns, key = lambda x: int(x.split('_')[1])), axis=1)
Output:
A_1 B_1 A_2 B_2 A_3 B_3 A_10 B_10
0 68 11 59 69 37 68 76 17
1 19 37 52 54 23 93 85 3
Next case, you need human sorting:
import re
def atoi(text):
return int(text) if text.isdigit() else text
def natural_keys(text):
'''
alist.sort(key=natural_keys) sorts in human order
http://nedbatchelder.com/blog/200712/human_sorting.html
(See Toothy's implementation in the comments)
'''
return [ atoi(c) for c in re.split(r'(\d+)', text) ]
df.reindex(sorted(df.columns, key = lambda x:natural_keys(x)), axis=1)
Output:
A_1 A_2 A_3 A_10 B_1 B_2 B_3 B_10
0 68 59 37 76 11 69 68 17
1 19 52 23 85 37 54 93 3
Try this.
To re-order the columns based on the number after the column name
cols_fixed = df.columns[:3] # change index no based on your df
cols_variable = df.columns[3:] # change index no based on your df
cols_variable = sorted(cols_variable, key=lambda x : int(x.split('_')[1])) # split based on the number after '_'
cols_new = cols_fixed + cols_variable
new_df = pd.DataFrame(df[cols_new])
To re-arrange column names based on the string part AND number part of the column names
cols_fixed = df.columns[:3] # change index no based on your df
cols_variable = df.columns[3:] # change index no based on your df
cols_variable = sorted(cols_variable)
cols_new = cols_fixed + cols_variable
new_df = pd.DataFrame(df[cols_new])

Resources