pandas move to correspondent column based on value of other column - python-3.x

Im trying to move the f1_am, f2_am, f3_am to the correspondent column based on the values of f1_ty, f2_ty, f3_ty
I started adding new columns to the dataframe based on unique values from the _ty using sets, but I'm trying to figure it out how to move the _am values to were it belongs
Looked for the option of group by and pivot but the result exploded my mind....
I would appreciate some guidance.
Below the code.
import pandas as pd
import numpy as np
data = {
'mem_id': ['A', 'B', 'C', 'A', 'B', 'C']
, 'date_inf': ['01/01/2019', '01/01/2019', '01/01/2019', '02/01/2019', '02/01/2019', '02/01/2019']
, 'f1_ty': ['ABC', 'ABC', 'ABC', 'ABC', 'GHI', 'GHI']
, 'f1_am': [100, 20, 57, 44, 15, 10]
, 'f2_ty': ['DEF', 'DEF', 'DEF', 'GHI', 'ABC', 'XYZ']
, 'f2_am':[20, 30, 45, 66, 14, 21]
, 'f3_ty': ['XYZ', 'GHI', 'OPQ', 'OPQ', 'XYZ', 'DEF']
, 'f3_am':[20, 30, 45, 66, 14, 21]
}
df = pd.DataFrame (data)
#distinct values in columns using sets
distinct_values = sorted(list(set(df['f1_ty'])|set(df['f2_ty'])|set(df['f3_ty'])))
# add distinct values as new columns in the DataFrame
new_df = df.reindex(columns = np.append( df.columns.values, distinct_values))
So this would be my starting point and my wanted result.

Here is a try, thanks for the interesting problem (rename colujmns to make compatible to wide_to_long() followed by unstack() while dropping extra levels:
m=df.set_index(['mem_id','date_inf']).rename(columns=lambda x: ''.join(x.split('_')[::-1]))
n=(pd.wide_to_long(m.reset_index(),['tyf','amf'],['mem_id','date_inf'],'v')
.droplevel(-1).set_index('tyf',append=True).unstack(fill_value=0).reindex(m.index))
final=n.droplevel(0,axis=1).rename_axis(None,axis=1).reset_index()
print(final)
mem_id date_inf ABC DEF GHI OPQ XYZ
0 A 01/01/2019 100 20 0 0 20
1 B 01/01/2019 20 30 30 0 0
2 C 01/01/2019 57 45 0 45 0
3 A 02/01/2019 44 0 66 66 0
4 B 02/01/2019 14 0 15 0 14
5 C 02/01/2019 0 21 10 0 21

Related

Removing rows from DataFrame based on different conditions applied to subset of a data

Here is the Dataframe I am working with:
You can create it using the snippet:
my_dict = {'id': [1,2,1,2,1,2,1,2,1,2,3,1,3, 3],
'category':['a', 'a', 'b', 'b', 'b', 'b', 'a', 'a', 'b', 'b', 'b', 'a', 'a', 'a'],
'value' : [1, 12, 34, 12, 12 ,34, 12, 35, 34, 45, 65, 55, 34, 25]
}
x = pd.DataFrame(my_dict)
x
I want to filter IDs based on the condition: for category a, the count of values should be 2 and for category b, the count of values should be 3. Therefore, I would remove id 1 from category a and id 3 from category b from my original dataset x.
I can write the code for individual categories and start removing id's manually by using the code:
x.query('category == "a"').groupby('id').value.count().loc[lambda x: x != 2]
x.query('category == "b"').groupby('id').value.count().loc[lambda x: x != 3]
But, I don't want to do it manually since there are multiple categories. Is there a better way of doing it by considering all the categories at once and remove id's based on the condition listed in a list/dictionary?
If need filter MultiIndex Series - s by dictionary use Index.get_level_values with Series.map and get equal values per groups in boolean indexing:
s = x.groupby(['category','id']).value.count()
d = {'a': 2, 'b': 3}
print (s[s.eq(s.index.get_level_values(0).map(d))])
category id
a 2 2
3 2
b 1 3
2 3
Name: value, dtype: int64
If need filter original DataFrame:
s = x.groupby(['category','id'])['value'].transform('count')
print (s)
0 3
1 2
2 3
3 3
4 3
5 3
6 3
7 2
8 3
9 3
10 1
11 3
12 2
13 2
Name: value, dtype: int64
d = {'a': 2, 'b': 3}
print (x[s.eq(x['category'].map(d))])
id category value
1 2 a 12
2 1 b 34
3 2 b 12
4 1 b 12
5 2 b 34
7 2 a 35
8 1 b 34
9 2 b 45
12 3 a 34
13 3 a 25

Concate 2 dfs by a condition

I have 2 dfs
import pandas as pd
list_columns = ['Number', 'Name', 'Age']
list_data = [
[121, 'John', 25],
[122, 'Sam', 26]
]
df1 = pd.DataFrame(columns=list_columns, data=list_data)
Number Name Age
0 121 John 25
1 122 Sam 26
and
list_columns = ['Number', 'Name', 'Age']
list_data = [
[121, 'John', 31],
[122, 'Sam', 29],
[123, 'Andrew', 28]
]
df2 = pd.DataFrame(columns=list_columns, data=list_data)
Number Name Age
0 121 John 31
1 122 Sam 29
2 123 Andrew 28
In the end I want to take the missing values from df2 and add them into df1 bassed on the column Number.
In the above case in df1 I am missing only the Number 123, and I want to move only the data from this line to df1, so it will lok like
|Number|Name | Age|
| 121 |John | 25 |
| 122 |Sam | 26 |
| 123 |Andrew| 28 |
I tried to use concat with keep= 'First' but I am afraid that if a have lot of data it will alterate the existing data in df1(I want to add only missing data based on Number).
Is there a better way of achieving this?
this how I tried to concat
pd.concat([df1,df2]).drop_duplicates(['Number'],keep='first')
Use DataFrame.set_index on df1 and df2 to set the index as column Number and use DataFrame.combine_first:
df = (
df1.set_index('Number').combine_first(
df2.set_index('Number')).reset_index()
)
Result:
Number Name Age
0 121 John 25.0
1 122 Sam 26.0
2 123 Andrew 28.0

Getting columns by list of substring values

I have dataframe which is mentioned below, i have large data wanted to create diffrent data frame from substring values of column
df
ID ex_srr123 ex2_srr124 ex3_srr125 ex4_srr1234 ex23_srr5323
san 12 43 0 34 0
mat 53 0 34 76 656
jon 82 223 23 32 21
jack 0 12 2 0 0
i have a list of substring of column
coln1=['srr123', 'srr124']
coln2=['srr1234','srr5323']
I wanted
df2=
ID ex_srr123 ex2_srr12
san 12 43
mat 53 0
jon 82 223
jack 0 12
I tried
df2=df[coln1]
i didn't get what i wanted please help me how can i get desire output
Statically
df2 = df.filter(regex="srr123$|srr124$").copy()
Dynamically
coln1 = ['srr123', 'srr124']
df2 = df.filter(regex=f"{coln1[0]}$|{coln1[1]}$").copy()
The $ signifies the end of the string, so that the column ex4_srr1234 isn't also included in your result.
Look into the filter method
df.filter(regex="srr123|srr124").copy()
I am making a few assumptions:
'ID' is a column and not the index.
The third column in df2 should read 'ex2_srr124' instead of 'ex2_srr12'.
You do not want to include columns of 'df' in 'df2' if the substring does not match everything after the underscore (since 'srr123' is a substring of 'ex4_srr1234' but you did not include it in 'df2').
# set the provided data frames
df = pd.DataFrame([['san', 12, 43, 0, 34, 0],
['mat', 53, 0, 34, 76, 656],
['jon', 82, 223, 23, 32, 21],
['jack', 0, 12, 2, 0, 0]],
columns = ['ID', 'ex_srr123', 'ex2_srr124', 'ex3_srr125', 'ex4_srr1234', 'ex23_srr5323'])
# set the list of column-substrings
coln1=['srr123', 'srr124']
coln2=['srr1234','srr5323']
I suggest to solve this as follows:
# create df2 and add the ID column
df2 = pd.DataFrame()
df2['ID'] = df['ID']
# iterate over each substring in a list of column-substrings
for substring in coln1:
# iterate over each column name in the df columns
for column_name in df.columns.values:
# check if column name ends with substring
if substring == column_name[-len(substring):]:
# assign the new column to df2
df2[column_name] = df[column_name]
This yields the desired dataframe df2:
ID ex_srr123 ex2_srr124
0 san 12 43
1 mat 53 0
2 jon 82 223
3 jack 0 12
df.filter(regex = '|'.join(['ID'] + [col+ '$' for col in coln1])).copy()
ID ex_srr123 ex2_srr124
0 san 12 43
1 mat 53 0
2 jon 82 223
3 jack 0 12

Reorder columns in groups by number embedded in column name?

I have a very large dataframe with 1,000 columns. The first few columns occur only once, denoting a customer. The next few columns are representative of multiple encounters with the customer, with an underscore and the number encounter. Every additional encounter adds a new column, so there is NOT a fixed number of columns -- it'll grow with time.
Sample dataframe header structure excerpt:
id dob gender pro_1 pro_10 pro_11 pro_2 ... pro_9 pre_1 pre_10 ...
I'm trying to re-order the columns based on the number after the column name, so all _1 should be together, all _2 should be together, etc, like so:
id dob gender pro_1 pre_1 que_1 fre_1 gen_1 pro2 pre_2 que_2 fre_2 ...
(Note that the re-order should order the numbers correctly; the current order treats them like strings, which orders 1, 10, 11, etc. rather than 1, 2, 3)
Is this possible to do in pandas, or should I be looking at something else? Any help would be greatly appreciated! Thank you!
EDIT:
Alternatively, is it also possible to re-arrange column names based on the string part AND number part of the column names? So the output would then look similar to the original, except the numbers would be considered so that the order is more intuitive:
id dob gender pro_1 pro_2 pro_3 ... pre_1 pre_2 pre_3 ...
EDIT 2.0:
Just wanted to thank everyone for helping! While only one of the responses worked, I really appreciate the effort and learned a lot about other approaches / ways to think about this.
Here is one way you can try:
# column names copied from your example
example_cols = 'id dob gender pro_1 pro_10 pro_11 pro_2 pro_9 pre_1 pre_10'.split()
# sample DF
df = pd.DataFrame([range(len(example_cols))], columns=example_cols)
df
# id dob gender pro_1 pro_10 pro_11 pro_2 pro_9 pre_1 pre_10
#0 0 1 2 3 4 5 6 7 8 9
# number of columns excluded from sorting
N = 3
# get a list of columns from the dataframe
cols = df.columns.tolist()
# split, create an tuple of (column_name, prefix, number) and sorted based on the 2nd and 3rd item of the tuple, then retrieved the first item.
# adjust "key = lambda x: x[2]" to group cols by numbers only
cols_new = cols[:N] + [ a[0] for a in sorted([ (c, p, int(n)) for c in cols[N:] for p,n in [c.split('_')]], key = lambda x: (x[1], x[2])) ]
# get the new dataframe based on the cols_new
df_new = df[cols_new]
# id dob gender pre_1 pre_10 pro_1 pro_2 pro_9 pro_10 pro_11
#0 0 1 2 8 9 3 6 7 4 5
Luckily there is a one liner in python that can fix this:
df = df.reindex(sorted(df.columns), axis=1)
For Example lets say you had this dataframe:
import pandas as pd
import numpy as np
df = pd.DataFrame({'Name': [2, 4, 8, 0],
'ID': [2, 0, 0, 0],
'Prod3': [10, 2, 1, 8],
'Prod1': [2, 4, 8, 0],
'Prod_1': [2, 4, 8, 0],
'Pre7': [2, 0, 0, 0],
'Pre2': [10, 2, 1, 8],
'Pre_2': [10, 2, 1, 8],
'Pre_9': [10, 2, 1, 8]}
)
print(df)
Output:
Name ID Prod3 Prod1 Prod_1 Pre7 Pre2 Pre_2 Pre_9
0 2 2 10 2 2 2 10 10 10
1 4 0 2 4 4 0 2 2 2
2 8 0 1 8 8 0 1 1 1
3 0 0 8 0 0 0 8 8 8
Then used
df = df.reindex(sorted(df.columns), axis=1)
Then the dataframe will then look like:
ID Name Pre2 Pre7 Pre_2 Pre_9 Prod1 Prod3 Prod_1
0 2 2 10 2 10 10 2 10 2
1 0 4 2 0 2 2 4 2 4
2 0 8 1 0 1 1 8 1 8
3 0 0 8 0 8 8 0 8 0
As you can see, the columns without underscore will come first, followed by an ordering based on the number after the underscore. However this also sorts of the column names, so the column names that come first in the alphabet will be first.
You need to split you column on '_' then convert to int:
c = ['A_1','A_10','A_2','A_3','B_1','B_10','B_2','B_3']
df = pd.DataFrame(np.random.randint(0,100,(2,8)), columns = c)
df.reindex(sorted(df.columns, key = lambda x: int(x.split('_')[1])), axis=1)
Output:
A_1 B_1 A_2 B_2 A_3 B_3 A_10 B_10
0 68 11 59 69 37 68 76 17
1 19 37 52 54 23 93 85 3
Next case, you need human sorting:
import re
def atoi(text):
return int(text) if text.isdigit() else text
def natural_keys(text):
'''
alist.sort(key=natural_keys) sorts in human order
http://nedbatchelder.com/blog/200712/human_sorting.html
(See Toothy's implementation in the comments)
'''
return [ atoi(c) for c in re.split(r'(\d+)', text) ]
df.reindex(sorted(df.columns, key = lambda x:natural_keys(x)), axis=1)
Output:
A_1 A_2 A_3 A_10 B_1 B_2 B_3 B_10
0 68 59 37 76 11 69 68 17
1 19 52 23 85 37 54 93 3
Try this.
To re-order the columns based on the number after the column name
cols_fixed = df.columns[:3] # change index no based on your df
cols_variable = df.columns[3:] # change index no based on your df
cols_variable = sorted(cols_variable, key=lambda x : int(x.split('_')[1])) # split based on the number after '_'
cols_new = cols_fixed + cols_variable
new_df = pd.DataFrame(df[cols_new])
To re-arrange column names based on the string part AND number part of the column names
cols_fixed = df.columns[:3] # change index no based on your df
cols_variable = df.columns[3:] # change index no based on your df
cols_variable = sorted(cols_variable)
cols_new = cols_fixed + cols_variable
new_df = pd.DataFrame(df[cols_new])

Pandas: How to build a column based on another column which is indexed by another one?

I have this dataframe presented below. I tried a solution below, but I am not sure if this is a good solution.
import pandas as pd
def creatingDataFrame():
raw_data = {'code': [1, 2, 3, 2 , 3, 3],
'Region': ['A', 'A', 'C', 'B' , 'A', 'B'],
'var-A': [2,4,6,4,6,6],
'var-B': [20, 30, 40 , 50, 10, 20],
'var-C': [3, 4 , 5, 1, 2, 3]}
df = pd.DataFrame(raw_data, columns = ['code', 'Region','var-A', 'var-B', 'var-C'])
return df
if __name__=="__main__":
df=creatingDataFrame()
df['var']=np.where(df['Region']=='A',1.0,0.0)*df['var-A']+np.where(df['Region']=='B',1.0,0.0)*df['var-B']+np.where(df['Region']=='C',1.0,0.0)*df['var-C']
I want the variable var assumes values of column 'var-A', 'var-B' or 'var-C' depending on the region provided by region 'Region'.
The result must be
df['var']
Out[50]:
0 2.0
1 4.0
2 5.0
3 50.0
4 6.0
5 20.0
Name: var, dtype: float64
You can try with lookup
df.columns=df.columns.str.split('-').str[-1]
df
Out[255]:
code Region A B C
0 1 A 2 20 3
1 2 A 4 30 4
2 3 C 6 40 5
3 2 B 4 50 1
4 3 A 6 10 2
5 3 B 6 20 3
df.lookup(df.index,df.Region)
Out[256]: array([ 2, 4, 5, 50, 6, 20], dtype=int64)
#df['var']=df.lookup(df.index,df.Region)

Resources