Change the structure of column name - python-3.x

I have the column as
id_no| 2021-05-19 00:00:00 | 2021-05-20 00:00:00 | decider
100 20 20 878
200 64 38 917
here idno is the index and the rest are columns
I want the outupt as
id_no| 2021-05-19 | 2021-05-20 | decider
100 20 20 878
200 64 38 917
I tried converting the column names but just column name is not getting changed and column names are in datetime format except the population column. I tried below code
for (columnName, columnData) in df.iteritems():
columnName = pd.to_datetime(columnName)

We can try str slice when other column length are not greater than 10
df.columns = df.columns.astype(str).str[:10]
df
Out[356]:
id_no 2021-05-19 2021-05-20 decider
0 100 20 20 878
1 200 64 38 917

Changing a loop variable changes only... the loop variable, not the column name! You must create a list of strings representing the new column names, and make it the new column index:
new_columns = [df.columns[0]] + \
pd.to_datetime(df.columns[1:-1]).astype(str).tolist() +\
[df.columns[-1]]
df.columns = new_columns

You can just assign a list of names to the columns attribute of your df.
data = {'id_no': {0: 100, 1: 200},
'2021-05-19 00:00:00': {0: 20, 1: 64},
'2021-05-20 00:00:00': {0: 20, 1: 38},
'decider': {0: 878, 1: 917}}
df = pd.DataFrame(data)
df.columns = ['id_no', '2021-05-19', '2021-05-20', 'decider'] # simple solution
# edit, you can use a list comprehension with conditional
df.columns = [str(x)[0:10] if x[0] == '2' else x for x in df.columns]
Output:
id_no 2021-05-19 2021-05-20 decider
0 100 20 20 878
1 200 64 38 917

Related

Pandas DataFrame: Same operation on multiple sets of columns

I want to do the same operation on multiple sets of columns of a DataFrame.
Since "for-loops" are frowned upon I'm searching for a decent alternative.
An example:
df = pd.DataFrame({
'a': [1, 11, 111],
'b': [222, 22, 2],
'a_x': [10, 80, 30],
'b_x': [20, 20, 60],
})
This is a simple for-loop approach. It's short and well readable.
cols = ['a', 'b']
for col in cols:
df[f'{col}_res'] = df[[col, f'{col}_x']].min(axis=1)
a b a_x b_x a_res b_res
0 1 222 10 20 1 20
1 11 22 80 20 11 20
2 111 2 30 60 30 2
This is an alternative (w/o for-loop), but I feel that the additional complexity is not really for the better.
cols = ['a', 'b']
def res_df(df, col, name):
res = pd.Series(
df[[col, f'{col}_x']].min(axis=1), index=df.index, name=name)
return res
res = [res_df(df, col, f'{col}_res') for col in cols]
df = pd.concat([df, pd.concat(res, axis=1)], axis=1)
Does anyone have a better/more pythonic solution?
Thanks!
UPDATE 1
Inspired by the proposal from mozway I find the following solution quite appealing.
Imho it's short, readable and generic, since the particular operation can be swapped into a function and the list comprehension applies the function to the given sets of columns.
def operation(s1, s2):
# fill in any operation on pandas series'
# e.g. res = s1 * s2 / (s1 + s2)
res = np.minimum(s1, s2)
return res
df = df.join(
[operation(df[f'{col}'], df[f'{col}_x']).rename(f'{col}_res') for col in cols]
)
You can use numpy.minimum after setting the arrays to identical column names:
cols = ['a', 'b']
cols2 = [f'{x}_x' for x in cols]
df = df.join(np.minimum(df[cols],
df[cols2].set_axis(cols, axis=1))
.add_suffix('_res'))
output:
a b a_x b_x a_res b_res
0 1 222 10 20 1 20
1 11 22 80 20 11 20
2 111 2 30 60 30 2
or, using rename as suggested in the other answer:
cols = ['a', 'b']
cols2 = {f'{x}_x': x for x in cols}
df = df.join(np.minimum(df[cols],
df[list(cols2)].rename(columns=cols2))
.add_suffix('_res'))
One idea is rename columns names by dictionary, select columns by list cols and then group by columns names with aggregate min, sum, max or use custom function:
cols = ['a', 'b']
suffix = '_x'
d = {f'{x}{suffix}':x for x in cols}
print (d)
{'a_x': 'a', 'b_x': 'b'}
print (df.rename(columns=d)[cols])
a a b b
0 1 10 222 20
1 11 80 22 20
2 111 30 2 60
df1 = df.rename(columns=d)[cols].groupby(axis=1,level=0).min().add_suffix('_res')
print (df1)
a_res b_res
0 1 20
1 11 20
2 30 2
Last add to original DataFrame:
df = df.join(df1)
print (df)
a b a_x b_x a_res b_res
0 1 222 10 20 1 20
1 11 22 80 20 11 20
2 111 2 30 60 30 2

Python for-loop to change row value based on a condition works correctly but does not change the values on pandas dataframe?

I am just getting into Python, and I am trying to make a for-loop that loops on every row and randomly select two columns on each iteration based on a given condition and change their values. The for-loop works without any problems; however, the results don't change on the dataframe.
A reproducible example:
df= pd.DataFrame({'A': [10,40,10,20,10],
'B': [10,10,50,40,50],
'C': [10,20,10,10,10],
'D': [10,30,10,10,50],
'E': [10,10,40,10,10],
'F': [2,3,2,2,3]})
df:
A B C D E F
0 10 10 10 10 10 2
1 40 10 20 30 10 3
2 10 50 10 10 40 2
3 20 40 10 10 10 2
4 10 50 10 50 10 3
This is my for-loop; the for loop iterates on all rows and check if the value on column F = 2; it randomly selects two columns with value 10 and change them to 100.
for index, i in df.iterrows():
if i['F'] == 2:
i[i==10].sample(2, axis=0)+100
print(i[i==10].sample(2, axis=0)+100)
This is the output of the loop:
E 110
C 110
Name: 0, dtype: int64
C 110
D 110
Name: 2, dtype: int64
C 110
D 110
Name: 3, dtype: int64
This is what the dataframe is expected to look like:
df:
A B C D E F
0 10 10 110 10 110 2
1 40 10 20 30 10 3
2 10 50 110 110 40 2
3 20 40 110 110 10 2
4 10 50 10 50 10 3
However, the columns on the dataframe are not change. Any idea what's going wrong?
This line:
i[i==10].sample(2, axis=0)+100
.sample returns a new dataframe so the original dataframe (df) was not updated at all.
Try this:
for index, i in df.iterrows():
if i['F'] == 2:
cond = (i == 10)
# You can only sample 2 rows if there are at
# least 2 rows meeting the condition
if cond.sum() >= 2:
idx = i[cond].sample(2).index
i[idx] += 100
print(i[idx])
You should not modify the original df in place. Make a copy and iterate:
df2 = df.copy()
for index, i in df.iterrows():
if i['F'] == 2:
s = i[i==10].sample(2, axis=0)+100
df2.loc[index,i.index.isin(s.index)] = s

how to multiply pandas pandas data

im a beginner.
I have a python dataframe as below. I would like to multiply each of the elements by a=100, b=200, c=300. Can someone help me to understand how to do that?
There are n number of columns.
Thank you.
index
a
b
c
2021-01-01
22
20
18
2021-01-02
25
29
7
2021-01-03
15
30
20
Create a dictionary and apply operation to your dataframe:
coeff = {'a': 100, 'b': 200, 'c': 300}
df.update(df[coeff.keys()].mul(pd.Series(coeff), axis=1))
>>> df
index a b c
0 2021-01-01 2200 4000 5400
1 2021-01-02 2500 5800 2100
2 2021-01-03 1500 6000 6000
Alternative with a list:
df[['a', 'b', 'c']] *= [100, 200, 300]
Saying your dataframe is called df then it is simple as (if I understand it correctly):
df.a = df.a * 100
df.b = df.b * 200
df.c = df.c * 300

Indexing based on multiple columns

I'm new to python and below mentioned is an ongoing data engineering issue I'm currently trying to resolve.
Table structure
Data:
Index 1 :
Is sequential and would increment by 1 as rows are added.
Index 2 : The problem <<-- To tabulate index 2
This is dependent on values stored in the columns [A,B,C,D,E]. If the value remains the same, we need to assign a single index for these rows.
eg: Rows 1,2,3 have 567 as a value for A,B,C respectively.
Therefore, index 2 is 100 for these 3 rows.
Record types :
1 - A
2 - B
3 - C
4 - D
5 - E
Code
data = [(100, 100, 1 , 567,'','','','') ,
(101, 100, 2 , '',567,'','','') ,
(102, 100, 3 , '','',567,'','') ,
(103, 101, 3 , '','',568,'','') ,
(104, 101, 4 , '','','',568,'') ,
(105, 101, 5 , '','','','',568) ]
#Creates the data frame
df = pd.DataFrame( data, columns = ['index1' , 'index2', 'record_type' , 'A','B','C','D','E'], dtype=str)
#Combines columns A,B,C,D,E and adds a $ where ever it is null in order to stack these values
df['combined'] = df[['A', 'B', 'C','D','E']].stack().groupby(level=0).agg('$'.join)
# Cleans the column 'combined'
df['combined_cleaned']= df['combined'].replace({'\$':''}, regex = True)
Attempting to use the combined_cleaned column to calculate index2.
Not sure if this is the right approach, open to suggestions.
A few assumptions here, but seem to fit your problem.
If there is only ever 1 value over those columns for each row then you can take the max along the row, and then find consecutive groups checking whether that Series is equal to itself, shifted.
We add 99 because by definition the counting will start at 1, but you seem to want 100.
val_cols = ['A', 'B', 'C', 'D', 'E']
s = df[val_cols].apply(pd.to_numeric).max(1)
#0 567.0
#1 567.0
#2 567.0
#3 568.0
#4 568.0
#5 568.0
#dtype: float64
df['index2'] = s.ne(s.shift()).cumsum() + 99
print(df)
index1 record_type A B C D E index2
0 100 1 567 100
1 101 2 567 100
2 102 3 567 100
3 103 3 568 101
4 104 4 568 101
5 105 5 568 101
If instead of a single value, 'record_type' points to the appropriate column you can use numpy indexing.
import numpy as np
arr = df[val_cols].to_numpy()
idx = df['record_type'].astype(int).to_numpy()
vals = arr[np.arange(len(arr)), idx-1]
#array(['567', '567', '567', '568', '568', '568'], dtype=object)
The combined_cleaned column could be generated directly using
cols = ['A', 'B', 'C','D','E']
df[cols].replace('', np.nan).apply(lambda x: x.dropna().item(), axis=1)
You can also try with stack followed by factorize:
cols = ['A', 'B', 'C','D','E']
s = pd.factorize(df[cols].replace('',np.nan).stack())[0]
df['index2_new'] = int(df['index1'].iat[0]) + s
print(df)
index1 index2 record_type A B C D E index2_new
0 100 100 1 567 100
1 101 100 2 567 100
2 102 100 3 567 100
3 103 101 3 568 101
4 104 101 4 568 101
5 105 101 5 568 101

Getting columns by list of substring values

I have dataframe which is mentioned below, i have large data wanted to create diffrent data frame from substring values of column
df
ID ex_srr123 ex2_srr124 ex3_srr125 ex4_srr1234 ex23_srr5323
san 12 43 0 34 0
mat 53 0 34 76 656
jon 82 223 23 32 21
jack 0 12 2 0 0
i have a list of substring of column
coln1=['srr123', 'srr124']
coln2=['srr1234','srr5323']
I wanted
df2=
ID ex_srr123 ex2_srr12
san 12 43
mat 53 0
jon 82 223
jack 0 12
I tried
df2=df[coln1]
i didn't get what i wanted please help me how can i get desire output
Statically
df2 = df.filter(regex="srr123$|srr124$").copy()
Dynamically
coln1 = ['srr123', 'srr124']
df2 = df.filter(regex=f"{coln1[0]}$|{coln1[1]}$").copy()
The $ signifies the end of the string, so that the column ex4_srr1234 isn't also included in your result.
Look into the filter method
df.filter(regex="srr123|srr124").copy()
I am making a few assumptions:
'ID' is a column and not the index.
The third column in df2 should read 'ex2_srr124' instead of 'ex2_srr12'.
You do not want to include columns of 'df' in 'df2' if the substring does not match everything after the underscore (since 'srr123' is a substring of 'ex4_srr1234' but you did not include it in 'df2').
# set the provided data frames
df = pd.DataFrame([['san', 12, 43, 0, 34, 0],
['mat', 53, 0, 34, 76, 656],
['jon', 82, 223, 23, 32, 21],
['jack', 0, 12, 2, 0, 0]],
columns = ['ID', 'ex_srr123', 'ex2_srr124', 'ex3_srr125', 'ex4_srr1234', 'ex23_srr5323'])
# set the list of column-substrings
coln1=['srr123', 'srr124']
coln2=['srr1234','srr5323']
I suggest to solve this as follows:
# create df2 and add the ID column
df2 = pd.DataFrame()
df2['ID'] = df['ID']
# iterate over each substring in a list of column-substrings
for substring in coln1:
# iterate over each column name in the df columns
for column_name in df.columns.values:
# check if column name ends with substring
if substring == column_name[-len(substring):]:
# assign the new column to df2
df2[column_name] = df[column_name]
This yields the desired dataframe df2:
ID ex_srr123 ex2_srr124
0 san 12 43
1 mat 53 0
2 jon 82 223
3 jack 0 12
df.filter(regex = '|'.join(['ID'] + [col+ '$' for col in coln1])).copy()
ID ex_srr123 ex2_srr124
0 san 12 43
1 mat 53 0
2 jon 82 223
3 jack 0 12

Resources