How to move content of a column to a column next to it - python-3.x

The dataframe looks like this:
import pandas as pd
df = pd.DataFrame ({
'Name':['Brian','John','Adam'],
'HomeAddr':[12,32,44],
'Age':['M','M','F'],
'Genre': ['NaN','NaN','NaN']
})
The current output is:
Name HomeAddr Age Genre
0 Brian 12 M NaN
1 John 32 M NaN
2 Adam 44 F NaN
I would like to shift somehow content of HomeAddr and Age columns to columns+1. Below is a sample of the expected output.
Name HomeAddr Age Genre
0 Brian NaN 12 M
1 John NaN 32 M
2 Adam NaN 44 F
I tried with .shift() but it doesn't work.
import pandas as pd
df = pd.DataFrame ({
'Name':['Brian','John','Adam'],
'HomeAddr':[12,32,44],
'Age':['M','M','F'],
'Genre': ['NaN','NaN','NaN']
})
df['HomeAddr'] = df['HomeAddr'].shift(-1)
print(df)
Name HomeAddr Age Genre
0 Brian 32.0 M NaN
1 John 44.0 M NaN
2 Adam NaN F NaN
Any ideas guys? Thank you!

Use DataFrame.shift, but is necessary convert columns to strings for avoid missing values, then convert numeric columns back:
df.loc[:, 'HomeAddr':] = df.loc[:, 'HomeAddr':].astype(str).shift(1, axis=1)
df['Age'] = pd.to_numeric(df['Age'])
print (df)
Name HomeAddr Age Genre
0 Brian NaN 12 M
1 John NaN 32 M
2 Adam NaN 44 F
Another out of box solution:
df = df.drop('Genre', axis=1).rename(columns={'HomeAddr':'Age', 'Age':'Genre'})
df.insert(1, 'HomeAddr', np.nan)
print (df)
Name HomeAddr Age Genre
0 Brian NaN 12 M
1 John NaN 32 M
2 Adam NaN 44 F

Related

how to change data frame row to next row in pandas

I am a noob python user and my purpose is got name and shift to next row
import pandas as pd
import numpy as np
df = pd.DataFrame({"1": ['Alfred', 'car', 'bike','Alex','car'],
"2": [np.nan, 'Ford', 'Giant',np.nan,'Toyota'],
"3": [pd.NaT, pd.Timestamp("2018-01-01"),
pd.Timestamp("2018-07-01"),np.nan,pd.Timestamp("2021-01-01")]})
1 2 3
0 Alfred NaN NaT
1 car Ford 2018-01-01
2 bike Giant 2018-07-01
3 Alex NaN NaT
4 car Toyota 2021-01-01
my goal result like as below
df = pd.DataFrame({"transportation": ['car', 'bike','car'],
"Mark": ['Ford', 'Giant','Toyota'],
"BuyDate":[pd.Timestamp("2018-01-01"),
pd.Timestamp("2018-07-01"),pd.Timestamp("2021-01-01")],
"Name":['Alfred','Alfred','Alex']
})
transportation Mark BuyDate Name
0 car Ford 2018-01-01 Alfred
1 bike Giant 2018-07-01 Alfred
2 car Toyota 2021-01-01 Alex
i'm try to search some method , but cannot solve this
thanks for see my post and help
thanks mozway、jezrael、mcsoini help,it's work and i'm going learning those different method 。
Joseph Assaker
i had a question for your answer , when i run as below code and show error code 。 am i miss something ??
j = 0
for i in range(1, df.shape[0]):
if df.loc[i][1] is np.nan:
running_name = df.loc[i][0]
continue
new_df.loc[j] = list(df.loc[i]) + [running_name]
j += 1
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_14216/1012510729.py in <module>
4 running_name = df.loc[i][0]
5 continue
----> 6 new_df.loc[j] = list(df.loc[i]) + [running_name]
7 j += 1
NameError: name 'running_name' is not defined
Idea is forward filling missing values by Mark column to Name column and then filter rows in same mask:
df.columns = ["Transportation", "Mark", "BuyDate"]
m = df["Mark"].notna()
df["Name"] = df["transportation"].mask(m).ffill()
df = df[m].reset_index(drop=True)
print(df)
Transportation Mark BuyDate Name
0 car Ford 2018-01-01 Alfred
1 bike Giant 2018-07-01 Alfred
2 car Toyota 2021-01-01 Alex
You can do this using a helper column and then a forward fill:
# rename columns
df.columns = ["transportation", "Mark", "BuyDate"]
# assumption: the rows where "Mark" is NaN defines the name for the following rows
df["is_name"] = df["Mark"].isna()
# create a new column which is NaN everywhere except for the name rows
df["name"] = np.where(df.is_name, df["transportation"], np.nan)
# do a forward fill to extend the names to all rows
df["name"] = df["name"].fillna(method="ffill")
# filter by non-name rows and drop the temporary is_name column
df = df.loc[~df.is_name].drop("is_name", axis=1)
print(df)
Out:
transportation Mark BuyDate name
1 car Ford 2018-01-01 Alfred
2 bike Giant 2018-07-01 Alfred
4 car Toyota 2021-01-01 Alex
You could use this pipeline:
m = df.iloc[:,1].notna()
(df.assign(Name=df.iloc[:,0].mask(m).ffill()) # add new column
.loc[m] # keep only the columns with info
# below: rework df to fit output
.rename(columns={'1': 'transportation', '2': 'Mark', '3': 'BuyDate'})
.reset_index(drop=True)
)
output:
transportation Mark BuyDate Name
0 car Ford 2018-01-01 Alfred
1 bike Giant 2018-07-01 Alfred
2 car Toyota 2021-01-01 Alex
You can do this like so:
>>> import pandas as pd
>>> import numpy as np
>>> df = pd.DataFrame({"1": ['Alfred', 'car', 'bike','Alex','car'],
... "2": [np.nan, 'Ford', 'Giant',np.nan,'Toyota'],
... "3": [pd.NaT, pd.Timestamp("2018-01-01"),
... pd.Timestamp("2018-07-01"),np.nan,pd.Timestamp("2021-01-01")]})
>>>
>>> df
1 2 3
0 Alfred NaN NaT
1 car Ford 2018-01-01
2 bike Giant 2018-07-01
3 Alex NaN NaT
4 car Toyota 2021-01-01
>>>
>>> new_df = pd.DataFrame(columns=['Transportation', 'Mark', 'BuyDate', 'Name'])
>>>
>>> j = 0
>>> for i in range(1, df.shape[0]):
... if df.loc[i][1] is np.nan:
... running_name = df.loc[i][0]
... continue
... new_df.loc[j] = list(df.loc[i]) + [running_name]
... j += 1
...
>>> new_df
Transportation Mark BuyDate Name
0 car Ford 2018-01-01 Alfred
1 bike Giant 2018-07-01 Alfred
2 car Toyota 2021-01-01 Alex
>>>

Convert one dataframe's format and check if each row exits in another dataframe in Python

Given a small dataset df1 as follow:
city year quarter
0 sh 2019 q4
1 bj 2020 q3
2 bj 2020 q2
3 sh 2020 q4
4 sh 2020 q1
5 bj 2021 q1
I would like to create date range in quarter from 2019-q2 to 2021-q1 as column names, then check if each row in df1's year and quarter for each city exist in df2.
If they exist, then return ys for that cell, otherwise, return NaNs.
The final result will like:
city 2019-q2 2019-q3 2019-q4 2020-q1 2020-q2 2020-q3 2020-q4 2021-q1
0 bj NaN NaN NaN NaN y y NaN y
1 sh NaN NaN y y NaN NaN y NaN
To create column names for df2:
pd.date_range('2019-04-01', '2021-04-01', freq = 'Q').to_period('Q')
How could I achieve this in Python? Thanks.
We can use crosstab on city and the string concatenation of the year and quarter columns:
new_df = pd.crosstab(df['city'], df['year'].astype(str) + '-' + df['quarter'])
new_df:
col_0 2019-q4 2020-q1 2020-q2 2020-q3 2020-q4 2021-q1
city
bj 0 0 1 1 0 1
sh 1 1 0 0 1 0
We can convert to bool, replace False and True to be the correct values, reindex to add missing columns, and cleanup axes and index to get exact output:
col_names = pd.date_range('2019-01-01', '2021-04-01', freq='Q').to_period('Q')
new_df = (
pd.crosstab(df['city'], df['year'].astype(str) + '-' + df['quarter'])
.astype(bool) # Counts to boolean
.replace({False: np.NaN, True: 'y'}) # Fill values
.reindex(columns=col_names.strftime('%Y-q%q')) # Add missing columns
.rename_axis(columns=None) # Cleanup axis name
.reset_index() # reset index
)
new_df:
city 2019-q1 2019-q2 2019-q3 2019-q4 2020-q1 2020-q2 2020-q3 2020-q4 2021-q1
0 bj NaN NaN NaN NaN NaN y y NaN y
1 sh NaN NaN NaN y y NaN NaN y NaN
DataFrame and imports:
import numpy as np
import pandas as pd
df = pd.DataFrame({
'city': ['sh', 'bj', 'bj', 'sh', 'sh', 'bj'],
'year': [2019, 2020, 2020, 2020, 2020, 2021],
'quarter': ['q4', 'q3', 'q2', 'q4', 'q1', 'q1']
})

Groupby id and get each string from an id, in each diferent column

Hello I just want to group the elements by id and show each string in a separated column
Original dataframe:
id|elements|
1|a
1|b
1|c
1|d
2|a
2|b
2|b
3|a
3|a
3|b
3|c
3|c
3|c
Desired output:
id|column1|column2|column3|column4|column5|
1 |a|b|c|d| | |
2 |a|b|b|
3 |a|a|b|c|c|c|
Any ideas? Thank you very much in advance
Given your original data frame, you can simply do:
df.groupby('id').apply(lambda x: x['element'].to_list()).apply(pd.Series)
Output:
0 1 2 3 4 5
id
1 a b c d NaN NaN
2 a b b NaN NaN NaN
3 a a b c c c
If you do not want id to be the index, use .reset_index().
Try this
import pandas as pd
import numpy as np
F = {'id': [1,1,1,1,2,2,2,3,3,3,3,3], 'element': ['a','b','c','d','a','b','b','a','a','b','c','c']}
df = pd.DataFrame(data = F)
df2 = df.set_index('id').stack().groupby(level=[0,1]).apply(list).unstack()
df3 = pd.DataFrame(df2["element"].to_list(), columns=['element1', 'element2','element3', 'element4','element5'])

pandas groupby and widen dataframe with ordered columns

I have a long form dataframe that contains multiple samples and time points for each subject. The number of samples and timepoint can vary, and the days between time points can also vary:
test_df = pd.DataFrame({"subject_id":[1,1,1,2,2,3],
"sample":["A", "B", "C", "D", "E", "F"],
"timepoint":[19,11,8,6,2,12],
"time_order":[3,2,1,2,1,1]
})
subject_id sample timepoint time_order
0 1 A 19 3
1 1 B 11 2
2 1 C 8 1
3 2 D 6 2
4 2 E 2 1
5 3 F 12 1
I need to figure out a way to generalize grouping this dataframe by subject_id and putting all samples and time points on the same row, in time order.
DESIRED OUTPUT:
subject_id sample1 timepoint1 sample2 timepoint2 sample3 timepoint3
0 1 C 8 B 11 A 19
1 2 E 2 D 6 null null
5 3 F 12 null null null null
Pivot gets me close, but I'm stuck on how to proceed from there:
test_df = test_df.pivot(index=['subject_id', 'sample'],
columns='time_order', values='timepoint')
Use DataFrame.set_index with DataFrame.unstack for pivoting, sorting MultiIndex in columns, flatten it and last convert subject_id to column:
df = (test_df.set_index(['subject_id', 'time_order'])
.unstack()
.sort_index(level=[1,0], axis=1))
df.columns = df.columns.map(lambda x: f'{x[0]}{x[1]}')
df = df.reset_index()
print (df)
subject_id sample1 timepoint1 sample2 timepoint2 sample3 timepoint3
0 1 C 8.0 B 11.0 A 19.0
1 2 E 2.0 D 6.0 NaN NaN
2 3 F 12.0 NaN NaN NaN NaN
a=test_df.iloc[:,:3].groupby('subject_id').last().add_suffix('1')
b=test_df.iloc[:,:3].groupby('subject_id').nth(-2).add_suffix('2')
c=test_df.iloc[:,:3].groupby('subject_id').nth(-3).add_suffix('3')
pd.concat([a, b,c], axis=1)
sample1 timepoint1 sample2 timepoint2 sample3 timepoint3
subject_id
1 C 8 B 11.0 A 19.0
2 E 2 D 6.0 NaN NaN
3 F 12 NaN NaN NaN NaN

DataFrame difference between rows based on multiple columns

I am trying to calculate the difference between rows based on multiple columns. The data set is very large and I am pasting dummy data below that describes the problem:
if I want to calculate the daily difference in weight at a pet+name level. So far I have only come up with the solution of concatenating these columns and creating multiindex based on the new column and the date column. But I think there should be a better way. In the real dataset I have more than 3 columns I am using calculate row difference.
df['pet_name']=df.pet + df.name
df.set_index(['pet_name','date'],inplace = True)
df.sort_index(inplace=True)
df['diffs']=np.nan
for idx in t.index.levels[0]:
df.diffs[idx] = df.weight[idx].diff()
Base on your description , you can try groupby
df['pet_name']=df.pet + df.name
df.groupby('pet_name')['weight'].diff()
Use groupby by 2 columns:
df.groupby(['pet', 'name'])['weight'].diff()
All together:
#convert dates to datetimes
df['date'] = pd.to_datetime(df['date'])
#sorting
df = df.sort_values(['pet', 'name','date'])
#get differences per groups
df['diffs'] = df.groupby(['pet', 'name', 'date'])['weight'].diff()
Sample:
np.random.seed(123)
N = 100
L = list('abc')
df = pd.DataFrame({'pet': np.random.choice(L, N),
'name': np.random.choice(L, N),
'date': pd.Series(pd.date_range('2015-01-01', periods=int(N/10)))
.sample(N, replace=True),
'weight':np.random.rand(N)})
df['date'] = pd.to_datetime(df['date'])
df = df.sort_values(['pet', 'name','date'])
df['diffs'] = df.groupby(['pet', 'name', 'date'])['weight'].diff()
df['pet_name'] = df.pet + df.name
df = df.sort_values(['pet_name','date'])
df['diffs1'] = df.groupby(['pet_name', 'date'])['weight'].diff()
print (df.head(20))
date name pet weight diffs pet_name diffs1
1 2015-01-02 a a 0.105446 NaN aa NaN
2 2015-01-03 a a 0.845533 NaN aa NaN
2 2015-01-03 a a 0.980582 0.135049 aa 0.135049
2 2015-01-03 a a 0.443368 -0.537214 aa -0.537214
3 2015-01-04 a a 0.375186 NaN aa NaN
6 2015-01-07 a a 0.715601 NaN aa NaN
7 2015-01-08 a a 0.047340 NaN aa NaN
9 2015-01-10 a a 0.236600 NaN aa NaN
0 2015-01-01 b a 0.777162 NaN ab NaN
2 2015-01-03 b a 0.871683 NaN ab NaN
3 2015-01-04 b a 0.988329 NaN ab NaN
4 2015-01-05 b a 0.918397 NaN ab NaN
4 2015-01-05 b a 0.016119 -0.902279 ab -0.902279
5 2015-01-06 b a 0.095530 NaN ab NaN
5 2015-01-06 b a 0.894978 0.799449 ab 0.799449
5 2015-01-06 b a 0.365719 -0.529259 ab -0.529259
5 2015-01-06 b a 0.887593 0.521874 ab 0.521874
7 2015-01-08 b a 0.792299 NaN ab NaN
7 2015-01-08 b a 0.313669 -0.478630 ab -0.478630
7 2015-01-08 b a 0.281235 -0.032434 ab -0.032434

Resources