Concatenate 2 dataframes. I would like to combine duplicate columns - python-3.x

The following code can be used as an example of the problem I'm having:
dic={'A':['1','2','3'], 'B':['10','11','12']}
df1=pd.DataFrame(dic)
df1.set_index('A', inplace=True)
dic2={'A':['4','5','6'], 'B':['10','11','12']}
df2=pd.DataFrame(dic2)
df2.set_index('A', inplace=True)
df3=pd.concat([df1,df2], axis=1)
print(df3)
The result I get from this concatenation is:
B B
1 10 NaN
2 11 NaN
3 12 NaN
4 NaN 10
5 NaN 11
6 NaN 12
I would like to have:
B
1 10
2 11
3 12
4 10
5 11
6 12
I know that I can concatenate along axis=0. Unfortunately, that only solves the problem for this little example. The actual code I'm working with is more complex. Concatenating along axis=0 causes the index to be duplicated. I don't want that either.
EDIT:
People have asked me to give a more complex example to describe why simply removing 'axis=1' doesn't work. Here is a more complex example, first with axis=1 INCLUDED:
dic={'A':['1','2','3'], 'B':['10','11','12']}
df1=pd.DataFrame(dic)
df1.set_index('A', inplace=True)
dic2={'A':['4','5','6'], 'B':['10','11','12']}
df2=pd.DataFrame(dic2)
df2.set_index('A', inplace=True)
df=pd.concat([df1,df2], axis=1)
dic3={'A':['1','2','3'], 'C':['20','21','22']}
df3=pd.DataFrame(dic3)
df3.set_index('A', inplace=True)
df4=pd.concat([df,df3], axis=1)
print(df4)
This gives me:
B B C
1 10 NaN 20
2 11 NaN 21
3 12 NaN 22
4 NaN 10 NaN
5 NaN 11 NaN
6 NaN 12 NaN
I would like to have:
B C
1 10 20
2 11 21
3 12 22
4 10 NaN
5 11 NaN
6 12 NaN
Now here is an example with axis=1 REMOVED:
dic={'A':['1','2','3'], 'B':['10','11','12']}
df1=pd.DataFrame(dic)
df1.set_index('A', inplace=True)
dic2={'A':['4','5','6'], 'B':['10','11','12']}
df2=pd.DataFrame(dic2)
df2.set_index('A', inplace=True)
df=pd.concat([df1,df2])
dic3={'A':['1','2','3'], 'C':['20','21','22']}
df3=pd.DataFrame(dic3)
df3.set_index('A', inplace=True)
df4=pd.concat([df,df3])
print(df4)
This gives me:
B C
A
1 10 NaN
2 11 NaN
3 12 NaN
4 10 NaN
5 11 NaN
6 12 NaN
1 NaN 20
2 NaN 21
3 NaN 22
I would like to have:
B C
1 10 20
2 11 21
3 12 22
4 10 NaN
5 11 NaN
6 12 NaN
Sorry it wasn't very clear. I hope this helps.

Here is a two step process, for the example provided after the 'EDIT' point. Start by creating the dictionaries:
import pandas as pd
dic = {'A':['1','2','3'], 'B':['10','11','12']}
dic2 = {'A':['4','5','6'], 'B':['10','11','12']}
dic3 = {'A':['1','2','3'], 'C':['20','21','22']}
Step 1: convert each dictionary to a data frame, with index 'A', and concatenate (along axis=0):
t = pd.concat([pd.DataFrame(dic).set_index('A'),
pd.DataFrame(dic2).set_index('A'),
pd.DataFrame(dic3).set_index('A')])
Step 2: concatenate non-null elements of col 'B' with non-null elements of col 'C' (you could put this in a list comprehension if there are more than two columns). Now we concatenate along axis=1:
result = pd.concat([
t.loc[ t['B'].notna(), 'B' ],
t.loc[ t['C'].notna(), 'C' ],
], axis=1)
print(result)
B C
1 10 20
2 11 21
3 12 22
4 10 NaN
5 11 NaN
6 12 NaN

Edited:
If two objects need to be added along axis=1, then the new columns will be appended.And with axis=0 or default same column will be appended with new values.
Refer Below Solution:
import pandas as pd
dic={'A':['1','2','3'], 'B':['10','11','12']}
df1=pd.DataFrame(dic)
df1.set_index('A', inplace=True)
dic2={'A':['4','5','6'], 'B':['10','11','12']}
df2=pd.DataFrame(dic2)
df2.set_index('A', inplace=True)
df=pd.concat([df1,df2])
dic3={'A':['1','2','3'], 'C':['20','21','22']}
df3=pd.DataFrame(dic3)
df3.set_index('A', inplace=True)
df4=pd.concat([df,df3],axis=1) #As here C is new new column so need to use axis=1
print(df4)
Output:
B C
1 10 20
2 11 21
3 12 22
4 10 NaN
5 11 NaN
6 12 NaN

Related

How to add value to specific index that is out of bounds

I have a list array
list = [[0, 1, 2, 3, 4, 5],[0],[1],[2],[3],[4],[5]]
Say I add [6, 7, 8] to the first row as the header for my three new columns, what's the best way to add values in these new columns, without getting index out of bounds? I've tried first filling all three columns with "" but when I add a value, it then pushes the "" out to the right and increases my list size.
Would it be any easier to use a Pandas dataframe? Are you allowed "gaps" in a Pandas dataframe?
according to ops comment i think a pandas df is the more appropriate solution. you can not have 'gaps', but nan values like this
import pandas as pd
# create sample data
a = np.arange(1, 6)
df = pd.DataFrame(zip(*[a]*5))
print(df)
output:
0 1 2 3 4
0 1 1 1 1 1
1 2 2 2 2 2
2 3 3 3 3 3
3 4 4 4 4 4
4 5 5 5 5 5
for adding empty columns:
# add new columns, not empty but filled w/ nan
df[5] = df[6] = df[7] = float('nan')
# fill single value in column 7, index 3
df[7].iloc[4] = 123
print(df)
output:
0 1 2 3 4 5 6 7
0 1 1 1 1 1 NaN NaN NaN
1 2 2 2 2 2 NaN NaN NaN
2 3 3 3 3 3 NaN NaN NaN
3 4 4 4 4 4 NaN NaN NaN
4 5 5 5 5 5 NaN NaN 123.0

Replace and remove duplicates string elements from one column in Python

Given a small dataset as follows:
id room area room_vector
0 1 A-102 world 01 , 02, 03, 04
1 2 NaN 24 A; B; C
2 3 B309 NaN s01, s02 , s02
3 4 C·102 25 E2702-2703,E2702-2703
4 5 E_1089 hello 03,05,06
5 6 27 NaN 05-08,09,10-12, 05-08
6 7 27 NaN NaN
I need to manipulate room_vector column with the following logic:
(1) remove white spaces and replace ; with ,;
(2) replace duplicates and keep one seperated by ,.
For the first one, I've tried:
df['room_vector'] = df['room_vector'].str.replace([' ', ';'], '')
Out:
TypeError: unhashable type: 'list'
How could I get the expected result as follows:
id room area room_vector
0 1 A-102 world 01,02,03,04
1 2 NaN 24 A,B,C
2 3 B309 NaN s01,s02
3 4 C·102 25 E2702-2703
4 5 E_1089 hello 03,05,06
5 6 27 NaN 05-08,09,10-12
6 7 27 NaN NaN
Many thanks.
Idea is remove whitespaces, then split by , or ; in Series.str.split and then remove duplicates with original order by create dictionary from keys and extracted keys but only for lists else is returned original:
f = lambda x: ','.join(dict.fromkeys(x).keys()) if isinstance(x, list) else x
df['room_vector'] = df['room_vector'].str.replace(' ', '').str.split('[,;]').apply(f)
print(df)
id room area room_vector
0 1 A-102 world 01,02,03,04
1 2 NaN 24 A,B,C
2 3 B309 NaN s01,s02
3 4 C·102 25 E2702-2703
4 5 E_1089 hello 03,05,06
5 6 27 NaN 05-08,09,10-12
6 7 27 NaN NaN

pandas groupby and widen dataframe with ordered columns

I have a long form dataframe that contains multiple samples and time points for each subject. The number of samples and timepoint can vary, and the days between time points can also vary:
test_df = pd.DataFrame({"subject_id":[1,1,1,2,2,3],
"sample":["A", "B", "C", "D", "E", "F"],
"timepoint":[19,11,8,6,2,12],
"time_order":[3,2,1,2,1,1]
})
subject_id sample timepoint time_order
0 1 A 19 3
1 1 B 11 2
2 1 C 8 1
3 2 D 6 2
4 2 E 2 1
5 3 F 12 1
I need to figure out a way to generalize grouping this dataframe by subject_id and putting all samples and time points on the same row, in time order.
DESIRED OUTPUT:
subject_id sample1 timepoint1 sample2 timepoint2 sample3 timepoint3
0 1 C 8 B 11 A 19
1 2 E 2 D 6 null null
5 3 F 12 null null null null
Pivot gets me close, but I'm stuck on how to proceed from there:
test_df = test_df.pivot(index=['subject_id', 'sample'],
columns='time_order', values='timepoint')
Use DataFrame.set_index with DataFrame.unstack for pivoting, sorting MultiIndex in columns, flatten it and last convert subject_id to column:
df = (test_df.set_index(['subject_id', 'time_order'])
.unstack()
.sort_index(level=[1,0], axis=1))
df.columns = df.columns.map(lambda x: f'{x[0]}{x[1]}')
df = df.reset_index()
print (df)
subject_id sample1 timepoint1 sample2 timepoint2 sample3 timepoint3
0 1 C 8.0 B 11.0 A 19.0
1 2 E 2.0 D 6.0 NaN NaN
2 3 F 12.0 NaN NaN NaN NaN
a=test_df.iloc[:,:3].groupby('subject_id').last().add_suffix('1')
b=test_df.iloc[:,:3].groupby('subject_id').nth(-2).add_suffix('2')
c=test_df.iloc[:,:3].groupby('subject_id').nth(-3).add_suffix('3')
pd.concat([a, b,c], axis=1)
sample1 timepoint1 sample2 timepoint2 sample3 timepoint3
subject_id
1 C 8 B 11.0 A 19.0
2 E 2 D 6.0 NaN NaN
3 F 12 NaN NaN NaN NaN

Pandas append returns DF with NaN values

I'm appending data from a list to pandas df. I keep getting NaN in my entries.
Based on what I've read I think I might have to mention the data type for each column in my code.
dumps = [];features_df = pd.DataFrame()
for i in range (int(len(ids)/50)):
dumps = sp.audio_features(ids[i*50:50*(i+1)])
for i in range (len(dumps)):
print(list(dumps[0].values()))
features_df = features_df.append(list(dumps[0].values()), ignore_index = True)
Expected results, something like-
[0.833, 0.539, 11, -7.399, 0, 0.178, 0.163, 2.1e-06, 0.101, 0.385, 99.947, 'audio_features', '6MWtB6iiXyIwun0YzU6DFP', 'spotify:track:6MWtB6iiXyIwun0YzU6DFP', 'https://api.spotify.com/v1/tracks/6MWtB6iiXyIwun0YzU6DFP', 'https://api.spotify.com/v1/audio-analysis/6MWtB6iiXyIwun0YzU6DFP', 149520, 4]
for one row.
Actual-
danceability energy ... duration_ms time_signature
0 NaN NaN ... NaN NaN
1 NaN NaN ... NaN NaN
2 NaN NaN ... NaN NaN
3 NaN NaN ... NaN NaN
4 NaN NaN ... NaN NaN
5 NaN NaN ... NaN NaN
For all rows
append() strategy in a tight loop isn't a great way to do this. Rather, you can construct an empty DataFrame and then use loc to specify an insertion point. The DataFrame index should be used.
For example:
import pandas as pd
df = pd.DataFrame(data=[], columns=['n'])
for i in range(100):
df.loc[i] = i
print(df)
time python3 append_df.py
n
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
real 0m13.178s
user 0m12.287s
sys 0m0.617s
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.append.html
Iteratively appending rows to a DataFrame can be more computationally intensive than a single concatenate. A better solution is to append those rows to a list and then concatenate the list with the original DataFrame all at once.

Copy and Paste Values Based on a Condition in Python

I am trying to populate column 'C' with values from column 'A' based on conditions in column 'B'. Example: If column 'B' equals 'nan', then row under column 'C' equals the row in column 'A'. If column 'B' does NOT equal 'nan', then leave column 'C' as is (ie 'nan'). Next, the values in column 'A' to be removed (only the values that were copied from column A to C).
Original Dataset:
index A B C
0 6 nan nan
1 6 nan nan
2 9 3 nan
3 9 3 nan
4 2 8 nan
5 2 8 nan
6 3 4 nan
7 3 nan nan
8 4 nan nan
Output:
index A B C
0 nan nan 6
1 nan nan 6
2 9 3 nan
3 9 3 nan
4 2 8 nan
5 2 8 nan
6 3 4 nan
7 nan nan 3
8 nan nan 4
Below is what I have tried so far, but its not working.
def impute_unit(cols):
Legal_Block = cols[0]
Legal_Lot = cols[1]
Legal_Unit = cols[2]
if pd.isnull(Legal_Lot):
return 3
else:
return Legal_Unit
bk_Final_tax['Legal_Unit'] = bk_Final_tax[['Legal_Block', 'Legal_Lot',
'Legal_Unit']].apply(impute_unit, axis = 1)
Seems like you need
df['C'] = np.where(df.B.isna(), df.A, df.C)
df['A'] = np.where(df.B.isna(), np.nan, df.A)
A different, maybe fancy way to do it would be to swap A and C values only when B is np.nan
m = df.B.isna()
df.loc[m, ['A', 'C']] = df.loc[m, ['C', 'A']].values
In other words, change
bk_Final_tax['Legal_Unit'] = bk_Final_tax[['Legal_Block', 'Legal_Lot',
'Legal_Unit']].apply(impute_unit, axis = 1)
for
bk_Final_tax['Legal_Unit'] = np.where(df.Legal_Lot.isna(), df.Legal_Block, df.Legal_Unit)
bk_Final_tax['Legal_Block'] = np.where(df.Legal_Lot.isna(), np.nan, df.Legal_Block)

Resources