Concate 2 dfs by a condition - python-3.x

I have 2 dfs
import pandas as pd
list_columns = ['Number', 'Name', 'Age']
list_data = [
[121, 'John', 25],
[122, 'Sam', 26]
]
df1 = pd.DataFrame(columns=list_columns, data=list_data)
Number Name Age
0 121 John 25
1 122 Sam 26
and
list_columns = ['Number', 'Name', 'Age']
list_data = [
[121, 'John', 31],
[122, 'Sam', 29],
[123, 'Andrew', 28]
]
df2 = pd.DataFrame(columns=list_columns, data=list_data)
Number Name Age
0 121 John 31
1 122 Sam 29
2 123 Andrew 28
In the end I want to take the missing values from df2 and add them into df1 bassed on the column Number.
In the above case in df1 I am missing only the Number 123, and I want to move only the data from this line to df1, so it will lok like
|Number|Name | Age|
| 121 |John | 25 |
| 122 |Sam | 26 |
| 123 |Andrew| 28 |
I tried to use concat with keep= 'First' but I am afraid that if a have lot of data it will alterate the existing data in df1(I want to add only missing data based on Number).
Is there a better way of achieving this?
this how I tried to concat
pd.concat([df1,df2]).drop_duplicates(['Number'],keep='first')

Use DataFrame.set_index on df1 and df2 to set the index as column Number and use DataFrame.combine_first:
df = (
df1.set_index('Number').combine_first(
df2.set_index('Number')).reset_index()
)
Result:
Number Name Age
0 121 John 25.0
1 122 Sam 26.0
2 123 Andrew 28.0

Related

Pandas: Merging rows into one

I have the following table:
Name
Age
Data_1
Data_2
Tom
10
Test
Tom
10
Foo
Anne
20
Bar
How I can merge this rows to get this output:
Name
Age
Data_1
Data_2
Tom
10
Test
Foo
Anne
20
Bar
I tried this code (and some other related (agg, groupby other fields, et cetera)):
import pandas as pd
data = [['tom', 10, 'Test', ''], ['tom', 10, 1, 'Foo'], ['Anne', 20, '', 'Bar']]
df = pd.DataFrame(data, columns=['Name', 'Age', 'Data_1', 'Data_2'])
df = df.groupby("Name").sum()
print(df)
But I only get something like this:
c2
Name
--------
--------------
Anne
Foo
Tom
Bar
Just a groupby and a sum will do.
df.groupby(['Name','Age']).sum().reset_index()
Name Age Data_1 Data_2
0 Anne 20 Bar
1 tom 10 Test Foo
Use this if the empty cells are NaN :
(df.set_index(['Name', 'Age'])
.stack()
.groupby(level=[0, 1, 2])
.apply(''.join)
.unstack()
.reset_index()
)
Otherwise, add this line df.replace('', np.nan, inplace=True) before the code above.
# Output
Name Age Data_1 Data_2
0 Anne 20 NaN Bar
1 Tom 10 Test Foo

Getting columns by list of substring values

I have dataframe which is mentioned below, i have large data wanted to create diffrent data frame from substring values of column
df
ID ex_srr123 ex2_srr124 ex3_srr125 ex4_srr1234 ex23_srr5323
san 12 43 0 34 0
mat 53 0 34 76 656
jon 82 223 23 32 21
jack 0 12 2 0 0
i have a list of substring of column
coln1=['srr123', 'srr124']
coln2=['srr1234','srr5323']
I wanted
df2=
ID ex_srr123 ex2_srr12
san 12 43
mat 53 0
jon 82 223
jack 0 12
I tried
df2=df[coln1]
i didn't get what i wanted please help me how can i get desire output
Statically
df2 = df.filter(regex="srr123$|srr124$").copy()
Dynamically
coln1 = ['srr123', 'srr124']
df2 = df.filter(regex=f"{coln1[0]}$|{coln1[1]}$").copy()
The $ signifies the end of the string, so that the column ex4_srr1234 isn't also included in your result.
Look into the filter method
df.filter(regex="srr123|srr124").copy()
I am making a few assumptions:
'ID' is a column and not the index.
The third column in df2 should read 'ex2_srr124' instead of 'ex2_srr12'.
You do not want to include columns of 'df' in 'df2' if the substring does not match everything after the underscore (since 'srr123' is a substring of 'ex4_srr1234' but you did not include it in 'df2').
# set the provided data frames
df = pd.DataFrame([['san', 12, 43, 0, 34, 0],
['mat', 53, 0, 34, 76, 656],
['jon', 82, 223, 23, 32, 21],
['jack', 0, 12, 2, 0, 0]],
columns = ['ID', 'ex_srr123', 'ex2_srr124', 'ex3_srr125', 'ex4_srr1234', 'ex23_srr5323'])
# set the list of column-substrings
coln1=['srr123', 'srr124']
coln2=['srr1234','srr5323']
I suggest to solve this as follows:
# create df2 and add the ID column
df2 = pd.DataFrame()
df2['ID'] = df['ID']
# iterate over each substring in a list of column-substrings
for substring in coln1:
# iterate over each column name in the df columns
for column_name in df.columns.values:
# check if column name ends with substring
if substring == column_name[-len(substring):]:
# assign the new column to df2
df2[column_name] = df[column_name]
This yields the desired dataframe df2:
ID ex_srr123 ex2_srr124
0 san 12 43
1 mat 53 0
2 jon 82 223
3 jack 0 12
df.filter(regex = '|'.join(['ID'] + [col+ '$' for col in coln1])).copy()
ID ex_srr123 ex2_srr124
0 san 12 43
1 mat 53 0
2 jon 82 223
3 jack 0 12

pandas move to correspondent column based on value of other column

Im trying to move the f1_am, f2_am, f3_am to the correspondent column based on the values of f1_ty, f2_ty, f3_ty
I started adding new columns to the dataframe based on unique values from the _ty using sets, but I'm trying to figure it out how to move the _am values to were it belongs
Looked for the option of group by and pivot but the result exploded my mind....
I would appreciate some guidance.
Below the code.
import pandas as pd
import numpy as np
data = {
'mem_id': ['A', 'B', 'C', 'A', 'B', 'C']
, 'date_inf': ['01/01/2019', '01/01/2019', '01/01/2019', '02/01/2019', '02/01/2019', '02/01/2019']
, 'f1_ty': ['ABC', 'ABC', 'ABC', 'ABC', 'GHI', 'GHI']
, 'f1_am': [100, 20, 57, 44, 15, 10]
, 'f2_ty': ['DEF', 'DEF', 'DEF', 'GHI', 'ABC', 'XYZ']
, 'f2_am':[20, 30, 45, 66, 14, 21]
, 'f3_ty': ['XYZ', 'GHI', 'OPQ', 'OPQ', 'XYZ', 'DEF']
, 'f3_am':[20, 30, 45, 66, 14, 21]
}
df = pd.DataFrame (data)
#distinct values in columns using sets
distinct_values = sorted(list(set(df['f1_ty'])|set(df['f2_ty'])|set(df['f3_ty'])))
# add distinct values as new columns in the DataFrame
new_df = df.reindex(columns = np.append( df.columns.values, distinct_values))
So this would be my starting point and my wanted result.
Here is a try, thanks for the interesting problem (rename colujmns to make compatible to wide_to_long() followed by unstack() while dropping extra levels:
m=df.set_index(['mem_id','date_inf']).rename(columns=lambda x: ''.join(x.split('_')[::-1]))
n=(pd.wide_to_long(m.reset_index(),['tyf','amf'],['mem_id','date_inf'],'v')
.droplevel(-1).set_index('tyf',append=True).unstack(fill_value=0).reindex(m.index))
final=n.droplevel(0,axis=1).rename_axis(None,axis=1).reset_index()
print(final)
mem_id date_inf ABC DEF GHI OPQ XYZ
0 A 01/01/2019 100 20 0 0 20
1 B 01/01/2019 20 30 30 0 0
2 C 01/01/2019 57 45 0 45 0
3 A 02/01/2019 44 0 66 66 0
4 B 02/01/2019 14 0 15 0 14
5 C 02/01/2019 0 21 10 0 21

Converge columns in dataframe to single column in specific order

Python newb here, I have two columns in a data frame, we'll call them dat1 and dat2
dat1 dat2
0 123 20
1 456 30
2 789 10
3 123 10
4 456 20
5 789 30
I need to convert that into a single column like so:
10
789
123
20
123
456
30
456
789
or in terms of columns, [dat2,dat1,dat1,dat2,dat1,dat1,dat2,dat1,dat1]
I made up this terrible code:
unique = dp['dat2'].unique()
for each in unique:
mylist.append(each)
for x in dp:
mylist.append(dp[dp['dat2'] == each])
and i get the output as below
20
dat1 dat2
0 123 20
4 456 20
dat1 dat2
0 123 20
4 456 20
30
dat1 dat2
1 456 30
5 789 30
dat1 dat2
1 456 30
5 789 30
10
dat1 dat2
2 789 10
3 123 10
dat1 dat2
2 789 10
3 123 10
I'm basically trying to replicate the function of the pivot table in excel. Any help would be really appreciated.
Thanks
# sort the values by the second column
dp = dp.sort_values(by='dat2')
# create a list which will collect the results
my_data = []
# loop over the 2nd unique columns values
for d2 in dp.dat2.unique():
# insert the data into the list
my_data.append(d2)
# grep the dat1 data from the table, where dat2 == d2
for i in dp.dat1[dp.dat2 == d2]:
my_data.append(i)
my_data
[10, 789, 123, 20, 123, 456, 30, 456, 789]
using pd.concat you can concat the dataframe data
import pandas as pd
df={'id1':[11,22,33],'id2':[77,88,99]}
df=pd.DataFrame(df)
print(pd.concat([df['id1'],df['id2']]))
Since it looks like you are grouping the values based on column dat2 and adding elements from dat1 after each dat2 element, I would use pd.groupby
import pandas as pd
dat1 = [123,456,789,123,456,789]
dat2 = [20,30,10,10,20,30]
df = pd.DataFrame(list(zip(dat1, dat2)), columns=['dat1', 'dat2'])
grouped = df.groupby(['dat2']).agg({'dat1':list}).reset_index()
dict_list = grouped.to_dict('records')
new_data_list = []
for single_dict in dict_list:
new_data_list.append(single_dict['dat2'])
new_data_list += single_dict['dat1']
print(new_data_list)
result:
[10, 789, 123, 20, 123, 456, 30, 456, 789]

Add two more columns to csv file based on matching values of other csv

I have two csv files
csv1:
csv2:
What i need to process is:
Get each value of column c of csv1 file and match it with column number of csv2.
If any row of csv2 matches with that number then add a new column c_text into csv1 that will contain value of text column for matching row of csv2
Repeat above process for column d of csv1 and add a new column d_text into csv1
Here is what i need at the end
Am new to pandas. How can i do this using pandas.
You can use apply():
csv1['c_text'] = csv1['c'].apply(lambda x: csv2[csv2['number']==x]['text'].values[0])
csv1['d_text'] = csv1['d'].apply(lambda x: csv2[csv2['number']==x]['text'].values[0])
Yields:
a b c d c_text d_text
0 1 4 101 201 val1 val4
1 2 5 105 202 val2 val5
2 3 6 107 203 val3 val6
In terms of an option using merge(), this will yield the same output:
csv1 = csv1.merge(csv2, left_on='c', right_on='number', how='left')
csv1 = csv1.merge(csv2, left_on='d', right_on='number', how='left')
csv1 = csv1.rename(columns={'text_x': 'c_text', 'text_y': 'd_text'})[['a','b','c','d','c_text','d_text']]
Here's something that will do the trick:
df1 = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6], 'c':[101, 105, 107], 'd':[201, 202, 203]})
df2 = pd.DataFrame({'number': [101, 105, 107, 201, 202, 203, 205, 2010, 310], 'text': ["val_{x}".format(x=y + 1) for y in range(9)]})
df1
a b c d
0 1 4 101 201
1 2 5 105 202
2 3 6 107 203
df2
number text
0 101 val_1
1 105 val_2
2 107 val_3
3 201 val_4
4 202 val_5
5 203 val_6
6 205 val_7
7 2010 val_8
8 310 val_9
merged = df1.merge(df2, left_on='c', right_on='number', how='left')
merged
a b c d number text
0 1 4 101 201 101 val_1
1 2 5 105 202 105 val_2
2 3 6 107 203 107 val_3
output = merged.merge(df2, left_on='d', right_on='number', how='left')[['a', 'b', 'c', 'd', 'text_x', 'text_y']]
output
a b c d text_x text_y
0 1 4 101 201 val_1 val_4
1 2 5 105 202 val_2 val_5
2 3 6 107 203 val_3 val_6
What you want is the merge functionality of Pandas. Assuming you have imported the Pandas module with the shorthand name like import pandas as pd, then:
csv1_with_text_col = pd.merge(csv1, csv2, left_on='c', right_on='number', how='left')
This will give you a new dataframe, csv1_with_text_col, with the columns from csv2 merged into csv1 where csv1['c'] == csv2['number']. Additionally, by specifying how='left', only rows from the left dataframe, csv1, will be kept.
You can then merge this new dataframe, csv1_with_text_col, with csv2 again but with left_on='d'.

Resources