Concate 2 dfs by a condition

Concate 2 dfs by a condition - python-3.x

I have 2 dfs
import pandas as pd
list_columns = ['Number', 'Name', 'Age']
list_data = [
[121, 'John', 25],
[122, 'Sam', 26]
]
df1 = pd.DataFrame(columns=list_columns, data=list_data)
Number Name Age
0 121 John 25
1 122 Sam 26
and
list_columns = ['Number', 'Name', 'Age']
list_data = [
[121, 'John', 31],
[122, 'Sam', 29],
[123, 'Andrew', 28]
]
df2 = pd.DataFrame(columns=list_columns, data=list_data)
Number Name Age
0 121 John 31
1 122 Sam 29
2 123 Andrew 28
In the end I want to take the missing values from df2 and add them into df1 bassed on the column Number.
In the above case in df1 I am missing only the Number 123, and I want to move only the data from this line to df1, so it will lok like
|Number|Name | Age|
| 121 |John | 25 |
| 122 |Sam | 26 |
| 123 |Andrew| 28 |
I tried to use concat with keep= 'First' but I am afraid that if a have lot of data it will alterate the existing data in df1(I want to add only missing data based on Number).
Is there a better way of achieving this?
this how I tried to concat
pd.concat([df1,df2]).drop_duplicates(['Number'],keep='first')

Use DataFrame.set_index on df1 and df2 to set the index as column Number and use DataFrame.combine_first:
df = (
df1.set_index('Number').combine_first(
df2.set_index('Number')).reset_index()
)
Result:
Number Name Age
0 121 John 25.0
1 122 Sam 26.0
2 123 Andrew 28.0

Related

Pandas: Merging rows into one

I have the following table:
Name
Age
Data_1
Data_2
Tom
10
Test
Tom
10
Foo
Anne
20
Bar
How I can merge this rows to get this output:
Name
Age
Data_1
Data_2
Tom
10
Test
Foo
Anne
20
Bar
I tried this code (and some other related (agg, groupby other fields, et cetera)):
import pandas as pd
data = [['tom', 10, 'Test', ''], ['tom', 10, 1, 'Foo'], ['Anne', 20, '', 'Bar']]
df = pd.DataFrame(data, columns=['Name', 'Age', 'Data_1', 'Data_2'])
df = df.groupby("Name").sum()
print(df)
But I only get something like this:
c2
Name
--------
--------------
Anne
Foo
Tom
Bar

Just a groupby and a sum will do.
df.groupby(['Name','Age']).sum().reset_index()
Name Age Data_1 Data_2
0 Anne 20 Bar
1 tom 10 Test Foo

Use this if the empty cells are NaN :
(df.set_index(['Name', 'Age'])
.stack()
.groupby(level=[0, 1, 2])
.apply(''.join)
.unstack()
.reset_index()
)
Otherwise, add this line df.replace('', np.nan, inplace=True) before the code above.
# Output
Name Age Data_1 Data_2
0 Anne 20 NaN Bar
1 Tom 10 Test Foo

Getting columns by list of substring values

I have dataframe which is mentioned below, i have large data wanted to create diffrent data frame from substring values of column
df
ID ex_srr123 ex2_srr124 ex3_srr125 ex4_srr1234 ex23_srr5323
san 12 43 0 34 0
mat 53 0 34 76 656
jon 82 223 23 32 21
jack 0 12 2 0 0
i have a list of substring of column
coln1=['srr123', 'srr124']
coln2=['srr1234','srr5323']
I wanted
df2=
ID ex_srr123 ex2_srr12
san 12 43
mat 53 0
jon 82 223
jack 0 12
I tried
df2=df[coln1]
i didn't get what i wanted please help me how can i get desire output

Statically
df2 = df.filter(regex="srr123$|srr124$").copy()
Dynamically
coln1 = ['srr123', 'srr124']
df2 = df.filter(regex=f"{coln1[0]}$|{coln1[1]}$").copy()
The $ signifies the end of the string, so that the column ex4_srr1234 isn't also included in your result.

Look into the filter method
df.filter(regex="srr123|srr124").copy()

I am making a few assumptions:
'ID' is a column and not the index.
The third column in df2 should read 'ex2_srr124' instead of 'ex2_srr12'.
You do not want to include columns of 'df' in 'df2' if the substring does not match everything after the underscore (since 'srr123' is a substring of 'ex4_srr1234' but you did not include it in 'df2').
# set the provided data frames
df = pd.DataFrame([['san', 12, 43, 0, 34, 0],
['mat', 53, 0, 34, 76, 656],
['jon', 82, 223, 23, 32, 21],
['jack', 0, 12, 2, 0, 0]],
columns = ['ID', 'ex_srr123', 'ex2_srr124', 'ex3_srr125', 'ex4_srr1234', 'ex23_srr5323'])
# set the list of column-substrings
coln1=['srr123', 'srr124']
coln2=['srr1234','srr5323']
I suggest to solve this as follows:
# create df2 and add the ID column
df2 = pd.DataFrame()
df2['ID'] = df['ID']
# iterate over each substring in a list of column-substrings
for substring in coln1:
# iterate over each column name in the df columns
for column_name in df.columns.values:
# check if column name ends with substring
if substring == column_name[-len(substring):]:
# assign the new column to df2
df2[column_name] = df[column_name]
This yields the desired dataframe df2:
ID ex_srr123 ex2_srr124
0 san 12 43
1 mat 53 0
2 jon 82 223
3 jack 0 12

df.filter(regex = '|'.join(['ID'] + [col+ '$' for col in coln1])).copy()
ID ex_srr123 ex2_srr124
0 san 12 43
1 mat 53 0
2 jon 82 223
3 jack 0 12

pandas move to correspondent column based on value of other column

Im trying to move the f1_am, f2_am, f3_am to the correspondent column based on the values of f1_ty, f2_ty, f3_ty
I started adding new columns to the dataframe based on unique values from the _ty using sets, but I'm trying to figure it out how to move the _am values to were it belongs
Looked for the option of group by and pivot but the result exploded my mind....
I would appreciate some guidance.
Below the code.
import pandas as pd
import numpy as np
data = {
'mem_id': ['A', 'B', 'C', 'A', 'B', 'C']
, 'date_inf': ['01/01/2019', '01/01/2019', '01/01/2019', '02/01/2019', '02/01/2019', '02/01/2019']
, 'f1_ty': ['ABC', 'ABC', 'ABC', 'ABC', 'GHI', 'GHI']
, 'f1_am': [100, 20, 57, 44, 15, 10]
, 'f2_ty': ['DEF', 'DEF', 'DEF', 'GHI', 'ABC', 'XYZ']
, 'f2_am':[20, 30, 45, 66, 14, 21]
, 'f3_ty': ['XYZ', 'GHI', 'OPQ', 'OPQ', 'XYZ', 'DEF']
, 'f3_am':[20, 30, 45, 66, 14, 21]
}
df = pd.DataFrame (data)
#distinct values in columns using sets
distinct_values = sorted(list(set(df['f1_ty'])|set(df['f2_ty'])|set(df['f3_ty'])))
# add distinct values as new columns in the DataFrame
new_df = df.reindex(columns = np.append( df.columns.values, distinct_values))
So this would be my starting point and my wanted result.

Here is a try, thanks for the interesting problem (rename colujmns to make compatible to wide_to_long() followed by unstack() while dropping extra levels:
m=df.set_index(['mem_id','date_inf']).rename(columns=lambda x: ''.join(x.split('_')[::-1]))
n=(pd.wide_to_long(m.reset_index(),['tyf','amf'],['mem_id','date_inf'],'v')
.droplevel(-1).set_index('tyf',append=True).unstack(fill_value=0).reindex(m.index))
final=n.droplevel(0,axis=1).rename_axis(None,axis=1).reset_index()
print(final)
mem_id date_inf ABC DEF GHI OPQ XYZ
0 A 01/01/2019 100 20 0 0 20
1 B 01/01/2019 20 30 30 0 0
2 C 01/01/2019 57 45 0 45 0
3 A 02/01/2019 44 0 66 66 0
4 B 02/01/2019 14 0 15 0 14
5 C 02/01/2019 0 21 10 0 21

Converge columns in dataframe to single column in specific order

Python newb here, I have two columns in a data frame, we'll call them dat1 and dat2
dat1 dat2
0 123 20
1 456 30
2 789 10
3 123 10
4 456 20
5 789 30
I need to convert that into a single column like so:
10
789
123
20
123
456
30
456
789
or in terms of columns, [dat2,dat1,dat1,dat2,dat1,dat1,dat2,dat1,dat1]
I made up this terrible code:
unique = dp['dat2'].unique()
for each in unique:
mylist.append(each)
for x in dp:
mylist.append(dp[dp['dat2'] == each])
and i get the output as below
20
dat1 dat2
0 123 20
4 456 20
dat1 dat2
0 123 20
4 456 20
30
dat1 dat2
1 456 30
5 789 30
dat1 dat2
1 456 30
5 789 30
10
dat1 dat2
2 789 10
3 123 10
dat1 dat2
2 789 10
3 123 10
I'm basically trying to replicate the function of the pivot table in excel. Any help would be really appreciated.
Thanks

# sort the values by the second column
dp = dp.sort_values(by='dat2')
# create a list which will collect the results
my_data = []
# loop over the 2nd unique columns values
for d2 in dp.dat2.unique():
# insert the data into the list
my_data.append(d2)
# grep the dat1 data from the table, where dat2 == d2
for i in dp.dat1[dp.dat2 == d2]:
my_data.append(i)
my_data
[10, 789, 123, 20, 123, 456, 30, 456, 789]

using pd.concat you can concat the dataframe data
import pandas as pd
df={'id1':[11,22,33],'id2':[77,88,99]}
df=pd.DataFrame(df)
print(pd.concat([df['id1'],df['id2']]))

Since it looks like you are grouping the values based on column dat2 and adding elements from dat1 after each dat2 element, I would use pd.groupby
import pandas as pd
dat1 = [123,456,789,123,456,789]
dat2 = [20,30,10,10,20,30]
df = pd.DataFrame(list(zip(dat1, dat2)), columns=['dat1', 'dat2'])
grouped = df.groupby(['dat2']).agg({'dat1':list}).reset_index()
dict_list = grouped.to_dict('records')
new_data_list = []
for single_dict in dict_list:
new_data_list.append(single_dict['dat2'])
new_data_list += single_dict['dat1']
print(new_data_list)
result:
[10, 789, 123, 20, 123, 456, 30, 456, 789]

Add two more columns to csv file based on matching values of other csv

I have two csv files
csv1:
csv2:
What i need to process is:
Get each value of column c of csv1 file and match it with column number of csv2.
If any row of csv2 matches with that number then add a new column c_text into csv1 that will contain value of text column for matching row of csv2
Repeat above process for column d of csv1 and add a new column d_text into csv1
Here is what i need at the end
Am new to pandas. How can i do this using pandas.

You can use apply():
csv1['c_text'] = csv1['c'].apply(lambda x: csv2[csv2['number']==x]['text'].values[0])
csv1['d_text'] = csv1['d'].apply(lambda x: csv2[csv2['number']==x]['text'].values[0])
Yields:
a b c d c_text d_text
0 1 4 101 201 val1 val4
1 2 5 105 202 val2 val5
2 3 6 107 203 val3 val6
In terms of an option using merge(), this will yield the same output:
csv1 = csv1.merge(csv2, left_on='c', right_on='number', how='left')
csv1 = csv1.merge(csv2, left_on='d', right_on='number', how='left')
csv1 = csv1.rename(columns={'text_x': 'c_text', 'text_y': 'd_text'})[['a','b','c','d','c_text','d_text']]

Here's something that will do the trick:
df1 = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6], 'c':[101, 105, 107], 'd':[201, 202, 203]})
df2 = pd.DataFrame({'number': [101, 105, 107, 201, 202, 203, 205, 2010, 310], 'text': ["val_{x}".format(x=y + 1) for y in range(9)]})
df1
a b c d
0 1 4 101 201
1 2 5 105 202
2 3 6 107 203
df2
number text
0 101 val_1
1 105 val_2
2 107 val_3
3 201 val_4
4 202 val_5
5 203 val_6
6 205 val_7
7 2010 val_8
8 310 val_9
merged = df1.merge(df2, left_on='c', right_on='number', how='left')
merged
a b c d number text
0 1 4 101 201 101 val_1
1 2 5 105 202 105 val_2
2 3 6 107 203 107 val_3
output = merged.merge(df2, left_on='d', right_on='number', how='left')[['a', 'b', 'c', 'd', 'text_x', 'text_y']]
output
a b c d text_x text_y
0 1 4 101 201 val_1 val_4
1 2 5 105 202 val_2 val_5
2 3 6 107 203 val_3 val_6

What you want is the merge functionality of Pandas. Assuming you have imported the Pandas module with the shorthand name like import pandas as pd, then:
csv1_with_text_col = pd.merge(csv1, csv2, left_on='c', right_on='number', how='left')
This will give you a new dataframe, csv1_with_text_col, with the columns from csv2 merged into csv1 where csv1['c'] == csv2['number']. Additionally, by specifying how='left', only rows from the left dataframe, csv1, will be kept.
You can then merge this new dataframe, csv1_with_text_col, with csv2 again but with left_on='d'.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Concate 2 dfs by a condition - python-3.x

Use DataFrame.set_index on df1 and df2 to set the index as column Number and use DataFrame.combine_first: df = ( df1.set_index('Number').combine_first( df2.set_index('Number')).reset_index() ) Result: Number Name Age 0 121 John 25.0 1 122 Sam 26.0 2 123 Andrew 28.0

Related

Pandas: Merging rows into one

Getting columns by list of substring values

pandas move to correspondent column based on value of other column

Converge columns in dataframe to single column in specific order

Add two more columns to csv file based on matching values of other csv

Categories

Resources