Python Pandas Merge data from different Dataframes on specific index and create new one

Python Pandas Merge data from different Dataframes on specific index and create new one - python-3.x

My code is given below: I have two data frames a,b. I want to create a new data frame c by merging a specific index data of a, b frames.
import pandas as pd
a = [10,20,30,40,50,60]
b = [0.1, 0.2, 0.3, 0.4, 0.5, 0.6]
a = pd.DataFrame(a,columns=['Voltage'])
b = pd.DataFrame(b,columns=['Current'])
c = pd.merge(a,b,left_index=True, right_index=True)
print(c)
The actual output is:
Voltage Current
0 10 0.1
1 20 0.2
2 30 0.3
3 40 0.4
4 50 0.5
5 60 0.6
I don't want all the rows. But, specific index rows something like:
c = Voltage Current
0 30 0.3
1 40 0.4
How to modify c = pd.merge(a,b,left_index=True, right_index=True) code so that, I only want those specific third and fourth rows in c with new index order as given above?

Use iloc for select rows by positions and add reset_index with drop=True for default index in both DataFrames:
Solution1 with concat:
c = pd.concat([a.iloc[2:4].reset_index(drop=True),
b.iloc[2:4].reset_index(drop=True)], axis=1)
Or use merge:
c = pd.merge(a.iloc[2:4].reset_index(drop=True),
b.iloc[2:4].reset_index(drop=True),
left_index=True,
right_index=True)
print(c)
Voltage Current
0 30 0.3
1 40 0.4

Related

Dataframe column value based on aggregation of several columns

Say I have a pandas dataframe as below:
A B C
1 4 0.1
2 3 0.5
4 1 0.7
5 2 0.2
7 5 0.6
I want to loop through the rows in the dataframe, and for each row perform on aggregation on columns A and B as:
Agg = row[A] / row[A] + row[B]
A B C Agg
1 4 0.1 0.2
2 3 0.5 0.4
4 1 0.7 0.8
5 2 0.2 0.7
7 5 0.6 0.6
For all values of Agg > 0.6, get their corresponding column C values into a list, i.e. 0.7 and 0.2 in this case.
Last step is to get the minimum of the list i.e. min(list) = 0.2 in this instance.

We could use vectorized operations: add for addition, rdiv for division (for A/(A+B)), gt for greater than comparison and loc for the filtering:
out = df.loc[df['A'].add(df['B']).rdiv(df['A']).gt(0.6), 'C'].min()
We could also derive the same result using query much more concisely:
out = df.query('A/(A+B)>0.6')['C'].min()
Output:
0.2

Instead of iterating, you can try creating an aggregate function and apply it across all rows.
def aggregate(row):
return row["A"] / (row["A"] + row["B"])
df["Agg"] = round(df.apply(aggregate, axis = 1), 1)
df[df["Agg"] > 0.6]["C"].min()
Output -
0.2

merge dataframes on multiple columns ignoring order

I have the following dataframes:
df1=pd.DataFrame({'fr':[1,2,3],'to':[4,5,6],'R':[0.1,0.2,0.3]})
df2=pd.DataFrame({'fr':[1,5,3],'to':[4,2,6],'X':[0.4,0.5,0.6]})
I would like to merge these two dataframes on fr and to, ignoring the order of fr and to, i.e., (2,5) is the same as (5,2). The desired output is:
dfO=pd.DataFrame({'fr':[1,2,3],'to':[4,5,6],'R':[0.1,0.2,0.3],'X':[0.4,0.5,0.6]})
or
dfO=pd.DataFrame({'fr':[1,5,3],'to':[4,2,6],'R':[0.1,0.2,0.3],'X':[0.4,0.5,0.6]})
I can do the following:
pd.merge(df1,df2,on=['fr','to'],how='left')
However, as expected, the X value of the second row is NaN.
Thank you for your help.

You need do numpy sort first
df1[['fr','to']] = np.sort(df1[['fr','to']].values,1)
df2[['fr','to']] = np.sort(df2[['fr','to']].values,1)
out = df1.merge(df2,how='left')
out
Out[44]:
fr to R X
0 1 4 0.1 0.4
1 2 5 0.2 0.5
2 3 6 0.3 0.6

You can create a temp field and then join on it
df1['tmp'] = df1.apply(lambda x: ','.join(sorted([str(x.fr), str(x.to)])), axis=1)
df2['tmp'] = df2.apply(lambda x: ','.join(sorted([str(x.fr), str(x.to)])), axis=1)
This will give the result that you expect
pd.merge(df1,df2[['tmp', 'X']],on=['tmp'], how='left').drop(columns=['tmp'])

Locate dataframe rows where values are outside bounds specified for each column

I have a dataframe with k columns and n rows, k ~= 10, n ~= 1000. I have a (2, k) array representing bounds on values for each column, e.g.:
# For 5 columns
bounds = ([0.1, 1, 0.1, 5, 10],
[10, 1000, 1, 1000, 50])
# Example df
a b c d e
0 5 3 0.3 17 12
1 12 50 0.5 2 31
2 9 982 0.2 321 21
3 1 3 1.2 92 48
# Expected output with bounds given above
a b c d e
0 5 3 0.3 17 12
2 9 982 0.2 321 21
Crucially, the bounds on each column are different.
I would like to identify and exclude all rows of the dataframe where any column value falls outside the bounds for that respective column, preferably using array operations rather than iterating over the dataframe. The best I can think of so far involves iterating over the columns (which isn't too bad but still seems less than ideal):
for i in len(df.columns):
df = df.query('(bounds[0][i] < df.columns[i]) & (df.columns[i] < bounds[1][i])')
Is there a better way to do this? Or alternatively, to select only the rows where all column values are within the respective bounds?

One way using pandas.DataFrame.apply with pandas.Series.between:
bounds = dict(zip(df.columns, zip(*bounds)))
new_df = df[~df.apply(lambda x: ~x.between(*bounds[x.name])).any(1)]
print(new_df)
Output:
a b c d e
0 5 3 0.3 17 12
2 9 982 0.2 321 21

pandas.series.split(' ',expand =True) With Column Names

I have a Pandas Data Frame with two string columns, which I would like to split on space, like this:
df =
A B
0.1 0.5 0.01 ... 0.3 0.1 0.4 ...
I would like to split both these columns and form new columns for as many values, which result out of the split.
So, the result:
df =
A1 A2. A3 ... B1 B2 B3
0.1 0.5 0.01 ... 0.3 0.1 0.4
Currently, I am doing:
df = df.join(df['A'].str.split(' ', expand = True)
df = df.join(df['B'].str.split(' ', expand = True)
But, I get the following error:
columns overlap but no suffix specified
This is because I guess columns names of 1st and 2nd split overlap?
So, my question is how to split multiple columns by providing column names or suffixes for multiple splits?

Use DataFrame.add_prefix for columns names by splitted column:
df = df.join(df['A'].str.split(expand = True).add_prefix('A'))
df = df.join(df['B'].str.split(expand = True).add_prefix('B'))
print (df)
A B A0 A1 A2 B0 B1 B2
0 0.1 0.5 0.01 0.3 0.1 0.4 0.1 0.5 0.01 0.3 0.1 0.4
Another idea is use list comprehension:
cols = ['A','B']
df1 = pd.concat([df[c].str.split(expand=True).add_prefix(c) for c in cols], axis=1)
print (df1)
A0 A1 A2 B0 B1 B2
0 0.1 0.5 0.01 0.3 0.1 0.4
And for add all original columns:
df = df.join(df1)

Split pandas columns into two with column MultiIndex

I need to split DataFrame columns into two and add an additional value to the new column. The twist is that I need to lift the original column names up one level and add two new column names.
Given a DataFrame h:
>>> import pandas as pd
>>> h = pd.DataFrame({'a': [0.6, 0.4, 0.1], 'b': [0.2, 0.4, 0.7]})
>>> h
a b
0 0.6 0.2
1 0.4 0.4
2 0.1 0.7
I need to lift the original column names up one level and add two new column names. The result should look like this:
>>> # some stuff...
a b
expected received expected received
0 0.6 1 0.2 1
1 0.4 1 0.4 1
2 0.1 1 0.7 1
I've tried this:
>>> h['a1'] = [1, 1, 1]
>>> h['b1'] = [1, 1, 1]
>>> t = [('f', 'expected'),('f', 'received'), ('g', 'expected'), ('g', 'received')]
>>> h.columns = pd.MultiIndex.from_tuples(t)
>>> h
f g
expected received expected received
0 0.6 0.2 1 1
1 0.4 0.4 1 1
2 0.1 0.7 1 1
This just renames the columns but does not align them properly. I think the issue is there's no link between a1 and b1 to the expected and received columns.
How do I lift the original column names up one level and add two new column names?

I am using concat with keys , then swaplevel
h1=h.copy()
h1[:]=1
pd.concat([h,h1],keys=['expected', 'received'],axis=1).\
swaplevel(0,1,axis=1).\
sort_index(level=0,axis=1)
Out[233]:
a b
expected received expected received
0 0.6 1.0 0.2 1.0
1 0.4 1.0 0.4 1.0
2 0.1 1.0 0.7 1.0

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Python Pandas Merge data from different Dataframes on specific index and create new one - python-3.x

Related

Dataframe column value based on aggregation of several columns

merge dataframes on multiple columns ignoring order

Locate dataframe rows where values are outside bounds specified for each column

pandas.series.split(' ',expand =True) With Column Names

Split pandas columns into two with column MultiIndex

Categories

Resources