pandas.series.split(' ',expand =True) With Column Names

pandas.series.split(' ',expand =True) With Column Names - python-3.x

I have a Pandas Data Frame with two string columns, which I would like to split on space, like this:
df =
A B
0.1 0.5 0.01 ... 0.3 0.1 0.4 ...
I would like to split both these columns and form new columns for as many values, which result out of the split.
So, the result:
df =
A1 A2. A3 ... B1 B2 B3
0.1 0.5 0.01 ... 0.3 0.1 0.4
Currently, I am doing:
df = df.join(df['A'].str.split(' ', expand = True)
df = df.join(df['B'].str.split(' ', expand = True)
But, I get the following error:
columns overlap but no suffix specified
This is because I guess columns names of 1st and 2nd split overlap?
So, my question is how to split multiple columns by providing column names or suffixes for multiple splits?

Use DataFrame.add_prefix for columns names by splitted column:
df = df.join(df['A'].str.split(expand = True).add_prefix('A'))
df = df.join(df['B'].str.split(expand = True).add_prefix('B'))
print (df)
A B A0 A1 A2 B0 B1 B2
0 0.1 0.5 0.01 0.3 0.1 0.4 0.1 0.5 0.01 0.3 0.1 0.4
Another idea is use list comprehension:
cols = ['A','B']
df1 = pd.concat([df[c].str.split(expand=True).add_prefix(c) for c in cols], axis=1)
print (df1)
A0 A1 A2 B0 B1 B2
0 0.1 0.5 0.01 0.3 0.1 0.4
And for add all original columns:
df = df.join(df1)

Related

Dataframe column value based on aggregation of several columns

Say I have a pandas dataframe as below:
A B C
1 4 0.1
2 3 0.5
4 1 0.7
5 2 0.2
7 5 0.6
I want to loop through the rows in the dataframe, and for each row perform on aggregation on columns A and B as:
Agg = row[A] / row[A] + row[B]
A B C Agg
1 4 0.1 0.2
2 3 0.5 0.4
4 1 0.7 0.8
5 2 0.2 0.7
7 5 0.6 0.6
For all values of Agg > 0.6, get their corresponding column C values into a list, i.e. 0.7 and 0.2 in this case.
Last step is to get the minimum of the list i.e. min(list) = 0.2 in this instance.

We could use vectorized operations: add for addition, rdiv for division (for A/(A+B)), gt for greater than comparison and loc for the filtering:
out = df.loc[df['A'].add(df['B']).rdiv(df['A']).gt(0.6), 'C'].min()
We could also derive the same result using query much more concisely:
out = df.query('A/(A+B)>0.6')['C'].min()
Output:
0.2

Instead of iterating, you can try creating an aggregate function and apply it across all rows.
def aggregate(row):
return row["A"] / (row["A"] + row["B"])
df["Agg"] = round(df.apply(aggregate, axis = 1), 1)
df[df["Agg"] > 0.6]["C"].min()
Output -
0.2

Pandas - interpolate over values in index

I have the following Pandas dataframe:
a0 a1 a2 a3
0.2 0.46 15.85 124.06 -380.04
0.4 0.21 28.20 -53.17 87.97
0.6 1.10 -5.55 167.76 -417.72
0.8 0.82 6.11 16.90 -70.86
1.0 1.00 0.00 0.00 0.00
Which is made by:
import pandas as pd
df = pd.DataFrame(data={'a0': [0.46,0.21,1.10,0.82,1],
'a1': [15.85,28.20,-5.55,6.11,0],
'a2': [124.06,-53.17,167.76,16.90,0],
'a3': [-380.04,87.97,-417.72,-70.86,0]},
index=pd.Series(['0.2', '0.4', '0.6','0.8','1.0']))
a0,a1,a2,a3 are polynomial coefficients from a fit y= a0 + a1x + a2x^2 + a3*x^3.
5 fits have been made for 5 ratios Ht/H, these ratios are on the indices.
I want to return values for a0.. a3 for specified Ht/H ratio.
For example, if I specify Ht/H= 0.9, I want to get a0= 0.91, a1= 3.05,a2= 8.45,a3= -35.43.

First I notice that your index is currently strings, and you want numeric for interpolation. So do:
df.index = pd.to_numeric(df.index)
Let's try reindex:
s = 0.9
# create new index that includes the new value
new_idx = np.unique(list(df.index) + [s])
df.reindex(new_idx).interpolate('index').loc[s]
Output:
a0 0.910
a1 3.055
a2 8.450
a3 -35.430
Name: 0.9, dtype: float64

merge dataframes on multiple columns ignoring order

I have the following dataframes:
df1=pd.DataFrame({'fr':[1,2,3],'to':[4,5,6],'R':[0.1,0.2,0.3]})
df2=pd.DataFrame({'fr':[1,5,3],'to':[4,2,6],'X':[0.4,0.5,0.6]})
I would like to merge these two dataframes on fr and to, ignoring the order of fr and to, i.e., (2,5) is the same as (5,2). The desired output is:
dfO=pd.DataFrame({'fr':[1,2,3],'to':[4,5,6],'R':[0.1,0.2,0.3],'X':[0.4,0.5,0.6]})
or
dfO=pd.DataFrame({'fr':[1,5,3],'to':[4,2,6],'R':[0.1,0.2,0.3],'X':[0.4,0.5,0.6]})
I can do the following:
pd.merge(df1,df2,on=['fr','to'],how='left')
However, as expected, the X value of the second row is NaN.
Thank you for your help.

You need do numpy sort first
df1[['fr','to']] = np.sort(df1[['fr','to']].values,1)
df2[['fr','to']] = np.sort(df2[['fr','to']].values,1)
out = df1.merge(df2,how='left')
out
Out[44]:
fr to R X
0 1 4 0.1 0.4
1 2 5 0.2 0.5
2 3 6 0.3 0.6

You can create a temp field and then join on it
df1['tmp'] = df1.apply(lambda x: ','.join(sorted([str(x.fr), str(x.to)])), axis=1)
df2['tmp'] = df2.apply(lambda x: ','.join(sorted([str(x.fr), str(x.to)])), axis=1)
This will give the result that you expect
pd.merge(df1,df2[['tmp', 'X']],on=['tmp'], how='left').drop(columns=['tmp'])

Split pandas columns into two with column MultiIndex

I need to split DataFrame columns into two and add an additional value to the new column. The twist is that I need to lift the original column names up one level and add two new column names.
Given a DataFrame h:
>>> import pandas as pd
>>> h = pd.DataFrame({'a': [0.6, 0.4, 0.1], 'b': [0.2, 0.4, 0.7]})
>>> h
a b
0 0.6 0.2
1 0.4 0.4
2 0.1 0.7
I need to lift the original column names up one level and add two new column names. The result should look like this:
>>> # some stuff...
a b
expected received expected received
0 0.6 1 0.2 1
1 0.4 1 0.4 1
2 0.1 1 0.7 1
I've tried this:
>>> h['a1'] = [1, 1, 1]
>>> h['b1'] = [1, 1, 1]
>>> t = [('f', 'expected'),('f', 'received'), ('g', 'expected'), ('g', 'received')]
>>> h.columns = pd.MultiIndex.from_tuples(t)
>>> h
f g
expected received expected received
0 0.6 0.2 1 1
1 0.4 0.4 1 1
2 0.1 0.7 1 1
This just renames the columns but does not align them properly. I think the issue is there's no link between a1 and b1 to the expected and received columns.
How do I lift the original column names up one level and add two new column names?

I am using concat with keys , then swaplevel
h1=h.copy()
h1[:]=1
pd.concat([h,h1],keys=['expected', 'received'],axis=1).\
swaplevel(0,1,axis=1).\
sort_index(level=0,axis=1)
Out[233]:
a b
expected received expected received
0 0.6 1.0 0.2 1.0
1 0.4 1.0 0.4 1.0
2 0.1 1.0 0.7 1.0

Python Pandas Merge data from different Dataframes on specific index and create new one

My code is given below: I have two data frames a,b. I want to create a new data frame c by merging a specific index data of a, b frames.
import pandas as pd
a = [10,20,30,40,50,60]
b = [0.1, 0.2, 0.3, 0.4, 0.5, 0.6]
a = pd.DataFrame(a,columns=['Voltage'])
b = pd.DataFrame(b,columns=['Current'])
c = pd.merge(a,b,left_index=True, right_index=True)
print(c)
The actual output is:
Voltage Current
0 10 0.1
1 20 0.2
2 30 0.3
3 40 0.4
4 50 0.5
5 60 0.6
I don't want all the rows. But, specific index rows something like:
c = Voltage Current
0 30 0.3
1 40 0.4
How to modify c = pd.merge(a,b,left_index=True, right_index=True) code so that, I only want those specific third and fourth rows in c with new index order as given above?

Use iloc for select rows by positions and add reset_index with drop=True for default index in both DataFrames:
Solution1 with concat:
c = pd.concat([a.iloc[2:4].reset_index(drop=True),
b.iloc[2:4].reset_index(drop=True)], axis=1)
Or use merge:
c = pd.merge(a.iloc[2:4].reset_index(drop=True),
b.iloc[2:4].reset_index(drop=True),
left_index=True,
right_index=True)
print(c)
Voltage Current
0 30 0.3
1 40 0.4

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

pandas.series.split(' ',expand =True) With Column Names - python-3.x

Related

Dataframe column value based on aggregation of several columns

Pandas - interpolate over values in index

merge dataframes on multiple columns ignoring order

Split pandas columns into two with column MultiIndex

Python Pandas Merge data from different Dataframes on specific index and create new one

Categories

Resources