I have a Pandas Data Frame with two string columns, which I would like to split on space, like this:
df =
A B
0.1 0.5 0.01 ... 0.3 0.1 0.4 ...
I would like to split both these columns and form new columns for as many values, which result out of the split.
So, the result:
df =
A1 A2. A3 ... B1 B2 B3
0.1 0.5 0.01 ... 0.3 0.1 0.4
Currently, I am doing:
df = df.join(df['A'].str.split(' ', expand = True)
df = df.join(df['B'].str.split(' ', expand = True)
But, I get the following error:
columns overlap but no suffix specified
This is because I guess columns names of 1st and 2nd split overlap?
So, my question is how to split multiple columns by providing column names or suffixes for multiple splits?
Use DataFrame.add_prefix for columns names by splitted column:
df = df.join(df['A'].str.split(expand = True).add_prefix('A'))
df = df.join(df['B'].str.split(expand = True).add_prefix('B'))
print (df)
A B A0 A1 A2 B0 B1 B2
0 0.1 0.5 0.01 0.3 0.1 0.4 0.1 0.5 0.01 0.3 0.1 0.4
Another idea is use list comprehension:
cols = ['A','B']
df1 = pd.concat([df[c].str.split(expand=True).add_prefix(c) for c in cols], axis=1)
print (df1)
A0 A1 A2 B0 B1 B2
0 0.1 0.5 0.01 0.3 0.1 0.4
And for add all original columns:
df = df.join(df1)
Related
Say I have a pandas dataframe as below:
A B C
1 4 0.1
2 3 0.5
4 1 0.7
5 2 0.2
7 5 0.6
I want to loop through the rows in the dataframe, and for each row perform on aggregation on columns A and B as:
Agg = row[A] / row[A] + row[B]
A B C Agg
1 4 0.1 0.2
2 3 0.5 0.4
4 1 0.7 0.8
5 2 0.2 0.7
7 5 0.6 0.6
For all values of Agg > 0.6, get their corresponding column C values into a list, i.e. 0.7 and 0.2 in this case.
Last step is to get the minimum of the list i.e. min(list) = 0.2 in this instance.
We could use vectorized operations: add for addition, rdiv for division (for A/(A+B)), gt for greater than comparison and loc for the filtering:
out = df.loc[df['A'].add(df['B']).rdiv(df['A']).gt(0.6), 'C'].min()
We could also derive the same result using query much more concisely:
out = df.query('A/(A+B)>0.6')['C'].min()
Output:
0.2
Instead of iterating, you can try creating an aggregate function and apply it across all rows.
def aggregate(row):
return row["A"] / (row["A"] + row["B"])
df["Agg"] = round(df.apply(aggregate, axis = 1), 1)
df[df["Agg"] > 0.6]["C"].min()
Output -
0.2
I have the following Pandas dataframe:
a0 a1 a2 a3
0.2 0.46 15.85 124.06 -380.04
0.4 0.21 28.20 -53.17 87.97
0.6 1.10 -5.55 167.76 -417.72
0.8 0.82 6.11 16.90 -70.86
1.0 1.00 0.00 0.00 0.00
Which is made by:
import pandas as pd
df = pd.DataFrame(data={'a0': [0.46,0.21,1.10,0.82,1],
'a1': [15.85,28.20,-5.55,6.11,0],
'a2': [124.06,-53.17,167.76,16.90,0],
'a3': [-380.04,87.97,-417.72,-70.86,0]},
index=pd.Series(['0.2', '0.4', '0.6','0.8','1.0']))
a0,a1,a2,a3 are polynomial coefficients from a fit y= a0 + a1x + a2x^2 + a3*x^3.
5 fits have been made for 5 ratios Ht/H, these ratios are on the indices.
I want to return values for a0.. a3 for specified Ht/H ratio.
For example, if I specify Ht/H= 0.9, I want to get a0= 0.91, a1= 3.05,a2= 8.45,a3= -35.43.
First I notice that your index is currently strings, and you want numeric for interpolation. So do:
df.index = pd.to_numeric(df.index)
Let's try reindex:
s = 0.9
# create new index that includes the new value
new_idx = np.unique(list(df.index) + [s])
df.reindex(new_idx).interpolate('index').loc[s]
Output:
a0 0.910
a1 3.055
a2 8.450
a3 -35.430
Name: 0.9, dtype: float64
I have the following dataframes:
df1=pd.DataFrame({'fr':[1,2,3],'to':[4,5,6],'R':[0.1,0.2,0.3]})
df2=pd.DataFrame({'fr':[1,5,3],'to':[4,2,6],'X':[0.4,0.5,0.6]})
I would like to merge these two dataframes on fr and to, ignoring the order of fr and to, i.e., (2,5) is the same as (5,2). The desired output is:
dfO=pd.DataFrame({'fr':[1,2,3],'to':[4,5,6],'R':[0.1,0.2,0.3],'X':[0.4,0.5,0.6]})
or
dfO=pd.DataFrame({'fr':[1,5,3],'to':[4,2,6],'R':[0.1,0.2,0.3],'X':[0.4,0.5,0.6]})
I can do the following:
pd.merge(df1,df2,on=['fr','to'],how='left')
However, as expected, the X value of the second row is NaN.
Thank you for your help.
You need do numpy sort first
df1[['fr','to']] = np.sort(df1[['fr','to']].values,1)
df2[['fr','to']] = np.sort(df2[['fr','to']].values,1)
out = df1.merge(df2,how='left')
out
Out[44]:
fr to R X
0 1 4 0.1 0.4
1 2 5 0.2 0.5
2 3 6 0.3 0.6
You can create a temp field and then join on it
df1['tmp'] = df1.apply(lambda x: ','.join(sorted([str(x.fr), str(x.to)])), axis=1)
df2['tmp'] = df2.apply(lambda x: ','.join(sorted([str(x.fr), str(x.to)])), axis=1)
This will give the result that you expect
pd.merge(df1,df2[['tmp', 'X']],on=['tmp'], how='left').drop(columns=['tmp'])
I need to split DataFrame columns into two and add an additional value to the new column. The twist is that I need to lift the original column names up one level and add two new column names.
Given a DataFrame h:
>>> import pandas as pd
>>> h = pd.DataFrame({'a': [0.6, 0.4, 0.1], 'b': [0.2, 0.4, 0.7]})
>>> h
a b
0 0.6 0.2
1 0.4 0.4
2 0.1 0.7
I need to lift the original column names up one level and add two new column names. The result should look like this:
>>> # some stuff...
a b
expected received expected received
0 0.6 1 0.2 1
1 0.4 1 0.4 1
2 0.1 1 0.7 1
I've tried this:
>>> h['a1'] = [1, 1, 1]
>>> h['b1'] = [1, 1, 1]
>>> t = [('f', 'expected'),('f', 'received'), ('g', 'expected'), ('g', 'received')]
>>> h.columns = pd.MultiIndex.from_tuples(t)
>>> h
f g
expected received expected received
0 0.6 0.2 1 1
1 0.4 0.4 1 1
2 0.1 0.7 1 1
This just renames the columns but does not align them properly. I think the issue is there's no link between a1 and b1 to the expected and received columns.
How do I lift the original column names up one level and add two new column names?
I am using concat with keys , then swaplevel
h1=h.copy()
h1[:]=1
pd.concat([h,h1],keys=['expected', 'received'],axis=1).\
swaplevel(0,1,axis=1).\
sort_index(level=0,axis=1)
Out[233]:
a b
expected received expected received
0 0.6 1.0 0.2 1.0
1 0.4 1.0 0.4 1.0
2 0.1 1.0 0.7 1.0
My code is given below: I have two data frames a,b. I want to create a new data frame c by merging a specific index data of a, b frames.
import pandas as pd
a = [10,20,30,40,50,60]
b = [0.1, 0.2, 0.3, 0.4, 0.5, 0.6]
a = pd.DataFrame(a,columns=['Voltage'])
b = pd.DataFrame(b,columns=['Current'])
c = pd.merge(a,b,left_index=True, right_index=True)
print(c)
The actual output is:
Voltage Current
0 10 0.1
1 20 0.2
2 30 0.3
3 40 0.4
4 50 0.5
5 60 0.6
I don't want all the rows. But, specific index rows something like:
c = Voltage Current
0 30 0.3
1 40 0.4
How to modify c = pd.merge(a,b,left_index=True, right_index=True) code so that, I only want those specific third and fourth rows in c with new index order as given above?
Use iloc for select rows by positions and add reset_index with drop=True for default index in both DataFrames:
Solution1 with concat:
c = pd.concat([a.iloc[2:4].reset_index(drop=True),
b.iloc[2:4].reset_index(drop=True)], axis=1)
Or use merge:
c = pd.merge(a.iloc[2:4].reset_index(drop=True),
b.iloc[2:4].reset_index(drop=True),
left_index=True,
right_index=True)
print(c)
Voltage Current
0 30 0.3
1 40 0.4