How to use an integer list to find rows in pd.DataFrame with non-integer indices - python-3.x

How can I make this work?
import pandas as pd
L = [1,3,5]
df = pd.DataFrame([1,2,3,4,5,6,7], index=[0.1,0.2,0.3,0.4,0.5,0.6,0.7])
print(df[0])
print(df[0].loc(L))
I would like to have this output format:
0.2 2
0.4 4
0.6 6

I think that is .iloc
df.iloc[L]
Out[477]:
0
0.2 2
0.4 4
0.6 6

Related

Dataframe column value based on aggregation of several columns

Say I have a pandas dataframe as below:
A B C
1 4 0.1
2 3 0.5
4 1 0.7
5 2 0.2
7 5 0.6
I want to loop through the rows in the dataframe, and for each row perform on aggregation on columns A and B as:
Agg = row[A] / row[A] + row[B]
A B C Agg
1 4 0.1 0.2
2 3 0.5 0.4
4 1 0.7 0.8
5 2 0.2 0.7
7 5 0.6 0.6
For all values of Agg > 0.6, get their corresponding column C values into a list, i.e. 0.7 and 0.2 in this case.
Last step is to get the minimum of the list i.e. min(list) = 0.2 in this instance.
We could use vectorized operations: add for addition, rdiv for division (for A/(A+B)), gt for greater than comparison and loc for the filtering:
out = df.loc[df['A'].add(df['B']).rdiv(df['A']).gt(0.6), 'C'].min()
We could also derive the same result using query much more concisely:
out = df.query('A/(A+B)>0.6')['C'].min()
Output:
0.2
Instead of iterating, you can try creating an aggregate function and apply it across all rows.
def aggregate(row):
return row["A"] / (row["A"] + row["B"])
df["Agg"] = round(df.apply(aggregate, axis = 1), 1)
df[df["Agg"] > 0.6]["C"].min()
Output -
0.2

Easy way to convert list of string to numpy array

I am working with data from World Ocean Database (WOD), and somehow I ended up with a list that looks like this one:
idata =
[' 1, 0.0,0, , 6.2386,0, , 33.2166,0, ,\n',
' 2, 5.0,0, , 6.2385,0, , 33.2166,0, ,\n',
' 3, 10.0,0, , 6.2306,0, , 33.2175,0, ,\n',
' 4, 15.0,0, , 6.2359,0, , 33.2176,0, ,\n',
' 5, 20.0,0, , 6.2387,0, , 33.2175,0, ,\n']
Is there any easy way to convert this structure into a numpy array or in a friendlier format? I just want to add the information of the columns in a pandas DataFrame.
You could use a combination of string manipulation (i.e. strip() and split()) and list comprehensions:
import numpy as np
idata = [
' 1, 0.0,0, , 6.2386,0, , 33.2166,0, ,\n',
' 2, 5.0,0, , 6.2385,0, , 33.2166,0, ,\n',
' 3, 10.0,0, , 6.2306,0, , 33.2175,0, ,\n',
' 4, 15.0,0, , 6.2359,0, , 33.2176,0, ,\n',
' 5, 20.0,0, , 6.2387,0, , 33.2175,0, ,\n']
ll = [[float(x.strip()) for x in s.split(',') if x.strip()] for s in idata]
print(np.array(ll))
# [[ 1. 0. 0. 6.2386 0. 33.2166 0. ]
# [ 2. 5. 0. 6.2385 0. 33.2166 0. ]
# [ 3. 10. 0. 6.2306 0. 33.2175 0. ]
# [ 4. 15. 0. 6.2359 0. 33.2176 0. ]
# [ 5. 20. 0. 6.2387 0. 33.2175 0. ]]
which can also be fed to a Pandas dataframe constructor:
import pandas as pd
df = pd.DataFrame(ll)
print(df)
# 0 1 2 3 4 5 6
# 0 1.0 0.0 0.0 6.2386 0.0 33.2166 0.0
# 1 2.0 5.0 0.0 6.2385 0.0 33.2166 0.0
# 2 3.0 10.0 0.0 6.2306 0.0 33.2175 0.0
# 3 4.0 15.0 0.0 6.2359 0.0 33.2176 0.0
# 4 5.0 20.0 0.0 6.2387 0.0 33.2175 0.0
You might split the values by comma, strip the parts and add the resulting array to a DataFrame like follows:
import pandas as pd
data = [[item.strip() for item in line.split(',')] for line in idata]
df = pd.DataFrame(data)
In order to safely convert the DataFrame to numeric values pd.to_numeric could be used:
df = df.apply(pd.to_numeric)
try: from io import StringIO # Python 3
except: from StringIO import StringIO # Python 2
import pandas as pd
df = pd.read_csv(StringIO(''.join(idata)), index_col=0, header=None, sep=r',\s*', engine='python')
print(df)
# prints:
# 1 2 3 4 5 6 7 8 9 10
# 0
# 1 0.0 0 NaN 6.2386 0 NaN 33.2166 0 NaN NaN
# 2 5.0 0 NaN 6.2385 0 NaN 33.2166 0 NaN NaN
# 3 10.0 0 NaN 6.2306 0 NaN 33.2175 0 NaN NaN
# 4 15.0 0 NaN 6.2359 0 NaN 33.2176 0 NaN NaN
# 5 20.0 0 NaN 6.2387 0 NaN 33.2175 0 NaN NaN
Remove the header=None if you can include an initial row of idata that actually specifies helpful column labels. Remove sep=r',\s*', engine='python' if you're happy for the blank columns to contain blank string objects instead of NaN.

Split pandas columns into two with column MultiIndex

I need to split DataFrame columns into two and add an additional value to the new column. The twist is that I need to lift the original column names up one level and add two new column names.
Given a DataFrame h:
>>> import pandas as pd
>>> h = pd.DataFrame({'a': [0.6, 0.4, 0.1], 'b': [0.2, 0.4, 0.7]})
>>> h
a b
0 0.6 0.2
1 0.4 0.4
2 0.1 0.7
I need to lift the original column names up one level and add two new column names. The result should look like this:
>>> # some stuff...
a b
expected received expected received
0 0.6 1 0.2 1
1 0.4 1 0.4 1
2 0.1 1 0.7 1
I've tried this:
>>> h['a1'] = [1, 1, 1]
>>> h['b1'] = [1, 1, 1]
>>> t = [('f', 'expected'),('f', 'received'), ('g', 'expected'), ('g', 'received')]
>>> h.columns = pd.MultiIndex.from_tuples(t)
>>> h
f g
expected received expected received
0 0.6 0.2 1 1
1 0.4 0.4 1 1
2 0.1 0.7 1 1
This just renames the columns but does not align them properly. I think the issue is there's no link between a1 and b1 to the expected and received columns.
How do I lift the original column names up one level and add two new column names?
I am using concat with keys , then swaplevel
h1=h.copy()
h1[:]=1
pd.concat([h,h1],keys=['expected', 'received'],axis=1).\
swaplevel(0,1,axis=1).\
sort_index(level=0,axis=1)
Out[233]:
a b
expected received expected received
0 0.6 1.0 0.2 1.0
1 0.4 1.0 0.4 1.0
2 0.1 1.0 0.7 1.0

Python Pandas Merge data from different Dataframes on specific index and create new one

My code is given below: I have two data frames a,b. I want to create a new data frame c by merging a specific index data of a, b frames.
import pandas as pd
a = [10,20,30,40,50,60]
b = [0.1, 0.2, 0.3, 0.4, 0.5, 0.6]
a = pd.DataFrame(a,columns=['Voltage'])
b = pd.DataFrame(b,columns=['Current'])
c = pd.merge(a,b,left_index=True, right_index=True)
print(c)
The actual output is:
Voltage Current
0 10 0.1
1 20 0.2
2 30 0.3
3 40 0.4
4 50 0.5
5 60 0.6
I don't want all the rows. But, specific index rows something like:
c = Voltage Current
0 30 0.3
1 40 0.4
How to modify c = pd.merge(a,b,left_index=True, right_index=True) code so that, I only want those specific third and fourth rows in c with new index order as given above?
Use iloc for select rows by positions and add reset_index with drop=True for default index in both DataFrames:
Solution1 with concat:
c = pd.concat([a.iloc[2:4].reset_index(drop=True),
b.iloc[2:4].reset_index(drop=True)], axis=1)
Or use merge:
c = pd.merge(a.iloc[2:4].reset_index(drop=True),
b.iloc[2:4].reset_index(drop=True),
left_index=True,
right_index=True)
print(c)
Voltage Current
0 30 0.3
1 40 0.4

pandas equivalent of R's cbind (concatenate/stack vectors vertically)

suppose I have two dataframes:
import pandas
....
....
test1 = pandas.DataFrame([1,2,3,4,5])
....
....
test2 = pandas.DataFrame([4,2,1,3,7])
....
I tried test1.append(test2) but it is the equivalent of R's rbind.
How can I combine the two as two columns of a dataframe similar to the cbind function in R?
test3 = pd.concat([test1, test2], axis=1)
test3.columns = ['a','b']
(But see the detailed answer by #feng-mai, below)
There is a key difference between concat(axis = 1) in pandas and cbind() in R:
concat attempts to merge/align by index. There is no concept of index in a R dataframe. If the two pandas dataframes' indexes are misaligned, the results are different from cbind (even if they have the same number of rows). You need to either make sure the indexes align or drop/reset the indexes.
Example:
import pandas as pd
test1 = pd.DataFrame([1,2,3,4,5])
test1.index = ['a','b','c','d','e']
test2 = pd.DataFrame([4,2,1,3,7])
test2.index = ['d','e','f','g','h']
pd.concat([test1, test2], axis=1)
0 0
a 1.0 NaN
b 2.0 NaN
c 3.0 NaN
d 4.0 4.0
e 5.0 2.0
f NaN 1.0
g NaN 3.0
h NaN 7.0
pd.concat([test1.reset_index(drop=True), test2.reset_index(drop=True)], axis=1)
0 1
0 1 4
1 2 2
2 3 1
3 4 3
4 5 7
pd.concat([test1.reset_index(), test2.reset_index(drop=True)], axis=1)
index 0 0
0 a 1 4
1 b 2 2
2 c 3 1
3 d 4 3
4 e 5 7

Resources