Avoiding explicit for-loop in Python with pandas dataframe - python-3.x

I would like to find a better way to carry out the following process.
#import packages
import pandas as pd
I have defined a pandas dataframe.
# Create dataframe
data = {'name': ['Jason', 'Jason', 'Tina', 'Tina', 'Tina'],
'reports': [4, 24, 31, 2, 3],
'coverage': [25, 94, 57, 62, 70]}
df = pd.DataFrame(data)
After the dataframe is created, I want to add an extra column to the dataframe. This column contains the rank based on the values in the coverage column for every name seperately.
#Add column with ranks based on 'coverage' for every name separately.
df_end = pd.DataFrame()
for person_names in df.groupby('name').groups:
one_name = df.groupby('name').get_group(person_names)
one_name['coverageRank'] = one_name['coverage'].rank()
df_end = df_end.append(one_name)
Is it possible to achieve this simple task in a simpler way? Maybe without using the for-loop?

I think you need DataFrameGroupBy.rank:
df['coverageRank'] = df.groupby('name')['coverage'].rank()
print (df)
coverage name reports coverageRank
0 25 Jason 4 1.0
1 94 Jason 24 2.0
2 57 Tina 31 1.0
3 62 Tina 2 2.0
4 70 Tina 3 3.0

Related

Generate conditional lists of lists in Pandas, "Pythonically"

I want to generate a conditional list of lists. The number of embedded lists is determined by the number of unique conditions, and each embedded list contains values from a given condition.
I can generate this list of lists using a for-loop. See the code below. However, I am looking for a faster and more Pythonic (i.e, no for-loop) approach.
import pandas as pd
from random import randint
example_conditions = ["A","A","B","B","B","C","D","D","D","D"]
example_values = [randint(-100,100) for _ in example_conditions ]
df = pd.DataFrame({
"conditions":example_conditions,
"values": example_values
})
lol = []
for condition in df["conditions"].unique():
sublist = df.loc[df["conditions"]==condition]["values"].values.tolist()
lol.append(sublist)
Thanks!
Try:
x = df.groupby("conditions")["values"].agg(list).to_list()
print(x)
Prints:
[[-1, 78], [33, 74, -79], [59], [-32, -2, 52, -66]]
Input dataframe:
conditions values
0 A -1
1 A 78
2 B 33
3 B 74
4 B -79
5 C 59
6 D -32
7 D -2
8 D 52
9 D -66

Apply a function to every row of a dataframe and store the data to a list/Dataframe in Python

I have the following simplified version of the code:
import pandas as pd
def myFunction(portf, Val):
mydata = {portf: [Val, Val * 2, Val * 3, Val * 4]}
df = pd.DataFrame(mydata, columns=[portf])
return df
data = {'Portfolio': ['Book1', 'Book2', 'Book1', 'Book2'],
'Value': [10, 5, 6, 11]}
df_input = pd.DataFrame(data, columns=['Portfolio', 'Value'])
df_output = myFunction(df_input['Portfolio'][0], df_input['Value'][0])
df_output1 = myFunction(df_input['Portfolio'][1], df_input['Value'][1])
df_output2 = myFunction(df_input['Portfolio'][2], df_input['Value'][2])
df_output3 = myFunction(df_input['Portfolio'][3], df_input['Value'][3])
What I would like is concatenate all the df_output in a single list or even better in a dataframe in an efficient way as the df_input dataframe will have 100+ columns.
I tried to apply the following:
df_input.apply(lambda row : myFunction(row['Portfolio'], row['Value']), axis = 1)
However all the results return to a single column.
Any idea how to achieve that?
Thanks
You can use pd.concat to store all results in a single dataframe:
pd.concat([myFunction(row['Portfolio'], row['Value'])
for _, row in df_input.iterrows()], axis=1)
First you build a list of pd.DataFrames with a list comprehension (you could also use a normal loop). Then you concat all DataFrames along axis=1.
Output:
Book1 Book2 Book1 Book2
0 10 5 6 11
1 20 10 12 22
2 30 15 18 33
3 40 20 24 44
You mentioned df_input has many more rows in the original dataframe. To account for this you neeed another loop (minimal example):
data = {'Portfolio': ['Book1', 'Book2', 'Book1', 'Book2'],
'Value': [10, 5, 6, 11]}
df_input = pd.DataFrame(data, columns=['Portfolio', 'Value'])
df_input['Value2'] = df_input['Value'] * 100
pd.concat([myFunction(row['Portfolio'], row[col])
for col in df_input.columns if col != 'Portfolio'
for (_, row) in df_input.iterrows()], axis=1)
Output:
Book1 Book2 Book1 Book2 Book1 Book2 Book1 Book2
0 10 5 6 11 1000 500 600 1100
1 20 10 12 22 2000 1000 1200 2200
2 30 15 18 33 3000 1500 1800 3300
3 40 20 24 44 4000 2000 2400 4400
You might want to rename the columns or aggregate the resulting dataframe in some other way. But for this I had to guess (and I try not to guess in the face of ambiguity).

Pandas Create DataFrame with ColumnNames from a list

Considering the following list made up of sub-lists as elements, I need to create a pandas dataframe
import pandas as pd
data = [['tom', 10], ['nick', 15], ['juli', 14]]
The desired output is as following, with the first argument being converted to the column name in the dataframe.
tom nick juli
0 10 15 14
Is there a way by which this output can be achieved?
Best Regards.
Use dictionary comprehension and pass to DataFrame constructor:
print ({x[0]: x[1:] for x in data})
{'tom': [10], 'nick': [15], 'juli': [14]}
df = pd.DataFrame({x[0]: x[1:] for x in data})
print (df)
tom nick juli
0 10 15 14
You could also use dict + extended iterable unpacking:
import pandas as pd
data = [['tom', 10], ['nick', 15], ['juli', 14]]
result = pd.DataFrame(dict((column, values) for column, *values in data))
print(result)
Output
tom nick juli
0 10 15 14
We also do:
pd.DataFrame(data).set_index(0).T
0 tom nick juli
1 10 15 14

How do I perform inter-row operations within a pandas.dataframe

How do I write the nested for loop to access every other row with respect to a row within a pandas.dataframe?
I am trying to perform some operations between rows in a pandas.dataframe
The operation for my example code is calculating Euclidean distances between each row with each other row.
The results are then saved into a some list in the form
[(row_reference, name, dist)].
I understand how to access each row in a pandas.dataframe using df.itterrows() but I'm not sure how to access every other row with respect to the current row in order to perform the inter-row operation.
import pandas as pd
import numpy
import math
df = pd.DataFrame([{'name': "Bill", 'c1': 3, 'c2': 8}, {'name': "James", 'c1': 4, 'c2': 12},
{'name': "John", 'c1': 12, 'c2': 26}])
#Euclidean distance function where x1=c1_row1 ,x2=c1_row2, y1=c2_row1, #y2=c2_row2
def edist(x1, x2, y1, y2):
dist = math.sqrt(math.pow((x1 - x2),2) + math.pow((y1 - y2),2))
return dist
# Calculate Euclidean distance for one row (e.g. Bill) against each other row
# (e.g. "James" and "John"). Save results to a list (N_name, dist).
all_results = []
for index, row in df.iterrows():
results = []
# secondary loop to look for OTHER rows with respect to the current row
# results.append(row2['name'],edist())
all_results.append(row,results)
I hope to perform some operation edist() on all rows with respect to the current row/index.
I expect the loop to do the following:
In[1]:
result = []
result.append(['James',edist(3,4,8,12)])
result.append(['John',edist(3,12,8,26)])
results_all=[]
results_all.append([0,result])
result2 = []
result2.append(['John',edist(4,12,12,26)])
result2.append(['Bill',edist(4,3,12,8)])
results_all.append([1,result2])
result3 = []
result3.append(['Bill',edist(12,3,26,8)])
result3.append(['James', edist(12,4,26,12)])
results_all.append([2,result3])
results_all
With the following expected resulting output:
OUT[1]:
[[0, [['James', 4.123105625617661], ['John', 20.12461179749811]]],
[1, [['John', 16.1245154965971], ['Bill', 4.123105625617661]]],
[2, [['Bill', 20.12461179749811], ['James', 16.1245154965971]]]]
If you data is not too long, you can check out scipy's distance_matrix:
all_results = pd.DataFrame(distance_matrix(df[['c1','c2']],df[['c1','c2']]),
index=df['name'],
columns=df['name'])
Output:
name Bill James John
name
Bill 0.000000 4.123106 20.124612
James 4.123106 0.000000 16.124515
John 20.124612 16.124515 0.000000
Consider shift and avoid any rowwise looping. And because you run straightforward arithmetic, run the expression directly on columns using help of numpy for vectorized calculation.
import numpy as np
df = (df.assign(c1_shift = lambda x: x['c1'].shift(1),
c2_shift = lambda x: x['c2'].shift(1))
)
df['dist'] = np.sqrt(np.power(df['c1'] - df['c1_shift'], 2) +
np.power(df['c2'] - df['c2_shift'], 2))
print(df)
# name c1 c2 c1_shift c2_shift dist
# 0 Bill 3 8 NaN NaN NaN
# 1 James 4 12 3.0 8.0 4.123106
# 2 John 12 26 4.0 12.0 16.124515
Should you want every row combination with each other, consider a cross join on itself and query out reverse duplicates:
df = (pd.merge(df.assign(key=1), df.assign(key=1), on="key")
.query("name_x < name_y")
.drop(columns=['key'])
)
df['dist'] = np.sqrt(np.power(df['c1_x'] - df['c1_y'], 2) +
np.power(df['c2_x'] - df['c2_y'], 2))
print(df)
# name_x c1_x c2_x name_y c1_y c2_y dist
# 1 Bill 3 8 James 4 12 4.123106
# 2 Bill 3 8 John 12 26 20.124612
# 5 James 4 12 John 12 26 16.124515

How to sort and update a column in pandas dataframe. In a updated dataframe i want to concat new dataframe

I want to sort a column of pandas DataFrame.
not just sorting but want to return a dataframe with sorted column.
in a sorted and updated DataFrame I want to concat a single column DataFrame.
students = {'name': ['s1', 's2', 's3', 's4', 's5', 's6', 's7', 's8', 's9', 's10']}
marks = {'grade': [45, 78, 12, 14, 48, 43, 47, 98, 35, 80]}
df_1 = pd.DataFrame(students)
df_2 = pd.DataFrame(marks)
df = pd.concat([df_1, df_2], axis=1)
df_25to_75 = df
df_25to_75.sort_values(['Marks'], inplace = True)
lst = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
a = pd.concat([df_25to_75, pd.DataFrame({'no.s': lst})], axis = 1)
You can create your DataFrame more simply, then you want rank. Feel free to sort in the end if it's necessary, though it's not to get the ranking.
import pandas as pd
df = pd.DataFrame({**students, **marks})
df['no.s'] = df.grade.rank(method='dense').astype(int)
name grade no.s
0 s1 45 5
1 s2 78 8
2 s3 12 1
3 s4 14 2
4 s5 48 7
5 s6 43 4
6 s7 47 6
7 s8 98 10
8 s9 35 3
9 s10 80 9
Your original issue is that though you sort the DataFrame the original index remains bound to the rows. Thus, when you then assign a new Series 1 - 10, it aligns on the original index, not on the row ordering.

Resources