Use KDTree/KNN Return Closest Neighbors - python-3.x

I have two python pandas dataframes. One contains all NFL Quarterbacks' College Football statistics since 2007 and a label on the type of player they are (Elite, Average, Below Average). The other dataframe contains all of the college football qbs' data from this season along with a prediction label.
I want to run some sort of analysis to determine the two closest NFL comparisons for every college football qb based on their labels. I'd like to add to two comparable qbs as two new columns to the second dataframe.
The feature names in both dataframes are the same. Here is what the dataframes look like:
Player Year Team GP Comp % YDS TD INT Label
Player A 2020 ASU 12 65.5 3053 25 6 Average
For the example above, I'd like two find the two closest neighbors to Player A that also have the label "Average" from the first dataframe.
The way I thought of doing this was to use Scipy's KDTree and run a query tree:
tree = KDTree(nfl[features], leafsize=nfl[features].shape[0]+1)
closest = []
for row in college.iterrows():
distances, ndx = tree.query(row[features], k=2)
closest.append(ndx)
print(closest)
However, the print statement returned an empty list. Is this the right way to solve my problem?

.iterrows(), will return namedtuples (index, Series) where index is obviously the index of the row, and Series is the features values with the index of those being the columns names (see below).
As you have it, row is being stored as that tuple, so when you have row[features], that won't really do anything. What you're really after is that Series which the features and values Ie row[1]. So you can either call that directly, or just break them up in your loop by doing for idx, row in df.iterrows():. Then you can just call on that Series row.
Scikit learn is a good package here to use (actually built on Scipy so you'll notice same syntax). You'll have to edit the code to your specifications (like filter to only have the "Average" players, maybe you are one-hot encoding the category columns and in that case may need to add that to the features,etc.), but to give you an idea (And I made up these dataframes just for an example...actually the nfl one is accurate, but the college completely made up), you can see below using the kdtree and then taking each row in the college dataframe to see which 2 values it's closest to in the nfl dataframe. I obviously have it print out the names, but as you can see with print(closest), the raw arrays are there for you.
import pandas as pd
nfl = pd.DataFrame([['Tom Brady','1999','Michigan',11,61.0,2217,16,6,'Average'],
['Aaron Rodgers','2004','California',12,66.1,2566,24,8,'Average'],
['Payton Manning','1997','Tennessee',12,60.2,3819,36,11,'Average'],
['Drew Brees','2000','Perdue',12,60.4,3668,26,12,'Average'],
['Dan Marino','1982','Pitt',12,58.5,2432,17,23,'Average'],
['Joe Montana','1978','Notre Dame',11,54.2,2010,10,9,'Average']],
columns = ['Player','Year','Team','GP','Comp %','YDS','TD','INT','Label'])
college = pd.DataFrame([['Joe Smith','2019','Illinois',11,55.6,1045,15,7,'Average'],
['Mike Thomas','2019','Wisconsin',11,67,2045,19,11,'Average'],
['Steve Johnson','2019','Nebraska',12,57.3,2345,9,19,'Average']],
columns = ['Player','Year','Team','GP','Comp %','YDS','TD','INT','Label'])
features = ['GP','Comp %','YDS','TD','INT']
from sklearn.neighbors import KDTree
tree = KDTree(nfl[features], leaf_size=nfl[features].shape[0]+1)
closest = []
for idx, row in college.iterrows():
X = row[features].values.reshape(1, -1)
distances, ndx = tree.query(X, k=2, return_distance=True)
closest.append(ndx)
collegePlayer = college.loc[idx,'Player']
closestPlayers = [ nfl.loc[x,'Player'] for x in ndx[0] ]
print ('%s closest to: %s' %(collegePlayer, closestPlayers))
print(closest)
Output:
Joe Smith closest to: ['Joe Montana', 'Tom Brady']
Mike Thomas closest to: ['Joe Montana', 'Tom Brady']
Steve Johnson closest to: ['Dan Marino', 'Tom Brady']

Related

How join two dataframes with multiple overlap in pyspark

Hi I have a dataset of multiple households where all people within households have been matched between two datasources. The dataframe therefore consists of a 'household' col, and two person cols (one for each datasource). However some people (like Jonathan or Peter below) where not able to be matched and so have a blank second person column.
Household
Person_source_A
Person_source_B
1
Oliver
Oliver
1
Jonathan
1
Amy
Amy
2
David
Dave
2
Mary
Mary
3
Lizzie
Elizabeth
3
Peter
As the dataframe is gigantic, my aim is to take a sample of the unmatched individuals, and then output a df that has all people within households where only sampled unmatched people exist. Ie say my random sample includes Oliver but not Peter, then I would only household 1 in the output.
My issue is I've filtered to take the sample and now am stuck making progress. Some combination of join, agg/groupBy... will work but I'm struggling. I add a flag to the sampled unmatched names to identify them which i think is helpful...
My code:
# filter to unmatched people
df_unmatched = df.filter(col('per_A').isNotNull()) & col('per_B').isNull())
# take random sample of 10%
df_unmatched_sample = df_unmatched.sample(0.1)
# add flag of sampled unmatched persons
df_unmatched_sample = df_unmatched.withColumn('sample_flag', lit('1'))
As it pertains to your intent:
I just want to reduce my dataframe to only show the full households of
households where an unmatched person exists that has been selected by
a random sample out of all unmatched people
Using your existing approach you could use a join on the Household of the sample records
# filter to unmatched people
df_unmatched = df.filter(col('per_A').isNotNull()) & col('per_B').isNull())
# take random sample of 10%
df_unmatched_sample = df_unmatched.sample(0.1).select("Household").distinct()
desired_df = df.join(df_unmatched_sample,["Household"],"inner")
Edit 1
In response to op's comment:
Is there a slightly different way that keeps a flag to identify the
sampled unmatched person (as there are some households with more than
one unmatched person)?
A left join on your existing dataset after adding the flag column to your sample may help you to achieve this eg:
# filter to unmatched people
df_unmatched = df.filter(col('per_A').isNotNull()) & col('per_B').isNull())
# take random sample of 10%
df_unmatched_sample = df_unmatched.sample(0.1).withColumn('sample_flag', lit('1'))
desired_df = (
df.alias("dfo").join(
df_unmatched_sample.alias("dfu"),
[
col("dfo.Household")==col("dfu.Household") ,
col("dfo.per_A")==col("dfu.per_A"),
col("dfo.per_B").isNull()
],
"left"
)
)

How to find row and column names of the n-th highest values in a 2D array in Excel?

I want to find the row and column names of the n-th highest values in a 2D array in Excel.
My array has a header row (the Coins) and a header column (the Markets). The data itself displays if a coin is supported on the market and if so what the approximate return of investment (ROI) will be in percent.
Example
An example of the array could look like this:
ROI
Coin A
Coin B
Coin C
Market 1
N/A
7.8%
5.7%
Market 2
0.4%
6.8%
N/A
Market 3
0.45%
7.6%
12.3%
Pay attention: So some values are also set to N/A (or is there a better way to display that a market doesn't support a specific coin? I don't want to enter 0% as it makes it harder to spot is a coin is supported by the market. I also don't want to leave that field blank because then I don't know if I already checked that market for that coin.)
Preferred output
The output for the example table from above with n=3 should then look like this (from high ROI to low):
Coin
Market
ROI
C
3
12.3%
B
1
7.8%
A
3
0.45%
Requirements
Each coin must only be shown once. So, for example, Coin B must not be listed twice in the Top3 output (once for Market 1: 7.8% and once for Market 3: 7.6%)
What I tried
So I thought about how to split up that problem into smaller parts. I think, it will come to these main parts:
find header/row name
here I found something to find the column name for the highest value per row but I wasn't able to adapt it to a working solution for a 2D array
find max in 2D array
here they describe to find the max value in a 2D array but not how to find the n-th highest values
find n-th highest values
here is a good explanation on how to find the highest n values of a 1D array but not how to apply that for a 2D array
only include each coin once
So I really tried to solve this myself but I struggle with adding these different parts together.

Summary statistics for each group and transpose using pandas

I have a dataframe like as shown below
df = pd.DataFrame({'person_id': [11,11,11,11,11,11,11,11,12,12,12],
'time' :[0,0,0,1,2,3,4,4,0,0,1],
'value':[101,102,np.nan,120,143,153,160,170,96,97,99]})
What I would like to do is
a) Get the summary statistics for each subject for each time point (ex: 0hr, 1hr, 2hr etc)
b) Please note that NA rows shouldn't be counted as separate record/row during computing mean
I was trying the below
for i in df['subject_id'].unique()
df[df['subject_id'].isin([i])].time.unique
val_mean = df.groupby(['subject_id','time']][value].mean()
val_stddev = df[value].std()
But I couldn't get the expected output
I expect my output to be like as shown below where I expect one row for each time point (ex: 0hr, 1 hr , 2hr, 3hr etc). Please note that NA rows shouldn't be counted as seperated record/row during computing mean

How to force plotly plots to correct starting point on x axis?

I'm plotting the sales numbers (amount) per week YYYYWW per product product_name.
All the data appears on the graph, however some of the products are showing incorrectly. If product A only started having sales figures from year 2019 (ie no sales figures for the whole of 2018); then I want the line for that product to be zero in 2018 and begin showing values from 2019.
What's happening instead is Product A is showing the line graph from the origin of the graph. So week 1 of sales is at YYYYWW 201801 instead.
Is there a more efficient way to solve this than to place zero values for the product with a list comprehension?
import plotly.graph_objs as go
import plotly.offline as pyo
data = [go.Scatter(x=sorted(df.YYYYWW.unique().astype(str)),
y=list(df.loc[df.product_name == 'Product A',
['amount','YYYYWW']].groupby('YYYYWW').sum().amount),
mode='lines+markers',
)
]
pyo.plot(data)
The values in x are: 201801, 201802, ... 201920
The values in y are:
YYYYWW amount
2019/15 454.32
2019/16 1131.15
2019/17 1152.96
2019/18 2822.77
2019/19 3580.86
2019/20 2265.06
solved it!
My x values should be taken from a subset of the dataframe just as done in my y values:
x = df.loc[df.product_name == i].YYYYWW.unique().astype(str)

How to understand python data frames better

Being a beginner in Python, I often face this problem - Let's say I am working with a data frame and want to execute an operation on one of the column. It can be just removing the decimal point from the value or maybe I want to take out the month from the date column. But often the solutions I would find online - it is generally shown with a single value or a data point like this:
a = 11.0
int(a)
11
Now, the same solution can't be applied if I have a data frame or a column. Again If I want to add time with date
d = date.today()
d
datetime.date(2018, 3, 30)
datetime.combine(d, datetime.min.time())
datetime.datetime(2018, 3, 30, 0, 0)
In the same manner, this solution can not be used for a data frame. That will throw an error. Obviously I have a lacking in knowledge here, I am not being able to make it work in terms of data frames. Can you please point me towards the topic which might help me understand these problems in terms of data frames ? or maybe show an example how its done ?
You should have a look at pandas library to manipulate dataframes : https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html
This is an exemple to apply a function for each value of a given column:
import pandas as pd
def myFunction(a_string):
return a_string.upper()
data = pd.read_csv('data.csv')
print(data)
data['City'] = data['City'].apply(myFunction)
print(data)
Data at beginning :
Name City Age
Robert Paris 32
Max Dallas 24
Raj Delhi 27
Data after:
Name City Age
Robert PARIS 32
Max DALLAS 24
Raj DELHI 27
Here myFunction uppercase the string but could be used the same way for other kind of operations.
Hope that helps.

Resources