How join two dataframes with multiple overlap in pyspark - apache-spark

Hi I have a dataset of multiple households where all people within households have been matched between two datasources. The dataframe therefore consists of a 'household' col, and two person cols (one for each datasource). However some people (like Jonathan or Peter below) where not able to be matched and so have a blank second person column.
Household
Person_source_A
Person_source_B
1
Oliver
Oliver
1
Jonathan
1
Amy
Amy
2
David
Dave
2
Mary
Mary
3
Lizzie
Elizabeth
3
Peter
As the dataframe is gigantic, my aim is to take a sample of the unmatched individuals, and then output a df that has all people within households where only sampled unmatched people exist. Ie say my random sample includes Oliver but not Peter, then I would only household 1 in the output.
My issue is I've filtered to take the sample and now am stuck making progress. Some combination of join, agg/groupBy... will work but I'm struggling. I add a flag to the sampled unmatched names to identify them which i think is helpful...
My code:
# filter to unmatched people
df_unmatched = df.filter(col('per_A').isNotNull()) & col('per_B').isNull())
# take random sample of 10%
df_unmatched_sample = df_unmatched.sample(0.1)
# add flag of sampled unmatched persons
df_unmatched_sample = df_unmatched.withColumn('sample_flag', lit('1'))

As it pertains to your intent:
I just want to reduce my dataframe to only show the full households of
households where an unmatched person exists that has been selected by
a random sample out of all unmatched people
Using your existing approach you could use a join on the Household of the sample records
# filter to unmatched people
df_unmatched = df.filter(col('per_A').isNotNull()) & col('per_B').isNull())
# take random sample of 10%
df_unmatched_sample = df_unmatched.sample(0.1).select("Household").distinct()
desired_df = df.join(df_unmatched_sample,["Household"],"inner")
Edit 1
In response to op's comment:
Is there a slightly different way that keeps a flag to identify the
sampled unmatched person (as there are some households with more than
one unmatched person)?
A left join on your existing dataset after adding the flag column to your sample may help you to achieve this eg:
# filter to unmatched people
df_unmatched = df.filter(col('per_A').isNotNull()) & col('per_B').isNull())
# take random sample of 10%
df_unmatched_sample = df_unmatched.sample(0.1).withColumn('sample_flag', lit('1'))
desired_df = (
df.alias("dfo").join(
df_unmatched_sample.alias("dfu"),
[
col("dfo.Household")==col("dfu.Household") ,
col("dfo.per_A")==col("dfu.per_A"),
col("dfo.per_B").isNull()
],
"left"
)
)

Related

Sampling a dataframe according to some rules: balancing a multilabel dataset

I have a dataframe like this:
df = pd.DataFrame({'id':[10,20,30,40],'text':['some text','another text','random stuff', 'my cat is a god'],
'A':[0,0,1,1],
'B':[1,1,0,0],
'C':[0,0,0,1],
'D':[1,0,1,0]})
Here I have columns from Ato D but my real dataframe has 100 columns with values of 0and 1. This real dataframe has 100k reacords.
For example, the column A is related to the 3rd and 4rd row of text, because it is labeled as 1. The Same way, A is not related to the 1st and 2nd rows of text because it is labeled as 0.
What I need to do is to sample this dataframe in a way that I have the same or about the same number of features.
In this case, the feature C has only one occurrece, so I need to filter all others columns in a way that I have one text with A, one text with B, one text with Cetc..
The best would be: I can set using for example n=100 that means I want to sample in a way that I have 100 records with all the features.
This dataset is a multilabel dataset training and is higly unbalanced, I am looking for the best way to balance it for a machine learning task.
Important: I don't want to exclude the 0 features. I just want to have ABOUT the same number of columns with 1 and 0
For example. with a final data set with 1k records, I would like to have all columns from A to the final_column and all these columns with the same numbers of 1 and 0. To accomplish this I will need to random discard text rows and id only.
The approach I was trying was to look to the feature with the lowest 1 and 0 counts and then use this value as threshold.
Edit 1: One possible way I thought is to use:
df.sum(axis=0, skipna=True)
Then I can use the column with the lowest sum value as threshold to filter the text column. I dont know how to do this filtering step
Thanks
The exact output you expect is unclear, but assuming you want to get 1 random row per letter with 1 you could reshape (while dropping the 0s) and use GroupBy.sample:
(df
.set_index(['id', 'text'])
.replace(0, float('nan'))
.stack()
.groupby(level=-1).sample(n=1)
.reset_index()
)
NB. you can rename the columns if needed
output:
id text level_2 0
0 30 random stuff A 1.0
1 20 another text B 1.0
2 40 my cat is a god C 1.0
3 30 random stuff D 1.0

Summary statistics for each group and transpose using pandas

I have a dataframe like as shown below
df = pd.DataFrame({'person_id': [11,11,11,11,11,11,11,11,12,12,12],
'time' :[0,0,0,1,2,3,4,4,0,0,1],
'value':[101,102,np.nan,120,143,153,160,170,96,97,99]})
What I would like to do is
a) Get the summary statistics for each subject for each time point (ex: 0hr, 1hr, 2hr etc)
b) Please note that NA rows shouldn't be counted as separate record/row during computing mean
I was trying the below
for i in df['subject_id'].unique()
df[df['subject_id'].isin([i])].time.unique
val_mean = df.groupby(['subject_id','time']][value].mean()
val_stddev = df[value].std()
But I couldn't get the expected output
I expect my output to be like as shown below where I expect one row for each time point (ex: 0hr, 1 hr , 2hr, 3hr etc). Please note that NA rows shouldn't be counted as seperated record/row during computing mean

Use KDTree/KNN Return Closest Neighbors

I have two python pandas dataframes. One contains all NFL Quarterbacks' College Football statistics since 2007 and a label on the type of player they are (Elite, Average, Below Average). The other dataframe contains all of the college football qbs' data from this season along with a prediction label.
I want to run some sort of analysis to determine the two closest NFL comparisons for every college football qb based on their labels. I'd like to add to two comparable qbs as two new columns to the second dataframe.
The feature names in both dataframes are the same. Here is what the dataframes look like:
Player Year Team GP Comp % YDS TD INT Label
Player A 2020 ASU 12 65.5 3053 25 6 Average
For the example above, I'd like two find the two closest neighbors to Player A that also have the label "Average" from the first dataframe.
The way I thought of doing this was to use Scipy's KDTree and run a query tree:
tree = KDTree(nfl[features], leafsize=nfl[features].shape[0]+1)
closest = []
for row in college.iterrows():
distances, ndx = tree.query(row[features], k=2)
closest.append(ndx)
print(closest)
However, the print statement returned an empty list. Is this the right way to solve my problem?
.iterrows(), will return namedtuples (index, Series) where index is obviously the index of the row, and Series is the features values with the index of those being the columns names (see below).
As you have it, row is being stored as that tuple, so when you have row[features], that won't really do anything. What you're really after is that Series which the features and values Ie row[1]. So you can either call that directly, or just break them up in your loop by doing for idx, row in df.iterrows():. Then you can just call on that Series row.
Scikit learn is a good package here to use (actually built on Scipy so you'll notice same syntax). You'll have to edit the code to your specifications (like filter to only have the "Average" players, maybe you are one-hot encoding the category columns and in that case may need to add that to the features,etc.), but to give you an idea (And I made up these dataframes just for an example...actually the nfl one is accurate, but the college completely made up), you can see below using the kdtree and then taking each row in the college dataframe to see which 2 values it's closest to in the nfl dataframe. I obviously have it print out the names, but as you can see with print(closest), the raw arrays are there for you.
import pandas as pd
nfl = pd.DataFrame([['Tom Brady','1999','Michigan',11,61.0,2217,16,6,'Average'],
['Aaron Rodgers','2004','California',12,66.1,2566,24,8,'Average'],
['Payton Manning','1997','Tennessee',12,60.2,3819,36,11,'Average'],
['Drew Brees','2000','Perdue',12,60.4,3668,26,12,'Average'],
['Dan Marino','1982','Pitt',12,58.5,2432,17,23,'Average'],
['Joe Montana','1978','Notre Dame',11,54.2,2010,10,9,'Average']],
columns = ['Player','Year','Team','GP','Comp %','YDS','TD','INT','Label'])
college = pd.DataFrame([['Joe Smith','2019','Illinois',11,55.6,1045,15,7,'Average'],
['Mike Thomas','2019','Wisconsin',11,67,2045,19,11,'Average'],
['Steve Johnson','2019','Nebraska',12,57.3,2345,9,19,'Average']],
columns = ['Player','Year','Team','GP','Comp %','YDS','TD','INT','Label'])
features = ['GP','Comp %','YDS','TD','INT']
from sklearn.neighbors import KDTree
tree = KDTree(nfl[features], leaf_size=nfl[features].shape[0]+1)
closest = []
for idx, row in college.iterrows():
X = row[features].values.reshape(1, -1)
distances, ndx = tree.query(X, k=2, return_distance=True)
closest.append(ndx)
collegePlayer = college.loc[idx,'Player']
closestPlayers = [ nfl.loc[x,'Player'] for x in ndx[0] ]
print ('%s closest to: %s' %(collegePlayer, closestPlayers))
print(closest)
Output:
Joe Smith closest to: ['Joe Montana', 'Tom Brady']
Mike Thomas closest to: ['Joe Montana', 'Tom Brady']
Steve Johnson closest to: ['Dan Marino', 'Tom Brady']

Excel multi tiered lists in data validatation using alphanumeric fields

I am using multi tiered lists and data validation and have an issue when the fields are alpha numeric
eg
Col 1 Col 2
100 Dairy 101 Milk
102 Cheese
200 Bakery 201 Bread
202 Cake
If you choose 100 Dairy in the drop down list Col1 Col2 will not give the 2 fields to choose from however if you remove one of the sets of numbers then data validation will provide the 2 options in col 2
In Column 2 Data validation I am using =Indirect(A1)
How can I have both Number and letters in both of my drop down lists
Any assistance appreciated
A Name (named range) cannot start with a digit or contain spaces. So, in your validation formula, you have to change your entry in column A into the actual name that is being used for your dependent dropdown.
If you used the Excel Create Name wizard, then something like:
=INDIRECT(SUBSTITUTE(" " & $A$1," ","_"))
That will only work if ALL of your dependent lists start with a digit. If not, you will need a different algorithm.
eg:
=INDIRECT(SUBSTITUTE(IF(ISNUMBER(-LEFT($A$1))," ","")&$A$1," ","_"))
If you made up your own names, again, you will need a different algorithm.

I need to find a way to show the number of relationships between values in two different excel columns

OK, I have two columns in excel that contain city names. I need a rank of how many times a relationship between two cities occurs. For example, the ranking for the data below should be as follows. #1 is Austin to Dallas with 3 occurrences. #2 is Chicago to Boston with 2 occurrences. #3 is Chicago to New York with 1 occurrence.
sample data set
You can use a =COUNTIFS statement to check specific cities.
For example:
Row 1 = Headers
Column A = Origin City
Column B = Destination City
Data in your table should be in A2:B7
You can use:
=COUNTIFS($A$2:$A$7,"Austin",$B$2:$B$7,"Dallas")
=COUNTIFS($A$2:$A$7,"Chicago",$B$2:$B$7,"Boston")
=COUNTIFS($A$2:$A$7,"Chicago",$B$2:$B$7,"New York")

Resources