How to compare "raw" joins to the output of deep feature synthesis in Featuretools? - featuretools

Is it possible to get the results someone would get from deep feature synthesis, but without any aggregations?
I have some small datasets, and I want to be able to compare the "processed" outputs of deep feature synthesis with the "raw" joined data.
For example, this aggregate collapses the resulting df down to 1 row per customer:
fm, features = ft.dfs(
entityset=es,
target_dataframe_name="customers",
agg_primitives=["sum"],
trans_primitives=[],
)
fm.head()
I'd love to not have that "sum" happening, so that I get a resulting dataframe with multiple rows per customer. But I can't swap out agg_primitives=["sum"], for agg_primitives=[], because I get:
AssertionError: No features can be generated from the specified primitives. Please make sure the primitives you are using are compatible with the variable types in your data.
I expect the answer is "what you want is not possible in featuretools".
Thank you!

If you want to see the output without any aggregations performed you can simply set the agg_primitives parameter to an empty list in your call to ft.dfs. Similarly, you can disable transformations by passing an empty list to trans_primitives.
Here is an example of how you would do this using one of the Featuretools demo EntitySets:
import featuretools as ft
es = ft.demo.load_retail()
fm, features = ft.dfs(
entityset=es,
target_dataframe_name="order_products",
agg_primitives=[],
trans_primitives=[],
)
order_id product_id quantity unit_price total orders.customer_name orders.country orders.cancelled
order_product_id
0 536365 85123A 6 4.2075 25.245 Andrea Brown United Kingdom False
1 536365 71053 6 5.5935 33.561 Andrea Brown United Kingdom False
2 536365 84406B 8 4.5375 36.300 Andrea Brown United Kingdom False
3 536365 84029G 6 5.5935 33.561 Andrea Brown United Kingdom False
4 536365 84029E 6 5.5935 33.561 Andrea Brown United Kingdom False
One thing to note, by default Featuretools only includes certain column types in the output, specifically numeric, boolean and categorical columns. If you want to include all column types, simple include return_types="all" in the call to ft.dfs.

Related

Fstest way to check for multiple string match of a dataframe column [duplicate]

This question already has an answer here:
Finding multiple exact string matches in a dataframe column using PANDAS
(1 answer)
Closed 8 months ago.
I am currently trying to find a string match from a dataframe that has list of actors and the movies that they acted in.
my_favourite_actors = ['Clint Eastwood','Morgan Freeman','Al Pacino']
Actor
Movie
Morgan Freeman, Tim Robbins, Bob Gunton, William Sadler, Clancy Brown
The Shawshank Redemption
Marlon Brando, Al Pacino, James Caan
The Godfather
Christian Bale, Heath Ledger, Aaron Eckhart, Gary Oldman, Maggie Gyllenhaal, Morgan Freeman
The Dark Knight
Henry Fonda, Lee Cobb, Martin Balsam
12 Angry Men
Liam Neeson, Ralph Fiennes, Ben Kingsley
Schindler's List
Elijah Wood, Viggo Mortensen, Ian McKellen
The Lord of the Rings: The Return of the King
John Travolta, Uma Thurman, Samuel Jackson
Pulp Fiction
Clint Eastwood, Eli Wallach, Lee Van Cleef
The Good, the Bad and the Ugly
Brad Pitt, Edward Norton, Meat Loaf
Fight Club
Leonardo DiCaprio, Joseph Gordon-Levitt,
Inception
I am currently using the following approach to do the string matching, but it's taking a very long time since the whole dataset almost has 100K rows.
def favourite_actor(movie_dataset):
for actor in my_favourite_actors:
movie_index= movie_dataset.loc[movie_dataset['Actor'].str.contains(actor , case=False)].index
movie_dataset["_IsActorFound"].iloc[movie_index] = 1
The rows that will find my favourite actors will insert the value of 1 to it's adjacent column of ['_IsActorFound']
What can be an optimal and fast way to do the string match iteratively as my current code is taking extremely long time to execute?
You could use the apply function as follows:
def find_actor(s, actors):
for actor in actors:
if actor in s.lower():
return 1
return 0
df['Actor'].apply(find_actor, actors=my_favourite_actors.lower())
The advantage is that it only checks until one of the actors is found. Please note that for strings the apply function is ok to use because str.contains() is also not vectorized under the hood.
Use -
df['Actor'].str.contains('|'.join(my_favourite_actors), regex=True, case=False)
Output
0 True
1 True
2 True
3 False
4 False
5 False
6 False
7 True
8 False
9 False
Name: Actor, dtype: bool
Explanation
Create a regex on the fly with the list, and then use .str.contains() accessor in pandas. | means set to True if any one element of the list matches.

How join two dataframes with multiple overlap in pyspark

Hi I have a dataset of multiple households where all people within households have been matched between two datasources. The dataframe therefore consists of a 'household' col, and two person cols (one for each datasource). However some people (like Jonathan or Peter below) where not able to be matched and so have a blank second person column.
Household
Person_source_A
Person_source_B
1
Oliver
Oliver
1
Jonathan
1
Amy
Amy
2
David
Dave
2
Mary
Mary
3
Lizzie
Elizabeth
3
Peter
As the dataframe is gigantic, my aim is to take a sample of the unmatched individuals, and then output a df that has all people within households where only sampled unmatched people exist. Ie say my random sample includes Oliver but not Peter, then I would only household 1 in the output.
My issue is I've filtered to take the sample and now am stuck making progress. Some combination of join, agg/groupBy... will work but I'm struggling. I add a flag to the sampled unmatched names to identify them which i think is helpful...
My code:
# filter to unmatched people
df_unmatched = df.filter(col('per_A').isNotNull()) & col('per_B').isNull())
# take random sample of 10%
df_unmatched_sample = df_unmatched.sample(0.1)
# add flag of sampled unmatched persons
df_unmatched_sample = df_unmatched.withColumn('sample_flag', lit('1'))
As it pertains to your intent:
I just want to reduce my dataframe to only show the full households of
households where an unmatched person exists that has been selected by
a random sample out of all unmatched people
Using your existing approach you could use a join on the Household of the sample records
# filter to unmatched people
df_unmatched = df.filter(col('per_A').isNotNull()) & col('per_B').isNull())
# take random sample of 10%
df_unmatched_sample = df_unmatched.sample(0.1).select("Household").distinct()
desired_df = df.join(df_unmatched_sample,["Household"],"inner")
Edit 1
In response to op's comment:
Is there a slightly different way that keeps a flag to identify the
sampled unmatched person (as there are some households with more than
one unmatched person)?
A left join on your existing dataset after adding the flag column to your sample may help you to achieve this eg:
# filter to unmatched people
df_unmatched = df.filter(col('per_A').isNotNull()) & col('per_B').isNull())
# take random sample of 10%
df_unmatched_sample = df_unmatched.sample(0.1).withColumn('sample_flag', lit('1'))
desired_df = (
df.alias("dfo").join(
df_unmatched_sample.alias("dfu"),
[
col("dfo.Household")==col("dfu.Household") ,
col("dfo.per_A")==col("dfu.per_A"),
col("dfo.per_B").isNull()
],
"left"
)
)

SQL Select and Group by in Pandas

Track Actor Movie
1 Katherine Hepburn Guess Who's Coming to Dinner
2 Katherine Hepburn Guess Who's Coming to Dinner
3 Katherine Hepburn On Golden Pond
4 Katherine Hepburn The Lion in Winter
5 Bette Davis What Ever Happened to Baby Jane?
6 Bette Davis The Letter
7 Bette Davis The Letter
...
100 Omar Shariff Lawrence of Arabia
Need to write a code in python to select all the actors that have starred in more than one movie and append their names to a list.
A python equivalent of the following SQL query.
SELECT Actor, count(DISTINCT Movie)
FROM table
GROUP by Actor
HAVING count(DISTINCT Movie) > 1
You can use drop_duplicates() method for DISTINCT movie values:
df=df.drop_duplicates(subset=['Actor','Movie'])
Now For grouping and aggregrating use groupby() method and chain agg() method to it:
result=df.groupby('Actor').agg(count=('Movie','count'))
Finally make use of boolean masking and check your condition(count>1):
result=result[result['count']>1]

Extracting the data between paranthesis and put the resulting value in another column

I would like to extract the data between parenthesis from the below dataframe and put the resulting value in a new column. If there are no parenthesis in the column data then we can leave them empty.
Data
0 The city is far (RANDOM)
1 Omega Fatty Acid is good for health
2 Name of the fruit is (MANGO)
3 The producer had given man good films (GOOD)
4 This summer has a very good (Offer)
We can use str.extract with a regex group where we define everything between paranthesis:
df['Newcol'] = df['Data'].str.extract('\((.*)\)')
Data Newcol
0 The city is far (RANDOM) RANDOM
1 Omega Fatty Acid is good for health NaN
2 Name of the fruit is (MANGO) MANGO
3 The producer had given man good films (GOOD) GOOD
4 This summer has a very good (Offer) Offer

Excel Vlookup Multiple Values

I am looking for a vlookup formula that returns multiple matches using two lookup values. I am currently trying to use the concatenate method, but I haven't quite figured it out. The table needs to return all of the multiple matches not just one. Currently, its only returning the last match.
For example, lets say I have a list of multiple city and states. The cities differ but the states remain the same obviously. I want to return the number of people in the each city.
City State #OfPeople
Albany NY 10
Orlando FL 5
Tampa FL 3
Seattle WA 1
Queens NY 8
So I concatenated the city and state column.
Join City State #OfPeople
Albany-NY Albany NY 10
Orlando-FL Orlando FL 5
Tampa-FL Tampa FL 3
Seattle-WA Seattle WA 1
Queens-NY Queens NY 8
The purpose of this is to create an updated log of people in each city has time progresses. I want to have a grand total amount of people in each column. (I know this requires another formula. I'm just focused on returning multiple matches for now). However, I don't want to overwrite the existing data. Hopefully, I explained this well. This is just an example of a larger project I'm working on. I need to be able to build on this list. That's why its important that I be able to return matches multiple times.
Join City State #OfPeople Total
Albany-NY Albany NY 10 10
Orlando-FL Orlando FL 5 15
Tampa-FL Tampa FL 3 18
Seattle-WA Seattle WA 1 19
Queens-NY Queens NY 8 27
Any help would be greatly appreciated!
Considering you're trying to get some grand totals based on multiple criteria, I would suggest using SUMIFS() / COUNTIFS() functions, rather than focusing on searching matching row itself.
However, if you need multiple criteria look up, for some reason, I believe INDEX() + MATCH() combination can perfectly do the job.
The table needs to return all of the multiple matches not just one.
Currently, its only returning the last match
You'll need to use SUMIFS() if there are multiple records for the same city/state combo in your people lookup.
=SUMIFS (sum_range, range1, criteria1, [range2], [criteria2], ...)
Let's assume that you have a cities tab and a people tab. Let's assume you have ten cities that you want to return the total amount of people from.
Cities Tab definition
City range: 'Cities'!A$1:A$10
State range: 'Cities'!B$1:B$10
People Tab definition
City range: 'People'!A$1:A$100
State range: 'People'!B$1:B$100
#OfPeople range: 'People'!C$1:C$100
Drop this formula in the first row of your cities tab, drag down the entire range of cities.
=SUMIFS('People'!C$1:C$100, 'Cities'!A$1, People'!A$1:A$100, 'Cities'!B$1, 'People'!B$1:B$100)

Resources