Fstest way to check for multiple string match of a dataframe column [duplicate] - python-3.x

This question already has an answer here:
Finding multiple exact string matches in a dataframe column using PANDAS
(1 answer)
Closed 8 months ago.
I am currently trying to find a string match from a dataframe that has list of actors and the movies that they acted in.
my_favourite_actors = ['Clint Eastwood','Morgan Freeman','Al Pacino']
Actor
Movie
Morgan Freeman, Tim Robbins, Bob Gunton, William Sadler, Clancy Brown
The Shawshank Redemption
Marlon Brando, Al Pacino, James Caan
The Godfather
Christian Bale, Heath Ledger, Aaron Eckhart, Gary Oldman, Maggie Gyllenhaal, Morgan Freeman
The Dark Knight
Henry Fonda, Lee Cobb, Martin Balsam
12 Angry Men
Liam Neeson, Ralph Fiennes, Ben Kingsley
Schindler's List
Elijah Wood, Viggo Mortensen, Ian McKellen
The Lord of the Rings: The Return of the King
John Travolta, Uma Thurman, Samuel Jackson
Pulp Fiction
Clint Eastwood, Eli Wallach, Lee Van Cleef
The Good, the Bad and the Ugly
Brad Pitt, Edward Norton, Meat Loaf
Fight Club
Leonardo DiCaprio, Joseph Gordon-Levitt,
Inception
I am currently using the following approach to do the string matching, but it's taking a very long time since the whole dataset almost has 100K rows.
def favourite_actor(movie_dataset):
for actor in my_favourite_actors:
movie_index= movie_dataset.loc[movie_dataset['Actor'].str.contains(actor , case=False)].index
movie_dataset["_IsActorFound"].iloc[movie_index] = 1
The rows that will find my favourite actors will insert the value of 1 to it's adjacent column of ['_IsActorFound']
What can be an optimal and fast way to do the string match iteratively as my current code is taking extremely long time to execute?

You could use the apply function as follows:
def find_actor(s, actors):
for actor in actors:
if actor in s.lower():
return 1
return 0
df['Actor'].apply(find_actor, actors=my_favourite_actors.lower())
The advantage is that it only checks until one of the actors is found. Please note that for strings the apply function is ok to use because str.contains() is also not vectorized under the hood.

Use -
df['Actor'].str.contains('|'.join(my_favourite_actors), regex=True, case=False)
Output
0 True
1 True
2 True
3 False
4 False
5 False
6 False
7 True
8 False
9 False
Name: Actor, dtype: bool
Explanation
Create a regex on the fly with the list, and then use .str.contains() accessor in pandas. | means set to True if any one element of the list matches.

Related

How to compare "raw" joins to the output of deep feature synthesis in Featuretools?

Is it possible to get the results someone would get from deep feature synthesis, but without any aggregations?
I have some small datasets, and I want to be able to compare the "processed" outputs of deep feature synthesis with the "raw" joined data.
For example, this aggregate collapses the resulting df down to 1 row per customer:
fm, features = ft.dfs(
entityset=es,
target_dataframe_name="customers",
agg_primitives=["sum"],
trans_primitives=[],
)
fm.head()
I'd love to not have that "sum" happening, so that I get a resulting dataframe with multiple rows per customer. But I can't swap out agg_primitives=["sum"], for agg_primitives=[], because I get:
AssertionError: No features can be generated from the specified primitives. Please make sure the primitives you are using are compatible with the variable types in your data.
I expect the answer is "what you want is not possible in featuretools".
Thank you!
If you want to see the output without any aggregations performed you can simply set the agg_primitives parameter to an empty list in your call to ft.dfs. Similarly, you can disable transformations by passing an empty list to trans_primitives.
Here is an example of how you would do this using one of the Featuretools demo EntitySets:
import featuretools as ft
es = ft.demo.load_retail()
fm, features = ft.dfs(
entityset=es,
target_dataframe_name="order_products",
agg_primitives=[],
trans_primitives=[],
)
order_id product_id quantity unit_price total orders.customer_name orders.country orders.cancelled
order_product_id
0 536365 85123A 6 4.2075 25.245 Andrea Brown United Kingdom False
1 536365 71053 6 5.5935 33.561 Andrea Brown United Kingdom False
2 536365 84406B 8 4.5375 36.300 Andrea Brown United Kingdom False
3 536365 84029G 6 5.5935 33.561 Andrea Brown United Kingdom False
4 536365 84029E 6 5.5935 33.561 Andrea Brown United Kingdom False
One thing to note, by default Featuretools only includes certain column types in the output, specifically numeric, boolean and categorical columns. If you want to include all column types, simple include return_types="all" in the call to ft.dfs.

Excel Lambda function: iterating For Loop

So, I am working on another issue and need to check a 12 x 6 excel range for errors. If there is an error, I want it to build a new 12 x 6 range within the function and then check that for errors. I am at the very beginning and very new to Lambda Functions in Excel (but I have the basics). I also have a limitation of not using VBA (which I know would be way simpler and cleaner).
So I created a function LoopTest in Name Manager and then in "refers to":
=LAMBDA(X,Y,
IF(Y<=11=TRUE,
IF(X<=6=TRUE,
LoopTest(X+1,Y),
IF(Y=11,
"TEST SUCCESS",
LoopTest(0,Y+1)
)
)
)
)
Then =LoopTest(0,0)
This seems to be working correctly (although excel doesn't really allow intermediate testing of the function). So now I assume I can loop through a range with Index(array,X,Y) and check the cells for errors.
The only problem is that I can only do one array/table/range at a time. I need to figure out how to create a test array the first time through and then pass it back each time until the test fails or has complete success (at which point it returns the successful range). I am leaning towards Let() function to define some more variables and hide them behind some IF statements (I haven't used IFS, but have seen others use that to success.) I haven't checked the following formula but the general flow should be correct.
=LAMBDA(X,Y,Array1,
IF(Y<=11=TRUE,
IF(X<=6=TRUE,
IF(ISERROR(INDEX(Array1,X,Y))=FALSE,
LoopTest(X+1,Y,Array1), 'IF True continue checking Array1
Array1 = NEWARRAY 'IF False I NEED A WAY TO CREATE A NEW ARRAY AND BEGIN CHECKING IT
IF(Y=11,
Array1 'IF True Return the fully checked Array1
IF(ISERROR(INDEX(Array1,X,Y))=FALSE,
LoopTest(0,Y+1,Array1) 'IF True continue checking Array1
Array1 = NEWARRAY 'IF False I NEED A WAY TO CREATE A NEW ARRAY AND BEGIN CHECKING IT
)
)
)
)
)
The purpose is to allow a range of names with a bunch of qualifications like
Adam
Bill
Camp
Doug
Earl
Fred
Gabe
Hall
Ivan
Kobe
Lane
Mike
And create a range that is unique similar to Sudoku (horizontal and vertical unique).
Gabe Earl Fred Doug Bill Ivan
Adam Gabe Bill Lane Mike Camp
Mike Hall Kobe Bill Doug Gabe
Fred Doug Gabe Camp Kobe Mike
Camp Kobe Lane Mike Ivan Fred
Bill Lane Ivan Fred Gabe Adam
Doug Camp Adam Earl Hall Lane
Earl Adam Hall Ivan Fred Bill
Lane Ivan Mike Adam Earl Hall
Ivan Mike Camp Kobe Lane Earl
Hall Bill Doug Gabe Camp Kobe
Kobe Fred Earl Hall Adam Doug
With 6 positions and 12 names, it will fail more often than succeed (guessing 100 iterations per valid solution), but I want it to keep iterating until the Lambda finds a valid solution. The simple solution of just grabbing names randomly for the table based on what came from above and to the left is about 50/50 on finding a valid solution.

SQL Select and Group by in Pandas

Track Actor Movie
1 Katherine Hepburn Guess Who's Coming to Dinner
2 Katherine Hepburn Guess Who's Coming to Dinner
3 Katherine Hepburn On Golden Pond
4 Katherine Hepburn The Lion in Winter
5 Bette Davis What Ever Happened to Baby Jane?
6 Bette Davis The Letter
7 Bette Davis The Letter
...
100 Omar Shariff Lawrence of Arabia
Need to write a code in python to select all the actors that have starred in more than one movie and append their names to a list.
A python equivalent of the following SQL query.
SELECT Actor, count(DISTINCT Movie)
FROM table
GROUP by Actor
HAVING count(DISTINCT Movie) > 1
You can use drop_duplicates() method for DISTINCT movie values:
df=df.drop_duplicates(subset=['Actor','Movie'])
Now For grouping and aggregrating use groupby() method and chain agg() method to it:
result=df.groupby('Actor').agg(count=('Movie','count'))
Finally make use of boolean masking and check your condition(count>1):
result=result[result['count']>1]

Extracting the data between paranthesis and put the resulting value in another column

I would like to extract the data between parenthesis from the below dataframe and put the resulting value in a new column. If there are no parenthesis in the column data then we can leave them empty.
Data
0 The city is far (RANDOM)
1 Omega Fatty Acid is good for health
2 Name of the fruit is (MANGO)
3 The producer had given man good films (GOOD)
4 This summer has a very good (Offer)
We can use str.extract with a regex group where we define everything between paranthesis:
df['Newcol'] = df['Data'].str.extract('\((.*)\)')
Data Newcol
0 The city is far (RANDOM) RANDOM
1 Omega Fatty Acid is good for health NaN
2 Name of the fruit is (MANGO) MANGO
3 The producer had given man good films (GOOD) GOOD
4 This summer has a very good (Offer) Offer

How to find row with specific column value only

I am trying to figure out the names who only have specific column value and nothing else.
I have tried filtering the rows according to the column value but that isn't what I want, I want the names who only went to eat pizza.
I want names who only had pizza, so my code should return John only and not peter as john only had pizza
Click to view data frame
Your description is not clear. At first, it looks like a simple .loc will be enough. However, after viewing your picture of sample data, I realized it is not that simple. To get what you want, you need to identify duplicated or non-duplicated names having one Restaurant value only, and pick it. To do this, you need to use nunique and check it eq(1), and assign it a mask m. Finally, using m with slicing to get your desire output:
Your sample data:
In [512]: df
Out[512]:
Name Restaurant
0 john pizza
1 peter kfc
2 john pizza
3 peter pizza
4 peter kfc
5 peter pizza
6 john pizza
m = df.groupby('Name').Restaurant.transform('nunique').eq(1)
df[m]
Out[513]:
Name Res
0 john pizza
2 john pizza
6 john pizza
If you want to show only one row, just chain additional .drop_duplicates
df[m].drop_duplicates()
Out[515]:
Name Restaurant
0 john pizza

Resources