SQL Select and Group by in Pandas - python-3.x

Track Actor Movie
1 Katherine Hepburn Guess Who's Coming to Dinner
2 Katherine Hepburn Guess Who's Coming to Dinner
3 Katherine Hepburn On Golden Pond
4 Katherine Hepburn The Lion in Winter
5 Bette Davis What Ever Happened to Baby Jane?
6 Bette Davis The Letter
7 Bette Davis The Letter
...
100 Omar Shariff Lawrence of Arabia
Need to write a code in python to select all the actors that have starred in more than one movie and append their names to a list.
A python equivalent of the following SQL query.
SELECT Actor, count(DISTINCT Movie)
FROM table
GROUP by Actor
HAVING count(DISTINCT Movie) > 1

You can use drop_duplicates() method for DISTINCT movie values:
df=df.drop_duplicates(subset=['Actor','Movie'])
Now For grouping and aggregrating use groupby() method and chain agg() method to it:
result=df.groupby('Actor').agg(count=('Movie','count'))
Finally make use of boolean masking and check your condition(count>1):
result=result[result['count']>1]

Related

Fstest way to check for multiple string match of a dataframe column [duplicate]

This question already has an answer here:
Finding multiple exact string matches in a dataframe column using PANDAS
(1 answer)
Closed 8 months ago.
I am currently trying to find a string match from a dataframe that has list of actors and the movies that they acted in.
my_favourite_actors = ['Clint Eastwood','Morgan Freeman','Al Pacino']
Actor
Movie
Morgan Freeman, Tim Robbins, Bob Gunton, William Sadler, Clancy Brown
The Shawshank Redemption
Marlon Brando, Al Pacino, James Caan
The Godfather
Christian Bale, Heath Ledger, Aaron Eckhart, Gary Oldman, Maggie Gyllenhaal, Morgan Freeman
The Dark Knight
Henry Fonda, Lee Cobb, Martin Balsam
12 Angry Men
Liam Neeson, Ralph Fiennes, Ben Kingsley
Schindler's List
Elijah Wood, Viggo Mortensen, Ian McKellen
The Lord of the Rings: The Return of the King
John Travolta, Uma Thurman, Samuel Jackson
Pulp Fiction
Clint Eastwood, Eli Wallach, Lee Van Cleef
The Good, the Bad and the Ugly
Brad Pitt, Edward Norton, Meat Loaf
Fight Club
Leonardo DiCaprio, Joseph Gordon-Levitt,
Inception
I am currently using the following approach to do the string matching, but it's taking a very long time since the whole dataset almost has 100K rows.
def favourite_actor(movie_dataset):
for actor in my_favourite_actors:
movie_index= movie_dataset.loc[movie_dataset['Actor'].str.contains(actor , case=False)].index
movie_dataset["_IsActorFound"].iloc[movie_index] = 1
The rows that will find my favourite actors will insert the value of 1 to it's adjacent column of ['_IsActorFound']
What can be an optimal and fast way to do the string match iteratively as my current code is taking extremely long time to execute?
You could use the apply function as follows:
def find_actor(s, actors):
for actor in actors:
if actor in s.lower():
return 1
return 0
df['Actor'].apply(find_actor, actors=my_favourite_actors.lower())
The advantage is that it only checks until one of the actors is found. Please note that for strings the apply function is ok to use because str.contains() is also not vectorized under the hood.
Use -
df['Actor'].str.contains('|'.join(my_favourite_actors), regex=True, case=False)
Output
0 True
1 True
2 True
3 False
4 False
5 False
6 False
7 True
8 False
9 False
Name: Actor, dtype: bool
Explanation
Create a regex on the fly with the list, and then use .str.contains() accessor in pandas. | means set to True if any one element of the list matches.

In a table, function to get all cell values that meet condition in a different column

Number of patients
Name of doctor
5
Ann
5
John
3
Ellen
5
Dennis
1
Janis
In the table above, I'd like the function (no code please) to the names of all and only doctors who currently treat 5 patients:
Number of patients
Name of doctor
5
Ann
5
John
5
Dennis
Assume there are many more columns in the table.
Thanks!
If you have excel-365 then use FILTER() function.
=FILTER(A2:B6,A2:A6=5)
For older version you can use below function-
=IFERROR(INDEX($A$2:$B$6,AGGREGATE(15,6,ROW($1:$5)/($A$2:$A$6=5),ROW(1:1)),COLUMN(A$1)),"")

Extracting the data between paranthesis and put the resulting value in another column

I would like to extract the data between parenthesis from the below dataframe and put the resulting value in a new column. If there are no parenthesis in the column data then we can leave them empty.
Data
0 The city is far (RANDOM)
1 Omega Fatty Acid is good for health
2 Name of the fruit is (MANGO)
3 The producer had given man good films (GOOD)
4 This summer has a very good (Offer)
We can use str.extract with a regex group where we define everything between paranthesis:
df['Newcol'] = df['Data'].str.extract('\((.*)\)')
Data Newcol
0 The city is far (RANDOM) RANDOM
1 Omega Fatty Acid is good for health NaN
2 Name of the fruit is (MANGO) MANGO
3 The producer had given man good films (GOOD) GOOD
4 This summer has a very good (Offer) Offer

How to find row with specific column value only

I am trying to figure out the names who only have specific column value and nothing else.
I have tried filtering the rows according to the column value but that isn't what I want, I want the names who only went to eat pizza.
I want names who only had pizza, so my code should return John only and not peter as john only had pizza
Click to view data frame
Your description is not clear. At first, it looks like a simple .loc will be enough. However, after viewing your picture of sample data, I realized it is not that simple. To get what you want, you need to identify duplicated or non-duplicated names having one Restaurant value only, and pick it. To do this, you need to use nunique and check it eq(1), and assign it a mask m. Finally, using m with slicing to get your desire output:
Your sample data:
In [512]: df
Out[512]:
Name Restaurant
0 john pizza
1 peter kfc
2 john pizza
3 peter pizza
4 peter kfc
5 peter pizza
6 john pizza
m = df.groupby('Name').Restaurant.transform('nunique').eq(1)
df[m]
Out[513]:
Name Res
0 john pizza
2 john pizza
6 john pizza
If you want to show only one row, just chain additional .drop_duplicates
df[m].drop_duplicates()
Out[515]:
Name Restaurant
0 john pizza

Load item cost from an inventory table

I have an Inventory Sheet that contains a bunch of data about products I have for sale. I have a sheet for each month where I load in my individual sales. In order to calculate my cost of sales, I enter my product cost for each sale manually. I would like a formula to load the cost automatically, using the product name as a search term.
Inventory Item | Cost Sold Item | Sale Price | Cost
Product 1 | 2.99 Product 3 | 16.99 | X
Product 2 | 4.99 Product 3 | 14.57 | X
Product 3 | 6.99 Product 1 | 7.99 | X
So basically I am looking to "solve for X".
In addition to this, the product name on the two tables are actually different lengths. For example, one item on my Inventory Table may be "This is a very, very long product name that goes on and on for up to 120 characters", and on my products sold table it will be truncated at the first 40 characters of the product name. So in the above formula, it should only search for the first 40 characters of the product name.
Due to the complicated nature of this, I haven't been able to search for a sufficient solution, since I don't really know exactly where to start to quickly explain it.
UPDATE:
The product names of my Inventory List, and the product names of my items sold aren't matching. I thought I could just search for the left-most 40 characters, but this is not the case.
Here is a sample of products I have in my Inventory List:
Ford Focus 2000 thru 2007 (Haynes Repair Manual) by Haynes, Max
Franklin Brass D2408PC Futura, Bath Hardware Accessory, Tissue Paper Holder, ...
Fuji HQ T-120 Recordable VHS Cassette Tapes ( 12 pack ) (Discontinued by Manu...
Fundamentals of Three Dimensional Descriptive Geometry [Paperback] by Slaby, ...
GE Lighting 40184 Energy Smart 55-Watt 2-D Compact Fluorescent Bulb, 250-Watt...
Get Set for School: Readiness & Writing Pre-K Teacher's Guide (Handwriting Wi...
Get the Edge: A 7-Day Program To Transform Your Life [Audiobook] [Box set] by...
Gift Basket Wrap Bag - 2 Pack - 22" x 30" - Clear [Kitchen]
GOLDEN GATE EDITION 100 PIECE PUZZLE [Toy]
Granite Ware 0722 Stainless Steel Deluxe Food Mill, 2-Quart [Kitchen]
Guess Who's Coming, Jesse Bear [Paperback] by Carlstrom, Nancy White; Degen, ...
Guide to Culturally Competent Health Care (Purnell, Guide to Culturally Compe...
Guinness World Records 2002 [Illustrated] by Antonia Cunningham; Jackie Fresh...
Hawaii [Audio CD] High Llamas
And then here is a sample of the product names in my Sold list:
My Interactive Point-and-Play with Disne...
GE Lighting 40184 Energy Smart 55-Watt 2...
Crayola® Sidewalk Chalk Caddy with Chalk...
Crayola® Sidewalk Chalk Caddy with Chalk...
First Look and Find: Merry Christmas!
Sesame Street Point-and-Play and 10-Book...
Disney Mickey Mouse Board Game - Duck Du...
Nordic Ware Microwave Vegetable and Seaf...
SmartGames BACK 2 BACK
I have played around with searching for the left-most characters, minus 3. This did not work correctly. I have also switched the [range lookup] between TRUE and FALSE, but this has also not worked in a predictable way.
Use the VLOOKUP function. Augment the lookup_value parameter with the LEFT function.
        
In the above example, LEFT(E2, 9) is used to truncate the Sold Item lookup into Inventory Item.

Resources