SOLR Combined and weigthed results - search

I have the following task: Query SOLR and return a weighted list based on multiple conditions.
Example:
I have documents with the following fields, they mostly represent movies:
name, genre, actors, director
I want to return 20 documents sorted on the following condition
The document shares 1 actor and is from the same director (5 points)
The document shares 2 or more actors (3 points)
The document shares the director (3 points)
The document is of the same genre and shares an actor (2 points)
The document is of the same genre (1 point)
Then take these 4 movies:
Id: 1
Name: Harry Potter and the Philosopher's Stone
Genre: Adventure
Director: Chris Columbus
Actors: Daniel Radcliffe, Rupert Grint, Emma Watson
Id: 2
Name: My Week with Marilyn
Genre: Drama
Director: Simon Curtis
Actors: Michelle Williams, Eddie Redmayne, Emma Watson
Id: 3
Name: Percy Jackson & the Olympians: The Lightning Thief
Genre: Adventure
Directory: Chris Columbus
Actors: Logan Lerman, Brandon T. Jackson, Alexandra Daddario
Id: 4
Name: Harry Potter and the Chamber of Secrets
Genre: Adventure
Director: Chris Columbus
Actors: Daniel Radcliffe, Rupert Grint, Emma Watson
I want to query the SOLR as such: Return me a list of relevant movies based on movie id==4
The returned result should be:
Id: 1, points: 14 (matches all 5 conditions)
Id: 3, points: 4 (matches condition 3 and 5)
Id: 2, points: 0 (matches 0 conditions)
Is there anyway to do this directly within SOLR?
As always thanks in advance :)

You can return weighted results with the DisMax Query Parser, it's called boosting. You can give varying weights to the columns in your document by using a Query Filter like in the following example. You'll have to modify it to come up with your own formula, but you should be able to get close. Start with tweaking the numbers in the boost, but you might end up doing some more advanced Function Queries
From your example where you want to find documents that match #4
?q=Genre:'Adventure' Director:'Chris Columnbus' Actors:('Daniel Radcliffe' 'Rupert Grint' 'Emma Watson')&qf=Director^2.0+Actor^1.5+Genre^1.0&fl=*,score
//Get everything that matches #4
?q=Genre:'Adventure' Director:'Chris Columnbus' Actors:('Daniel Radcliffe' 'Rupert Grint' 'Emma Watson')
//use dismax
&defType=dismax
//boost some fields with a "query filter"
//this will make a match on director worth the most
//each actor will be worth a little bit less, but 2+ actors will be more
//all matches will be added together to create a score similar to your example
&qf=Director^2.0+Actor^1.5+Genre^1.0
//Make sure you can see the score for debugging
&fl=*,score

I don't think there is a way to do this with Solr out of the box. You could check out http://solr-ra.tgels.com/ to see if this might be something better suited to your needs or maybe show you how to make your own ranking algorithm.

Related

Fstest way to check for multiple string match of a dataframe column [duplicate]

This question already has an answer here:
Finding multiple exact string matches in a dataframe column using PANDAS
(1 answer)
Closed 8 months ago.
I am currently trying to find a string match from a dataframe that has list of actors and the movies that they acted in.
my_favourite_actors = ['Clint Eastwood','Morgan Freeman','Al Pacino']
Actor
Movie
Morgan Freeman, Tim Robbins, Bob Gunton, William Sadler, Clancy Brown
The Shawshank Redemption
Marlon Brando, Al Pacino, James Caan
The Godfather
Christian Bale, Heath Ledger, Aaron Eckhart, Gary Oldman, Maggie Gyllenhaal, Morgan Freeman
The Dark Knight
Henry Fonda, Lee Cobb, Martin Balsam
12 Angry Men
Liam Neeson, Ralph Fiennes, Ben Kingsley
Schindler's List
Elijah Wood, Viggo Mortensen, Ian McKellen
The Lord of the Rings: The Return of the King
John Travolta, Uma Thurman, Samuel Jackson
Pulp Fiction
Clint Eastwood, Eli Wallach, Lee Van Cleef
The Good, the Bad and the Ugly
Brad Pitt, Edward Norton, Meat Loaf
Fight Club
Leonardo DiCaprio, Joseph Gordon-Levitt,
Inception
I am currently using the following approach to do the string matching, but it's taking a very long time since the whole dataset almost has 100K rows.
def favourite_actor(movie_dataset):
for actor in my_favourite_actors:
movie_index= movie_dataset.loc[movie_dataset['Actor'].str.contains(actor , case=False)].index
movie_dataset["_IsActorFound"].iloc[movie_index] = 1
The rows that will find my favourite actors will insert the value of 1 to it's adjacent column of ['_IsActorFound']
What can be an optimal and fast way to do the string match iteratively as my current code is taking extremely long time to execute?
You could use the apply function as follows:
def find_actor(s, actors):
for actor in actors:
if actor in s.lower():
return 1
return 0
df['Actor'].apply(find_actor, actors=my_favourite_actors.lower())
The advantage is that it only checks until one of the actors is found. Please note that for strings the apply function is ok to use because str.contains() is also not vectorized under the hood.
Use -
df['Actor'].str.contains('|'.join(my_favourite_actors), regex=True, case=False)
Output
0 True
1 True
2 True
3 False
4 False
5 False
6 False
7 True
8 False
9 False
Name: Actor, dtype: bool
Explanation
Create a regex on the fly with the list, and then use .str.contains() accessor in pandas. | means set to True if any one element of the list matches.

How to group similar news text / article in pandas dataframe

I have a pandas dataframe of news article. Suppose
id
news title
keywords
publcation date
content
1
Congress Wants to Beef Up Army Effort to Develop Counter-Drone Weapons
USA,Congress,Drone,Army
2020-12-10
SOME NEWS CONTENT
2
Israel conflict: The range and scale of Hamas' weapons ...
Israel,Hamas,Conflict
2020-12-10
NEWS CONTENT
3
US Air Force progresses testing of anti-drone laser weapons
USA,Air Force,Weapon,Dron
2020-10-10
NEWS CONTENT
4
Hamas fighters display weapons in Gaza after truce with Israel
Hamas,Gaza,Israel,Weapon,Truce
2020-11-10
NEWS CONTENT
Now
HOW TO GROUP SIMILAR DATA BASED ON NEWS CONTENT AND SORT BY PUBLICATION DATE
Note:The content may be summary of the news
So that it displays as:
Group1
id
news title
keywords
publcation date
content
3
US Air Force progresses testing of anti-drone laser weapons
USA,Air Force,Weapon,Dron
2020-10-10
NEWS CONTENT
1
Congress Wants to Beef Up Army Effort to Develop Counter-Drone Weapons
USA,Congress,Drone,Army
2020-12-10
SOME NEWS CONTENT
Group2
id
news title
keywords
publcation date
content
4
Hamas fighters display weapons in Gaza after truce with Israel
Hamas,Gaza,Israel,Weapon,Truce
2020-11-10
NEWS CONTENT
2
Israel conflict: The range and scale of Hamas' weapons ...
Israel,Hamas,Conflict
2020-12-10
NEWS CONTENT
It's a little bit complicated, I choose the easy way for the similarity, but you can change the function as you wish.
you can also use https://pypi.org/project/pyjarowinkler/ for the is_similar function instead of the "set" I did. *the function can be much more complicated than the one I did
I used two applies the first one is to fit the "grps". it will work without the first one but it will be more accurate at the second time
you can also change the range(3,-1,-1) to a higher number for the accuracy
def is_similar(txt1,txt2,level=0):
return len(set(txt1) & set(txt2))>level
grps={}
def get_grp_id(row):
row_words = row['keywords'].split(',')
if len(grps.keys())==0:
grps[1]=set(row_words)
return 1
else:
for level in range(3,-1,-1):
for grp in grps:
if is_similar(grps[grp],row_words,level):
grps[grp]= grps[grp] | set(row_words)
return grp
grp +=1
grps[grp]=set(row_words)
return grp
df.apply(get_grp_id,axis=1)
df['grp'] = df.apply(get_grp_id,axis=1)
df = df.sort_values(['grp','publcation date'])
this is the output
if you want to split it into separate df let me know

SQL Select and Group by in Pandas

Track Actor Movie
1 Katherine Hepburn Guess Who's Coming to Dinner
2 Katherine Hepburn Guess Who's Coming to Dinner
3 Katherine Hepburn On Golden Pond
4 Katherine Hepburn The Lion in Winter
5 Bette Davis What Ever Happened to Baby Jane?
6 Bette Davis The Letter
7 Bette Davis The Letter
...
100 Omar Shariff Lawrence of Arabia
Need to write a code in python to select all the actors that have starred in more than one movie and append their names to a list.
A python equivalent of the following SQL query.
SELECT Actor, count(DISTINCT Movie)
FROM table
GROUP by Actor
HAVING count(DISTINCT Movie) > 1
You can use drop_duplicates() method for DISTINCT movie values:
df=df.drop_duplicates(subset=['Actor','Movie'])
Now For grouping and aggregrating use groupby() method and chain agg() method to it:
result=df.groupby('Actor').agg(count=('Movie','count'))
Finally make use of boolean masking and check your condition(count>1):
result=result[result['count']>1]

Attribute extraction from Metadata using Python3

Here is the input:
Magic Bullet Magic Bullet MBR-1701 17-Piece Express Mixing Set 4.1 out of 5 stars 3,670 customer reviews | 300 answered questions List Price: $59.99 Price: $47.02 & FREE Shipping You Save: $12.97 (22%) Only 5 left in stock - order soon. Ships from and sold by shopincaldwell. This fits your . Enter your model number to make sure this fits. 17-piece high-speed mixing system chops, whips, blends, and more Includes power base, 2 blades, 2 cups, 4 mugs, 4 colored comfort lip rings, 2 sealed lids, 2 vented lids, and recipe book Durable see-through construction; press down for results in 10 seconds or less Microwave- and freezer-safe cups and mugs; dishwasher-safe parts Product Built to North American Electrical Standards 23 new from $43.95 4 used from $35.00 There is a newer model of this item: Magic Bullet Blender, Small, Silver, 11 Piece Set $34.00 (2,645) In Stock.
Expected Output:
Title: "Magic Bullet MBR-1701 17-Piece Express Mixing Set",
Customer Rating : "4.1",
customer reviews : "3670",
List Price : $59.99,
Offer Price : $47.02,
Shipping : FREE Shipping
Can anyone please help me?

Load item cost from an inventory table

I have an Inventory Sheet that contains a bunch of data about products I have for sale. I have a sheet for each month where I load in my individual sales. In order to calculate my cost of sales, I enter my product cost for each sale manually. I would like a formula to load the cost automatically, using the product name as a search term.
Inventory Item | Cost Sold Item | Sale Price | Cost
Product 1 | 2.99 Product 3 | 16.99 | X
Product 2 | 4.99 Product 3 | 14.57 | X
Product 3 | 6.99 Product 1 | 7.99 | X
So basically I am looking to "solve for X".
In addition to this, the product name on the two tables are actually different lengths. For example, one item on my Inventory Table may be "This is a very, very long product name that goes on and on for up to 120 characters", and on my products sold table it will be truncated at the first 40 characters of the product name. So in the above formula, it should only search for the first 40 characters of the product name.
Due to the complicated nature of this, I haven't been able to search for a sufficient solution, since I don't really know exactly where to start to quickly explain it.
UPDATE:
The product names of my Inventory List, and the product names of my items sold aren't matching. I thought I could just search for the left-most 40 characters, but this is not the case.
Here is a sample of products I have in my Inventory List:
Ford Focus 2000 thru 2007 (Haynes Repair Manual) by Haynes, Max
Franklin Brass D2408PC Futura, Bath Hardware Accessory, Tissue Paper Holder, ...
Fuji HQ T-120 Recordable VHS Cassette Tapes ( 12 pack ) (Discontinued by Manu...
Fundamentals of Three Dimensional Descriptive Geometry [Paperback] by Slaby, ...
GE Lighting 40184 Energy Smart 55-Watt 2-D Compact Fluorescent Bulb, 250-Watt...
Get Set for School: Readiness & Writing Pre-K Teacher's Guide (Handwriting Wi...
Get the Edge: A 7-Day Program To Transform Your Life [Audiobook] [Box set] by...
Gift Basket Wrap Bag - 2 Pack - 22" x 30" - Clear [Kitchen]
GOLDEN GATE EDITION 100 PIECE PUZZLE [Toy]
Granite Ware 0722 Stainless Steel Deluxe Food Mill, 2-Quart [Kitchen]
Guess Who's Coming, Jesse Bear [Paperback] by Carlstrom, Nancy White; Degen, ...
Guide to Culturally Competent Health Care (Purnell, Guide to Culturally Compe...
Guinness World Records 2002 [Illustrated] by Antonia Cunningham; Jackie Fresh...
Hawaii [Audio CD] High Llamas
And then here is a sample of the product names in my Sold list:
My Interactive Point-and-Play with Disne...
GE Lighting 40184 Energy Smart 55-Watt 2...
Crayola® Sidewalk Chalk Caddy with Chalk...
Crayola® Sidewalk Chalk Caddy with Chalk...
First Look and Find: Merry Christmas!
Sesame Street Point-and-Play and 10-Book...
Disney Mickey Mouse Board Game - Duck Du...
Nordic Ware Microwave Vegetable and Seaf...
SmartGames BACK 2 BACK
I have played around with searching for the left-most characters, minus 3. This did not work correctly. I have also switched the [range lookup] between TRUE and FALSE, but this has also not worked in a predictable way.
Use the VLOOKUP function. Augment the lookup_value parameter with the LEFT function.
        
In the above example, LEFT(E2, 9) is used to truncate the Sold Item lookup into Inventory Item.

Resources