FuzzyWuzzy: Match an EXACT substring in longer substrings give multiple matches - fuzzywuzzy

I am using FuzzyWuzzy in Python to match some bank names from a list. But I note if the bank name is an EXACT substring in longer substrings, it will give 100 score for every of these matches.
See below example:
list1= [Many banks names]
ratio=process.extract('BANK OF INDIA', list1, limit=3, scorer=fuzz.token_set_ratio)
print(ratio)
OUTPUT
[('AGRICULTURAL BANK OF INDIA LIMITED', 100), ('BANK OF INDIA LIMITED', 100), ('INDUSTRIAL AND COMMERCIAL BANK OF INDIA LIMITED', 100)]
The correct match should be 'BANK OF INDIA LIMITED', the bank name with the shortest name. So how can I extract this correct match?
Thank you

Related

Splitting strings into two different columns pandas

I have a below data frame called df. It has location column and it is a list separated by a comma.
Expected output
I need to split the last two strings into multiple columns.
Example Input:
['122 Grenfell Street', 'Adelaide CBD', '5000 Adelaide', 'Australia']
Example Output:
df['Country']: Australia
df['City'] : 5000 Adelaide
I need to do the same for all the rows.
I tried below code
df['Country'] = df['Loction'].str.split(',', expand = True)
The above code is not working. I tried other posts but not successful
Create list by using the tolist(). Create datframe using pd.DataFrame
Say sample data is:
df=pd.DataFrame({'text':[['122 Grenfell Street', 'Adelaide CBD', '5000 Adelaide', 'Australia']]})
Extract list elements into columns:
df[['Street','Area','City','Country']] = pd.DataFrame(df.text.tolist(), index= df.index)
text Street \
0 [122 Grenfell Street, Adelaide CBD, 5000 Adela... 122 Grenfell Street
Area City Country
0 Adelaide CBD 5000 Adelaide Australia
Use, Series.str.extract along with the given regex pattern:
df[['City', 'Country']] = df['Location'].str.extract(r"'([^,']+?)'\s*,\s*'([^'\]]+)'\s*\]")
Result:
# print(df)
Location City Country
0 [122 Grenfell Street, Adelaide CBD, 5000 Adela... 5000 Adelaide Australia
See the regex demo here.

Display the same indexed value from 3 lists in python stock script

Objective
My goal is to have 3 lists for stocks and display same index position of each list based upon the matching value of a user input.
My 3 lists
Price = [37.10, 46.18, 51.76, 145.64]
Symbol = ['T', 'KO', 'ABQ', 'LOW']
Name = ['ATT', 'COCA COLA COMPANY', 'ABBOT LABORATORIES', 'LOWES COMPANY INC']
My script in full will ask the user for an investment amount and if below $1000 it'll invest in the cheapest stock.
If over $1000, it'll invest in the most expensive stock.
So if under $1000, it should invest all the money into ATT at $37.10 per share.
If over $1000, it should invest all the money into LOW at $145.64 a share.
The output should display the purchase amount of shares for the given stock it invested in.
Output to user
How much are you wanting to invest today? 1001.57
Below are the stocks available for purchase:
Price | Symbol | Name
$ 37.1 | T | ATT
$ 46.18 | KO | COCA COLA COMPANY
$ 51.76 | ABT | ABBOTT LABORATORIES
$ 145.64 | LOW | LOWES COMPANIES INC
This will purchase 26.99 shares of 'T:ATT' at $37.1 per share.
This will purchase 6.87 shares of 'LOW:LOWES COMPANIES INC' at $145.64 per share.
My code is as follows:
investing = (float(input("How much are you wanting to invest today? ")))
print ("\nBelow are the stocks available for purchase:")
print ("\nLast Trade | Symbol | Name ")
for pr, sy, na, in sorted(zip(stock_price, stock_symbol, stock_name)):
print ('$',pr, '|',sy, '|',na)
cheapest_stock = (min(stock_price))
expensive_stock = (max(stock_price))
purchase_low = (investing/cheapest_stock) // 0.01 / 100
purchase_hi = (investing/expensive_stock) // 0.01 / 100
print (f"\nThis will purchase {purchase_amt} shares of 'T:ATT' at ${cheapest_stock} per share.")
print (f"\nThis will purchase {purchase_amt} shares of 'LOW:ATT' at ${expensive_stock} per share.")
Problem
I can not figure how to get the low value determined by min(stock_price) or the high value determined by max(stock_price) to display the matching values for the stock symbol and stock name from the other lists.
So I currently have it manually typed out in my print statements above.
Get the Index of the min/max element in Price list and find the element in that index from other lists.
For ex: if you want the symbol for the lowest Price then below code gives the required result.
index=Price.index(min(Price)) #Gives the index of Minimum element in Price list
print(Symbol[index]) #prints the value of an element at an above index from list Symbol.
#or we can write as
Symbol[Price.index(min(Price))]
Use index() to get the index of an element.
lis.index("foo") #this code gives index of element "foo" in list lis.

How to properly understand pandas dataframe merge (how, left_on, right_on)? [duplicate]

This question already has answers here:
Pandas Merging 101
(8 answers)
Closed 4 years ago.
I have been trying to wrap my head around merge for a while:
I have the following dataframes:
staff_df = pd.DataFrame([{'Name': 'Kelly', 'Role': 'Director of HR', 'Location': 'State Street'},
{'Name': 'Sally', 'Role': 'Course liasion', 'Location': 'Washington Avenue'},
{'Name': 'James', 'Role': 'Grader', 'Location': 'Washington Avenue'}])
student_df = pd.DataFrame([{'Name': 'James', 'School': 'Business', 'Location': '1024 Billiard Avenue'},
{'Name': 'Mike', 'School': 'Law', 'Location': 'Fraternity House #22'},
{'Name': 'Sally', 'School': 'Engineering', 'Location': '512 Wilson Crescent'}])
I understand that I can merge them in more ways than one:
pd.merge(staff_df, student_df, how='left', left_on='Name', right_on='Name')
pd.merge(student_df, staff_df, how='left', left_on='Name', right_on='Name')
pd.merge(staff_df, student_df, how='right', left_on='Name', right_on='Name')
pd.merge(student_df, staff_df, how='right', left_on='Name', right_on='Name')
Each produces a slightly different output. Can someone guide me on the proper way to understand how each output is constructed?
Specifically,
Why are the role and school columns always between location_y?
When is the role column beside the name column and when is the school
column beside the name column?
I would hold off asking about using left_index and right_on in the same merge statement.
Thanks.
I suggest you to go through the documentation to understand the merging operation properly. It is well documented with examples. Counldn't think of much simpler explanation. Documentation for merging
From documentation
left_on: Columns from the left DataFrame to use as keys. Can either be
column names or arrays with length equal to the length of the
DataFrame
right_on: Columns from the right DataFrame to use as keys. Can either
be column names or arrays with length equal to the length of the
DataFrame
Why are the role and school columns always between location_y?
After merging columns will be sorted. To check that change a column name that starts with letter earlier than L in the second df parameter of pd.merge.
pd.merge(student_df, staff_df, how='left', left_on='Name', right_on='Name')
Location_x Name School Location_y Role
0 1024 Billiard Avenue James Business Washington Avenue Grader
1 Fraternity House #22 Mike Law NaN NaN
2 512 Wilson Crescent Sally Engineering Washington Avenue Course liasion
Example if Role is Bole
Location_x Name School Bole Location_y
0 1024 Billiard Avenue James Business Grader Washington Avenue
1 Fraternity House #22 Mike Law NaN NaN
2 512 Wilson Crescent Sally Engineering Course liasion Washington Avenue
Instead of left_on and right_on two parameters you can use on which will match the keys from both the dataframe. i.e
pd.merge(student_df, staff_df, how='left', on='Name')
When is the role column beside the name column and when is the school column beside the name column?
It depends on the priority of df you give. If your specify staff_df first then the columns will be concatenated column wise after the staff_df. So Role will be beside Name column. Similary if you specify student_df Student will be beside Name column.

Load item cost from an inventory table

I have an Inventory Sheet that contains a bunch of data about products I have for sale. I have a sheet for each month where I load in my individual sales. In order to calculate my cost of sales, I enter my product cost for each sale manually. I would like a formula to load the cost automatically, using the product name as a search term.
Inventory Item | Cost Sold Item | Sale Price | Cost
Product 1 | 2.99 Product 3 | 16.99 | X
Product 2 | 4.99 Product 3 | 14.57 | X
Product 3 | 6.99 Product 1 | 7.99 | X
So basically I am looking to "solve for X".
In addition to this, the product name on the two tables are actually different lengths. For example, one item on my Inventory Table may be "This is a very, very long product name that goes on and on for up to 120 characters", and on my products sold table it will be truncated at the first 40 characters of the product name. So in the above formula, it should only search for the first 40 characters of the product name.
Due to the complicated nature of this, I haven't been able to search for a sufficient solution, since I don't really know exactly where to start to quickly explain it.
UPDATE:
The product names of my Inventory List, and the product names of my items sold aren't matching. I thought I could just search for the left-most 40 characters, but this is not the case.
Here is a sample of products I have in my Inventory List:
Ford Focus 2000 thru 2007 (Haynes Repair Manual) by Haynes, Max
Franklin Brass D2408PC Futura, Bath Hardware Accessory, Tissue Paper Holder, ...
Fuji HQ T-120 Recordable VHS Cassette Tapes ( 12 pack ) (Discontinued by Manu...
Fundamentals of Three Dimensional Descriptive Geometry [Paperback] by Slaby, ...
GE Lighting 40184 Energy Smart 55-Watt 2-D Compact Fluorescent Bulb, 250-Watt...
Get Set for School: Readiness & Writing Pre-K Teacher's Guide (Handwriting Wi...
Get the Edge: A 7-Day Program To Transform Your Life [Audiobook] [Box set] by...
Gift Basket Wrap Bag - 2 Pack - 22" x 30" - Clear [Kitchen]
GOLDEN GATE EDITION 100 PIECE PUZZLE [Toy]
Granite Ware 0722 Stainless Steel Deluxe Food Mill, 2-Quart [Kitchen]
Guess Who's Coming, Jesse Bear [Paperback] by Carlstrom, Nancy White; Degen, ...
Guide to Culturally Competent Health Care (Purnell, Guide to Culturally Compe...
Guinness World Records 2002 [Illustrated] by Antonia Cunningham; Jackie Fresh...
Hawaii [Audio CD] High Llamas
And then here is a sample of the product names in my Sold list:
My Interactive Point-and-Play with Disne...
GE Lighting 40184 Energy Smart 55-Watt 2...
Crayola® Sidewalk Chalk Caddy with Chalk...
Crayola® Sidewalk Chalk Caddy with Chalk...
First Look and Find: Merry Christmas!
Sesame Street Point-and-Play and 10-Book...
Disney Mickey Mouse Board Game - Duck Du...
Nordic Ware Microwave Vegetable and Seaf...
SmartGames BACK 2 BACK
I have played around with searching for the left-most characters, minus 3. This did not work correctly. I have also switched the [range lookup] between TRUE and FALSE, but this has also not worked in a predictable way.
Use the VLOOKUP function. Augment the lookup_value parameter with the LEFT function.
        
In the above example, LEFT(E2, 9) is used to truncate the Sold Item lookup into Inventory Item.

Excel Charts for table with many duplicates

I have a list of names with 1000 entries and maybe 750 unique. There are other attributes, like location and position. Can I create a pivot table that would show me simple stat's like X number of unique people, X number unique in location 1, location 2, location 3, and finally x number of positions in location 1, position 2/location1, position 3/location2...?
Name Location Title
Smith, Bob UK Sales Manager
Smith, Bob UK Plant Manger
Jones, Keith UK Sales Manager
Jones, Keith UK Plant Foreman
White, Derick Denver Sales Manager
Brown, Frank Boston Supply Chain
Black, Jay Denver Sales Manager
Smith, Jeff Denver Sales Manager
Gonzalez, Al UK
Gonzalez, Al UK Staging Area Manager
Bright, Susan Denver Legel Secretary
Bright, Susan Denver Paralegal
Bright, Susan Denver Executive Assistant
Bright, Susan Denver Press Secretary
Alf, Jeff Denver VP, Sales
Green, Burt Boston VP, Sales
Jones, Chuck Denver Plant Foreman
Alten, Cory Denver Sales Manager
Clark, Jerry Boston Plant Foreman
Romo, Tom Denver Sales Manager
You may want to consider using CountifS functions in adjacent columns to the data. For example to count unique people, just create a column (call it column x for this example) and enter =Countif(Column A, A1) and copy down for all the rows. then just enter =countif(Column X, 1) and that should give you the unique names.
You can use CountifS function for more complicated counting logic to answer the other questions.
I have done similar counting of unique entries using pivot tables.
I create a additional column that has a 1 if it is the first occurance of the "key" (ie name). The formula is similar to this: (assume Column A has the "key")
=IF(ROW(A1) = MATCH(A1,A:A,0),1,0)
This formula is then copied down for every row (or if it's a proper table, it's auto copied!)
The idea is that the MATCH returns the row number of the first occurrence of "key". If it is the same as the current row, then count 1. If the values aren't the same, it's a duplicate, so give it a 0 value.
When you then do a pivot table sum on this value, it adds up to the number of unique entries. (ie unique names in a region.)

Resources