Matching "fuzzy" data based on several inputs - search

I have a search and matching problem:
Inputs
In my database, I have thousands of names, in addition to some other matching characteristics: a few columns of numerical data, and a few columns of other text that helps identify this specific company.
A prospective client has about 500 company names, and then sparsely populated additional characteristics as mentioned above for each of the names.
Current Process
In the past, the process has been a manual one, try to match each name given by the client by searching through the database, finding a name "like" the one reported to me, and then verifying that the additional characteristics match up. However, the main issue is that the names reported are not the same, can often contain abbreviations or only parts of the name stored in my database, and the additional characteristics may be incomplete or only partially matching as well.
Automation
I want to automate this process since it happens frequently. The optimal solution would input one company from the client list along with any of the additional characteristics they filled in for it, and then try to find the top 5 matches in my database.
I've never used Lucene or Sphinx, but they seem to be more document driven. Is there a way to format these inputs so those libraries work for this problem, or instead, what other software tools exist that would work?

To Lucene, a 'document' can easily be a row in a table and I think you will like the fuzzy~ search and search hit scoring capabilities.

Related

Multiple analyzers for a single field in a search index of Azure Cognitive Search

We need two different types of search (based on user input), partial and exact for few fields that we have and for the same requirement, we require two different analyzers for each field to produce the required output.
Now, the problem is, I'm not able to configure 2 analyzers for a single field. The only option for me is to create two different indexes altogether and then query respective index based on the user input, but clearly, this is not the right solution, it is not scalable, mostly redundant data and takes almost double the space.
I'm trying to create a duplicate field in the same index with different analyzers and use the output of them based on the user input, but I'm not sure how I can configure that in the index. The name of the field is what is used to search for, during query time. Is there a possibility for me to have 2 different fields with different names, which actually point to one field but have different analyzers?
You can have 2 different fields with different names, which actually point to one field with two different analyzers. This can be done using field mappings in indexer definition.
I have created index as shown below,
As highlighted in above screen shot, I have taken two new fields with name cont01 and cont02.
These two new fields will point to field merged_content with two different analyzers.
In indexer definition I have configured field mappings as shown below,
Ran indexer and results are as shown below,
Reference link

Searching strings between unrelated tables

I'm trying to categorize my service requests based on key words which can be found either in the request description or request solution. My list of keywords is independent from my database. So I can't come up with the code necessary to lookup various text strings from one table in the columns of another (non-related) table.
My first table (called Comp_build) has 4 columns (each for Installs, Instructions, Security, and Troubleshoot) and below is a sample of it with keywords
Install Troubleshoot Security Instructions
request error Security driven create automated
onboarding fail package
deploy
My second table (called Comp_determination) has 3 columns: Category (calculated column), SR Description, SR Solution
I've played around with:
SWITCH
TRUE
IF
FIND
SEARCH
SUMX
CALCULATE
This is the closest I've been able to find (https://powerpivotpro.com/2014/01/containsx-finding-if-a-value-in-table-1-has-a-matching-value-in-table-2/) but the results aren't consistent. it seems to work correctly if I'm looking up my install keywords but if I try with security keywords, my entire column Category is populated with "Security"
category = IF(SUMX(comp_build,FIND(comp_build[Install],comp_determination[SR Description],,0))>0,"Installations","")
I want to identify all cases bases on keywords and assign them a request category. Since we receive essentially 4 major categories of requests, I have keywords for each category - located in a separate table and each in their column. Additionally, I can search for the key words in two places - either in the request description or the request solution.

The implication of #search.score in Azure Search Service

I understood the reason for having search profile and boosting results based on some fields e.g. distance, rating, etc. To me, that's most likely applicable to structured documents like json files. The scenario that I cannot make sense of it is when indexer gets search service index let's say a MS Word or PDF document in azure blob. We have two entries of "id" and "content" which I don't know how the search score would apply to it.
For e.g. there are two documents with different contents. I searched for a keyword and the same keyword found in two documents resulted into getting two different scores for two MS Word documents. My challenge is why this score should be different while both documents contain the same keyword?
The score is determined by many factors, for example, the count of terms in each document, and the number of searchable fields in which query terms were found. In your example, the documents have different lengths, so naturally they'll have different scores. HTH.

SAS - Reducing list of unique names to a smaller list of generic names based on string similarity within the list

I have a list of 18,000 unique entities in SAS. In most cases, individual entities refer back to a broader general entity, but differ in spelling. For example, "BANK OF MADEUPBANK" and "BK OF MADEUPBANK" would exist in the list as unique entities even though they are in fact the same entity. My goal is to group the entities that are actually the same by a generic customer name. For example, the generic name "BANK OF MADEUPBANK" would apply to "BK OF MADEUPBANK", "BANK OF MADEUP", "BK OF MAD UP", etc. This is straightforward to achieve programatically with various string searches for some common, large entities. However, there's no way to apply a string search for obscure entities that I wouldn't even know are in the list without reading through all 18,000. I'm wondering if there is a way to conduct the logical process I used on this made up bank to catch all related instances within the list of 18,000...but without knowing what the ultimate general entity is. Is there a technique that can scroll through the list of names and group them based on similarity?
Thanks!

How can I delete records from a table that have certain criteria

Rookie question I know.
I have a table with about 10 fields, one of the fields is a category field. I need this field to exist because of the multiple types of categories. However, one category in this field is wrong and is duplicating results.
So can I delete all records in the table that have "Type320" in the CatDescription field, and how? I want to keep eveerything else as it is in this table; just need to get rid of the records that have that that in that one field
Thanks very much!
EDIT: Thanks for the answer, I did not know how to do this so this is very helpful
However, this is more complicated than I thought. The raw data that I am supplied carries these duplicate records (only duplicate in certain circumstances but they are easy to isolate). This raw data is given to me on a monthly basis in several spreadsheet forms.
It all relates to these ID numbers, and has like 10 fields (xls columns). As I said before one of these is the Category Description field (sorry, this is not a lookup) In certain places this records automatically duplicates itself on output because in the database this comes from, it has to have this sub category for one particular "type"
So....every time there is a duplication, every single bit of information in all fields are exactly the same, with the exception of this CatDescription (one is Type320, and the duplicated record type is "Type321"). However, there are some instances where Type321 is valid on it's own (in which case there is no matching data row with a Type320 catdescription). By matching I mean all data in all fields of a particular record.
A very clear absolute of this is if all fields (data within) of a record with Type320 CatDescription, matches all fields (data within) a record with Type321 CatDescription, then I can delete that record containing Type321 CatDescription. This is true because this is the only situation where this duplication occurs, normally not all of this should match.
This allows all unique records with Type320 and Type321 data (that does not match exactly) to stay; just a it should. This makes sense to me (and hopefully you too :/) but can it be done, and how?
thanks because this is way over my head. I would rather know how to do it in access, but an xls solution is equally as appreciate. heck i would do it in ppt if it would get the job done! :)
I would try with one of these two querys:
DELETE FROM table WHERE CatDescription LIKE '%Type320%';
DELETE FROM table WHERE CatDescription LIKE '*Type320*';
That because the Access database engine could be using * (ANSI-89 Query Mode e.g. DAO) instead of % (ANSI-92 Query Mode e.g. OLE DB/ADO) for the wildcards.
Alternatively, this regardless of ANSI Query Mode:
DELETE FROM table WHERE CatDescription ALIKE '%Type320%';
Note the Access database engine's ALIKE keyword is not officially supported.
Does the CatDescription field look to another table? Is it a a query of those tables that creates what you call duplicate results?
If so, be careful about blaming the table that has CatDescription. Check the look-up table to see if Type320 is found there in duplicate.
If you don't have the problem isolated correctly, then you're likely to delete good records while not fixing the problem.

Resources