SAS - Reducing list of unique names to a smaller list of generic names based on string similarity within the list

SAS - Reducing list of unique names to a smaller list of generic names based on string similarity within the list - string

I have a list of 18,000 unique entities in SAS. In most cases, individual entities refer back to a broader general entity, but differ in spelling. For example, "BANK OF MADEUPBANK" and "BK OF MADEUPBANK" would exist in the list as unique entities even though they are in fact the same entity. My goal is to group the entities that are actually the same by a generic customer name. For example, the generic name "BANK OF MADEUPBANK" would apply to "BK OF MADEUPBANK", "BANK OF MADEUP", "BK OF MAD UP", etc. This is straightforward to achieve programatically with various string searches for some common, large entities. However, there's no way to apply a string search for obscure entities that I wouldn't even know are in the list without reading through all 18,000. I'm wondering if there is a way to conduct the logical process I used on this made up bank to catch all related instances within the list of 18,000...but without knowing what the ultimate general entity is. Is there a technique that can scroll through the list of names and group them based on similarity?
Thanks!

Related

Multiple analyzers for a single field in a search index of Azure Cognitive Search

We need two different types of search (based on user input), partial and exact for few fields that we have and for the same requirement, we require two different analyzers for each field to produce the required output.
Now, the problem is, I'm not able to configure 2 analyzers for a single field. The only option for me is to create two different indexes altogether and then query respective index based on the user input, but clearly, this is not the right solution, it is not scalable, mostly redundant data and takes almost double the space.
I'm trying to create a duplicate field in the same index with different analyzers and use the output of them based on the user input, but I'm not sure how I can configure that in the index. The name of the field is what is used to search for, during query time. Is there a possibility for me to have 2 different fields with different names, which actually point to one field but have different analyzers?

You can have 2 different fields with different names, which actually point to one field with two different analyzers. This can be done using field mappings in indexer definition.
I have created index as shown below,
As highlighted in above screen shot, I have taken two new fields with name cont01 and cont02.
These two new fields will point to field merged_content with two different analyzers.
In indexer definition I have configured field mappings as shown below,
Ran indexer and results are as shown below,
Reference link

Arcade expression to calculate unique IDs in ArcGIS Pro

Arcade expression to calculate unique ID in ArcGIS Pro
I have a field that I want to be automatically populated with a unique ID whenever a new record is created. I'm pretty rusty at working with Arcade and Attribute Rules. I've figured out how to make a number automatically populate through a sequence and attribute rule, but I don't know how to make the rule take into account the values already in the table.
Using NextSequenceValue, the rule will add an ID that is unique to the new sequence that I created, but it is not unique from the other records that already have IDs. This is an old dataset with loads of different IDs that don't necessarily follow a predictable pattern, otherwise I would just choose my sequence start appropriately (some IDs are in the 100s, some in the 1000s, some even 100,000s, etc.).
I basically want to perform a check where the rule assesses if the ID is unique and if it's not, it adds 1 or something until it is unique.
I tried using a sequence but it doesn't take into account already existing IDs so they aren't truly unique.

The implication of #search.score in Azure Search Service

I understood the reason for having search profile and boosting results based on some fields e.g. distance, rating, etc. To me, that's most likely applicable to structured documents like json files. The scenario that I cannot make sense of it is when indexer gets search service index let's say a MS Word or PDF document in azure blob. We have two entries of "id" and "content" which I don't know how the search score would apply to it.
For e.g. there are two documents with different contents. I searched for a keyword and the same keyword found in two documents resulted into getting two different scores for two MS Word documents. My challenge is why this score should be different while both documents contain the same keyword?

The score is determined by many factors, for example, the count of terms in each document, and the number of searchable fields in which query terms were found. In your example, the documents have different lengths, so naturally they'll have different scores. HTH.

Sharepoint 2010 Exclusive columns?

Does anyone know a way to have two columns in a SP2010 list that are exclusive? I need to ensure that there is only a value for ONE column, not both.
Basically I need the following structure:
Category List ->
SubCategory List (with a lookup to Category) ->
Value (with a lookup to SubCategory).
But, if there is no SubCategory, use a lookup to Category. One or the other must be used, not both.

Using either a list Event Receiver (SPItemEventReceiver) or a Custom Workflow should be able to achieve these desired semantics.
I do not believe the model itself can represent such relationships. An alternative might be to have different content types, for which only one of the columns applies (to each).
Happy coding.

Matching "fuzzy" data based on several inputs

I have a search and matching problem:
Inputs
In my database, I have thousands of names, in addition to some other matching characteristics: a few columns of numerical data, and a few columns of other text that helps identify this specific company.
A prospective client has about 500 company names, and then sparsely populated additional characteristics as mentioned above for each of the names.
Current Process
In the past, the process has been a manual one, try to match each name given by the client by searching through the database, finding a name "like" the one reported to me, and then verifying that the additional characteristics match up. However, the main issue is that the names reported are not the same, can often contain abbreviations or only parts of the name stored in my database, and the additional characteristics may be incomplete or only partially matching as well.
Automation
I want to automate this process since it happens frequently. The optimal solution would input one company from the client list along with any of the additional characteristics they filled in for it, and then try to find the top 5 matches in my database.
I've never used Lucene or Sphinx, but they seem to be more document driven. Is there a way to format these inputs so those libraries work for this problem, or instead, what other software tools exist that would work?

To Lucene, a 'document' can easily be a row in a table and I think you will like the fuzzy~ search and search hit scoring capabilities.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

SAS - Reducing list of unique names to a smaller list of generic names based on string similarity within the list - string

Related

Multiple analyzers for a single field in a search index of Azure Cognitive Search

Arcade expression to calculate unique IDs in ArcGIS Pro

The implication of #search.score in Azure Search Service

Sharepoint 2010 Exclusive columns?

Matching "fuzzy" data based on several inputs

Categories

Resources