String matching keywords and key phrases in Python - python-3.x

I am trying to perform a smart dynamic lookup with strings in Python for a NLP-like task. I have a large amount of similar-structure sentences that I would like to parse through each, and tokenize parts of the sentence. For example, I first parse a string such as "bob goes to the grocery store".
I am taking this string in, splitting it into words and my goal is to look up matching words in a keyword list. Let's say I have a list of single keywords such as "store" and a list of keyword phrases such as "grocery store".
sample = 'bob goes to the grocery store'
keywords = ['store', 'restaurant', 'shop', 'office']
keyphrases = ['grocery store', 'computer store', 'coffee shop']
for word in sample.split():
# do dynamic length lookups
Now the issue is this Sometimes my sentences might be simply "bob goes to the store" instead of "bob goes to the grocery store".
I want to find the keyword "store" for sure but if there are descriptive words such as "grocery" or "computer" before the word store I would like to capture that as well. That is why I have the keyphrases list as well. I am trying to figure out a way to basically capture a keyword at the very least then if there are words related to it that might be a possible "phrase" I want to capture those too.
Maybe an alternative is to have some sort of adjective list instead of a phrase list of multiple words?
How could I go about doing these sort of variable length lookups where I look at more than just a single word if one is captured, or is there an entirely different method I should be considering?

Here is how you can use a nested for loop and a formatted string:
sample = 'bob goes to the grocery store'
keywords = ['store', 'restaurant', 'shop', 'office']
keyphrases = ['grocery', 'computer', 'coffee']
for kw in keywords:
for kp in keyphrases:
if f"{kp} {kw}" in sample:
# Do something

Related

Arango wildcard query

I am working on building a simple arango query where if the user enters: "foo bar" (starting to type Foo Barber), the query returns results. The issue I am running in to is going from a normal single space separated string (i.e. imagine LET str = "foo barber" at the top), to having multiple wildcard queries like shown below.
Also, open to other queries that would work for this, i.e. LIKE, PHRASE or similar.
The goal is when we have a single string like 'foo bar', search results are returned for Foo Barber and similar.
FOR doc IN movies SEARCH PHRASE(doc.name,
[
{WILDCARD: ["%foo%"]},
{WILDCARD: ["%bar%"]}
], "text_en") RETURN doc
If you want to find Black Knight but not Knight Black if the search phrase is black kni, then you should probably avoid tokenizing Analyzers such as text_en.
Instead, create a norm Analyzer that removes diacritics and allows for case-insensitive searching. In arangosh:
var analyzers = require("#arangodb/analyzers");
analyzers.save("norm_en", "norm", {"locale": "en_US.utf-8", "accent": false, "case": "lower"}, []);
Add the Analyzer in the View definition for the desired field (should be title and not name, shouldn't it?). You should then be able to run queries like:
FOR doc IN movies SEARCH ANALYZER(STARTS_WITH(doc.title, TOKENS("Black Kni", "norm_en")[0]), "norm_en") RETURN doc
FOR doc IN movies SEARCH ANALYZER(LIKE(doc.title, TOKENS("Black Kni%", "norm_en")[0]), "norm_en") RETURN doc
FOR doc IN movies SEARCH ANALYZER(LIKE(doc.title, CONCAT(TOKENS(SUBSTITUTE("Black Kni", ["%", "_"], ["\\%", "\\_"]), "norm_en")[0], "%")), "norm_en") RETURN doc
The search phrase Black Kni is normalized to black kni and then used for a prefix search, either using STARTS_WITH() or LIKE() with a trailing wildcard %. The third example escapes user-entered wildcard characters.

Code about replacing certain words in discord.js

I was trying to make the bot replace multiple words in one sentence with another word.
ex: User will say "Today is a great day"
and the bot shall answer "Today is a bad night"
the words "great" and "day" were replaced by the words "bad" and "night" in this example.
I've been searching in order to find a similar code, but unfortunately all I could find is "word-blacklisting" scripts.
//I tried to do some coding with it but I am not an expert with node.js the code is written really badly. It's not even worth showing really.
The user will say some sentence and the bot will recognize some predetermined words on the sentence and will replace those words with other words I'll decide in the script
We can use String.replace() combined with Regular Expressions to match and replace single words of your choosing.
Consider this example:
function antonyms(string) {
return string
.replace(/(?<![A-Z])terrible(?![A-Z])/gi, 'great')
.replace(/(?<![A-Z])today(?![A-Z])/gi, 'tonight')
.replace(/(?<![A-Z])day(?![A-Z])/gi, 'night');
}
const original = 'Today is a tErRiBlE day.';
console.log(original);
const altered = antonyms(original);
console.log(altered);
const testStr = 'Daylight is approaching.'; // Note that this contains 'day' *within* a word.
const testRes = antonyms(testStr); // The lookarounds in the regex prevent replacement.
console.log(testRes); // If this isn't the desired behavior, you can remove them.

FuzzyWuzzy for very similar records in Python

I have a dataset with which I want to find the closest string match. For that purpose I'm using FuzzyWuzzy in this way
sol=process.extract(t,dev2,scorer=fuzz.token_sort_ratio)
Where t is the string and dev2 is the list to compare to. My problem is that sometimes it has very similar records and options provided by FuzzyWuzzy seems to be lacking. And I've tested with token_sort, token_set, partial_token sort and set, ratio, partial_ratio, and WRatio.
For example, the string Italy - Serie A gives me the following 2 closest matches.
Token_sort_ratio: (92, 'Italy - Serie D');(86, 'Italian - Serie A')
The one wanted is obviously the second one, but character by character is closer the first one, which is a different league.
This happens as well with teams. If, let's say I have a string Buchtholz I would obtains Buchtholz II before I get TSV Buchtholz.
My main guess now would be to try and weight the presence and absence of several characters more heavily, like single capital letters at the end of the string, so if there is a difference in the letter or an absence it is weighted as less close. Or for () and special characters.
I don't know if there is a way to take this into account or you guys have a better approach to get the string that really matches.
Similarity matches often require knowledge of the data being analysed. i.e. it is not just a blind single round of matching. I recommend that you pass your results through more steps of matching, starting with inclusive/optimistic approaches (like token_set_ratio) with low cut off scores and working toward more exclusive/pessimistic approaches with higher cut off scores until you have a clear winner. If you know more about the text you're analyzing, you can even modify the strings as you progress.
In a case I worked on, I did similarity matches of goods movement descriptions. In the descriptions the numbers sequences were more important than the text. e.g. when looking for a match for "SLURRY VALVE 250MM RAGMAX 2000" the 250 and 2000 part of the string are important, otherwise I get a "SLURRY VALVE 50MM RAGMAX 2000" as the best match instead of "VALVE B/F 250MM,RAGMAX 250RAG2000 RAGON" which is a better result.
I put the similarity match process through two steps: 1. Get a bunch of similar matches using an optimistic matching scorer (token_set_ratio) 2. get the number sequences of these results and pass them through another round of matching with a more strict scorer (token_sort_ratio). Doing this gave me the better result in the example I showed above.
Below is some blocks of code that could be of assistance:
here's a function to get numbers from the sequence. (In your case you might use this to exclude numbers from your string instead?)
def get_numbers_from_string(description):
numbers = ''.join((ch if ch in '0123456789.-' else ' ') for ch in description)
numbers = ' '.join([nr for nr in numbers.split()])
return numbers
and here is a portion of the code I used to put the description match through two rounds:
try:
# get close match from goods move that has material numbers
df_material = pd.DataFrame(process.extract(description,
corpus_material,
scorer=fuzz.token_set_ratio),
columns=['Similar Text','Score']
)
if df_material['Score'][df_material['Score']>=cut_off_accuracy_materials].count()>=1:
similar_text = df_material['Similar Text'].iloc[0]
score = df_material['Score'].iloc[0]
if nr_description_numbers>4:
# if there are multiple matches found, then get best number combination match
df_material = df_material[df_material['Score']>=cut_off_accuracy_materials]
new_corpus = list(df_material['Similar Text'])
new_corpus = np.vectorize(get_numbers_from_string)(new_corpus)
df_material['numbers'] = new_corpus
df_numbers = pd.DataFrame(process.extract(description_numbers,
new_corpus,
scorer=fuzz.token_sort_ratio),
columns=['numbers','Score']
)
similar_text = df_material['Similar Text'][df_material['numbers']==df_numbers['numbers'].iloc[0]].iloc[0]
nr_score = df_numbers['Score'].iloc[0]
hope it helps, and good luck

How to extract relationships from a text

I am currently new with NLP and need guidance as of how I can solve this problem.
I am currently doing a filtering technique where I need to brand data in a database as either being correct or incorrect. I am given a structured data set, with columns and rows.
However, the filtering conditions are given to me in a text file.
An example filtering text file could be the following:
Values in the column ID which are bigger than 99
Values in the column Cash which are smaller than 10000
Values in the column EndDate that are smaller than values in StartDate
Values in the column Name that contain numeric characters
Any value that follows those conditions should be branded as bad.
However, I want to extract those conditions and append them to the program that I've made so far.
For instance, for the conditions above, I would like to produce
`if ID>99`
`if Cash<10000`
`if EndDate < StartDate`
`if Name LIKE %[1-9]%`
How can I achieve the above result using the Stanford NLP? (or any other NLP library).
This doesn't look like a machine learning problem; it's a simple parser. You have a simple syntax, from which you can easily extract the salient features:
column name
relationship
target value or target column
The resulting "action rule" is simply removing the "syntactic sugar" words and converting the relationship -- and possibly the target value -- to its symbolic form.
Enumerate all of your critical words for each position in a lexicon. Then use basic string manipulation operators in your chosen implementation language to find the three needed fields.
EXAMPLE
Given the data above, your lexicons might be like this:
column_trigger = "Values in the column"
relation_dict = {
"are bigger than" : ">",
"are smaller than" : "<",
"contain" : "LIKE",
...
}
value_desc = {
"numeric characters" : "%[1-9]%",
...
}
From here, use these items in standard parsing. If you're not familiar with that, please look up the basics of a simple sentence grammar in your favourite programming language, with rules such as such as
SENTENCE => SUBJ VERB OBJ
Does that get you going?

Replace words marked with offsets

I have a sentence like that:
"My name is Bond. It's a fake name."
and I have to replace some words in a list with offsets of each word:
name, 29-33; Bond, 11-15; name, 3-7
In addition, each word must replace with a specific word:
name -> noun
Bond -> proper
I have to obtain this output:
"My noun is proper. It's a fake noun."
I tried to manage the offsets with a post-offset variable that I update after each replacement but it is not valid because is an unordered list. Note that find method is not valid due to names repetition. Is there any algorithm to do it? Any vectorial implementation (String, Numpy, NLTK) that computes it in one step?
Bro Check this one :
string = "My name is Bond. It's a fake name."
y=list()
y=string.split(" ") #now it will break your strings into words
Now traverse the list and set the condition
for i in y:
if(i==name):
i="noun"
if(i==Bond):
i="Proper"
Now the list values will be changed and use the Join() method to make back the list into string
For more Please refer to this website https://www.tutorialspoint.com/python/python_strings.htm
This page contains all the data related to string processing in python.

Resources