Sphinx sentence like query - search

my task is to find a similar sentence in database collection.
Could you advise me which query type to use?
Sample:
Search: Welcome to the first sample code.
And let say the following sentences are fine for my query:
Dbase:
...
Welcome in first movie ...
This is first sample code ...
Welcome!
...
Thanks

If I got it correctly, each sentence in the DB, which includes one or more words from the search query, is fine.
In this case, you have to use the SPH_MATCH_ANY mode or SPH_MATCH_EXTENDED2 with | (OR) operator.
Matching modes...
Extended query syntax...
If you want to exclude such words as "to", "the" and other short words, you have several options:
1) If you are sure that each word which is less than 4 letters should be excluded, add the following line to your sphinx.conf file:
min_word_len = 4
Read more...
2) if you want to exclude specific words, use the stopwords file(s).
Add the following lines to sphinx.conf:
#path to txt file with words to be excluded (space separated)
stopwords = /usr/local/sphinx/configuration/stopwords.txt
Read more...
And the last thing you should know is that I just provided very basic things which are clearly explained in the documentation and my examples are also taken from there.

Related

Regex for specific permutations of a word

I am working on a wordle bot and I am trying to match words using regex. I am stuck at a problem where I need to look for specific permutations of a given word.
For example, if the word is "steal" these are all the permutations:
'tesla', 'stale', 'steal', 'taels', 'leats', 'setal', 'tales', 'slate', 'teals', 'stela', 'least', 'salet'.
I had some trouble creating a regex for this, but eventually stumbled on positive lookaheads which solved the issue. regex -
'(?=.*[s])(?=.*[l])(?=.*[a])(?=.*[t])(?=.*[e])'
But, if we are looking for specific permutations, how do we go about it?
For example words that look like 's[lt]a[lt]e'. The matching words are 'steal', 'stale', 'state'. But I want to limit the count of l and t in the matched word, which means the output should be 'steal' & 'stale'. 1 obvious solution is this regex r'slate|stale', but this is not a general solution. I am trying to arrive at a general solution for any scenario and the use of positive lookahead above seemed like a starting point. But I am unable to arrive at a solution.
Do we combine positive lookaheads with normal regex?
s(?=.*[lt])a(?=.*[lt])e (Did not work)
Or do we write nested lookaheads or something?
A few more regex that did not work -
s(?=.*[lt]a[tl]e)
s(?=.*[lt])(?=.*[a])(?=.*[lt])(?=.*[e])
I tried to look through the available posts on SO, but could not find anything that would help me understand this. Any help is appreciated.
You could append the regex which matches the permutations of interest to your existing regex. In your sample case, you would use:
(?=.*s)(?=.*l)(?=.*a)(?=.*t)(?=.*e)s[lt]a[lt]e
This will match only stale and slate; it won't match state because it fails the lookahead that requires an l in the word.
Note that you don't need the (?=.*s)(?=.*a)(?=.*e) in the above regex as they are required by the part that matches the permutations of interest. I've left them in to keep that part of the regex generic and not dependent on what follows it.
Demo on regex101
Note that to allow for duplicated characters you might want to change your lookaheads to something in this form:
(?=(?:[^s]*s){1}[^s]*)
You would change the quantifier on the group to match the number of occurrences of that character which are required.

Extracting string from Regex in Pandas for large dataset

We have a csv file which contains log entries in each row.
We need to extract the thread names from each log entry into a separate column.
What would be the fastest way to implement the same ?
The approach below (string functions) also seems to take alot of time for large datasets.
We have csv files with minimum of 100K entries in each csv file.
This is the piece of codes which extracts the path
df['thread'] = df.message.str.extract(pat = '(\[(\w+.)+?\]|$)')[0]
The below is a sample log entry, we are picking out:
[c.a.j.sprint_planning_resources.listener.RunAsyncEvent]
from the regex above.
2020-12-01 05:07:36,485-0500 ForkJoinPool.commonPool-worker-30 WARN Ives_Chen 245x27568399x23 oxk7fv 10.97.200.99,127.0.0.1 /browse/MDT-206838 [c.a.j.sprint_planning_resources.listener.RunAsyncEvent] Event processed: com.atlassian.jira.event.issue.IssueEvent#5c8703d0[issue=ABC-61381,comment=<null>,worklog=<null>,changelog=[GenericEntity:ChangeGroup][issue,1443521][author,JIRAUSER39166][created,2020-12-01 05:07:36.377][id,15932782],eventTypeId=2,sendMail=true,params={eventsource=action, baseurl=https://min.com},subtasksUpdated=true,spanningOperation=Optional.empty]
Does anyone know a better/faster method to implement the same ?
The \[(\w+.)+?\] is a very inefficient pattern that may cause catastrophic backtracking due to the nested quantifiers with an unescaped . that matches any char, and thus also matches what \w does.
You can use
df['thread'] = df['message'].str.extract(r'\[(\w+(?:\.\w+)*)]', expand=False).fillna("")
See this regex demo. Note there is no need adding $ as an alternative since .fillna("") will replace the NA with an empty string.
The regex matches
\[ - a [ char
(\w+(?:\.\w+)*) - Capturing group 1: one or more word chars followed with zero or more sequences of a . and one or more word chars
] - a ] char.
Your regex takes a whopping 8,572 steps to complete, see https://regex101.com/r/5c3vi7/1
You can use this regex to significantly cut down the regex processing to 4 steps:
\[[^\]]+\]
Do notice the absence of the /g modifier
https://regex101.com/r/6522P8/1

Want to find all results containing specific pattern in Azure Search explorer

I want to find all records containing the pattern "170629-2" in Azure Search explorer, did try with
query string : customOfferId eq "170629-2*"
which only give one result back, which is the exactly match of "170629-2", but i do not get the records which have the patterns of "170629-20", "170629-21" or "170629-201".
Two things.
1-You can't use standard analyzer as it will break your "words" in two parts:
e.g. 170629-20 will be breaked as 170629 and another entry as 20.
2-You can use regex and specify the pattern you want:
170629-2+.*
https://learn.microsoft.com/en-us/azure/search/query-lucene-syntax#bkmk_regex
PS: use &queryType=full to allow regex

Search a string and list all sentences matching that string

I am trying to code in Scala for the following use case:
Search a string in a text file and list only the sentences that has a match for this string.
I tried using the following:
val fileContents = Source.fromFile("/Users/sc/Documents/Scala_Code/input.txt").getLines.mkString
val sentence = fileContents.filter(line => fileContents.contains("string to search"))
This lists the entire text file even if there is one match. I need just the sentences that has a match.
Appreciate if someone can provide some inputs.
I think it's kind of hard to be sure to describe a sentence in regex. Nevertheless, here's my suggestion:
for all sentences (in case you want to pattern match on them):
"""\A?\b((?!\?+"?|!+"?|\.+)(.|\n))+(\Z|\?+"?|!+"?|\.+)""".r.findAllIn(fileContents.mkString) //.toSeq
For a specific string (for example you):
"""\A?\b((?!\?+"?|!+"?|\.+)(.|\n))+(\Z|\?+"?|!+"?|\.+)""".r.findAllIn(fileContents.mkString).toIterator.withFilter(_.contains("you")) //.toSeq
toSeq (or toList) is useful for checking on small amount of data...
You can test it here: https://scalafiddle.io/sf/0znMzyi/8
Hope it helps.

Azure Search: Keyword tokenizer don't work with multi word search

I have a fields in index with [Analyzer(<name>)] applied. This analyzer is of type CustomAnalyzer with tokenizer = Keyword. I assume it treats both field value and search text as one term each. E.g.
ClientName = My Test Client (in index, is broken into 1 term). Search term = My Test Client (broken in 1 term). Result = match.
But surprisingly that's not the case until I apply phrasal search (enclose term in double quotes). Does anyone know why? And how to solve it? I'd rather treat search term as the whole, then do enclosing
Regards,
Sergei.
This is expected behavior. Query text is processed first by the query parser and only individual query terms go through lexical analysis. When you issue a phrase query, the whole expression between quotes is treated as a phrase term and as one goes through lexical analysis. You can find a complete explanation of this process here: How full text search works in Azure Search.

Resources