Phrase search with slop in Lucene

Phrase search with slop in Lucene - search

I am having trouble understanding how this can be a hit.
In my index I have:
wa wb wc wd
And my search term is:
"wd wc wb wa"~6
How can the second query be rearranged into the first with only 6 re-arrangements? My initial assumption was that this needed slop 8 minimum to be a hit (move wa 3 positions left, move wd 3 positions right, move wc 1 position right, move wb 1 position left), but I actually get a hit with slop 6 or more.
Thank you.

The edit distance also contains the delete and insert operations. In your case following 6 operations can be made to achieve the resulting match:
Move wb right
Move wb right
Delete wc
Delete wd
Insert wc
Insert wd

In https://lucene.apache.org/core/8_0_0/core/org/apache/lucene/search/PhraseQuery.html#getSlop--, the edit distance between "quick fox" and "the fox is quick" would be at an edit distance of 3 seems to be inaccurate.
It is not said that the edit distance used by Lucene is the Levenshtein distance (insertion, deletion, substitution, all of weight 1) but a test of "quick fox" PhraseQuery with a slop of 2 (it's enough) hits the text "the fox is quick" (1 deletion + 1 insertion).

Related

Macro to split up text into different rows depending on a keyword

I have a varying number of rows of text which I paste into excel like those two below. The content will vary slightly but the overall structure will stay the same:
Now I need to split these up and would therefore like a macro which searches for the word "maturity" and selects this word and all the text on the right side of the word and moves it one cell to the right.
I tried splitting it up via text to row, but the position of the word varies and splitting it up via space or comma destroys the rest of the data.
example:
1/ Worst Of Put K UN, WS UQ, XYZ YX maturity 22May2019, 80% strike, size q€7M (€ quanto), BID
2/ Worst Of Put xyz xy, TSLA UQ, KK BK maturity 20Nov2021, 100% strike, size €3.5M (€ quanto), BID
the macro should keep "2/ Worst Of Put xyz xy, TSLA UQ, KK BK" in one cell and move "maturity 20Nov2021, 100% strike, size €3.5M (€ quanto), BID" one cell to the right.
Many thanks for the help,
Ragnar

Step 1) Do a Find/Replace on your data. Find maturity, replace with ~maturity. [Note: This assumes you won't have ~ anywhere in the strings. Use another character if you have ~ somewhere.]
Step 2) Highlight your data, go to Text to Columns, and split on a delimiter ~

6502 BASIC removing a character

How do I delete a character in 6502 basic? Like reverse print...
Would there be a way to channel the del key code into a program? I have tried looking at the reference but there is nothing there for a command.

As OldBoyCoder suggests, it can be done - but the character depends on the platform.
10 PRINT "AB";
20 PRINT "C"
Output on Commodore PET or Apple II:
ABC
For Commodore PET add:
15 PRINT CHR$(20);
Output:
AC
On the Apple II you'd need to use 8 instead of 20 because the I/O firmware is different.

You can also use the string manipulation commands, such as LEFT$, RIGHT$ and MID$, here's an untested example from memory (for the Commodore PET and other Commodore 8-bit machines, probably also works on other variants of BASIC):
0 A$="ABC"
1 PRINT LEFT$(A$,2): REM PRINTS AB
2 PRINT RIGHT$(A$,2): REM PRINTS BC
3 PRINT MID$(A$,2,1): REM PRINTS B
Otherwise, you can over-write characters as long as you know where they are located on the screen.
0 D$=CHR$(17): REM CURSOR DOWN
1 U$=CHR$(145): REM CURSOR UP
2 R$=CHR$(29): REM CURSOR RIGHT
3 L$=CHR$(157): REM CURSOR LEFT
4 PRINT "ABC";L$;" "
I'm using CHR$ codes here for clarity. If you open a string with a double-quote, you can press the UP/DOWN and LEFT/RIGHT keys which will show inversed characters. Printing these inversed characters will do the same as moving the cursor around with the cursor keys.

How to count a keyword in a series of text strings within multiple cells in Excel?

I've found similar ideas on here by using SEARCH or FIND in Excel, but those seem to be more about finding the location of the keyword, rather than counting how many times it comes up.
I have a CSV of a shot list. Each shot is associated with a sequence, and each shot has a set of "tags" (this is the text string). Please see below for an example:
There are two main keywords I'd like to keep track of: "dog" and "fox". There are multiple shots per sequence, and my goal is to figure out how many shots per sequence have the "dog" tag and how many have the "fox" tag. The formula I need would be for the columns highlighted yellow, and I have manually entered the first few entires to give an idea of what number should be there. Once those are filled in, I can then count the ratio per sequence of which ones are tagged more for "dog" or "fox".
I can't use text-to-columns in Excel to easily break down the text string column, because each one contains a different series of tags (somewhat demonstrated by my sample text).
I've figured out a simple formula to count what I want if the text column only had "dog" or "fox" in it, but I can't figure out how to get Excel to find one word within a text string and count it.
=SUMIFS(D:D,B:B,1,F:F,"dog")
1 being the sequence number, and the rest of the columns are referencing my larger data sheet.
Any help would be much appreciated!!
Edit:
Sheet in text form here (sorry about formatting, cant upload a file from work ATM):
COUNTER SAMPLE DATA
Sequence Total Fox Total Dog Total Entries Ratio Fox Ratio Dog Sequence Shot Text
1 2 2 4 0.5 0.5 1 mov_101 The quick brown fox
2 3 2 5 0.6 0.4 2 mov_102 jumps over the lazy dog
3 4 3 mov_103 The fox and the hound
4 2 4 mov_104 fox news
5 3 5 mov_105 I am a dog
1 mov_106 The fox and the hound
2 mov_107 jumps over the lazy dog
3 mov_108 The fox and the hound
4 mov_109 jumps over the lazy dog
5 mov_110 I am a dog
1 mov_111 jumps over the lazy dog
3 mov_112 The fox and the hound
5 mov_113 The fox and the hound
2 mov_114 jumps over the lazy dog
2 mov_115 fox news
1 mov_116 I am a dog
3 mov_117 I am a dog
2 mov_118 The fox and the hound

You were close, you need to use COUNTIFS instead of SUMIFS to get the count of sequences. And use "*" around word fox and dog to consider surrounding words.
Here is the formula that I've used to get fox count:
=COUNTIFS($H:$H,$A2,$J:$J,"*fox*")
Place this formula in cell B2 and drag it down.
Same way, following formula will get you the dog count per sequence:
=COUNTIFS($H:$H,$A2,$J:$J,"*dog*")
Place this formula in cell C2 and drag it down.
So I tried to replicate your data and this is what I've used:
Let me know if you have any doubts.

Someone will probably have a better solution than this, but I've used it before when looking for a similar function and couldn't find one.
=(LEN([textcell]) - LEN(SUBSTITUTE([textcell], [wordcell], ""))) / LEN([wordcell])
What this does is compare the length of the original string, with the length of the string with the search word removed. Dividing it by the length of the word, giving you how many occurrences were removed.
So given the following content :
fox dog search
1 0 The quick brown fox
0 1 jumps over the lazy dog
The formula on A2 is
=(LEN($C2) - LEN(SUBSTITUTE($C2,A$1, ""))) / LEN(A$1)
Dollar signs not required, but made it so I could copy the formula to all 4 cells.

If your Sequence column is E, and the column with text is F, you could use this formula:
=SUMPRODUCT(--(NOT(ISERROR(SEARCH(B$1,$F$2:$F$6)))),--($E$2:$E$6=$A2))
This creates two arrays, one that's a sequence of 1's and 0's where 1 is that the text contains B1 ("fox" or "dog"), and another that is 1 for sequence matching and 0 for not sequence matching.
Then it multiplies and sums the arrays so you only get the count of when both conditions match.
The formula is in cells B2:C3 in my example:
Picture of sample data I used:

finding the shortest subsegment

I wanted to know which alogrithm should i apply.
Their is a sentence given and a list to words. We have to find the first shortest sub segment that contains all the words in the list of words.
eg:
Sentence - this is the best problem i have ever solved
List of words -
is
best
this
The answer should be:
this is the best
If there are many such sub segments then we have to print the one that contains the smallest number of words and appears first in sentence.

Here is my approach to solve the above problem.
1. Take 2 pointers head and tail both point to 0
Now move the head pointer until the word pointed to by the head pointer is a valid keyword; now mark it as head.
2. Now move tail pointer until the sentence contains all the given keywords at least once; now mark it as tail.
And this is the first valid subsegment with all valid keywords and calculate it's length
3. Now check word frequency at head - if it is greater than 1 now move head pointer to a word in the sentence which is a valid keyword, as well as it contains frequency of word as 1.
4. Now check whether all keywords are there or not - if yes, calculate it's length and store it as min sub-segment.
5. If it does not contains all valid keywords now move tail pointer until all keywords are found and calculate its length like (tail-head+1); if it is greater than min one then ignore it.
6. Now continue this process until last keyword of given sentence
The complexity of the above approach is o(n).
For example let us take this sentance
Hi this is a funny world this is a good experience with this world
and i need to find 3 keywords
this
is
world
at first consider 2 hash tables namely required,obtained
now store all required keywords in required table.
now take head and tail as 0 now check hi is a valid keyword since it it not move head
now check for next keyword i.e this ,now this is a valid keyword so make a count of 1 and store this word position as head .so now head is 1
now move tail pointer so next keyword is "is" ,it is a valid one hence increment count
now similarly check for a,funny keywords since they are not valid ones hence move tail to world
now world is a valid one as well as count is 3 and tail is 4 whenever count == no of required keywords(in our case it is 3) that means our segment contains all valid keywords
now it's length is (4-1+1)=4
now check frequency of word at head it is one hence if we move this head pointer then we won't get a valid segment
so now move tail pointer to next word this now update frequency of this to 2 from 1 and counter becomes 4
so now we can move our head pointer now move to a keyword is now update counter as 3 because our segment won't contain this at this moment because we have shifted the head pointer from this keyword
now again count is 3 hence calculate it's length again it is 4
so check freq of head keyword is it is 1 hence move tail pointer to next keyword is now is keyword freq is more than 1 hence now move head pointer until we get a valid keyword with freq as 1 now obtained keyword is world and head position is 5 and tail position is 7
and counter is 3 so calculate length as 7-5+1 which is 3 hence this is a min length that we found till now
now move tail until keyword freq at head is more than 1 now finally our tail become 13
now move head from 5 to 6 calculate it's length ,and it becomes 13-6+1 which is 8 so ignore it
now further we cannot move our tail hence print the words from min_head to min_tail as final result
in our case the answer is
world this is

Consider the following simple approach -
Make a dictionary mapping(enumeration) for each word in the sentence. Like -
this[1] is[2] the[3] best[4] problem[5] i[6] have[7] ever[8] solved[9]
Assuming all are distinct words in the sentence.
Now, taking one word at a time and keeping the record of max and min value of that word as key. In this case it would be 4 and 1, resp.
Return the string within the limits.

Probability question: Estimating the number of attempts needed to exhaustively try all possible placements in a word search

Would it be reasonable to systematically try all possible placements in a word search?
Grids commonly have dimensions of 15*15 (15 cells wide, 15 cells tall) and contain about 15 words to be placed, each of which can be placed in 8 possible directions. So in general it seems like you can calculate all possible placements by the following:
width*height*8_directions_to_place_word*number of words
So for such a grid it seems like we only need to try 15*15*8*15 = 27,000 which doesn't seem that bad at all. I am expecting some huge number so either the grid size and number of words is really small or there is something fishy with my math.

Formally speaking, assuming that x is number of rows and y is number of columns you should sum all the probabilities of every possible direction for every possible word.
Inputs are: x, y, l (average length of a word), n (total words)
so you have
horizontally a word can start from 0 to x-l and going right or from l to x going left for each row: 2x(x-l)
same approach is used for vertical words: they can go from 0 to y-l going down or from l to y going up. So it's 2y(y-l)
for diagonal words you shoul consider all possible start positions x*y and subtract l^2 since a rect of the field can't be used. As before you multiply by 4 since you have got 4 possible directions: 4*(x*y - l^2).
Then you multiply the whole result for the number of words included:
total = n*(2*x*(x-l)+2*y*(y-l)+4*(x*y-l^2)

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Phrase search with slop in Lucene - search

The edit distance also contains the delete and insert operations. In your case following 6 operations can be made to achieve the resulting match: Move wb right Move wb right Delete wc Delete wd Insert wc Insert wd

Related

Macro to split up text into different rows depending on a keyword

6502 BASIC removing a character

How to count a keyword in a series of text strings within multiple cells in Excel?

finding the shortest subsegment

Probability question: Estimating the number of attempts needed to exhaustively try all possible placements in a word search

Categories

Resources