Find identical phrases in strings - string

I have sets of texts. The texts are rather short (up to 500 characters). The sets consist of up to 10 texts.
I try to find phrases as long as possible which occur in most of the texts. In other words I am looking for identical substrings. The longer the better and the more texts they occur in also the better.
Example:
"The red brown fox jumps over the lazy dog"
"The red haired girl smokes brown cigars"
"Where the yellow fox jumps over the haywire"
"All the boys like a red brown fox"
"A girl like a fox jumps over the dead boys grave"
Phrases (one word phrases ommitted):
"red brown fox", length 12, occurence 2
"fox jumps over the", length 18, occurrence 3
"The red", length 7, occurrence 2
Phrases like "brown fox" or "fox jumps" are omitted, because they are subphrases of longer phrases.
I am looking for an algorithm to find those phrases.

Finding the longuest commun substring is a commun DP algorithm explained pretty well here. https://www.geeksforgeeks.org/longest-common-substring-dp-29/ .
After to find the occurence of the strings in the set of text you can simply use a code loke this.
substring = "red brown fox"
n = 0
for text in texts:
if substring in text:
n = n + 1
print(substring, n)

As every substring is a prefix of some suffix, traversing a generalised suffix tree ought to let us compare paths both by how many leaves from different texts they share, indicating quantity of texts sharing a substring, as well as how long the distances to the leaves, indicating shared substring lengths.

Related

How to convert negations and single words with same repetitive letter

I have a data frame that has a column with text data in it. I want to remove words that mean nothing and convert negations like "isn't" to "is not" from the text data. Because when I remove the punctuations "isn't" becomes "isn t" and when I will remove words having letters less than length 2 "t" will be deleted completely. So, I want to do the following 3 tasks-
1) convert negations like "isn't" to "is not"
2) remove words that mean nothing
3) remove less than length 2 letters
For eg, the df column looks similar to this-
user_id text data column
1 it's the coldest day
2 they aren't going
3 aa
4 how are you jkhf
5 v
6 ps
7 jkhf
The output should be-
user_id text data column
1 it is the coldest day
2 they are not going
3
4 how are you
5
6
7
How to implement this?
def is_repetitive(w):
"""Predicate, true for words like jj or aaaaa."""
w = str(w) # caller should have provided a single word as input
return len(w) > 1 and all((c == w[0] for c in w[1:]))
Feed all words in the corpus to that function,
to accumulate a list of repetitive words.
Then add such words to your list of stop words.
1) Use SpaCy or NLTK's lemmatization tools to convert strings (though they do other things like convert plural to singular as well - so you may end up needing to write your own code to do this).
2) Use stopwords from NLTK or spacy to remove the obvious stop words. Alternatively, feed them your own list of stop words (their default stop words are things like is, a, the).
3)Use a basic filter, if len<2 remove row

How to count a keyword in a series of text strings within multiple cells in Excel?

I've found similar ideas on here by using SEARCH or FIND in Excel, but those seem to be more about finding the location of the keyword, rather than counting how many times it comes up.
I have a CSV of a shot list. Each shot is associated with a sequence, and each shot has a set of "tags" (this is the text string). Please see below for an example:
There are two main keywords I'd like to keep track of: "dog" and "fox". There are multiple shots per sequence, and my goal is to figure out how many shots per sequence have the "dog" tag and how many have the "fox" tag. The formula I need would be for the columns highlighted yellow, and I have manually entered the first few entires to give an idea of what number should be there. Once those are filled in, I can then count the ratio per sequence of which ones are tagged more for "dog" or "fox".
I can't use text-to-columns in Excel to easily break down the text string column, because each one contains a different series of tags (somewhat demonstrated by my sample text).
I've figured out a simple formula to count what I want if the text column only had "dog" or "fox" in it, but I can't figure out how to get Excel to find one word within a text string and count it.
=SUMIFS(D:D,B:B,1,F:F,"dog")
1 being the sequence number, and the rest of the columns are referencing my larger data sheet.
Any help would be much appreciated!!
Edit:
Sheet in text form here (sorry about formatting, cant upload a file from work ATM):
COUNTER SAMPLE DATA
Sequence Total Fox Total Dog Total Entries Ratio Fox Ratio Dog Sequence Shot Text
1 2 2 4 0.5 0.5 1 mov_101 The quick brown fox
2 3 2 5 0.6 0.4 2 mov_102 jumps over the lazy dog
3 4 3 mov_103 The fox and the hound
4 2 4 mov_104 fox news
5 3 5 mov_105 I am a dog
1 mov_106 The fox and the hound
2 mov_107 jumps over the lazy dog
3 mov_108 The fox and the hound
4 mov_109 jumps over the lazy dog
5 mov_110 I am a dog
1 mov_111 jumps over the lazy dog
3 mov_112 The fox and the hound
5 mov_113 The fox and the hound
2 mov_114 jumps over the lazy dog
2 mov_115 fox news
1 mov_116 I am a dog
3 mov_117 I am a dog
2 mov_118 The fox and the hound
You were close, you need to use COUNTIFS instead of SUMIFS to get the count of sequences. And use "*" around word fox and dog to consider surrounding words.
Here is the formula that I've used to get fox count:
=COUNTIFS($H:$H,$A2,$J:$J,"*fox*")
Place this formula in cell B2 and drag it down.
Same way, following formula will get you the dog count per sequence:
=COUNTIFS($H:$H,$A2,$J:$J,"*dog*")
Place this formula in cell C2 and drag it down.
So I tried to replicate your data and this is what I've used:
Let me know if you have any doubts.
Someone will probably have a better solution than this, but I've used it before when looking for a similar function and couldn't find one.
=(LEN([textcell]) - LEN(SUBSTITUTE([textcell], [wordcell], ""))) / LEN([wordcell])
What this does is compare the length of the original string, with the length of the string with the search word removed. Dividing it by the length of the word, giving you how many occurrences were removed.
So given the following content :
fox dog search
1 0 The quick brown fox
0 1 jumps over the lazy dog
The formula on A2 is
=(LEN($C2) - LEN(SUBSTITUTE($C2,A$1, ""))) / LEN(A$1)
Dollar signs not required, but made it so I could copy the formula to all 4 cells.
If your Sequence column is E, and the column with text is F, you could use this formula:
=SUMPRODUCT(--(NOT(ISERROR(SEARCH(B$1,$F$2:$F$6)))),--($E$2:$E$6=$A2))
This creates two arrays, one that's a sequence of 1's and 0's where 1 is that the text contains B1 ("fox" or "dog"), and another that is 1 for sequence matching and 0 for not sequence matching.
Then it multiplies and sums the arrays so you only get the count of when both conditions match.
The formula is in cells B2:C3 in my example:
Picture of sample data I used:

How do I replace a text string with a number, based on a key word contained in the cell

I have a string variable with short text strings. I want to replace all the text strings with numbers based on key words contained inside the individual cells.
Example: Some cells states "I like cats", while others "I dont like the smell of wet dog".
I want to assign the value 1 to all cells containing the word cat, and the number 2 to all cells containing the word dog.
How do I do this?
This will put 1 in NewVar when "cat" appears in OldVar, 2 for "dog", 3 for "mouse":
do repeat wrd="cat" "dog" "mouse"/val= 1 2 3.
if index(OldVar, wrd)>0 NewVar=val.
end repeat.
This is only good if there will never be a cat AND a dog in the same sentence. If you do have such cases you should go this way:
do repeat wrd="cat" "dog" "mouse"/NewVar=cat dog mouse.
compute NewVar=char.index(OldVar, wrd)>0.
end repeat.
This will create a new variable for each of the possible words, putting 1 in cases where the word appears in OldVar, 0 when it doesn't.
Apparently you have to open a syntax window and enter this command:
COMPUTE newvar=CHAR.INDEX(UPCASE(VAR1),"ABCD")>0
newvar is the name of the new variable.
VAR1 is the name of the variable to be searched.
ABCD is the text to be searched for. NOTE: This must be in CAPITAL letters.
newvar will recieve a value of 1 if the text is found.

String matching algorithm design

Given a text t[1...n] and k pattern p1,p2,...pk each of length m, n=2m, from alphabet [0, Sigma-1]. Design an efficient algorithm to find all locations i in t where any of the patterns pj's match.
So I have a string t = "1 2 3 4 5 2 2 9" and the pattern p = "4 5 2 2". I know there will be m+1 locations I can find a pattern (either from "1 2 3 4", "2 3 4 5", etc...).
Then we have k characters in the pattern so the bigO comes outs to be O(k(m+1)).
My algorithm would be to search through the string checking each character with the characters in the pattern. That will run me k iterations for m+1 locations.
Hopefully, I'm explaining it correctly. I just want to know if I'm doing it right and if there are any flaws in my logic. Thank you!
My algorithm would be to search through the string checking each
character with the characters in the pattern. That will run me k
iterations for m+1 locations.
That means for each pattern, you can do it O(m+1), right?
Although there are algorithms that can achieve this performance, your brute force one isn't. You have m+1 locations, and for each location you need to check m characters, so the total complexity for each pattern is O(m(m+1)).

Why vim's d operator is misbehaving?

Few days ago I decided to use the Vim text editor... playing around with the vimtutor I found something very rare with the d operator; Vim session:
Case 1
before: The Quick Red Fox Jumps Over the Lazy Brown Dog
after :    The Quick Red F Jumps Over the Lazy Brown Dog
results: as expected.
Case 2
Placing the cursor in the last character of a word.
before: The Quick Red Fox Jumps Over the Lazy Brown Dog
after :    The Quick Red Fo Over the Lazy Brown Dog
results: de deletes the "x Jumps" substring.
Case 3
Placing the cursor in the last character of the last word.
before: The Quick Red Fox Jumps Over the Lazy Brown Dog
after :    The Quick Red Fox Jumps Over the Lazy Brown Do
results: as expected.
Please note that:
In both cases I'm using the de command.
after: reflects the changes after applying the de command.
The highlighted part represent the cursor position in the editor.
Questions:
Is this a bug?
Am I doing something wrong?
What is happening?
Vim version: version 7.3.50; Modified by Gentoo-7.3.50
When you are already at the end of word, de will delete to the end of next word.
d is a operator command. It accepts a motion command(e or others).
When you press e at the end of word, you can see that behavior is consistent.
When pressing e, vim takes you to the end of the word. If the cursor is on the x in fox, you are already there, so e takes to the next end of a word.
Thus, de will delete jumps as well.
Keep in mind that de is 2 commands d deletes. e is end of word.
Issuing e at the end of one word jumps to the end of the next word. So de deletes from the current position to the end of the next word. You might want to try dw or daw
See also :help e and :help d

Resources