How to calculate word co-occurence - string

I have a string of characters of length 50 say representing a sequence abbcda.... for alphabets taken from the set A={a,b,c,d}.
I want to calculate how many times b is followed by another b (n-grams) where n=2.
Similarly, how many times a particular character is repeated thrice n=3 consecutively, say in the input string abbbcbbb etc so here the number of times b occurs in a sequence of 3 letters is 2.

To find the number of non-overlapping 2-grams you can use
numel(regexp(str, 'b{2}'))
and for 3-grams
numel(regexp(str, 'b{3}'))
to count overlapping 2-grams use positive lookahead
numel(regexp(str, '(b)(?=b{1})'))
and for overlapping n-grams
numel(regexp(str, ['(b)(?=b{' num2str(n-1) '})']))
EDIT
In order to find number of occurrences of an arbitrary sequence use the first element in first parenthesis and the rest after equality sign, to find ba use
numel(regexp(str, '(b)(?=a)'))
to find bda use
numel(regexp(str, '(b)(?=da)'))

Building on the proposal by Magla:
str = 'abcdabbcdaabbbabbbb'; % for example
index_single = ismember(str, 'b');
index_digram = index_single(1:end-1)&index_single(2:end);
index_trigram = index_single(1:end-2)&index_single(2:end-1)&index_single(3:end);

You may try this piece of code that uses ismember (doc).
%generate string (50 char, 'a' to 'd')
str = char(floor(97 + (101-97).*rand(1,50)))
%digram case
index_digram = ismember(str, 'aa');
%trigram case
index_trigram = ismember(str, 'aaa');
EDIT
Probabilities can be computed with
proba = sum(index_digram)/length(index_digram);

this will find all n-grams and count them:
numberOfGrams = 5;
s = char(floor(rand(1,1000)*4)+double('a'));
ngrams = cell(1);
for n = 2:numberOfGrams
strLength = size(s,2)-n+1;
indices = repmat((1:strLength)',1,n)+repmat(1:n,strLength,1)-1;
grams = s(indices);
gramNumbers = (double(grams)-double('a'))*((ones(1,n)*n).^(0:n-1))';
[uniqueGrams, gramInd] = unique(gramNumbers);
count=hist(gramNumbers,uniqueGrams);
ngrams(n) = {struct('gram',grams(gramInd,:),'count',count)};
end
edit:
the result will be:
ngrams{n}.gram %a list of all n letter sequences in the string
ngrams{n}.count(x) %the number of times the sequence ngrams{n}.gram(x) appears

Related

How to find set of shortest subsequences with minimal collisions from set of strings

I've got a list of strings like
Foobar
Foobaron
Foot
barstool
barfoo
footloose
I want to find the set of shortest possible sub-sequences that are unique to each string in the set; the characters in each sub-sequence do not need to be adjacent, just in order as they appear in the original string. For the example above, that would be (along other possibilities)
Fb (as unique to Foobar as it gets; collision with Foobaron unavoidable)
Fn (unique to Foobaron, no other ...F...n...)
Ft (Foot)
bs (barstool)
bf (barfoo)
e (footloose)
Is there an efficient way to mine such sequences and minimize the number of colliding strings (when collisions can't be avoided, e.g. when strings are substrings of other strings) from a given array of strings? More precisely, chosing the length N, what is the set of sub-sequences of up to N characters each that identify the original strings with the fewest number of collisions.
I would'nt really call that 'efficient', but you can do better than totally dumb like that:
words = ['Foobar', 'Foobaron', 'Foot', 'barstool', 'barfoo', 'footloose']
N = 2
n = len(words)
L = max([len(word) for word in words])
def generate_substrings(word, max_length=None):
if max_length is None:
max_length = len(word)
set_substrings = set()
set_substrings.add('')
for charac in word:
new_substr_list = []
for substr in set_substrings:
new_substr = substr + charac
if len(new_substr) <= max_length:
new_substr_list.append(new_substr)
set_substrings.update(new_substr_list)
return set_substrings
def get_best_substring_for_each(string_list=words, max_length=N):
all_substrings = {}
best = {}
for word in string_list:
for substring in generate_substrings(word, max_length=max_length):
if substring not in all_substrings:
all_substrings[substring] = 0
all_substrings[substring] = all_substrings[substring] + 1
for word in string_list:
best_score = len(string_list) + 1
best[word] = ''
for substring in generate_substrings(word=word, max_length=max_length):
if all_substrings[substring] < best_score:
best[word] = substring
best_score = all_substrings[substring]
return best
print(get_best_substring_for_each(words, N))
This program prints the solution:
{'barfoo': 'af', 'Foobar': 'Fr', 'Foobaron': 'n', 'footloose': 'os', 'barstool': 'al', 'Foot': 'Ft'}
This can still be improved easily by a constant factor, for instance by storing the results of generate_substringsinstead of computing it twice.
The complexity is O(n*C(N, L+N)), where n is the number of words and L the maximum length of a word, and C(n, k) is the number of combinations with k elements out of n.
I don't think (not sure though) that you can do much better in the worst case, because it seems hard not to enumerate all possible substrings in the worst case (the last one to be evaluated could be the only one with no redundancy...). Maybe in average you can do better...
You could use a modification to the longest common subsequence algorithm. In this case you are seeking the shortest unique subsequence. Shown below is part of a dynamic programming solution which is more efficient than a recursive solution. The modifications to the longest common subsequence algorithm are described in the comments below:
for (int i = 0; i < string1.Length; i++)
for (int j = 0; j < string2.Length; j++)
if (string1[i-1] != string2[j-1]) // find characters in the strings that are distinct
SUS[i][j] = SUS[i-1][j-1] + 1; // SUS: Shortest Unique Substring
else
SUS[i][j] = min(SUS[i-1][j], SUS[i][j-1]); // find minimum size of distinct strings
You can then put this code in a function and call this function for each string in your set to find the length of the shortest unique subsequence in the set.
Once you have the length of the shortest unique subsequence, you can backtrack to print the subsequence.
You should use modified Trie structure, insert strings to a trie in a way that :
Foo-bar-on
-t
bar-stool
-foo
The rest is straightforward, just choose correct compressed node[0] char
That Radix tree should help

Efficiently counting the number of substrings of a digit string that are divisible by k?

We are given a string which consists of digits 0-9. We have to count number of sub-strings divisible by a number k. One way is to generate all the sub-strings and check if it is divisible by k but this will take O(n^2) time. I want to solve this problem in O(n*k) time.
1 <= n <= 100000 and 2 <= k <= 1000.
I saw a similar question here. But k was fixed as 4 in that question. So, I used the property of divisibility by 4 to solve the problem.
Here is my solution to that problem:
int main()
{
string s;
vector<int> v[5];
int i;
int x;
long long int cnt = 0;
cin>>s;
x = 0;
for(i = 0; i < s.size(); i++) {
if((s[i]-'0') % 4 == 0) {
cnt++;
}
}
for(i = 1; i < s.size(); i++) {
int f = s[i-1]-'0';
int s1 = s[i] - '0';
if((10*f+s1)%4 == 0) {
cnt = cnt + (long long)(i);
}
}
cout<<cnt;
}
But I wanted a general algorithm for any value of k.
This is a really interesting problem. Rather than jumping into the final overall algorithm, I thought I'd start with a reasonable algorithm that doesn't quite cut it, then make a series of modifications to it to end up with the final, O(nk)-time algorithm.
This approach combines together a number of different techniques. The major technique is the idea of computing a rolling remainder over the digits. For example, let's suppose we want to find all prefixes of the string that are multiples of k. We could do this by listing off all the prefixes and checking whether each one is a multiple of k, but that would take time at least Θ(n2) since there are Θ(n2) different prefixes. However, we can do this in time Θ(n) by being a bit more clever. Suppose we know that we've read the first h characters of the string and we know the remainder of the number formed that way. We can use this to say something about the remainder of the first h+1 characters of the string as well, since by appending that digit we're taking the existing number, multiplying it by ten, and then adding in the next digit. This means that if we had a remainder of r, then our new remainder is (10r + d) mod k, where d is the digit that we uncovered.
Here's quick pseudocode to count up the number of prefixes of a string that are multiples of k. It runs in time Θ(n):
remainder = 0
numMultiples = 0
for i = 1 to n: // n is the length of the string
remainder = (10 * remainder + str[i]) % k
if remainder == 0
numMultiples++
return numMultiples
We're going to use this initial approach as a building block for the overall algorithm.
So right now we have an algorithm that can find the number of prefixes of our string that are multiples of k. How might we convert this into an algorithm that finds the number of substrings that are multiples of k? Let's start with an approach that doesn't quite work. What if we count all the prefixes of the original string that are multiples of k, then drop off the first character of the string and count the prefixes of what's left, then drop off the second character and count the prefixes of what's left, etc? This will eventually find every substring, since each substring of the original string is a prefix of some suffix of the string.
Here's some rough pseudocode:
numMultiples = 0
for i = 1 to n:
remainder = 0
for j = i to n:
remainder = (10 * remainder + str[j]) % k
if remainder == 0
numMultiples++
return numMultiples
For example, running this approach on the string 14917 looking for multiples of 7 will turn up these strings:
String 14917: Finds 14, 1491, 14917
String 4917: Finds 49,
String 917: Finds 91, 917
String 17: Finds nothing
String 7: Finds 7
The good news about this approach is that it will find all the substrings that work. The bad news is that it runs in time Θ(n2).
But let's take a look at the strings we're seeing in this example. Look, for example, at the substrings found by searching for prefixes of the entire string. We found three of them: 14, 1491, and 14917. Now, look at the "differences" between those strings:
The difference between 14 and 14917 is 917.
The difference between 14 and 1491 is 91
The difference between 1491 and 14917 is 7.
Notice that the difference of each of these strings is itself a substring of 14917 that's a multiple of 7, and indeed if you look at the other strings that we've matched later on in the run of the algorithm we'll find these other strings as well.
This isn't a coincidence. If you have two numbers with a common prefix that are multiples of the same number k, then the "difference" between them will also be a multiple of k. (It's a good exercise to check the math on this.)
So this suggests another route we can take. Suppose that we find all prefixes of the original string that are multiples of k. If we can find all of them, we can then figure out how many pairwise differences there are among those prefixes and potentially avoid rescanning things multiple times. This won't find everything, necessarily, but it will find all substrings that can be formed by computing the difference of two prefixes. Repeating this over all suffixes - and being careful not to double-count things - could really speed things up.
First, let's imagine that we find r different prefixes of the string that are multiples of k. How many total substrings did we just find if we include differences? Well, we've found k strings, plus one extra string for each (unordered) pair of elements, which works out to k + k(k-1)/2 = k(k+1)/2 total substrings discovered. We still need to make sure we don't double-count things, though.
To see whether we're double-counting something, we can use the following technique. As we compute the rolling remainders along the string, we'll store the remainders we find after each entry. If in the course of computing a rolling remainder we rediscover a remainder we've already computed at some point, we know that the work we're doing is redundant; some previous scan over the string will have already computed this remainder and anything we've discovered from this point forward will have already been found.
Putting these ideas together gives us this pseudocode:
numMultiples = 0
seenRemainders = array of n sets, all initially empty
for i = 1 to n:
remainder = 0
prefixesFound = 0
for j = i to n:
remainder = (10 * remainder + str[j]) % k
if seenRemainders[j] contains remainder:
break
add remainder to seenRemainders[j]
if remainder == 0
prefixesFound++
numMultiples += prefixesFound * (prefixesFound + 1) / 2
return numMultiples
So how efficient is this? At first glance, this looks like it runs in time O(n2) because of the outer loops, but that's not a tight bound. Notice that each element can only be passed over in the inner loop at most k times, since after that there aren't any remainders that are still free. Therefore, since each element is visited at most O(k) times and there are n total elements, the runtime is O(nk), which meets your runtime requirements.

Matlab. Find the indices of a cell array of strings with characters all contained in a given string (without repetition)

I have one string and a cell array of strings.
str = 'actaz';
dic = {'aaccttzz', 'ac', 'zt', 'ctu', 'bdu', 'zac', 'zaz', 'aac'};
I want to obtain:
idx = [2, 3, 6, 8];
I have written a very long code that:
finds the elements with length not greater than length(str);
removes the elements with characters not included in str;
finally, for each remaining element, checks the characters one by one
Essentially, it's an almost brute force code and runs very slowly. I wonder if there is a simple way to do it fast.
NB: I have just edited the question to make clear that characters can be repeated n times if they appear n times in str. Thanks Shai for pointing it out.
You can sort the strings and then match them using regular expression. For your example the pattern will be ^a{0,2}c{0,1}t{0,1}z{0,1}$:
u = unique(str);
t = ['^' sprintf('%c{0,%d}', [u; histc(str,u)]) '$'];
s = cellfun(#sort, dic, 'uni', 0);
idx = find(~cellfun('isempty', regexp(s, t)));
I came up with this :
>> g=#(x,y) sum(x==y) <= sum(str==y);
>> h=#(t)sum(arrayfun(#(x)g(t,x),t))==length(t);
>> f=cellfun(#(x)h(x),dic);
>> find(f)
ans =
2 3 6
g & h: check if number of count of each letter in search string <= number of count in str.
f : finally use g and h for each element in dic

Removing minimal letters from a string 'A' to remove all instances of string 'B'

If we have string A of length N and string B of length M, where M < N, can I quickly compute the minimum number of letters I have to remove from string A so that string B does not occur as a substring in A?
If we have tiny string lengths, this problem is pretty easy to brute force: you just iterate a bitmask from 0 to 2^N and see if B occurs as a substring in this subsequence of A. However, when N can go up to 10,000 and M can go up to 1,000, this algorithm obviously falls apart quickly. Is there a faster way to do this?
Example: A=ababaa B=aba. Answer=1.Removing the second a in A will result in abbaa, which does not contain B.
Edit: User n.m. posted a great counter example: aabcc and abc. We want to remove the single b, because removing any a or c will create another instance of the string abc.
Solve it with dynamic programming. Let dp[i][j] the minimum operator to make A[0...i-1] have a suffix of B[0...j-1] as well as A[0...i] doesn't contain B, dp[i][j] = Infinite to index the operator is impossible. Then
if(A[i-1]=B[i-1])
dp[i][j] = min(dp[i-1][j-1], dp[i-1][j])
else dp[i][j]=dp[i-1][j]`,
return min(A[N][0],A[N][1],...,A[N][M-1]);`
Can you do a graph search on the string A. This is probably too slow for large N and special input but it should work better than an exponential brute force algorithm. Maybe a BFS.
I'm not sure this question is still of someone interest, but I have an idea that maybe could work.
once we decided that the problem is not to find the substring, is to decide which letter is more convenient to remove from string A, the solution to me appears pretty simple: if you find an occurrence of B string into A, the best thing you can do is just remove a char that is inside the string, closed to the right bondary...let say the one previous the last. That's why if you have a substring that actually end how it starts, if you remove a char at the beginning you just remove one of the B occurencies, while you can actually remove two at once.
Algorithm in pseudo cose:
String A, B;
int occ_posit = 0;
N = B.length();
occ_posit = A.getOccurrencePosition(B); // pseudo function that get the first occurence of B into A and returns the offset (1° byte = 1), or 0 if no occurence present.
while (occ_posit > 0) // while there are B into A
{
if (B.firstchar == B.lastchar) // if B starts as it ends
{
if (A.charat[occ_posit] == A.charat[occ_posit+1])
A.remove[occ_posit - 1]; // no reason to remove A[occ_posit] here
else
A.remove[occ_posit]; // here we remove the last char, so we could remove 2 occurencies at the same time
}
else
{
int x = occ_posit + N - 1;
while (A.charat[x + 1] == A.charat[x])
x--; // find the first char different from the last one
A.remove[x]; // B does not ends as it starts, so if there are overlapping instances they overlap by more than one char. Removing the first that is not equal to the char following B instance, we kill both occurrencies at once.
}
}
Let's explain with an example:
A = "123456789000987654321"
B = "890"
read this as a table:
occ_posit: 123456789012345678901
A = "123456789000987654321"
B = "890"
first occurrence is at occ_posit = 8. B does not end as it starts, so it get into the second loop:
int x = 8 + 3 - 1 = 10;
while (A.charat[x + 1] == A.charat[x])
x--; // find the first char different from the last one
A.remove[x];
the while find that A.charat11 matches A.charat[10] (="0"), so x become 9 and then while exits as A.charat[10] does not match A.charat9. A then become:
A = "12345678000987654321"
with no more occurencies in it.
Let's try with another:
A = "abccccccccc"
B = "abc"
first occurrence is at occ_posit = 1. B does not end as it starts, so it get into the second loop:
int x = 1 + 3 - 1 = 3;
while (A.charat[x + 1] == A.charat[x])
x--; // find the first char different from the last one
A.remove[x];
the while find that A.charat4 matches A.charat[3] (="c"), so x become 2 and then while exits as A.charat[3] does not match A.charat2. A then become:
A = "accccccccc"
let's try with overlapping:
A = "abcdabcdabff"
B = "abcdab"
the algorithm results in: A = "abcdacdabff" that has no more occurencies.
finally, one letter overlap:
A = "abbabbabbabba"
B = "abba"
B end as it starts, so it enters the first if:
if (A.charat[occ_posit] == A.charat[occ_posit+1])
A.remove[occ_posit - 1]; // no reason to remove A[occ_posit] here
else
A.remove[occ_posit]; // here we remove the last char, so we could remove 2 occurencies at the same time
that lets the last "a" of B instance to be removed. So:
1° step: A= "abbbbabbabba"
2° step: A= "abbbbabbbba" and we are done.
Hope this helps
EDIT: pls note that the algotirhm must be corrected a little not to give error when you are close to the A end with your search, but this is just an easy programming issue.
Here's a sketch I've come up with.
First, if A contains any symbols that are not found in B, split up A into a bunch of smaller strings containing only those characters found in B. Apply the algorithm on each of the smaller strings, then glue them back together to get the total result. This really functions as an optimization.
Next, check if A contains any of B. If there isn't, you're done. If A = B, then delete all of them.
I think a relatively greedy algorithm may work.
First, mark all of the symbols in A which belong to at least one occurrence of B. Let A = aabcbccabcaa, B = abc. Bolding indicates these marked characters:
a abc bcc abc aa. If there's an overlap, mark all possible. This operation is naively approximately (A-B) operations, but I believe it can be done in around (A/B) operations.
Consider the deletion of each marked letter in A: a abc bcc abc aa.
Check whether the deletion of that marked letter decreases the number of marked letters. You only need to check the substrings which could possibly be affected by the deletion of the letter. If B has a length of 4, only the substrings starting at the following locations would need to be deleted if x were being checked:
-------x------
^^^^
Any further left or right will exist regardless of the presence of x.
For instance:
Marking the [a] in the following string: a [a]bc bcc abc aa.
Its deletion yields abcbccabcaa, which when marked produces abc bcc abc aa, which has an equal number of marked characters. Since only the relative number is required for this operation, it can be done in approximately 2B time for each selected letter. For each, assign the relative difference between the two. Pick an arbitrary one which is maximal and delete it. Repeat until done. Each pass is roughly up to 2AB operations, for a maximum of A passes, giving a total time of about 2A^2 B.
In the above example, these values are assigned:
aabcbccabcaa
033 333
So arbitrarily deleting the first marked b gives you: aacbccabcaa. If you repeat the process, you get:
aacbccabcaa
333
The final result is done.
I believe the algorithm is correctly minimal. I think it is true that whenever A requires only one deletion, the algorithm must be optimal. In that case, the letter which reduces the most possible matches (ie: all of them) should be best. I can come up with no such proof, though. I'd be interested in finding any counter-examples to optimality.
Find the indeces of each substring in the main string.
Then using a dynamic programming algorithm (so memoize intermediate values), remove each letter that is part of a substring from the main string, add 1 to the count, and repeat.
You can find the letters, because they are within the indeces of each match index + length of B.
A = ababaa
B = aba
count = 0
indeces = (0, 2)
A = babaa, aabaa, abbaa, abbaa, abaaa, ababa
B = aba
count = 1
(2nd abbaa is memoized)
indeces = (1), (1), (), (), (0), (0, 2)
answer = 1
You can take it a step further, and try to memoize the substring match indeces of substrings, but that might not actually be a performance gain.
Not sure on the exact bounds, but shouldn't take too long computationally.

Optimizing count of occurrence of a string

I have to count how often a certain string is contained in a cell-array. The problem is the code is way to slow it takes almost 1 second in order to do this.
uniqueWordsSize = 6; % just a sample number
wordsCounter = zeros(uniqueWordsSize, 1);
uniqueWords = unique(words); % words is a cell-array
for i = 1:uniqueWordsSize
wordsCounter(i) = sum(strcmp(uniqueWords(i), words));
end
What I'm currently doing is to compare every word in uniqueWords with the cell-array words and use sum in order to calculate the sum of the array which gets returned by strcmp.
I hope someone can help me to optimize that.... 1 second for 6 words is just too much.
EDIT: ismember is even slower.
You can drop the loop completely by using the third output of unique together with hist:
words = {'a','b','c','a','a','c'}
[uniqueWords,~,wordOccurrenceIdx]=unique(words)
nUniqueWords = length(uniqueWords);
counts = hist(wordOccurrenceIdx,1:nUniqueWords)
uniqueWords =
'a' 'b' 'c'
wordOccurrenceIdx =
1 2 3 1 1 3
counts =
3 1 2
tricky way without using explicit fors..
clc
close all
clear all
Paragraph=lower(fileread('Temp1.txt'));
AlphabetFlag=Paragraph>=97 & Paragraph<=122; % finding alphabets
DelimFlag=find(AlphabetFlag==0); % considering non-alphabets delimiters
WordLength=[DelimFlag(1), diff(DelimFlag)];
Paragraph(DelimFlag)=[]; % setting delimiters to white space
Words=mat2cell(Paragraph, 1, WordLength-1); % cut the paragraph into words
[SortWords, Ia, Ic]=unique(Words); %finding unique words and their subscript
Bincounts = histc(Ic,1:size(Ia, 1));%finding their occurence
[SortBincounts, IndBincounts]=sort(Bincounts, 'descend');% finding their frequency
FreqWords=SortWords(IndBincounts); % sorting words according to their frequency
FreqWords(1)=[];SortBincounts(1)=[]; % dealing with remaining white space
Freq=SortBincounts/sum(SortBincounts)*100; % frequency percentage
%% plot
NMostCommon=20;
disp(Freq(1:NMostCommon))
pie([Freq(1:NMostCommon); 100-sum(Freq(1:NMostCommon))], [FreqWords(1:NMostCommon), {'other words'}]);

Resources