Optimizing count of occurrence of a string - string

I have to count how often a certain string is contained in a cell-array. The problem is the code is way to slow it takes almost 1 second in order to do this.
uniqueWordsSize = 6; % just a sample number
wordsCounter = zeros(uniqueWordsSize, 1);
uniqueWords = unique(words); % words is a cell-array
for i = 1:uniqueWordsSize
wordsCounter(i) = sum(strcmp(uniqueWords(i), words));
end
What I'm currently doing is to compare every word in uniqueWords with the cell-array words and use sum in order to calculate the sum of the array which gets returned by strcmp.
I hope someone can help me to optimize that.... 1 second for 6 words is just too much.
EDIT: ismember is even slower.

You can drop the loop completely by using the third output of unique together with hist:
words = {'a','b','c','a','a','c'}
[uniqueWords,~,wordOccurrenceIdx]=unique(words)
nUniqueWords = length(uniqueWords);
counts = hist(wordOccurrenceIdx,1:nUniqueWords)
uniqueWords =
'a' 'b' 'c'
wordOccurrenceIdx =
1 2 3 1 1 3
counts =
3 1 2

tricky way without using explicit fors..
clc
close all
clear all
Paragraph=lower(fileread('Temp1.txt'));
AlphabetFlag=Paragraph>=97 & Paragraph<=122; % finding alphabets
DelimFlag=find(AlphabetFlag==0); % considering non-alphabets delimiters
WordLength=[DelimFlag(1), diff(DelimFlag)];
Paragraph(DelimFlag)=[]; % setting delimiters to white space
Words=mat2cell(Paragraph, 1, WordLength-1); % cut the paragraph into words
[SortWords, Ia, Ic]=unique(Words); %finding unique words and their subscript
Bincounts = histc(Ic,1:size(Ia, 1));%finding their occurence
[SortBincounts, IndBincounts]=sort(Bincounts, 'descend');% finding their frequency
FreqWords=SortWords(IndBincounts); % sorting words according to their frequency
FreqWords(1)=[];SortBincounts(1)=[]; % dealing with remaining white space
Freq=SortBincounts/sum(SortBincounts)*100; % frequency percentage
%% plot
NMostCommon=20;
disp(Freq(1:NMostCommon))
pie([Freq(1:NMostCommon); 100-sum(Freq(1:NMostCommon))], [FreqWords(1:NMostCommon), {'other words'}]);

Related

How to remove kth element in O(1) time complexity

Given a string I need to remove the smallest character and return the sum of indices of removed charecter.
Suppose the string is 'abcab' I need to remove first a at index 1.
We are left with 'bcab'. Now remove again a which is smallest in remaining string and is at index 3
We are left with 'bcb'.
In the same way remove b at index 1,then remove again b from 'cb' at index 2 and finally remove c
Total of all indices is 1+3+1+2+1=8
Question is simple but we need to do it in O(n). for that I need to remove kth element in O(1). In python del list[index] has time complexity O(n).
How can I delete in constant time using python
Edit
This is the exact question
You are given a string S of size N. Assume that count is equal to 0.
Your task is the remove all the N elements of string S by performing the following operation N times
• In a single operation, select an alphabetically smallest character in S, for example, Remove from S and add its index to count. If multiple characters such as c exist, then select that has the smallest index.
Print the value of count.
Note Consider 1-based indexing
Solve the problem for T test cases
Input format
The first line of the input contains an integer T denoting the number of test cases • The first line of each test case contains an integer N denoting the size of string S
• The second line of each test case contains a string S
Output format
For each test case print a single line containing one integer denoting the value of count
1<T, N < 10^5
• S contains only lowercase English alphabets
Sum of N over all test cases does not exceed 10
Sample input 1
5
abcab
Sample Output1
8
Explanation
The operations occur in the following order
Current string S= abcab', The alphabetically smallest character of s is 'a As there are 2 occurrences of a, we choose the first occurrence. Its Index 1 will be added to the count and a will be removed. Therefore, S becomes bcab
a will.be removed from 5 (bcab) and 3 will.be added to count
The first occurrence of b will be removed from (bcb) and 1 will be added to count.
b will be removed from s (cb) and 2 will be added to count
c will be removed from 5 (c) and 1 will be added to count
If you follow your procedure of repeatedly removing the first occurrence of the smallest character, then each character's index -- when you remove it -- is the number of preceding larger characters in the original string plus one.
So what you really need to do is find, for each character, the number of preceding larger characters, and then add up all those counts.
There are only 26 characters, so you can do this as you go with 26 counters.
Please link to the original problem statement, or copy/paste exactly what it says, without trying to explain it. As is, what you're asking for is impossible.
Forget deleting: if what you're asking for was possible, sorting would be worse-case O(n) (remove the minimum remaining n times, at O(1) cost for each), but it's well known that comparison-based sorting cannot do better than worst case O(n log n).
One bet: the original problem statement doesn't require that you delete anything - but instead that you return the result as if you had deleted.
With one pass over the input
Putting together various ideas, the final index of a character is one more than the number of larger characters seen before it. So it's possible to do this in one left-to-right pass over the input, using O(1) storage and O(n) time, while deleting nothing:
def crunch(s):
neq = [0] * 26
result = 0
orda = ord('a')
for ch in map(ord, s):
ch -= orda
result += sum(neq[i] for i in range(ch + 1, 26)) + 1
neq[ch] += 1
return result
For your original:
>>> crunch('abcab')
8
But it's also possible to process arbitary iterables one character at a time:
>>> from itertools import repeat, chain
>>> crunch(chain(repeat('y', 1000000), 'xz'))
2000002
x is originally at (1-based) index 1000001, which accounts for half the result. Then each of a million 'y's is conceptually deleted, each at index 1. Finally 'z' is at index 1, for a grand total of 2000002.
Looks like you're only interested in the resulting sum of indices and don't need to simulate this algorithm step by step.
In which case you could compute the result in the following way:
For each letter from a to z:
Have a counter of already removed letters set to 0
Iterate over the string and if you encounter the current letter add current_index - already_removed_counter to the result.
2a. If you encounter current or earlier (smaller) letter increase the counter as it already has been removed
The time complexity is 26 * O{n} which is O{n}.
Since there are only 26 distinct chatacters in the string, we can take each character separately and linearly traverse the string to find all its occurences. Keep a counter of how many chacters were found. Each time an occurence of a given character is found display its index decreased by the counter. Before switching to a new character, remove all the occurences of the previous one - this can be done in linear time.
res = 0
for c in 'a' .. 'z'
cnt = 0
for idx = 1 .. len(s)
if s[idx] = c
print idx - cnt
res += idx - cnt
cnt++
removeAll(s, c)
return res
where
removeAll(s,c):
i = 1
cnt = 0
n = len(s)
while (i < n)
if s[i + cnt] = c
cnt++
n--
else
s[i] = s[i + cnt]
i++
len(s) = n
It prints the elements of the sum to better illustrate what's going on.
Edit:
An updated version based on Igor's answer, that does not require actually removing elements. The complexity is the same i.e. O(n).
res = 0
for c in 'a' .. 'z'
cnt = 0
for idx = 1 .. len(s)
if s[idx] <= c
if s[idx] = c
print idx - cnt
res += idx - cnt
cnt++
return res

How can I get the the sum of all combined word lengths to print as a single value?

I am having trouble converting the total of all word counts. I have tried various methods, the length per word is correct, I just can't get a total?
file=open(r"sheSaid.txt","r+")
from collections import Counter
wordcount = Counter(file.read().split())
for word in file.read().split(' '):
word = word.rstrip(".""',?!")
if word not in wordcount:
wordcount[word] = 1
else:
wordcount[word] += 1
#print (word.rstrip(".""..',?!"),wordcount)
for item in wordcount.items(): print("{}\t{}".format(*item))
wordCount = len(wordcount)
#can count all word lengths just fine
print ("The total word count is:", word, wordCount) # when I use Len() Or #Count I cannot get the sum of all wordCounts?
print ("The total length of all words are:","total_of_all_word_counts?")
#I need the sum to complete this
print ("The avg length is:", wordcount/total_of_all_word_counts?)
file.close();
Something like this will do, although not clear what you are looking for, but this will count the words in the text file, calculate the character count for all words and then the character average:
file = open(r"sheSaid.txt","r+")
file_contents = file.read().split() # Reads the file and split by spaces to create a list of words
file.close() # Always good idea to close the file after done with it
words_count = len(file_contents)
characters_count = len(''.join(file_contents))
average_characters = characters_count / words_count
You may want to do an extra effort and handle some other cases, like removing characters like !, . etc. That will be straightforward. This is just a starting point to build upon. You only know the corner cases and what will be in the text file.

Find the location of multiple strings in a cell array of strings

I have 2 question regarding searching for strings in MATLAB
If I have to find a string in a cell array of strings I can do the following to get the location of 'PO' in the cell array
find(strcmpi({'PO','FOO','PO1','FOO1','PO1','PO'},'PO'))
% 1 6
But, I really want to search for multiple strings ({'PO1', 'PO'}) at the same time (not using a for loop). What is the best way to do this?
Is there any function like histc() which can tell me how many times the string has occurred. Again for one string, I could do:
length(strfind({'PO','FOO','PO1','FOO1','PO1','PO'},'PO'))
But this obviously doesn't work for multiple strings at a time.
If you want to find multiple strings, then just use the second output of ismember instead to tell you which string it is. If you really need case-insensitive matching, I've added the upper call to force all inputs to be upper-case. You can omit this if you think it's already uppercase.
data = {'PO','FOO','PO1','FOO1','PO1','PO', 'PO'};
[tf, inds] = ismember(upper(data), {'PO1', 'PO'});
% 2 0 1 0 1 2 2
You can then use the second output to determine which string was found where:
% PO1 Occurrences
find(inds == 1)
% 3 5
% PO Occurrences
find(inds == 2)
% 1 6 7
If you want the equivalent of histc, you can use accumarray to do that. We can pass it all of the values of inds that are non-zero (i.e. the ones that you were actually searching for).
accumarray(inds(tf).', ones(sum(tf), 1))
% 2 3
If instead you want to get the histogram of all strings (not just the ones you're searching for) you could do the following:
[strings, ~, inds] = unique(data, 'stable');
occurrences = accumarray(inds, ones(size(inds)));
% 'PO' [3]
% 'FOO' [1]
% 'PO1' [2]
% 'FOO1' [1]

How to calculate word co-occurence

I have a string of characters of length 50 say representing a sequence abbcda.... for alphabets taken from the set A={a,b,c,d}.
I want to calculate how many times b is followed by another b (n-grams) where n=2.
Similarly, how many times a particular character is repeated thrice n=3 consecutively, say in the input string abbbcbbb etc so here the number of times b occurs in a sequence of 3 letters is 2.
To find the number of non-overlapping 2-grams you can use
numel(regexp(str, 'b{2}'))
and for 3-grams
numel(regexp(str, 'b{3}'))
to count overlapping 2-grams use positive lookahead
numel(regexp(str, '(b)(?=b{1})'))
and for overlapping n-grams
numel(regexp(str, ['(b)(?=b{' num2str(n-1) '})']))
EDIT
In order to find number of occurrences of an arbitrary sequence use the first element in first parenthesis and the rest after equality sign, to find ba use
numel(regexp(str, '(b)(?=a)'))
to find bda use
numel(regexp(str, '(b)(?=da)'))
Building on the proposal by Magla:
str = 'abcdabbcdaabbbabbbb'; % for example
index_single = ismember(str, 'b');
index_digram = index_single(1:end-1)&index_single(2:end);
index_trigram = index_single(1:end-2)&index_single(2:end-1)&index_single(3:end);
You may try this piece of code that uses ismember (doc).
%generate string (50 char, 'a' to 'd')
str = char(floor(97 + (101-97).*rand(1,50)))
%digram case
index_digram = ismember(str, 'aa');
%trigram case
index_trigram = ismember(str, 'aaa');
EDIT
Probabilities can be computed with
proba = sum(index_digram)/length(index_digram);
this will find all n-grams and count them:
numberOfGrams = 5;
s = char(floor(rand(1,1000)*4)+double('a'));
ngrams = cell(1);
for n = 2:numberOfGrams
strLength = size(s,2)-n+1;
indices = repmat((1:strLength)',1,n)+repmat(1:n,strLength,1)-1;
grams = s(indices);
gramNumbers = (double(grams)-double('a'))*((ones(1,n)*n).^(0:n-1))';
[uniqueGrams, gramInd] = unique(gramNumbers);
count=hist(gramNumbers,uniqueGrams);
ngrams(n) = {struct('gram',grams(gramInd,:),'count',count)};
end
edit:
the result will be:
ngrams{n}.gram %a list of all n letter sequences in the string
ngrams{n}.count(x) %the number of times the sequence ngrams{n}.gram(x) appears

Trying to read a text file...but not getting all the contents

I am trying to read the file with the following format which repeats itself (but I have cut out the data even for the first repetition because of it being too long):
1.00 'day' 2011-01-02
'Total Velocity Magnitude RC - Matrix' 'm/day'
0.190189 0.279141 0.452853 0.61355 0.757833 0.884577
0.994502 1.08952 1.17203 1.24442 1.30872 1.36653
1.41897 1.46675 1.51035 1.55003 1.58595 1.61824
Download the actual file with the complete data here
This is my code which I am using to read the data from the above file:
fid = fopen(file_name); % open the file
dotTXT_fileContents = textscan(fid,'%s','Delimiter','\n'); % read it as string ('%s') into one big array, row by row
dotTXT_fileContents = dotTXT_fileContents{1};
fclose(fid); %# don't forget to close the file again
%# find rows containing 'Total Velocity Magnitude RC - Matrix' 'm/day'
data_starts = strmatch('''Total Velocity Magnitude RC - Matrix'' ''m/day''',...
dotTXT_fileContents); % data_starts contains the line numbers wherever 'Total Velocity Magnitude RC - Matrix' 'm/day' is found
ndata = length(data_starts); % total no. of data values will be equal to the corresponding no. of '** K' read from the .txt file
%# loop through the file and read the numeric data
for w = 1:ndata-1
%# read lines containing numbers
tmp_str = dotTXT_fileContents(data_starts(w)+1:data_starts(w+1)-3); % stores the content from file dotTXT_fileContents of the rows following the row containing 'Total Velocity Magnitude RC - Matrix' 'm/day' in form of string
%# convert strings to numbers
tmp_str = tmp_str{:}; % store the content of the string which contains data in form of a character
%# assign output
data_matrix_grid_wise(w,:) = str2num(tmp_str); % convert the part of the character containing data into number
end
To give you an idea of pattern of data in my text file, these are some results from the code:
data_starts =
2
1672
3342
5012
6682
8352
10022
ndata =
7
Therefore, my data_matrix_grid_wise should contain 1672-2-2-1(for a new line)=1667 rows. However, I am getting this as the result:
data_matrix_grid_wise =
Columns 1 through 2
0.190189000000000 0.279141000000000
0.423029000000000 0.616590000000000
0.406297000000000 0.604505000000000
0.259073000000000 0.381895000000000
0.231265000000000 0.338288000000000
0.237899000000000 0.348274000000000
Columns 3 through 4
0.452853000000000 0.613550000000000
0.981086000000000 1.289920000000000
0.996090000000000 1.373680000000000
0.625792000000000 0.859638000000000
0.547906000000000 0.743446000000000
0.562903000000000 0.759652000000000
Columns 5 through 6
0.757833000000000 0.884577000000000
1.534560000000000 1.714330000000000
1.733690000000000 2.074690000000000
1.078000000000000 1.277930000000000
0.921371000000000 1.080570000000000
0.934820000000000 1.087410000000000
Where am I wrong? In my final result, I should get data_matrix_grid_wise composed of 10000 elements instead of 36 elements. Thanks.
Update: How can I include the number before 'day' i.e. 1,2,3 etc. on a line just before the data_starts(w)? I am using this within the loop but it doesn't seem to work:
days_str = dotTXT_fileContents(data_starts(w)-1);
days_str = days_str{1};
days(w,:) = sscanf(days_str(w-1,:), '%d %*s %*s', [1, inf]);
Problem in line tmp_str = tmp_str{:}; Matlab have strange behaviour when handling chars. Short solution for you is replace last with the next two lines:
y = cell2mat( cellfun(#(z) sscanf(z,'%f'),tmp_str,'UniformOutput',false));
data_matrix_grid_wise(w,:) = y;
The problem is with last 2 statements. When you do tmp_str{:} you convert cell array to comma-separated list of strings. If you assign this list to a single variable, only the first string is assigned. So the tmp_str will now have only the first row of data.
Here is what you can do instead of last 2 lines:
tmp_mat = cellfun(#str2num, tmp_str, 'uniformoutput',0);
data_matrix_grid_wise(w,:) = cell2mat(tmp_mat);
However, you will have a problem with concatenation (cell2mat) since not all of your rows have the same number of columns. It's depends on you how to solve it.

Resources