Kusto String Difference - string

I need help with finding difference between 2 strings. For example, difference between the strings outlook and outlooka needs to be "a" or even the number of characters that differ should work fine.
I am okay with converting the strings to array and calculating the set difference as well.
Any help is much appreciated. Thank you.
I am trying to identify homoglyph domains with minor changes.

This query counts each character occurrences in each string and returns the differences.
datatable(id:int, str1:string, str2:string)
[
1 ,"outlook" ,"outlooka"
,2 ,"outlook" ,"outlok"
,3 ,"outlook" ,"outllooook"
,4 ,"outlook" ,"lookout"
]
| mv-apply c = extract_all("(.)", strcat(str1, str2)) to typeof(string)
,s = array_concat(repeat("1", strlen(str1)), repeat("2", strlen(str2))) to typeof(string) on
(
summarize count_diff = countif(s == 2) - countif(s == 1) by c
| summarize char_diff = make_bag_if(bag_pack(c, count_diff), count_diff != 0)
)
id
str1
str2
char_diff
1
outlook
outlooka
{"a":1}
2
outlook
outlok
{"o":-1}
3
outlook
outllooook
{"o":2,"l":1}
4
outlook
lookout
{}
Fiddle

Related

Making one string the anagram of other

I have a problem where two strings of same length are given, and I have to tell how many letters I have to change in the first string to make it an anagram of the second.
Here is what I did:
count = 0
Mutable_str = ''.join(sorted("hhpddlnnsjfoyxpci"))
Ref_str = ''.join(sorted("ioigvjqzfbpllssuj"))
i = 0
while i < len(Mutable_str):
if Mutable_str[i] != Ref_str[i]:
count += 1
i += 1
print(count)
My algorithm in this case returned 16 as result. But the correct answer is 10. Can someone tell me what is wrong in my code?
Thank you very much!
You need to use str.count
So you need to add up the differences between the number of occurrences of each character in the different strings. This can be done with str.count(c) where c is each distinct character in the second string (got with set()). We then need to use max() on the difference with 0 so that if the difference is negative this doesn't effect the total differences.
So as you can see, it boils down to one neat little one-liner:
def changes(s1, s2):
return sum(max(0, s2.count(c) - s1.count(c)) for c in set(s2))
and some tests:
>>> changes("hhpddlnnsjfoyxpci", "ioigvjqzfbpllssuj")
10
>>> changes("abc", "bcd")
1
>>> changes("jimmy", "bobby")
4

Find the location of multiple strings in a cell array of strings

I have 2 question regarding searching for strings in MATLAB
If I have to find a string in a cell array of strings I can do the following to get the location of 'PO' in the cell array
find(strcmpi({'PO','FOO','PO1','FOO1','PO1','PO'},'PO'))
% 1 6
But, I really want to search for multiple strings ({'PO1', 'PO'}) at the same time (not using a for loop). What is the best way to do this?
Is there any function like histc() which can tell me how many times the string has occurred. Again for one string, I could do:
length(strfind({'PO','FOO','PO1','FOO1','PO1','PO'},'PO'))
But this obviously doesn't work for multiple strings at a time.
If you want to find multiple strings, then just use the second output of ismember instead to tell you which string it is. If you really need case-insensitive matching, I've added the upper call to force all inputs to be upper-case. You can omit this if you think it's already uppercase.
data = {'PO','FOO','PO1','FOO1','PO1','PO', 'PO'};
[tf, inds] = ismember(upper(data), {'PO1', 'PO'});
% 2 0 1 0 1 2 2
You can then use the second output to determine which string was found where:
% PO1 Occurrences
find(inds == 1)
% 3 5
% PO Occurrences
find(inds == 2)
% 1 6 7
If you want the equivalent of histc, you can use accumarray to do that. We can pass it all of the values of inds that are non-zero (i.e. the ones that you were actually searching for).
accumarray(inds(tf).', ones(sum(tf), 1))
% 2 3
If instead you want to get the histogram of all strings (not just the ones you're searching for) you could do the following:
[strings, ~, inds] = unique(data, 'stable');
occurrences = accumarray(inds, ones(size(inds)));
% 'PO' [3]
% 'FOO' [1]
% 'PO1' [2]
% 'FOO1' [1]

Does Python have a string contains how many substring method, allowing for overlap?

I want to count a long string contain how many substring, how to do it in python?
"12212"
contains 2x "12"
how to get the count number?
It must allow for overlaping substrings; for instance "1111" contains 3 "11" substrings.
"12121" contains 2 "121" substrings.
"1111".count("11")
will return 2. It does not count any overlaps.
Strings have a method count
You can do
s = '12212'
s.count('12') # this equals 2
Edited for the changing question, the answer below was posted as a comment by tobias_k
To count with overlap,
count_all = lambda string, sub: sum(string[i:i+len(sub)] == sub for i in range(len(string) - len(sub) + 1))
This can be called with,
count_all('1111', '11') # this returns 3

Optimizing count of occurrence of a string

I have to count how often a certain string is contained in a cell-array. The problem is the code is way to slow it takes almost 1 second in order to do this.
uniqueWordsSize = 6; % just a sample number
wordsCounter = zeros(uniqueWordsSize, 1);
uniqueWords = unique(words); % words is a cell-array
for i = 1:uniqueWordsSize
wordsCounter(i) = sum(strcmp(uniqueWords(i), words));
end
What I'm currently doing is to compare every word in uniqueWords with the cell-array words and use sum in order to calculate the sum of the array which gets returned by strcmp.
I hope someone can help me to optimize that.... 1 second for 6 words is just too much.
EDIT: ismember is even slower.
You can drop the loop completely by using the third output of unique together with hist:
words = {'a','b','c','a','a','c'}
[uniqueWords,~,wordOccurrenceIdx]=unique(words)
nUniqueWords = length(uniqueWords);
counts = hist(wordOccurrenceIdx,1:nUniqueWords)
uniqueWords =
'a' 'b' 'c'
wordOccurrenceIdx =
1 2 3 1 1 3
counts =
3 1 2
tricky way without using explicit fors..
clc
close all
clear all
Paragraph=lower(fileread('Temp1.txt'));
AlphabetFlag=Paragraph>=97 & Paragraph<=122; % finding alphabets
DelimFlag=find(AlphabetFlag==0); % considering non-alphabets delimiters
WordLength=[DelimFlag(1), diff(DelimFlag)];
Paragraph(DelimFlag)=[]; % setting delimiters to white space
Words=mat2cell(Paragraph, 1, WordLength-1); % cut the paragraph into words
[SortWords, Ia, Ic]=unique(Words); %finding unique words and their subscript
Bincounts = histc(Ic,1:size(Ia, 1));%finding their occurence
[SortBincounts, IndBincounts]=sort(Bincounts, 'descend');% finding their frequency
FreqWords=SortWords(IndBincounts); % sorting words according to their frequency
FreqWords(1)=[];SortBincounts(1)=[]; % dealing with remaining white space
Freq=SortBincounts/sum(SortBincounts)*100; % frequency percentage
%% plot
NMostCommon=20;
disp(Freq(1:NMostCommon))
pie([Freq(1:NMostCommon); 100-sum(Freq(1:NMostCommon))], [FreqWords(1:NMostCommon), {'other words'}]);

Concatenate variable-length-strings as table

I am fetching an array of data from SQL, and then concatenating them as strings for display. The function looks like this:
function FetchTopStats( Conn, iLimit )
local sToReturn = "\tS.No. \t UserName \t Score\n\t"
SQLQuery = assert( Conn:execute( string.format( [[SELECT username, totalcount FROM chatstat ORDER BY totalcount DESC LIMIT %d]], iLimit ) ) )
DataArray = SQLQuery:fetch ({}, "a")
i = 1
while DataArray do
sToReturn = sToReturn..tostring( i ).."\t"..DataArray.username.." \t "..DataArray.totalcount.."\n\t"
DataArray = SQLQuery:fetch ({}, "a")
i = i + 1
end
return sToReturn
end
This gives me an output like:
S.No. UserName Score
1 aim 6641
2 time 5021
3 Shepard 4977
and so on. I am thinking of using a string.format function, to have a display as follows:
S.No. UserName Score
1 aim 6641
2 time 5021
3 Shepard 4977
But, I am totally out of ideas on how to have this. The only option coming to my mind is checking string length of username, and then applying \t accordingly. That, I want to use at the very last.
Well, you need to either find out the maximum length of username thus making the algorithm 2-pass, or limit it to some arbitrary (but reasonable) size and unconditionally chop off the tails of too long strings. Once you have the column width, you can use left or right aligned format strings:
> print(string.format("|%-10d|%-20s|%10d|", 1, "Shepard", 9000))
|1 |Shepard | 9000|
Also, for large tables consider using table.concat for building the final output: it's considerably faster than repeatedly appending strings (refer to the Chapter 11.6 of PIL for explanation).

Resources