String splitting based on unknown delimeters (rather, LENGTH of delimeter) - string

I have a somewhat esoteric problem. My program wants to decode morse code.
The point is, I will need to handle any character. Any random characters that adhere to my system and can correspond to a letter should be accepted. Meaning, the letter "Q" is represented by "- - . -", but my program will treat any string of characters (separated by appropriate newchar signal) to be accepted as Q, for example "dj ir j kw" (long long short long).
There is a danger of falling out of sync, so I will need to implement a "new character" signal. I chose this to be "xxxx" as in 4 letters. For white, blank space symbol, I chose "xxxxxx", 6 chars.
Long story short, how can I split the string that is to be decoded into readable characters based on the length of the delimeter (4 continous symbols), since I can't really deterministically know what letters will make up the newchar delimeter?

The question is not very clearly worded.
For instance, here you show space as a delimeter between parts of the symbol Q:
for example "dj ir j kw" (long long short long)
Later you say:
For white, blank space symbol, I chose "xxxxxx", 6 chars.
Is that the symbol for whitespace, or the delimeter you use within a symbol (such as Q, above)? Your post doesn't say.
In this case, as always, an example is worth a thousands words. You should have shown a few examples of possible input and shown how you'd like them parsed.
If what you mean was that "dj ir j kw jfkl abpzoq jfkl dj ir j kw" should be decoded as "Q Q", and you just want to know how to match tokens by their length, then... the question is easy. There's a million ways you could do that.
In Lua, I'd do it in two passes. First, convert the message into a string containing only the length of each chunk of consequitive characters:
message = 'dj ir j kw jfkl abpzoq jfkl dj ir j kw'
message = message:gsub('(%S+)%s*', function(s) return #s end)
print(message) --> 22124642212
Then split on the number 4 to get each group
for group in message:gmatch('[^4]+') do
print(group)
end
Which gives you:
2212
6
2212
So you could convert something like this:
function translate(message)
local lengthToLetter = {
['2212'] = 'Q',
[ '6'] = ' ',
}
local translation = {}
message = message:gsub('(%S+)%s*', function(s) return #s end)
for group in message:gmatch('[^4]+') do
table.insert(translation, lengthToLetter[group] or '?')
end
return table.concat(translation)
end
print(translate(message))

This will split a string by any len continuous occurrences of char, which may be a character or pattern character class (such as %s), or of any character (i.e. .) if char is not passed.
It does this by using backreferences in the pattern passed to string.find, e.g. (.)%1%1%1 to match any character repeated four times.
The rest is just a bog-standard string splitter; the only real Lua peculiarity here is the choice of pattern.
-- split str, using (char * len) as the delimiter
-- leave char blank to split on len repetitions of any character
local function splitter(str, len, char)
-- build pattern to match len continuous occurrences of char
-- "(x)%1%1%1%1" would match "xxxxx" etc.
local delim = "("..(char or ".")..")" .. string.rep("%1", len-1)
local pos, out = 1, {}
-- loop through the string, find the pattern,
-- and string.sub the rest of the string into a table
while true do
local m1, m2 = string.find(str, delim, pos)
-- no sign of the delimiter; add the rest of the string and bail
if not m1 then
out[#out+1] = string.sub(str, pos)
break
end
out[#out+1] = string.sub(str, pos, m1-1)
pos = m2+1
-- string ends with the delimiter; bail
if m2 == #str then
break
end
end
return out
end
-- and the result?
print(unpack(splitter("dfdsfsdfXXXXXsfsdfXXXXXsfsdfsdfsdf", 5)))
-- dfdsfsdf, sfsdf, sfsdfsdfsdf

Related

How to determine the end of string in VFP 9?

In some programming languages, such as C for example, the end of string may be marked as a separate null terminator symbol.
How do I determine if the current symbol is the end of string?
Currently I use some string functions' calls, but I guess it may be performed easier.
*the string's end
IF ISBLANK(SUBSTR(str, pos, 1) == .T. AND CHR(32) != SUBSTR(str, pos, 1)
RETURN .T.
ENDIF
There's no need to worry about C-style string termination in VFP.
Assuming you don't care what the last character is then from your example:
return (pos = len(str))
If you want to ignore spaces:
return (pos = len(alltrim(str))
VFP strings are not ASCIIZ strings as in C. A VFP string can contain any ASCII character including character 0 - chr(0)- which is a string termination character in C style languages.
Normally the end of the string in VFP is the same as its length. But, although it is not clear from your question, sometimes you get a string from a C source (ie: a win32 call) where multiple string values are separated by chr(0) values. You can easily parse such a string into multiple string with a code like alines(). ie:
? ALines(laLines, "hello"+Chr(0)+"there",1+4,Chr(0)) && prints 2
display memory like laLines && shows two string values
Also you could use many string function like at(), occurs(), asc() ... to locate, count ... that character.

Convert string S to another string T by performing exactly K operations (append to / delete from the end of the string S)

I am trying to solve a problem. But I am missing some corner case. Please help me. The problem statement is:
You have a string, S , of lowercase English alphabetic letters. You can perform two types of operations on S:
Append a lowercase English alphabetic letter to the end of the string.
Delete the last character in the string. Performing this operation on an empty string results in an empty string.
Given an integer, k, and two strings, s and t , determine whether or not you can convert s to t by performing exactly k of the above operations on s.
If it's possible, print Yes; otherwise, print No.
Examples
Input Output
hackerhappy Yes
hackerrank
9
5 delete operations (h,a,p,p,y) and 4 append operations (r,a,n,k)
aba Yes
aba
7
4 delete operations (delete on empty = empty) and 3 append operations
I tried in this way (C language):
int sl = strlen(s); int tl = strlen(t); int diffi=0;
int i;
for(i=0;s[i]&&t[i]&&s[i]==t[i];i++); //going till matching
diffi=i;
((sl-diffi+tl-diffi<=k)||(sl+tl<=k))?printf("Yes"):printf("No");
Please help me to solve this.
Thank You
You also need the remaining operations to divide in 2, because you need to just add and remove letters to waste the operations.
so maybe:
// c language - strcmp(s,t) returns 0 if s==t.
if(strcmp(s,t))
((sl-diffi+tl-diffi<=k && (k-(sl-diffi+tl-diffi))%2==0)||(sl+tl<=k))?printf("Yes"):printf("No");
else
if(sl+tl<=k||k%2==0) printf("Yes"); else printf("No");
You can do it one more way using binary search.
Take the string of smaller length and take sub-string(pattern) of length/2.
1.Do a binary search(by character) on both of the string if u get a match append length/4 more character to the pattern if it matches add more by length/2^n else append one character to the original(pattern of length/2) and try .
2.If u get a mismatch for pattern of length/2 reduce length of the pattern to length/4 and if u get a match append next character .
Now repeat the steps 1 and 2
If n1+n2 <= k then the answer is Yes
else the answer is no
Example:
s1=Hackerhappy
s2=Hackerrank
pattern=Hacker // length = 10 (s2 is smaller and length of s2=10 length/2 =5)
//Do a binary search of the pattern you will get a match by steps 1 and 2
n1 number of mismatched characters is 5
n2 number of mismatched characters is 4
Now n1+n2<k // its because we will need to do these much operation to make these to equal.
So Yes
This should work for all cases:
int sl = strlen(s); int tl = strlen(t); int diffi=0;
int i,m;
for(i=0;s[i]&&t[i]&&s[i]==t[i];i++); //going till matching
diffi=i;
m = sl+tl-2*diffi;
((k>=m&&(k-m)%2==0)||(sl+tl<=k))?printf("Yes"):printf("No");

Go's LeftStr, RightStr, SubStr

I believe there are no LeftStr(str,n) (take at most n first characters), RightStr(str,n) (take at most n last characters) and SubStr(str,pos,n) (take first n characters after pos) function in Go, so I tried to make one
// take at most n first characters
func Left(str string, num int) string {
if num <= 0 {
return ``
}
if num > len(str) {
num = len(str)
}
return str[:num]
}
// take at most last n characters
func Right(str string, num int) string {
if num <= 0 {
return ``
}
max := len(str)
if num > max {
num = max
}
num = max - num
return str[num:]
}
But I believe those functions will give incorrect output when the string contains unicode characters. What's the fastest solution for those function, is using for range loop is the only way?
As mentioned in already in comments,
combining characters, modifying runes, and other multi-rune
"characters"
can cause difficulties.
Anyone interested in Unicode handling in Go should probably read the Go Blog articles
"Strings, bytes, runes and characters in Go"
and "Text normalization in Go".
In particular, the later talks about the golang.org/x/text/unicode/norm package which can help in handling some of this.
You can consider several levels increasingly of more accurate (or increasingly more Unicode aware) spiting the first (or last) "n characters" from a string.
Just use n bytes.
This may split in the middle of a rune but is O(1), is very simple, and in many cases you know the input consists of only single byte runes.
E.g. str[:n].
Split after n runes.
This may split in the middle of a character. This can be done easily, but at the expense of copying and converting with just string([]rune(str)[:n]).
You can avoid the conversion and copying by using the unicode/utf8 package's DecodeRuneInString (and DecodeLastRuneInString) functions to get the length of each of the first n runes in turn and then return str[:sum] (O(n), no allocation).
Split after the n'th "boundary".
One way to do this is to use
norm.NFC.FirstBoundaryInString(str) repeatedly
or norm.Iter to find the byte position to split at and then return str[:pos].
Consider the displayed string "cafés" which could be represented in Go code as: "cafés", "caf\u00E9s", or "caf\xc3\xa9s" which all result in the identical six bytes. Alternative it could represented as "cafe\u0301s" or "cafe\xcc\x81s" which both result in the identical seven bytes.
The first "method" above may split those into "caf\xc3"+"\xa9s" and cafe\xcc"+"\x81s".
The second may split them into "caf\u00E9"+"s" ("café"+"s") and "cafe"+"\u0301s" ("cafe"+"́s").
The third should split them into "caf\u00E9"+"s" and "cafe\u0301"+"s" (both shown as "café"+"s").

Standard ML string to a list

Is there a way in ML to take in a string and output a list of those string where a separation is a space, newline or eof, but also keeping strings inside strings intact?
EX) hello world "my id" is 5555
-> [hello, world, my id, is, 5555]
I am working on a tokenizing these then into:
->[word, word, string, word, int]
Sure you can! Here's the idea:
If we take a string like "Hello World, \"my id\" is 5555", we can split it at the quote marks, ignoring the spaces for now. This gives us ["Hello World, ", "my id", " is 5555"]. The important thing to notice here is that the list contains three elements - an odd number. As long as the string only contains pairs of quotes (as it will if it's properly formatted), we'll always get an odd number of elements when we split at the quote marks.
A second important thing is that all the even-numbered elements of the list will be strings that were unquoted (if we start counting from 0), and the odd-numbered ones were quoted. That means that all we need to do is tokenize the ones that were unquoted, and then we're done!
I put some code together - you can continue from there:
fun foo s =
let
val quoteSep = String.tokens (fn c => c = #"\"") s
val spaceSep = String.tokens (fn c => c = #" ") (* change this to include newlines and stuff *)
fun sepEven [] = []
| sepEven [x] = (* there were no quotes in the string *)
| sepEven (x::y::xs) = (* x was unquoted, y was quoted *)
in
if length quoteSep mod 2 = 0
then (* there was an uneven number of quote marks - something is wrong! *)
else (* call sepEven *)
end
String.tokens brings you halfway there. But if you really want to handle quotes like you are sketching then there is no way around writing an actual lexer. MLlex, which comes with SML/NJ and MLton (but is usable with any SML) could help. Or you just write it by hand, which should be easy enough in this case as well.

Finding mean of ascii values in a string MATLAB

The string I am given is as follows:
scrap1 =
a le h
ke fd
zyq b
ner i
You'll notice there are 2 blank spaces indicating a space (ASCII 32) in each row. I need to find the mean ASCII value in each column without taking into account the spaces (32). So first I would convert to with double(scrap1) but then how do I find the mean without taking into account the spaces?
If it's only the ASCII 32 you want to omit:
d = double(scrap1);
result = mean(d(d~=32)); %// logical indexing to remove unwanted value, then mean
You can remove the intermediate spaces in the string with scrap1(scrap1 == ' ') = ''; This replaces any space in the input with an empty string. Then you can do the conversion to double and average the result. See here for other methods.
Probably, you can use regex to find the space and ignore it. "\s"
findSpace = regexp(scrap1, '\s', 'ignore')
% I am not sure about the ignore case, this what comes to my mind. but u can read more about regexp by typying doc regexp.

Resources