Is there a way in ML to take in a string and output a list of those string where a separation is a space, newline or eof, but also keeping strings inside strings intact?
EX) hello world "my id" is 5555
-> [hello, world, my id, is, 5555]
I am working on a tokenizing these then into:
->[word, word, string, word, int]
Sure you can! Here's the idea:
If we take a string like "Hello World, \"my id\" is 5555", we can split it at the quote marks, ignoring the spaces for now. This gives us ["Hello World, ", "my id", " is 5555"]. The important thing to notice here is that the list contains three elements - an odd number. As long as the string only contains pairs of quotes (as it will if it's properly formatted), we'll always get an odd number of elements when we split at the quote marks.
A second important thing is that all the even-numbered elements of the list will be strings that were unquoted (if we start counting from 0), and the odd-numbered ones were quoted. That means that all we need to do is tokenize the ones that were unquoted, and then we're done!
I put some code together - you can continue from there:
fun foo s =
let
val quoteSep = String.tokens (fn c => c = #"\"") s
val spaceSep = String.tokens (fn c => c = #" ") (* change this to include newlines and stuff *)
fun sepEven [] = []
| sepEven [x] = (* there were no quotes in the string *)
| sepEven (x::y::xs) = (* x was unquoted, y was quoted *)
in
if length quoteSep mod 2 = 0
then (* there was an uneven number of quote marks - something is wrong! *)
else (* call sepEven *)
end
String.tokens brings you halfway there. But if you really want to handle quotes like you are sketching then there is no way around writing an actual lexer. MLlex, which comes with SML/NJ and MLton (but is usable with any SML) could help. Or you just write it by hand, which should be easy enough in this case as well.
Related
I am looking to extract only chars from the given string. but my query is doing exactly opposite
s= "A man, a plan, a canal: Panama"
newS = ''.join(re.findall("[^a-zA-Z]*", s))
print(newS) // my o/p: , , :
expected o/p string is:
"A man a plan a canal Panama"
Your regular expression is inverting the match - that's what the caret symbol (^) does inside square brackets (negated character class). You first need to remove that.
Next, you should be matching a sequence of one or more characters (+) rather than zero or more characters (*) -- using * will match the empty string, which you don't want in this case.
Finally your join should join with a space to get the intended output, rather than an empty string -- which won't retain the spaces between the words.
newS = ' '.join(re.findall(r'[a-zA-Z]+', s))
Though not essential in this case, its advised to use raw strings for regular expressions (r). More in this post.
Full working code:
import re
s = 'A man, a plan, a canal: Panama'
newS = ' '.join(re.findall(r'[a-zA-Z]+', s))
print(newS)
I am trying to write a program in Haskell to split a string by delimiter.
And I have studied different examples provided by other users. An example would the the code that is posted below.
split :: String -> [String]
split [] = [""]
split (c:cs)
| c == ',' = "" : rest
| otherwise = (c : head rest) : tail rest
where
rest = split cs
Sample Input: "1,2,3".
Sample Output: ["1","2","3"].
I have been trying to modify the code so that the output would be something like ["1", "," , "2", "," , "3"] which includes the delimiter in the output as well , but I just cannot succeed.
For example, I changed the line:
| c == ',' = "" : rest
into:
| c == ',' = "," : rest
But the result becomes ["1,","2,","3"].
What is the problem and in which part I have had a misunderstanding?
If you're trying to write this function "for real" instead of writing the character-by-character recursion for practice, I think a clearer method is to use the break function from Data.List. The following expression:
break (==',') str
breaks the string into a tuple (a,b) where the first part consists of the initial "comma-free" part, and the second part is either more string starting with the comma or else empty if there's no more string.
This makes the definition of split clear and straightforward:
split str = case break (==',') str of
(a, ',':b) -> a : split b
(a, "") -> [a]
You can verify that this handles split "" (which returns [""]), so there's no need to treat that as a special case.
This version has the added benefit that the modification to include the delimiter is also easy to understand:
split2 str = case break (==',') str of
(a, ',':b) -> a : "," : split2 b
(a, "") -> [a]
Note that I've written the patterns in these functions in more detail than is necessary to make it absolute clear what's going on, and this also means that Haskell does a duplicate check on each comma. For this reason, some people might prefer:
split str = case break (==',') str of
(a, _:b) -> a : split b
(a, _) -> [a]
or, if they still wanted to document exactly what they were expecting in each case branch:
split str = case break (==',') str of
(a, _comma:b) -> a : split b
(a, _empty) -> [a]
Instead of altering code in the hope that it matches the expecations, it is usually better to understand the code fragment first.
split :: String -> [String]
split [] = [""]
split (c:cs) | c == ',' = "" : rest
| otherwise = (c : head rest) : tail rest
where rest = split cs
First of all we better analyze what split does. The first statement simply says "The split of an empty string, is a list with one element, the empty string". This seems reasonable. Now the second clause states: "In case the head of the string is a comma, we produce a list where the first element is an empty string, followed by splitting up the remainings of the string.". The last guard says "In case the first character of the string is not a comma, we prepend that character to the first item of the split of the remaining string, followed by the remaining elements of the split of the remaining string". Mind that split returns a list of strings, so the head rest is a string.
So if we want to add the delimiter to the output, then we need to add that as a separate string in the output of split. Where? In the first guard. We should not return "," : rest, since the head is - by recursion - prepended, but as a separate string. So the result is:
split :: String -> [String]
split [] = [""]
split (c:cs) | c == ',' = "" : "," : rest
| otherwise = (c : head rest) : tail rest
where rest = split cs
That example code is poor style. Never use head and tail unless you know exactly what you're doing (these functions are unsafe, partial functions). Also, equality comparisons are usually better written as dedicated patterns.
With that in mind, the example becomes:
split :: String -> [String]
split "" = [""]
split (',':cs) = "" : split cs
split (c:cs) = (c:cellCompletion) : otherCells
where cellCompletion : otherCells = split cs
(Strictly speaking, this is still unsafe because the match cellCompletion:otherCells is non-exhaustive, but at least it happens in a well-defined place which will give a clear error message if anything goes wrong.)
Now IMO, this makes it quite a bit clearer what's actually going on here: with "" : split cs, the intend is not really to add an empty cell to the result. Rather, it is to add a cell which will be filled up by calls further up in the recursion stack. This happens because those calls deconstruct the deeper result again, with the pattern match cellCompletion : otherCells = split cs, i.e. they pop off the first cell again and prepend the actual cell contents.
So, if you change that to "," : split, the effect is just that all cells you build will already be pre-terminated with a , character. That's not what you want.
Instead you want to add an additional cell that won't be touched anymore. That needs to be deeper in the result then:
split (',':cs) = "" : "," : split cs
I have a somewhat esoteric problem. My program wants to decode morse code.
The point is, I will need to handle any character. Any random characters that adhere to my system and can correspond to a letter should be accepted. Meaning, the letter "Q" is represented by "- - . -", but my program will treat any string of characters (separated by appropriate newchar signal) to be accepted as Q, for example "dj ir j kw" (long long short long).
There is a danger of falling out of sync, so I will need to implement a "new character" signal. I chose this to be "xxxx" as in 4 letters. For white, blank space symbol, I chose "xxxxxx", 6 chars.
Long story short, how can I split the string that is to be decoded into readable characters based on the length of the delimeter (4 continous symbols), since I can't really deterministically know what letters will make up the newchar delimeter?
The question is not very clearly worded.
For instance, here you show space as a delimeter between parts of the symbol Q:
for example "dj ir j kw" (long long short long)
Later you say:
For white, blank space symbol, I chose "xxxxxx", 6 chars.
Is that the symbol for whitespace, or the delimeter you use within a symbol (such as Q, above)? Your post doesn't say.
In this case, as always, an example is worth a thousands words. You should have shown a few examples of possible input and shown how you'd like them parsed.
If what you mean was that "dj ir j kw jfkl abpzoq jfkl dj ir j kw" should be decoded as "Q Q", and you just want to know how to match tokens by their length, then... the question is easy. There's a million ways you could do that.
In Lua, I'd do it in two passes. First, convert the message into a string containing only the length of each chunk of consequitive characters:
message = 'dj ir j kw jfkl abpzoq jfkl dj ir j kw'
message = message:gsub('(%S+)%s*', function(s) return #s end)
print(message) --> 22124642212
Then split on the number 4 to get each group
for group in message:gmatch('[^4]+') do
print(group)
end
Which gives you:
2212
6
2212
So you could convert something like this:
function translate(message)
local lengthToLetter = {
['2212'] = 'Q',
[ '6'] = ' ',
}
local translation = {}
message = message:gsub('(%S+)%s*', function(s) return #s end)
for group in message:gmatch('[^4]+') do
table.insert(translation, lengthToLetter[group] or '?')
end
return table.concat(translation)
end
print(translate(message))
This will split a string by any len continuous occurrences of char, which may be a character or pattern character class (such as %s), or of any character (i.e. .) if char is not passed.
It does this by using backreferences in the pattern passed to string.find, e.g. (.)%1%1%1 to match any character repeated four times.
The rest is just a bog-standard string splitter; the only real Lua peculiarity here is the choice of pattern.
-- split str, using (char * len) as the delimiter
-- leave char blank to split on len repetitions of any character
local function splitter(str, len, char)
-- build pattern to match len continuous occurrences of char
-- "(x)%1%1%1%1" would match "xxxxx" etc.
local delim = "("..(char or ".")..")" .. string.rep("%1", len-1)
local pos, out = 1, {}
-- loop through the string, find the pattern,
-- and string.sub the rest of the string into a table
while true do
local m1, m2 = string.find(str, delim, pos)
-- no sign of the delimiter; add the rest of the string and bail
if not m1 then
out[#out+1] = string.sub(str, pos)
break
end
out[#out+1] = string.sub(str, pos, m1-1)
pos = m2+1
-- string ends with the delimiter; bail
if m2 == #str then
break
end
end
return out
end
-- and the result?
print(unpack(splitter("dfdsfsdfXXXXXsfsdfXXXXXsfsdfsdfsdf", 5)))
-- dfdsfsdf, sfsdf, sfsdfsdfsdf
This question is related to my question about Roxygen.
I want to write a new function that does word wrapping of strings, similar to strwrap or stringr::str_wrap, but with the following twist: Any elements (substrings) in the string that are enclosed in quotes must not be allowed to wrap.
So, for example, using the following sample data
test <- "function(x=123456789, y=\"This is a long string argument\")"
cat(test)
function(x=123456789, y="This is a long string argument")
strwrap(test, width=40)
[1] "function(x=123456789, y=\"This is a long"
[2] "string argument\")"
I want the desired output of a newWrapFunction(x, width=40, ...) to be:
desired <- c("function(x=123456789, ", "y=\"This is a long string argument\")")
desired
[1] "function(x=123456789, "
[2] "y=\"This is a long string argument\")"
identical(desired, newWrapFunction(tsring, width=40))
[1] TRUE
Can you think of a way to do this?
PS. If you can help me solve this, I will propose this code as a patch to roxygen2. I have identified where this patch should be applied and will acknowledge your contribution.
Here's what I did to get strwrap so it would not break single quoted sections on spaces:
A) Pre-process the "even" sections after splitting by the single-quotes by substituting "~|~" for the spaces:
Define new function strwrapqt
....
zz <- strsplit(x, "\'") # will be only working on even numbered sections
for (i in seq_along(zz) ){
for (evens in seq(2, length(zz[[i]]), by=2)) {
zz[[i]][evens] <- gsub("[ ]", "~|~", zz[[i]][evens])}
}
zz <- unlist(zz)
.... insert just before
z <- lapply(strsplit) ...........
Then at the end replace all the "~|~" with spaces. It might be necessary to doa lot more thinking about the other sorts of whitespace "events" to get a fully regular treatment.
....
y <- gsub("~\\|~", " ", y)
....
Edit: Tested #joran's suggestion. Matching single and double quotes would be a difficult task with the methods I am using but if one were willing to consider any quote as equally valid as a separator target, one could just use zz <- strsplit(x, "\'|\"") as the splitting criterion in the code above.
I have a big string (a base64 encoded image) and it is 1050 characters long. How can I append a big string formed of small ones, like this in C
function GetIcon()
return "Bigggg string 1"\
"continuation of string"\
"continuation of string"\
"End of string"
According to Programming in Lua 2.4 Strings:
We can delimit literal strings also by matching double square brackets [[...]]. Literals in this bracketed form may run for several lines, may nest, and do not interpret escape sequences. Moreover, this form ignores the first character of the string when this character is a newline. This form is especially convenient for writing strings that contain program pieces; for instance,
page = [[
<HTML>
<HEAD>
<TITLE>An HTML Page</TITLE>
</HEAD>
<BODY>
Lua
[[a text between double brackets]]
</BODY>
</HTML>
]]
This is the closest thing to what you are asking for, but using the above method keeps the newlines embedded in the string, so this will not work directly.
You can also do this with string concatenation (using ..):
value = "long text that" ..
" I want to carry over" ..
"onto multiple lines"
Most answers here solves this issue at run-time and not at compile-time.
Lua 5.2 introduces the escape sequence \z to solve this problem elegantly without incurring any run-time expense.
> print "This is a long \z
>> string with \z
>> breaks in between, \z
>> and is spanning multiple lines \z
>> but still is a single string only!"
This is a long string with breaks in between, and is spanning multiple lines but still is a single string only!
\z skips all subsequent characters in a string literal1 until the first non-space character. This works for non-multiline literal text too.
> print "This is a simple \z string"
This is a simple string
From Lua 5.2 Reference Manual
The escape sequence '\z' skips the following span of white-space characters, including line breaks; it is particularly useful to break and indent a long literal string into multiple lines without adding the newlines and spaces into the string contents.
1: All escape sequences, including \z, work only on short literal strings ("…", '…') and, understandably, not on long literal strings ([[...]], etc.)
I'd put all chunks in a table and use table.concat on it. This avoids the creation of new strings at every concatenation. for example (without counting overhead for strings in Lua):
-- bytes used
foo="1234".. -- 4 = 4
"4567".. -- 4 + 4 + 8 = 16
"89ab" -- 16 + 4 + 12 = 32
-- | | | \_ grand total after concatenation on last line
-- | | \_ second operand of concatenation
-- | \_ first operand of concatenation
-- \_ total size used until last concatenation
As you can see, this explodes pretty rapidly. It's better to:
foo=table.concat{
"1234",
"4567",
"89ab"}
Which will take about 3*4+12=24 bytes.
Have you tried the
string.sub(s, i [, j]) function.
You may like to look here:
http://lua-users.org/wiki/StringLibraryTutorial
This:
return "Bigggg string 1"\
"continuation of string"\
"continuation of string"\
"End of string"
C/C++ syntax causes the compiler to see it all as one large string. It is generally used for readability.
The Lua equivalent would be:
return "Bigggg string 1" ..
"continuation of string" ..
"continuation of string" ..
"End of string"
Do note that the C/C++ syntax is compile-time, while the Lua equivalent likely does the concatenation at runtime (though the compiler could theoretically optimize it). It shouldn't be a big deal though.