How to split a string into a list of words in TCL, ignoring multiple spaces? - string

Basically, I have a string that consists of multiple, space-separated words. The thing is, however, that there can be multiple spaces instead of just one separating the words. This is why [split] does not do what I want:
split "a b"
gives me this:
{a {} {} {} b}
instead of this:
{a b}
Searching Google, I found a page on the Tcler's wiki, where a user asked more or less the same question.
One proposed solution would look like this:
split [regsub -all {\s+} "a b" " "]
which seems to work for simple string. But a test string such as [string repeat " " 4] (used string repeat because StackOverflow strips multiple spaces) will result in regsub returning " ", which split would again split up into {{} {}} instead of an empty list.
Another proposed solution was this one, to force a reinterpretation of the given string as a list:
lreplace "a list with many spaces" 0 -1
But if there's one thing I've learned about TCL, it is that you should never use list functions (starting with l) on strings. And indeed, this one will choke on strings containing special characters (namely { and }):
lreplace "test \{a b\}"
returns test {a b} instead of test \{a b\} (which would be what I want, every space-separated word split up into a single element of the resulting list).
Yet another solution was to use a 'filter':
proc filter {cond list} {
set res {}
foreach element $list {if [$cond $element] {lappend res $element}}
set res
}
You'd then use it like this:
filter llength [split "a list with many spaces"]
Again, same problem. This would call llength on a string, which might contain special characters (again, { and }) - passing it "\{a b\}" would result in TCL complaining about an "unmatched open brace in list".
I managed to get it to work by modifying the given filter function, adding a {*} in front of $cond in the if, so I could use it with string length instead of llength, which seemed to work for every possible input I've tried to use it on so far.
Is this solution safe to use as it is now? Would it choke on some special input I didn't test so far? Or, is it possible to do this right in a simpler way?

The easiest way is to use regexp -all -inline to select and return all words. For example:
# The RE matches any non-empty sequence of non-whitespace characters
set theWords [regexp -all -inline {\S+} $theString]
If instead you define words to be sequences of alphanumerics, you instead use this for the regular expression term: {\w+}

You can use regexp instead:
From tcl wiki split:
Splitting by whitespace: the pitfalls
split { abc def ghi}
{} abc def {} ghi
Usually, if you are splitting by whitespace and do not want those blank fields, you are better off doing:
regexp -all -inline {\S+} { abc def ghi}
abc def ghi

Related

Lua - match only words outside {} braces in string and replace or append the words with substring

I have various strings with forms similar to:
This is a sentence outside braces{sentence{} with some words. {This is a
sentence inside braces with some words.}{This is a second sentence
inside braces.} Maybe some more words here for another sentence.
With Lua, I want to only match specific words in the string which are outside the "{}" braces. For example, I might want to match the word "sentence" outside the braces but not the occurrences of "sentence" inside the braces. I want to only match the bolded occurrences of the word not the italicized ones.
How to do it?
EDIT: What if I want append or replace the matched words while keeping the substrings inside the braces intact?
Example: append "word" to sentence:
This is a sentenceword outside braces{sentence{} with some words. {This is a
sentence inside braces with some words.}{This is a second sentence
inside braces.} Maybe some more words here for another sentenceword.
The simplest way to do this would be to replace all the brackets with a zero length strings in a temporary variable which you can then use to search for whatever you like.
You can easily do this using Lua's pattern matching and the following simple gsub code:
local tempStr = startStr:gsub("{.-}","")
The .- is the part that makes it grab everything between the { and } and gsub then replaces it all with a blank string.
Edit: The issue with the above method, as DarkWiiPlayer has pointed out is that the first open brace mathces with the first close brace which is incorrect.
The way around that is to use balanced braces (%b) as DarkWiiPlayer has recommended in his answer, like so:
local tempStr = startStr:gsub("%b{}","")
local function weird_match(word, str)
return str:gsub("%b{}", ''):match(word)
end
Replace balanced pairs of { and } with the empty string
Find the desired pattern (word) in the resulting string
Return the matched word (or its captures, if it has any)

java String.format - how to put a space between two characters

I am searching for a way to use a formatter to put a space between two characters. i thought it would be easy with a string formatter.
here is what i am trying to accomplish:
given: "AB" it will produce "A B"
Here is what i have tried so far:
"AB".format("%#s")
but this keep returning "AB" i want "A B". i thought the number sign could be used for space.
i also tried this:
"26".format("%#d") but its still prints "26"
is there anyway to do this with string.formatter.
It is kind of possible with the string formatter although not directly with a pattern.
jshell> String.format("%1$c %2$c", "AB".chars().boxed().toArray())
$10 ==> "A B"
We need to turn the string into an object array so it can be passed in as varargs and the formatter pattern can extract characters based on index (1$ and 2$) and format them as characters (c).
A much simpler regex solution is the following which scales to any number of characters:
jshell> "ABC^&*123".replaceAll(".", "$0 ").trim()
$3 ==> "A B C ^ & * 1 2 3"
All single characters are replaced with them-self ($0) followed by a space. Then the last extra space is removed with the trim() call.
I could not find way to do this using String#format. But here is a way to accomplish this using regex replacement:
String input = "AB";
String output = input.replaceAll("(?<=[A-Z])(?=[A-Z])", " ");
System.out.println(output);
The regex pattern (?<=[A-Z])(?=[A-Z]) will match every position in between two capital letters, and interpolate a space at that point. The above script prints:
A B

How to match a part of string before a character into one variable and all after it into another

I have a problem with splitting string into two parts on special character.
For example:
12345#data
or
1234567#data
I have 5-7 characters in first part separated with "#" from second part, where are another data (characters,numbers, doesn't matter what)
I need to store two parts on each side of # in two variables:
x = 12345
y = data
without "#" character.
I was looking for some Lua string function like splitOn("#") or substring until character, but I haven't found that.
Use string.match and captures.
Try this:
s = "12345#data"
a,b = s:match("(.+)#(.+)")
print(a,b)
See this documentation:
First of all, although Lua does not have a split function is its standard library, it does have string.gmatch, which can be used instead of a split function in many cases. Unlike a split function, string.gmatch takes a pattern to match the non-delimiter text, instead of the delimiters themselves
It is easily achievable with the help of a negated character class with string.gmatch:
local example = "12345#data"
for i in string.gmatch(example, "[^#]+") do
print(i)
end
See IDEONE demo
The [^#]+ pattern matches one or more characters other than # (so, it "splits" a string with 1 character).

AWK - enclose found strings with symbols in one command

I have a large body of text and I print only lines that contain one of several strings. Each line can contain more than one string.
Example of the rule:
(house|mall|building)
I want to mark the found string for making the result easier to read.
Example of the result I want:
New record: Two New York houses under contract for nearly $5 millionĀ each.
New record: Two New York #house#s under contract for nearly $5 million each.
I know I can find the location, trim, add marker, add string etc.
I am asking if there is a way to mark the found string in one command.
Thanks.
http://pubs.opengroup.org/onlinepubs/009695399/utilities/awk.html
gsub(ere, repl[, in])
Behave like sub (see below), except that it shall replace all occurrences of the regular expression ...
sub(ere, repl[, in ])
Substitute the string repl in place of the first instance of the
extended regular expression ERE in string in and return the number of
substitutions. An ampersand ( '&' ) appearing in the string repl shall
be replaced by the string from in that matches the ERE ...
BEGIN {
r = "house|mall|building"
s = "Two New York houses under contract for nearly $5 million each."
gsub(r, "#&#", s)
print s
}

How can I grep nth column of tab delimited file in Groovy?

My source file is tab delimited and I need to grep the 4th column of values. How can I do this in Groovy? Here's my code which doesn't work. Is it even close?
def tab_file = new File('source_file.tab')
tab_file.eachline { line -> println line.grep('\t\t\t\t'}
You could split by tab character, that would give you an array you can index into to get the column:
groovy:000> s = "aaa\tbbb\tccc\tddd\teee";
===> aaa bbb ccc ddd eee
groovy:000> s.split("\\t")[3]
===> ddd
Something like the following should work:
tab_file.eachLine { line ->
println ((line =~ /([^\t]*\t){3}([^\t]*)/)[0][2])
}
Explanation:
The =~ operator creates a java.util.regex.Matcher object using the pattern on the right-hand side. Groovy lets you then implicitly execute find() via the array subscript operator. If your regex has groups in it, this results in a List for each result. This list has the whole matched area as element 0, then the groups as further elements. So [0][2] is the first match of the regex (zero-indexed), specifically the 2nd group match. (Btw, if there were no groups in the regex, the result is just a string with the match). Details/Examples here.
Update/Aside:
I was just looking into the grep() fxnality added to Object, as I was curious. I'm not sure I see the utility outside of collection types, but when applied to Strings, it doesn't do as you might expect - it appears to loop through the characters in the string, and compares each character against the passed-in String (collecting matches in a list). If your passed-in String is >1 character, you'll never get a match, as the character under inspection per iteration will never equal the whole string passed-in (in your example, any \t != "\t\t\t\t")

Resources