Scala: split string via pattern matching - string

Is it possible to split string into lexems somehow like
"user#domain.com" match {
case name :: "#" :: domain :: "." :: zone => doSmth(name, domain, zone)
}
In other words, on the same manner as lists...

Yes, you can do this with Scala's Regex functionality.
I found an email regex on this site, feel free to use another one if this doesn't suit you:
[-0-9a-zA-Z.+_]+#[-0-9a-zA-Z.+_]+\.[a-zA-Z]{2,4}
The first thing we have to do is add parentheses around groups:
([-0-9a-zA-Z.+_]+)#([-0-9a-zA-Z.+_]+)\.([a-zA-Z]{2,4})
With this we have three groups: the part before the #, between # and ., and finally the TLD.
Now we can create a Scala regex from it and then use Scala's pattern matching unapply to get the groups from the Regex bound to variables:
val Email = """([-0-9a-zA-Z.+_]+)#([-0-9a-zA-Z.+_]+)\.([a-zA-Z]{2,4})""".r
Email: scala.util.matching.Regex = ([-0-9a-zA-Z.+_]+)#([-0-9a-zA-Z.+_]+)\.([a-zA-Z] {2,4})
"user#domain.com" match {
case Email(name, domain, zone) =>
println(name)
println(domain)
println(zone)
}
// user
// domain
// com

Starting Scala 2.13, it's possible to pattern match a Strings by unapplying a string interpolator:
val s"$user#$domain.$zone" = "user#domain.com"
// user: String = "user"
// domain: String = "domain"
// zone: String = "com"
If you are expecting malformed inputs, you can also use a match statement:
"user#domain.com" match {
case s"$user#$domain.$zone" => Some(user, domain, zone)
case _ => None
}
// Option[(String, String, String)] = Some(("user", "domain", "com"))

In general regex is horribly inefficient, so wouldn't advise.
You CAN do it using Scala pattern matching by calling .toList on your string to turn it into List[Char]. Then your parts name, domain and zone will also be List[Char], to turn them back into Strings use .mkString. Though I'm not sure how efficient this is.
I have benchmarked using basic string operations (like substring, indexOf, etc) for various use cases vs regex and regex is usually an order or two slower. And of course regex is hideously unreadible.
UPDATE: The best thing to do is to use Parsers, either the native Scala ones, or Parboiled2

Related

Match Operator prettier syntax in Groovy

I'm very new to groovy (here working on Jenkinsfile)
One of my coworkers uses a Match Operator to check a condition. But I find it not readable and hard to maintain.
Original Match Operator:
PROJECT_NAME = 'projectA' // User Input from Jenkins params normaly
if ( "${PROJECT_NAME}" ==~ /projectA|projectB|projectC|projectD/) { // The real line is 300 Char long
// Do stuff
}
There is 15 projects in total, i've shorten up the line because it was too long. So every time he needs to add a project name, he appends at the start or end of his regex.
Also, those project name are in a list before.
projects = ['projectA',
'projectB',
'projectC',
'projectD']
Could there be a way to use this list to build the regex?
Here is what I tried:
string_regex = "/"
for (project in projects) {
string_regex = string_regex + project + "|"
}
string_regex = string_regex.substring(0, string_regex.length() - 1)
string_regex = string_regex + "/"
print "${string_regex}\n"
if ("${PROJECT_NAME}" ==~ string_regex) {
print "Well Done you did it\n"
// Do stuff
}
But saddly it doesn't seems to work, since I'm using a string?
EDIT: I found out that I could use the contains method from a list in Groovy. In my case, it fixes my original problem. But I'm still curious on how to build such regex with strings.
if (projects.contains(PROJECT_NAME)) {
// Do stuff
}
You can join your projects and then turn the string into a regexp via Pattern.compile(). For good measure use Pattern.quote() to safe-guard against chars in your project names with "meaning" in regexp.
import java.util.regex.Pattern
def projects = ['projectA',
'projectB',
'projectC',
'projectD']
def re = Pattern.compile(projects.collect{ Pattern.quote it }.join("|"))
['projectA', 'projectX'].each{
println it ==~ re
}
// -> true
// -> false
For what it's worth, I came to like the Groovy matching operators for their compact syntax. If you learn about them and practice for a short bit, you will probably get to like them, too.
Regardless, for a simple check, on whether a String is part of a list, there is a much simpler way in Groovy than using full blown regexp : the in operator, e.g.:
def projects = ['projectA',
'projectB',
'projectC',
'projectD']
['projectA', 'projectX'].each {
println "${it} is ${it in projects ? 'IN' : 'NOT IN'} the project list"
}
which yields e.g.:
projectA is IN the project list
projectX is NOT IN the project list
More info on that operator and many other aspects of the Groovy language from the always excellent MrHaki here
Of course, if you need to account for case differences, etc... you have to massage the code a bit, but at some point, a regexp might be warranted.
If you have already an collection, you should nearly always use an collection operator; E.g. replace
if ( "${PROJECT_NAME}" ==~ /projectA|projectB|projectC|projectD/) {
with:
if (PROJECT_NAME in projects) {
Much easier to read and understand, no? 😉

How to split/tokenize a string by given requirements?

I have this string here I need to split up in tokens.
thorak={name="Thorak"}
The result should look something like this:
["thorak", "=", "{", "name", "=", "Thorak", "}"]
My thought
I thought about having different RegEx rules running over it but am a bit unsure how to do this properly.
Consider this rule array:
["^(\w+)", "^=", "^{", "^(\w+)", "^=", "^(\w+)", "^}"]
Given a RegEx rule that only matches Strings ^(\w+) I would apply it to the string.
It should match with the thorak string and there I have my first token.
To have this work in a loop I might do the following:
Match a RegEx rule
Save the matching string in an array
Remove the matching string from the iterated string (to have RegEx rules running over the next parts)
Repeat until the String is empty OR no rule was able to be applied
This is my first time doing a bit more labor work on string so I wonder what nifty tricks there exist to make what I want easier.

How to move part of the string after exact word to another field in logstash?

Let's imagine I have log file like the following:
My custom exception ST: java.lang.RuntimeException: Text of this dummy err.
My final goal is to put everything after ST: to new field ST called and remove ST:.
I'm trying to use the pattern, but it doesn't work.
filter {
grok {
match => { "message" => "(?<newField>(?<=ST)(?s)(.*$))" }
}
Grok is based on Oniguruma regex library. To make . match any char with an Oniguruma regex, you need to pass (?m) inline modifier, not (?s) (as in PCRE and some other regex engines).
By placing (?<=ST) positive lookahead inside the named capturing group, you require ST to appear immediately before the current location, but you have ST and a colon right after, and then a space. It makes sense to just move ST: out of the named group:
"(?m)ST: (?<newField>.*)"
^^^^^^^^
The ST: and space will get matched and consumed, newField group will hold the rest of string in it.
You can use a specific regex like that:
^My custom exception ST: %{GREEDYDATA:ST}
Or a more generric one:
%{GREEDYDATA} \bST\b: %{GREEDYDATA:ST}
Always try to use specific regex.

scala using string interpolation for string replacement

scala 2.11.6
val fontColorMap = Map( "Good" -> "#FFA500", "Bad" -> "#0000FF")
val content = "Good or Bad?"
"(Bad|Good)".r.replaceFirstIn(content,s"""<font color="${fontColorMap("$1")}">$$1</font>""")
I want to replace the String using regex. In this case
$$1 can fetch the matched string, but I dont know how to do it in ${}.
plus. I know that scala will translate the interpolation
into something like this
new StringContext("""<font color=""",""">$$1</font>""").s(fontColorMap("$1"))
Thus it will fail.
But, is there any way I can handle this gracefully?
You can use the version of replaceAllIn that takes a function:
"(Bad|Good)".r.replaceAllIn(content, m =>
s"""<font color="${fontColorMap(m.matched)}">${m.matched}</font>"""
)
where m is of type scala.util.matching.Regex.Match.
There doesn't seem to be a version of replaceFirstIn that does the same thing though.
Seems is caused by regex group variable interpolation with scala StringContext interpolation has the different interpolation order.And StringContext need to evaluate firstly before go to the regex interpolation. Maybe we can try to get value firstly before regex replace interpolation, like:
"(Bad|Good)".r.findFirstIn(content).map(key => {
val value = fontColorMap(key)
content.replaceFirst(key, s"""<font color="$value">$key</font>""")
}).get
> <font color="#FFA500">Good</font> or Bad?

String includes many substrings in ScalaTest Matchers

I need to check that one string contains many substrings. The following works
string should include ("seven")
string should include ("eight")
string should include ("nine")
but it takes three almost duplicated lines. I'm looking for something like
string should contain allOf ("seven", "eight", "nine")
however this doesn't work... The assertion just fails while string contains these substrings for sure.
How can I perform such assertion in one line?
Try this:
string should (include("seven") and include("eight") and include("nine"))
You can always create a custom matcher:
it should "..." in {
"str asd dsa ddsd" should includeAllOf ("r as", "asd", "dd")
}
def includeAllOf(expectedSubstrings: String*): Matcher[String] =
new Matcher[String] {
def apply(left: String): MatchResult =
MatchResult(expectedSubstrings forall left.contains,
s"""String "$left" did not include all of those substrings: ${expectedSubstrings.map(s => s""""$s"""").mkString(", ")}""",
s"""String "$left" contained all of those substrings: ${expectedSubstrings.map(s => s""""$s"""").mkString(", ")}""")
}
See http://www.scalatest.org/user_guide/using_matchers#usingCustomMatchers for more details.

Resources