Scala split string to tuple

Scala split string to tuple - string

I would like to split a string on whitespace that has 4 elements:
1 1 4.57 0.83
and I am trying to convert into List[(String,String,Point)] such that first two splits are first two elements in the list and the last two is Point. I am doing the following but it doesn't seem to work:
Source.fromFile(filename).getLines.map(string => {
val split = string.split(" ")
(split(0), split(1), split(2))
}).map{t => List(t._1, t._2, t._3)}.toIterator

How about this:
scala> case class Point(x: Double, y: Double)
defined class Point
scala> s43.split("\\s+") match { case Array(i, j, x, y) => (i.toInt, j.toInt, Point(x.toDouble, y.toDouble)) }
res00: (Int, Int, Point) = (1,1,Point(4.57,0.83))

You could use pattern matching to extract what you need from the array:
case class Point(pts: Seq[Double])
val lines = List("1 1 4.34 2.34")
val coords = lines.collect(_.split("\\s+") match {
case Array(s1, s2, points # _*) => (s1, s2, Point(points.map(_.toDouble)))
})

You are not converting the third and fourth tokens into a Point, nor are you converting the lines into a List. Also, you are not rendering each element as a Tuple3, but as a List.
The following should be more in line with what you are looking for.
case class Point(x: Double, y: Double) // Simple point class
Source.fromFile(filename).getLines.map(line => {
val tokens = line.split("""\s+""") // Use a regex to avoid empty tokens
(tokens(0), tokens(1), Point(tokens(2).toDouble, tokens(3).toDouble))
}).toList // Convert from an Iterator to List

case class Point(pts: Seq[Double])
val lines = "1 1 4.34 2.34"
val splitLines = lines.split("\\s+") match {
case Array(s1, s2, points # _*) => (s1, s2, Point(points.map(_.toDouble)))
}
And for the curious, the # in pattern matching binds a variable to the pattern, so points # _* is binding the variable points to the pattern *_ And *_ matches the rest of the array, so points ends up being a Seq[String].

There are ways to convert a Tuple to List or Seq, One way is
scala> (1,2,3).productIterator.toList
res12: List[Any] = List(1, 2, 3)
But as you can see that the return type is Any and NOT an INTEGER
For converting into different types you use Hlist of
https://github.com/milessabin/shapeless

Related

scala - string parsing without Regex

I have various types of strings like the following:
sales_data_type
saledatatypes
sales_data.new.metric1
sales_data.type.other.metric2
sales_data.type3.metric3
I'm trying to parse them to get a substring with a word before and after the last dot. For example: new.metric1, other.metric2, type3.metric3. If a word doesn't contain dots, it has to be returned as is: sales_data_type, saledatatypes.
With a Regex it may be done this way:
val infoStr = "sales_data.type.other.metric2"
val pattern = ".*?([^.]+\\.)?([^.]+)$"
println(infoStr.replaceAll(pattern, "$1$2"))
// prints other.metric2
// for saledatatypes just prints nullsaledatatypes ??? but seems to work
I want to find a way to achieve this with Scala, without using Regex in order to expand my understanding of Scala features. Will be grateful for any ideas.

One-liner:
dataStr.split('.').takeRight(2).mkString(".")
takeRight(2) will take the last 2 if there are 2 to take, else it will take the last, and only, 1. mkString(".") will re-insert the dot only if there are 2 elements for the dot to go between, else it will leave the string unaltered.

Here's one with lots of scala features for you.
val string = "head.middle.last"
val split = string.split('.') // Array(head, middle, last)
val result = split.toSeq match {
case Seq(word) ⇒ word
case _ :+ before :+ after ⇒ s"$before.$after"
}
println(result) // middle.last
First we split the string on your . and get individual parts.
Then we pattern match those parts, first to check if there is only one (in which case we just return it), and second to grab the last two elements in the seq.
Finally we put a . back in between those last two using string interpolation.

One way of doing it:
val value = "sales_data.type.other.metric2"
val elems = value.split("\\.").toList
elems match {
case _:+beforeLast:+last => s"${beforeLast}.${last}"
case _ => throw new NoSuchElementException
}

for(s<-strs) yield {val s1 = s.split('.');
if(s1.size>=2)s1.takeRight(2).mkString(".") else s }
or
for(s<-strs) yield { val s1 = s.split('.');
if(s1.size>=2)s1.init.last+'.'+s1.last else s }
In Scala REPL:
scala> val strs =
Vector("sales_data_type","saledatatypes","sales_data.new.metric1","sales_data.type.other.metric2","sales_d
ata.type3.metric3")
strs: scala.collection.immutable.Vector[String] = Vector(sales_data_type, saledatatypes, sales_data.new.metric1, sales_data.
type.other.metric2, sales_data.type3.metric3)
scala> for(s<-strs) yield { val s1 = s.split('.');if(s1.size>=2)s1.takeRight(2).mkString(".") else s }
res62: scala.collection.immutable.Vector[String] = Vector(sales_data_type, saledatatypes, new.metric1, other.metric2, type3.
metric3)
scala> for(s<-strs) yield { val s1 = s.split('.');if(s1.size>=2)s1.init.last+'.'+s1.last else s }
res60: scala.collection.immutable.Vector[String] = Vector(sales_data_type, saledatatypes, new.metric1, other.metric2, type3.
metric3)

Use scala match and do like this
def getFormattedStr(str:String):String={
str.contains(".") match{
case true=>{
val arr=str.split("\\.")
val len=arr.length
len match{
case 1=>str
case _=>arr(len-2)+"."+arr(len-1)
}
}
case _=>str
}
}

Scala String Similarity

I have a Scala code that computes similarity between a set of strings and give all the unique strings.
val filtered = z.reverse.foldLeft((List.empty[String],z.reverse)) {
case ((acc, zt), zz) =>
if (zt.tail.exists(tt => similarity(tt, zz) < threshold)) acc
else zz :: acc, zt.tail
}._1
I'll try to explain what is going on here :
This uses a fold over the reversed input data, starting from the empty String (to accumulate results) and the (reverse of the) remaining input data (to compare against - I labeled it zt for "z-tail").
The fold then cycles through the data, checking each entry against the tail of the remaining data (so it doesn't get compared to itself or any earlier entry)
If there is a match, just the existing accumulator (labelled acc) will be allowed through, otherwise, add the current entry (zz) to the accumulator. This updated accumulator is paired with the tail of the "remaining" Strings (zt.tail), to ensure a reducing set to compare against.
Finally, we end up with a pair of lists: the required remaining Strings, and an empty list (no Strings left to compare against), so we take the first of these as our result.
The problem is like in first iteration, if 1st, 4th and 8th strings are similar, I am getting only the 1st string. Instead of it, I should get a set of (1st,4th,8th), then if 2nd,5th,14th and 21st strings are similar, I should get a set of (2nd,5th,14th,21st).

If I understand you correctly - you want the result to be of type List[List[String]] and not the List[String] you are getting now - where each item is a list of similar Strings (right?).
If so - I can't see a trivial change to your implementation that would achieve this, as the similar values are lost (when you enter the if(true) branch and just return the acc - you skip an item and you'll never "see" it again).
Two possible solutions I can think of:
Based on your idea, but using a 3-Tuple of the form (acc, zt, scanned) as the foldLeft result type, where the added scanned is the list of already-scanned items. This way we can refer back to them when we find an element that doesn't have preceeding similar elements:
val filtered = z.reverse.foldLeft((List.empty[List[String]],z.reverse,List.empty[String])) {
case ((acc, zt, scanned), zz) =>
val hasSimilarPreceeding = zt.tail.exists { tt => similarity(tt, zz) < threshold }
val similarFollowing = scanned.collect { case tt if similarity(tt, zz) < threshold => tt }
(if (hasSimilarPreceeding) acc else (zz :: similarFollowing) :: acc, zt.tail, zz :: scanned)
}._1
A probably-slower but much simpler solution would be to just groupBy the group of similar strings:
val alternative = z.groupBy(s => z.collect {
case other if similarity(s, other) < threshold => other
}.toSet ).values.toList
All of this assumes that the function:
f(a: String, b: String): Boolean = similarity(a, b) < threshold
Is commutative and transitive, i.e.:
f(a, b) && f(a. c) means that f(b, c)
f(a, b) if and only if f(b, a)
To test both implementations I used:
// strings are similar if they start with the same character
def similarity(s1: String, s2: String) = if (s1.head == s2.head) 0 else 100
val threshold = 1
val z = List("aa", "ab", "c", "a", "e", "fa", "fb")
And both options produce the same results:
List(List(aa, ab, a), List(c), List(e), List(fa, fb))

Substitute substring within a string bidrectionally [duplicate]

This question already has answers here:
Replace multiple strings with multiple other strings
(27 answers)
Closed 7 years ago.
Given a string M that contains term A and B, I would like to substitute every A for B and every B for A to for M'. Naively one would try replacing A by B and then subsequently B by A but in that case the M' contains only of A. I can think of replacing the terms and record their position so that the terms do not get replaced again. This works when we only have A and B to replace. But if we need to substitute more than 2 terms and they are of different length then it gets tricky.
So I thought about doing this:
We are given M as input string and R = [(x1, y1), (x2, y2), ... (xn, yn)] as terms to replace, where we replace xi for yi for all i.
With M, Initiate L = [(M, false)] to be a list of (string * boolean) tuple where false means that this string has not been replaced.
Search for occurence of xi in each member L(i) of L with second term false. Partition L(i) into [(pre, false), (xi, false), (post, false)], and map to [(pre, false), (yi, true), (post, false)] where pre and post are string before and after xi. Flatten L.
Repeat the above until R is exhausted.
Concatenate the first element of each tuple of L to from M'.
Is there a more efficient way of doing this?

Here's a regex solution:
var M = 'foobazbar123foo match';
var pairs = {
'foo': 'bar',
'bar': 'foo',
'baz': 'quz',
'no': 'match'
};
var re = new RegExp(Object.keys(pairs).join('|'),'g');
alert(M.replace(re, function(m) { return pairs[m]; }));
Note: This is a demonstration / POC. A real implementation would have to handle proper escaping of the lookup strings.

Another approach is to replace strings by intermediate temporary strings (symbols) and then replace symbols by their original counterparts. So the transformation 'foo' => 'bar' can be transformed in two steps as, say, 'foo' => '___1' => 'bar'. The other transformation 'bar' ==> 'foo' will then become 'bar' ==> '___2' ==> 'foo'. This will prevent the mixup you describe.
Sample python code for the same example as the other answer follows:
import re
def substr(string):
repdict = {"foo":"bar", "bar":"foo", "baz":"quz", "no":"match"}
tmpdict = dict()
count = 0
for left, right in repdict.items():
tmpleft = "___" + count.__str__()
tmpdict[tmpleft] = right
count = count + 1
tmpright = "___" + count.__str__()
tmpdict[tmpright] = left
count = count + 1
string = re.sub(left, tmpleft, string)
string = re.sub(right, tmpright, string)
for tmpleft, tmpright in tmpdict.items():
string = re.sub(tmpleft, tmpright, string)
print string
>>> substr("foobazbar123foo match")
barquzfoo123bar no

scala assigning string and array of values

I'm trying to assign a string followed by an array of scores.
I defined some categories
case class CategoryScore( //Define Category Score class
val food: Int,
val tech: Int,
val service: Int,
val fashion: Int)
and mapped them to some keys so that a String such as the name of a product would be followed by the case class of scores.
var keywordscores:Map[String, CategoryScore] = Map() //keyword scores
keywordscores += ("amazon",CategoryScore(1,9,1,4)) //Tried to add score for a string, does not work
am I missing something here?

scala> keywordscores += ("amazon" -> CategoryScore(1,9,1,4))
or (note the extra parenthesis)
scala> keywordscores += (("amazon", CategoryScore(1,9,1,4)))
The reason for that is that + is defined as +(kvs: (A, B)*): Map[A, B], meaning it can take any number of (key,value) pairs, leading to += (k,v) being ambiguous.
The a -> b notation removes this ambiguity (and it's much nicer to read).

Maps are added like
keywordscores += ("amazon" -> CategoryScore(1,9,1,4))

With a mutable Map you can also update/insert entries as follows,
val keywordscores:collection.mutable.Map[String, CategoryScore] = Map()
keywordscores("amazon") = CategoryScore(1,9,1,4))
Here a new entry with key "amazon" is inserted; a subsequent call with the same key will update the value.

Detecting the index in a string that is not printable character with Scala

I have a method that detects the index in a string that is not printable as follows.
def isPrintable(v:Char) = v >= 0x20 && v <= 0x7E
val ba = List[Byte](33,33,0,0,0)
ba.zipWithIndex.filter { v => !isPrintable(v._1.toChar) } map {v => v._2}
> res115: List[Int] = List(2, 3, 4)
The first element of the result list is the index, but I wonder if there is a simpler way to do this.

If you want an Option[Int] of the first non-printable character (if one exists), you can do:
ba.zipWithIndex.collectFirst{
case (char, index) if (!isPrintable(char.toChar)) => index
}
> res4: Option[Int] = Some(2)
If you want all the indices like in your example, just use collect instead of collectFirst and you'll get back a List.

For getting only the first index that meets the given condition:
ba.indexWhere(v => !isPrintable(v.toChar))
(it returns -1 if nothing is found)

You can use directly regexp to found unprintable characters by unicode code points.
Resource: Regexp page
In such way you can directly filter your string with such pattern, for instance:
val text = "this is \n sparta\t\r\n!!!"
text.zipWithIndex.filter(_._1.matches("\\p{C}")).map(_._2)
> res3: Vector(8, 16, 17, 18)
As result you'll get Vector with indices of all unprintable characters in String. Check it out

If desired only the first occurrence of non printable char
Method span applied on a List delivers two sublists, the first where all the elements hold a condition, the second starts with an element that falsified the condition. In this case consider,
val (l,r) = ba.span(b => isPrintable(b.toChar))
l: List(33, 33)
r: List(0, 0, 0)
To get the index of the first non printable char,
l.size
res: Int = 2
If desired all the occurrences of non printable chars
Consider partition of a given List for a criteria. For instance, for
val ba2 = List[Byte](33,33,0,33,33)
val (l,r) = ba2.zipWithIndex.partition(b => isPrintable(b._1.toChar))
l: List((33,0), (33,1), (33,3), (33,4))
r: List((0,2))
where r includes tuples with non printable chars and their position in the original List.

I am not sure whether list of indexes or tuples is needed and I am not sure whether 'ba' needs to be an list of bytes or starts off as a string.
for { i <- 0 until ba.length if !isPrintable(ba(i).toChar) } yield i
here, because people need performance :)
def getNonPrintable(ba:List[Byte]):List[Int] = {
import scala.collection.mutable.ListBuffer
var buffer = ListBuffer[Int]()
#tailrec
def go(xs: List[Byte], cur: Int): ListBuffer[Int] = {
xs match {
case Nil => buffer
case y :: ys => {
if (!isPrintable(y.toChar)) buffer += cur
go(ys, cur + 1)
}
}
}
go(ba, 0)
buffer.toList
}

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Scala split string to tuple - string

How about this: scala> case class Point(x: Double, y: Double) defined class Point scala> s43.split("\\s+") match { case Array(i, j, x, y) => (i.toInt, j.toInt, Point(x.toDouble, y.toDouble)) } res00: (Int, Int, Point) = (1,1,Point(4.57,0.83))

You could use pattern matching to extract what you need from the array: case class Point(pts: Seq[Double]) val lines = List("1 1 4.34 2.34") val coords = lines.collect(_.split("\\s+") match { case Array(s1, s2, points # _*) => (s1, s2, Point(points.map(_.toDouble))) })

There are ways to convert a Tuple to List or Seq, One way is scala> (1,2,3).productIterator.toList res12: List[Any] = List(1, 2, 3) But as you can see that the return type is Any and NOT an INTEGER For converting into different types you use Hlist of https://github.com/milessabin/shapeless

Related

scala - string parsing without Regex

Scala String Similarity

Substitute substring within a string bidrectionally [duplicate]

scala assigning string and array of values

Detecting the index in a string that is not printable character with Scala

Categories

Resources