Scala/Spark efficient partial string match - string

I am writing a small program in Spark using Scala, and came across a problem. I have a List/RDD of single word strings and a List/RDD of sentences which might or might not contain words from the list of single words. i.e.
val singles = Array("this", "is")
val sentence = Array("this Date", "is there something", "where are something", "this is a string")
and I want to select the sentences that contains one or more of the words from singles such that the result should be something like:
output[(this, Array(this Date, this is a String)),(is, Array(is there something, this is a string))]
I thought about two approaches, one by splitting the sentence and filtering using .contains. The other is to split and format sentence into a RDD and use the .join for RDD intersection. I am looking at around 50 single words and 5 million sentences, which method would be faster? Are there any other solutions? Could you also help me with the coding, I seem to get no results with my code (although it compiles and run without error)

You can create a set of required keys, look up the keys in sentences and group by keys.
val singles = Array("this", "is")
val sentences = Array("this Date",
"is there something",
"where are something",
"this is a string")
val rdd = sc.parallelize(sentences) // create RDD
val keys = singles.toSet // words required as keys.
val result = rdd.flatMap{ sen =>
val words = sen.split(" ").toSet;
val common = keys & words; // intersect
common.map(x => (x, sen)) // map as key -> sen
}
.groupByKey.mapValues(_.toArray) // group values for a key
.collect // get rdd contents as array
// result:
// Array((this, Array(this Date, this is a string)),
// (is, Array(is there something, this is a string)))

I've just tried to solve your problem and I've ended up with this code:
def check(s:String, l: Array[String]): Boolean = {
var temp:Int = 0
for (element <- l) {
if (element.equals(s)) {temp = temp +1}
}
var result = false
if (temp > 0) {result = true}
result
}
val singles = sc.parallelize(Array("this", "is"))
val sentence = sc.parallelize(Array("this Date", "is there something", "where are something", "this is a string"))
val result = singles.cartesian(sentence)
.filter(x => check(x._1,x._2.split(" ")) == true )
.groupByKey()
.map(x => (x._1,x._2.mkString(", ") )) // pay attention here(*)
result.foreach(println)
The last map line (*) is there just beacause without it I get something with CompactBuffer, like this:
(is,CompactBuffer(is there something, this is a string))
(this,CompactBuffer(this Date, this is a string))
With that map line (with a mkString command) I get a more readable output like this:
(is,is there something, this is a string)
(this,this Date, this is a string)
Hope it could help in some way.
FF

Related

How do I remove a substring/character from a string in Scala?

I am writing a program in which I need to filter a string. So I have a map of characters, and I want the string to filter out all characters that are not in the map. Is there a way for me to do this?
Let's say we have the string and map:
str = "ABCDABCDABCDABCDABCD"
Map('A' -> "A", 'D' -> "D")
Then I want the string to be filtered down to:
str = "BCBCBCBCBC"
Also, if I find a given substring in the string, is there a way I can replace that with a different substring?
So for example, if we have the string:
"The number ten is even"
Could we replace that with:
"The number 10 is even"
To filter the String with the map is just a filter command:
val str = "ABCDABCDABCDABCDABCD"
val m = Map('A' -> "A", 'D' -> "D")
str.filterNot(elem => m.contains(elem))
A more functional alternative as recommended in comments
str.filterNot(m.contains)
Output
scala> str.filterNot(elem => m.contains(elem))
res3: String = BCBCBCBCBC
To replace elements in the String:
string.replace("ten", "10")
Output
scala> val s = "The number ten is even"
s: String = The number ten is even
scala> s.replace("ten", "10")
res4: String = The number 10 is even

How to use match on String value with Scala?

I'm trying to iterate on a String value to change each occurence of it.
For example i want that "1" become "one", "2" become "two", etc.
I've done this :
override def toString = {
val mapXX = init.map(_.clone);
var returnVALUE = mapXX.map(_.mkString).mkString("\n")
for(c <- returnVALUE){
c match {
case 1 => "one";
case 2 => "two";
...
case _ => "";
}
}
returnVALUE
}
}
It didn't change anything of my list, i have the same display of my list. Nothing has changed.
Did someone knows how can we iterate on each character of a String value in order to replace each character by something else ?
Thanks
It's not completely clear what you're doing. Try
returnVALUE.map {
case '1' => "one"
case '2' => "two"
case '3' => "three"
// ...
case _ => " "
}.mkString
and this should be the last line of toString.
String#map accepts a function from Char to something (e.g. to String).
If returnVALUE is "1 2 3" then this produces "one two three".
When the last line is returnVALUE this means you return the original value of returnVALUE, not the modified value.
A for comprehension without the yield clause doesn't create any results. It can only be used for side effects, which good Scala programmers try to avoid.
Maybe something like this.
val numberNames = Map(0 -> "zero", 1 -> "one", 2 -> "two").withDefaultValue("too big")
val result = List(2,0,1,4).map(numberNames)
//result: List[String] = List(two, zero, one, too big)

How to get the count of first character in all the words into an RDD?

Im new to spark and I am trying to get the count of first alphabet each word starts with.
I have the following input file.
sales file:
Liverpool,100,Red
Leads United,100,Blue
ManUnited,100,Red
Chelsea,300,Blue
I got the word count by doing the below steps.
val input = sc.textFile("salesfile")
val words = input.flatMap(word => word.split(",")
val wCount = words.map(words => (words,1))
val result = wCount.reduceByKey((x,y) => x+y)
result.collect().foreach(println)
Im getting the word count by the above code.
But Im unable to write a logic to take the first alphabet of each word into an RDD. Can anyone let me know how to do it ?
Assuming you want to ignore numbers:
val words = input.flatMap(word => word.split(","))
// "Liverpool","100","Red","Leads United", etc. -- includes numbers
val wCount = words.filter(word => Character.isLetter(word.head)) // ignores numbers
.map(word => (word.head, 1)) // gets the first letter of each word
val result = wCount.reduceByKey((x, y) => x + y)
result.collect().foreach(println)
val words = input.flatMap(word => word.split(","))
//note: your words will be the Array("Liverpool","100","Red","Leads United",....)
//idk if that's what you're looking for, but that's the example that was provided
//words(0) gets the first char from each string
val lWords = words.map(words => (words(0),1))
val result = lCount.reduceByKey((x,y) => x+y)
scala> result.collect().foreach(println)
(R,2)
(1,3)
(3,1)
(B,2)
(C,1)
(L,2)
(M,1)

Find the intersection of two strings in order with Scala

I'm trying to find the intersection of two strings in order with Scala. I'm pretty new to Scala, but I feel like this should be a one-liner. I've tried using both map and foldLeft, and have yet to attain the correct answer.
Given two strings, return a list of characters that are the same in order. For instance, "abcd", "acc" should return "a", and "abcd", "abc" should return "abc".
Here are the two functions I've tried so far
(str1 zip str2).map{ case(a, b) => if (a == b) a else ""}
and
(str1 zip str2).foldLeft(""){case(acc,n) => if (n._1 == n._2) acc+n._1.toString else ""}
What I want to do is something like this
(str1 zip str2).map{ case(a, b) => if (a == b) a else break}
but that doesn't work.
I know that I can do this with multiple lines and a for loop, but this feels like a one liner. Can anyone help?
Thanks
(str1 zip str2).takeWhile( pair => pair._1 == pair._2).map( _._1).mkString
Testing it out in the scala REPL:
scala> val str1 = "abcd"
str1: String = abcd
scala> val str2 = "abc"
str2: String = abc
scala> (str1 zip str2).takeWhile( pair => pair._1 == pair._2).map( _._1).mkString
res26: String = abc
Edited to pass both test cases
scala> (str1 zip "acc").takeWhile( pair => pair._1 == pair._2).map( _._1).mkString
res27: String = a
This is not at all efficient, but it is obvious:
def lcp(str1:String, str2:String) =
(str1.inits.toSet intersect str2.inits.toSet).maxBy(_.length)
lcp("abce", "abcd") //> res0: String = abc
lcp("abcd", "bcd") //> res1: String = ""
(take the longest of the intersection of all of the prefixes of string 1 with all of the prefixes of string 2)
Alternatively, to avoid zipping the entire strings:
(s1, s2).zipped.takeWhile(Function.tupled(_ == _)).unzip._1.mkString
Here it is:
scala> val (s1, s2) = ("abcd", "bcd")
s1: String = abcd
s2: String = bcd
scala> Iterator.iterate(s1)(_.init).find(s2.startsWith).get
res1: String = ""
scala> val (s1, s2) = ("abcd", "abc")
s1: String = abcd
s2: String = abc
scala> Iterator.iterate(s1)(_.init).find(s2.startsWith).get
res2: String = abc

Split String into alternating words (Scala)

I want to split a String into alternating words. There will always be an even number.
e.g.
val text = "this here is a test sentence"
should transform to some ordered collection type containing
"this", "is", "test"
and
"here", "a", "sentence"
I've come up with
val (l1, l2) = text.split(" ").zipWithIndex.partition(_._2 % 2 == 0) match {
case (a,b) => (a.map(_._1), b.map(_._1))}
which gives me the right results as two Arrays.
Can this be done more elegantly?
scala> val s = "this here is a test sentence"
s: java.lang.String = this here is a test sentence
scala> val List(l1, l2) = s.split(" ").grouped(2).toList.transpose
l1: List[java.lang.String] = List(this, is, test)
l2: List[java.lang.String] = List(here, a, sentence)
So, how about this:
scala> val text = "this here is a test sentence"
text: java.lang.String = this here is a test sentence
scala> val Reg = """\s*(\w+)\s*(\w+)""".r
Reg: scala.util.matching.Regex = \s*(\w+)\s*(\w+)
scala> (for(Reg(x,y) <- Reg.findAllIn(text)) yield(x,y)).toList.unzip
res8: (List[String], List[String]) = (List(this, is, test),List(here, a, sentence))
scala>

Resources