Scala: How to count occurrences of unique items in a certain index? - string

I have a list that is formatted like the lists below:
List(List(21, Georgetown, Male),List(29, Medford, Male),List(18, Manchester, Male),List(27, Georgetown, Female))
And I need to count the occurrences of each unique town name, then return the town name and the amount of times it was counted. But I only want to return the one town that had the most occurences. So if I applied the function to the list above, I would get
(Georgetown, 2)
I'm coming from Java, so I know how to do this process in a longer way, but I want to utilize some of Scala's built in methods.

scala> val towns = List(
| List(21, "Georgetown", "Male"),
| List(29, "Medford", "Male"),
| List(18, "Manchester", "Male"),
| List(27, "Georgetown", "Female"))
towns: List[List[Any]] = ...
scala> towns.map({ case List(a, b, c) => (b, c) }).groupBy(_._1).mapValues(_.length).maxBy(_._2)
res0: (Any, Int) = (Georgetown,2)

This is a pretty weird structure, but a way to do it would be with:
val items : List[List[Any]] = List(
List(List(21, "Georgetown", "Male")),
List(List(29, "Medford", "Male")),
List(List(18, "Manchester", "Male")),
List(List(27, "Georgetown", "Female"))).map(_.flatten)
val results = items.foldLeft(Map[String,Int]()) {
(acc,item) =>
val key = item(1).asInstanceOf[String]
val count = acc.getOrElse(key, 0 )
acc + (key -> (count + 1))
}
println(results)
Which produces:
Map(Georgetown -> 2, Medford -> 1, Manchester -> 1)

Related

Can any one implement CombineByKey() instead of GroupByKey() in Spark in order to group elements?

i am trying to group elements of an RDD that i have created. one simple but expensive way is to use GroupByKey(). but recently i learned that CombineByKey() can do this work more efficiently. my RDD is very simple. it looks like this:
(1,5)
(1,8)
(1,40)
(2,9)
(2,20)
(2,6)
val grouped_elements=first_RDD.groupByKey()..mapValues(x => x.toList)
the result is:
(1,List(5,8,40))
(2,List(9,20,6))
i want to group them based on the first element (key).
can any one help me to do it with CombineByKey() function? i am really confused by CombineByKey()
To begin with take a look at API Refer docs
combineByKey[C](createCombiner: (V) ⇒ C, mergeValue: (C, V) ⇒ C, mergeCombiners: (C, C) ⇒ C): RDD[(K, C)]
So it accepts three functions which I have defined below
scala> val createCombiner = (v:Int) => List(v)
createCombiner: Int => List[Int] = <function1>
scala> val mergeValue = (a:List[Int], b:Int) => a.::(b)
mergeValue: (List[Int], Int) => List[Int] = <function2>
scala> val mergeCombiners = (a:List[Int],b:List[Int]) => a.++(b)
mergeCombiners: (List[Int], List[Int]) => List[Int] = <function2>
Once you define these then you can use it in your combineByKey call as below
scala> val list = List((1,5),(1,8),(1,40),(2,9),(2,20),(2,6))
list: List[(Int, Int)] = List((1,5), (1,8), (1,40), (2,9), (2,20), (2,6))
scala> val temp = sc.parallelize(list)
temp: org.apache.spark.rdd.RDD[(Int, Int)] = ParallelCollectionRDD[41] at parallelize at <console>:30
scala> temp.combineByKey(createCombiner,mergeValue, mergeCombiners).collect
res27: Array[(Int, List[Int])] = Array((1,List(8, 40, 5)), (2,List(20, 9, 6)))
Please note that I tried this out in Spark Shell and hence you can see the outputs below the commands executed. They will help build you your understanding.

Using Apache Spark to find frequent contiguous

How can I use Apache Spark to find contiguous sequences
Try taking inital string, splitting it into unique subsequences of different length, then broadcasting intial sequence over them and filtering matched. Something like this would work in a spark-shell
val s = "AATTGTGTGTGTGATTTTTTAATG" //your string
val s_broadcast = sc.broadcast(s) //broadcast version
val A = 2 // min length of substring
val B = 3 // max length of substring
val C = 3 // min support
val L = s.size //length of the string
sc.parallelize(
for{
i <- A to B
j <- 0 to (L - i)
} yield (j,i+j)
) // generating paris of substrings
.map{case(j,i)=>s_broadcast.value.substring(j,i)}
.distinct // if optimization is needed, this step is a place to start
.filter(x=>s_broadcast.value.indexOf(x*C)>=0)
.collect
.map(_*C)
EDITED:
as an after though - here's code which will return LONGEST substrings. The previous code has C fixed, this one tries longest.
val s = "AATTGTGTGTGTGTGATTTTTTAATG" //your string
val s_broadcast = sc.broadcast(s) //broadcast version
val A = 2 // min length of substring
val B = 3 // max length of substring
val C = 3 // min support
val L = s.size //length of the string
sc.parallelize(
for{
i <- A to B
j <- 0 to (L - i)
} yield (j,i+j)
) // generating paris of substrings
.map{case(j,i)=>s_broadcast.value.substring(j,i)}
.distinct // if optimization is needed, this step is a place to start
.flatMap(x=>
for{
v <- C to L/A
} yield x->v
) //making "AB"->3 pairs, which will result in search for "ABABAB"
.filter{case(x,v)=>s_broadcast.value.indexOf(x*v)>=0}
.groupByKey //grouping same substrings of different length
.map{case(k,v)=>k->v.max} //getting longer substring
.collect //bringing substring to the driver
.map{case(k,v)=>k*v}

extract or filter MapType of Spark DataFrame

I have a DataFrame that contains various columns.
One column contains a Map[Integer,Integer[]].
It looks like { 2345 -> [1,34,2]; 543 -> [12,3,2,5]; 2 -> [3,4]}
Now what I need to do is filter out some keys.
I have a Set of Integers (javaIntSet) in Java with which I should filter such that
col(x).keySet.isin(javaIntSet)
ie. the above map should only contain the key 2 and 543 but not the other two and should look like {543 -> [12,3,2,5]; 2 -> [3,4]} after filtering.
Documentation of how to use the Java Column Class is sparse.
How do I extract the col(x) such that I can just filter it in java and then replace the cell data with a filtered map. Or are there any useful functions of columns I am overlooking.
Can I write an UDF2<Map<Integer, Integer[]>,Set<Integer>,Map<Integer,Integer[]>
I can write an UDF1<String,String> but I am not so sure how it works with more complex parameters.
Generally the javaIntSet is only a dozen and usually less than a 100 values. The Map usually also has only a handful entries (0-5 usually).
I have to do this in Java (unfortunately) but I am familiar with Scala. A Scala answer that I translate myself to Java would already be very helpful.
You don't need a UDF. Might be cleaner with one, but you could just as easily do it with DataFrame.explode:
case class MapTest(id: Int, map: Map[Int,Int])
val mapDf = Seq(
MapTest(1, Map((1,3),(2,10),(3,2)) ),
MapTest(2, Map((1,12),(2,333),(3,543)) )
).toDF("id", "map")
mapDf.show
+---+--------------------+
| id| map|
+---+--------------------+
| 1|Map(1 -> 3, 2 -> ...|
| 2|Map(1 -> 12, 2 ->...|
+---+--------------------+
Then you can use explode:
mapDf.explode($"map"){
case Row(map: Map[Int,Int] #unchecked) => {
val newMap = map.filter(m => m._1 != 1) // <-- do filtering here
Seq(Tuple1(newMap))
}
}.show
+---+--------------------+--------------------+
| id| map| _1|
+---+--------------------+--------------------+
| 1|Map(1 -> 3, 2 -> ...|Map(2 -> 10, 3 -> 2)|
| 2|Map(1 -> 12, 2 ->...|Map(2 -> 333, 3 -...|
+---+--------------------+--------------------+
If you did want to do the UDF, it would look like this:
val mapFilter = udf[Map[Int,Int],Map[Int,Int]](map => {
val newMap = map.filter(m => m._1 != 1) // <-- do filtering here
newMap
})
mapDf.withColumn("newMap", mapFilter($"map")).show
+---+--------------------+--------------------+
| id| map| newMap|
+---+--------------------+--------------------+
| 1|Map(1 -> 3, 2 -> ...|Map(2 -> 10, 3 -> 2)|
| 2|Map(1 -> 12, 2 ->...|Map(2 -> 333, 3 -...|
+---+--------------------+--------------------+
DataFrame.explode is a little more complicated, but ultimately more flexible. For example, you could divide the original row into two rows -- one containing the map with the elements filtered out, the other a map with the reverse -- the elements that were filtered.

Document Count of a Word in Spark/Scala

I have a text variable which is an RDD of String in scala
val data = sc.parallelize(List("i am a good boy.Are you a good boy.","You are also working here.","I am posting here today.You are good."))
I have another variable in Scala Map(given below)
//list of words for which doc count needs to be found,initial doc count is 1
val dictionary = Map( """good""" -> 1,"""working""" -> 1,"""posting""" -> 1 ).
I want to do a document count of each of the dictionary terms and get the output in key value format
My output should be like below for the above data.
(good,2)
(working,1)
(posting,1)
What i have tried is
dictionary.map { case(k,v) => k -> k.r.findFirstIn(data.map(line => line.trim()).collect().mkString(",")).size}
I am getting counts as 1 for all the words.
Please help me in fixing the above line
Thanks in advance.
Why not use flatMap to create the dictionary and then you can query that.
val dictionary = data.flatMap {case line => line.split(" ")}.map {case word => (word, 1)}.reduceByKey(_+_)
If I collect this in the REPL I get the following result:
res9: Array[(String, Int)] = Array((here,1), (good.,1), (good,2), (here.,1), (You,1), (working,1), (today.You,1), (boy.Are,1), (are,2), (a,2), (posting,1), (i,1), (boy.,1), (also,1), (I,1), (am,2), (you,1))
Obviously you would need to do a better split than in my simple example.
First of all your dictionary should be a Set, because in general sense you need to map the Set of terms to the number of documents which contain them.
So your data should look like:
scala> val docs = List("i am a good boy.Are you a good boy.","You are also working here.","I am posting here today.You are good.")
docs: List[String] = List(i am a good boy.Are you a good boy., You are also working here., I am posting here today.You are good.)
Your dictionary should look like:
scala> val dictionary = Set("good", "working", "posting")
dictionary: scala.collection.immutable.Set[String] = Set(good, working, posting)
Then you have to implement your transformation, for the simplest logic of the contains function it might look like:
scala> dictionary.map(k => k -> docs.count(_.contains(k))) toMap
res4: scala.collection.immutable.Map[String,Int] = Map(good -> 2, working -> 1, posting -> 1)
For better solution I'd recommend you to implement specific function for your requirements
(String, String) => Boolean
to determine the presence of the term in the document:
scala> def foo(doc: String, term: String): Boolean = doc.contains(term)
foo: (doc: String, term: String)Boolean
Then final solution will look like:
scala> dictionary.map(k => k -> docs.count(d => foo(d, k))) toMap
res3: scala.collection.immutable.Map[String,Int] = Map(good -> 2, working -> 1, posting -> 1)
The last thing you have to do is to calculate the result map using SparkContext. First of all you have to define what data you want to have parallelised. Let's assume we want to parallelize the collection of the documents, then solution might be like following:
val docsRDD = sc.parallelize(List(
"i am a good boy.Are you a good boy.",
"You are also working here.",
"I am posting here today.You are good."
))
docsRDD.mapPartitions(_.map(doc => dictionary.collect {
case term if doc.contains(term) => term -> 1
})).map(_.toMap) reduce { case (m1, m2) => merge(m1, m2) }
def merge(m1: Map[String, Int], m2: Map[String, Int]) =
m1 ++ m2 map { case (k, v) => k -> (v + m1.getOrElse(k, 0)) }

Scala - modify strings in a list based on their number of occurences

Another Scala newbie question since I am not getting how to achieve this in a functional way (mostly coming from a scripting language background):
I have a list of strings:
val food-list = List("banana-name", "orange-name", "orange-num", "orange-name", "orange-num", "grape-name")
and where they are duplicated, I'd like to add an incrementing number into the string and get that in a list similar to the input list, like so:
List("banana-name", "orange1-name", "orange1-num", "orange2-name", "orange2-num", "grape-name")
I've grouped them up to get counts for them with:
val freqs = list.groupBy(identity).mapValues(v => List.range(1, v.length + 1))
Which gives me:
Map(orange-num -> List(1, 2), banana-name -> List(1), grape-name -> List(1), orange-name -> List(1, 2))
The order of the list is important (it should be in the original order of food-list) so I know it's problematic for me to use a Map at this point. The closest I feel I have gotten to a solution is:
food-list.map{l =>
if (freqs(l).length > 1){
freqs(l).map(n =>
l.split("-")(0) + n.toString + "-" + l.split("-")(1))
} else {
l
}
}
This of course gives me a wonky output since I am mapping the list of frequencies from the words value in freqs
List(banana-name, List(orange1-name, orange2-name), List(orange1-num, orange2-num), List(orange1-name, orange2-name), List(orange1-num, orange2-num), grape-name)
How is this done in a Scala fp way without resorting to clumsy for loops and counters?
If the indices are important, sometimes it's best to keep track of them explicitly using zipWithIndex (very similar to Python's enumerate):
food-list.zipWithIndex.groupBy(_._1).values.toList.flatMap{
//if only one entry in this group, don't change the values
//x is actually a tuple, could write case (str, idx) :: Nil => (str, idx) :: Nil
case x :: Nil => x :: Nil
//case where there are duplicate strings
case xs => xs.zipWithIndex.map {
//idx is index in the original list, n is index in the new list i.e. count
case ((str, idx), n) =>
//destructuring assignment, like python's (fruit, suffix) = ...
val Array(fruit, suffix) = str.split("-")
//string interpolation, returning a tuple
(s"$fruit${n+1}-$suffix", idx)
}
//We now have our list of (string, index) pairs;
//sort them and map to a list of just strings
}.sortBy(_._2).map(_._1)
Efficient and simple:
val food = List("banana-name", "orange-name", "orange-num",
"orange-name", "orange-num", "grape-name")
def replaceName(s: String, n: Int) = {
val tokens = s.split("-")
tokens(0) + n + "-" + tokens(1)
}
val indicesMap = scala.collection.mutable.HashMap.empty[String, Int]
val res = food.map { name =>
{
val n = indicesMap.getOrElse(name, 1)
indicesMap += (name -> (n + 1))
replaceName(name, n)
}
}
Here is an attempt to provide what you expected with foldLeft:
foodList.foldLeft((List[String](), Map[String, Int]()))//initial value
((a/*accumulator, list, map*/, v/*value from the list*/)=>
if (a._2.isDefinedAt(v))//already seen
(s"$v+${a._2(v)}" :: a._1, a._2.updated(v, a._2(v) + 1))
else
(v::a._1, a._2.updated(v, 1)))
._1/*select the list*/.reverse/*because we created in the opposite order*/

Resources