word counting program not producing desired output in spark - apache-spark

I'm writing the code for word count in spark but it giving me the output as an array and some time the rdd after using the map:-
Array[(String, Int)] = Array((Welcome,1), (Programmings,1), (Spark,1), (in,1), (Saaransh,1))
I've already tried the code ->
val f = sc.textFile("/root/Desktop/BigData/ScalaProgram/WordCount.txt")
val fm = f.flatMap(x => x.split(" ")).map(y => (y,1)).reduceByKey((a, b) => a+b).collect
val i = f.flatMap(x => x.split(" "))
val j = i.map(y => (y,1)).reduceByKey((a, b)=> a+b)
I want the output as a single integer which represents the total number of words in a single file.

I find the wording a little confusing, but if this is the question:
I want the output as a singles integer which is a number of words in a
file.
then this is all you need:
val fileRDD = sc.textFile("/FileStore/tables/some.txt")
val count_words_in_single_file = fileRDD.flatMap(x => x.split(" ")).map(y => (y,1)).map(w => (w._2)).sum
This does it per a single file as input, if you have multiple files per input, then the solution differs again requiring sc.wholeTextFiles with name of file gotten, unless you want to count all words over all files.
You may want to consider datasets in future.

Related

How to find all two pairs of sets and elements in a collection using MapReduce in Spark?

I have a collection of sets, each set contains many items. I want to retrieve all pairs of sets and elements using Spark where each pair after reduce processing will contains two items and two sets
for example:
If I have this list of sets
Set A={1,2,3,4 }
Set B={1,2,4,5}
Set C= {2,3,5,6}
The map process will be:
(A,1)
(A,2)
(A,3)
(B,1)
(B,2)
(B,4)
(B,5)
(C,2)
(C,3)
(C,5)
(C,6)
The target result after reduce is:
(A B, 1 2) // since 1 2 exist in both A and B
(A B, 1 4)
(A B, 2 4)
(A C,2 3)
(B C,2 5)
here (A B,1 3) not in the result because 1 3 not exists in B
Could you help me to solve this problem in Spark in one map and one reduce functions in any language ( Python, Scala, or Java)?
Lets break this problem into multiple parts, I consider the transformation from input lists to map output trivial. So let us start from there,
That you have a list of (String, int) looking like
("A", 1)
("A", 2)
....
Lets forget you need 2 integer elements in result set first, and lets solve for getting intersection set between any 2 keys from the mapped output.
Result from your input would look like
(AB, Set(1,2,4))
(BC, Set(2,5))
(AC, Set(2,3))
To do this, first, extract all keys from your mapped output (mappedOutput) that is an RDD of (String, Int), convert to set, and get all combinations of 2 elements (I am using a stupid method here, a good way to do this that scales would be to use a combination generator)
val combinations = mappedOutput.map(x => x._1).collect.toSet
.subsets.filter(x => x.size == 2).toList
.map(x => x.mkString(""))
output would be List(ab,ac,bc), these combination codes will serve as keys to be joined.
convert mapped output to list of set key (a,b,c) => set of elements
val step1 = mappedOutput.groupByKey().map(x => (x._1, x._2.toSet))
Attach combination codes as key to step1
val step2 = step1.map(x => combinations.filter(y => y.contains(x._1)).map(y => (y, x))).flatMap(x => x)
output would be (ab, (a, set of elements in a)), (ac, (a, set of elements in a)) etc. Because of the filter, we will not attach combination code bc to set a.
Now obtain the result I want using a reduce
val result = step2.reduceByKey((a, b) => ("", a.intersect(b))).map(x => (x._1, x._2._2))
So we have the output I mentioned we want at the start now. What is left is to transform this result to what you need, which is very simple to do.
val transformed = result.map(x => x._2.subsets.filter(x => x.size == 2).map(y => (x._1, y.mkString(" ")))).flatMap(x => x)
end :)

Spark Accumulator value when the spark context is reading a folder with 100 files?

Spark program to count the accumulator value which is initialized at 0 and is going to be incremented by 1, when the program is reading a folder with 100 files?
val myaccumulator = sc.accumulator(0)
val inputRDD= sc.wholeTextFiles("/path/to/100Files")
inputRDD.foreach(f => myaccumulator + f.count)
<console>:29: error: value count is not a member of (String, String)
inputRDD.foreach(f => myaccumulator + f.count)
^
If you just want to count the lines in your files, you don't need anything fancy. This would do:
sc.textFile("path/to/dir/containing/the/files").count
If you absolutely want to use an accumulator, you can do it this way:
val myaccumulator = sc.accumulator(0)
sc.textFile("path/to/dir/containing/the/files").foreach(_ => myaccumulator += 1)
If you absolutely want to use wholeTextFile (which puts the whole content of each file in a single String), any of the following would count the lines:
sc.wholetextFiles("path/to/dir/containing/the/files")
.map(_._2.split("\\n").size)
.reduce(_+_)
or with an accumulator
val myaccumulator = sc.accumulator(0)
sc.wholeTextFiles
.foreach(x => myaccumulator += x._2.split("\\n").size)

How to do non-random Dataset splitting on Apache Spark?

I know I can do random splitting with randomSplit method:
val splittedData: Array[Dataset[Row]] =
preparedData.randomSplit(Array(0.5, 0.3, 0.2))
Can I split the data into consecutive parts with some 'nonRandomSplit method'?
Apache Spark 2.0.1.
Thanks in advance.
UPD: data order is important, I'm going to train my model on data with 'smaller IDs' and test it on data with 'larger IDs'. So I want to split data into consecutive parts without shuffling.
e.g.
my dataset = (0,1,2,3,4,5,6,7,8,9)
desired splitting = (0.8, 0.2)
splitting = (0,1,2,3,4,5,6,7), (8,9)
The only solution I can think of is to use count and limit, but there probably is a better one.
This is the solution I've implemented: Dataset -> Rdd -> Dataset.
I'm not sure whether it is the most effective way to do it, so I'll be glad to accept a better solution.
val count = allData.count()
val trainRatio = 0.6
val trainSize = math.round(count * trainRatio).toInt
val dataSchema = allData.schema
// Zipping with indices and skipping rows with indices > trainSize.
// Could have possibly used .limit(n) here
val trainingRdd =
allData
.rdd
.zipWithIndex()
.filter { case (_, index) => index < trainSize }
.map { case (row, _) => row }
// Can't use .limit() :(
val testRdd =
allData
.rdd
.zipWithIndex()
.filter { case (_, index) => index >= trainSize }
.map { case (row, _) => row }
val training = MySession.createDataFrame(trainingRdd, dataSchema)
val test = MySession.createDataFrame(testRdd, dataSchema)

Collect RDD to one node in sorted order

I have a large RDD which needs to be written to a single file on disk, one line for each element, the lines sorted in some defined order. So I was thinking of sorting the RDD, collect one partition at a time in the driver, and appending to the output file.
Couple of questions:
After rdd.sortBy(), do I have the guarantee that partition 0 will contain the first elements of the sorted RDD, partiton 1 will contain the next elements of the sorted RDD, and so on? (I'm using the default partitioner.)
e.g.
val rdd = ???
val sortedRdd = rdd.sortBy(???)
for (p <- sortedRdd.partitions) {
val index = p.index
val partitionRdd = sortedRdd mapPartitionsWithIndex { case (i, values) => if (i == index) values else Iterator() }
val partition = partitionRdd.collect()
partition foreach { e =>
// Append element e to file
}
}
I understand that rdd.toLocalIterator is a more efficient way of fetching all partitions, one at a time. So same question: do I get the elements in the order given by .sortBy()?
val rdd = ???
val sortedRdd = rdd.sortBy(???)
for (e <- sortedRdd.toLocalIterator) {
// Append element e to file
}

Document Count of a Word in Spark/Scala

I have a text variable which is an RDD of String in scala
val data = sc.parallelize(List("i am a good boy.Are you a good boy.","You are also working here.","I am posting here today.You are good."))
I have another variable in Scala Map(given below)
//list of words for which doc count needs to be found,initial doc count is 1
val dictionary = Map( """good""" -> 1,"""working""" -> 1,"""posting""" -> 1 ).
I want to do a document count of each of the dictionary terms and get the output in key value format
My output should be like below for the above data.
(good,2)
(working,1)
(posting,1)
What i have tried is
dictionary.map { case(k,v) => k -> k.r.findFirstIn(data.map(line => line.trim()).collect().mkString(",")).size}
I am getting counts as 1 for all the words.
Please help me in fixing the above line
Thanks in advance.
Why not use flatMap to create the dictionary and then you can query that.
val dictionary = data.flatMap {case line => line.split(" ")}.map {case word => (word, 1)}.reduceByKey(_+_)
If I collect this in the REPL I get the following result:
res9: Array[(String, Int)] = Array((here,1), (good.,1), (good,2), (here.,1), (You,1), (working,1), (today.You,1), (boy.Are,1), (are,2), (a,2), (posting,1), (i,1), (boy.,1), (also,1), (I,1), (am,2), (you,1))
Obviously you would need to do a better split than in my simple example.
First of all your dictionary should be a Set, because in general sense you need to map the Set of terms to the number of documents which contain them.
So your data should look like:
scala> val docs = List("i am a good boy.Are you a good boy.","You are also working here.","I am posting here today.You are good.")
docs: List[String] = List(i am a good boy.Are you a good boy., You are also working here., I am posting here today.You are good.)
Your dictionary should look like:
scala> val dictionary = Set("good", "working", "posting")
dictionary: scala.collection.immutable.Set[String] = Set(good, working, posting)
Then you have to implement your transformation, for the simplest logic of the contains function it might look like:
scala> dictionary.map(k => k -> docs.count(_.contains(k))) toMap
res4: scala.collection.immutable.Map[String,Int] = Map(good -> 2, working -> 1, posting -> 1)
For better solution I'd recommend you to implement specific function for your requirements
(String, String) => Boolean
to determine the presence of the term in the document:
scala> def foo(doc: String, term: String): Boolean = doc.contains(term)
foo: (doc: String, term: String)Boolean
Then final solution will look like:
scala> dictionary.map(k => k -> docs.count(d => foo(d, k))) toMap
res3: scala.collection.immutable.Map[String,Int] = Map(good -> 2, working -> 1, posting -> 1)
The last thing you have to do is to calculate the result map using SparkContext. First of all you have to define what data you want to have parallelised. Let's assume we want to parallelize the collection of the documents, then solution might be like following:
val docsRDD = sc.parallelize(List(
"i am a good boy.Are you a good boy.",
"You are also working here.",
"I am posting here today.You are good."
))
docsRDD.mapPartitions(_.map(doc => dictionary.collect {
case term if doc.contains(term) => term -> 1
})).map(_.toMap) reduce { case (m1, m2) => merge(m1, m2) }
def merge(m1: Map[String, Int], m2: Map[String, Int]) =
m1 ++ m2 map { case (k, v) => k -> (v + m1.getOrElse(k, 0)) }

Resources