I have a spark streaming application that read a Kafka stream and inserts data to a database.
This is the code snippet
eventDStream.foreachRDD { (rdd, time) =>
val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
// First example that works
val accumulator = streamingContext.sparkContext.longAccumulator
rdd.foreachPartition { records =>
val count = Consumer.process(records)
accumulator.add(count)
}
println(s"accumulated $accumulator.value")
// do the same but aggregate count, does not work
val results = rdd.mapPartitions(records => Consumer.processIterator(records))
val x = results.fold(0)(_ + _)
println(x)
eventDStream.asInstanceOf[CanCommitOffsets].commitAsync(offsetRanges)
}
In the first part, I used the foreachPartition with an accumulator to count the number of successful inserts
In the second part, I computed an RDD[Int] the represents the number of successful inserts in each RDDand aggregate the result using the fold function
But the second part always prints 0 and the first part always do exactly what I want.
Can you show me why?
Thanks
Related
Spark program to count the accumulator value which is initialized at 0 and is going to be incremented by 1, when the program is reading a folder with 100 files?
val myaccumulator = sc.accumulator(0)
val inputRDD= sc.wholeTextFiles("/path/to/100Files")
inputRDD.foreach(f => myaccumulator + f.count)
<console>:29: error: value count is not a member of (String, String)
inputRDD.foreach(f => myaccumulator + f.count)
^
If you just want to count the lines in your files, you don't need anything fancy. This would do:
sc.textFile("path/to/dir/containing/the/files").count
If you absolutely want to use an accumulator, you can do it this way:
val myaccumulator = sc.accumulator(0)
sc.textFile("path/to/dir/containing/the/files").foreach(_ => myaccumulator += 1)
If you absolutely want to use wholeTextFile (which puts the whole content of each file in a single String), any of the following would count the lines:
sc.wholetextFiles("path/to/dir/containing/the/files")
.map(_._2.split("\\n").size)
.reduce(_+_)
or with an accumulator
val myaccumulator = sc.accumulator(0)
sc.wholeTextFiles
.foreach(x => myaccumulator += x._2.split("\\n").size)
I know I can do random splitting with randomSplit method:
val splittedData: Array[Dataset[Row]] =
preparedData.randomSplit(Array(0.5, 0.3, 0.2))
Can I split the data into consecutive parts with some 'nonRandomSplit method'?
Apache Spark 2.0.1.
Thanks in advance.
UPD: data order is important, I'm going to train my model on data with 'smaller IDs' and test it on data with 'larger IDs'. So I want to split data into consecutive parts without shuffling.
e.g.
my dataset = (0,1,2,3,4,5,6,7,8,9)
desired splitting = (0.8, 0.2)
splitting = (0,1,2,3,4,5,6,7), (8,9)
The only solution I can think of is to use count and limit, but there probably is a better one.
This is the solution I've implemented: Dataset -> Rdd -> Dataset.
I'm not sure whether it is the most effective way to do it, so I'll be glad to accept a better solution.
val count = allData.count()
val trainRatio = 0.6
val trainSize = math.round(count * trainRatio).toInt
val dataSchema = allData.schema
// Zipping with indices and skipping rows with indices > trainSize.
// Could have possibly used .limit(n) here
val trainingRdd =
allData
.rdd
.zipWithIndex()
.filter { case (_, index) => index < trainSize }
.map { case (row, _) => row }
// Can't use .limit() :(
val testRdd =
allData
.rdd
.zipWithIndex()
.filter { case (_, index) => index >= trainSize }
.map { case (row, _) => row }
val training = MySession.createDataFrame(trainingRdd, dataSchema)
val test = MySession.createDataFrame(testRdd, dataSchema)
I have 2 inputs, where first input is stream (say input1) and the second one is batch (say input2).
I want to figure out if the keys in first input matches single row or more than one row in the second input.
The further transformations/logic depends on the number of rows matching, whether single row matches or multiple rows match (for atleast one key in the first input)
if(single row matches){
// do something
}else{
// do something
}
Code that i tried so far
val input1Pair = streamData.map(x => (x._1, x))
val input2Pair = input2.map(x => (x._1, x))
val joinData = input1Pair.transform{ x => input2Pair.leftOuterJoin(x)}
val result = joinData.mapValues{
case(v, Some(a)) => 1L
case(v, None) => 0
}.reduceByKey(_ + _).filter(_._2 > 1)
I have done the above coding.
When I do result.print, it prints nothing if all the keys matches only one row in the input2.
With the fact that the DStream may have multiple RDDs, not sure how to figure out if the DStream is empty or not. If this is possible then I can do a if check.
There's no function to determine if a DStream is empty, as a DStream represents a collection over time. From a conceptual perspective, an empty DStream would be a stream that never has data and that would not be very useful.
What can be done is to check whether a given microbatch has data or not:
dstream.foreachRDD{ rdd => if (rdd.isEmpty) {...} }
Please note that at any given point in time, there's only one RDD.
I think that the actual question is how to check the number of matches between the reference RDD and the data in the DStream. Probably the easiest way would be to intersect both collections and check the intersection size:
val intersectionDStream = streamData.transform{rdd => rdd.intersection(input2)}
intersectionDStream.foreachRDD{rdd =>
if (rdd.count > 1) {
..do stuff with the matches
} else {
..do otherwise
}
}
We could also place the RDD-centric transformations within the foreachRDD operation:
streamData.foreachRDD{rdd =>
val matches = rdd.intersection(input2)
if (matches.count > 1) {
..do stuff with the matches
} else {
..do otherwise
}
}
I have a large RDD which needs to be written to a single file on disk, one line for each element, the lines sorted in some defined order. So I was thinking of sorting the RDD, collect one partition at a time in the driver, and appending to the output file.
Couple of questions:
After rdd.sortBy(), do I have the guarantee that partition 0 will contain the first elements of the sorted RDD, partiton 1 will contain the next elements of the sorted RDD, and so on? (I'm using the default partitioner.)
e.g.
val rdd = ???
val sortedRdd = rdd.sortBy(???)
for (p <- sortedRdd.partitions) {
val index = p.index
val partitionRdd = sortedRdd mapPartitionsWithIndex { case (i, values) => if (i == index) values else Iterator() }
val partition = partitionRdd.collect()
partition foreach { e =>
// Append element e to file
}
}
I understand that rdd.toLocalIterator is a more efficient way of fetching all partitions, one at a time. So same question: do I get the elements in the order given by .sortBy()?
val rdd = ???
val sortedRdd = rdd.sortBy(???)
for (e <- sortedRdd.toLocalIterator) {
// Append element e to file
}
I am using spark stream to read data from kafka cluster. I want to sort a DStream pair and get the Top N alone. So far I have sorted using
val result = ds.reduceByKeyAndWindow((x: Double, y: Double) => x + y,
Seconds(windowInterval), Seconds(batchInterval))
result.transform(rdd => rdd.sortBy(_._2, false))
result.print
My Questions are
How to get only the top N elements from the dstream ?
The transform operation is applied rdd by rdd . So will the result be sorted across elements in all rdds ? If not how to achieve it ?
You can use transform method in the DStream object then sort the input RDD and take n elements of it in a list, then filter the original RDD to be contained in this list.
Note: Both RDD and DStream are immutable, So any transformation would return a new RDD or DStream but won't change in the original RDD or DStream.
val n = 10
val topN = result.transform(rdd =>{
val list = rdd.sortBy(_._2, false).take(n)
rdd.filter(list.contains)
})
topN.print