How to do non-random Dataset splitting on Apache Spark?

How to do non-random Dataset splitting on Apache Spark? - apache-spark

I know I can do random splitting with randomSplit method:
val splittedData: Array[Dataset[Row]] =
preparedData.randomSplit(Array(0.5, 0.3, 0.2))
Can I split the data into consecutive parts with some 'nonRandomSplit method'?
Apache Spark 2.0.1.
Thanks in advance.
UPD: data order is important, I'm going to train my model on data with 'smaller IDs' and test it on data with 'larger IDs'. So I want to split data into consecutive parts without shuffling.
e.g.
my dataset = (0,1,2,3,4,5,6,7,8,9)
desired splitting = (0.8, 0.2)
splitting = (0,1,2,3,4,5,6,7), (8,9)
The only solution I can think of is to use count and limit, but there probably is a better one.

This is the solution I've implemented: Dataset -> Rdd -> Dataset.
I'm not sure whether it is the most effective way to do it, so I'll be glad to accept a better solution.
val count = allData.count()
val trainRatio = 0.6
val trainSize = math.round(count * trainRatio).toInt
val dataSchema = allData.schema
// Zipping with indices and skipping rows with indices > trainSize.
// Could have possibly used .limit(n) here
val trainingRdd =
allData
.rdd
.zipWithIndex()
.filter { case (_, index) => index < trainSize }
.map { case (row, _) => row }
// Can't use .limit() :(
val testRdd =
allData
.rdd
.zipWithIndex()
.filter { case (_, index) => index >= trainSize }
.map { case (row, _) => row }
val training = MySession.createDataFrame(trainingRdd, dataSchema)
val test = MySession.createDataFrame(testRdd, dataSchema)

Related

forEachPartition works but not mapPartition

I have a spark streaming application that read a Kafka stream and inserts data to a database.
This is the code snippet
eventDStream.foreachRDD { (rdd, time) =>
val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
// First example that works
val accumulator = streamingContext.sparkContext.longAccumulator
rdd.foreachPartition { records =>
val count = Consumer.process(records)
accumulator.add(count)
}
println(s"accumulated $accumulator.value")
// do the same but aggregate count, does not work
val results = rdd.mapPartitions(records => Consumer.processIterator(records))
val x = results.fold(0)(_ + _)
println(x)
eventDStream.asInstanceOf[CanCommitOffsets].commitAsync(offsetRanges)
}
In the first part, I used the foreachPartition with an accumulator to count the number of successful inserts
In the second part, I computed an RDD[Int] the represents the number of successful inserts in each RDDand aggregate the result using the fold function
But the second part always prints 0 and the first part always do exactly what I want.
Can you show me why?
Thanks

RDD of Tuple and RDD of Row differences

I have two different RDDs and apply a foreach on both of them and note a difference that I cannot resolve.
First one:
val data = Array(("CORN",6), ("WHEAT",3),("CORN",4),("SOYA",4),("CORN",1),("PALM",2),("BEANS",9),("MAIZE",8),("WHEAT",2),("PALM",10))
val rdd = sc.parallelize(data,3) // NOT sorted
rdd.foreach{ x => {
println (x)
}}
rdd: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[103] at parallelize at command-325897530726166:8
Works fine in this sense.
Second one:
rddX.foreach{ x => {
val prod = x(0)
val vol = x(1)
val prt = counter
val cnt = counter * 100
println(prt,cnt,prod,vol)
}}
rddX: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = MapPartitionsRDD[128] at rdd at command-686855653277634:51
Works fine.
Question: why can I not do val prod = x(0) as in the second case on the first example? And how could I do that with the foreach? Or would we need to use map for the first case always? Due to Row internals on the second example?

As you can see the difference in datatypes
First one is RDD[(String, Int)]
This is an RDD of Tuple2 which contains (String, Int) so you can access this as val prod = x._1 for first value as String and x._2 for second Integer value.
Since it is a Tuple you can't access as val prod = x(0)
and second one is RDD[org.apache.spark.sql.Row] which can be access a
val prod = x.getString(0) or val prod = x(0)
I hope this helped!

Creating Pair RDDs

From the below RDD , I would like to create a pair RDD.
val line = sc.parallelize(Array("2,SMITH,AARON"))
I used the below code:
val pair = line.map(x => (x.split(",")(0).toInt, x))
The output generated is Array[(Int, String)] = Array((2,2,SMITH,AARON))
But I would like the desired output to be Array[(Int, String)] = Array((2,SMITH,AARON))
Pls help me out.
I am a newbie.

Just take the rest:
val pair = line.map(x => x.split(",") match {
case Array(x, xs # _ *) => (x.toInt, xs.join(",")}
})

Easy way to do this is to split and get the array in each position
line.map(r => {
val split = r.split(",")
(split(0).toInt, (split.tail.mkString(",")))
})

Strict partition an RDD into multiple RDDs in Spark

I have an rdd with n partitions and I would like to split this rdd into k rdds in such a way that
rdd = rdd_1.union(rdd_2).union(rdd_3)...union(rdd_k)
So for example if n=10 and k=2 I would like to end up with 2 rdds where rdd1 is composed of 5 partitions and rdd2 is composed of the other 5 partitions.
What is the most efficient way to do this in Spark?

You can try something like this:
val rdd: RDD[T] = ???
val k: Integer = ???
val n = rdd.partitions.size
val rdds = (0 until n) // Create Seq of partitions numbers
.grouped(n / k) // group it into fixed sized buckets
.map(idxs => (idxs.head, idxs.last)) // Take the first and the last idx
.map {
case(min, max) => rdd.mapPartitionsWithIndex(
// If partition in [min, max] range keep its iterator
// otherwise return empty-one
(i, iter) => if (i >= min & i <= max) iter else Iterator()
)
}
If input RDD has complex dependencies you should cache it before applying this.

Collect RDD to one node in sorted order

I have a large RDD which needs to be written to a single file on disk, one line for each element, the lines sorted in some defined order. So I was thinking of sorting the RDD, collect one partition at a time in the driver, and appending to the output file.
Couple of questions:
After rdd.sortBy(), do I have the guarantee that partition 0 will contain the first elements of the sorted RDD, partiton 1 will contain the next elements of the sorted RDD, and so on? (I'm using the default partitioner.)
e.g.
val rdd = ???
val sortedRdd = rdd.sortBy(???)
for (p <- sortedRdd.partitions) {
val index = p.index
val partitionRdd = sortedRdd mapPartitionsWithIndex { case (i, values) => if (i == index) values else Iterator() }
val partition = partitionRdd.collect()
partition foreach { e =>
// Append element e to file
}
}
I understand that rdd.toLocalIterator is a more efficient way of fetching all partitions, one at a time. So same question: do I get the elements in the order given by .sortBy()?
val rdd = ???
val sortedRdd = rdd.sortBy(???)
for (e <- sortedRdd.toLocalIterator) {
// Append element e to file
}

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

How to do non-random Dataset splitting on Apache Spark? - apache-spark

Related

forEachPartition works but not mapPartition

RDD of Tuple and RDD of Row differences

Creating Pair RDDs

Strict partition an RDD into multiple RDDs in Spark

Collect RDD to one node in sorted order

Categories

Resources