Converting Dataframe to RDD reduces partitions - apache-spark

In our code, Dataframe was created as :
DataFrame DF = hiveContext.sql("select * from table_instance");
When I convert my dataframe to rdd and try to get its number of partitions as
RDD<Row> newRDD = Df.rdd();
System.out.println(newRDD.getNumPartitions());
It reduces the number of partitions to 1(1 is printed in the console). Originally my dataframe has 102 partitions .
UPDATE:
While reading , I repartitoned the dataframe :
DataFrame DF = hiveContext.sql("select * from table_instance").repartition(200);
and then converted to rdd , so it gave me 200 partitions only.
Does
JavaSparkContext
has a role to play in this? When we convert a dataframe to rdd , is default minimum partitions flag also considered at the spark context level?
UPDATE:
I made a seperate sample program in which I read the exact same table into dataframe and converted to rdd. No extra stage was created for RDD conversion and the partition count was also correct. I am now wondering what different am I doing in my main program.
Please let me know if my understanding is wrong here.

It basically depends on the implementation of hiveContext.sql(). Since I am new to Hive, my guess is hiveContext.sql doesn't know OR is not able to split the data present in the table.
For example, when you read a text file from HDFS, spark context considers the number of blocks used by that file to determine the partitions.
What you did with repartition is the obvious solution for these kinds of problems.(Note: repartition may cause a shuffle operation if proper partitioner is not used, hash Partitioner is used by default)
Coming to your doubt, hiveContext may consider the default minimum partition property. But, relying on default property is not going to
solve all your problems. For instance, if your hive table's size increases, your program still uses the default number of partitions.
Update: Avoid shuffle during repartition
Define your custom partitioner:
public class MyPartitioner extends HashPartitioner {
private final int partitions;
public MyPartitioner(int partitions) {
super();
this.partitions = partitions;
}
#Override
public int numPartitions() {
return this.partitions;
}
#Override
public int getPartition(Object key) {
if (key instanceof String) {
return super.getPartition(key);
} else if (key instanceof Integer) {
return (Integer.valueOf(key.toString()) % this.partitions);
} else if (key instanceof Long) {
return (int)(Long.valueOf(key.toString()) % this.partitions);
}
//TOD ... add more types
}
}
Use your custom partitioner:
JavaPairRDD<Long, SparkDatoinDoc> pairRdd = hiveContext.sql("select * from table_instance")
.mapToPair( //TODO ... expose the column as key)
rdd = rdd.partitionBy(new MyPartitioner(200));
//... rest of processing

Related

How to improve performance my spark job here to load data into cassandra table?

I am using spark-sql-2.4.1 ,spark-cassandra-connector_2.11-2.4.1 with java8 and apache cassandra 3.0 version.
I have my spark-submit or spark cluster enviroment as below to load 2 billion records.
--executor-cores 3
--executor-memory 9g
--num-executors 5
--driver-cores 2
--driver-memory 4g
I am using Cassandra 6 node cluster with below settings :
cassandra.output.consistency.level=ANY
cassandra.concurrent.writes=1500
cassandra.output.batch.size.bytes=2056
cassandra.output.batch.grouping.key=partition
cassandra.output.batch.grouping.buffer.size=3000
cassandra.output.throughput_mb_per_sec=128
cassandra.connection.keep_alive_ms=30000
cassandra.read.timeout_ms=600000
I am loading using spark dataframe into cassandra tables.
After reading into spark data set I am grouping by on certain columns as below.
Dataset<Row> dataDf = //read data from source i.e. hdfs file which are already partitioned based "load_date", "fiscal_year" , "fiscal_quarter" , "id", "type","type_code"
Dataset<Row> groupedDf = dataDf.groupBy("id","type","value" ,"load_date","fiscal_year","fiscal_quarter" , "create_user_txt", "create_date")
groupedDf.write().format("org.apache.spark.sql.cassandra")
.option("table","product")
.option("keyspace", "dataload")
.mode(SaveMode.Append)
.save();
Cassandra table(
PRIMARY KEY (( id, type, value, item_code ), load_date)
) WITH CLUSTERING ORDER BY ( load_date DESC )
Basically I am groupBy "id","type","value" ,"load_date" columns. As the other columns ( "fiscal_year","fiscal_quarter" , "create_user_txt", "create_date") should be available for storing into cassandra table I have to include them also in the groupBy clause.
1) Frankly speaking I dont know how to get those columns after groupBy
into resultant dataframe i.e groupedDf to store. Any advice here
to how to tackle this please ?
2) With above process/steps , my spark job for loading is pretty slow due to lot of shuffling i.e. read shuffle and write shuffle processes.
What should I do here to improve the speed ?
While reading from source (into dataDf) do I need to do anything here to improve performance? This is already partitioned.
Should I still need to do any partitioning ? If so , what is the best way/approach given the above cassandra table?
HDFS file columns
"id","type","value","type_code","load_date","item_code","fiscal_year","fiscal_quarter","create_date","last_update_date","create_user_txt","update_user_txt"
Pivoting
I am using groupBy due to pivoting as below
Dataset<Row> pivot_model_vals_unpersist_df = model_vals_df.groupBy("id","type","value","type_code","load_date","item_code","fiscal_year","fiscal_quarter","create_date")
.pivot("type_code" )
.agg( first(//business logic)
)
)
Please advice.
Your advice/feedback are highly thankful.
So, as I got from comments your task is next:
Take 2b rows from HDFS.
Save this rows into Cassandra with some conversion.
Schema of Cassandra table is not the same as schema of HDFS dataset.
At first, you definitely don't need group by. GROUP BY doesn't group columns, it group rows invoking some aggregate function like sum, avg, max, etc. Semantic is similar to SQL "group by", so it's no your case. What you really need - make your "to save" dataset fit into desired Cassandra schema.
In Java this is a little bit trickier than in Scala. At first I suggest to define a bean that would represent a Cassandra row.
public class MyClass {
// Remember to declare no-args constructor
public MyClass() { }
private Long id;
private String type;
// another fields, getters, setters, etc
}
Your dataset is Dataset, you need to convert it into JavaRDD. So, you need a convertor.
public class MyClassFabric {
public static MyClass fromRow(Row row) {
MyClass myClass = new MyClass();
myClass.setId(row.getInt("id"));
// ....
return myClass;
}
}
In result we would have something like this:
JavaRDD<MyClass> rdd = dataDf.toJavaRDD().map(MyClassFabric::fromRow);
javaFunctions(rdd).writerBuilder("keyspace", "table",
mapToRow(MyClass.class)).saveToCassandra();
For additional info you can take a look https://github.com/datastax/spark-cassandra-connector/blob/master/doc/7_java_api.md

Spark RDD do not get processed in multiple nodes

I have a use case where in i create rdd from a hive table. I wrote a business logic that operates on every row in the hive table. My assumption was that when i create rdd and span a map process on it, it then utilises all my spark executors. But, what i see in my log is only one node process the rdd while rest of my 5 nodes sitting idle. Here is my code
val flow = hiveContext.sql("select * from humsdb.t_flow")
var x = flow.rdd.map { row =>
< do some computation on each row>
}
Any clue where i go wrong?
As specify here by #jaceklaskowski
By default, a partition is created for each HDFS partition, which by
default is 64MB (from Spark’s Programming Guide).
If your input data is less than 64MB (and you are using HDFS) then by default only one partition will be created.
Spark will use all nodes when using big data
Could there be a possibility that your data is skewed?
To rule out this possibility, do the following and rerun the code.
val flow = hiveContext.sql("select * from humsdb.t_flow").repartition(200)
var x = flow.rdd.map { row =>
< do some computation on each row>
}
Further if in your map logic you are dependent on a particular column you can do below
val flow = hiveContext.sql("select * from humsdb.t_flow").repartition(col("yourColumnName"))
var x = flow.rdd.map { row =>
< do some computation on each row>
}
A good partition column could be date column

Use of partitioners in Spark

Hy, I have a question about partitioning in Spark,in Learning Spark book, authors said that partitioning can be useful, like for example during PageRank at page 66 and they write :
since links is a static dataset, we partition it at the start with
partitionBy(), so that it does not need to be shuffled across the
network
Now I'm focused about this example, but my questions are general:
why a partitioned RDD doesn't need to be shuffled?
PartitionBy() is a wide transformation,so it will produce shuffle anyway,right?
Could someone illustrate a concrete example and what happen into each single node when partitionBy happens?
Thanks in advance
Why a partitioned RDD doesn't need to be shuffled?
When the author does:
val links = sc.objectFile[(String, Seq[String])]("links")
.partitionBy(new HashPartitioner(100))
.persist()
He's partitioning the data set into 100 partitions where each key will be hashed to a given partition (pageId in the given example). This means that the same key will be stored in a single given partition. Then, when he does the join:
val contributions = links.join(ranks)
All chunks of data with the same pageId should already be located on the same executor, avoiding the need for a shuffle between different nodes in the cluster.
PartitionBy() is a wide transformation,so it will produce shuffle
anyway, right?
Yes, partitionBy produces a ShuffleRDD[K, V, V]:
def partitionBy(partitioner: Partitioner): RDD[(K, V)] = self.withScope {
if (keyClass.isArray && partitioner.isInstanceOf[HashPartitioner]) {
throw new SparkException("HashPartitioner cannot partition array keys.")
}
if (self.partitioner == Some(partitioner)) {
self
} else {
new ShuffledRDD[K, V, V](self, partitioner)
}
}
Could someone illustrate a concrete example and what happen into each
single node when partitionBy happens?
Basically, partitionBy will do the following:
It will hash the key modulu the number of partitions (100 in this case), and since it relys on the fact that the same key will always produce the same hashcode, it will package all data from a given id (in our case, pageId) to the same partition, such that when you join, all data will be available in that partition already, avoiding the need for a shuffle.

Demultiplexing RDD onto multiple ORC tables

I'm trying to convert data stored in S3 as JSON-per-line textfiles to a structured, columnar format like ORC or Parquet on S3.
The source files contain data of multiple schemes (eg. HTTP request, HTTP response, ...), which need to be parsed into different Spark Dataframes of the correct type.
Example schemas:
val Request = StructType(Seq(
StructField("timestamp", TimestampType, nullable=false),
StructField("requestId", LongType),
StructField("requestMethod", StringType),
StructField("scheme", StringType),
StructField("host", StringType),
StructField("headers", MapType(StringType, StringType, valueContainsNull=false)),
StructField("path", StringType),
StructField("sessionId", StringType),
StructField("userAgent", StringType)
))
val Response = StructType(Seq(
StructField("timestamp", TimestampType, nullable=false),
StructField("requestId", LongType),
StructField("contentType", StringType),
StructField("contentLength", IntegerType),
StructField("statusCode", StringType),
StructField("headers", MapType(keyType=StringType, valueType=StringType, valueContainsNull=false)),
StructField("responseDuration", DoubleType),
StructField("sessionId", StringType)
))
I got that part working fine, however trying to write out the data back to S3 as efficiently as possible seems to be an issue atm.
I tried 3 approaches:
muxPartitions from the silex project
caching the parsed S3 input and looping over it multiple times
making each scheme type a separate partition of the RDD
In the first case, the JVM ran out of memory and in the second one the machine ran out of disk space.
The third I haven't thoroughly tested yet, but this does not seem an efficient use of processing power (as only one node of the cluster (the one on which this particular partition is) would actually be writing the data back out to S3).
Relevant code:
val allSchemes = Schemes.all().keys.toArray
if (false) {
import com.realo.warehouse.multiplex.implicits._
val input = readRawFromS3(inputPrefix) // returns RDD[Row]
.flatMuxPartitions(allSchemes.length, data => {
val buffers = Vector.tabulate(allSchemes.length) { j => ArrayBuffer.empty[Row] }
data.foreach {
logItem => {
val schemeIndex = allSchemes.indexOf(logItem.logType)
if (schemeIndex > -1) {
buffers(schemeIndex).append(logItem.row)
}
}
}
buffers
})
allSchemes.zipWithIndex.foreach {
case (schemeName, index) =>
val rdd = input(index)
writeColumnarToS3(rdd, schemeName)
}
} else if (false) {
// Naive approach
val input = readRawFromS3(inputPrefix) // returns RDD[Row]
.persist(StorageLevel.MEMORY_AND_DISK)
allSchemes.foreach {
schemeName =>
val rdd = input
.filter(x => x.logType == schemeName)
.map(x => x.row)
writeColumnarToS3(rdd, schemeName)
}
input.unpersist()
} else {
class CustomPartitioner extends Partitioner {
override def numPartitions: Int = allSchemes.length
override def getPartition(key: Any): Int = allSchemes.indexOf(key.asInstanceOf[String])
}
val input = readRawFromS3(inputPrefix)
.map(x => (x.logType, x.row))
.partitionBy(new CustomPartitioner())
.map { case (logType, row) => row }
.persist(StorageLevel.MEMORY_AND_DISK)
allSchemes.zipWithIndex.foreach {
case (schemeName, index) =>
val rdd = input
.mapPartitionsWithIndex(
(i, iter) => if (i == index) iter else Iterator.empty,
preservesPartitioning = true
)
writeColumnarToS3(rdd, schemeName)
}
input.unpersist()
}
Conceptually, I think the code should have 1 output DStream per scheme type and the input RDD should pick 'n place each processed item onto the correct DStream (with batching for better throughput).
Does anyone have any pointers as to how to implement this? And/or is there a better way of tackling this problem?
Given that the input is a json, you can read it into a dataframe of strings (each line is a single string). Then you can extract the type from each json (either by using a UDF or by using a function such as get_json_object or json_tuple).
Now you have two columns: The type and the original json. You can now use partitionBy dataframe option when writing the dataframe. This would result in a directory for each type and the content of the directory would include the original jsons.
Now you can read each type with its own schema.
You can also do a similar thing with RDD using a map which turns the input rdd into a pair rdd with the key being the type and the value being the json converted to the target schema. Then you can use partitionBy and map partition to save each partition to a file or you can use reduce by key to write to different files (e.g. by using the key to set the filename).
You could also take a look at Write to multiple outputs by key Spark - one Spark job
Note that I assumed here that the goal is to split to file. Depending on your specific use case, other options might be viable. For example, if your different schemas are close enough, you can create a super schema which encompasses all of them and create the dataframe directly from that. Then you can either work on the dataframe directly or use the dataframe partitionBy to write the different subtypes to different directories (but this time already saved to parquet).
This is what I came up with eventually:
I use a custom partitioner to partition the data based on their scheme plus the hashcode of the row.
The reasoning here is that we want to be able to only process certain partitions, yet still allow all nodes to participate (for performance reasons). So we don't spread the data over just 1 partition, but over X partitions (with X being the number of nodes times 2, in this example).
Then for each scheme, we prune the partitions we don't need and thus we will only process the ones we do.
Code example:
def process(date : ReadableInstant, schemesToProcess : Array[String]) = {
// Tweak this based on your use case
val DefaultNumberOfStoragePartitions = spark.sparkContext.defaultParallelism * 2
class CustomPartitioner extends Partitioner {
override def numPartitions: Int = schemesToProcess.length * DefaultNumberOfStoragePartitions
override def getPartition(key: Any): Int = {
// This is tightly coupled with how `input` gets transformed below
val (logType, rowHashCode) = key.asInstanceOf[(String, Int)]
(schemesToProcess.indexOf(logType) * DefaultNumberOfStoragePartitions) + Utils.nonNegativeMod(rowHashCode, DefaultNumberOfStoragePartitions)
}
/**
* Internal helper function to retrieve all partition indices for the given key
* #param key input key
* #return
*/
private def getPartitions(key: String): Seq[Int] = {
val index = schemesToProcess.indexOf(key) * DefaultNumberOfStoragePartitions
index until (index + DefaultNumberOfStoragePartitions)
}
/**
* Returns an RDD which only traverses the partitions for the given key
* #param rdd base RDD
* #param key input key
* #return
*/
def filterRDDForKey[T](rdd: RDD[T], key: String): RDD[T] = {
val partitions = getPartitions(key).toSet
PartitionPruningRDD.create(rdd, x => partitions.contains(x))
}
}
val partitioner = new CustomPartitioner()
val input = readRawFromS3(date)
.map(x => ((x.logType, x.row.hashCode), x.row))
.partitionBy(partitioner)
.persist(StorageLevel.MEMORY_AND_DISK_SER)
// Initial stage: caches the processed data + gets an enumeration of all schemes in this RDD
val schemesInRdd = input
.map(_._1._1)
.distinct()
.collect()
// Remaining stages: for each scheme, write it out to S3 as ORC
schemesInRdd.zipWithIndex.foreach {
case (schemeName, index) =>
val rdd = partitioner.filterRDDForKey(input, schemeName)
.map(_._2)
.coalesce(DefaultNumberOfStoragePartitions)
writeColumnarToS3(rdd, schemeName)
}
input.unpersist()
}

Apache Spark: Splitting Pair RDD into multiple RDDs by key to save values

I am using Spark 1.0.1 to process a large amount of data. Each row contains an ID number, some with duplicate IDs. I want to save all the rows with the same ID number in the same location, but I am having trouble doing it efficiently. I create an RDD[(String, String)] of (ID number, data row) pairs:
val mapRdd = rdd.map{ x=> (x.split("\\t+")(1), x)}
A way that works, but is not performant, is to collect the ID numbers, filter the RDD for each ID, and save the RDD of values with the same ID as a text file.
val ids = rdd.keys.distinct.collect
ids.foreach({ id =>
val dataRows = mapRdd.filter(_._1 == id).values
dataRows.saveAsTextFile(id)
})
I also tried a groupByKey or reduceByKey so that each tuple in the RDD contains a unique ID number as the key and a string of combined data rows separated by new lines for that ID number. I want to iterate through the RDD only once using foreach to save the data, but it can't give the values as an RDD
groupedRdd.foreach({ tup =>
val data = sc.parallelize(List(tup._2)) //nested RDD does not work
data.saveAsTextFile(tup._1)
})
Essentially, I want to split an RDD into multiple RDDs by an ID number and save the values for that ID number into their own location.
I think this problem is similar to
Write to multiple outputs by key Spark - one Spark job
Please refer the answer there.
import org.apache.hadoop.io.NullWritable
import org.apache.spark._
import org.apache.spark.SparkContext._
import org.apache.hadoop.mapred.lib.MultipleTextOutputFormat
class RDDMultipleTextOutputFormat extends MultipleTextOutputFormat[Any, Any] {
override def generateActualKey(key: Any, value: Any): Any =
NullWritable.get()
override def generateFileNameForKeyValue(key: Any, value: Any, name: String): String =
key.asInstanceOf[String]
}
object Split {
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("Split" + args(1))
val sc = new SparkContext(conf)
sc.textFile("input/path")
.map(a => (k, v)) // Your own implementation
.partitionBy(new HashPartitioner(num))
.saveAsHadoopFile("output/path", classOf[String], classOf[String],
classOf[RDDMultipleTextOutputFormat])
spark.stop()
}
}
Just saw similar answer above, but actually we don't need customized partitions. The MultipleTextOutputFormat will create file for each key. It is ok that multiple record with same keys fall into the same partition.
new HashPartitioner(num), where the num is the partition number you want. In case you have a big number of different keys, you can set number to big. In this case, each partition will not open too many hdfs file handlers.
you can directly call saveAsTextFile on grouped RDD, here it will save the data based on partitions, i mean, if you have 4 distinctID's, and you specified the groupedRDD's number of partitions as 4, then spark stores each partition data into one file(so by which you can have only one fileper ID) u can even see the data as iterables of eachId in the filesystem.
This will save the data per user ID
val mapRdd = rdd.map{ x=> (x.split("\\t+")(1),
x)}.groupByKey(numPartitions).saveAsObjectFile("file")
If you need to retrieve the data again based on user id you can do something like
val userIdLookupTable = sc.objectFile("file").cache() //could use persist() if data is to big for memory
val data = userIdLookupTable.lookup(id) //note this returns a sequence, in this case you can just get the first one
Note that there is no particular reason to save to the file in this case I just did it since the OP asked for it, that being said saving to a file does allow you to load the RDD at anytime after the initial grouping has been done.
One last thing, lookup is faster than a filter approach of accessing ids but if you're willing to go off a pull request from spark you can checkout this answer for a faster approach

Resources