Spark mapPartitions Issue - apache-spark

I am using spark mapPartition on my DF and the use case i should submit one Job (either calling lambda or sending a SQS Message) for each Partition.
I am partitioning on a custom formatted date column and logging the no.of partitions before and after and it is working as expected.
How ever when i see the total no.of jobs it is more than the no.of partitions. For Some of the partitions there are two or three jobs !!
Here is the Code i am using
val yearMonthQueryRDD = yearMonthQueryDF.rdd.mapPartitions(
partition => {
val partitionObjectList = new java.util.ArrayList[String]()
logger.info("partitionIndex = {}",TaskContext.getPartitionId());
val partitionCounter:AtomicLong = new AtomicLong(0)
val partitionSize:AtomicLong = new AtomicLong(0)
val paritionColumnName:AtomicReference[String] = new AtomicReference[String]();
// Iterate the Objects in a given parittion
val updatedPartition = partition.map( record => {
import yearMonthQueryDF.sparkSession.implicits._
partitionCounter.set(partitionCounter.get()+1)
val recordSizeInt = Integer.parseInt(record.getAs("object_size"))
val recordSize:Long = recordSizeInt.toLong
partitionObjectList.add(record.getAs("object_key"))
paritionColumnName.set(record.getAs("partition_column_name"))
record
}
).toList
logger_ref.info("No.of Elements in Partition ["+paritionColumnName.get()+"] are =["+partitionCounter.get()+"] Total Size=["+partitionSize.get()+"]")
// Submit a Job for the parition
// jobUtil.submitJob(paritionColumnName.get(),partitionObjectList,partitionSize.get())
updatedPartition.toIterator
}
)
Another thing that is making the debugging harder is the logging statements inside the mapPartitions() method are not found in the container error logs (since they are executed on each worker node not on master node i expected them to find them in container logs rather than in master node logs. Need to figure why i am only seeing stderr logs but not stdout logs on the containers though).
Thanks
Sateesh

Related

How to use spark to write to HBase using multi-thread

I'm using spark to write data to HBase, but at the writing stage, only one executor and one core are executing.
I wonder why my code is not writing properly or what should I do to make it write faster?
Here is my code:
val df = ss.sql("SQL")
HBaseTableWriterUtil.hbaseWrite(ss, tableList, df)
def hbaseWrite(ss:SparkSession,tableList: List[String], df:DataFrame): Unit ={
val tableName = tableList(0)
val rowKeyName = tableList(4)
val rowKeyType = tableList(5)
hbaseConf.set(TableOutputFormat.OUTPUT_TABLE, s"${tableName}")
//写入到HBase
val sc = ss.sparkContext
sc.hadoopConfiguration.addResource(hbaseConf)
val columns = df.columns
val result = df.rdd.mapPartitions(par=>{
par.map(row=>{
var rowkey:String =""
if("String".equals(rowKeyType)){
rowkey = row.getAs[String](rowKeyName)
}else if("Long".equals(rowKeyType)){
rowkey = row.getAs[Long](rowKeyName).toString
}
val put = new Put(Bytes.toBytes(rowkey))
for(name<-columns){
var value = row.get(row.fieldIndex(name))
if(value!=null){
put.addColumn(Bytes.toBytes("cf"),Bytes.toBytes(name),Bytes.toBytes(value.toString))
}
}
(new ImmutableBytesWritable,put)
})
})
val job = Job.getInstance(sc.hadoopConfiguration)
job.setOutputKeyClass(classOf[ImmutableBytesWritable])
job.setOutputValueClass(classOf[Result])
job.setOutputFormatClass(classOf[TableOutputFormat[ImmutableBytesWritable]])
result.saveAsNewAPIHadoopDataset(job.getConfiguration)
}
You may not control how many parallel execute may write to HBase.
Though you can start multiple Spark jobs in multiThreaded client program.
e.g. You can have a shell script which triggers multiple spark-submit command to induce parallelism. Each spark job can work on one set of data independent to each other and push into HBase.
This can also be done using Spark Java/Scala SparkLauncher API using it with Java concurrent API (e.g. Executor framework).
val sparkLauncher = new SparkLauncher
//Set Spark properties.only Basic ones are shown here.It will be overridden if properties are set in Main class.
sparkLauncher.setSparkHome("/path/to/SPARK_HOME")
.setAppResource("/path/to/jar/to/be/executed")
.setMainClass("MainClassName")
.setMaster("MasterType like yarn or local[*]")
.setDeployMode("set deploy mode like cluster")
.setConf("spark.executor.cores","2")
// Lauch spark application
val sparkLauncher1 = sparkLauncher.startApplication()
//get jobId
val jobAppId = sparkLauncher1.getAppId
//Get status of job launched.THis loop will continuely show statuses like RUNNING,SUBMITED etc.
while (true) {
println(sparkLauncher1.getState().toString)
}
However, the challenge is to track each of them for failure and automatic recovery. It may be tricky specially when partial data is already written into HBase. i.e. A job fails to process the complete set of data assigned to it. You may have to automatically clean the data from HBase before automatically retrigger.

Spark RDD do not get processed in multiple nodes

I have a use case where in i create rdd from a hive table. I wrote a business logic that operates on every row in the hive table. My assumption was that when i create rdd and span a map process on it, it then utilises all my spark executors. But, what i see in my log is only one node process the rdd while rest of my 5 nodes sitting idle. Here is my code
val flow = hiveContext.sql("select * from humsdb.t_flow")
var x = flow.rdd.map { row =>
< do some computation on each row>
}
Any clue where i go wrong?
As specify here by #jaceklaskowski
By default, a partition is created for each HDFS partition, which by
default is 64MB (from Spark’s Programming Guide).
If your input data is less than 64MB (and you are using HDFS) then by default only one partition will be created.
Spark will use all nodes when using big data
Could there be a possibility that your data is skewed?
To rule out this possibility, do the following and rerun the code.
val flow = hiveContext.sql("select * from humsdb.t_flow").repartition(200)
var x = flow.rdd.map { row =>
< do some computation on each row>
}
Further if in your map logic you are dependent on a particular column you can do below
val flow = hiveContext.sql("select * from humsdb.t_flow").repartition(col("yourColumnName"))
var x = flow.rdd.map { row =>
< do some computation on each row>
}
A good partition column could be date column

Why does my Spark Streaming application not print the number of records from Kafka (using count operator)?

I am working on a spark application which needs to read data from Kafka. I created a Kafka topic where producer was posting messages. I verified from console consumer that messages were successfully posted .
I wrote a short spark application to read data from Kafka, but it is not getting any data.
Following is the code i used:
def main(args: Array[String]): Unit = {
val Array(zkQuorum, group, topics, numThreads) = args
val sparkConf = new SparkConf().setAppName("SparkConsumer").setMaster("local[2]")
val ssc = new StreamingContext(sparkConf, Seconds(2))
val topicMap = topics.split(",").map((_, numThreads.toInt)).toMap
val lines = KafkaUtils.createStream(ssc, zkQuorum, group, topicMap).map(_._2)
process(lines) // prints the number of records in Kafka topic
ssc.start()
ssc.awaitTermination()
}
private def process(lines: DStream[String]) {
val z = lines.count()
println("count of lines is "+z)
//edit
lines.foreachRDD(rdd => rdd.map(println)
// <-- Why does this **not** print?
)
Any suggestions on how to resolve this issue?
******EDIT****
I have used
lines.foreachRDD(rdd => rdd.map(println)
as well in actual code but that is also not working. I set the retention period as mentioned in post : Kafka spark directStream can not get data . But still the problem exist.
Your process is a continuation of a DStream pipeline with no output operator that gets the pipeline executed every batch interval.
You can "see" it by reading the signature of count operator:
count(): DStream[Long]
Quoting the count's scaladoc:
Returns a new DStream in which each RDD has a single element generated by counting each RDD of this DStream.
So, you have a dstream of Kafka records that you transform to a dstream of single values (being the result of count). Not much to have it outputed (to a console or any other sink).
You have to end the pipeline using an output operator as described in the official documentation Output Operations on DStreams:
Output operations allow DStream’s data to be pushed out to external systems like a database or a file systems. Since the output operations actually allow the transformed data to be consumed by external systems, they trigger the actual execution of all the DStream transformations (similar to actions for RDDs).
(Low-Level) Output operators register input dstreams as output dstreams so the execution can start. Spark Streaming's DStream by design has no notion of being an output dstream. It is DStreamGraph to know and be able to differentiate between input and output dstreams.

Handle database connection inside spark streaming

I am not sure if I understand correctly how spark handle database connection and how to reliable using large number of database update operation insides spark without potential screw up the spark job. This is a code snippet I have been using (for easy illustration):
val driver = new MongoDriver
val hostList: List[String] = conf.getString("mongo.hosts").split(",").toList
val connection = driver.connection(hostList)
val mongodb = connection(conf.getString("mongo.db"))
val dailyInventoryCol = mongodb[BSONCollection](conf.getString("mongo.collections.dailyInventory"))
val stream: InputDStream[(String,String)] = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder, (String, String)](
ssc, kafkaParams, fromOffsets,
(mmd: MessageAndMetadata[String, String]) => (mmd.topic, mmd.message()));
def processRDD(rddElem: RDD[(String, String)]): Unit = {
val df = rdd.map(line => {
...
}).flatMap(x => x).toDF()
if (!isEmptyDF(df)) {
var mongoF: Seq[Future[dailyInventoryCol.BatchCommands.FindAndModifyCommand.FindAndModifyResult]] = Seq();
val dfF2 = df.groupBy($"CountryCode", $"Width", $"Height", $"RequestType", $"Timestamp").agg(sum($"Frequency")).collect().map(row => {
val countryCode = row.getString(0); val width = row.getInt(1); val height = row.getInt(2);
val requestType = row.getInt(3); val timestamp = row.getLong(4); val frequency = row.getLong(5);
val endTimestamp = timestamp + 24*60*60; //next day
val updateOp = dailyInventoryCol.updateModifier(BSONDocument("$inc" -> BSONDocument("totalFrequency" -> frequency)), false, true)
val f: Future[dailyInventoryCol.BatchCommands.FindAndModifyCommand.FindAndModifyResult] =
dailyInventoryCol.findAndModify(BSONDocument("width" -> width, "height" -> height, "country_code" -> countryCode, "request_type" -> requestType,
"startTs" -> timestamp, "endTs" -> endTimestamp), updateOp)
f
})
mongoF = mongoF ++ dfF2
//split into small chunk to avoid drying out the mongodb connection
val futureList: List[Seq[Future[dailyInventoryCol.BatchCommands.FindAndModifyCommand.FindAndModifyResult]]] = mongoF.grouped(200).toList
//future list
futureList.foreach(seqF => {
Await.result(Future.sequence(seqF), 40.seconds)
});
}
stream.foreachRDD(processRDD(_))
Basically, I am using Reactive Mongo (Scala) and for each RDD, I convert it into dataframe, group/extract the necessary data and then fire a large number of database update query against mongo. I want to ask:
I am using mesos to deploy spark on 3 servers and have one more server for mongo database. Is this the correct way to handle database connection. My concern is if database connection / polling is opened at the beginning of spark job and maintained properly (despite timeout/network error failover) during the whole duration of spark(weeks, months....) and if it will be closed when each batch finished? Given the fact that job might be scheduled on different servers? Does it means that each batch, it will open different set of DB connections?
What happen if exception occurs when executing queries. The spark job for that batch will failed? But the next batch will keep continue?
If there is too many queries (2000->+) to run update on mongo-database, and the executing time is exceeding configured spark batch duration (2 minutes), will it cause the problem? I was noticed that with my current setup, after abt 2-3 days, all of the batch is queued up as "Process" on Spark WebUI (if i disable the mongo update part, then i can run one week without prob), none is able to exit properly. Which basically hang up all batch job until i restart/resubmit the job.
Thanks a lot. I appreciate if you can help me address the issue.
Please read "Design Patterns for using foreachRDD" section in http://spark.apache.org/docs/latest/streaming-programming-guide.html. This will clear your doubts about how connections should be used/ created.
Secondly i would suggest to keep the direct update operations separate from your Spark Job. Better way would be that your spark job, process the data and then post it into a Kafka Queue and then have another dedicated process/ job/ code which reads the data from Kafka Queue and perform insert/ update operation on Mongo DB.

apache spark running task on each rdd

I have a rdd which is distributed accross multiple machines in a spark environment. I would like to execute a function on each worker machine on this rdd.
I do not want to collect the rdd and then execute a function on the driver. The function should be executed seperately on each executors for their own rdd.
How can I do that
Update (adding code)
I am running all this in spark shell
import org.apache.spark.sql.cassandra.CassandraSQLContext
import java.util.Properties
val cc = new CassandraSQLContext(sc)
val rdd = cc.sql("select * from sams.events where appname = 'test'");
val df = rdd.select("appname", "assetname");
Here I have a df with 400 rows. I need to save this df to sql server table. When I try to use df.write method it gives me errors which I have posted in a separate thread
spark dataframe not appending to the table
I can open a driverManager conection and insert rows but that will be done in the driver module of spark
import java.sql._
import com.microsoft.sqlserver.jdbc.SQLServerDriver
// create a Statement from the connection
Statement statement = conn.createStatement();
// insert the data
statement.executeUpdate("INSERT INTO Customers " + "VALUES (1001, 'Simpson', 'Mr.', 'Springfield', 2001)");
String connectionUrl = "jdbc:sqlserver://localhost:1433;" +
"databaseName=AdventureWorks;user=MyUserName;password=*****;";
Connection con = DriverManager.getConnection(connectionUrl);
I need to do this writing in the executor machine. How can I achieve this?
In order to setup connections from workers to other systems, we should use rdd.foreachPartitions(iter => ...)
foreachPartitions lets you execute an operation for each partition, giving you access to the data of the partition as a local iterator.
With enough data per partition, the time of setting up resources (like db connections) is amortized by using such resources over a whole partition.
abstract eg.
rdd.foreachPartition(iter =>
//setup db connection
val dbconn = Driver.connect(ip, port)
iter.foreach{element =>
val query = makeQuery(element)
dbconn.execute(query)
}
dbconn.close
}
It's also possible to create singleton resource managers that manage those resources for each JVM of the cluster. See also this answer for a complete example of such local resource manager: spark-streaming and connection pool implementation

Resources