I have a use case, where I have to read parquet files and publish the records towards a kafka topic.
I read the parquet files using :
spark.read.schema(//path to parquet file )
Then I sort this dataframe based on the timestamp (use case specific requirements to preserve order)
Finally, I do the following :
binaryFiles below is the dataframe containing the sorted records from the parquet file
binaryFiles.coalesce(1).foreachPartition(partition => {
val producer = new KafkaProducer[String,Array[Byte]](properties)
partition.foreach(file => {
try {
var producerRecord = new ProducerRecord[String,Array[Byte]](targetTopic,file.getAs[Integer](2),file.getAs[String](0),file.getAs[Array[Byte]](1))
var metadata = producer.send(producerRecord, new Callback {
override def onCompletion(recordMetadata:RecordMetadata , e:Exception):Unit = {
if (e != null){
println ("Error while producing" + e);
producer.close()
}
}
});
producer.flush()
}
catch{
case unknown :Throwable => println("Exception obtained with record. Key : " + unknown)
}
})
producer.close()
println("Closing the producer for this partition")
})
While writing the failover strategy for this scenario, one scenario that I am trying to cater to is what if the node that runs the kafka producer goes down.
Now when the kafka producer is restarted it will again read the parquet file from start and start publishing all the records again to the same topic.
How can we overcome this and implement some sort of checkpointing that spark stream provides.
PS: I cannot use spark structured streaming as that does not preserve the order of the messages
Related
i have a spark streaming (2.1.1 with cloudera 5.12). with input kafka and output HDFS (in parquet format)
the problem is , i'm getting LeaseExpiredException randomly (not in all mini-batch)
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException): No lease on /user/qoe_fixe/data_tv/tmp/cleanData/_temporary/0/_temporary/attempt_20180629132202_0215_m_000000_0/year=2018/month=6/day=29/hour=11/source=LYO2/part-00000-c6f21a40-4088-4d97-ae0c-24fa463550ab.snappy.parquet (inode 135532024): File does not exist. Holder DFSClient_attempt_20180629132202_0215_m_000000_0_-1048963677_900 does not have any open files.
i'm using the dataset API for writing to hdfs
if (!InputWithDatePartition.rdd.isEmpty() ) InputWithDatePartition.repartition(1).write.partitionBy("year", "month", "day","hour","source").mode("append").parquet(cleanPath)
my job fails after few hours because of this error
Two jobs write to the same directory share the same _temporary folder.
So when the first job finishes this code is executed (FileOutputCommitter class):
public void cleanupJob(JobContext context) throws IOException {
if (hasOutputPath()) {
Path pendingJobAttemptsPath = getPendingJobAttemptsPath();
FileSystem fs = pendingJobAttemptsPath
.getFileSystem(context.getConfiguration());
// if job allow repeatable commit and pendingJobAttemptsPath could be
// deleted by previous AM, we should tolerate FileNotFoundException in
// this case.
try {
fs.delete(pendingJobAttemptsPath, true);
} catch (FileNotFoundException e) {
if (!isCommitJobRepeatable(context)) {
throw e;
}
}
} else {
LOG.warn("Output Path is null in cleanupJob()");
}
}
it deletes pendingJobAttemptsPath(_temporary) while the second job is still running
This may be helpful:
Multiple spark jobs appending parquet data to same base path with partitioning
I am using the strategy provided here in spark streaming documentation for the committing to kafka itself. My flow is like so:
Topic A --> Spark Stream [ foreachRdd process -> send to topic b] commit offset to topic A
JavaInputDStream<ConsumerRecord<String, Request>> kafkaStream = KafkaUtils.createDirectStream(
streamingContext,
LocationStrategies.PreferConsistent(),
ConsumerStrategies.<String, Request>Subscribe(inputTopics, kafkaParams)
);
kafkaStream.foreachRDD(rdd -> {
OffsetRange[] offsetRanges = ((HasOffsetRanges) rdd).offsetRanges();
rdd.foreachPartition(
consumerRecords -> {
OffsetRange o = offsetRanges[TaskContext.get().partitionId()];
System.out.println(String.format("$s %d %d $d", o.topic(), o.partition(), o.fromOffset(), o.untilOffset()));
consumerRecords.forEachRemaining(record -> doProcess(record));
});
((CanCommitOffsets) kafkaStream.inputDStream()).commitAsync(offsetRanges);
}
);
So let's say the RDD gets 10 events from topic A and while processing for each of them I send a new event to topic B. Now supposed that one of those responses fails. Now I don't want to commit that particular offset to topic A. Topic A and B have the same number of partitions N. So each RDD should be consuming from the same partition. What would be the best strategy to keep processing? How can I reset the stream to try to process those events from topic A until it succeeds? I know if I can't continue processing that partition without committing because that would automatically move the offset and the failed record would not be processed again.
I don't know how if it is possible for the stream/rdd to keep trying to process the same messages for that partition only, while the other partitions/rdd can keep working. If I throw an exception from that particular RDD what would happened to my job. Would it fail? Would I need to restart it manually? With regular consumers you could retry/recover but I am not sure what happens with Streaming.
This is what I came up with and it takes the input data and then sends a request using the output topic. The producer has to be created inside the foreach loop otherwise spark will try to serialize and send it to all the workers. Notice the response is send asynchronously. This means that I am using at least one semantics in this system.
kafkaStream.foreachRDD(rdd -> {
OffsetRange[] offsetRanges = ((HasOffsetRanges) rdd.rdd()).offsetRanges();
rdd.foreachPartition(
partition -> {
OffsetRange o = offsetRanges[TaskContext.get().partitionId()];
System.out.println(String.format("%s %d %d %d", o.topic(), o.partition(), o.fromOffset(), o.untilOffset()));
// Print statements in this section are shown in the executor's stdout logs
KafkaProducer<String, MLMIOutput> producer = new KafkaProducer(producerConfig(o.partition()));
partition.forEachRemaining(record -> {
System.out.println("request: "+record.value());
Response data = new Response …
// As as debugging technique, users can write to DBFS to verify that records are being written out
// dbutils.fs.put("/tmp/test_kafka_output",data,true)
ProducerRecord<String, Response> message = new ProducerRecord(outputTopic, null, data);
Future<RecordMetadata> result = producer.send(message);
try {
RecordMetadata metadata = result.get();
System.out.println(String.format("offset='$d' partition='%d' topic='%s'timestamp='$d",
metadata.offset(),metadata.partition(),metadata.topic(),metadata.timestamp()));
} catch (InterruptedException e) {
e.printStackTrace();
} catch (ExecutionException e) {
e.printStackTrace();
}
});
producer.close();
});
((CanCommitOffsets) kafkaStream.inputDStream()).commitAsync(offsetRanges);
}
);
i'm in a project using spark 2.2 struct streaming to read kafka msg into oracle database. the message flow into kafka is about 4000-6000 messages per second .
when using hdfs file system as sink destination ,it just works fine. when using foreach jdbc writer,it will have a huge delay over time . I think the lag is caused by foreach loop .
the jdbc sink class(stand alone class file):
class JDBCSink(url: String, user: String, pwd: String) extends org.apache.spark.sql.ForeachWriter[org.apache.spark.sql.Row] {
val driver = "oracle.jdbc.driver.OracleDriver"
var connection: java.sql.Connection = _
var statement: java.sql.PreparedStatement = _
val v_sql = "insert INTO sparkdb.t_cf(EntityId,clientmac,stime,flag,id) values(?,?,to_date(?,'YYYY-MM-DD HH24:MI:SS'),?,stream_seq.nextval)"
def open(partitionId: Long, version: Long): Boolean = {
Class.forName(driver)
connection = java.sql.DriverManager.getConnection(url, user, pwd)
connection.setAutoCommit(false)
statement = connection.prepareStatement(v_sql)
true
}
def process(value: org.apache.spark.sql.Row): Unit = {
statement.setString(1, value(0).toString)
statement.setString(2, value(1).toString)
statement.setString(3, value(2).toString)
statement.setString(4, value(3).toString)
statement.executeUpdate()
}
def close(errorOrNull: Throwable): Unit = {
connection.commit()
connection.close
}
}
the sink part :
val df = spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "namenode:9092").option("fetch.message.max.bytes", "50000000").option("kafka.max.partition.fetch.bytes", "50000000")
.option("subscribe", "rawdb.raw_data")
.option("startingOffsets", "latest")
.load()
.select($"value".as[Array[Byte]])
.map(avroDeserialize(_))
.filter(some logic).select(some logic)
.writeStream.format("csv").option("checkpointLocation", "/user/root/chk").option("path", "/user/root/testdir").start()
if I change the last line
.writeStream.format("csv")...
into jdbc foreach sink as following:
val url = "jdbc:oracle:thin:#(DESCRIPTION=(ADDRESS_LIST=(ADDRESS=(PROTOCOL=TCP)(HOST=x.x.x.x)(PORT=1521)))(CONNECT_DATA=(SERVICE_NAME=fastdb)))"
val user = "user";
val pwd = "password";
val writer = new JDBCSink(url, user, pwd)
.writeStream.foreach(writer).outputMode("append").start()
the lag show up.
I guess the problem most likely caused by foreach loop mechanics-it's not in batch mode deal with like several thousands row in a batch ,as an oracle DBA either, I have fine tuned oracle database side ,mostly the database is waiting for idle events . excessive commit is trying to be avoided by setting connection.setAutoCommit(false) already,any suggestion will be much appreciate.
Although I don't have an actual profile of whats taking the longest time in your application, I would assume it is due to the fact that using ForeachWriter will effectively close and re-open your JDBC connection on each run, because that's how ForeachWriter works.
I would advise that instead of using it, write a custom Sink for JDBC where you control how the connection is opened or closed.
There is an open pull request to add a JDBC driver to Spark which you can take a peek at to see a possible approach to the implementation.
problem solved by injecting the result into another Kafka topic , then wrote another program read from the new topic write them into database on batches .
I think in next spark release,they might provide the jdbc sink and have some parameter setting batch size .
the main code is as following :
write to another topic:
.writeStream.format("kafka")
.option("kafka.bootstrap.servers", "x.x.x.x:9092")
.option("topic", "fastdbtest")
.option("checkpointLocation", "/user/root/chk")
.start()
read the topic and write to databases,i'm using c3p0 connection pool
lines.foreachRDD(rdd => {
if (!rdd.isEmpty) {
rdd.foreachPartition(partitionRecords => {
//get a connection from connection pool
val conn = ConnManager.getManager.getConnection
val ps = conn.prepareStatement("insert into sparkdb.t_cf(ENTITYID,CLIENTMAC,STIME,FLAG) values(?,?,?,?)")
try {
conn.setAutoCommit(false)
partitionRecords.foreach(record => {
insertIntoDB(ps, record)
}
)
ps.executeBatch()
conn.commit()
} catch {
case e: Exception =>{}
// do some log
} finally {
ps.close()
conn.close()
}
})
}
})
Have you tried using a trigger?
I notice when I didn't use a trigger my Foreach Sink open and close several times the connection to the database.
writeStream.foreach(writer).start()
But when I used a trigger, the Foreach only opened and closed the connection one time, processing for example 200 queries and when the micro-batch was ended it closed the connection until a new micro batch was received.
writeStream.trigger(Trigger.ProcessingTime("3 seconds")).foreach(writer).start()
My use case is reading from a Kafka topic with only one partition, so Spark I think is using one partition. I dont know if this solution works the same with multiple Spark partitions but my conclusion here is the Foreach process all the micro-batch at a time (row by row) in the process method and doesn't call open() and close() for every row like a lot of people think.
I have a Spark Streaming job to do some aggregations on an incoming Kafka Stream and save the result in Hive. However, I have about 5 Spark SQL to be run on the incoming data, which can be run concurrently as there is no dependency in transformations among these 5 and if possible, I would like to run them in concurrent fashion without waiting for the first SQL to end. They all go to separate Hive tables. For example :
// This is the Kafka inbound stream
// Code in Consumer
val stream = KafkaUtils.createDirectStream[..](...)
val metric1= Future {
computeFuture(stream, dataframe1, countIndex)
}
val metric2= Future {
computeFuture(stream, dataframe2, countIndex)
}
val metric3= Future {
computeFirstFuture(stream, dataframe3, countIndex)
}
val metric4= Future {
computeFirstFuture(stream, dataframe4, countIndex)
}
metric1.onFailure {
case e => logger.error(s"Future failed with an .... exception", e)
}
metric2.onFailure {
case e => logger.error(s"Future failed with an .... exception", e)
}
....and so on
On doing the above, the actions in Future are appearing sequential (from Spark url interface). How can I enforce concurrent execution? Using Spark 2.0, Scala 2.11.8. Do I need to create separate spark sessions using .newSession() ?
We have spark streaming job ..writing data to AmazonDynamoDB using foreachRDD but it is very slow with our consumption rate at 10,000/sec and writing 10,000 takes 35min ...this is the code piece ..
tempRequestsWithState is Dstream
tempRequestsWithState.foreachRDD { rdd =>
if ((rdd != null) && (rdd.count() > 0) && (!rdd.isEmpty()) ) {
rdd.foreachPartition {
case (topicsTableName, hashKeyTemp, attributeValueUpdate) => {
val client = new AmazonDynamoDBClient()
val request = new UpdateItemRequest(topicsTableName, hashKeyTemp, attributeValueUpdate)
try client.updateItem(request)
catch {
case se: Exception => println("Error executing updateItem!\nTable ", se)
}
}
case null =>
}
}
}
From research learnt that using foreachpartition and creating a connection per partition will help ..but not sure how to go about writing code for it ..will greatly appreciate if someone can help with this ...Also any other suggestion to speed up writing is greatly appreciated
Each partition is handled by an executor(a jvm process). So inside that you can write code to initialize db connection and write to database. IN the code given, the line for first case() is where you write that code. So as you get more partitions, if you have multiple executors writing to db, this will be done in parallel.
rdd.foreachPartition {
//initialize database cnx
//write to db
//close connection
}
It is better to use a single partition to write in a db and singleton to initialize cnx, to decrease the numbers of db connection, in foreachPartition function use write with
batch to increase the numbers of the inserted lines.
rdd.repartition(1).foreachPartition {
//get singleton instance cnx
//write with batche to db
//close connection
}