i'm running a spark application witch receive input every 3min, after the process, le output may be on 2 different folder depending on the result of the process.
In one case, i'have to send an email after the stream process in order to confirm that it's has finished..but spark ignore the function(my function send email) that i putted between the start() and awaittermination() in order to send an email after each stream...
Code below:
val completeExternalFile = dfToWrite
.coalesce(1)
.writeStream
.format("csv")
.outputMode("append")
.option("checkpointLocation",destFilePathFileNameMetadata + checkpointSuffix )
.option("compression","uncompressed")
.option("header","true")
.option("path", destFilePathFileNameEnrichi)
.trigger(Trigger.ProcessingTime(20.seconds))
.start()
/////
-- here i call my function..
////
val completeExternalFile1 = completeExternalFile.awaitTermination()
Related
I'm new to Spark and currently battling a problem related to save the result of a Spark Stream to file after Context time. So the issue is: I want a query to run for 60seconds and save all the input it reads in that time to a file and also be able to define the file name for future processing.
Initially I thought the code below would be the way to go:
sc.socketTextStream("localhost", 12345)
.foreachRDD(rdd -> {
rdd.saveAsTextFile("./test");
});
However, after running, I realized that not only it saved a different file for each input read - (imagine that I have random numbers generating at random pace at that port), so if in one second it read 1 the file would contain 1 number, but if it read more the file would have them, instead of writing just one file with all the values from that 60s timeframe - but also I wasn't able to name the file, since the argument inside saveAsTextFile was the desired directory.
So I would like to ask if there is any spark native solution so I don't have to solve it by "java tricks", like this:
sc.socketTextStream("localhost", 12345)
.foreachRDD(rdd -> {
PrintWriter out = new PrintWriter("./logs/votes["+dtf.format(LocalDateTime.now().minusMinutes(2))+","+dtf.format(LocalDateTime.now())+"].txt");
List<String> l = rdd.collect();
for(String voto: l)
out.println(voto + " "+dtf.format(LocalDateTime.now()));
out.close();
});
I searched the spark documentation of similar problems but was unable to find a solution :/
Thanks for your time :)
Below is the template to consume socket stream data using new Spark APIs.
import org.apache.spark.sql.streaming.{OutputMode, Trigger}
object ReadSocket {
def main(args: Array[String]): Unit = {
val spark = Constant.getSparkSess
//Start reading from socket
val dfStream = spark.readStream
.format("socket")
.option("host","127.0.0.1") // Replace it your socket host
.option("port","9090")
.load()
dfStream.writeStream
.trigger(Trigger.ProcessingTime("1 minute")) // Will trigger 1 minute
.outputMode(OutputMode.Append) // Batch will processed for the data arrived in last 1 minute
.foreachBatch((ds,id) => { //
ds.foreach(row => { // Iterdate your data set
//Put around your File generation logic
println(row.getString(0)) // Thats your record
})
}).start().awaitTermination()
}
}
For code explanation please read read the code inline comments
Java Version
import org.apache.spark.api.java.function.VoidFunction2;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Encoders;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;
import org.apache.spark.sql.streaming.StreamingQueryException;
import org.apache.spark.sql.streaming.Trigger;
public class ReadSocketJ {
public static void main(String[] args) throws StreamingQueryException {
SparkSession spark = Constant.getSparkSess();
Dataset<Row> lines = spark
.readStream()
.format("socket")
.option("host", "127.0.0.1") // Replace it your socket host
.option("port", "9090")
.load();
lines.writeStream()
.trigger(Trigger.ProcessingTime("5 seconds"))
.foreachBatch((VoidFunction2<Dataset<Row>, Long>) (v1, v2) -> {
v1.as(Encoders.STRING())
.collectAsList().forEach(System.out::println);
}).start().awaitTermination();
}
}
I am trying to read data from a file uploaded on a server path, do some manipulation and then save this data to DB2 database. We have around 300K records which may increase further in future so we are trying to do all the manipulation as well write to DB2 inside foreachpartition. Below are the steps followed to do so.
Create spark context as global and static.
static SparkContext sparkContext = new SparkContext();
static JavaSparkContext jc = sparkContext.getJavaSparkContext();
static SparkSession sc = sparkContext.getSparkSession();
Create a dataset of file present on server
Dataset<Row> dataframe = sparkContext.getSparkSession().read().option("delimiter","|").option("header","false").option("inferSchema","false").schema(SchemaClass.generateSchema()).csv(filePath).withColumn("ID",monotonically_increasing_id())).withColumn("Partition_Id",spark_partition_id());
Calling foreachpartition
dataframe.foreachpartition(new ForeachPartitionFunction<Row>()){
#Override
public void call(Iterator<Row> _row) throws Exception{
List<SparkPositionDto> li = new ArrayList<>();
while(_row.hasNext()){
PositionDto positionDto = AnotherClass.method(row,1);
SparkPositionDto spd = copyToSparkDto(positionDto);
if(spd != null){
li.add(spd);
}
}
System.out.println("Writing via Spark : List size : "+li.size());
JavaRDD<SparkPositionDto> finalRdd = jc.parallelize(li);
Dataset<Row> dfToWrite = sc.createDataFrame(finalRDD, SparkPositionDto.class);
System.out.println("Writing Data");
if(dfToWrite != null){
dfToWrite.write()
.format("jdbc")
.option("url","jdbc:msdb2://"+"Database_Name"+";useKerberos=true")
.option("driver","DRIVER_NAME")
.option("dbtable","TABLE_NAME")
.mode(SaveMode.Append)
.save();
}
}
}
The weird observation is that when i run this code outside foreachpartition for a small set of data, it works fine and in my spark cluster just 1 driver and 1 application runs, but when the same code is running inside foreachpartition, I could see 1 driver and 2 applications running with 1 app in running state and other in waiting. If I add numberOfPartitions as 5 in my schema then 5 applications can be seen running. It is running continuously, nothing in logs, seems it got stuck somewhere.
i'm in a project using spark 2.2 struct streaming to read kafka msg into oracle database. the message flow into kafka is about 4000-6000 messages per second .
when using hdfs file system as sink destination ,it just works fine. when using foreach jdbc writer,it will have a huge delay over time . I think the lag is caused by foreach loop .
the jdbc sink class(stand alone class file):
class JDBCSink(url: String, user: String, pwd: String) extends org.apache.spark.sql.ForeachWriter[org.apache.spark.sql.Row] {
val driver = "oracle.jdbc.driver.OracleDriver"
var connection: java.sql.Connection = _
var statement: java.sql.PreparedStatement = _
val v_sql = "insert INTO sparkdb.t_cf(EntityId,clientmac,stime,flag,id) values(?,?,to_date(?,'YYYY-MM-DD HH24:MI:SS'),?,stream_seq.nextval)"
def open(partitionId: Long, version: Long): Boolean = {
Class.forName(driver)
connection = java.sql.DriverManager.getConnection(url, user, pwd)
connection.setAutoCommit(false)
statement = connection.prepareStatement(v_sql)
true
}
def process(value: org.apache.spark.sql.Row): Unit = {
statement.setString(1, value(0).toString)
statement.setString(2, value(1).toString)
statement.setString(3, value(2).toString)
statement.setString(4, value(3).toString)
statement.executeUpdate()
}
def close(errorOrNull: Throwable): Unit = {
connection.commit()
connection.close
}
}
the sink part :
val df = spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "namenode:9092").option("fetch.message.max.bytes", "50000000").option("kafka.max.partition.fetch.bytes", "50000000")
.option("subscribe", "rawdb.raw_data")
.option("startingOffsets", "latest")
.load()
.select($"value".as[Array[Byte]])
.map(avroDeserialize(_))
.filter(some logic).select(some logic)
.writeStream.format("csv").option("checkpointLocation", "/user/root/chk").option("path", "/user/root/testdir").start()
if I change the last line
.writeStream.format("csv")...
into jdbc foreach sink as following:
val url = "jdbc:oracle:thin:#(DESCRIPTION=(ADDRESS_LIST=(ADDRESS=(PROTOCOL=TCP)(HOST=x.x.x.x)(PORT=1521)))(CONNECT_DATA=(SERVICE_NAME=fastdb)))"
val user = "user";
val pwd = "password";
val writer = new JDBCSink(url, user, pwd)
.writeStream.foreach(writer).outputMode("append").start()
the lag show up.
I guess the problem most likely caused by foreach loop mechanics-it's not in batch mode deal with like several thousands row in a batch ,as an oracle DBA either, I have fine tuned oracle database side ,mostly the database is waiting for idle events . excessive commit is trying to be avoided by setting connection.setAutoCommit(false) already,any suggestion will be much appreciate.
Although I don't have an actual profile of whats taking the longest time in your application, I would assume it is due to the fact that using ForeachWriter will effectively close and re-open your JDBC connection on each run, because that's how ForeachWriter works.
I would advise that instead of using it, write a custom Sink for JDBC where you control how the connection is opened or closed.
There is an open pull request to add a JDBC driver to Spark which you can take a peek at to see a possible approach to the implementation.
problem solved by injecting the result into another Kafka topic , then wrote another program read from the new topic write them into database on batches .
I think in next spark release,they might provide the jdbc sink and have some parameter setting batch size .
the main code is as following :
write to another topic:
.writeStream.format("kafka")
.option("kafka.bootstrap.servers", "x.x.x.x:9092")
.option("topic", "fastdbtest")
.option("checkpointLocation", "/user/root/chk")
.start()
read the topic and write to databases,i'm using c3p0 connection pool
lines.foreachRDD(rdd => {
if (!rdd.isEmpty) {
rdd.foreachPartition(partitionRecords => {
//get a connection from connection pool
val conn = ConnManager.getManager.getConnection
val ps = conn.prepareStatement("insert into sparkdb.t_cf(ENTITYID,CLIENTMAC,STIME,FLAG) values(?,?,?,?)")
try {
conn.setAutoCommit(false)
partitionRecords.foreach(record => {
insertIntoDB(ps, record)
}
)
ps.executeBatch()
conn.commit()
} catch {
case e: Exception =>{}
// do some log
} finally {
ps.close()
conn.close()
}
})
}
})
Have you tried using a trigger?
I notice when I didn't use a trigger my Foreach Sink open and close several times the connection to the database.
writeStream.foreach(writer).start()
But when I used a trigger, the Foreach only opened and closed the connection one time, processing for example 200 queries and when the micro-batch was ended it closed the connection until a new micro batch was received.
writeStream.trigger(Trigger.ProcessingTime("3 seconds")).foreach(writer).start()
My use case is reading from a Kafka topic with only one partition, so Spark I think is using one partition. I dont know if this solution works the same with multiple Spark partitions but my conclusion here is the Foreach process all the micro-batch at a time (row by row) in the process method and doesn't call open() and close() for every row like a lot of people think.
The below simple program reads from kafka stream and writes to CSV file every 5 mins, and its spark streaming. Is there a way I can invoke a Java function after micro-batch in the "DRIVER PROGRAM" ( not in executor ).
I agree its not a good practice to call the arbitrary code in the stream, but this is special case where we have low volume data. please adivse. Thanks.
public static void main(String[] args) throws Exception {
if (args.length == 0)
throw new Exception("Usage program configFilename");
String configFilename = args[0];
addShutdownHook();
ConfigLoader.loadConfig(configFilename);
sparkSession = SparkSession
.builder()
.appName(TestKafka.class.getName())
.master(ConfigLoader.getValue("master")).getOrCreate();
SparkContext context = sparkSession.sparkContext();
context.setLogLevel(ConfigLoader.getValue("logLevel"));
SQLContext sqlCtx = sparkSession.sqlContext();
System.out.println("Spark context established");
DataStreamReader kafkaDataStreamReader = sparkSession.readStream()
.format("kafka")
.option("kafka.bootstrap.servers", ConfigLoader.getValue("brokers"))
.option("group.id", ConfigLoader.getValue("groupId"))
.option("subscribe", ConfigLoader.getValue("topics"))
.option("failOnDataLoss", false);
Dataset<Row> rawDataSet = kafkaDataStreamReader.load();
rawDataSet.printSchema();
rawDataSet.createOrReplaceTempView("rawEventView1");
rawDataSet = rawDataSet.withColumn("rawEventValue", rawDataSet.col("value").cast("string"));
rawDataSet.printSchema();
rawDataSet.createOrReplaceTempView("eventView1");
sqlCtx.sql("select * from eventView1")
.writeStream()
.format("csv")
.option("header", "true")
.option("delimiter", "~")
.option("checkpointLocation", ConfigLoader.getValue("checkpointPath"))
.option("path", ConfigLoader.getValue("recordsPath"))
.outputMode(OutputMode.Append())
.trigger(ProcessingTime.create(Integer.parseInt(ConfigLoader.getValue("kafkaProcessingTime"))
, TimeUnit.SECONDS))
.start()
.awaitTermination();
}
You should be able to achieve this by something like this:
kafkaDataStreamReader.map{value -> mySideEffect(); value}
This will call the function mySideEffect every time a microbatch is received from kafka, how ever I would not recommend doing this, a better approach would be to watch the folder you persist the CSV it or simply checking the web ui, considering a micro batch happens every few seconds at most you will be swamped by emails. if you want to make sure the streaming application is up you can query the spark REST API every few seconds and make sure it's still up
https://spark.apache.org/docs/latest/monitoring.html
I have flume stream which I want to store it in HDFS via spark . Below is spark code that I am running
object FlumePull {
def main(args: Array[String]) {
if (args.length < 2) {
System.err.println(
"Usage: FlumePollingEventCount <host> <port>")
System.exit(1)
}
val batchInterval = Milliseconds(60000)
val sparkConf = new SparkConf().setAppName("FlumePollingEventCount")
val ssc = new StreamingContext(sparkConf, batchInterval)
val stream = FlumeUtils.createPollingStream(ssc, "localhost", 9999)
stream.map(x => x + "!!!!")
.saveAsTextFiles("/user/root/spark/flume_Map_", "_Mapout")
ssc.start()
ssc.awaitTermination()
}
}
When I start my spsark streaming job , it does stores output in HDFS but output is something like this:
[root#sandbox ~]# hadoop fs -cat /user/root/spark/flume_Map_-1459450380000._Mapout/part-00000
org.apache.spark.streaming.flume.SparkFlumeEvent#1b9bd2c9!!!!
org.apache.spark.streaming.flume.SparkFlumeEvent#33fd3a48!!!!
org.apache.spark.streaming.flume.SparkFlumeEvent#35fd67a2!!!!
org.apache.spark.streaming.flume.SparkFlumeEvent#f9ed85f!!!!
org.apache.spark.streaming.flume.SparkFlumeEvent#58f4cfc!!!!
org.apache.spark.streaming.flume.SparkFlumeEvent#307373e!!!!
org.apache.spark.streaming.flume.SparkFlumeEvent#4ebbc8ff!!!!
org.apache.spark.streaming.flume.SparkFlumeEvent#a8905bb!!!!
org.apache.spark.streaming.flume.SparkFlumeEvent#29d73d64!!!!
org.apache.spark.streaming.flume.SparkFlumeEvent#71ff85b1!!!!
org.apache.spark.streaming.flume.SparkFlumeEvent#3ea261ef!!!!
org.apache.spark.streaming.flume.SparkFlumeEvent#16cbb209!!!!
org.apache.spark.streaming.flume.SparkFlumeEvent#17157890!!!!
org.apache.spark.streaming.flume.SparkFlumeEvent#29e41c7!!!!
It is storing flume event instead of data coming from Flume. How do it get data out of it?
Thanks
You need to extract the underlying buffer from the SparkFlumeEvent and save that. For example, if your event body is a String:
stream.map(x => new String(x.event.getBody.array) + "!!!!")
.saveAsTextFiles("/user/root/spark/flume_Map_", "_Mapout")