How to get data from regularly appended log file in Apache Spark? - apache-spark

I have one Apache access log file which has some data and it is continuously increasing. I want to analyze that data using Apache Spark Streaming API.
And Spark is new for me and i created one program in which ,i use jssc.textFileStream(directory) function to get log data. but its not work as per my requirement.
please suggest me some approaches to analyze that log file using spark.
Here is my code.
SparkConf conf = new SparkConf()
.setMaster("spark://192.168.1.9:7077")
.setAppName("log streaming")
.setSparkHome("/usr/local/spark")
.setJars(new String[] { "target/sparkstreamingdemo-0.0.1.jar" });
StreamingContext ssc = new StreamingContext(conf, new Duration(5000));
DStream<String> filerdd = ssc.textFileStream("/home/user/logs");
filerdd.print();
ssc.start();
ssc.awaitTermination();
This code does not return any data from existing files. This is only work when i create a new file but when i update that new file, program again does not return updated data.

If the file is modified in real-time you can use Tailer from Apache Commons IO.
That's the simpliest sample:
public void readLogs(File f, long delay) {
TailerListener listener = new MyTailerListener();
Tailer tailer = new Tailer(f, listener, delay);
// stupid executor impl. for demo purposes
Executor executor = new Executor() {
public void execute(Runnable command) {
command.run();
}
};
executor.execute(tailer);
}
public class MyTailerListener extends TailerListenerAdapter {
public void handle(String line) {
System.out.println(line);
}
}
The code above may be used as a log reader for Apache Flume and applied as a source. Then you need to configure Flume sink to redirect collected logs to Spark stream and apply Spark for analyzing data from Flume stream (http://spark.apache.org/docs/latest/streaming-flume-integration.html)
More details about Flume setup in this post: real time log processing using apache spark streaming

Related

slow performance of spark

I am new to Spark. I have a requirement where I need to integrate Spark with Web Service. Any request to a Web service has to be processed using Spark and send the response back to the client.
I have created a small dummy service in Vertx, which accepts request and processes it using Spark. I am using Spark in cluster mode (1 master, 2 slaves, 8 core, 32 Gb each, running on top of Yarn and Hdfs)
public class WebServer {
private static SparkSession spark;
private static void createSparkSession(String masterUrl) {
SparkContext context = new SparkContext(new SparkConf().setAppName("Web Load Test App").setMaster(masterUrl)
.set("spark.hadoop.fs.default.name", "hdfs://x.x.x.x:9000")
.set("spark.hadoop.fs.defaultFS", "hdfs://x.x.x.x:9000")
.set("spark.hadoop.fs.hdfs.impl", org.apache.hadoop.hdfs.DistributedFileSystem.class.getName())
.set("spark.hadoop.fs.hdfs.server", org.apache.hadoop.hdfs.server.namenode.NameNode.class.getName())
.set("spark.hadoop.conf", org.apache.hadoop.hdfs.HdfsConfiguration.class.getName())
.set("spark.eventLog.enabled", "true")
.set("spark.eventLog.dir", "hdfs://x.x.x.x:9000/spark-logs")
.set("spark.history.provider", "org.apache.spark.deploy.history.FsHistoryProvider")
.set("spark.history.fs.logDirectory", "hdfs://x.x.x.x:9000/spark-logs")
.set("spark.history.fs.update.interval", "10s")
.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
//.set("spark.kryoserializer.buffer.max", "1g")
//.set("spark.kryoserializer.buffer", "512m")
);
spark = SparkSession.builder().sparkContext(context).getOrCreate();
}
public static void main(String[] args) {
Vertx vertx = Vertx.vertx();
String masterUrl = "local[2]";
if(args.length == 1) {
masterUrl = args[0];
}
System.out.println("Master url: " + masterUrl);
createSparkSession(masterUrl);
WebServer webServer = new WebServer();
webServer.start(vertx);
}
private void start(Vertx vertx) {
int port = 19090;
Router router = Router.router(vertx);
router.route("/word_count*").handler(BodyHandler.create());
router.post("/word_count").handler(this::calWordFrequency);
vertx.createHttpServer()
.requestHandler(router::accept)
.listen(port,
result -> {
if (result.succeeded()) {
System.out.println("Server started # " + port);
} else {
System.out.println("Server failed to start # " + port);
}
});
}
private void calWordFrequency(RoutingContext routingContext) {
WordCountRequest wordCountRequest = routingContext.getBodyAsJson().mapTo(WordCountRequest.class);
List<String> words = wordCountRequest.getWords();
Dataset<String> wordsDataset = spark.createDataset(words, Encoders.STRING());
Dataset<Row> wordCounts = wordsDataset.groupBy("value").count();
List<String> result = wordCounts.toJSON().collectAsList();
routingContext.response().setStatusCode(300).putHeader("content-type", "application/json; charset=utf-8")
.end(Json.encodePrettily(result));
}
}
When I make a post request with payload size of about 5kb, then it is taking around 6 seconds to complete the request and the response back. I feel it is very slow.
However, if I carry a simple example of reading file from Hbase and performing transformation and displaying result, it is very fast. I am able to processes a file of 8Gb file in 2mins.
Eg:
logFile="/spark-logs/single-word-line.less.txt"
master_node = 'spark://x.x.x.x:7077'
spark = SparkSession.builder.master(master_node).appName('Avi-load-test').getOrCreate()
log_data = spark.read.text(logFile)
word_count = (log_data.groupBy('value').count())
print(word_count.show())
What is the reason for my application to run so slow? Any pointers would be really helpful. Thank you in advance.
Spark processing is asynchronous, you are using it as part of a synchronous flow. You can do that but can't expect the processing to be finished. We have implemented similar use case - we have a REST service which triggers a spark job. The implementation does return spark job id(takes some time). And we have another end point to get job status using the Job Id, but we didn't implement a flow to return data from spark job through REST srvc and it is not recommended.
Our flow
REST API <-> Spark Job <-> HBase/Kafka
The REST endpoint triggers a Spark Job and the Spark job reads data from HBase and does the processing and write data back to HBase and Kafka. The REST API is called by a different and they receive the data that is processed through Kafka.
I think you need to rethink your architecture.

How to save Spark Stream data to file

I'm new to Spark and currently battling a problem related to save the result of a Spark Stream to file after Context time. So the issue is: I want a query to run for 60seconds and save all the input it reads in that time to a file and also be able to define the file name for future processing.
Initially I thought the code below would be the way to go:
sc.socketTextStream("localhost", 12345)
.foreachRDD(rdd -> {
rdd.saveAsTextFile("./test");
});
However, after running, I realized that not only it saved a different file for each input read - (imagine that I have random numbers generating at random pace at that port), so if in one second it read 1 the file would contain 1 number, but if it read more the file would have them, instead of writing just one file with all the values from that 60s timeframe - but also I wasn't able to name the file, since the argument inside saveAsTextFile was the desired directory.
So I would like to ask if there is any spark native solution so I don't have to solve it by "java tricks", like this:
sc.socketTextStream("localhost", 12345)
.foreachRDD(rdd -> {
PrintWriter out = new PrintWriter("./logs/votes["+dtf.format(LocalDateTime.now().minusMinutes(2))+","+dtf.format(LocalDateTime.now())+"].txt");
List<String> l = rdd.collect();
for(String voto: l)
out.println(voto + " "+dtf.format(LocalDateTime.now()));
out.close();
});
I searched the spark documentation of similar problems but was unable to find a solution :/
Thanks for your time :)
Below is the template to consume socket stream data using new Spark APIs.
import org.apache.spark.sql.streaming.{OutputMode, Trigger}
object ReadSocket {
def main(args: Array[String]): Unit = {
val spark = Constant.getSparkSess
//Start reading from socket
val dfStream = spark.readStream
.format("socket")
.option("host","127.0.0.1") // Replace it your socket host
.option("port","9090")
.load()
dfStream.writeStream
.trigger(Trigger.ProcessingTime("1 minute")) // Will trigger 1 minute
.outputMode(OutputMode.Append) // Batch will processed for the data arrived in last 1 minute
.foreachBatch((ds,id) => { //
ds.foreach(row => { // Iterdate your data set
//Put around your File generation logic
println(row.getString(0)) // Thats your record
})
}).start().awaitTermination()
}
}
For code explanation please read read the code inline comments
Java Version
import org.apache.spark.api.java.function.VoidFunction2;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Encoders;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;
import org.apache.spark.sql.streaming.StreamingQueryException;
import org.apache.spark.sql.streaming.Trigger;
public class ReadSocketJ {
public static void main(String[] args) throws StreamingQueryException {
SparkSession spark = Constant.getSparkSess();
Dataset<Row> lines = spark
.readStream()
.format("socket")
.option("host", "127.0.0.1") // Replace it your socket host
.option("port", "9090")
.load();
lines.writeStream()
.trigger(Trigger.ProcessingTime("5 seconds"))
.foreachBatch((VoidFunction2<Dataset<Row>, Long>) (v1, v2) -> {
v1.as(Encoders.STRING())
.collectAsList().forEach(System.out::println);
}).start().awaitTermination();
}
}

Writing to DB2 database gets stuck inside foreachpartition() from my Java code on Spark Cluster with 20 worker nodes configured

I am trying to read data from a file uploaded on a server path, do some manipulation and then save this data to DB2 database. We have around 300K records which may increase further in future so we are trying to do all the manipulation as well write to DB2 inside foreachpartition. Below are the steps followed to do so.
Create spark context as global and static.
static SparkContext sparkContext = new SparkContext();
static JavaSparkContext jc = sparkContext.getJavaSparkContext();
static SparkSession sc = sparkContext.getSparkSession();
Create a dataset of file present on server
Dataset<Row> dataframe = sparkContext.getSparkSession().read().option("delimiter","|").option("header","false").option("inferSchema","false").schema(SchemaClass.generateSchema()).csv(filePath).withColumn("ID",monotonically_increasing_id())).withColumn("Partition_Id",spark_partition_id());
Calling foreachpartition
dataframe.foreachpartition(new ForeachPartitionFunction<Row>()){
#Override
public void call(Iterator<Row> _row) throws Exception{
List<SparkPositionDto> li = new ArrayList<>();
while(_row.hasNext()){
PositionDto positionDto = AnotherClass.method(row,1);
SparkPositionDto spd = copyToSparkDto(positionDto);
if(spd != null){
li.add(spd);
}
}
System.out.println("Writing via Spark : List size : "+li.size());
JavaRDD<SparkPositionDto> finalRdd = jc.parallelize(li);
Dataset<Row> dfToWrite = sc.createDataFrame(finalRDD, SparkPositionDto.class);
System.out.println("Writing Data");
if(dfToWrite != null){
dfToWrite.write()
.format("jdbc")
.option("url","jdbc:msdb2://"+"Database_Name"+";useKerberos=true")
.option("driver","DRIVER_NAME")
.option("dbtable","TABLE_NAME")
.mode(SaveMode.Append)
.save();
}
}
}
The weird observation is that when i run this code outside foreachpartition for a small set of data, it works fine and in my spark cluster just 1 driver and 1 application runs, but when the same code is running inside foreachpartition, I could see 1 driver and 2 applications running with 1 app in running state and other in waiting. If I add numberOfPartitions as 5 in my schema then 5 applications can be seen running. It is running continuously, nothing in logs, seems it got stuck somewhere.

What is the best way to restart spark streaming application?

I basically want to write an event callback in my driver program which will restart the spark streaming application on arrival of that event.
My driver program is setting up the streams and the execution logic by reading configurations from a file.
Whenever the file is changed (new configs added) the driver program has to do the following steps in a sequence,
Restart,
Read the config file (as part of the main method) and
Set up the streams
What is the best way to achieve this?
In some cases you may want to reload streaming context dynamically (for example to reloading of streaming operations).
In that cases you may (Scala example):
val sparkContext = new SparkContext()
val stopEvent = false
var streamingContext = Option.empty[StreamingContext]
val shouldReload = false
val processThread = new Thread {
override def run(): Unit = {
while (!stopEvent){
if (streamingContext.isEmpty) {
// new context
streamingContext = Option(new StreamingContext(sparkContext, Seconds(1)))
// create DStreams
val lines = streamingContext.socketTextStream(...)
// your transformations and actions
// and decision to reload streaming context
// ...
streamingContext.get.start()
} else {
if (shouldReload) {
streamingContext.get.stop(stopSparkContext = false, stopGracefully = true)
streamingContext.get.awaitTermination()
streamingContext = Option.empty[StreamingContext]
} else {
Thread.sleep(1000)
}
}
}
streamingContext.get.stop(stopSparkContext =true, stopGracefully = true)
streamingContext.get.awaitTermination()
}
}
// and start it in separate thread
processThread.start()
processThread.join()
or in python:
spark_context = SparkContext()
stop_event = Event()
spark_streaming_context = None
should_reload = False
def process(self):
while not stop_event.is_set():
if spark_streaming_context is None:
# new context
spark_streaming_context = StreamingContext(spark_context, 0.5)
# create DStreams
lines = spark_streaming_context.socketTextStream(...)
# your transformations and actions
# and decision to reload streaming context
# ...
self.spark_streaming_context.start()
else:
# TODO move to config
if should_reload:
spark_streaming_context.stop(stopSparkContext=False, stopGraceFully=True)
spark_streaming_context.awaitTermination()
spark_streaming_context = None
else:
time.sleep(1)
else:
self.spark_streaming_context.stop(stopGraceFully=True)
self.spark_streaming_context.awaitTermination()
# and start it in separate thread
process_thread = threading.Thread(target=process)
process_thread.start()
process_thread.join()
If you want to prevent you code from crashes and restart streaming context from the last place use checkpointing mechanism.
It allow you to restore your job state after failure.
The best way to Restart the Spark is actually according to your environment.But it is always suggestible to use spark-submit console.
You can background the spark-submit process like any other linux process, by putting it into the background in the shell. In your case, the spark-submit job actually then runs the driver on YARN, so, it's baby-sitting a process that's already running asynchronously on another machine via YARN.
Cloudera blog
One way that we explored recently (in a spark meetup here) was to achieve this by using Zookeeper in Tandem with Spark. This in a nutshell uses Apache Curator to watch for changes on Zookeeper (changes in config of ZK this can be triggered by your external event) that then causes a listener to restart.
The referenced code base is here , you will find that a change in config causes the Watcher (a spark streaming app) to reboot after a graceful shutdown and reload changes. Hope this is a pointer!
I am currently solving this issue as follows,
Listen to external events by subscribing to a MQTT topic
In the MQTT callback, stop the streaming context ssc.stop(true,true) which will gracefully shutdown the streams and underlying
spark config
Start the spark application again by creating a spark conf and
setting up the streams by reading the config file
// Contents of startSparkApplication() method
sparkConf = new SparkConf().setAppName("SparkAppName")
ssc = new StreamingContext(sparkConf, Seconds(1))
val myStream = MQTTUtils.createStream(ssc,...) //provide other options
myStream.print()
ssc.start()
The application is built as Spring boot application
In Scala, stopping sparkStreamingContext may involve stopping SparkContext. I have found that when a receiver hangs, it is best to restart the SparkCintext and the SparkStreamingContext.
I am sure the code below can be written much more elegantly, but it allows for the restarting of SparkContext and SparkStreamingContext programatically. Once this is done, you can restart your receivers programatically as well.
package coname.utilobjects
import com.typesafe.config.ConfigFactory
import grizzled.slf4j.Logging
import coname.conameMLException
import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession
import org.apache.spark.streaming.{Seconds, StreamingContext}
import scala.collection.mutable
object SparkConfProviderWithStreaming extends Logging
{
val sparkVariables: mutable.HashMap[String, Any] = new mutable.HashMap
}
trait SparkConfProviderWithStreaming extends Logging{
private val keySSC = "SSC"
private val keyConf = "conf"
private val keySparkSession = "spark"
lazy val packagesversion=ConfigFactory.load("streaming").getString("streaming.cassandraconfig.packagesversion")
lazy val sparkcassandraconnectionhost=ConfigFactory.load("streaming").getString("streaming.cassandraconfig.sparkcassandraconnectionhost")
lazy val sparkdrivermaxResultSize=ConfigFactory.load("streaming").getString("streaming.cassandraconfig.sparkdrivermaxResultSize")
lazy val sparknetworktimeout=ConfigFactory.load("streaming").getString("streaming.cassandraconfig.sparknetworktimeout")
#throws(classOf[conameMLException])
def intitializeSpark(): Unit =
{
getSparkConf()
getSparkStreamingContext()
getSparkSession()
}
#throws(classOf[conameMLException])
def getSparkConf(): SparkConf = {
try {
if (!SparkConfProviderWithStreaming.sparkVariables.get(keyConf).isDefined) {
logger.info("\n\nLoading new conf\n\n")
val conf = new SparkConf().setMaster("local[4]").setAppName("MLPCURLModelGenerationDataStream")
conf.set("spark.streaming.stopGracefullyOnShutdown", "true")
conf.set("spark.cassandra.connection.host", sparkcassandraconnectionhost)
conf.set("spark.driver.maxResultSize", sparkdrivermaxResultSize)
conf.set("spark.network.timeout", sparknetworktimeout)
SparkConfProviderWithStreaming.sparkVariables.put(keyConf, conf)
logger.info("Loaded new conf")
getSparkConf()
}
else {
logger.info("Returning initialized conf")
SparkConfProviderWithStreaming.sparkVariables.get(keyConf).get.asInstanceOf[SparkConf]
}
}
catch {
case e: Exception =>
logger.error(e.getMessage, e)
throw new conameMLException(e.getMessage)
}
}
#throws(classOf[conameMLException])
def killSparkStreamingContext
{
try
{
if(SparkConfProviderWithStreaming.sparkVariables.get(keySSC).isDefined)
{
SparkConfProviderWithStreaming.sparkVariables -= keySSC
SparkConfProviderWithStreaming.sparkVariables -= keyConf
}
SparkSession.clearActiveSession()
SparkSession.clearDefaultSession()
}
catch {
case e: Exception =>
logger.error(e.getMessage, e)
throw new conameMLException(e.getMessage)
}
}
#throws(classOf[conameMLException])
def getSparkStreamingContext(): StreamingContext = {
try {
if (!SparkConfProviderWithStreaming.sparkVariables.get(keySSC).isDefined) {
logger.info("\n\nLoading new streaming\n\n")
SparkConfProviderWithStreaming.sparkVariables.put(keySSC, new StreamingContext(getSparkConf(), Seconds(6)))
logger.info("Loaded streaming")
getSparkStreamingContext()
}
else {
SparkConfProviderWithStreaming.sparkVariables.get(keySSC).get.asInstanceOf[StreamingContext]
}
}
catch {
case e: Exception =>
logger.error(e.getMessage, e)
throw new conameMLException(e.getMessage)
}
}
def getSparkSession():SparkSession=
{
if(!SparkSession.getActiveSession.isDefined)
{
SparkSession.builder.config(getSparkConf()).getOrCreate()
}
else
{
SparkSession.getActiveSession.get
}
}
}

Mid-Stream Changing Configuration with Check-Pointed Spark Stream

I have a Spark streaming / DStream app like this:
// Function to create and setup a new StreamingContext
def functionToCreateContext(): StreamingContext = {
val ssc = new StreamingContext(...) // new context
val lines = ssc.socketTextStream(...) // create DStreams
...
ssc.checkpoint(checkpointDirectory) // set checkpoint directory
ssc
}
// Get StreamingContext from checkpoint data or create a new one
val context = StreamingContext.getOrCreate(checkpointDirectory, functionToCreateContext _)
// Do additional setup on context that needs to be done,
// irrespective of whether it is being started or restarted
context. ...
// Start the context
context.start()
context.awaitTermination()
Where my context uses a configuration file where I can pull items with methods like appConf.getString. So I actually use:
val context = StreamingContext.getOrCreate(
appConf.getString("spark.checkpointDirectory"),
() => createStreamContext(sparkConf, appConf))
where val sparkConf = new SparkConf()....
If I stop my app and change configuration in the app file, these changes are not picked up unless I delete the checkpoint directory contents. For example, I would like to change spark.streaming.kafka.maxRatePerPartition or spark.windowDurationSecs dynamically. (EDIT: I kill the app, change the configuration file and then restart the app.) How can I do dynamically change these settings or enforce a (EDITED WORD) configuration change without trashing my checkpoint directory (which is about to include checkpoints for state info)?
How can I do dynamically change these settings or enforce a configuration change without trashing my checkpoint directory?
If dive into the code for StreamingContext.getOrCreate:
def getOrCreate(
checkpointPath: String,
creatingFunc: () => StreamingContext,
hadoopConf: Configuration = SparkHadoopUtil.get.conf,
createOnError: Boolean = false
): StreamingContext = {
val checkpointOption = CheckpointReader.read(
checkpointPath, new SparkConf(), hadoopConf, createOnError)
checkpointOption.map(new StreamingContext(null, _, null)).getOrElse(creatingFunc())
}
You can see that if CheckpointReader has checkpointed data in the class path, it uses new SparkConf() as a parameter, as the overload doesn't allow for passing of a custom created SparkConf. By default, SparkConf will load any settings declared either as an environment variable or passed to the classpath:
class SparkConf(loadDefaults: Boolean) extends Cloneable with Logging {
import SparkConf._
/** Create a SparkConf that loads defaults from system properties and the classpath */
def this() = this(true)
So one way of achieving what you want is instead of creating a SparkConf object in the code, you can pass the parameters via spark.driver.extraClassPath and spark.executor.extraClassPath to spark-submit.
Do you create your Streaming Context the way the docs suggest, by using StreamingContext.getOrCreate, which takes a previous checkpointDirectory as an argument?
// Function to create and setup a new StreamingContext
def functionToCreateContext(): StreamingContext = {
val ssc = new StreamingContext(...) // new context
val lines = ssc.socketTextStream(...) // create DStreams
...
ssc.checkpoint(checkpointDirectory) // set checkpoint directory
ssc
}
// Get StreamingContext from checkpoint data or create a new one
val context = StreamingContext.getOrCreate(checkpointDirectory, functionToCreateContext _)
// Do additional setup on context that needs to be done,
// irrespective of whether it is being started or restarted
context. ...
// Start the context
context.start()
context.awaitTermination()
http://spark.apache.org/docs/latest/streaming-programming-guide.html#checkpointing
It can not be done adding/updating spark configurations when you are restoring from checkpoint directory. You can find spark checkpointing behaviour in documentation:
When the program is being started for the first time, it will create a new StreamingContext, set up all the streams and then call start().
When the program is being restarted after failure, it will re-create a StreamingContext from the checkpoint data in the checkpoint directory
So if you use checkpoint directory then on restart of job it will re-create a StreamingContext from checkpoint data which will have old sparkConf.

Resources