slow performance of spark - apache-spark

I am new to Spark. I have a requirement where I need to integrate Spark with Web Service. Any request to a Web service has to be processed using Spark and send the response back to the client.
I have created a small dummy service in Vertx, which accepts request and processes it using Spark. I am using Spark in cluster mode (1 master, 2 slaves, 8 core, 32 Gb each, running on top of Yarn and Hdfs)
public class WebServer {
private static SparkSession spark;
private static void createSparkSession(String masterUrl) {
SparkContext context = new SparkContext(new SparkConf().setAppName("Web Load Test App").setMaster(masterUrl)
.set("spark.hadoop.fs.default.name", "hdfs://x.x.x.x:9000")
.set("spark.hadoop.fs.defaultFS", "hdfs://x.x.x.x:9000")
.set("spark.hadoop.fs.hdfs.impl", org.apache.hadoop.hdfs.DistributedFileSystem.class.getName())
.set("spark.hadoop.fs.hdfs.server", org.apache.hadoop.hdfs.server.namenode.NameNode.class.getName())
.set("spark.hadoop.conf", org.apache.hadoop.hdfs.HdfsConfiguration.class.getName())
.set("spark.eventLog.enabled", "true")
.set("spark.eventLog.dir", "hdfs://x.x.x.x:9000/spark-logs")
.set("spark.history.provider", "org.apache.spark.deploy.history.FsHistoryProvider")
.set("spark.history.fs.logDirectory", "hdfs://x.x.x.x:9000/spark-logs")
.set("spark.history.fs.update.interval", "10s")
.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
//.set("spark.kryoserializer.buffer.max", "1g")
//.set("spark.kryoserializer.buffer", "512m")
);
spark = SparkSession.builder().sparkContext(context).getOrCreate();
}
public static void main(String[] args) {
Vertx vertx = Vertx.vertx();
String masterUrl = "local[2]";
if(args.length == 1) {
masterUrl = args[0];
}
System.out.println("Master url: " + masterUrl);
createSparkSession(masterUrl);
WebServer webServer = new WebServer();
webServer.start(vertx);
}
private void start(Vertx vertx) {
int port = 19090;
Router router = Router.router(vertx);
router.route("/word_count*").handler(BodyHandler.create());
router.post("/word_count").handler(this::calWordFrequency);
vertx.createHttpServer()
.requestHandler(router::accept)
.listen(port,
result -> {
if (result.succeeded()) {
System.out.println("Server started # " + port);
} else {
System.out.println("Server failed to start # " + port);
}
});
}
private void calWordFrequency(RoutingContext routingContext) {
WordCountRequest wordCountRequest = routingContext.getBodyAsJson().mapTo(WordCountRequest.class);
List<String> words = wordCountRequest.getWords();
Dataset<String> wordsDataset = spark.createDataset(words, Encoders.STRING());
Dataset<Row> wordCounts = wordsDataset.groupBy("value").count();
List<String> result = wordCounts.toJSON().collectAsList();
routingContext.response().setStatusCode(300).putHeader("content-type", "application/json; charset=utf-8")
.end(Json.encodePrettily(result));
}
}
When I make a post request with payload size of about 5kb, then it is taking around 6 seconds to complete the request and the response back. I feel it is very slow.
However, if I carry a simple example of reading file from Hbase and performing transformation and displaying result, it is very fast. I am able to processes a file of 8Gb file in 2mins.
Eg:
logFile="/spark-logs/single-word-line.less.txt"
master_node = 'spark://x.x.x.x:7077'
spark = SparkSession.builder.master(master_node).appName('Avi-load-test').getOrCreate()
log_data = spark.read.text(logFile)
word_count = (log_data.groupBy('value').count())
print(word_count.show())
What is the reason for my application to run so slow? Any pointers would be really helpful. Thank you in advance.

Spark processing is asynchronous, you are using it as part of a synchronous flow. You can do that but can't expect the processing to be finished. We have implemented similar use case - we have a REST service which triggers a spark job. The implementation does return spark job id(takes some time). And we have another end point to get job status using the Job Id, but we didn't implement a flow to return data from spark job through REST srvc and it is not recommended.
Our flow
REST API <-> Spark Job <-> HBase/Kafka
The REST endpoint triggers a Spark Job and the Spark job reads data from HBase and does the processing and write data back to HBase and Kafka. The REST API is called by a different and they receive the data that is processed through Kafka.
I think you need to rethink your architecture.

Related

Writing to DB2 database gets stuck inside foreachpartition() from my Java code on Spark Cluster with 20 worker nodes configured

I am trying to read data from a file uploaded on a server path, do some manipulation and then save this data to DB2 database. We have around 300K records which may increase further in future so we are trying to do all the manipulation as well write to DB2 inside foreachpartition. Below are the steps followed to do so.
Create spark context as global and static.
static SparkContext sparkContext = new SparkContext();
static JavaSparkContext jc = sparkContext.getJavaSparkContext();
static SparkSession sc = sparkContext.getSparkSession();
Create a dataset of file present on server
Dataset<Row> dataframe = sparkContext.getSparkSession().read().option("delimiter","|").option("header","false").option("inferSchema","false").schema(SchemaClass.generateSchema()).csv(filePath).withColumn("ID",monotonically_increasing_id())).withColumn("Partition_Id",spark_partition_id());
Calling foreachpartition
dataframe.foreachpartition(new ForeachPartitionFunction<Row>()){
#Override
public void call(Iterator<Row> _row) throws Exception{
List<SparkPositionDto> li = new ArrayList<>();
while(_row.hasNext()){
PositionDto positionDto = AnotherClass.method(row,1);
SparkPositionDto spd = copyToSparkDto(positionDto);
if(spd != null){
li.add(spd);
}
}
System.out.println("Writing via Spark : List size : "+li.size());
JavaRDD<SparkPositionDto> finalRdd = jc.parallelize(li);
Dataset<Row> dfToWrite = sc.createDataFrame(finalRDD, SparkPositionDto.class);
System.out.println("Writing Data");
if(dfToWrite != null){
dfToWrite.write()
.format("jdbc")
.option("url","jdbc:msdb2://"+"Database_Name"+";useKerberos=true")
.option("driver","DRIVER_NAME")
.option("dbtable","TABLE_NAME")
.mode(SaveMode.Append)
.save();
}
}
}
The weird observation is that when i run this code outside foreachpartition for a small set of data, it works fine and in my spark cluster just 1 driver and 1 application runs, but when the same code is running inside foreachpartition, I could see 1 driver and 2 applications running with 1 app in running state and other in waiting. If I add numberOfPartitions as 5 in my schema then 5 applications can be seen running. It is running continuously, nothing in logs, seems it got stuck somewhere.

spark 2.2 struct streaming foreach writer jdbc sink lag

i'm in a project using spark 2.2 struct streaming to read kafka msg into oracle database. the message flow into kafka is about 4000-6000 messages per second .
when using hdfs file system as sink destination ,it just works fine. when using foreach jdbc writer,it will have a huge delay over time . I think the lag is caused by foreach loop .
the jdbc sink class(stand alone class file):
class JDBCSink(url: String, user: String, pwd: String) extends org.apache.spark.sql.ForeachWriter[org.apache.spark.sql.Row] {
val driver = "oracle.jdbc.driver.OracleDriver"
var connection: java.sql.Connection = _
var statement: java.sql.PreparedStatement = _
val v_sql = "insert INTO sparkdb.t_cf(EntityId,clientmac,stime,flag,id) values(?,?,to_date(?,'YYYY-MM-DD HH24:MI:SS'),?,stream_seq.nextval)"
def open(partitionId: Long, version: Long): Boolean = {
Class.forName(driver)
connection = java.sql.DriverManager.getConnection(url, user, pwd)
connection.setAutoCommit(false)
statement = connection.prepareStatement(v_sql)
true
}
def process(value: org.apache.spark.sql.Row): Unit = {
statement.setString(1, value(0).toString)
statement.setString(2, value(1).toString)
statement.setString(3, value(2).toString)
statement.setString(4, value(3).toString)
statement.executeUpdate()
}
def close(errorOrNull: Throwable): Unit = {
connection.commit()
connection.close
}
}
the sink part :
val df = spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "namenode:9092").option("fetch.message.max.bytes", "50000000").option("kafka.max.partition.fetch.bytes", "50000000")
.option("subscribe", "rawdb.raw_data")
.option("startingOffsets", "latest")
.load()
.select($"value".as[Array[Byte]])
.map(avroDeserialize(_))
.filter(some logic).select(some logic)
.writeStream.format("csv").option("checkpointLocation", "/user/root/chk").option("path", "/user/root/testdir").start()
if I change the last line
.writeStream.format("csv")...
into jdbc foreach sink as following:
val url = "jdbc:oracle:thin:#(DESCRIPTION=(ADDRESS_LIST=(ADDRESS=(PROTOCOL=TCP)(HOST=x.x.x.x)(PORT=1521)))(CONNECT_DATA=(SERVICE_NAME=fastdb)))"
val user = "user";
val pwd = "password";
val writer = new JDBCSink(url, user, pwd)
.writeStream.foreach(writer).outputMode("append").start()
the lag show up.
I guess the problem most likely caused by foreach loop mechanics-it's not in batch mode deal with like several thousands row in a batch ,as an oracle DBA either, I have fine tuned oracle database side ,mostly the database is waiting for idle events . excessive commit is trying to be avoided by setting connection.setAutoCommit(false) already,any suggestion will be much appreciate.
Although I don't have an actual profile of whats taking the longest time in your application, I would assume it is due to the fact that using ForeachWriter will effectively close and re-open your JDBC connection on each run, because that's how ForeachWriter works.
I would advise that instead of using it, write a custom Sink for JDBC where you control how the connection is opened or closed.
There is an open pull request to add a JDBC driver to Spark which you can take a peek at to see a possible approach to the implementation.
problem solved by injecting the result into another Kafka topic , then wrote another program read from the new topic write them into database on batches .
I think in next spark release,they might provide the jdbc sink and have some parameter setting batch size .
the main code is as following :
write to another topic:
.writeStream.format("kafka")
.option("kafka.bootstrap.servers", "x.x.x.x:9092")
.option("topic", "fastdbtest")
.option("checkpointLocation", "/user/root/chk")
.start()
read the topic and write to databases,i'm using c3p0 connection pool
lines.foreachRDD(rdd => {
if (!rdd.isEmpty) {
rdd.foreachPartition(partitionRecords => {
//get a connection from connection pool
val conn = ConnManager.getManager.getConnection
val ps = conn.prepareStatement("insert into sparkdb.t_cf(ENTITYID,CLIENTMAC,STIME,FLAG) values(?,?,?,?)")
try {
conn.setAutoCommit(false)
partitionRecords.foreach(record => {
insertIntoDB(ps, record)
}
)
ps.executeBatch()
conn.commit()
} catch {
case e: Exception =>{}
// do some log
} finally {
ps.close()
conn.close()
}
})
}
})
Have you tried using a trigger?
I notice when I didn't use a trigger my Foreach Sink open and close several times the connection to the database.
writeStream.foreach(writer).start()
But when I used a trigger, the Foreach only opened and closed the connection one time, processing for example 200 queries and when the micro-batch was ended it closed the connection until a new micro batch was received.
writeStream.trigger(Trigger.ProcessingTime("3 seconds")).foreach(writer).start()
My use case is reading from a Kafka topic with only one partition, so Spark I think is using one partition. I dont know if this solution works the same with multiple Spark partitions but my conclusion here is the Foreach process all the micro-batch at a time (row by row) in the process method and doesn't call open() and close() for every row like a lot of people think.

How to set the Spark streaming receiver frequency?

My requirement is to process the hourly data of a stock market.
i.e, get the data from source once per streaming interval and process it via DStream.
I have implemented a custom receiver to scrap/monitor the website by implementing onStart() and onStop() methods and its working.
Challenges encountered:
Receiver thread is fetching the data continuously i.e, multiples times per interval.
Unable to coordinate receiver and DStream execution time interval.
Options I tried:
Receiver Thread to sleep for few seconds (equal to streaming interval).
In this case data is not the latest data while processing.
class CustomReceiver(interval: Int)
extends Receiver[String](StorageLevel.MEMORY_AND_DISK_2) {
def onStart() {
new Thread("Website Scrapper") {
override def run() { receive() }
}.start()
}
def onStop() {
}
/** Create a socket connection and receive data until receiver is stopped */
private def receive() {
println("Entering receive:" + new Date());
try {
while (!isStopped) {
val scriptsLTP = StockMarket.getLiveStockData()
for ((script, ltp) <- scriptsLTP) {
store(script + "," + ltp)
}
println("sent data")
System.out.println("going to sleep:" + new Date());
Thread.sleep(3600 * 1000);
System.out.println("awaken from sleep:" + new Date());
}
println("Stopped receiving")
restart("Trying to connect again")
} catch {
case t: Throwable =>
restart("Error receiving data", t)
}
println("Exiting receive:" + new Date());
}
}
How to make the Spark Streaming receiver in sync with DStream processing?
This use case doesn't seem a good fit for Spark Streaming. The interval is long enough to consider this as a regular batch job instead. That way, we can make better use of the cluster resources.
I would rewrite it as a Spark Job by parallelizing the target tickers, using a mapPartitions to use the executors as distributed web scrappers and then process as intended.
Then schedule the Spark job to run each hour with cron or more advanced alternatives, such as Chronos at the exact times wanted.

Limit apache spark job running duration

I want to submit the job in a cluster environment with a timeout parameter, is there a way to make spark kill a running job if it exeeded the allowed duration?
At Spark 2.1.0, there is no built-in solution (a very good feature to add!).
You can play with speculation feature to re-launch long task and spark.task.maxFailures to kill too many re-launched tasks.
But this is absolutely not clean, Spark is missing a real "circuit breaker" to stop long task (such as the noob SELECT * FROM DB)
In other side, you could use the Spark web UI web API:
1) Get running jobs: GET http://SPARK_CLUSTER_PROD/api/v1/applications/application_1502112083252_1942/jobs?status=running
(this will give you an array with submissionTime field that you can use to find long jobs)
2) Kill the job: POST http://SPARK_CLUSTER_PROD/stages/stage/kill/?id=23881&terminate=true for each job stages.
I believe Spark has a hidden API too, you can try to use.
You can use YARN REST api to kill the spark application from your service. I am using the following code to stop the long running spark application. The following code is using httpclient library.
def killApplication(applicationId: String) : Boolean = {
val appKillPut = new HttpPut(s"http://xx.xx.xx.xx:8088//ws/v1/cluster/apps/$applicationId/state")
val json = new JSONObject(Map("state"-> "KILLED"))
val params = new StringEntity(json.toString(),"UTF-8")
params.setContentType("application/json")
appKillPut.addHeader("Content-Type", "application/json")
appKillPut.addHeader("Accept", "*/*")
appKillPut.setEntity(params)
println(s"Request payload ${json.toString}")
val client: CloseableHttpClient = HttpClientBuilder.create().build()
val response: CloseableHttpResponse = client.execute(appKillPut)
val responseBody = EntityUtils.toString(response.getEntity)
println(s"Response payload ${responseBody}")
val statusCode: Int = response.getStatusLine.getStatusCode
if(statusCode == 200 || statusCode == 201 || statusCode == 202) {
println(s"Successfully stopped the application : ${applicationId}")
true
} else {
false
}
}
Hope this helps.
Ravi

How to get data from regularly appended log file in Apache Spark?

I have one Apache access log file which has some data and it is continuously increasing. I want to analyze that data using Apache Spark Streaming API.
And Spark is new for me and i created one program in which ,i use jssc.textFileStream(directory) function to get log data. but its not work as per my requirement.
please suggest me some approaches to analyze that log file using spark.
Here is my code.
SparkConf conf = new SparkConf()
.setMaster("spark://192.168.1.9:7077")
.setAppName("log streaming")
.setSparkHome("/usr/local/spark")
.setJars(new String[] { "target/sparkstreamingdemo-0.0.1.jar" });
StreamingContext ssc = new StreamingContext(conf, new Duration(5000));
DStream<String> filerdd = ssc.textFileStream("/home/user/logs");
filerdd.print();
ssc.start();
ssc.awaitTermination();
This code does not return any data from existing files. This is only work when i create a new file but when i update that new file, program again does not return updated data.
If the file is modified in real-time you can use Tailer from Apache Commons IO.
That's the simpliest sample:
public void readLogs(File f, long delay) {
TailerListener listener = new MyTailerListener();
Tailer tailer = new Tailer(f, listener, delay);
// stupid executor impl. for demo purposes
Executor executor = new Executor() {
public void execute(Runnable command) {
command.run();
}
};
executor.execute(tailer);
}
public class MyTailerListener extends TailerListenerAdapter {
public void handle(String line) {
System.out.println(line);
}
}
The code above may be used as a log reader for Apache Flume and applied as a source. Then you need to configure Flume sink to redirect collected logs to Spark stream and apply Spark for analyzing data from Flume stream (http://spark.apache.org/docs/latest/streaming-flume-integration.html)
More details about Flume setup in this post: real time log processing using apache spark streaming

Resources