Spark partition issue with Stanford NLP - apache-spark

I am trying to process millions of data with spark/scala integrated with stanford NLP (3.4.1).
Since I am using social media data I have to use NLP for the themes generation (pos tagging) and Sentiment calculation.
I have to deal with Twitter data and NON Twitter data separately.So I have two classes that deal with Twitter/Non Twitter
I am using lasy val initialization from each class for loading the stanfordNLP
features: Seq[String] = Seq("tokenize","ssplit","pos","parse","sentiment")
val props = new Properties()
props.put("annotators", features.mkString(", "))
props.put("pos.model", "tagger/gate-EN-twitter.model")
props.put("parse.model", "tagger/englishSR.ser.gz");
val pipeline = new StanfordCoreNLP(props)
Note: For above Twitter I am using different pos model and shift reduce parse model for parsing. The reason I use shift reduce parser is for some of the junk
data at rum time the default PCFG model takes lot of time for processing and getting some Exception.Shift reduce parser will take around 15 seconds at load time and its faster at run time while processing the
data.
NonTwitter class
features: Seq[String] = Seq("tokenize","ssplit","pos","parse","sentiment")
val props = new Properties()
props.put("annotators", features.mkString(", "))
props.put("parse.model", "tagger/englishSR.ser.gz");
Here I am using the default pos model and shift reduce parser
Problem:
Currently we am running with 8 Nodes with 6 cores and I can run with 48 partition. For to process millions of data
with the above configuration with lesser partition it works fine for me.
8 Nodes and 6 cores we have almost 48 partition and if I ran with 42 Number of partition it takes around 1 hr to finish the processing.
with the current configuration I need to scale it to 200 partition
8 Nodes and 6 cores we have almost 48 partition and if we ran with 200 Number of partition it takes around 2 hr and finally throwing some exception saying the one node is lost
or java.lang.IllegalArgumentException: annotator "sentiment" requires annotator "binarized_trees" etc etc.
The problem is only if we scale up the number of partition to 200 with 8 Nodes and 6 cores which we have only 48 cores.
I have the suspect that its because of loading the shift reduce parser loading at each partition. I thought of loading this class at one time and then do the Broadcast but standforndNLP class is not searializable so I cannot broadcast.
The reason we need to scale to 200 partition is it will run quickly with lesser time to process this data.

I don't understand the details of your message with apparently different results with different numbers of workers/partitions, but what you want to do is load the appropriate CoreNLP once on each worker. (And you can have two different CoreNLP pipelines loaded in one program.)
Have you tried the approach that Evan Sparks describes on this thread? http://apache-spark-user-list.1001560.n3.nabble.com/Spark-and-Stanford-CoreNLP-td19654.html

Related

Does Spark configs spark.streaming.receiver.maxRate has any effect in a Kafka Beam pipeline

I was wondering if somebody has any experience with rate limiting in Beam KafkaIO component when the runner is a SparkRunner. The versions I am using are:Beam 2.29, Spark 3.2.0 and Kafka client 2.5.0?
I have the Beam parameter maxRecordsPerBatch set to a large number, 100000000. But even when the pipeline stops for 45 minutes, this value is never hit. But when there is a high burst of data above the normal, the Kafka lag increases till it eventually catches up. In the SparkUI I see that parameter batchIntervalMillis=300000 (5 min) is not reached, batches take a maximum of 3 min. It looks like the KafkaIO stops reading at some point, even when the lag is very large. My Kafka parameters --fetchMaxWaitMs=1000
--maxPollRecords=5000 should be able to bring plenty of data. Specially because KafkaIO creates one consumer per partition. In my system there are multiple topics with a total of 992 partitions and my spark.default.parallelism=600. Some partitions have very little data, while others have a large number. Topics are per region and when a region goes down the data is sent through another region/topic. That is when the lag happens.
Does the configuration values for spark.streaming.receiver.maxRate and spark.streaming.receiver.maxRatePerPartition plus spark.streaming.backpressure.enabled play any role at all?
For what I have seen, it looks like Beam controls the whole reading from Kafka with the operator KafkaIO. This component creates its own consumers, therefore the rate of the consumer can only be set by using consumer configs which include fetchMaxWaitMs and maxPollRecords.
The only way those Spark parameters could have any effect if in the rest of the pipeline after the IO source. But I am not sure.
So I finally figure out how it all works. First I want to state that the Spark configuration values: spark.streaming.receiver.maxRate, spark.streaming.receiver.maxRatePerPartition, spark.streaming.backpressure.enabled do not play a factor in Beam because they only work if you are using the source operators from Spark to read from Kafka. Since Beam has its own operator KafkaIO they do not play a role.
So Beam has a set of parameters defined in the class SparkPipelineOptions that are used in the SparkRunner to setup reading from Kafka. Those parameters are:
#Description("Minimum time to spend on read, for each micro-batch.")
#Default.Long(200)
Long getMinReadTimeMillis();
#Description(
"A value between 0-1 to describe the percentage of a micro-batch dedicated to reading from UnboundedSource.")
#Default.Double(0.1)
Double getReadTimePercentage();
Beam create a SourceDStream object that it will pass to spark to use as a source to read from Kafka. In this class the method boundReadDuration returns the result of calculating the larger of two reading values:
proportionalDuration and lowerBoundDuration. The first one is calculated by multiplying BatchIntervalMillis from readTimePercentage. The second is just the value in mills from minReadTimeMillis. Below is the code from SourceDStream. The time duration returned from this function will be used to read from Kafka alone the rest of the time will be allocated to the other tasks in the pipeline.
Last but no least the following parameter also control how many records are process during a batch maxRecordsPerBatch. The pipeline would not process more than those records in a single batch.
private Duration boundReadDuration(double readTimePercentage, long minReadTimeMillis) {
long batchDurationMillis = ssc().graph().batchDuration().milliseconds();
Duration proportionalDuration = new Duration(Math.round(batchDurationMillis * readTimePercentage));
Duration lowerBoundDuration = new Duration(minReadTimeMillis);
Duration readDuration = proportionalDuration.isLongerThan(lowerBoundDuration) ? proportionalDuration: lowerBoundDuration;
LOG.info("Read duration set to: " + readDuration);
return readDuration;
}

Spark window function on dataframe with large number of columns

I have an ML dataframe which I read from csv files. It contains three types of columns:
ID Timestamp Feature1 Feature2...Feature_n
where n is ~ 500 (500 features in ML parlance). The total number of rows in the dataset is ~ 160 millions.
As this is the result of a previous full join, there are many features which do not have values set.
My aim is to run a "fill" function(fillna style form python pandas), where each empty feature value gets set with the previously available value for that column, per Id and Date.
I am trying to achieve this with the following spark 2.2.1 code:
val rawDataset = sparkSession.read.option("header", "true").csv(inputLocation)
val window = Window.partitionBy("ID").orderBy("DATE").rowsBetween(-50000, -1)
val columns = Array(...) //first 30 columns initially, just to see it working
val rawDataSetFilled = columns.foldLeft(rawDataset) { (originalDF, columnToFill) =>
originalDF.withColumn(columnToFill, coalesce(col(columnToFill), last(col(columnToFill), ignoreNulls = true).over(window)))
}
I am running this job on a 4 m4.large instances on Amazon EMR, with spark 2.2.1. and dynamic allocation enabled.
The job runs for over 2h without completing.
Am I doing something wrong, at the code level? Given the size of the data, and the instances, I would assume it should finish in a reasonable amount of time? And I haven't even tried with the full 500 columns, just with about 30!
Looking in the container logs, all I see are many logs like this:
INFO codegen.CodeGenerator: Code generated in 166.677493 ms
INFO execution.ExternalAppendOnlyUnsafeRowArray: Reached spill
threshold of
4096 rows, switching to
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter
I have tried setting parameter spark.sql.windowExec.buffer.spill.threshold to something larger, without any impact. Is theresome other setting I should know about? Those 2 lines are the only ones I see in any container log.
In Ganglia, I see most of the CPU cores peaking around full usage, but the memory usage is lower than the maximum available. All executors are allocated and are doing work.
I have managed to rewrite the fold left logic without using withColumn calls. Apparently they can be very slow for large number of columns, and I was also getting stackoverflow errors because of that.
I would be curious to know why this massive difference - and what exactly happens behind the scenes with the query plan execution, which makes repeated withColumns calls so slow.
Links which proved very helpful: Spark Jira issue and this stackoverflow question
var rawDataset = sparkSession.read.option("header", "true").csv(inputLocation)
val window = Window.partitionBy("ID").orderBy("DATE").rowsBetween(Window.unboundedPreceding, Window.currentRow)
rawDataset = rawDataset.select(rawDataset.columns.map(column => coalesce(col(column), last(col(column), ignoreNulls = true).over(window)).alias(column)): _*)
rawDataset.write.option("header", "true").csv(outputLocation)

Where are my extra Spark tasks coming from

I have a Spark program that is training several ML algorithms. The code that generates the final stage of my job looks like this (in Kotlin):
val runConfigs = buildOptionsCrossProduct(opts)
log.info("Will run {} different configurations.", runConfigs.size)
val runConfigsRdd: JavaRDD<RunConfiguration> = sc.parallelize(runConfigs)
// Create an RDD mapping window size to the score for that window size.
val accuracyRdd = runConfigsRdd.mapToPair { runConfig: RunConfiguration ->
runSingleOptionSet(runConfig, opts, trainingBroadcast, validBroadcast) }
accuracyRdd.saveAsTextFile(opts.output)
runConfigs is a list of 18 items. The log line right after the configs are generated shows:
17/02/06 19:23:20 INFO SparkJob: Will run 18 different configurations.
So I'd expect at most 18 tasks as there should be at most one task per stage per partition (at least that's my understanding). However, the History server reports 80 tasks most of which finish very quickly and, not surprisingly, produce no output:
There are in fact 80 output files generated with all but 18 of them being empty. My question is, what are the other 80 - 18 = 62 tasks in this stage doing and why did they get generated?
You use SparkContext.parallelize without providing numSlices argument so Spark is using defaultParallelism which is probably 80. In general parallelize tries to spread data uniformly between partitions but it doesn't remove empty ones so if you want to avoid executing empty task you should set numSlices to a number smaller or equal to runConfigs.size.

Bad performance with window function in streaming job

I use Spark 2.0.2, Kafka 0.10.1 and the spark-streaming-kafka-0-8 integration. I want to do the following:
I extract features in a streaming job out of NetFlow connections and than apply the records to a k-means model. Some of the features are simple ones which are calculated directly from the record. But I also have more complex features which depend on records from a specified time window before. They count how many connections in the last second were to the same host or service as the current one. I decided to use the SQL window functions for this.
So I build window specifications:
val hostCountWindow = Window.partitionBy("plainrecord.ip_dst").orderBy(desc("timestamp")).rangeBetween(-1L, 0L)
val serviceCountWindow = Window.partitionBy("service").orderBy(desc("timestamp")).rangeBetween(-1L, 0L)
And a function which is called to extract this features on every batch:
def extractTrafficFeatures(dataset: Dataset[Row]) = {
dataset
.withColumn("host_count", count(dataset("plainrecord.ip_dst")).over(hostCountWindow))
.withColumn("srv_count", count(dataset("service")).over(serviceCountWindow))
}
And use this function as follows
stream.map(...).map(...).foreachRDD { rdd =>
val dataframe = rdd.toDF(featureHeaders: _*).transform(extractTrafficFeatures(_))
...
}
The problem is that this has a very bad performance. A batch needs between 1 and 3 seconds for a average input rate of less than 100 records per second. I guess it comes from the partitioning, which produces a lot of shuffling?
I tried to use the RDD API and countByValueAndWindow(). This seems to be much faster, but the code looks way nicer and cleaner with the DataFrame API.
Is there a better way to calculate these features on the streaming data? Or am I doing something wrong here?
Relatively low performance is to be expected here. Your code has to shuffle and sort data twice, once for:
Window
.partitionBy("plainrecord.ip_dst")
.orderBy(desc("timestamp")).rangeBetween(-1L, 0L)
and once for:
Window
.partitionBy("service")
.orderBy(desc("timestamp")).rangeBetween(-1L, 0L)
This will have a huge impact on the runtime and if these are the hard requirements you won't be able to do much better.

Spark cores & tasks concurrency

I've a very basic question about spark. I usually run spark jobs using 50 cores. While viewing the job progress, most of the times it shows 50 processes running in parallel (as it is supposed to do), but sometimes it shows only 2 or 4 spark processes running in parallel. Like this:
[Stage 8:================================> (297 + 2) / 500]
The RDD's being processed are repartitioned on more than 100 partitions. So that shouldn't be an issue.
I have an observations though. I've seen the pattern that most of the time it happens, the data locality in SparkUI shows NODE_LOCAL, while other times when all 50 processes are running, some of the processes show RACK_LOCAL.
This makes me doubt that, maybe this happens because the data is cached before processing in the same node to avoid network overhead, and this slows down the further processing.
If this is the case, what's the way to avoid it. And if this isn't the case, what's going on here?
After a week or more of struggling with the issue, I think I've found what was causing the problem.
If you are struggling with the same issue, the good point to start would be to check if the Spark instance is configured fine. There is a great cloudera blog post about it.
However, if the problem isn't with configuration (as was the case with me), then the problem is somewhere within your code. The issue is that sometimes due to different reasons (skewed joins, uneven partitions in data sources etc) the RDD you are working on gets a lot of data on 2-3 partitions and the rest of the partitions have very few data.
In order to reduce the data shuffle across the network, Spark tries that each executor processes the data residing locally on that node. So, 2-3 executors are working for a long time, and the rest of the executors are just done with the data in few milliseconds. That's why I was experiencing the issue I described in the question above.
The way to debug this problem is to first of all check the partition sizes of your RDD. If one or few partitions are very big in comparison to others, then the next step would be to find the records in the large partitions, so that you could know, especially in the case of skewed joins, that what key is getting skewed. I've wrote a small function to debug this:
from itertools import islice
def check_skewness(df):
sampled_rdd = df.sample(False,0.01).rdd.cache() # Taking just 1% sample for fast processing
l = sampled_rdd.mapPartitionsWithIndex(lambda x,it: [(x,sum(1 for _ in it))]).collect()
max_part = max(l,key=lambda item:item[1])
min_part = min(l,key=lambda item:item[1])
if max_part[1]/min_part[1] > 5: #if difference is greater than 5 times
print 'Partitions Skewed: Largest Partition',max_part,'Smallest Partition',min_part,'\nSample Content of the largest Partition: \n'
print (sampled_rdd.mapPartitionsWithIndex(lambda i, it: islice(it, 0, 5) if i == max_part[0] else []).take(5))
else:
print 'No Skewness: Largest Partition',max_part,'Smallest Partition',min_part
It gives me the smallest and largest partition size, and if the difference between these two is more than 5 times, it prints 5 elements of the largest partition, to should give you a rough idea on what's going on.
Once you have figured out that the problem is skewed partition, you can find a way to get rid of that skewed key, or you can re-partition your dataframe, which will force it to get equally distributed, and you'll see now all the executors will be working for equal time and you'll see far less dreaded OOM errors and processing will be significantly fast too.
These are just my two cents as a Spark novice, I hope Spark experts can add some more to this issue, as I think a lot of newbies in Spark world face similar kind of problems far too often.

Resources