Jobs Queue MapR - linux

I'd like to know if it's possible to know for queue the number of jobs pending in a MapR queue ?
I tried with the mapred -info job-queue-name [-showJobs] command but it doesn't give the result i'm looking for.

Related

Possible reasons that spark waits and does not schedule tasks to run?

This might be a very generic question but hope someone can point some hint. But I found that sometimes, my job spark seems to hit a "pause" many times:
The natural of the job is: read orc files (from a hive table), filter by certain columns, no join, then write out to another hive table.
There were total 64K tasks for my job / stage (FileScan orc, followed by Filter, Project).
The application has 500 executors, each has 4 cores. Initially, about 2000 tasks were running concurrently, things look good.
After a while, I noticed the number running tasks dropped all the way near 100. Many cores/executors were just waiting with nothing to do. (I checked the log from these waiting executors, there was no error. All assigned tasks were done on them, they were just waiting)
After about 3-5 minutes, then these waiting executors suddenly got tasks assigned and now were working happily.
Any particular reasons this can be? The application is running from spark-shell (--master yarn --deploy-mode client, with number of executors/sizes etc. specified)
Thanks!

spark - any way to execute a hook after worker finished processing partition?

I'm no expert in spark, so my apologies if I'm way off.
We are using apache spark to process different sections of a large file simultaneously. We don't need any aggregations of the results. The problem we are facing is that the worker will process records one by one and we'd like to process them in groups. We can collect them in groups, but the last group will not be processed as we get no information from spark that it is processing the last record. Is there a way to get spark to call something after processing of a partition is completed so that we could complete processing of the last group?
Or maybe a totally different way of approaching this?
We are using java, should you decide to provide some code examples.
Thanks

Order of messages with Spark Executors

I have a spark streaming application which streams data from kafka. I rely heavily on the order of the messages and hence just have one partition created in the kafka topic.
I am deploying this job in a cluster mode.
My question is: Since I am executing this in the cluster mode, I can have more than one executor pick up tasks and will I lose the order of messages received from kafka in that case. If not, how does spark guarantee order?
The distributed processing power wouldn't be there with single partition, so instead use multiple partitions and I would suggest to attach sequence number with every message, either counter or timestamp.
If you don't have timestamp within message then kafka streaming provide a way to extract message timestamp and you can use it to order events based on timestamp then run events based on sequence.
Refer answer on how to extract timestamp from kafka message.
To maintain order using single partition is the right choice, here are few other things you can try:
Turn off speculative execution
spark.speculation - If set to "true", performs speculative execution
of tasks. This means if one or more tasks are running slowly in a
stage, they will be re-launched.
Adjust your batch interval / sizes such that they can finish processing without any lag.
Cheers !

Could FAIR scheduling mode make Spark Streaming jobs that read from different topics running in parallel?

I use Spark 2.1 and Kafka 0.9.
Under fair sharing, Spark assigns tasks between jobs in a “round robin” fashion, so that all jobs get a roughly equal share of cluster resources. This means that short jobs submitted while a long job is running can start receiving resources right away and still get good response times, without waiting for the long job to finish.
According to this if i have multiple jobs from multiple threads in case of spark streaming(one topic from each thread) is it possible that multiple topics can run simultaneously if i have enough cores in my cluster or would it just do a round robin across pools but run only one job at a time ?
Context:
I have two topics T1 and T2, both with one 1 partition. I have configured a pool with scheduleMode to be FAIR. I have 4 cores registered with spark. Now each topic has two actions(hence two jobs - totally 4 jobs across topics) Let's say J1 and J2 are jobs for T1 and J3 and J4 are jobs for topic T2. What spark is doing in FAIR mode is execute J1 J3 J2 J4, but at any time only one job is executing. Now as each topic has only one partition, only once core is being used and 3 are just free. This is something which i don't want.
Any way i can avoid this ?
if i have multiple jobs from multiple threads...is it possible that multiple topics can run simultaneously
Yes. That's the purpose of FAIR scheduling mode.
As you may have noticed, I removed "Spark Streaming" from your question since it does not contribute in any way to how Spark schedules Spark jobs. It does not really matter whether you start your Spark jobs from a "regular" application or Spark Streaming one.
Quoting Scheduling Within an Application (highlighting mine):
Inside a given Spark application (SparkContext instance), multiple parallel jobs can run simultaneously if they were submitted from separate threads.
By default, Spark’s scheduler runs jobs in FIFO fashion. Each job is divided into "stages" (e.g. map and reduce phases), and the first job gets priority on all available resources while its stages have tasks to launch, then the second job gets priority, etc.
And then the quote you used to ask the question that should now get clearer.
it is also possible to configure fair sharing between jobs. Under fair sharing, Spark assigns tasks between jobs in a "round robin" fashion, so that all jobs get a roughly equal share of cluster resources.
So, speaking about Spark Streaming you'd have to configure FAIR scheduling mode and Spark Streaming's JobScheduler should submit Spark jobs per topic in parallel (haven't tested it out myself so it's more theory than practice).
I think that fair scheduler alone will not help, as it's the Spark Streaming engine that takes care of submitting the Spark Jobs and normally does so in a sequential mode.
There's a non-documented configuration parameter in Spark Streaming: spark.streaming.concurrentJobs[1], which is set to 1 by default. It controls the parallelism level of jobs submitted to Spark.
By increasing this value, you may see parallel processing of the different spark stages of your streaming job.
I would think that combining this configuration with the fair scheduler in Spark, you will be able to achieve controlled parallel processing of the independent topic consumers. This is mostly uncharted territory.

New directStream API reads topic's partitions sequentially. Why?

I am trying to read kafka topic with new directStream method in KafkaUtils.
I have Kafka topic with 8 partitions.
I am running streaming job on yarn with 8 execuors with 1 core(--num-executors 8 --executor-cores 1) for each one.
So noticed that spark reads all topic's partitions in one executor sequentially - this is obviously not what I want.
I want spark to read all partitions in parallel.
How can I achieve that?
Thank you, in advance.
An initial communication to Kafka at job creation occurs, solely to set the offsets of the KafkaRDD - more specifically, the offsets for each KafkaRDD partition that makes up the KafkaRDD across the cluster.
They are then used to fetch data once the job is actually executed, on each Executor. Depending on what you noticed it's possible you may have seen that initial communication (from the driver). If you have seen all your jobs executing on the same executor, then something else would be going wrong than just using Kafka.

Resources