View number of nodes used in Hive queries - azure

I need to view the number of nodes used in my HDinsights cluster while running hive queries. How can i view this while running my queries. I know Ambari view provides this, but where can i get the exact number of nodes and storage used. Thanks

After you run the job, review the current Jobtracker log and you may see entries like this -
2014-01-23 20:14:59,136 INFO org.apache.hadoop.mapred.JobInProgress: Input size for job job_201401221948_0006 = 1395667. Number of splits = 7
2014-01-23 20:14:59,137 INFO org.apache.hadoop.mapred.JobInProgress: tip:task_201401221948_0006_m_000000 has split on node:/fd0/ud0/localhost
2014-01-23 20:14:59,137 INFO org.apache.hadoop.mapred.JobInProgress: tip:task_201401221948_0006_m_000001 has split on node:/fd0/ud0/localhost
......
If you see Number of splits=1, there will be one map task and you know that only node will be utilized.
and when Number of splits > 1, for each split you will see a map task created with Tasktracker node info like this -
2014-01-23 20:14:59,153 INFO org.apache.hadoop.mapred.JobTracker: Adding task (JOB_SETUP) 'attempt_201401221948_0006_m_000008_0' to tip task_201401221948_0006_m_000008, for tracker 'tracker_workernode7:127.0.0.1/127.0.0.1:49200'

Related

Hazlecast Jet Cluster Processes duplicates

I have deployed 3 spring boot apps with Hazelcast Jet embedded. The nodes recognize each other and run as a cluster. I have the following code: A simple reading from CSV and write to a file. But Jet writes duplicates to the file sink. To be precise, Jet processes total entries in the CSV multiplied by the number of nodes. So, if I have 10 entries in the source and 3 nodes, I see 3 files in the sink each having all 10 entries. I just want to process one record once and only once. Following is my code:
Pipeline p = Pipeline.create();
BatchSource<List<String>> source = Sources.filesBuilder("files")
.glob("*.csv")
.build(path -> Files.lines(path).skip(1).map(line -> split(line)));
p.readFrom(source)
.writeTo(Sinks.filesBuilder("out").build());
instance.newJob(p).join();
If it's a shared file system, then sharedFileSystem attribute in FilesourceBuilder must be set to true.

Elasticsearch cluster size/architecture

I've been trying to setup an elasticsearch cluster for processing some log data from some 3D printers .
we are having more than 850K documents generated each day for 20 machines . each of them has it own index .
Right now we have the data of 16 months with make it about 410M records to index in each of the elasticsearch index .
we are processing the data from CSV files with spark and writing to an elasticsearch cluster with 3 machines each one of them has 16GB of RAM and 16 CPU cores .
but each time we reach about 10-14M doc/index we are getting a network error .
Job aborted due to stage failure: Task 173 in stage 9.0 failed 4 times, most recent failure: Lost task 173.3 in stage 9.0 (TID 17160, wn21-xxxxxxx.ax.internal.cloudapp.net, executor 3): org.elasticsearch.hadoop.rest.EsHadoopNoNodesLeftException: Connection error (check network and/or proxy settings)- all nodes failed; tried [[X.X.X.X:9200]]
I'm sure this is not a network error it's just elasticsearch cannot handle more indexing requests .
To solve this , I've tried to tweak many elasticsearch parameters such as : refresh_interval to speed up the indexation and get rid of the error but nothing worked . after monitoring the cluster we think that we should scale it up.
we also tried to tune the elasticsearch spark connector but with no result .
So I'm looking for a right way to choose the cluster size ? is there any guidelines on how to choose your cluster size ? any highlights will be helpful .
NB : we are interested mainly in indexing data since we have only one query or two to run on data to get some metrics .
I would start by trying to split up the indices by month (or even day) and then search across an index pattern. Example: sensor_data_2018.01, sensor_data_2018.02, sensor_data_2018.03 etc. And search with an index pattern of sensor_data_*
Some things which will impact what cluster size you need will be:
How many documents
Average size of each document
How many messages/second are being indexed
Disk IO speed
I think your cluster should be good enough to handle that amount of data. We have a cluster with 3 nodes (8CPU / 61GB RAM each), ~670 indices, ~3 billion documents, ~3TB data and have only had indexing problems when the indexing rate exceeds 30,000 documents/second. Even then only the indexing of a few documents will fail and can be successfully retried after a short delay. Our implementation is also very indexing heavy with minimal actual searching.
I would check the elastic search server logs and see if you can find a more detailed error message. Possible look for RejectedExecutionException's. Also check the cluster health and node stats when you start to receive the failures which might shed some more light on whats occurring. If possible implement a re-try and backoff when failures start to occur to give ES time to catch up to the load.
Hope that helps a bit!
This is a network error, saying the data node is ... lost. Maybe a crash, you can check the elasticsearch logs to see whats going on.
The most important thing to understand with elasticsearch4Hadoop is how work is parallelized:
1 Spark partition by 1 elasticsearch shard
The important thing is sharding, this is how you load-balance the work with elasticsearch. Also, refresh_interval must be > 30 secondes, and, you should disable replication when indexing, this is very basic configuration tuning, I am sure you can find many advises about that on documentation.
With Spark, you can check on web UI (port 4040) how the work is split into tasks and partitions, this help a lot. Also, you can monitor the network bandwidth between Spark and ES, and es node stats.

Where are my extra Spark tasks coming from

I have a Spark program that is training several ML algorithms. The code that generates the final stage of my job looks like this (in Kotlin):
val runConfigs = buildOptionsCrossProduct(opts)
log.info("Will run {} different configurations.", runConfigs.size)
val runConfigsRdd: JavaRDD<RunConfiguration> = sc.parallelize(runConfigs)
// Create an RDD mapping window size to the score for that window size.
val accuracyRdd = runConfigsRdd.mapToPair { runConfig: RunConfiguration ->
runSingleOptionSet(runConfig, opts, trainingBroadcast, validBroadcast) }
accuracyRdd.saveAsTextFile(opts.output)
runConfigs is a list of 18 items. The log line right after the configs are generated shows:
17/02/06 19:23:20 INFO SparkJob: Will run 18 different configurations.
So I'd expect at most 18 tasks as there should be at most one task per stage per partition (at least that's my understanding). However, the History server reports 80 tasks most of which finish very quickly and, not surprisingly, produce no output:
There are in fact 80 output files generated with all but 18 of them being empty. My question is, what are the other 80 - 18 = 62 tasks in this stage doing and why did they get generated?
You use SparkContext.parallelize without providing numSlices argument so Spark is using defaultParallelism which is probably 80. In general parallelize tries to spread data uniformly between partitions but it doesn't remove empty ones so if you want to avoid executing empty task you should set numSlices to a number smaller or equal to runConfigs.size.

How does the Apache Spark scheduler split files into tasks?

In spark-summit 2014, Aaron gives the speak A Deeper Understanding of Spark Internals , in his slide, page 17 show a stage has been splited into 4 tasks as bellow:
Here I wanna know three things about how does a stage be splited into tasks?
in this example above, it seems that tasks' number are created based on the file number, am I right?
if I'm right in point 1, so if there was just 3 files under directory names, will it just create 3 tasks?
If I'm right in point 2, what if there is just one but very large file? Does it just split this stage into 1 task? And what if when the data is coming from a streaming data source?
thanks a lot, I feel confused in how does the stage been splited into tasks.
You can configure the # of partitions (splits) for the entire process as the second parameter to a job, e.g. for parallelize if we want 3 partitions:
a = sc.parallelize(myCollection, 3)
Spark will divide the work into relatively even sizes (*) . Large files will be broken down accordingly - you can see the actual size by:
rdd.partitions.size
So no you will not end up with single Worker chugging away for a long time on a single file.
(*) If you have very small files then that may change this processing. But in any case large files will follow this pattern.
The split occurs in two stages:
Firstly HDSF splits the logical file into 64MB or 128MB physical files when the file is loaded.
Secondly SPARK will schedule a MAP task to process each physical file.
There is a fairly complex internal scheduling process as there are three copies of each physical file stored on three different servers, and, for large logical files it may not be possible to run all the tasks at once. The way this is handled is one of the major differences between hadoop distributions.
When all the MAP tasks have run the collectors, shuffle and reduce tasks can then be run.
Stage: New stage will get created when a wide transformation occurs
Task: Will get created based on partitions in a worker
Attaching the link for more explanation: How DAG works under the covers in RDD?
Question 1: in this example above, it seems that tasks' number are created based on the file number, am I right?
Answer : its not based on the filenumber, its based on your hadoop block(0.gz,1.gz is a block of data saved or stored in hdfs. )
Question 2:
if I'm right in point 1, so if there was just 3 files under directory names, will it just create 3 tasks?
Answer : By default block size in hadoop is of 64MB and that block of data will be treated as partition in spark.
Note : no of partitions = no of task, because of these it has created 3tasks.
Question 3 :
what if there is just one but very large file? Does it just split this stage into 1 task? And what if when the data is coming from a streaming data source?
Answer : No, the very large file will be partitioned and as i answered for ur question 2 based on the no of partitions , no of task will be created

Index initializer warning during bootstrap

I'm trying to simultaneously add 4 nodes to my current 2-node DC. I have Vnodes turned off as per Datastax suggestion. Right after the major index build in each node, the following warning is printed several times in the logs:
WARN [SolrSecondaryIndex ks.cf index initializer.] 2014-06-20
09:39:59,904 CassandraUtil.java (line 108) Error Operation timed out -
received only 3 responses. on attempt 1 out of 4 with CL QUORUM...
I understand what it means. But why is Cassandra expecting the nodes to fulfill the CL when these nodes are still bootstrapping? More importantly, how does the warning affect the bootstrap? I noticed that the nodes are not doing any index build or streaming anymore; but they also remained in "Active - Joining" state. Is there any chance that they will finish? What should I do?
I'm using DSE 4.0.3. All existing and new nodes in the DC are Search nodes. I pre-computed the tokens using the python program for MurMur3Partitioner.
EDIT:
Although nodetool compactionstats does not show any on-going index build in the nodes, for some reason, I still see a lot of these lines in the logs:
INFO [IndexPool backpressure thread-0] 2014-06-20 12:30:31,346 IndexPool.java (line 472) Throttling at 26 index requests per second with target total queue size at 40
INFO [IndexPool backpressure thread-0] 2014-06-20 12:30:34,169 IndexPool.java (line 428) Back pressure is active with total index queue size 18586 and average processing time 2770
EDIT:
Interestingly, I found the following lines in each node after digging through the log files:
INFO [main] 2014-06-20 09:39:48,588 StorageService.java (line 1036) Bootstrap completed! for the tokens [node token]
INFO [SolrSecondaryIndex ks.cf index initializer.] 2014-06-20 11:32:07,833 AbstractSolrSecondaryIndex.java (line 411) Reindexing 1417116631 commit log updates for core ks.cf
Based from these lines, I feel a lot safer that the bootstrap actually completed and that the nodes are simply re-indexing their data. I don't know, though, why the re-indexing process is not being shown in nodetool compactionstats.
It appears the bootstrap completed, and the DSE Search system is running normally.
why the re-indexing process is not being shown in nodetool compactionstat
DSE Search is not generally exposed via Cassandra command line tools. The log output should show the indexing as having completed, were you able to verify that?

Resources