Hadoop Yarn Stuck at Running Job Ubuntu

Hadoop Yarn Stuck at Running Job Ubuntu - linux

I was trying to run single mode pseudo distributed mode hadoop on ec2 ubuntu machine on a simple task. However my program is stuck at running jobs. I attach my linux screen and Resourcemanager page here.Any idea is appreciated. Thanks.
Add1: the other thing I found is the NodeManager disappears when I type jps (It was there the first time I type jps but disappears later)
Add2: I checked nodemanager log and noticed that it was shutdown due to that minimum allocation is not satisfied, though I had changed scheduler minimum mb to 128 and vcore to 1 in yarn-site.xml

Finally solved my own problem above.
Added details to the problem statement above, I later found my node manager didn't start well (first time type jps, node manager is there as shown in the attached picture. However, it disappeared seconds after that. I found this by type jps many times).
I checked the node manager log and found the error is "doesn't satisfy minimum allocations, Sending SHUTDOWN signal to the NodeManager"
I searched on the internet and found the solution is adding the following property into yarn-site.xml file
yarn.nodemanager.resource.memory-mb
value is 1024
yarn.nodemanager.resource.cpu-vcores
value is 1
Then restart start-dfs.sh and start-yarn.sh. If you get into hostname:8088, you would see the memory total and total vcores total should be larger than 0 now and you can run an application
PS: my application was stuck at map 0% and reduce 0% initially. Then I simply change 1024 to 4096, 1 to 2 above. Then I can successfully run a map reduce program on my ec2 instance (single node pseudo distributed)

Related

AWS Sagemaker Kernel appears to have died and restarts

I am getting a kernel error while trying to retrieve the data from an API that includes 100 pages. The data size is huge but the code runs well when executed on Google Colab or on local machine.
The error I see in a window is-
Kernel Restarting
The kernel appears to have died. It will restart automatically.
I am using an ml.m5.xlarge machine with a memory allocation of 1000GB and there are no pre-saved datasets in the instance. Also, the expected data size is around 60 GB split into multiple datasets of 4 GB each.
Can anyone help?

I think you could try not to load all the data into memory, or try to switch to a beefier instance type. According to https://aws.amazon.com/sagemaker/pricing/instance-types/ ml.m5.xlarge has 15GB memory.
Jun

How to tune "spark.rpc.askTimeout"?

We have a spark 1.6.1 application, which takes input from two kafka topics and writes the result to another kafka topic. The application receives some large (approximately 1MB) files in the first input topic and some simple conditions from the second input topic. If the condition is satisfied, the file is written to output topic else held in state (we use mapWithState).
The logic works fine for less (few hundred) number of input files, but fails with org.apache.spark.rpc.RpcTimeoutException and recommendation is to increase spark.rpc.askTimeout. After increasing from default (120s) to 300s the ran fine longer but crashed with the same error after 1 hour. After changing the value to 500s, the job ran fine for more than 2 hours.
Note: We are running the spark job in local mode and kafka is also running locally in the machine. Also, some time I see warning "[2016-09-06 17:36:05,491] [WARN] - [org.apache.spark.storage.MemoryStore] - Not enough space to cache rdd_2123_0 in memory! (computed 2.6 GB so far)"
Now, 300s seemed large enough a timeout considering all local configuration. But any idea, how to come up to an ideal timeout value instead of just using 500s or higher based on testing, as I see crashed cases using 800s and cases suggesting to use 60000s?

I was facing the same problem, I found this page saying that under heavy workloads it is wise to set spark.network.timeout(which controls all the timeouts, also the RPC one) to 800. At the moment it solved my problem.

ERROR 1777 (HY000): Partition memsqldb:0 has no master instance

I am using community edition of memsql. I got this error while i was running a query today. So i just restarted my cluster and got this error solved.
memsql-ops cluster-restart
But what happened and what should i do in future to avoid this error ?
NOTE
I donot want to buy the Enterprise edition.
Question
Is this problem of Availability ?

I got this error when experimenting with performance.
VM had 24 CPUs and 25 nodes: 1 Master Agg, 24 Leaf nodes
Reduced VM to 4 CPUs and restarted cluster.
All the leaves did not recover.
All except 4 recovered in < 5 minutes.
20 minutes later, 4 leaf nodes still were not connected.
From MySQL/MemSQL prompt:
use db;
show partitions;
I notice some partitions with ordinal from 0-71 for me have null instead Host, Port, Role defined for a given partition.
In memsql ops UI http://server:9000 > Settings > Config > Manual Cluster Control I checked "ENABLE MANUAL CONTROL" while I tried to run various commands with no real benefit.
Then 15 minutes later, I unchecked the box, Memsql-ops tried attaching all the leaf nodes again and was finally successful.
Perhaps a cluster restart would have done the same thing.

This happened because a leaf in your cluster has failed a health check heartbeat for some reason (loss of network connectivity, hardware failure, OS issue, machine overloaded, out of memory, etc.) and its partitions are no longer accessible to query. MemSQL Community Edition only supports redundancy 1 so there are no other copies of the data on the failed leaf node in your cluster (thus the error about missing a partition of data - MemSQL can't complete a query that needs to read data on any partitions on the problem leaf).
Given that a restart repaired things, the most likely answer is that linux "out of memory" killed you: MemSQL Linux OOM killer docs
You can also check the tracelog on the leaf that ran into issues to see if there is any clue there about what happened (It's usually at /var/lib/memsql/leaf_3306/tracelogs/memsql.log)
-Adam

I too have faced this error, that was because some of the slave ordinals had no corresponding masters. My error message looked like:
ERROR 1772 (HY000) at line 1: Leaf Error (10.0.0.112:3306): Partition database `<db_name>_0` can't be promoted to master because it is provisioning replication
My memsql> SHOW PARTITIONS; command returned the following.
So what approach I followed was to remove each of such cases (where the role was either Slave or NULL).
DROP PARTITION <db_name>:4 ON "10.0.0.193":3306;
..
DROP PARTITION <db_name>:46 ON "10.0.0.193":3306;
And then created a new partition with each of the dropped partition.
CREATE PARTITION <db_name>:4 ON "10.0.0.193":3306;
..
CREATE PARTITION <db_name>:46 ON "10.0.0.193":3306;
And this was the result of memsql> SHOW PARTITIONS; after that.
You can refer to the MemSQL Documentation regarding partitions, here if the above steps doesn't seem to solve your problem.

I was hitting the same problem. Using the following command in the master node, solved the problem:
REBALANCE PARTITIONS ON db_name
Optionally you can force it using FORCE:
REBALANCE PARTITIONS ON db_name FORCE
And to see the list of operations when rebalancing is going to execute, use above command with EXPLAIN:
EXPLAIN REBALANCE PARTITIONS ON db_name [FORCE]

Cassandra node down...any ideas why?

I've put up a test cluster - four nodes. Severely underpowered(!) - ok CPU, only 2 gigs of ram, shared non ssd storage. Hey, it's test :)
I just kept it running for three days. No data going in or out..everything's just idle. Connected with opscenter.
This morning, we found one of the nodes went down around 2 am last night. The OS didn't go down (was responding to pings). The cassandra log around that time is:
INFO [MemtableFlushWriter:114] 2014-07-29 02:07:34,952 Memtable.java:360 - Completed flushing /var/lib/cassandra/system/sstable_activity-5a1ff267ace03f128563cfae6103c65e/system-sstable_activity-ka-107-Data.db (686 bytes) for commitlog position ReplayPosition(segmentId=1406304454537, position=29042136)
INFO [ScheduledTasks:1] 2014-07-29 02:08:24,227 GCInspector.java:116 - GC for ParNew: 276 ms for 1 collections, 648591696 used; max is 1040187392
Next entry is:
INFO [main] 2014-07-29 09:18:41,661 CassandraDaemon.java:102 - Hostname: xxxxx
i.e. when we restarted the node through opscenter.
Does that mean it crashed on GC, or that GC finished and something else crashed? Is there some other log I should be looking at?
Note: In opscenter eventlog, we see this:
7/29/2014, 2:15am Warning Node reported as being down: xxxxxxx
I appreciate the nodes are underpowered, but for being completely idle, it shouldn't crash, should it?
Using 2.1.0-rc4 btw.

My guess is your node was shut down by the OOM killer. Because the Linux system over commits ram, when a heavy stress is on the system it may shut down applications to recover memory for the os. With 2G total ram this can happen very easily.

In Cassandra 1.2 - CQL 3 is it possible to abort a secondary index build?

Been using a 6GB dataset with each source record being ~1KB in length when I accidentally added an index on a column that I am pretty sure has a 100% cardinality.
Tried dropping the index from cqlsh but by that point the two node cluster had gone into a run away death spiral with loadavg surpassing 20 on each node and cqlsh hung on the drop command for 30 minutes. Since this was just a test setup, I shut-down and destroyed the cluster and restarted.
This is a fairly disconcerting problem as it makes me fear a scenario where a junior developer is on a production cluster and they set an index on a similar high cardinality column. I scanned through the documentation and looked at the options in nodetool but there didn't seem to be anything along the lines of "abort job or abort building index".
Test environment:
2x m1.xlarge EC2 instances with 2 Raid 0 ephemeral disks
Dataset was 6GB, 1KB per record.
My question in summary: Is it possible to abort the process of building a secondary index AND or possible to stop/postpone running builds (indexing, compaction) for a later date.

nodetool -h node_address stop index_build
See: http://www.datastax.com/docs/1.2/references/nodetool#nodetool-stop

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Hadoop Yarn Stuck at Running Job Ubuntu - linux

Related

AWS Sagemaker Kernel appears to have died and restarts

How to tune "spark.rpc.askTimeout"?

ERROR 1777 (HY000): Partition memsqldb:0 has no master instance

Cassandra node down...any ideas why?

In Cassandra 1.2 - CQL 3 is it possible to abort a secondary index build?

Categories

Resources