Hadoop Namenode startup failure due to InconsistentFSSStateException - azure

I am setting up a Hadoop (v1.1.1) cluster on Windows Azure. I am trying to launch the namenode process by using:
service hadoop-namenode start
However I am consistently getting the following error which is associated with when the VM reboots being wiped. I moved this directory out so it would not be deleted each time but it still occurs. Any help would be gratefully received.
STARTUP_MSG: Starting NameNode
STARTUP_MSG: host = master/10.77.42.61
STARTUP_MSG: args = []
STARTUP_MSG: version = 1.1.1
STARTUP_MSG: build = https://svn.apache.org/repos/asf/hadoop/common/branches/branch-1.1 -r 1411108; compiled by 'hortonfo' on Mon Nov 19 10:44:13 UTC 2012
************************************************************/
2012-12-13 09:38:54,102 INFO org.apache.hadoop.metrics2.impl.MetricsConfig: loaded properties from hadoop-metrics2.properties
2012-12-13 09:38:54,222 INFO org.apache.hadoop.metrics2.impl.MetricsSourceAdapter: MBean for source MetricsSystem,sub=Stats registered.
2012-12-13 09:38:54,230 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Scheduled snapshot period at 60 second(s).
2012-12-13 09:38:54,230 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: NameNode metrics system started
2012-12-13 09:38:54,675 INFO org.apache.hadoop.metrics2.impl.MetricsSourceAdapter: MBean for source ugi registered.
2012-12-13 09:38:54,714 INFO org.apache.hadoop.metrics2.impl.MetricsSourceAdapter: MBean for source jvm registered.
2012-12-13 09:38:54,720 INFO org.apache.hadoop.metrics2.impl.MetricsSourceAdapter: MBean for source NameNode registered.
2012-12-13 09:38:54,804 INFO org.apache.hadoop.hdfs.util.GSet: VM type = 64-bit
2012-12-13 09:38:54,810 INFO org.apache.hadoop.hdfs.util.GSet: 2% max memory = 2.475 MB
2012-12-13 09:38:54,810 INFO org.apache.hadoop.hdfs.util.GSet: capacity = 2^18 = 262144 entries
2012-12-13 09:38:54,810 INFO org.apache.hadoop.hdfs.util.GSet: recommended=262144, actual=262144
2012-12-13 09:38:54,890 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: fsOwner=hdfs
2012-12-13 09:38:54,895 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: supergroup=supergroup
2012-12-13 09:38:54,895 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: isPermissionEnabled=true
2012-12-13 09:38:54,915 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: dfs.block.invalidate.limit=100
2012-12-13 09:38:54,915 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: isAccessTokenEnabled=false accessKeyUpdateInterval=0 min(s), accessTokenLifetime=0 min(s)
2012-12-13 09:38:55,429 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Registered FSNamesystemStateMBean and NameNodeMXBean
2012-12-13 09:38:55,465 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: Caching file names occuring more than 10 times
2012-12-13 09:38:55,471 INFO org.apache.hadoop.hdfs.server.common.Storage: Cannot access storage directory /hadoop/name
2012-12-13 09:38:55,474 ERROR org.apache.hadoop.hdfs.server.namenode.FSNamesystem: FSNamesystem initialization failed.
org.apache.hadoop.hdfs.server.common.InconsistentFSStateException: Directory /hadoop/name is in an inconsistent state: storage directory does not exist or is not accessible.
at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:303)
at org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:100)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:411)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.<init>(FSNamesystem.java:379)
at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:277)
at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:529)
at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1403)
at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1412)
2012-12-13 09:38:55,476 ERROR org.apache.hadoop.hdfs.server.namenode.NameNode: org.apache.hadoop.hdfs.server.common.InconsistentFSStateException: Directory /hadoop/name is in an inconsistent state: storage directory does not exist or is not accessible.
at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:303)
at org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:100)

Change the permissions of the directory, which you have specified as the value of the "dfs.name.dir" property in your hdfs-site.xml file, to 755 and also change the user of this directory to the current user.BTW, Were you able to do a successful format?

Related

how to check when did cassandra got shut down or started/re-started from logs?

nodetool status -- shows some nodes as down and also application is not able to reach the cassandra nodes. When I connect to cassandra nodes and check the logs there are some errors in the logs (system.log and debug.log).
Didn't understand how to check if when did Cassandra got started / re-started and got shutdown. Is there any way to check this from logs ? if so which log and how ?
In the logs folder of your Cassandra installation you should find system.log file. The default location for the log files is /var/log/cassandra.
When Cassandra starts the output looks like this:
INFO [main] 2018-08-18 19:10:27,162 YamlConfigurationLoader.java:89 - Configuration location: file:/home/20171127/.ccm/3113/node1/conf/cassandra.yaml
INFO [main] 2018-08-18 19:10:27,696 Config.java:495 - Node configuration:[allocate_tokens_for_keyspace=null; authenticator=AllowAllAuthenticator;INFO [main] 2018-08-18 19:10:27,698 DatabaseDescriptor.java:367 - DiskAccessMode 'auto' determined to be mmap, indexAccessMode is mmap
INFO [main] 2018-08-18 19:10:27,699 DatabaseDescriptor.java:425 - Global memtable on-heap threshold is enabled at 123MB
INFO [main] 2018-08-18 19:10:27,700 DatabaseDescriptor.java:429 - Global memtable off-heap threshold is enabled at 123MB
WARN [main] 2018-08-18 19:10:27,974 DatabaseDescriptor.java:550 - Only 30.922GiB free across all data volumes. Consider adding more capacity to your cluster or removing obsolete snapshots
INFO [main] 2018-08-18 19:10:28,025 RateBasedBackPressure.java:123 - Initialized back-pressure with high ratio: 0.9, factor: 5, flow: FAST, window size: 2000.
INFO [main] 2018-08-18 19:10:28,026 DatabaseDescriptor.java:729 - Back-pressure is disabled with strategy org.apache.cassandra.net.RateBasedBackPressure{high_ratio=0.9, factor=5, flow=FAST}.
INFO [main] 2018-08-18 19:10:28,265 JMXServerUtils.java:246 - Configured JMX server at: service:jmx:rmi://127.0.0.1/jndi/rmi://127.0.0.1:7100/jmxrmi
INFO [main] 2018-08-18 19:10:28,282 CassandraDaemon.java:473 - Hostname: 2VVg5
INFO [main] 2018-08-18 19:10:28,284 CassandraDaemon.java:480 - JVM vendor/version: OpenJDK 64-Bit Server VM/1.8.0_181
INFO [main] 2018-08-18 19:10:28,285 CassandraDaemon.java:481 - Heap size: 495.000MiB/495.000MiB
INFO [main] 2018-08-18 19:10:28,287 CassandraDaemon.java:486 - Code Cache Non-heap memory: init = 2555904(2496K) used = 4178816(4080K) committed = 4194304(4096K) max = 251658240(245760K)
INFO [main] 2018-08-18 19:10:28,289 CassandraDaemon.java:486 - Metaspace Non-heap memory: init = 0(0K) used = 18530200(18095K) committed = 19005440(18560K) max = -1(-1K)
INFO [main] 2018-08-18 19:10:28,289 CassandraDaemon.java:486 - Compressed Class Space Non-heap memory: init = 0(0K) used = 2092504(2043K) committed = 2228224(2176K) max = 1073741824(1048576K)
INFO [main] 2018-08-18 19:10:28,290 CassandraDaemon.java:486 - Par Eden Space Heap memory: init = 41943040(40960K) used = 41943040(40960K) committed = 41943040(40960K) max = 41943040(40960K)
INFO [main] 2018-08-18 19:10:28,291 CassandraDaemon.java:486 - Par Survivor Space Heap memory: init = 5242880(5120K) used = 5242880(5120K) committed = 5242880(5120K) max = 5242880(5120K)
INFO [main] 2018-08-18 19:10:28,293 CassandraDaemon.java:486 - CMS Old Gen Heap memory: init = 471859200(460800K) used = 669336(653K) committed = 471859200(460800K) max = 471859200(460800K)
When Cassandra shuts down properly, the log entries look like this:
INFO [StorageServiceShutdownHook] 2018-08-18 19:28:23,661 Gossiper.java:1559 - Announcing shutdown
INFO [StorageServiceShutdownHook] 2018-08-18 19:28:23,662 StorageService.java:2289 - Node /127.0.0.1 state jump to shutdown
INFO [StorageServiceShutdownHook] 2018-08-18 19:28:25,664 MessagingService.java:992 - Waiting for messaging service to quiesce
INFO [ACCEPT-/127.0.0.1] 2018-08-18 19:28:25,665 MessagingService.java:1346 - MessagingService has terminated the accept() thread
INFO [StorageServiceShutdownHook] 2018-08-18 19:28:25,729 HintsService.java:220 - Paused hints dispatch
Depending on how Cassandra stopped is possible to not have any log entries where you could see that Cassandra stopped.
Other interesting files that you could look into are debug.log and gc.log.

Why am I getting "Removing worker because we got no heartbeat in 60 seconds" on Spark master

I think I might of stumbled across a bug and wanted to get other people's input. I am running a pyspark application using Spark 2.2.0 in standalone mode. I am doing a somewhat heavy transformation in python inside a flatMap and the driver keeps killing the workers.
Here is what am I seeing:
The master after 60s of not seeing any heartbeat message from the workers it prints out this message to the log:
Removing worker [worker name] because we got no heartbeat in 60
seconds
Removing worker [worker name] on [IP]:[port]
Telling app of
lost executor: [executor number]
I then see in the driver log the following message:
Lost executor [executor number] on [executor IP]: worker lost
The worker then terminates and I see this message in its log:
Driver commanded a shutdown
I have looked at the Spark source code and from what I can tell, as long as the executor is alive it should send a heartbeat message back as it is using a ThreadUtils.newDaemonSingleThreadScheduledExecutor to do this.
One other thing that I noticed while I was running top on one of the workers, is that the executor JVM seems to be suspended throughout this process. There are as many python processes as I specified in the SPARK_WORKER_CORES env variable and each is consuming close to 100% of the CPU.
Anyone have any thoughts on this?
I was facing this same issue, increasing interval worked.
Excerpt from Logs start-all.sh logs
INFO Utils: Successfully started service 'sparkMaster' on port 7077.
INFO Master: Starting Spark master at spark://master:7077
INFO Master: Running Spark version 3.0.1
INFO Utils: Successfully started service 'MasterUI' on port 8080.
INFO MasterWebUI: Bound MasterWebUI to 0.0.0.0, and started at http://master:8080
INFO Master: I have been elected leader! New state: ALIVE
INFO Master: Registering worker slave01:41191 with 16 cores, 15.7 GiB RAM
INFO Master: Registering worker slave02:37853 with 16 cores, 15.7 GiB RAM
WARN Master: Removing worker-20210618205117-slave01-41191 because we got no heartbeat in 60 seconds
INFO Master: Removing worker worker-20210618205117-slave01-41191 on slave01:41191
INFO Master: Telling app of lost worker: worker-20210618205117-slave01-41191
WARN Master: Removing worker-20210618204723-slave02-37853 because we got no heartbeat in 60 seconds
INFO Master: Removing worker worker-20210618204723-slave02-37853 on slave02:37853
INFO Master: Telling app of lost worker: worker-20210618204723-slave02-37853
WARN Master: Got heartbeat from unregistered worker worker-20210618205117-slave01-41191. This worker was never registered, so ignoring the heartbeat.
WARN Master: Got heartbeat from unregistered worker worker-20210618204723-slave02-37853. This worker was never registered, so ignoring the heartbeat.
Solution: add following configs to $SPARK_HOME/conf/spark-defaults.conf
spark.network.timeout 50000
spark.executor.heartbeatInterval 5000
spark.worker.timeout 5000

Cassandra stopped working after nodetool repair

After running "nodetool repair" command cassandra node gone down and did not start again.
INFO [main] 2016-10-19 12:44:50,244 ColumnFamilyStore.java:405 - Initializing system_schema.aggregates
INFO [main] 2016-10-19 12:44:50,247 ColumnFamilyStore.java:405 - Initializing system_schema.indexes
INFO [main] 2016-10-19 12:44:50,248 ViewManager.java:139 - Not submitting build tasks for views in keyspace system_schema as storage service is not initialized
Cassandra version 3.7
Turned on the node and it's fine. It took too long to start (more than 30 minutes).
INFO [main] 2016-10-19 15:32:48,348 ColumnFamilyStore.java:405 - Initializing system_schema.indexes
INFO [main] 2016-10-19 15:32:48,354 ViewManager.java:139 - Not submitting build tasks for views in keyspace system_schema as storage service is not initialized
INFO [main] 2016-10-19 16:07:36,529 ColumnFamilyStore.java:405 - Initializing system_distributed.parent_repair_history
INFO [main] 2016-10-19 16:07:36,546 ColumnFamilyStore.java:405 - Initializing system_distributed.repair_history
Now I'm trying to figure out why it is so slow.

Can't connect slaves to master in Spark

Using 4 instances on Compute Engine, each running spark set up with Cloudera Manager. I have no problems starting the master and connecting in my local browser, and it connects as spark://instance-1:7077. When I start the start-slave on the remaining instances I get no errors, until I look in the log:
16/05/02 13:10:18 INFO worker.Worker: Started daemon with process name: 12612#instance-2.c.cluster1-1294.internal
16/05/02 13:10:18 INFO worker.Worker: Registered signal handlers for [TERM, HUP, INT]
16/05/02 13:10:18 INFO spark.SecurityManager: Changing view acls to: root
16/05/02 13:10:18 INFO spark.SecurityManager: Changing modify acls to: root
16/05/02 13:10:18 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); users with mod$
16/05/02 13:10:19 INFO util.Utils: Successfully started service 'sparkWorker' on port 60270.
16/05/02 13:10:19 INFO worker.Worker: Starting Spark worker 10.142.0.3:60270 with 2 cores, 6.3 GB RAM
16/05/02 13:10:19 INFO worker.Worker: Running Spark version 1.6.0
16/05/02 13:10:19 INFO worker.Worker: Spark home: /opt/cloudera/parcels/CDH-5.7.0-1.cdh5.7.0.p0.45/lib/spark
16/05/02 13:10:19 ERROR worker.Worker: Failed to create work directory /opt/cloudera/parcels/CDH-5.7.0-1.cdh5.7.0.p0.45/lib/spark/work
If i use mkdir to create 'work' then it throws and error and says the directory already exists:
mkdir: cannot create directory ‘work’: File exists
The file does exist and when using ls to find it it is highlighted in red with a black background. Any help would be appreciated.
Maybe this is the permission issue,
Try this,
$sudo chown -R your_userName:your_groupName /opt/cloudera/parcels/CDH-5.7.0-1.cdh5.7.0.p0.45/lib/spark
Now change the Mode of the above path
$sudo chmod 777 /opt/cloudera/parcels/CDH-5.7.0-1.cdh5.7.0.p0.45/lib/spark
Also all the slaves must have ssh to each other and can able to talk one another.
And Copy all the Configuration file of spark to the slave nodes also.

Cannot start Cassandra with "bin/cassandra -f"

I have a problem of using Cassandra, I can start it with "bin/cassandra", but cannot start it with "bin/cassandra -f", anyone know the reason?
Here are the detailed info:
root#server1:~/cassandra# bin/cassandra -f
INFO 10:51:31,500 JNA not found. Native methods will be disabled.
INFO 10:51:31,740 DiskAccessMode 'auto' determined to be mmap, indexAccessMode is mmap
INFO 10:51:32,043 Deleted /var/lib/cassandra/data/system/LocationInfo-61-Data.db
INFO 10:51:32,044 Deleted /var/lib/cassandra/data/system/LocationInfo-62-Data.db
INFO 10:51:32,052 Deleted /var/lib/cassandra/data/system/LocationInfo-63-Data.db
INFO 10:51:32,053 Deleted /var/lib/cassandra/data/system/LocationInfo-64-Data.db
INFO 10:51:32,063 Sampling index for /var/lib/cassandra/data/system/LocationInfo-65-Data.db
INFO 10:51:32,117 Sampling index for /var/lib/cassandra/data/Keyspace1/Standard2-5-Data.db
INFO 10:51:32,118 Sampling index for /var/lib/cassandra/data/Keyspace1/Standard2-6-Data.db
INFO 10:51:32,120 Sampling index for /var/lib/cassandra/data/Keyspace1/Standard2-7-Data.db
INFO 10:51:32,131 Replaying /var/lib/cassandra/commitlog/CommitLog-1285869561954.log
INFO 10:51:32,143 Finished reading /var/lib/cassandra/commitlog/CommitLog-1285869561954.log
INFO 10:51:32,145 Creating new commitlog segment /var/lib/cassandra/commitlog/CommitLog-1286301092145.log
INFO 10:51:32,153 Standard2 has reached its threshold; switching in a fresh Memtable at CommitLogContext(file='/var/lib/cassandra/commitlog/CommitLog-1286301092145.log', position=121)
INFO 10:51:32,155 Enqueuing flush of Memtable-Standard2#1811560891(29 bytes, 1 operations)
INFO 10:51:32,156 Writing Memtable-Standard2#1811560891(29 bytes, 1 operations)
INFO 10:51:32,200 Completed flushing /var/lib/cassandra/data/Keyspace1/Standard2-8-Data.db
INFO 10:51:32,203 Compacting [org.apache.cassandra.io.SSTableReader(path='/var/lib/cassandra/data/Keyspace1/Standard2-5-Data.db'),org.apache.cassandra.io.SSTableReader(path='/var/lib/cassandra/data/Keyspace1/Standard2-6-Data.db'),org.apache.cassandra.io.SSTableReader(path='/var/lib/cassandra/data/Keyspace1/Standard2-7-Data.db'),org.apache.cassandra.io.SSTableReader(path='/var/lib/cassandra/data/Keyspace1/Standard2-8-Data.db')]
INFO 10:51:32,214 Recovery complete
INFO 10:51:32,214 Log replay complete
INFO 10:51:32,230 Saved Token found: 47408016217042861442279446207060121025
INFO 10:51:32,230 Saved ClusterName found: Test Cluster
INFO 10:51:32,231 Saved partitioner not found. Using org.apache.cassandra.dht.RandomPartitioner
INFO 10:51:32,250 LocationInfo has reached its threshold; switching in a fresh Memtable at CommitLogContext(file='/var/lib/cassandra/commitlog/CommitLog-1286301092145.log', position=345)
INFO 10:51:32,250 Enqueuing flush of Memtable-LocationInfo#1120194637(95 bytes, 2 operations)
INFO 10:51:32,251 Writing Memtable-LocationInfo#1120194637(95 bytes, 2 operations)
INFO 10:51:32,307 Completed flushing /var/lib/cassandra/data/system/LocationInfo-66-Data.db
INFO 10:51:32,316 Starting up server gossip
INFO 10:51:32,329 Compacted to /var/lib/cassandra/data/Keyspace1/Standard2-9-Data.db. 1670/1440 bytes for 6 keys. Time: 125ms.
INFO 10:51:32,366 Binding thrift service to /172.24.0.80:9160
INFO 10:51:32,369 Cassandra starting up...
I cant see any problems? (-f is short for 'foreground')

Resources