I am unable to run accumulo after i upgraded the JDK to 8. I have changed the environment variables.
I am able to start hadoop with out any Issues and i can access the data node on http://localhost:50075 (I am running all on my local machine).
And this is what i am getting in the accumulo logs
Thread "gc" died Can't tell if Accumulo is initialized; can't read instance id at hdfs://localhost:9000/accumulo/instance_id.
Here is the console output from accumulo start script
/usr/local/accumulo-1.6.1/bin$ ./start-all.sh
Starting monitor on localhost
WARN : Max open files on localhost is 1024, recommend 32768
Starting tablet servers .... done
Starting tablet server on localhost
WARN : Max open files on localhost is 1024, recommend 32768
2016-07-09 21:06:09,723 [fs.VolumeManagerImpl] WARN : dfs.datanode.synconclose set to false in hdfs-site.xml: data loss is possible on system reset or power loss
2016-07-09 21:06:09,726 [server.Accumulo] INFO : Attempting to talk to zookeeper
2016-07-09 21:06:09,844 [server.Accumulo] INFO : Zookeeper connected and initialized, attemping to talk to HDFS
2016-07-09 21:06:09,915 [server.Accumulo] WARN : Waiting for the NameNode to leave safemode
2016-07-09 21:06:09,915 [server.Accumulo] INFO : Backing off due to failure; current sleep period is 1.0 seconds
2016-07-09 21:06:10,917 [server.Accumulo] WARN : Waiting for the NameNode to leave safemode
2016-07-09 21:06:10,917 [server.Accumulo] INFO : Backing off due to failure; current sleep period is 2.0 seconds
2016-07-09 21:06:12,918 [server.Accumulo] WARN : Waiting for the NameNode to leave safemode
2016-07-09 21:06:12,918 [server.Accumulo] INFO : Backing off due to failure; current sleep period is 4.0 seconds
2016-07-09 21:06:16,919 [server.Accumulo] WARN : Waiting for the NameNode to leave safemode
2016-07-09 21:06:16,919 [server.Accumulo] INFO : Backing off due to failure; current sleep period is 8.0 seconds
2016-07-09 21:06:24,920 [server.Accumulo] WARN : Waiting for the NameNode to leave safemode
2016-07-09 21:06:24,921 [server.Accumulo] INFO : Backing off due to failure; current sleep period is 16.0 seconds
2016-07-09 21:06:40,923 [server.Accumulo] WARN : Waiting for the NameNode to leave safemode
2016-07-09 21:06:40,923 [server.Accumulo] INFO : Backing off due to failure; current sleep period is 32.0 seconds
2016-07-09 21:07:12,926 [server.Accumulo] WARN : Waiting for the NameNode to leave safemode
2016-07-09 21:07:12,926 [server.Accumulo] INFO : Backing off due to failure; current sleep period is 60.0 seconds
2016-07-09 21:08:12,929 [server.Accumulo] WARN : Waiting for the NameNode to leave safemode
2016-07-09 21:08:12,929 [server.Accumulo] INFO : Backing off due to failure; current sleep period is 60.0 seconds
2016-07-09 21:09:12,932 [server.Accumulo] WARN : Waiting for the NameNode to leave safemode
2016-07-09 21:09:12,932 [server.Accumulo] INFO : Backing off due to failure; current sleep period is 60.0 seconds
2016-07-09 21:10:12,935 [server.Accumulo] WARN : Waiting for the NameNode to leave safemode
2016-07-09 21:10:12,935 [server.Accumulo] INFO : Backing off due to failure; current sleep period is 60.0 seconds
2016-07-09 21:11:12,938 [server.Accumulo] WARN : Waiting for the NameNode to leave safemode
2016-07-09 21:11:12,938 [server.Accumulo] INFO : Backing off due to failure; current sleep period is 60.0 seconds
Not sure what i am doing wrong..Appreciate any help.
Thanks
The error message is telling you that the NameNode is in SafeMode. This is often because HDFS is aware that it is missing blocks to back the files that you had stored in HDFS on your local filesystem. Visit the NameNode UI at http://localhost:50070 for more reasons as to why the NameNode will not automatically leave safemode.
After you get the NameNode out of safe mode, verify that you still have the directory /accumulo in HDFS (e.g. hdfs dfs -ls /accumulo).
This directory (and the beneath structure) is created by running accumulo init. If that directory is not present, that means you deleted that HDFS directory (intentionally or not).
Make sure that you have HDFS configured to write to a directory which is not wiped on restart of your machine (e.g. avoid /tmp).
Related
I think I might of stumbled across a bug and wanted to get other people's input. I am running a pyspark application using Spark 2.2.0 in standalone mode. I am doing a somewhat heavy transformation in python inside a flatMap and the driver keeps killing the workers.
Here is what am I seeing:
The master after 60s of not seeing any heartbeat message from the workers it prints out this message to the log:
Removing worker [worker name] because we got no heartbeat in 60
seconds
Removing worker [worker name] on [IP]:[port]
Telling app of
lost executor: [executor number]
I then see in the driver log the following message:
Lost executor [executor number] on [executor IP]: worker lost
The worker then terminates and I see this message in its log:
Driver commanded a shutdown
I have looked at the Spark source code and from what I can tell, as long as the executor is alive it should send a heartbeat message back as it is using a ThreadUtils.newDaemonSingleThreadScheduledExecutor to do this.
One other thing that I noticed while I was running top on one of the workers, is that the executor JVM seems to be suspended throughout this process. There are as many python processes as I specified in the SPARK_WORKER_CORES env variable and each is consuming close to 100% of the CPU.
Anyone have any thoughts on this?
I was facing this same issue, increasing interval worked.
Excerpt from Logs start-all.sh logs
INFO Utils: Successfully started service 'sparkMaster' on port 7077.
INFO Master: Starting Spark master at spark://master:7077
INFO Master: Running Spark version 3.0.1
INFO Utils: Successfully started service 'MasterUI' on port 8080.
INFO MasterWebUI: Bound MasterWebUI to 0.0.0.0, and started at http://master:8080
INFO Master: I have been elected leader! New state: ALIVE
INFO Master: Registering worker slave01:41191 with 16 cores, 15.7 GiB RAM
INFO Master: Registering worker slave02:37853 with 16 cores, 15.7 GiB RAM
WARN Master: Removing worker-20210618205117-slave01-41191 because we got no heartbeat in 60 seconds
INFO Master: Removing worker worker-20210618205117-slave01-41191 on slave01:41191
INFO Master: Telling app of lost worker: worker-20210618205117-slave01-41191
WARN Master: Removing worker-20210618204723-slave02-37853 because we got no heartbeat in 60 seconds
INFO Master: Removing worker worker-20210618204723-slave02-37853 on slave02:37853
INFO Master: Telling app of lost worker: worker-20210618204723-slave02-37853
WARN Master: Got heartbeat from unregistered worker worker-20210618205117-slave01-41191. This worker was never registered, so ignoring the heartbeat.
WARN Master: Got heartbeat from unregistered worker worker-20210618204723-slave02-37853. This worker was never registered, so ignoring the heartbeat.
Solution: add following configs to $SPARK_HOME/conf/spark-defaults.conf
spark.network.timeout 50000
spark.executor.heartbeatInterval 5000
spark.worker.timeout 5000
I am trying to run a spark job via Mesos
it throws an exception
WARN MesosExternalShuffleClient: Unable to register app
6c1b7274-960f-47ef-9fa7-1dd06b05d4f1-0010
with external shuffle service.
Please manually remove shuffle data after driver exit
Error:java.lang.RuntimeException:
java.lang.UnsupportedOperationException: Unexpected message:
org.apache.spark.network.shuffle.protocol.mesos.RegisterDriver#88d86a24
"
and the logs in Stderr as
ERROR CoarseGrainedExecutorBackend: RECEIVED SIGNAL TERM
INFO DiskBlockManager: Shutdown hook called
E0911 05:32:34.711486 6619 process.cpp:951] Failed to accept socket: future discarded**
In Spark-Defaults.conf
spark.mesos.coarse true
spark.network.timeout 3600s
spark.shuffle.io.connectionTimeout 3600s
Who is killing my application..?
I was loading a large CSV into Cassandra using cassandra-loader.
The VM ran out of disk space during this process and crashed. I allocated more disk space to the VM and tried starting cassandra but it refused to start due to problems with SSTables and commit log.
I could not run nodetool repair as it is only works when the node is online.
I ran sstablescrub which took about 1 hour to finish. So I thought it might have fixed it.
But I still get this error in system.log
ERROR [SSTableBatchOpen:4] 2015-10-23 18:57:45,035 SSTableReader.java:506 - Corrupt sstable /var/lib/cassandra/data/keyspace1/location-777a33d0772911e597a98b820c5778a4/la-1709-big=[TOC.txt, CompressionInfo.db, Statistics.db, Digest.adler32, Data.db, Index.db, Filter.db]; skipping table
org.apache.cassandra.io.sstable.CorruptSSTableException: java.io.EOFException
at org.apache.cassandra.io.compress.CompressionMetadata.<init>(CompressionMetadata.java:125) ~[apache-cassandra-2.2.3.jar:2.2.3]
at org.apache.cassandra.io.compress.CompressionMetadata.create(CompressionMetadata.java:86) ~[apache-cassandra-2.2.3.jar:2.2.3]
at org.apache.cassandra.io.util.CompressedSegmentedFile$Builder.metadata(CompressedSegmentedFile.java:142) ~[apache-cassandra-2.2.3.jar:2.2.3]
at org.apache.cassandra.io.util.CompressedPoolingSegmentedFile$Builder.complete(CompressedPoolingSegmentedFile.java:101) ~[apache-cassandra-2.2.3.jar:2.2.3]
at org.apache.cassandra.io.util.SegmentedFile$Builder.complete(SegmentedFile.java:187) ~[apache-cassandra-2.2.3.jar:2.2.3]
at org.apache.cassandra.io.util.SegmentedFile$Builder.complete(SegmentedFile.java:179) ~[apache-cassandra-2.2.3.jar:2.2.3]
at org.apache.cassandra.io.sstable.format.SSTableReader.load(SSTableReader.java:703) ~[apache-cassandra-2.2.3.jar:2.2.3]
at org.apache.cassandra.io.sstable.format.SSTableReader.load(SSTableReader.java:664) ~[apache-cassandra-2.2.3.jar:2.2.3]
at org.apache.cassandra.io.sstable.format.SSTableReader.open(SSTableReader.java:458) ~[apache-cassandra-2.2.3.jar:2.2.3]
at org.apache.cassandra.io.sstable.format.SSTableReader.open(SSTableReader.java:363) ~[apache-cassandra-2.2.3.jar:2.2.3]
at org.apache.cassandra.io.sstable.format.SSTableReader$4.run(SSTableReader.java:501) ~[apache-cassandra-2.2.3.jar:2.2.3]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [na:1.8.0_60]
at java.util.concurrent.FutureTask.run(FutureTask.java:266) [na:1.8.0_60]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [na:1.8.0_60]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [na:1.8.0_60]
at java.lang.Thread.run(Thread.java:745) [na:1.8.0_60]
Caused by: java.io.EOFException: null
at java.io.DataInputStream.readUnsignedShort(DataInputStream.java:340) ~[na:1.8.0_60]
at java.io.DataInputStream.readUTF(DataInputStream.java:589) ~[na:1.8.0_60]
at java.io.DataInputStream.readUTF(DataInputStream.java:564) ~[na:1.8.0_60]
at org.apache.cassandra.io.compress.CompressionMetadata.<init>(CompressionMetadata.java:96) ~[apache-cassandra-2.2.3.jar:2.2.3]
... 15 common frames omitted
INFO [main] 2015-10-23 18:57:45,193 ColumnFamilyStore.java:382 - Initializing system_auth.role_permissions
INFO [main] 2015-10-23 18:57:45,201 ColumnFamilyStore.java:382 - Initializing system_auth.resource_role_permissons_index
INFO [main] 2015-10-23 18:57:45,213 ColumnFamilyStore.java:382 - Initializing system_auth.roles
INFO [main] 2015-10-23 18:57:45,233 ColumnFamilyStore.java:382 - Initializing system_auth.role_members
INFO [main] 2015-10-23 18:57:45,240 ColumnFamilyStore.java:382 - Initializing system_traces.sessions
INFO [main] 2015-10-23 18:57:45,252 ColumnFamilyStore.java:382 - Initializing system_traces.events
INFO [main] 2015-10-23 18:57:45,265 ColumnFamilyStore.java:382 - Initializing simplex.songs
INFO [main] 2015-10-23 18:57:45,276 ColumnFamilyStore.java:382 - Initializing simplex.playlists
INFO [pool-2-thread-1] 2015-10-23 18:57:45,289 AutoSavingCache.java:187 - reading saved cache /var/lib/cassandra/saved_caches/KeyCache-ca.db
INFO [pool-2-thread-1] 2015-10-23 18:57:45,313 AutoSavingCache.java:163 - Completed loading (25 ms; 36 keys) KeyCache cache
INFO [main] 2015-10-23 18:57:45,351 CommitLog.java:168 - Replaying /var/lib/cassandra/commitlog/CommitLog-5-1445578022702.log, /var/lib/cassandra/commitlog/CommitLog-5-1445578022703.log, /var/lib/cassandra/commitlog/CommitLog-5-1445578022704.log, /var/lib/cassandra/commitlog/CommitLog-5-1445578022705.log, /var/lib/cassandra/commitlog/CommitLog-5-1445578022706.log, /var/lib/cassandra/commitlog/CommitLog-5-1445578022707.log, /var/lib/cassandra/commitlog/CommitLog-5-1445578022708.log, /var/lib/cassandra/commitlog/CommitLog-5-1445578022709.log, /var/lib/cassandra/commitlog/CommitLog-5-1445578022710.log, /var/lib/cassandra/commitlog/CommitLog-5-1445578022712.log, /var/lib/cassandra/commitlog/CommitLog-5-1445578022713.log, /var/lib/cassandra/commitlog/CommitLog-5-1445578022714.log, /var/lib/cassandra/commitlog/CommitLog-5-1445578022715.log, /var/lib/cassandra/commitlog/CommitLog-5-1445578022716.log, /var/lib/cassandra/commitlog/CommitLog-5-1445578022717.log, /var/lib/cassandra/commitlog/CommitLog-5-1445578022719.log, /var/lib/cassandra/commitlog/CommitLog-5-1445578022720.log, /var/lib/cassandra/commitlog/CommitLog-5-1445578022721.log, /var/lib/cassandra/commitlog/CommitLog-5-1445578022722.log, /var/lib/cassandra/commitlog/CommitLog-5-1445578022723.log, /var/lib/cassandra/commitlog/CommitLog-5-1445578022724.log, /var/lib/cassandra/commitlog/CommitLog-5-1445578022725.log, /var/lib/cassandra/commitlog/CommitLog-5-1445578022727.log, /var/lib/cassandra/commitlog/CommitLog-5-1445578022728.log, /var/lib/cassandra/commitlog/CommitLog-5-1445578022730.log, /var/lib/cassandra/commitlog/CommitLog-5-1445578022731.log, /var/lib/cassandra/commitlog/CommitLog-5-1445578022732.log, /var/lib/cassandra/commitlog/CommitLog-5-1445578022733.log, /var/lib/cassandra/commitlog/CommitLog-5-1445578022734.log, /var/lib/cassandra/commitlog/CommitLog-5-1445578022736.log, /var/lib/cassandra/commitlog/CommitLog-5-1445578022738.log, /var/lib/cassandra/commitlog/CommitLog-5-1445578022740.log, /var/lib/cassandra/commitlog/CommitLog-5-1445578022741.log, /var/lib/cassandra/commitlog/CommitLog-5-1445578022743.log, /var/lib/cassandra/commitlog/CommitLog-5-1445578022744.log, /var/lib/cassandra/commitlog/CommitLog-5-1445578022745.log, /var/lib/cassandra/commitlog/CommitLog-5-1445578022746.log, /var/lib/cassandra/commitlog/CommitLog-5-1445578022748.log, /var/lib/cassandra/commitlog/CommitLog-5-1445578022749.log, /var/lib/cassandra/commitlog/CommitLog-5-1445578022750.log, /var/lib/cassandra/commitlog/CommitLog-5-1445578022751.log, /var/lib/cassandra/commitlog/CommitLog-5-1445578022752.log, /var/lib/cassandra/commitlog/CommitLog-5-1445578022753.log, /var/lib/cassandra/commitlog/CommitLog-5-1445578022755.log, /var/lib/cassandra/commitlog/CommitLog-5-1445578022756.log, /var/lib/cassandra/commitlog/CommitLog-5-1445578022758.log, /var/lib/cassandra/commitlog/CommitLog-5-1445578022759.log, /var/lib/cassandra/commitlog/CommitLog-5-1445578022760.log, /var/lib/cassandra/commitlog/CommitLog-5-1445578022761.log, /var/lib/cassandra/commitlog/CommitLog-5-1445578022763.log, /var/lib/cassandra/commitlog/CommitLog-5-1445578022764.log, /var/lib/cassandra/commitlog/CommitLog-5-1445578022765.log, /var/lib/cassandra/commitlog/CommitLog-5-1445578022766.log, /var/lib/cassandra/commitlog/CommitLog-5-1445578022767.log, /var/lib/cassandra/commitlog/CommitLog-5-1445578022768.log, /var/lib/cassandra/commitlog/CommitLog-5-1445578022769.log, /var/lib/cassandra/commitlog/CommitLog-5-1445578022770.log, /var/lib/cassandra/commitlog/CommitLog-5-1445578022771.log, /var/lib/cassandra/commitlog/CommitLog-5-1445578022772.log, /var/lib/cassandra/commitlog/CommitLog-5-1445578022773.log, /var/lib/cassandra/commitlog/CommitLog-5-1445578022774.log, /var/lib/cassandra/commitlog/CommitLog-5-1445578022775.log, /var/lib/cassandra/commitlog/CommitLog-5-1445578022776.log, /var/lib/cassandra/commitlog/CommitLog-5-1445588991268.log, /var/lib/cassandra/commitlog/CommitLog-5-1445589094722.log, /var/lib/cassandra/commitlog/CommitLog-5-1445589149527.log, /var/lib/cassandra/commitlog/CommitLog-5-1445595828633.log, /var/lib/cassandra/commitlog/CommitLog-5-1445595898055.log, /var/lib/cassandra/commitlog/CommitLog-5-1445596033717.log, /var/lib/cassandra/commitlog/CommitLog-5-1445596400441.log, /var/lib/cassandra/commitlog/CommitLog-5-1445596601854.log, /var/lib/cassandra/commitlog/CommitLog-5-1445598032544.log, /var/lib/cassandra/commitlog/CommitLog-5-1445598758663.log, /var/lib/cassandra/commitlog/CommitLog-5-1445601112953.log, /var/lib/cassandra/commitlog/CommitLog-5-1445601937334.log, /var/lib/cassandra/commitlog/CommitLog-5-1445601985416.log, /var/lib/cassandra/commitlog/CommitLog-5-1445604504389.log, /var/lib/cassandra/commitlog/CommitLog-5-1445606516196.log
ERROR [main] 2015-10-23 18:59:05,091 JVMStabilityInspector.java:78 - Exiting due to error while processing commit log during initialization.
org.apache.cassandra.db.commitlog.CommitLogReplayer$CommitLogReplayException: Mutation checksum failure at 4110758 in CommitLog-5-1445578022776.log
at org.apache.cassandra.db.commitlog.CommitLogReplayer.handleReplayError(CommitLogReplayer.java:622) [apache-cassandra-2.2.3.jar:2.2.3]
at org.apache.cassandra.db.commitlog.CommitLogReplayer.replaySyncSection(CommitLogReplayer.java:492) [apache-cassandra-2.2.3.jar:2.2.3]
at org.apache.cassandra.db.commitlog.CommitLogReplayer.recover(CommitLogReplayer.java:388) [apache-cassandra-2.2.3.jar:2.2.3]
at org.apache.cassandra.db.commitlog.CommitLogReplayer.recover(CommitLogReplayer.java:147) [apache-cassandra-2.2.3.jar:2.2.3]
at org.apache.cassandra.db.commitlog.CommitLog.recover(CommitLog.java:189) [apache-cassandra-2.2.3.jar:2.2.3]
at org.apache.cassandra.db.commitlog.CommitLog.recover(CommitLog.java:169) [apache-cassandra-2.2.3.jar:2.2.3]
at org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:273) [apache-cassandra-2.2.3.jar:2.2.3]
at org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:513) [apache-cassandra-2.2.3.jar:2.2.3]
at org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:622) [apache-cassandra-2.2.3.jar:2.2.3]
Exiting due to error while processing commit log during initialization.
How do I fix this? This is test data, I am okay with losing it. How would one handle this in production to avoid data loss?
Tried setting disk_failure_policy: ignore so that I could run nodetool repair once the server is up. But the server does not start even with this setting.
I am operating a single node and replication factor is 1. Would having more nodes and a >1 replication factor enabled me to fix an issue like this without data loss?
I am using Cassandra 2.2.3
Since you don't care about the data, removing files from \data\commitlogs should be easiest solution.
Having some replication would surely help you to fix this without data loss but it would come with a price.
Despite all your effort you cannot manage to recover your corrupted sstable. So you decide to remove it from your file system to start Cassandra again. If you do not have replication your data is lost. But if you have replication on the cluster, you can possibly fetch the data from other nodes. That is what nodetool repair do !
So nodetool repair does not repair corrupted sstable. Basicallynodetool repair compare tables from node to node to find missing or inconsistent data and then repair it. You can find more information on how it works here.
However nodetool repair is very expensive, it is long and uses a lot of cpu, disk and network. There is this good post about repair benefits and drawbacks.
This is how I fixed the problem with commit logs.
You should only do this if you don't care about preserving the state of your commit logs.
Try to restart cassandra using
sudo systemctl restart cassandra
Then I check
systemctl status cassandra
and see that the status is 'exited' so there is a problem. Check the logs for cassandra using
sudo less /var/log/cassandra/system.log
and see org.apache.cassandra.db.commitlog.CommitLogReplayer$CommitLogReplayException: Could not read commit log descriptor in file /var/lib/cassandra/commitlog/CommitLog-6-1498210233635.log
Because I don't care about preserving the state of Cassandra I delete all of the commit logs and it now boots up fine
sudo rm /var/lib/cassandra/commitlog/CommitLog*
sudo systemctl restart cassandra
systemctl status cassandra (should confirm that it it now running)
Simply goes in log directory in cassandra and delete the log files.
It work fine....
I am trying to start cassandra so I did
sudo ./cassandra
I came across this
Error: Exception thrown by the agent : java.net.MalformedURLException: Local host name unknown: java.net.UnknownHostException: node24.nise.local: node24.nise.local
so I did what was mentioned as on problem on starting cassandra link and i changed the /etc/hosts file.
Then the starting process got stuck after this:
INFO 22:27:14,227 CFS(Keyspace='system', ColumnFamily='local') liveRatio is 33.904761904761905 (just-counted was 33.904761904761905). calculation took 110ms for 3 cells
INFO 22:27:14,260 Enqueuing flush of Memtable-local#726006040(84/840 serialized/live bytes, 4 ops)
INFO 22:27:14,262 Writing Memtable-local#726006040(84/2848 serialized/live bytes, 4 ops)
INFO 22:27:14,280 Completed flushing /var/lib/cassandra/data/system/local/system-local-jb-50-Data.db (116 bytes) for commitlog position ReplayPosition(segmentId=1401859631027, position=500327)
WARN 22:27:14,327 setting live ratio to maximum of 64.0 instead of Infinity
INFO 22:27:14,327 Enqueuing flush of Memtable-local#1689909512(10100/101000 serialized/live bytes, 259 ops)
INFO 22:27:14,328 CFS(Keyspace='system', ColumnFamily='local') liveRatio is 64.0 (just-counted was 64.0). calculation took 0ms for 0 cells
INFO 22:27:14,350 Writing Memtable-local#1689909512(10100/101000 serialized/live bytes, 259 ops)
INFO 22:27:14,386 Completed flushing /var/lib/cassandra/data/system/local/system-local-jb-51-Data.db (5278 bytes) for commitlog position ReplayPosition(segmentId=1401859631027, position=512328)
INFO 22:27:14,493 Node localhost/127.0.0.1 state jump to normal
No other line was executed after this . Can anyone help in letting me know why did this happen exactly.
I was getting same error ..
you just need to do give command in command prompt
hostname localhost (or the hostname of where cassandra is running)
This believe it will solve your problem
I think after this statement
INFO 22:27:14,493 Node localhost/127.0.0.1 state jump to normal
your server running normally, to verify do jps and check that CassandraDaemon is running or not.
I want to install Spark Standlone mode to a Cluster with my two virtual machines.
With the version of spark-0.9.1-bin-hadoop1, I execute spark-shell successfully in each vm. I follow the offical document to make one vm(ip:xx.xx.xx.223) as both Master and Worker and to make the other(ip:xx.xx.xx.224) as Worker only.
But the 224-ip vm cannot connect the 223-ip vm.
Followed is 223(Master)'s master log:
[#tc-52-223 logs]# tail -100f spark-root-org.apache.spark.deploy.master.Master-1-tc-52-223.out
Spark Command: /usr/local/jdk/bin/java -cp :/data/test/spark-0.9.1-bin-hadoop1/conf:/data/test/spark-0.9.1-bin-hadoop1/assembly/target/scala-2.10/spark-assembly_2.10-0.9.1-hadoop1.0.4.jar -Dspark.akka.logLifecycleEvents=true -Djava.library.path= -Xms512m -Xmx512m org.apache.spark.deploy.master.Master --ip 10.11.52.223 --port 7077 --webui-port 8080
log4j:WARN No appenders could be found for logger (akka.event.slf4j.Slf4jLogger).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
14/04/14 22:17:03 INFO Master: Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
14/04/14 22:17:03 INFO Master: Starting Spark master at spark://10.11.52.223:7077
14/04/14 22:17:03 INFO MasterWebUI: Started Master web UI at http://tc-52-223:8080
14/04/14 22:17:03 INFO Master: I have been elected leader! New state: ALIVE
14/04/14 22:17:06 INFO Master: Registering worker tc-52-223:20599 with 1 cores, 4.0 GB RAM
14/04/14 22:17:06 INFO Master: Registering worker tc_52_224:21371 with 1 cores, 4.0 GB RAM
14/04/14 22:17:06 INFO RemoteActorRefProvider$RemoteDeadLetterActorRef: Message [org.apache.spark.deploy.DeployMessages$RegisteredWorker] from Actor[akka://sparkMaster/user/Master#1972530850] to Actor[akka://sparkMaster/deadLetters] was not delivered. [1] dead letters encountered. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'.
14/04/14 22:17:26 INFO Master: Registering worker tc_52_224:21371 with 1 cores, 4.0 GB RAM
14/04/14 22:17:26 INFO RemoteActorRefProvider$RemoteDeadLetterActorRef: Message [org.apache.spark.deploy.DeployMessages$RegisterWorkerFailed] from Actor[akka://sparkMaster/user/Master#1972530850] to Actor[akka://sparkMaster/deadLetters] was not delivered. [2] dead letters encountered. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'.
14/04/14 22:17:46 INFO Master: Registering worker tc_52_224:21371 with 1 cores, 4.0 GB RAM
14/04/14 22:17:46 INFO RemoteActorRefProvider$RemoteDeadLetterActorRef: Message [org.apache.spark.deploy.DeployMessages$RegisterWorkerFailed] from Actor[akka://sparkMaster/user/Master#1972530850] to Actor[akka://sparkMaster/deadLetters] was not delivered. [3] dead letters encountered. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'.
14/04/14 22:18:06 INFO Master: akka.tcp://sparkWorker#tc_52_224:21371 got disassociated, removing it.
14/04/14 22:18:06 INFO Master: akka.tcp://sparkWorker#tc_52_224:21371 got disassociated, removing it.
14/04/14 22:18:06 INFO LocalActorRef: Message [akka.remote.transport.ActorTransportAdapter$DisassociateUnderlying] from Actor[akka://sparkMaster/deadLetters] to Actor[akka://sparkMaster/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkMaster%4010.11.52.224%3A61550-1#646150938] was not delivered. [4] dead letters encountered. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'.
14/04/14 22:18:06 INFO Master: akka.tcp://sparkWorker#tc_52_224:21371 got disassociated, removing it.
14/04/14 22:18:06 ERROR EndpointWriter: AssociationError [akka.tcp://sparkMaster#10.11.52.223:7077] -> [akka.tcp://sparkWorker#tc_52_224:21371]: Error [Association failed with [akka.tcp://sparkWorker#tc_52_224:21371]] [
akka.remote.EndpointAssociationException: Association failed with [akka.tcp://sparkWorker#tc_52_224:21371]
Caused by: akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2: Connection refused: tc_52_224/10.11.52.224:21371
]
14/04/14 22:18:06 INFO Master: akka.tcp://sparkWorker#tc_52_224:21371 got disassociated, removing it.
14/04/14 22:18:06 ERROR EndpointWriter: AssociationError [akka.tcp://sparkMaster#10.11.52.223:7077] -> [akka.tcp://sparkWorker#tc_52_224:21371]: Error [Association failed with [akka.tcp://sparkWorker#tc_52_224:21371]] [
akka.remote.EndpointAssociationException: Association failed with [akka.tcp://sparkWorker#tc_52_224:21371]
Caused by: akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2: Connection refused: tc_52_224/10.11.52.224:21371
]
14/04/14 22:18:06 ERROR EndpointWriter: AssociationError [akka.tcp://sparkMaster#10.11.52.223:7077] -> [akka.tcp://sparkWorker#tc_52_224:21371]: Error [Association failed with [akka.tcp://sparkWorker#tc_52_224:21371]] [
akka.remote.EndpointAssociationException: Association failed with [akka.tcp://sparkWorker#tc_52_224:21371]
Caused by: akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2: Connection refused: tc_52_224/10.11.52.224:21371
]
14/04/14 22:18:06 INFO Master: akka.tcp://sparkWorker#tc_52_224:21371 got disassociated, removing it.
14/04/14 22:19:03 WARN Master: Removing worker-20140414221705-tc_52_224-21371 because we got no heartbeat in 60 seconds
14/04/14 22:19:03 INFO Master: Removing worker worker-20140414221705-tc_52_224-21371 on tc_52_224:21371
Followed is 223(Worker)'s worker log:
14/04/14 22:17:06 INFO Worker: Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
14/04/14 22:17:06 INFO Worker: Starting Spark worker tc-52-223:20599 with 1 cores, 4.0 GB RAM
14/04/14 22:17:06 INFO Worker: Spark home: /data/test/spark-0.9.1-bin-hadoop1
14/04/14 22:17:06 INFO WorkerWebUI: Started Worker web UI at http://tc-52-223:8081
14/04/14 22:17:06 INFO Worker: Connecting to master spark://xx.xx.52.223:7077...
14/04/14 22:17:06 INFO Worker: Successfully registered with master spark://xx.xx.52.223:7077
Followed is 224(Worker)'s work log:
Spark Command: /usr/local/jdk/bin/java -cp :/data/test/spark-0.9.1-bin-hadoop1/conf:/data/test/spark-0.9.1-bin-hadoop1/assembly/target/scala-2.10/spark-assembly_2.10-0.9.1-hadoop1.0.4.jar -Dspark.akka.logLifecycleEvents=true -Djava.library.path= -Xms512m -Xmx512m org.apache.spark.deploy.worker.Worker spark://10.11.52.223:7077 --webui-port 8081
========================================
log4j:WARN No appenders could be found for logger (akka.event.slf4j.Slf4jLogger).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
14/04/14 22:17:06 INFO Worker: Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
14/04/14 22:17:06 INFO Worker: Starting Spark worker tc_52_224:21371 with 1 cores, 4.0 GB RAM
14/04/14 22:17:06 INFO Worker: Spark home: /data/test/spark-0.9.1-bin-hadoop1
14/04/14 22:17:06 INFO WorkerWebUI: Started Worker web UI at http://tc_52_224:8081
14/04/14 22:17:06 INFO Worker: Connecting to master spark://xx.xx.52.223:7077...
14/04/14 22:17:26 INFO Worker: Connecting to master spark://xx.xx.52.223:7077...
14/04/14 22:17:46 INFO Worker: Connecting to master spark://xx.xx.52.223:7077...
14/04/14 22:18:06 ERROR Worker: All masters are unresponsive! Giving up.
Followed is my spark-env.sh:
JAVA_HOME=/usr/local/jdk
export SPARK_MASTER_IP=tc-52-223
export SPARK_WORKER_CORES=1
export SPARK_WORKER_INSTANCES=1
export SPARK_MASTER_PORT=7077
export SPARK_WORKER_MEMORY=4g
export MASTER=spark://${SPARK_MASTER_IP}:${SPARK_MASTER_PORT}
export SPARK_LOCAL_IP=tc-52-223
I have googled many solutions, but they cant work.
Please help me.
I'm not sure if this is the same issue I encountered but you may want to try setting SPARK_MASTER_IP the same as what spark binds to. In your example is looks like it would be 10.11.52.223 and not tc-52-223.
It should be the same as what you see when you visit the master node web UI on 8080. Something like: Spark Master at spark://ec2-XX-XX-XXX-XXX.compute-1.amazonaws.com:7077
If you are getting a "Connection refused" exception, You can resolve it by checking
=> Master is running on the specific host
netstat -at | grep 7077
You will get something similar to:
tcp 0 0 akhldz.master.io:7077 *:* LISTEN
If that is the case, then from your worker machine do a
host akhldz.master.io ( replace akhldz.master.io with your master host.If something goes wrong, then add a host entry in your /etc/hosts file)
telnet akhldz.master.io 7077 ( If this is not connecting, then your worker wont connect either. )
=> Adding Host entry in /etc/hosts
Open /etc/hosts from your worker machine and add the following entry (example)
192.168.100.20 akhldz.master.io
PS :In the above case Pillis was having two ip addresses having same host name
eg:
192.168.100.40 s1.machine.org
192.168.100.41 s1.machine.org
Hope that help.
There's a lot of answers and possible solutions, and this question is a bit old, but in the interest of completeness, there is a known Spark bug about hostnames resolving to IP addresses. I am not presenting this as the complete answer in all cases, but I suggest trying with a baseline of just using all IPs, and only use the single config SPARK_MASTER_IP. With just those two practices I get my clusters to work and all the other configs, or using hostnames, just seems to muck things up.
So in your spark-env.sh get rid of SPARK_WORKER_IP and change SPARK_MASTER_IP to an IP address, not a hostname.
I have treated this more at length in this answer.
For more completeness here's part of that answer:
Can you ping the box where the Spark master is running? Can you ping
the worker from the master? More importantly, can you password-less
ssh to the worker from the master box? Per 1.5.2 docs you need to be
able to do that with a private key AND have the worker entered in the
conf/slaves file. I copied the relevant paragraph at the end.
You can get a situation where the worker can contact the master but
the master can't get back to the worker so it looks like no connection
is being made. Check both directions.
I think the slaves file on the master node, and the password-less ssh can lead to similar errors to what you are seeing.
Per the answer I crosslinked, there's also an old bug but it's not clear how that bug was resolved.
set the port for spark worker also, Eg.: SPARK_WORKER_PORT=5078 ... check thespark-installation link for correct installation
basically your ports are blocked so communication from master to worker is cut down. check here https://spark.apache.org/docs/latest/configuration.html#networking
In the "Networking" section, you can see some of the ports are by default random. You can set them to your choice like below:
val conf = new SparkConf()
.setMaster(master)
.setAppName("namexxx")
.set("spark.driver.port", "51810")
.set("spark.fileserver.port", "51811")
.set("spark.broadcast.port", "51812")
.set("spark.replClassServer.port", "51813")
.set("spark.blockManager.port", "51814")
.set("spark.executor.port", "51815")
I my case, I could overcome the problem as "adding entry of hostname and IP adres of localhost to /etc/hosts file" as follows:
For a cluster, master has the /etc/hosts content as follows:
127.0.0.1 master.yourhost.com localhost localhost4 localhost.localdomain
192.168.1.10 slave1.yourhost.com
192.168.1.9 master.yourhost.com **# this line solved the problem**
Then I also do the SAME THING on slave1.yourhost.com machine.
Hope this helps..
I had faced same issue . you can resolve it by below procedure ,
first you should go to /etc/hosts file and comment 127.0.1.1 address .
then you should go towards spark/sbin directory , then you should started spark session by these command ,
./start-all.sh
or you can use ./start-master.sh and ./start-slave.sh for the same . Now if you will run spark-shell or pyspark or any other spark component then it will automatically create spark context object sc for you .