giraph.numInputThreads execution time for "input superstep" it's the same using 1 or 8 threads, how this can be possible? - multithreading

I'm doing BFS search through the Wikipedia (spanish edition) site. I converted the dump into a file that could be read with Giraph.
Using 1 worker, a file of 1 GB took 452 seconds. I executed Giraph with this command:
/home/hadoop/bin/yarn jar /home/hadoop/giraph/giraph.jar ar.edu.info.unlp.tesina.lectura.grafo.BusquedaDeCaminosNavegacionalesWikiquote -vif ar.edu.info.unlp.tesina.vertice.estructuras.IdTextWithComplexValueInputFormat -vip /user/hduser/input/grafo-wikipedia.txt -vof ar.edu.info.unlp.tesina.vertice.estructuras.IdTextWithComplexValueOutputFormat -op /user/hduser/output/caminosNavegacionales -w 1 -yh 120000 -ca giraph.metrics.enable=true,giraph.useOutOfCoreMessages=true
Container logs:
16/08/24 21:17:02 INFO master.BspServiceMaster: generateVertexInputSplits: Got 8 input splits for 1 input threads
16/08/24 21:17:02 INFO master.BspServiceMaster: createVertexInputSplits: Starting to write input split data to zookeeper with 1 threads
16/08/24 21:17:02 INFO master.BspServiceMaster: createVertexInputSplits: Done writing input split data to zookeeper
16/08/24 21:17:02 INFO yarn.GiraphYarnTask: [STATUS: task-0] MASTER_ZOOKEEPER_ONLY checkWorkers: Done - Found 1 responses of 1 needed to start superstep -1
16/08/24 21:17:02 INFO netty.NettyClient: Using Netty without authentication.
16/08/24 21:17:02 INFO netty.NettyClient: connectAllAddresses: Successfully added 1 connections, (1 total connected) 0 failed, 0 failures total.
16/08/24 21:17:02 INFO partition.PartitionUtils: computePartitionCount: Creating 1, default would have been 1 partitions.
...
16/08/24 21:25:40 INFO netty.NettyClient: stop: Halting netty client
16/08/24 21:25:40 INFO netty.NettyClient: stop: reached wait threshold, 1 connections closed, releasing resources now.
16/08/24 21:25:43 INFO netty.NettyClient: stop: Netty client halted
16/08/24 21:25:43 INFO netty.NettyServer: stop: Halting netty server
16/08/24 21:25:43 INFO netty.NettyServer: stop: Start releasing resources
16/08/24 21:25:44 INFO bsp.BspService: process: cleanedUpChildrenChanged signaled
16/08/24 21:25:47 INFO netty.NettyServer: stop: Netty server halted
16/08/24 21:25:47 INFO bsp.BspService: process: masterElectionChildrenChanged signaled
16/08/24 21:25:47 INFO master.MasterThread: setup: Took 0.898 seconds.
16/08/24 21:25:47 INFO master.MasterThread: input superstep: Took 452.531 seconds.
16/08/24 21:25:47 INFO master.MasterThread: superstep 0: Took 64.376 seconds.
16/08/24 21:25:47 INFO master.MasterThread: superstep 1: Took 1.591 seconds.
16/08/24 21:25:47 INFO master.MasterThread: shutdown: Took 6.609 seconds.
16/08/24 21:25:47 INFO master.MasterThread: total: Took 526.006 seconds.
As you guys can see, the first line tell us that input superstep is executing with only one thread. And took 492 second in finish Input Superstep.
I did another test, using giraph.numInputThreads=8, tryng to do the input superstep with 8 threads:
/home/hadoop/bin/yarn jar /home/hadoop/giraph/giraph.jar ar.edu.info.unlp.tesina.lectura.grafo.BusquedaDeCaminosNavegacionalesWikiquote -vif ar.edu.info.unlp.tesina.vertice.estructuras.IdTextWithComplexValueInputFormat -vip /user/hduser/input/grafo-wikipedia.txt -vof ar.edu.info.unlp.tesina.vertice.estructuras.IdTextWithComplexValueOutputFormat -op /user/hduser/output/caminosNavegacionales -w 1 -yh 120000 -ca giraph.metrics.enable=true,giraph.useOutOfCoreMessages=true,giraph.numInputThreads=8
The result was the following one:
16/08/24 21:54:00 INFO master.BspServiceMaster: generateVertexInputSplits: Got 8 input splits for 8 input threads
16/08/24 21:54:00 INFO master.BspServiceMaster: createVertexInputSplits: Starting to write input split data to zookeeper with 1 threads
16/08/24 21:54:00 INFO master.BspServiceMaster: createVertexInputSplits: Done writing input split data to zookeeper
...
16/08/24 22:10:07 INFO master.MasterThread: setup: Took 0.093 seconds.
16/08/24 22:10:07 INFO master.MasterThread: input superstep: Took 891.339 seconds.
16/08/24 22:10:07 INFO master.MasterThread: superstep 0: Took 66.635 seconds.
16/08/24 22:10:07 INFO master.MasterThread: superstep 1: Took 1.837 seconds.
16/08/24 22:10:07 INFO master.MasterThread: shutdown: Took 6.605 seconds.
16/08/24 22:10:07 INFO master.MasterThread: total: Took 966.512 seconds.
So, my question is, how can be possible that Giraph is using 452 seconds without input threads and 891 seconds with them? Should be exacly the opposite, right?
The cluster used for this was 1 master and one slave, both of a r3.8xlarge EC2 instance on AWS.

The problem is related to the HDFS access. There are 8 threads accessing the same resource, when that resource only can be accessed in a sequential way. For getting best performance, giraph.numInputThreads should be 2 or 1.

Related

Jmeter logs are missing

I am not able to see log messages in the logger view and not in jmeter.log. I am writing a simple jmeter groovy script [jsr223 listener].
org.apache.jorphan.logging.LoggingManager;
private static final Logger log = LoggingManager.getLoggerForClass();
log.info("your log message");
log.error("your error message");
println("something");
System.Out.Println("other thing");
Logs
2021-01-19 01:23:28,999 INFO o.a.j.e.StandardJMeterEngine: Running the test!
2021-01-19 01:23:29,015 INFO o.a.j.s.SampleEvent: List of sample_variables: []
2021-01-19 01:23:29,015 INFO o.a.j.g.u.JMeterMenuBar: setRunning(true, *local*)
2021-01-19 01:23:29,050 INFO o.a.j.e.StandardJMeterEngine: Starting ThreadGroup: 1 : Thread Group
2021-01-19 01:23:29,050 INFO o.a.j.e.StandardJMeterEngine: Starting 1 threads for group Thread Group.
2021-01-19 01:23:29,050 INFO o.a.j.e.StandardJMeterEngine: Thread will continue on error
2021-01-19 01:23:29,066 INFO o.a.j.t.ThreadGroup: Starting thread group... number=1 threads=1 ramp-up=1 delayedStart=false
2021-01-19 01:23:29,066 INFO o.a.j.t.ThreadGroup: Started thread group number 1
2021-01-19 01:23:29,066 INFO o.a.j.e.StandardJMeterEngine: All thread groups have been started
2021-01-19 01:23:29,066 INFO o.a.j.t.JMeterThread: Thread started: Thread Group 1-1
2021-01-19 01:23:29,066 INFO o.a.j.t.JMeterThread: Thread is done: Thread Group 1-1
2021-01-19 01:23:29,066 INFO o.a.j.t.JMeterThread: Thread finished: Thread Group 1-1
2021-01-19 01:23:29,066 INFO o.a.j.e.StandardJMeterEngine: Notifying test listeners of end of test
2021-01-19 01:23:29,066 INFO o.a.j.g.u.JMeterMenuBar: setRunning(false, *local*)
Expecting to see log and error message atleast but none come through. Also , I tried jmeter 5.4.1 snapshot and same behavior. I am window OS / jmeter 5.4 . Can clue what I might be missing?
You don't need the two first lines, log is a pre-defined variable for the JSR223 Test Elements, see Top 8 JMeter Java Classes You Should Be Using with Groovy article for more details.
The last line should look like System.out.println("other thing");
The last two lines print stuff not into the log file but to STDOUT
You should see an error message, something like unable to resolve class Logger, if you don't see it - your listener is not being executed at all

Why am I getting "Removing worker because we got no heartbeat in 60 seconds" on Spark master

I think I might of stumbled across a bug and wanted to get other people's input. I am running a pyspark application using Spark 2.2.0 in standalone mode. I am doing a somewhat heavy transformation in python inside a flatMap and the driver keeps killing the workers.
Here is what am I seeing:
The master after 60s of not seeing any heartbeat message from the workers it prints out this message to the log:
Removing worker [worker name] because we got no heartbeat in 60
seconds
Removing worker [worker name] on [IP]:[port]
Telling app of
lost executor: [executor number]
I then see in the driver log the following message:
Lost executor [executor number] on [executor IP]: worker lost
The worker then terminates and I see this message in its log:
Driver commanded a shutdown
I have looked at the Spark source code and from what I can tell, as long as the executor is alive it should send a heartbeat message back as it is using a ThreadUtils.newDaemonSingleThreadScheduledExecutor to do this.
One other thing that I noticed while I was running top on one of the workers, is that the executor JVM seems to be suspended throughout this process. There are as many python processes as I specified in the SPARK_WORKER_CORES env variable and each is consuming close to 100% of the CPU.
Anyone have any thoughts on this?
I was facing this same issue, increasing interval worked.
Excerpt from Logs start-all.sh logs
INFO Utils: Successfully started service 'sparkMaster' on port 7077.
INFO Master: Starting Spark master at spark://master:7077
INFO Master: Running Spark version 3.0.1
INFO Utils: Successfully started service 'MasterUI' on port 8080.
INFO MasterWebUI: Bound MasterWebUI to 0.0.0.0, and started at http://master:8080
INFO Master: I have been elected leader! New state: ALIVE
INFO Master: Registering worker slave01:41191 with 16 cores, 15.7 GiB RAM
INFO Master: Registering worker slave02:37853 with 16 cores, 15.7 GiB RAM
WARN Master: Removing worker-20210618205117-slave01-41191 because we got no heartbeat in 60 seconds
INFO Master: Removing worker worker-20210618205117-slave01-41191 on slave01:41191
INFO Master: Telling app of lost worker: worker-20210618205117-slave01-41191
WARN Master: Removing worker-20210618204723-slave02-37853 because we got no heartbeat in 60 seconds
INFO Master: Removing worker worker-20210618204723-slave02-37853 on slave02:37853
INFO Master: Telling app of lost worker: worker-20210618204723-slave02-37853
WARN Master: Got heartbeat from unregistered worker worker-20210618205117-slave01-41191. This worker was never registered, so ignoring the heartbeat.
WARN Master: Got heartbeat from unregistered worker worker-20210618204723-slave02-37853. This worker was never registered, so ignoring the heartbeat.
Solution: add following configs to $SPARK_HOME/conf/spark-defaults.conf
spark.network.timeout 50000
spark.executor.heartbeatInterval 5000
spark.worker.timeout 5000

Spark Jobserver fail just by receiving a job request

Jobserver 0.7.0 it have 4Gb ram available and 10Gb for the context, the system have 3 more free Gb. The context was running for a while and at the time when receive a request fails without any error. The request is the same like other ones that have processed while it was up, is not a special one. The following log corresponds to the jobserver log and as you can see, the last successfully job was finished at 03:08:23,341 and when receive the next one then the driver command a shutdown.
[2017-05-16 03:08:23,340] INFO output.FileOutputCommitter [] [] - Saved output of task 'attempt_201705160308_0321_m_000199_0' to file:/value_iq/spark-warehouse/spark_cube_users_v/tenant_id=7/_temporary/0/task_201705160308_0321_m_000199
[2017-05-16 03:08:23,340] INFO pred.SparkHadoopMapRedUtil [] [] - attempt_201705160308_0321_m_000199_0: Committed
[2017-05-16 03:08:23,341] INFO he.spark.executor.Executor [] [] - Finished task 199.0 in stage 321.0 (TID 49474). 2738 bytes result sent to driver
[2017-05-16 03:39:02,195] INFO arseGrainedExecutorBackend [] [] - Driver commanded a shutdown
[2017-05-16 03:39:02,239] INFO storage.memory.MemoryStore [] [] - MemoryStore cleared
[2017-05-16 03:39:02,254] INFO spark.storage.BlockManager [] [] - BlockManager stopped
[2017-05-16 03:39:02,363] ERROR arseGrainedExecutorBackend [] [] - RECEIVED SIGNAL TERM
[2017-05-16 03:39:02,404] INFO k.util.ShutdownHookManager [] [] - Shutdown hook called
[2017-05-16 03:39:02,412] INFO k.util.ShutdownHookManager [] [] - Deleting directory /tmp/spark-556033e2-c456-49d6-a43c-ef2cd3494b71/executor-b3ceaf84-e66a-45ed-acfe-1052ab1de2f8/spark-87671e4f-54da-47d7-a077-eb5f75d07e39
The Spark Worker server just log the following:
17/05/15 19:25:54 INFO ExternalShuffleBlockResolver: Registered executor AppExecId{appId=app-20170515192550-0004, execId=0} with ExecutorShuffleInfo{localDirs=[/tmp/spark-556033e2-c456-49d6-a43c-ef2cd3494b71/executor-b3ceaf84-e66a-45ed-acfe-1052ab1de2f8/blockmgr-eca888c0-4e63-421c-9e61-d959ee45f8e9], subDirsPerLocalDir=64, shuffleManager=org.apache.spark.shuffle.sort.SortShuffleManager}
17/05/16 03:39:02 INFO Worker: Asked to kill executor app-20170515192550-0004/0
17/05/16 03:39:02 INFO ExecutorRunner: Runner thread for executor app-20170515192550-0004/0 interrupted
17/05/16 03:39:02 INFO ExecutorRunner: Killing process!
17/05/16 03:39:02 INFO Worker: Executor app-20170515192550-0004/0 finished with state KILLED exitStatus 0
17/05/16 03:39:02 INFO Worker: Cleaning up local directories for application app-20170515192550-0004
17/05/16 03:39:07 INFO ExternalShuffleBlockResolver: Application app-20170515192550-0004 removed, cleanupLocalDirs = true
17/05/16 03:39:07 INFO ExternalShuffleBlockResolver: Cleaning up executor AppExecId{appId=app-20170515192550-0004, execId=0}'s 1 local dirs
And the Master log:
17/05/16 03:39:02 INFO Master: Received unregister request from application app-20170515192550-0004
17/05/16 03:39:02 INFO Master: Removing app app-20170515192550-0004
17/05/16 03:39:02 INFO Master: 157.97.107.150:33928 got disassociated, removing it.
17/05/16 03:39:02 INFO Master: 157.97.107.150:55444 got disassociated, removing it.
17/05/16 03:39:02 WARN Master: Got status update for unknown executor app-20170515192550-0004/0
Before receiving this request spark wasn't executing any other job, the context was using 5,3G/10G and the driver 1,3G/4G.
What meas "Driver commanded a shutdown"?
There is any log property that can be changed to see more details on the logs?
How can a simple request just break the context?

Unable to start cassandra

I am trying to start cassandra so I did
sudo ./cassandra
I came across this
Error: Exception thrown by the agent : java.net.MalformedURLException: Local host name unknown: java.net.UnknownHostException: node24.nise.local: node24.nise.local
so I did what was mentioned as on problem on starting cassandra link and i changed the /etc/hosts file.
Then the starting process got stuck after this:
INFO 22:27:14,227 CFS(Keyspace='system', ColumnFamily='local') liveRatio is 33.904761904761905 (just-counted was 33.904761904761905). calculation took 110ms for 3 cells
INFO 22:27:14,260 Enqueuing flush of Memtable-local#726006040(84/840 serialized/live bytes, 4 ops)
INFO 22:27:14,262 Writing Memtable-local#726006040(84/2848 serialized/live bytes, 4 ops)
INFO 22:27:14,280 Completed flushing /var/lib/cassandra/data/system/local/system-local-jb-50-Data.db (116 bytes) for commitlog position ReplayPosition(segmentId=1401859631027, position=500327)
WARN 22:27:14,327 setting live ratio to maximum of 64.0 instead of Infinity
INFO 22:27:14,327 Enqueuing flush of Memtable-local#1689909512(10100/101000 serialized/live bytes, 259 ops)
INFO 22:27:14,328 CFS(Keyspace='system', ColumnFamily='local') liveRatio is 64.0 (just-counted was 64.0). calculation took 0ms for 0 cells
INFO 22:27:14,350 Writing Memtable-local#1689909512(10100/101000 serialized/live bytes, 259 ops)
INFO 22:27:14,386 Completed flushing /var/lib/cassandra/data/system/local/system-local-jb-51-Data.db (5278 bytes) for commitlog position ReplayPosition(segmentId=1401859631027, position=512328)
INFO 22:27:14,493 Node localhost/127.0.0.1 state jump to normal
No other line was executed after this . Can anyone help in letting me know why did this happen exactly.
I was getting same error ..
you just need to do give command in command prompt
hostname localhost (or the hostname of where cassandra is running)
This believe it will solve your problem
I think after this statement
INFO 22:27:14,493 Node localhost/127.0.0.1 state jump to normal
your server running normally, to verify do jps and check that CassandraDaemon is running or not.

Cannot start Cassandra with "bin/cassandra -f"

I have a problem of using Cassandra, I can start it with "bin/cassandra", but cannot start it with "bin/cassandra -f", anyone know the reason?
Here are the detailed info:
root#server1:~/cassandra# bin/cassandra -f
INFO 10:51:31,500 JNA not found. Native methods will be disabled.
INFO 10:51:31,740 DiskAccessMode 'auto' determined to be mmap, indexAccessMode is mmap
INFO 10:51:32,043 Deleted /var/lib/cassandra/data/system/LocationInfo-61-Data.db
INFO 10:51:32,044 Deleted /var/lib/cassandra/data/system/LocationInfo-62-Data.db
INFO 10:51:32,052 Deleted /var/lib/cassandra/data/system/LocationInfo-63-Data.db
INFO 10:51:32,053 Deleted /var/lib/cassandra/data/system/LocationInfo-64-Data.db
INFO 10:51:32,063 Sampling index for /var/lib/cassandra/data/system/LocationInfo-65-Data.db
INFO 10:51:32,117 Sampling index for /var/lib/cassandra/data/Keyspace1/Standard2-5-Data.db
INFO 10:51:32,118 Sampling index for /var/lib/cassandra/data/Keyspace1/Standard2-6-Data.db
INFO 10:51:32,120 Sampling index for /var/lib/cassandra/data/Keyspace1/Standard2-7-Data.db
INFO 10:51:32,131 Replaying /var/lib/cassandra/commitlog/CommitLog-1285869561954.log
INFO 10:51:32,143 Finished reading /var/lib/cassandra/commitlog/CommitLog-1285869561954.log
INFO 10:51:32,145 Creating new commitlog segment /var/lib/cassandra/commitlog/CommitLog-1286301092145.log
INFO 10:51:32,153 Standard2 has reached its threshold; switching in a fresh Memtable at CommitLogContext(file='/var/lib/cassandra/commitlog/CommitLog-1286301092145.log', position=121)
INFO 10:51:32,155 Enqueuing flush of Memtable-Standard2#1811560891(29 bytes, 1 operations)
INFO 10:51:32,156 Writing Memtable-Standard2#1811560891(29 bytes, 1 operations)
INFO 10:51:32,200 Completed flushing /var/lib/cassandra/data/Keyspace1/Standard2-8-Data.db
INFO 10:51:32,203 Compacting [org.apache.cassandra.io.SSTableReader(path='/var/lib/cassandra/data/Keyspace1/Standard2-5-Data.db'),org.apache.cassandra.io.SSTableReader(path='/var/lib/cassandra/data/Keyspace1/Standard2-6-Data.db'),org.apache.cassandra.io.SSTableReader(path='/var/lib/cassandra/data/Keyspace1/Standard2-7-Data.db'),org.apache.cassandra.io.SSTableReader(path='/var/lib/cassandra/data/Keyspace1/Standard2-8-Data.db')]
INFO 10:51:32,214 Recovery complete
INFO 10:51:32,214 Log replay complete
INFO 10:51:32,230 Saved Token found: 47408016217042861442279446207060121025
INFO 10:51:32,230 Saved ClusterName found: Test Cluster
INFO 10:51:32,231 Saved partitioner not found. Using org.apache.cassandra.dht.RandomPartitioner
INFO 10:51:32,250 LocationInfo has reached its threshold; switching in a fresh Memtable at CommitLogContext(file='/var/lib/cassandra/commitlog/CommitLog-1286301092145.log', position=345)
INFO 10:51:32,250 Enqueuing flush of Memtable-LocationInfo#1120194637(95 bytes, 2 operations)
INFO 10:51:32,251 Writing Memtable-LocationInfo#1120194637(95 bytes, 2 operations)
INFO 10:51:32,307 Completed flushing /var/lib/cassandra/data/system/LocationInfo-66-Data.db
INFO 10:51:32,316 Starting up server gossip
INFO 10:51:32,329 Compacted to /var/lib/cassandra/data/Keyspace1/Standard2-9-Data.db. 1670/1440 bytes for 6 keys. Time: 125ms.
INFO 10:51:32,366 Binding thrift service to /172.24.0.80:9160
INFO 10:51:32,369 Cassandra starting up...
I cant see any problems? (-f is short for 'foreground')

Resources