I am running a Stanford CoreNLP server:
java -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 9001 -timeout 50000
It seems that it only uses one core when processing texts. Is it possible to run the Stanford CoreNLP server multithreadedly, so that it utilizes more than one core?
This is correct; every request to the server only uses one core. You can get parallelism by making multiple server requests at once. This will run in parallel up to the number of cores on the server (or, the value of -threads passed into the server executable), and after that it'll queue up jobs in a thread pool.
Related
The PM2 process manager allows to launch nodejs processes in the fork and cluster modes.
I understand that the cluster mode allows to launch n processes where n is the number of cores in the machine. http,tcp or udp load is then automatically balanced between these processes.
I am wondering if this load balancing is also occurring for AMQP messaging traffic.
I have a bunch (around 10) of JavaScript scripts that consume message via RabbitMQ (which implements the amqp protocol), these scripts are launched by PM2 in cluster mode in a 4 core machine which brings us to 4 instances per script.
Is the cluster mode making any difference in the previous scenario?
Is there some form of load balancing taking place when using RabbitMQ?
I'm setting up functional tests for applications running with Spark Streaming and Kafka. The steps to be done are
Start zookeeper server
Start kafka server
Start message producer to feed kafka with necessary data
Start Spark Streaming application
Wait for 5 minutes
Stop message producer
Stop Spark Streaming application
Stop kafka server
Stop zookeeper server
Validate output
What is the professional way to do this other than simple bash script?
I think this is quite general question not related strictly to Spark Streaming and Kafka. Maybe there are some testing frameworks which support setting up the environment, running multiple processes in parallel and data validation/assertions.
Maybe there are some testing frameworks which support setting up the environment, running multiple processes in parallel and data validation/assertions.
Unfortuanetely there is no all-in-one framework out there.
One-line answer would be: use docker-compose with the simplest unit-testing or gherkin-based framework of your choice.
Considering the steps above as:
Start the env
Generate Kafka messages / Validate
Shut down the env
Docker-Compose would be the best choice for the steps #1 and #3.
version: '2'
services:
kafka:
# this container already has zookeeper built in
image: spotify/kafka
ports:
- 2181:2181
- 9092:9092
# its just some mock-spark container, you'll have to replace it with
# docker container that can host your spark-app
spark:
image: epahomov/docker-spark:lightweighted
depends_on:
- kafka
The idea of the compose file is that you can start your env with one command:
docker-compose up
And the environment setup will be pretty much portable across dev machines and build servers.
For the step #2 any test framework will do.
The scenario would look like:
Start the environment / Make sure its started
Start Generating messages
Making assertions / Sleep my sweet thread
Shut down the env
Talking about frameworks:
Scala: Scalatest. There you can have a good spectrum of Async Assertions and parallel processing.
Python: Behave (be careful with multiprocessing there) or unit-testing framework such as pytest
Do not let the naming "unit-testing framework" confuse you.
Only test environment defines if a test becomes unit, modular, system or integration like, not a tool.
If a person uses unit-test framework and writes there
MyZookeeperConnect("192.168.99.100:2181") its not a unit test anymore, even unit test framework can't help it :)
To glue steps #1, #2, #3 together - simple bash would be my choice.
Consider using Citrus (http://citrusframework.org/) test framework which could be the all-in-one test framework for you.
Zookeeper access: check
Docker integration: check
Kafka integration via Apache Camel: check
Waiting for x period of time: check
Validating outcome: check
Also consider to use Fabric8 Docker Maven plugin (https://github.com/fabric8io/docker-maven-plugin) for setting up the Docker test environment before Citrus tests are executed within same build run.
Here is an example for both tools working together for automated integration testing: https://github.com/christophd/citrus-samples/tree/master/sample-docker
I am using Cloudera 5.4.1 with Spark 1.3.0. When I go to spark history server, I can see list of completed jobs and list of incomplete jobs.
However many jobs listed as incomplete are the ones which were killed.
So how does one see list of "running" jobs. Not the ones which were killed.
also how does one kill a running spark job by taking the application id from the history server?
Following is from Cloudera documentation:
To access the web application UI of a running Spark application, open http://spark_driver_host:4040 in a web browser. If multiple applications are running on the same host, the web application binds to successive ports beginning with 4040 (4041, 4042, and so on). The web application is available only for the duration of the application.
For 5.4x
For 5.9x
Answer for your second question:
You can use yarn CLI to kill the Spark application.
Ex: yarn application -kill <application ID>
I have a Spark streaming application running in a yarn-cluster mode reading from a Kafka topic.
I want to connect JMXConsole or the Java visualvm to these remote processes in a Cloudera distribution to gather some performance benchmarks.
How would I go about doing that?
The way I've done this is to set/add the following property (Also start Flight Recorder):
spark.executor.extraJavaOptions=-XX:+UnlockCommercialFeatures -XX:+FlightRecorder -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.port=0
If you have only one worker running on each box, you can set the port to be fixed. If you have multiple, then you need to go with port 0 and the use lsof to find which port got assigned,.
I have a java based server accepting clients requests and the client requests are cpu-bound jobs and the jobs have no dependency between them. My server is equipped with a thread pool having number of threads equal to the number of processors(or number of cores) in the system but server performance is low and client's requests wait for thread availability. Can cluster help me in this scenario? I want to use cluster and I want to distribute the jobs to nodes so that client's request wait time can be eliminated. help me in this regard. Also tell me about the framework I should use. can RMI technology help me? should I use hazelcast?
You can use the distributed ExecutorService to distribute your operations to the different nodes and offload them to your own threadpool.
There are some pretty good compute grid frameworks that will do the job. You can start by googling "java grid computing" or "java cluster computing". To name a few:
JPPF
GridGain
HTCondor
Hadoop
Unicore
etc ...