How to limit the number of service launched on one CoreOS node - coreos

Here is my problem :
fleetctl list-units
UNIT MACHINE
processing-node#1.service X.Y.Z.86
processing-node#10.service X.Y.Z.150
processing-node#11.service X.Y.Z.48
processing-node#12.service X.Y.Z.48
processing-node#13.service X.Y.Z.48
processing-node#14.service X.Y.Z.86
processing-node#15.service X.Y.Z.82
processing-node#16.service X.Y.Z.48
processing-node#2.service X.Y.Z.248
processing-node#3.service X.Y.Z.48
processing-node#4.service X.Y.Z.85
processing-node#5.service X.Y.Z.48
processing-node#6.service X.Y.Z.48
processing-node#7.service X.Y.Z.48
processing-node#8.service X.Y.Z.87
processing-node#9.service X.Y.Z.248
worker-cache#1.service X.Y.Z.248
worker-cache#2.service X.Y.Z.222
worker-cache#3.service X.Y.Z.87
worker-cache#4.service X.Y.Z.150
worker-cache#5.service X.Y.Z.82
worker-cache#6.service X.Y.Z.85
worker-cache#7.service X.Y.Z.48
worker-cache#8.service X.Y.Z.86
The cluster is composed of ten machines. The worker-cache units needs a lot of calculation power so they exclude each other in the service file :
tail -2 worker-cache#.service
[X-Fleet]
Conflicts=worker-cache#*
So we have only one worker-cache unit per node. The processing-nodes units need less power and can be spawned on the same machines as the worker-cache units but I would like to have maximum two of them per machines which is definitely not the case actually :
processing-node#11.service X.Y.Z.48
processing-node#12.service X.Y.Z.48
processing-node#13.service X.Y.Z.48
processing-node#16.service X.Y.Z.48
processing-node#3.service X.Y.Z.48
processing-node#5.service X.Y.Z.48
processing-node#6.service X.Y.Z.48
processing-node#7.service X.Y.Z.48
Is there a way to do that ?

Related

pyspark erroring with a AM Container limit error

All,
We have a Apache Spark v3.12 + Yarn on AKS (SQLServer 2019 BDC). We ran a refactored python code to Pyspark which resulted in the error below:
Application application_1635264473597_0181 failed 1 times (global
limit =2; local limit is =1) due to AM Container for
appattempt_1635264473597_0181_000001 exited with exitCode: -104
Failing this attempt.Diagnostics: [2021-11-12 15:00:16.915]Container
[pid=12990,containerID=container_1635264473597_0181_01_000001] is
running 7282688B beyond the 'PHYSICAL' memory limit. Current usage:
2.0 GB of 2 GB physical memory used; 4.9 GB of 4.2 GB virtual memory used. Killing container.
Dump of the process-tree for container_1635264473597_0181_01_000001 :
|- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS)
SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES)
FULL_CMD_LINE
|- 13073 12999 12990 12990 (python3) 7333 112 1516236800 235753
/opt/bin/python3
/var/opt/hadoop/temp/nm-local-dir/usercache/grajee/appcache/application_1635264473597_0181/container_1635264473597_0181_01_000001/tmp/3677222184783620782
|- 12999 12990 12990 12990 (java) 6266 586 3728748544 289538
/opt/mssql/lib/zulu-jre-8/bin/java -server -XX:ActiveProcessorCount=1
-Xmx1664m -Djava.io.tmpdir=/var/opt/hadoop/temp/nm-local-dir/usercache/grajee/appcache/application_1635264473597_0181/container_1635264473597_0181_01_000001/tmp
-Dspark.yarn.app.container.log.dir=/var/log/yarnuser/userlogs/application_1635264473597_0181/container_1635264473597_0181_01_000001
org.apache.spark.deploy.yarn.ApplicationMaster --class
org.apache.livy.rsc.driver.RSCDriverBootstrapper --properties-file
/var/opt/hadoop/temp/nm-local-dir/usercache/grajee/appcache/application_1635264473597_0181/container_1635264473597_0181_01_000001/spark_conf/spark_conf.properties --dist-cache-conf /var/opt/hadoop/temp/nm-local-dir/usercache/grajee/appcache/application_1635264473597_0181/container_1635264473597_0181_01_000001/spark_conf/spark_dist_cache.properties
|- 12990 12987 12990 12990 (bash) 0 0 4304896 775 /bin/bash -c
/opt/mssql/lib/zulu-jre-8/bin/java -server -XX:ActiveProcessorCount=1
-Xmx1664m -Djava.io.tmpdir=/var/opt/hadoop/temp/nm-local-dir/usercache/grajee/appcache/application_1635264473597_0181/container_1635264473597_0181_01_000001/tmp
-Dspark.yarn.app.container.log.dir=/var/log/yarnuser/userlogs/application_1635264473597_0181/container_1635264473597_0181_01_000001
org.apache.spark.deploy.yarn.ApplicationMaster --class
'org.apache.livy.rsc.driver.RSCDriverBootstrapper' --properties-file
/var/opt/hadoop/temp/nm-local-dir/usercache/grajee/appcache/application_1635264473597_0181/container_1635264473597_0181_01_000001/spark_conf/spark_conf.properties --dist-cache-conf /var/opt/hadoop/temp/nm-local-dir/usercache/grajee/appcache/application_1635264473597_0181/container_1635264473597_0181_01_000001/spark_conf/spark_dist_cache.properties
1>
/var/log/yarnuser/userlogs/application_1635264473597_0181/container_1635264473597_0181_01_000001/stdout
2>
/var/log/yarnuser/userlogs/application_1635264473597_0181/container_1635264473597_0181_01_000001/stderr
[2021-11-12 15:00:16.921]Container killed on request. Exit code is 143
[2021-11-12 15:00:16.940]Container exited with a non-zero exit code
143.
For more detailed output, check the application tracking page:
https://sparkhead-0.mssql-cluster.everestre.net:8090/cluster/app/application_1635264473597_0181 Then click on links to logs of each attempt.
. Failing the application.
The default setting is as below and there are no runtime settings:
"settings": {
"spark-defaults-conf.spark.driver.cores": "1",
"spark-defaults-conf.spark.driver.memory": "1664m",
"spark-defaults-conf.spark.driver.memoryOverhead": "384",
"spark-defaults-conf.spark.executor.instances": "1",
"spark-defaults-conf.spark.executor.cores": "2",
"spark-defaults-conf.spark.executor.memory": "3712m",
"spark-defaults-conf.spark.executor.memoryOverhead": "384",
"yarn-site.yarn.nodemanager.resource.memory-mb": "12288",
"yarn-site.yarn.nodemanager.resource.cpu-vcores": "6",
"yarn-site.yarn.scheduler.maximum-allocation-mb": "12288",
"yarn-site.yarn.scheduler.maximum-allocation-vcores": "6",
"yarn-site.yarn.scheduler.capacity.maximum-am-resource-percent": "0.34".
}
Is the AM Container mentioned the Application Master Container or Application Manager (of YARN). If this is the case, then in a Cluster Mode setting, the Driver and the Application Master run in the same Container?
What runtime parameter do I change to make the Pyspark code successfully.
Thanks,
grajee
Likely you don't change any settings 143 could mean a lot of things, including you ran out of memory. To test if you ran out of memory. I'd reduce the amount of data you are using and see if you code starts to work. If it does it's likely you ran out of memory and should consider refactoring your code. In general I suggest trying code changes first before making spark config changes.
For an understanding of how spark driver works on yarn, here's a reasonable explanation: https://sujithjay.com/spark/with-yarn

Logstash not loading config pipeline on startup

I installed the ELK (version 5.6 elastic / logstash / kibana) stack on a CentOS 7 (using version 5.6) box using the rpm method, and have services enabled to run on startup. I verify it is running after reboot like so:
ps aux | grep logstash
logstash 744 4.6 2.2 3381144 363308 ? SNsl 09:06 0:48 /usr/bin/java -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -XX:+DisableExplicitGC -Djava.awt.headless=true -Dfile.encoding=UTF-8 -XX:+HeapDumpOnOutOfMemoryError -Xmx1g -Xms256m -Xss2048k -Djffi.boot.library.path=/usr/share/logstash/vendor/jruby/lib/jni -Xbootclasspath/a:/usr/share/logstash/vendor/jruby/lib/jruby.jar -classpath : -Djruby.home=/usr/share/logstash/vendor/jruby -Djruby.lib=/usr/share/logstash/vendor/jruby/lib -Djruby.script=jruby -Djruby.shell=/bin/sh org.jruby.Main /usr/share/logstash/lib/bootstrap/environment.rb logstash/runner.rb --path.settings /etc/logstash
I don't see anything in that process command regarding my pipeline configurations. In my settings file, I have path.config : /etc/logstash/conf.d/*.conf.
This is modified from the default path.config : /etc/logstash/conf.d/ which according to the documentation does the exact same thing (and I've tried both).
The 3 files I have are valid, as I can ingest data manually using:
/usr/share/logstash/bin/logstash -f '/etc/logstash/conf.d/{fileA,fileB,fileC}.conf' (or I can run them individually).
I want these ingestion pipelines to start when my server reboots, but I believe I have my configurations set correctly. I even re-ran the system-install script for good measure. Any ideas?
I ended up manually changing the systemd config for logstash by going to /etc/systemd/system/logstash.service and changing:
ExecStart=/usr/share/logstash/bin/logstash "--path.settings" "/etc/logstash"
to:
ExecStart=/usr/share/logstash/bin/logstash "--path.settings" "/etc/logstash" "-f" "/etc/logstash/conf.d/{fileA,fileB,fileC}.conf"
After rebooting, I discovered the process was indeed loading these pipelines.

Why my worker node fails after submitting topology?

Here is my minimal Multi container (storm 1.1.0) setup on Kubernetes on Azure using Azure container service
- 1 zookeeper container
- 1 nimbus container (UI is running on the same container)
- 1 worker container
All containers are connected properly,can see that in the Storm UI and zookeeper logs.
But When I deploy the test topology - worker node fails (supervisor process waits for worker to start and restarts).Seems like worker process is not started by supervisor.
Here is my storm.yaml config for the worker node:
storm.zookeeper.servers:
- "10.0.xx.xxx"
storm.zookeeper.port: 2181
nimbus.seeds: ["10.0.xxx.xxx"]
storm.local.dir: "/storm/datadir"
Could you please help me out ?

Spark master and worker seem to run on different JVM version

In standalone mode master process uses /usr/bin/java which resolves to JVM 1.8 and worker process /usr/lib/jvm/java/bin/java which resolves to 1.7. In my Spark application I'm using some APIs introduced in 1.8.
Looking at stack trace one line that comes up is: Caused by: java.lang.NoClassDefFoundError: Could not initialize class SomeClassDefinedByMe which internally creates instance from java.time which I believe is only in JDK 1.8.
How do I force worker to use JVM 1.8?
Update:
For now I renamed /usr/lib/jvm/java/bin/java and created a link that points to /usr/bin/java. This solved the problem but still would like to know why both processes use different binary location and where is this set.
On each Worker node, edit ${SPARK_HOME}/conf/spark-env.sh and define the appropriate $JAVA_HOME e.g.
export JAVA_HOME=/usr/bin/java
That file is sourced by ${SPARK_HOME}/bin/load-spark-env.sh which is invoked by each and every Spark command-line utility:
${SPARK_HOME}/bin/spark-shell via ${SPARK_HOME}/bin/spark-class
${SPARK_HOME}/bin/spark-submit via ${SPARK_HOME}/bin/spark-class
...
${SPARK_HOME}/sbin/start-slave.sh
...
Side note: the Linux alternatives are the standard way to define which JVM is on top of your PATH...
Typical setup with a "fixed" setting, not relying on the priority set by the OpenJDK RPM install:
$ ls -AFl $(which java)
lrwxrwxrwx. 1 root root 22 Feb 15 16:06 /usr/bin/java -> /etc/alternatives/java*
$ alternatives --display java | grep -v slave
java - status is manual.
link currently points to /usr/java/jdk1.8.0_92/jre/bin/java
/usr/lib/jvm/jre-1.8.0-openjdk.x86_64/bin/java - priority 18091
/usr/lib/jvm/jre-1.6.0-openjdk.x86_64/bin/java - priority 16000
/usr/java/jdk1.8.0_92/jre/bin/java - priority 18092
Current `best' version is /usr/java/jdk1.8.0_92/jre/bin/java.
...provided that $PATH is defined properly for the Linux account that launches the Spark slaves!

How to restart an erlang node?

A sample I have git push github
It is an example of application from the book Programming Erlang
you can do follow README.md
My question is when the application sellaprime started, and I
./bin/sp restart
this will make the node down and not restart?
Erlang Doc say
The system is restarted inside the running Erlang node, which means that the emulator is not restarted. All applications are taken down smoothly, all code is unloaded, and all ports are closed before the system is booted again in the same way as initially started. The same BootArgs are used again.
What does "emulator is not restarted" mean?
If I want to restart a node, what is the right way to do?
By the way, is there any API to know the current release version, like
application:which_applications()
It looks like your sb init script, that is using the nodetool script should call init:restart() for you. If this is done, but your node is instead shut down, check your logs for any possible errors (perhaps one of your applications cannot handle a restart?).
Using init:restart() is the way to do it though. Here's an example: start an Erlang node with a name (in this case, test):
$ erl -sname test
Erlang/OTP 18 [erts-7.0] [source] [64-bit] [smp:4:4] [ds:4:4:10] [async-threads:10] [hipe] [kernel-poll:false]
Eshell V7.0 (abort with ^G)
(test#host)1> hello.
hello
(test#host)2>
Temporary start another node that will make an RPC call to the first node:
$ erl -sname other -noinput -noshell -eval "rpc:call('test#host', init, restart, [])" -s init stop
$
Observer the original node being restarted:
(test#host)2> Erlang/OTP 18 [erts-7.0] [source] [64-bit] [smp:4:4] [ds:4:4:10] [async-threads:10] [hipe] [kernel-poll:false]
Eshell V7.0 (abort with ^G)
(test#host)1>

Resources