How do I start ArangoDB cluster in active failover mode? - arangodb

I try to follow this, but I can't seem to make it work: https://docs.arangodb.com/devel/Manual/Deployment/ActiveFailover/ManualStart.html
ArangoDB version 3.3.7.
I try to start 3 nodes (3 agents). I believe officially I can start only 2 dbservers, but I heard 3 dbservers should also work.

Ok, it wasn't easy, but here are the commands. Note that you probably don't want to expose the database to the world without authentication (and maybe traffic encryption).
Server 1 (172.31.54.123):
arangod --server.endpoint tcp://172.31.54.123:8531 --server.storage-engine rocksdb --database.directory /data/agent --uid arangodb --gid arangodb --server.jwt-secret verysecret --log.file /data/agent.log --log.force-direct false --foxx.queues false --server.statistics false --agency.activate true --agency.size 3 --agency.supervision true --agency.my-address tcp://172.31.54.123:8531 --agency.endpoint tcp://172.31.63.137:8531 --agency.endpoint tcp://172.31.48.49:8531
arangod --server.endpoint tcp://172.31.54.123:8529 --server.storage-engine rocksdb --database.directory /data/arangodb --uid arangodb --gid arangodb --server.jwt-secret verysecret --log.file /data/arangod.log --log.force-direct false --foxx.queues false --server.statistics true --replication.automatic-failover true --cluster.my-address tcp://172.31.54.123:8529 --cluster.my-role SINGLE --cluster.agency-endpoint tcp://172.31.54.123:8531 --cluster.agency-endpoint tcp://172.31.63.137:8531 --cluster.agency-endpoint tcp://172.31.48.49:8531
Server 2 (172.31.63.137):
arangod --server.endpoint tcp://172.31.63.137:8531 --server.storage-engine rocksdb --database.directory /data/agent --uid arangodb --gid arangodb --server.jwt-secret verysecret --log.file /data/agent.log --log.force-direct false --foxx.queues false --server.statistics false --agency.activate true --agency.size 3 --agency.supervision true --agency.my-address tcp://172.31.63.137:8531 --agency.endpoint tcp://172.31.63.137:8531 --agency.endpoint tcp://172.31.48.49:8531
arangod --server.endpoint tcp://172.31.63.137:8529 --server.storage-engine rocksdb --database.directory /data/arangodb --uid arangodb --gid arangodb --server.jwt-secret verysecret --log.file /data/arangod.log --log.force-direct false --foxx.queues false --server.statistics true --replication.automatic-failover true --cluster.my-address tcp://172.31.63.137:8529 --cluster.my-role SINGLE --cluster.agency-endpoint tcp://172.31.54.123:8531 --cluster.agency-endpoint tcp://172.31.63.137:8531 --cluster.agency-endpoint tcp://172.31.48.49:8531
Server 3 (172.31.48.49):
arangod --server.endpoint tcp://172.31.48.49:8531 --server.storage-engine rocksdb --database.directory /data/agent --uid arangodb --gid arangodb --server.jwt-secret verysecret --log.file /data/agent.log --log.force-direct false --foxx.queues false --server.statistics false --agency.activate true --agency.size 3 --agency.supervision true --agency.my-address tcp://172.31.48.49:8531 --agency.endpoint tcp://172.31.63.137:8531 --agency.endpoint tcp://172.31.48.49:8531
arangod --server.endpoint tcp://172.31.48.49:8529 --server.storage-engine rocksdb --database.directory /data/arangodb --uid arangodb --gid arangodb --server.jwt-secret verysecret --log.file /data/arangod.log --log.force-direct false --foxx.queues false --server.statistics true --replication.automatic-failover true --cluster.my-address tcp://172.31.48.49:8529 --cluster.my-role SINGLE --cluster.agency-endpoint tcp://172.31.54.123:8531 --cluster.agency-endpoint tcp://172.31.63.137:8531 --cluster.agency-endpoint tcp://172.31.48.49:8531
IP addresses will of course be different, but overall this should work.

Please use the ArangoDB starter to start an ActiveFailover deployment here
https://docs.arangodb.com/devel/Manual/Deployment/ActiveFailover/UsingTheStarter.html
Or the cluster, where you may employ arbitrary numbers of db servers abd coordinators here:
https://docs.arangodb.com/devel/Manual/Tutorials/Starter

Related

Pyspark session configuration overwriting

I recently come across below program , after new session values are declared in variable , why stopping spark context ?
usually , spark session will be stopped at the end but here , why stopping after new configuration declaration ?
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.appName("MyApp") \
.config("spark.driver.host", "localhost") \
.getOrCreate()
default_conf = spark.sparkContext._conf.getAll()
print(default_conf)
conf = spark.sparkContext._conf.setAll([('spark.executor.memory', '4g'),
('spark.app.name', 'Spark Updated Conf'),
('spark.executor.cores', '4'),
('spark.cores.max', '4'),
('spark.driver.memory','4g')])
spark.sparkContext.stop()
spark = SparkSession \
.builder \
.appName("MyApp") \
.config(conf=conf) \
.getOrCreate()
default_conf = spark.sparkContext._conf.get("spark.cores.max")
print("updated configs " , default_conf)
I am trying to understand
I think it's only example code to show that you cannot modify running session config. You can create it, get the config that was generated, then stop and recreate with updated values.
Otherwise, it doesn't make any sense, because you'd just use the config you needed from the beginning...
default_conf variable at the end is only the cores setting, however.

Spark Memory Management Calculation

I am new in Spark application. I am using r5a.4xlarge aws cluster with min worker is 1 and max worker is 16. This instance has 128GB memory and 16 cores.
I have used spark.executor.cores 5.
As per the memory management calculation memory/ executor is near around 42GB. After subtracting the overhead memory 10% net memory available is around 37GB.
I have kept offheap memory enable. Whenever I am trying to use below spark configuration it is giving me the below error.
Error updating cluster for job S2C_ER_ENROLMENT_PREM_DTL_HISTORY_TEST×Specified heap memory (37888 MB) and off heap memory (73404 MB) is above the maximum executor memory (97871 MB) allowed for node type r5a.4xlarge.
I have three questions from the above error message.
How can maximum executor memory is showing 97871 MB? This is nowhere near to my calculated result.
How become offheap memory is becoming 73404MB as I did not set it explicitly?
How we calculate the offheap memory?
Below is the configuration I was trying to use.
spark.broadcast.compress true
spark.dynamicAllocation.enabled true
yarn.fail-fast true
spark.shuffle.reduceLocality.enabled true
spark.task.cpus 5
spark.dynamicAllocation.shuffleTracking.enabled true
mapreduce.shuffle.listen.queue.size 2048
spark.rpc.message.maxSize 1024
spark.storage.memoryFraction 0.8
spark.files.useFetchCache true
spark.scheduler.mode FAIR
spark.shuffle.compress true
spark.sql.adaptive.coalescePartitions.initialPartitionNum 3
spark.sql.adaptive.coalescePartitions.enabled true
mapreduce.shuffle.max.connections 100
spark.executor.cores 5
spark.executor.memory 37G
spark.driver.memory 37G
spark.driver.cores 5
spark.executor.instances 29
spark.sql.adaptive.coalescePartitions.parallelismFirst false
spark.storage.replication.proactive true
spark.sql.adaptive.skewJoin.enabled true
spark.network.timeout 300000
spark.broadcast.blockSize 128m
spark.sql.adaptive.coalescePartitions.minPartitionSize 256M
spark.akka.frameSize 1024
spark.speculation true
spark.cleaner.periodicGC.interval 12000
mapreduce.reduce.shuffle.memory.limit.percent 85
spark.logConf false
spark.executor.heartbeatInterval 200000
spark.worker.cleanup.enabled true
spark.sql.adaptive.enabled true
spark.sql.adaptive.advisoryPartitionSizeInBytes 1024M
spark.shuffle.io.preferDirectBufs true
mapreduce.map.log.level ERROR
spark.sql.adaptive.skewJoin.skewedPartitionFactor 6
spark.default.parallelism 40
spark.hadoop.databricks.fs.perfMetrics.enable false
mapreduce.shuffle.max.threads 99

Repair command #6 failed with error Nothing to repair for (1838038121817533879,1854751957995458118] in xyz - aborting

Node tool repair command is failing to repair form some of tables . Cassandra version is 3.11.6 version . I have following queries :
Is this really a problem . what is the impact if we ignore this error ?
How can we get rid of this token range for some keyspace ?
what is the possible reason that it complaining about this token range ?
here is the error trace :
[2020-11-12 16:33:46,506] Starting repair command #6 (d5fa7530-2504-11eb-ab07-59621b514775), repairing keyspace solutionkeyspace with repair options (parallelism: parallel, primary range: false, incremental: true, job threads: 1, ColumnFamilies: [], dataCenters: [], hosts: [], # of ranges: 256, pull repair: false, ignore unreplicated keyspaces: false)
[2020-11-12 16:33:46,507] Repair command #6 failed with error Nothing to repair for (1838038121817533879,1854751957995458118] in solutionkeyspace - aborting
[2020-11-12 16:33:46,507] Repair command #6 finished with error
error: Repair job has failed with the error message: [2020-11-12 16:33:46,507] Repair command #6 failed with error Nothing to repair for (1838038121817533879,1854751957995458118] in solutionkeyspace - aborting
-- StackTrace --
java.lang.RuntimeException: Repair job has failed with the error message: [2020-11-12 16:33:46,507] Repair command #6 failed with error Nothing to repair for (1838038121817533879,1854751957995458118] in solutionkeyspace - aborting
at org.apache.cassandra.tools.RepairRunner.progress(RepairRunner.java:116)
at org.apache.cassandra.utils.progress.jmx.JMXNotificationProgressListener.handleNotification(JMXNotificationProgressListener.java:77)
at com.sun.jmx.remote.internal.ClientNotifForwarder$NotifFetcher.dispatchNotification(ClientNotifForwarder.java:583)
at com.sun.jmx.remote.internal.ClientNotifForwarder$NotifFetcher.doRun(ClientNotifForwarder.java:533)
at com.sun.jmx.remote.internal.ClientNotifForwarder$NotifFetcher.run(ClientNotifForwarder.java:452)
at com.sun.jmx.remote.internal.ClientNotifForwarder$LinearExecutor$1.run(ClientNotifForwarder.java:108)
Scheme definition
CREATE KEYSPACE solutionkeyspace WITH replication = {'class': 'NetworkTopologyStrategy', 'datacenter1': '1', 'datacenter2': '1'} AND durable_writes = true;
CREATE TABLE solutionkeyspace.schemas (
namespace text PRIMARY KEY,
avpcontainer map<text, text>,
schemacreationcql text,
status text,
version text
) WITH bloom_filter_fp_chance = 0.001
AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
AND comment = ''
AND compaction = {'class': 'org.apache.cassandra.db.compaction.LeveledCompactionStrategy'}
AND compression = {'chunk_length_in_kb': '64', 'class': 'org.apache.cassandra.io.compress.LZ4Compressor'}
AND crc_check_chance = 1.0
AND dclocal_read_repair_chance = 0.1
AND default_time_to_live = 0
AND gc_grace_seconds = 10800
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair_chance = 0.05
AND speculative_retry = '99PERCENTILE';
nodetool status output
bash-5.0# nodetool -p 7199 status
Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns Host ID Rack
UN 172.16.0.68 5.71 MiB 256 ? 6a4d7b51-b57b-4918-be2f-3d62653b9509 rack1
Note: Non-system keyspaces don't have the same replication settings, effective ownership information is meaningless
It looks like you are running Cassandra version 3.11.9+ where this new behaviour was introduced.
https://issues.apache.org/jira/browse/CASSANDRA-15160
You will not see this error if you add the following nodetool repair command line option;
-iuk
or
--ignore-unreplicated-keyspaces
This change would have broken people's repair scripts when they upgrade from an earlier version of Cassandra to 3.11.9+. We certainly noticed it.

Spark Hive reporting pyspark.sql.utils.AnalysisException: u'Table not found: XXX' when run on yarn cluster

I'm attempting to run a pyspark script on BigInsights on Cloud 4.2 Enterprise that accesses a Hive table.
First I create the hive table:
[biadmin#bi4c-xxxxx-mastermanager ~]$ hive
hive> CREATE TABLE pokes (foo INT, bar STRING);
OK
Time taken: 2.147 seconds
hive> LOAD DATA LOCAL INPATH '/usr/iop/4.2.0.0/hive/doc/examples/files/kv1.txt' OVERWRITE INTO TABLE pokes;
Loading data to table default.pokes
Table default.pokes stats: [numFiles=1, numRows=0, totalSize=5812, rawDataSize=0]
OK
Time taken: 0.49 seconds
hive>
Then I create a simple pyspark script:
[biadmin#bi4c-xxxxxx-mastermanager ~]$ cat test_pokes.py
from pyspark import SparkContext
sc = SparkContext()
from pyspark.sql import HiveContext
hc = HiveContext(sc)
pokesRdd = hc.sql('select * from pokes')
print( pokesRdd.collect() )
I attempt to execute with:
[biadmin#bi4c-xxxxxx-mastermanager ~]$ spark-submit \
--master yarn-cluster \
--deploy-mode cluster \
--jars /usr/iop/4.2.0.0/hive/lib/datanucleus-api-jdo-3.2.6.jar, \
/usr/iop/4.2.0.0/hive/lib/datanucleus-core-3.2.10.jar, \
/usr/iop/4.2.0.0/hive/lib/datanucleus-rdbms-3.2.9.jar \
test_pokes.py
However, I encounter the error:
Traceback (most recent call last):
File "test_pokes.py", line 8, in <module>
pokesRdd = hc.sql('select * from pokes')
File "/disk6/local/usercache/biadmin/appcache/application_1477084339086_0481/container_e09_1477084339086_0481_01_000001/pyspark.zip/pyspark/sql/context.py", line 580, in sql
File "/disk6/local/usercache/biadmin/appcache/application_1477084339086_0481/container_e09_1477084339086_0481_01_000001/py4j-0.9-src.zip/py4j/java_gateway.py", line 813, in __call__
File "/disk6/local/usercache/biadmin/appcache/application_1477084339086_0481/container_e09_1477084339086_0481_01_000001/pyspark.zip/pyspark/sql/utils.py", line 51, in deco
pyspark.sql.utils.AnalysisException: u'Table not found: pokes; line 1 pos 14'
End of LogType:stdout
If I run spark-submit standalone, I can see the table exists ok:
[biadmin#bi4c-xxxxxx-mastermanager ~]$ spark-submit test_pokes.py
…
…
16/12/21 13:09:13 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 18962 bytes result sent to driver
16/12/21 13:09:13 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 168 ms on localhost (1/1)
16/12/21 13:09:13 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
16/12/21 13:09:13 INFO DAGScheduler: ResultStage 0 (collect at /home/biadmin/test_pokes.py:9) finished in 0.179 s
16/12/21 13:09:13 INFO DAGScheduler: Job 0 finished: collect at /home/biadmin/test_pokes.py:9, took 0.236558 s
[Row(foo=238, bar=u'val_238'), Row(foo=86, bar=u'val_86'), Row(foo=311, bar=u'val_311')
…
…
See my previous question related to this issue: hive spark yarn-cluster job fails with: "ClassNotFoundException: org.datanucleus.api.jdo.JDOPersistenceManagerFactory"
This question is similar to this other question: Spark can access Hive table from pyspark but not from spark-submit. However, unlike that question I am using HiveContext.
Update: see here for the final solution https://stackoverflow.com/a/41272260/1033422
This is because the spark-submit job is unable to find the hive-site.xml, so it cannot connect to the Hive metastore. Please add --files /usr/iop/4.2.0.0/hive/conf/hive-site.xml to your spark-submit command.
It looks like you are affected by this bug: https://issues.apache.org/jira/browse/SPARK-15345.
I had a similar issue with Spark 1.6.2 and 2.0.0 on HDP-2.5.0.0:
My goal was to create a dataframe from a Hive SQL query, under these conditions:
python API,
cluster deploy-mode (driver program running on one of the executor nodes)
use YARN to manage the executor JVMs (instead of a standalone Spark master instance).
The initial tests gave these results:
spark-submit --deploy-mode client --master local ... =>
WORKING
spark-submit --deploy-mode client --master yarn ... => WORKING
spark-submit --deploy-mode cluster --master yarn .... => NOT WORKING
In case #3, the driver running on one of the executor nodes could find the database. The error was:
pyspark.sql.utils.AnalysisException: 'Table or view not found: `database_name`.`table_name`; line 1 pos 14'
Fokko Driesprong's answer listed above worked for me.
With, the command listed below, the driver running on the executor node was able to access a Hive table in a database which is not default:
$ /usr/hdp/current/spark2-client/bin/spark-submit \
--deploy-mode cluster --master yarn \
--files /usr/hdp/current/spark2-client/conf/hive-site.xml \
/path/to/python/code.py
The python code I have used to test with Spark 1.6.2 and Spark 2.0.0 is:
(Change SPARK_VERSION to 1 to test with Spark 1.6.2. Make sure to update the paths in the spark-submit command accordingly)
SPARK_VERSION=2
APP_NAME = 'spark-sql-python-test_SV,' + str(SPARK_VERSION)
def spark1():
from pyspark.sql import HiveContext
from pyspark import SparkContext, SparkConf
conf = SparkConf().setAppName(APP_NAME)
sc = SparkContext(conf=conf)
hc = HiveContext(sc)
query = 'select * from database_name.table_name limit 5'
df = hc.sql(query)
printout(df)
def spark2():
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName(APP_NAME).enableHiveSupport().getOrCreate()
query = 'select * from database_name.table_name limit 5'
df = spark.sql(query)
printout(df)
def printout(df):
print('\n########################################################################')
df.show()
print(df.count())
df_list = df.collect()
print(df_list)
print(df_list[0])
print(df_list[1])
print('########################################################################\n')
def main():
if SPARK_VERSION == 1:
spark1()
elif SPARK_VERSION == 2:
spark2()
if __name__ == '__main__':
main()
For me the accepted answer did not work.
(--files /usr/iop/4.2.0.0/hive/conf/hive-site.xml)
Adding the below code on top of the code file solved it.
import findspark
findspark.init('/usr/share/spark-2.4') # for 2.4

Apache Spark: How to force usage of certain number of cores?

I want to run spark-shell in yarn mode with a certain number of cores.
the command I use is as follows
spark-shell --num-executors 25 --executor-cores 4 --executor-memory 1G \
--driver-memory 1G --conf spark.yarn.executor.memoryOverhead=2048 --master yarn \
--conf spark.driver.maxResultSize=10G \
--conf spark.serializer=org.apache.spark.serializer.KyroSerializer \
-i input.scala
input.scala looks something like this
import java.io.ByteArrayInputStream
// Plaintext sum on 10M rows
def aggrMapPlain(iter: Iterator[Long]): Iterator[Long] = {
var res = 0L
while (iter.hasNext) {
val cur = iter.next
res = res + cur
}
List[Long](res).iterator
}
val pathin_plain = <some file>
val rdd0 = sc.sequenceFile[Int, Long](pathin_plain)
val plain_table = rdd0.map(x => x._2).cache
plain_table.count
0 to 200 foreach { i =>
println("Plain - 10M rows - Run "+i+":")
plain_table.mapPartitions(aggrMapPlain).reduce((x,y)=>x+y)
}
On executing this the Spark UI first spikes to about 40 cores, and then settles at 26 cores.
On recommendation of this I changed the following in my yarn-site.xml
<property>
<name>yarn.nodemanager.resource.cpu-vcores</name>
<value>101</value>
</property>
<property>
<name>yarn.scheduler.maximum-allocation-vcores</name>
<value>101</value>
</property>
<property>
<name>yarn.scheduler.maximum-allocation-mb</name>
<value>102400</value>
</property>
<property>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>102400</value>
</property>
But I still cannot force spark to use 100 cores, which I need as I am doing benchmarking against earlier tests.
I am using Apache Spark 1.6.1.
Each node on the cluster including the driver has 16 cores and 112GB of memory.
They are on Azure (hdinsight cluster).
2 driver nodes + 7 worker nodes.
I'm unfamiliar with Azure, but I guess YARN is YARN, so you should make sure that you have
yarn.scheduler.capacity.resource-calculator=org.apache.hadoop.yarn.util.resource.DominantResourceCalculator
in capacity-scheduler.xml.
(See this similar question and answer)

Resources