Voltdb node failed when the RAM is about 10G - voltdb

I am trying to use voltdb. The capacity of the vm is 16U with 32G RAM. But I specified the xmx with 6G. The node failed when I trying to dump data into the voltdb. I calculated the size of the data following the official document. it is about 500M. But the process takes up about 10G RAM and it failed. Anyone knows the reason?
comment: I am using the default configuration of voltdb and running it in docker.

VoltDB dev here. Without looking at the log files, it is hard to say what the root cause was. The tail of the log file usually includes error messages that indicate why the cluster has gone down. If you can post the log file, we can take a look and help you pinpoint the issue. If you want, you can also post on our public forum at https://forum.voltdb.com or join our Slack channel at http://chat.voltdb.com/.
Ning

Related

System Requirement for Spark In Production

May someone please help me with the system requirement for Spark to run on Production Environment.
I am trying to set up Environment for Batch Processing of data coming from Kafka Producer.
The size of data daily process is in TB.
The Data is coming from HDFS,and Persistant layer is also HDFS.
The information i got are:-
4-8 disks per node, configured without RAID (just as separate mount points).
Allocating only at most 75% of the memory for Spark.
The rest for the operating system and buffer cache.
10 Gigabit or higher network is the best way to make these applications faster.
Please share your knowledge if someone used Spark on Prod.
Thanks a ton
at least 8-16 cores per machine.
May someone please help me on this.

Cloudera Execution Problem: Problem:Initial job has not accepted any resources

I'm trying to fetch some data from Cloudera's Quick Start Hadoop distribution (a Linux VM for us) on our SAP HANA database using SAP Spark Controller. Every time I trigger the job in HANA, it gets stuck and I see the following warning being logged continuously every 10-15 seconds in SPARK Controller's log file, unless I kill the job.
WARN org.apache.spark.scheduler.cluster.YarnScheduler: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
Although it's logged like a warning it looks like it's a problem that prevents the job from executing on Cloudera. From what I read, it's either an issue with the resource management on Cloudera, or an issue with blocked ports. In our case we don't have any blocked ports so it must be the former.
Our Cloudera is running a single node and has 16GB RAM with 4 CPU cores.
Looking at the overall configuration I have a bunch of warnings, but I can't determine if they are relevant to the issue or not.
Here's also how the RAM is distributed on Cloudera
It would be great if you can help me pinpoint the cause for this issue because I've been trying various combinations of things over the past few days without any success.
Thanks,
Dimitar
You're trying to use the Cloudera Quickstart VM‎ for a purpose beyond it's capacity. It's really meant for someone to play around with Hadoop and CDH and should not be used for any production level work.
Your Node Manager only has 5GB of memory to use for compute resources. In order to do any work, you need to create an Application Master(AM) and a Spark Executor and then have reserve memory for your executors which you won't have on a Quickstart VM.

Buffer/cache exhaustion Spark standalone inside a Docker container

I have a very weird memory issue (which is what a lot of people will most
likely say ;-)) with Spark running in standalone mode inside a Docker
container. Our setup is as follows: We have a Docker container in which we have a Spring boot application that runs Spark in standalone mode. This Spring boot app also contains a few scheduled tasks (managed by Spring). These tasks trigger Spark jobs. The Spark jobs scrape a SQL database, shuffles the data a bit and then writes the results to a different SQL table (writing the results doesn't go through Spark). Our current data set is very small (the table contains a few million rows).
The problem is that the Docker host (a CentOS VM) that runs the Docker
container crashes after a while because the memory gets exhausted. I currently have limited the Spark memory usage to 512M (I have set both executor and driver memory) and in the Spark UI I can see that the largest job only takes about 10 MB of memory. I know that Spark runs best if it has 8GB of memory or more available. I have tried that as well but the results are the same.
After digging a bit further I noticed that Spark eats up all the buffer / cache memory on the machine. After clearing this manually by forcing Linux to drop caches (echo 2 > /proc/sys/vm/drop_caches) (clearing the dentries and inodes) the cache usage drops considerably but if I don't keep doing this regularly I see that the cache usage slowly keeps going up until all memory is used in buffer/cache.
Does anyone have an idea what I might be doing wrong / what is going on here?
Big thanks in advance for any help!

GPDB:Out of memory at segment

we re facing OOM error when trying to execute multiple SQL query session via scheduled job .
Detailed error:
The error message is: org.postgresql.util.PSQLException:ERROR: Out of memory (seg6 slice5 sungpmsh0:40002 pid=13610)
Detail: VM protect failed to allocate 65584 bytes from system, VM Protect 5835 MB available
We tried
After reading the pivotal support doc, we are doing basic troubleshoot here
validated two memory parameters here
current setting in GPdb
GPDB vmprotect limit :8 GB
GPB statemen_mem: based on the vmprotect limit.as per reading it is responsible for running the query in the segment.
Test 2 Did Tuning the SQL queries. also, what should I tune here please guide?
Based on source
https://discuss.pivotal.io/hc/en-us/articles/201947018-Pivotal-Greenplum-GPDB-Memory-Configuration
https://discuss.pivotal.io/hc/en-us/articles/204268778-What-are-VM-Protect-failed-to-allocate-d-bytes-d-MB-available-error-
But still getting the same OOM error.
Do we need to increase the vmprotect limit? if Yes, then by which amount should we increase it?
How to handle concurrency at gpdb?
How much swap we need to add here when we are already running with 30 GB RAM.
currently, we have added 15GB swap here? is that ok ?
What is the query to identify host connection with Greenplum database ?
Thanks in advance
Do we need to increase the vmprotect limit? if Yes, then by which amount should we increase it?
There is a nice calculator on setting gp_vmem_protect_limit on Greenplum.org. The setting depends on how much memory, swap, and segments per host you have.
http://greenplum.org/calc/
You can be getting OOM errors for several reasons.
Bad query
Bad table distribution (skew)
Bad settings (like gp_vmem_protect_limit)
Not enough resources (RAM)
How to handle concurrency at gpdb?
More RAM, less segments per host, and workload management to limit the number of concurrent queries running.
How much swap we need to add here when we are already running with 30 GB RAM. currently, we have added 15GB swap here? is that ok ?
Only 30GB of RAM? That is pretty small. You can add more swap but it will slow down the queries compared to real RAM. I wouldn't use much more than 8GB of swap.
I recommend using 256GB of RAM or more especially if you are worried about concurrency.
What is the query to identify host connection with Greenplum database
select * from pg_stat_activity;

Cassandra compaction tasks stuck

I'm running Datastax Enterprise in a cluster consisting of 3 nodes. They are all running under the same hardware: 2 Core Intel Xeon 2.2 Ghz, 7 GB RAM, 4 TB Raid-0
This should be enough for running a cluster with a light load, storing less than 1 GB of data.
Most of the time, everything is just fine but it appears that sometimes the running tasks related to the Repair Service in OpsCenter sometimes get stuck; this causes an instability in that node and an increase in load.
However, if the node is restarted, the stuck tasks don't show up and the load is at normal levels again.
Because of the fact that we don't have much data in our cluster we're using the min_repair_time parameter defined in opscenterd.conf to delay the repair service so that it doesn't complete too often.
It really seems a little bit weird that the tasks that says that are marked as "Complete" and are showing a progress of 100% don't go away, and yes, we've waited hours for them to go away but they won't; the only way that we've found to solve this is to restart the nodes.
Edit:
Here's the output from nodetool compactionstats
Edit 2:
I'm running under Datastax Enterprise v. 4.6.0 with Cassandra v. 2.0.11.83
Edit 3:
This is output from dstat on a node that behaving normally
This is output from dstat on a node with stucked compaction
Edit 4:
Output from iostat on node with stucked compaction, see the high "iowait"
azure storage
Azure divides disk resources among storage accounts under an individual user account. There can be many storage accounts in an individual user account.
For the purposes of running DSE [or cassandra], it is important to note that a single storage account should not should not be shared between more than two nodes if DSE [or cassandra] is configured like the examples in the scripts in this document. This document configures each node to have 16 disks. Each disk has a limit of 500 IOPS. This yields 8000 IOPS when configured in RAID-0. So, two nodes will hit 16,000 IOPS and three would exceed the limit.
See details here
So, this has been an issue that have been under investigation for a long time now and we've found a solution, however, we aren't sure what the underlaying problem that were causing the issues were but we got a clue even tho that, nothing can be confirmed.
Basically what we did was setting up a RAID-0 also known as Striping consisting of four disks, each at 1 TB of size. We should have seen somewhere 4x one disks IOPS when using the Stripe, but we didn't, so something was clearly wrong with the setup of the RAID.
We used multiple utilities to confirm that the CPU were waiting for the IO to respond most of the time when we said to ourselves that the node was "stucked". Clearly something with the IO and most probably our RAID-setup was causing this. We tried a few differences within MDADM-settings etc, but didn't manage to solve the problems using the RAID-setup.
We started investigating Azure Premium Storage (which still is in preview). This enables attaching disks to VMs whose underlaying physical storage actually are SSDs. So we said, well, SSDs => more IOPS, so let us give this a try. We did not setup any RAID using the SSDs. We are only using one single SSD-disk per VM.
We've been running the Cluster for almost 3 days now and we've stress tested it a lot but haven't been able to reproduce the issues.
I guess we didn't came down to the real cause but the conclusion is that some of the following must have been the underlaying cause for our problems.
Too slow disks (writes > IOPS)
RAID was setup incorrectly which caused the disks to function non-normally
These two problems go hand-in-hand and most likely is that we basically just was setting up the disks in the wrong way. However, SSDs = more power to the people, so we will definitely continue using SSDs.
If someone experience the same problems that we had on Azure with RAID-0 on large disks, don't hesitate to add to here.
Part of the problem you have is that you do not have a lot of memory on those systems and it is likely that even with only 1GB of data per node, your nodes are experiencing GC pressure. Check in the system.log for errors and warnings as this will provide clues as to what is happening on your cluster.
The rollups_60 table in the OpsCenter schema contains the lowest (minute level) granularity time series data for all your Cassandra, OS, and DSE metrics. These metrics are collected regardless of whether you have built charts for them in your dashboard so that you can pick up historical views when needed. It may be that this table is outgrowing your small hardware.
You can try tuning OpsCenter to avoid this kind of issues. Here are some options for configuration in your opscenterd.conf file:
Adding keyspaces (for example the opsc keyspace) to your ignored_keyspaces setting
You can also decrease the TTL on this table by tuning the 1min_ttlsetting
Sources:
Opscenter Config DataStax docs
Metrics Config DataStax Docs

Resources