Java Heap Space issue in Grakn 1.6.0

Java Heap Space issue in Grakn 1.6.0 - python-3.x

i have a data of 100 nodes and 165 relations to be inserted into one keyspace. My grakn image have 4 core CPU and 3 GB Memory. While i try to insert the data i am getting an error [grpc-request-handler-4] ERROR grakn.core.server.Grakn - Uncaught exception at thread [grpc-request-handler-4] java.lang.OutOfMemoryError: Java heap space. It was noticed that the image used 346 % CPU and 1.46 GB RAM only. Also a finding for the issue in log was Caused by: com.datastax.oss.driver.api.core.AllNodesFailedException: Could not reach any contact point, make sure you've provided valid addresses (showing first 1, use getErrors() for more: Node(endPoint=/127.0.0.1:9042, hostId=null, hashCode=3cb85440): io.netty.channel.ChannelException: Unable to create Channel from class class io.netty.channel.socket.nio.NioSocketChannel)
Could you please help me with this?

It sounds like Cassandra ran out of memory - currently, Grakn spawns to processes: one for Cassandra and one for Grakn server. You can increase your memory limit with the following flags (unix):
SERVER_JAVAOPTS=-Xms1G STORAGE_JAVAOPTS=-Xms2G ./grakn server start
this would give the server 1GB, and the storage engine (cassandra) 2gb of memory, for instance. 3 GB may be a bit on the low end once your data grows so keep these flags in mind :)

Related

Hazelcast causing heavy traffic between nodes

NOTE: Found the root cause in application code using hazelcast which started to execute after 15 min, the code retrieved almost entire data, so the issue NOT in hazelcast, leaving the question here if anyone will see same side effect or wrong code.
What can cause heavy traffic between Hazelcast (v3.12.12, also tried 4.1.1) 2 nodes ?
It holds maps with lot of data, no new entries are added/removed within that time, only map values are updated.
Java 11, Memory usage 1.5GB out of 12GB, no full GCs identified.
Following JFR the high IO is from:
com.hazelcast.internal.networking.nio.NioThread.processTaskQueue()
Below chart of Network IO, 15 min after start traffic jumps from 15 to 60 MB. From application perspective nothing changed after these 15 min.

This smells garbage collection, you are most likely to be running into long gc pauses. Check your gc logs, which you can enable using verbose gc settings on all members. If there are back-to-back GCs then you should do various things:
increase the heap size
tune your gc, I'd look into G1 (with -XX:MaxGCPauseMillis set to a reasonable number) and/or ZGC.

AWS Sagemaker Kernel appears to have died and restarts

I am getting a kernel error while trying to retrieve the data from an API that includes 100 pages. The data size is huge but the code runs well when executed on Google Colab or on local machine.
The error I see in a window is-
Kernel Restarting
The kernel appears to have died. It will restart automatically.
I am using an ml.m5.xlarge machine with a memory allocation of 1000GB and there are no pre-saved datasets in the instance. Also, the expected data size is around 60 GB split into multiple datasets of 4 GB each.
Can anyone help?

I think you could try not to load all the data into memory, or try to switch to a beefier instance type. According to https://aws.amazon.com/sagemaker/pricing/instance-types/ ml.m5.xlarge has 15GB memory.
Jun

Has anyone faced Cassandra issue "Maximum memory usage reached"?

I'm using apache cassandra-3.0.6 ,4 node cluster, RF=3, CONSISTENCY is '1', Heap 16GB.
Im getting info message in system.log as
INFO [SharedPool-Worker-1] 2017-03-14 20:47:14,929 NoSpamLogger.java:91 - Maximum memory usage reached (536870912 bytes), cannot allocate chunk of 1048576 bytes
don't know exactly which memory it mean and I have tried by increasing the file_cache_size_in_mb to 1024 from 512 in Cassandra.yaml file But again it immediatly filled the remaining 512MB increased and stoping the application recording by showing the same info message as
INFO [SharedPool-Worker-5] 2017-03-16 06:01:27,034 NoSpamLogger.java:91 - Maximum memory usage reached (1073741824 bytes), cannot allocate chunk of 1048576 bytes
please suggest if anyone has faced the same issue..Thanks!!
Bhargav

As far as I can tell with Cassandra 3.11, no matter how large you set file_cache_size_in_mb, you will still get this message. The cache fills up, and writes this useless message. It happens in my case whether I set it to 2GB or 20GB. This may be a bug in the cache eviction strategy, but I can't tell.

The log message indicates that the node's off-heap cache is full because the node is busy servicing reads.
The 536870912 bytes in the log message is equivalent to 512 MB which is the default file_cache_size_in_mb.
It is fine to see the occasional occurrences of the message in the logs which is why it is logged at INFO level but if it gets logged repeatedly, it is an indicator that the node is getting overloaded and you should consider increasing the capacity of your cluster by adding more nodes.
For more info, see my post on DBA Stack Exchange -- What does "Maximum memory usage reached" mean in the Cassandra logs?. Cheers!

MemSQL code generation has failed: Failed to codegen

Ihave workstation of 250 gb Ram and 4 tb SSD. The memsql has a table that contains 1 billion records each of which 44 columns with 500 gb data. When I run the following query on that table
SELECT count(*) ct,name,age FROM research.all_data group by name having count(*) >100 order by ct desc
I got the following error
MemSQL code generation has failed
I made a restart to the server and after that I got another error
Not enough memory available to complete the current request. The request was not processed
I gave the server maximum mermory 220 GB and max_table_memory 190 GB.
why that error could happen?
why memsql consuming 140 gb from memory however I am using column store?

For "MemSQL code generation has failed", check the tracelog (http://docs.memsql.com/docs/trace-log) on the MemSQL node where the error was hit for more details - this can mean a lot of different things.
MemSQL needs memory to process query results, hold some metadata, etc. even though columnstore data lives on disk. Check memsql status info to see what is using memory - https://knowledgebase.memsql.com/hc/en-us/articles/208759276-What-is-using-memory-on-my-leaves-.

High disk I/O on Cassandra nodes

Setup:
We have 3 nodes Cassandra cluster having data of around 850G on each node, we have LVM setup for Cassandra data directory (currently consisting 3 drives 800G + 100G + 100G) and have separate volume (non LVM) for cassandra_logs
Versions:
Cassandra v2.0.14.425
DSE v4.6.6-1
Issue:
After adding 3rd (100G) volume in LVM on each of the node, all the nodes went very high in disk I/O and they go down quite often, servers also become inaccessible and we need to reboot the servers, servers don't get stable and we need to reboot after every 10 - 15 mins.
Other Info:
We have DSE recommended server settings (vm.max_map_count, file descriptor) configured on all nodes
RAM on each node : 24G
CPU on each node : 6 cores / 2600MHz
Disk on each node : 1000G (Data dir) / 8G (Logs)

As I suspected, you are having throughput problems on your disk. Here's what I looked at to give you background. The nodetool tpstats output from your three nodes had these lines:
Pool Name Active Pending Completed Blocked All time blocked
FlushWriter 0 0 22 0 8
FlushWriter 0 0 80 0 6
FlushWriter 0 0 38 0 9
The column I'm concerned about is the All Time Blocked. As a ratio to completed, you have a lot of blocking. The flushwriter is responsible for flushing memtables to the disk to keep the JVM from running out of memory or creating massive GC problems. The memtable is an in-memory representation of your tables. As your nodes take more writes, they start to fill and need to be flushed. That operation is a long sequential write to disk. Bookmark that. I'll come back to it.
When flushwriters are blocked, the heap starts to fill. If they stay blocked, you will see the requests starting to queue up and eventually the node will OOM.
Compaction might be running as well. Compaction is a long sequential read of SSTables into memory and then a long sequential flush of the merge sorted results. More sequential IO.
So all these operations on disk are sequential. Not random IOPs. If your disk is not able to handle simultaneous sequential read and write, IOWait shoots up, requests get blocked and then Cassandra has a really bad day.
You mentioned you are using Ceph. I haven't seen a successful deployment of Cassandra on Ceph yet. It will hold up for a while and then tip over on sequential load. Your easiest solution in the short term is to add more nodes to spread out the load. The medium term is to find some ways to optimize your stack for sequential disk loads, but that will eventually fail. Long term is get your data on real disks and off shared storage.
I have told this to consulting clients for years when using Cassandra "If your storage has an ethernet plug, you are doing it wrong" Good rule of thumb.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string