I am getting a heap memory issue while running DSBULK load - cassandra

I have unloaded more than 100 CSV files in a folder. When I try to load those files to cassandra using DSBULK load and specifying the the folder location of all these files, I get the below error
Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "unVocity-parsers input reading thread"
I wanted to see if anyone else has faced it and how it has been resolved.

Here are a few things you can try:
You can pass any JVM option or system property to the dsbulk executable using the DSBULK_JAVA_OPTS env var. See this page for more. Set the allocated memory to a higher value if possible.
You can throttle dsbulk using the -maxConcurrentQueries option. Start with -maxConcurrentQueries 1; then raise the value to get the best throughput possible without hitting the OOM error. More on this here.

Related

Java Heap Space issue in Grakn 1.6.0

i have a data of 100 nodes and 165 relations to be inserted into one keyspace. My grakn image have 4 core CPU and 3 GB Memory. While i try to insert the data i am getting an error [grpc-request-handler-4] ERROR grakn.core.server.Grakn - Uncaught exception at thread [grpc-request-handler-4] java.lang.OutOfMemoryError: Java heap space. It was noticed that the image used 346 % CPU and 1.46 GB RAM only. Also a finding for the issue in log was Caused by: com.datastax.oss.driver.api.core.AllNodesFailedException: Could not reach any contact point, make sure you've provided valid addresses (showing first 1, use getErrors() for more: Node(endPoint=/127.0.0.1:9042, hostId=null, hashCode=3cb85440): io.netty.channel.ChannelException: Unable to create Channel from class class io.netty.channel.socket.nio.NioSocketChannel)
Could you please help me with this?
It sounds like Cassandra ran out of memory - currently, Grakn spawns to processes: one for Cassandra and one for Grakn server. You can increase your memory limit with the following flags (unix):
SERVER_JAVAOPTS=-Xms1G STORAGE_JAVAOPTS=-Xms2G ./grakn server start
this would give the server 1GB, and the storage engine (cassandra) 2gb of memory, for instance. 3 GB may be a bit on the low end once your data grows so keep these flags in mind :)

AWS Sagemaker Kernel appears to have died and restarts

I am getting a kernel error while trying to retrieve the data from an API that includes 100 pages. The data size is huge but the code runs well when executed on Google Colab or on local machine.
The error I see in a window is-
Kernel Restarting
The kernel appears to have died. It will restart automatically.
I am using an ml.m5.xlarge machine with a memory allocation of 1000GB and there are no pre-saved datasets in the instance. Also, the expected data size is around 60 GB split into multiple datasets of 4 GB each.
Can anyone help?
I think you could try not to load all the data into memory, or try to switch to a beefier instance type. According to https://aws.amazon.com/sagemaker/pricing/instance-types/ ml.m5.xlarge has 15GB memory.
Jun

Azure Data Factory pipeline into compressed Parquet file: “java.lang.OutOfMemoryError:Java heap space”

I have a pipeline that reads data from an MS SQL Server and stores them into a file in a BLOB container in Azure Storage. The file has Parquet (or Apache Parquet, as it is also called) format.
So, when the “sink” (output) file is stored in a compressed way (snappy, or gzip – does not matter) AND the file is large enough (more than 50 Mb), the pipeline failed. The message was the following:
"errorCode": "2200",
"message": "Failure happened on 'Sink' side.
ErrorCode=UserErrorJavaInvocationException,'Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,Message=An error occurred when invoking java, message: java.lang.OutOfMemoryError:Java heap space\ntotal entry:11\r\njava.util.ArrayDeque.doubleCapacity(Unknown Source)\r\njava.util.ArrayDeque.addFirst(Unknown Source)\r\njava.util.ArrayDeque.push(Unknown Source)\r\norg.apache.parquet.io.ValidatingRecordConsumer.endField(ValidatingRecordConsumer.java:108)\r\norg.apache.parquet.example.data.GroupWriter.writeGroup(GroupWriter.java:58)\r\norg.apache.parquet.example.data.GroupWriter.write(GroupWriter.java:37)\r\norg.apache.parquet.hadoop.example.GroupWriteSupport.write(GroupWriteSupport.java:87)\r\norg.apache.parquet.hadoop.example.GroupWriteSupport.write(GroupWriteSupport.java:37)\r\norg.apache.parquet.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:123)\r\norg.apache.parquet.hadoop.ParquetWriter.write(ParquetWriter.java:292)\r\ncom.microsoft.datatransfer.bridge.parquet.ParquetBatchWriter.addRows(ParquetBatchWriter.java:60)\r\n,Source=Microsoft.DataTransfer.Common,''Type=Microsoft.DataTransfer.Richfile.JniExt.JavaBridgeException,Message=,Source=Microsoft.DataTransfer.Richfile.HiveOrcBridge,'",
"failureType": "UserError",
"target": "Work_Work"
}
The "Work_Work" is the name of a Copy Data activity in the pipeline.
If I turn the compression off (the generated BLOB file is uncompressed), the error does not happen.
Is this the error described in link the
“…If you copy data to/from Parquet format using Self-hosted
Integration Runtime and hit error saying "An error occurred when
invoking java, message: java.lang.OutOfMemoryError:Java heap space",
you can add an environment variable _JAVA_OPTIONS in the machine that
hosts the Self-hosted IR to adjust the min/max heap size for JVM to
empower such copy, then rerun the pipeline….”?
If it is, have I understood correctly that I have to do the following:
To go to a server where the “Self-hosted Integration Runtime” (still have no idea what it is) and increase the max heap size for JVM. Is this correct?
If it is, my next question is: how large the max heap size should be? My pipeline can generate a file whose size will be 30 GB.
What “max heap size” can guarantee that such a file will not cause the fail?
If you copy data to/from Parquet format using Self-hosted Integration Runtime and hit error saying "An error occurred when invoking java, message: java.lang.OutOfMemoryError:Java heap space", you can add an environment variable _JAVA_OPTIONS in the machine that hosts the Self-hosted IR to adjust the min/max heap size for JVM to empower such copy, then rerun the pipeline.
Example: set variable _JAVA_OPTIONS with value -Xms256m -Xmx16g. The flag Xms specifies the initial memory allocation pool for a Java Virtual Machine (JVM), while Xmx specifies the maximum memory allocation pool. This means that JVM will be started with Xms amount of memory and will be able to use a maximum of Xmx amount of memory. By default, ADF use min 64MB and max 1G.
Reference: https://learn.microsoft.com/en-us/azure/data-factory/format-parquet
Hope this helps.

Pentaho | GC overhead limit exceeded

I want to insert data from xlsx file into table. Excel has around 1,20,000 records. But while running transformation, I am getting below error:
GC overhead limit exceeded
I have changed in spoon.bat.
Xmx2g -XX:MaxPermSize=1024m
But still I am getting this error.
Can someone please help on this?
In my case , additional to adding the Xms and Xmx parameters(which didnt solve it completly) i added the option -XX:-UseGCOverheadLimit to spoon.sh and the problem solved.
Yes it work I increate the memory to 4GB issue is fixed.
if "%PENTAHO_DI_JAVA_OPTIONS%"=="" set PENTAHO_DI_JAVA_OPTIONS="-Xms2048m" "-Xmx4096m"
This is a known bug with apache POI.
the xlsx input step is not able to read big files.
I usually turn my files into CVS in such cases.
here is the jira case.
http://jira.pentaho.com/browse/PDI-5269
The environment variable PENTAHO_DI_JAVA_OPTIONS is used to add option to the starting of jre. Mine is set to "-Xms512m -Xmx3000M -XX:MaxPermSize=256m"
Split the file into two different files, save half of the data in one file and the remaining in another. It works perfectly.

Rate limit with Apache Spark GCS connector

I'm using Spark on a Google Compute Engine cluster with the Google Cloud Storage connector (instead of HDFS, as recommended), and get a lot of "rate limit" errors, as follows:
java.io.IOException: Error inserting: bucket: *****, object: *****
at com.google.cloud.hadoop.gcsio.GoogleCloudStorageImpl.wrapException(GoogleCloudStorageImpl.java:1600)
at com.google.cloud.hadoop.gcsio.GoogleCloudStorageImpl$3.run(GoogleCloudStorageImpl.java:475)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: com.google.api.client.googleapis.json.GoogleJsonResponseException: 429 Too Many Requests
{
"code" : 429,
"errors" : [ {
"domain" : "usageLimits",
"message" : "The total number of changes to the object ***** exceeds the rate limit. Please reduce the rate of create, update, and delete requests.",
"reason" : "rateLimitExceeded"
} ],
"message" : "The total number of changes to the object ***** exceeds the rate limit. Please reduce the rate of create, update, and delete requests."
}
at com.google.api.client.googleapis.json.GoogleJsonResponseException.from(GoogleJsonResponseException.java:145)
at com.google.api.client.googleapis.services.json.AbstractGoogleJsonClientRequest.newExceptionOnError(AbstractGoogleJsonClientRequest.java:113)
at com.google.api.client.googleapis.services.json.AbstractGoogleJsonClientRequest.newExceptionOnError(AbstractGoogleJsonClientRequest.java:40)
at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:432)
at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:352)
at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.execute(AbstractGoogleClientRequest.java:469)
at com.google.cloud.hadoop.gcsio.GoogleCloudStorageImpl$3.run(GoogleCloudStorageImpl.java:472)
... 3 more
Anyone knows any solution for that?
Is there a way to control the read/write rate of Spark?
Is there a way to increase the rate limit for my Google Project?
Is there a way to use local Hard-Disk for temp files that don't have
to be shared with other slaves?
Thanks!
Unfortunately, the usage of GCS when set as the DEFAULT_FS can pop up with high rates of directory-object creation whether using it for just intermediate directories or for final input/output directories. Especially for using GCS as the final output directory, it's difficult to apply any Spark-side workaround to reduce the rate of redundant directory-creation requests.
The good news is that most of these directory requests are indeed redundant, just because the system is used to being able to essentially "mkdir -p", and cheaply return true if the directory already exists. In our case, it's possible to fix it on the GCS-connector side by catching these errors and then just checking whether the directory indeed got created by some other worker in a race condition.
This should be fixed now with https://github.com/GoogleCloudPlatform/bigdata-interop/commit/141b1efab9ef23b6b5f5910d8206fcbc228d2ed7
To test, just run:
git clone https://github.com/GoogleCloudPlatform/bigdata-interop.git
cd bigdata-interop
mvn -P hadoop1 package
# Or or Hadoop 2
mvn -P hadoop2 package
And you should find the files "gcs/target/gcs-connector-*-shaded.jar" available for use. To plug it into bdutil, simply gsutil cp gcs/target/gcs-connector-*shaded.jar gs://<your-bucket>/some-path/ and then edit bdutil/bdutil_env.sh for Hadoop 1 or bdutil/hadoop2_env.sh to change:
GCS_CONNECTOR_JAR='https://storage.googleapis.com/hadoop-lib/gcs/gcs-connector-1.4.1-hadoop2.jar'
To instead point at your gs://<your-bucket>/some-path/ path; bdutil automatically detects that you're using a gs:// prefixed URI and will do the right thing during deployment.
Please let us know if it fixes the issue for you!
Have you tried to set the spark.local.dir config parameter and attach a disk (preferable SSD) for that tmp space to your Google Compute Engine instances?
https://spark.apache.org/docs/1.2.0/configuration.html
You can not change the rate limiting for your project, what you would have to use is a back-off algorithm once the limit is reached. Since you mentioned most of the reads/writes are for tmp files, try to configure Spark to use local disks for that.

Resources