Pentaho | GC overhead limit exceeded

Pentaho | GC overhead limit exceeded - garbage-collection

I want to insert data from xlsx file into table. Excel has around 1,20,000 records. But while running transformation, I am getting below error:
GC overhead limit exceeded
I have changed in spoon.bat.
Xmx2g -XX:MaxPermSize=1024m
But still I am getting this error.
Can someone please help on this?

In my case , additional to adding the Xms and Xmx parameters(which didnt solve it completly) i added the option -XX:-UseGCOverheadLimit to spoon.sh and the problem solved.

Yes it work I increate the memory to 4GB issue is fixed.
if "%PENTAHO_DI_JAVA_OPTIONS%"=="" set PENTAHO_DI_JAVA_OPTIONS="-Xms2048m" "-Xmx4096m"

This is a known bug with apache POI.
the xlsx input step is not able to read big files.
I usually turn my files into CVS in such cases.
here is the jira case.
http://jira.pentaho.com/browse/PDI-5269
The environment variable PENTAHO_DI_JAVA_OPTIONS is used to add option to the starting of jre. Mine is set to "-Xms512m -Xmx3000M -XX:MaxPermSize=256m"

Split the file into two different files, save half of the data in one file and the remaining in another. It works perfectly.

Related

I am getting a heap memory issue while running DSBULK load

I have unloaded more than 100 CSV files in a folder. When I try to load those files to cassandra using DSBULK load and specifying the the folder location of all these files, I get the below error
Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "unVocity-parsers input reading thread"
I wanted to see if anyone else has faced it and how it has been resolved.

Here are a few things you can try:
You can pass any JVM option or system property to the dsbulk executable using the DSBULK_JAVA_OPTS env var. See this page for more. Set the allocated memory to a higher value if possible.
You can throttle dsbulk using the -maxConcurrentQueries option. Start with -maxConcurrentQueries 1; then raise the value to get the best throughput possible without hitting the OOM error. More on this here.

Sequelize-Oracle Only returns 100 rows even if limit is more than 100

Has any one encountered the same question about sequelize-oracle or sequelize? My sequelize-rracle Only returns 100 rows even if limit is more than 100.
That is quite wired and I hope someone could help me. Thanks!! :D

It's likely the library you're using has a dependency on an older version of node-oracledb. In the past, the default for oracledb.maxRows was 100. You could try setting that to an arbitrarily high number, just be sure you have enough memory in Node.js to handle the data returned from the query.
In later versions, the default value was changed to 0 and the code was updated to interpret that as unlimited.

How to COPY a large Cassandra table without running out of memory?

I am attempting to run a simple Cassandra database COPY script, like the example below (or some variation that is very similar):
COPY my_keyspace_name.my_table_name TO 'cassandra_dump/my_keyspace_name.my_table_name.csv' WITH HEADER=true AND PAGETIMEOUT=40 AND PAGESIZE=20 AND DELIMITER='|';
It works on most tables except my largest one. In that case I get an error where it cannot allocate enough memory. The file size of the table is nowhere near as large in data as the error message claims (less than 1GB).
749314 rows exported to 1 files in 9 minutes and 11.240 seconds.
./dump_cassandra.sh: xmalloc: ../../.././lib/sh/strtrans.c:63: cannot allocate 18446744072166431589 bytes (6442528768 bytes allocated)", "stdout_lines": ["[Thu May 17 13:41:47 UTC 2018] Executing the following query:", "COPY my_keyspace_name.my_table_name TO 'cassandra_dump/my_keyspace_name.my_table_name.csv' WITH HEADER=true AND PAGETIMEOUT=40 AND PAGESIZE=20 AND DELIMITER='|';"
This answer seemed promising, but unfortunately it does not work for me.
Is there something I am missing that is preventing me from running a successful COPY on a large (relatively speaking) table?
--
EDIT: This error seems to be environmental. I have had mixed results on different servers with nearly identical amounts of data.

Setting MAXOUTPUTSIZE will split the backup data across multiple files and does not cause this error to occur
COPY my_keyspace_name.my_table_name TO 'cassandra_dump/my_keyspace_name.my_table_name.csv' WITH HEADER=true AND PAGETIMEOUT=40 AND MAXOUTPUTSIZE=100000 AND DELIMITER='|';

Liferay: huge DLFileRank table

I have a Liferay 6.2 server that has been running for years and is starting to take a lot of database space, despite limited actual content.
Table Size Number of rows
--------------------------------------
DLFileRank 5 GB 16 million
DLFileEntry 90 MB 60,000
JournalArticle 2 GB 100,000
The size of the DLFileRank table sounds to me as abnormally big (if it is totally normal please let me know).
While the file ranking feature of Liferay is nice to have, we would not really mind resetting it if it halves the size of the database.
Question: Would a DELETE * FROM DLFileRank be safe? (stop Liferay, run that SQL command, maybe set dl.file.rank.enabled=false in portal-ext.properties, start Liferay again)
Is there any better way to do it?
Bonus if there is a way to keep recent ranking data and throw away only the old data (not a strong requirement).

Wow. According to the documentation here (Ctrl-F rank), I'd not have expected the number of entries to be so high - did you configure those values differently?
Set the interval in minutes on how often CheckFileRankMessageListener
will run to check for and remove file ranks in excess of the maximum
number of file ranks to maintain per user per file. Defaults:
dl.file.rank.check.interval=15
Set this to true to enable file rank for document library files.
Defaults:
dl.file.rank.enabled=true
Set the maximum number of file ranks to maintain per user per file.
Defaults:
dl.file.rank.max.size=5
And according to the implementation of CheckFileRankMessageListener, it should be enough to just trigger DLFileRankLocalServiceUtil.checkFileRanks() yourself (e.g. through the scripting console). Why you accumulate that large number of files is beyond me...
As you might know, I can never be quoted by stating that direct database manipulation is the way to go - in fact I refuse thinking about the problem from that way.

Why does Spark job fail with "too many open files"?

I get "too many open files" during the shuffle phase of my Spark job. Why is my job opening so many files? What steps can I take to try to make my job succeed.

This has been answered on the spark user list:
The best way is definitely just to increase the ulimit if possible,
this is sort of an assumption we make in Spark that clusters will be
able to move it around.
You might be able to hack around this by decreasing the number of
reducers [or cores used by each node] but this could have some performance implications for your
job.
In general if a node in your cluster has C assigned cores and you run
a job with X reducers then Spark will open C*X files in parallel and
start writing. Shuffle consolidation will help decrease the total
number of files created but the number of file handles open at any
time doesn't change so it won't help the ulimit problem.
-Patrick Wendell

the default ulimit is 1024 which is ridiculously low for large scale applications. HBase recommends up to 64K; modern linux systems don't seem to have trouble with this many open files.
use
ulimit -a
to see your current maximum number of open files
ulimit -n
can temporarily change the number of open files; you need to update the system configuration files and per-user limits to make this permanent. On CentOS and RedHat systems, that can be found in
/etc/sysctl.conf
/etc/security/limits.conf

Another solution for this error is reducing your partitions.
check to see if you've got a lot of partitions with:
someBigSDF.rdd.getNumPartitions()
Out[]: 200
#if you need to persist the repartition, do it like this
someBigSDF = someBigSDF.repartition(20)
#if you just need it for one transformation/action,
#you can do the repartition inline like this
someBigSDF.repartition(20).groupBy("SomeDt").agg(count("SomeQty")).orderBy("SomeDt").show()

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Pentaho | GC overhead limit exceeded - garbage-collection

I want to insert data from xlsx file into table. Excel has around 1,20,000 records. But while running transformation, I am getting below error: GC overhead limit exceeded I have changed in spoon.bat. Xmx2g -XX:MaxPermSize=1024m But still I am getting this error. Can someone please help on this?

In my case , additional to adding the Xms and Xmx parameters(which didnt solve it completly) i added the option -XX:-UseGCOverheadLimit to spoon.sh and the problem solved.

Yes it work I increate the memory to 4GB issue is fixed. if "%PENTAHO_DI_JAVA_OPTIONS%"=="" set PENTAHO_DI_JAVA_OPTIONS="-Xms2048m" "-Xmx4096m"

Split the file into two different files, save half of the data in one file and the remaining in another. It works perfectly.

Related

I am getting a heap memory issue while running DSBULK load

Sequelize-Oracle Only returns 100 rows even if limit is more than 100

How to COPY a large Cassandra table without running out of memory?

Liferay: huge DLFileRank table

Why does Spark job fail with "too many open files"?

Categories

Resources