poi out of memorry when reading ppt file - apache-poi

I've been working on a project which uses apache-poi to read .PPT files and change some attributes of SlideShowDocInfoAtom record in ppt file.
I can read the file using HSLFSlideShow, however, when it comes to a large ppt file (e.g. over 1GB), and my application jvm max heap size is restricted to 2GB, poi throws an OutOfMemorry Error.
After reading the source code, I know it will create a byte array when reading one of the streams of the file. In the 1GB file, the PowerPoint Document stream in the file will be up to 1GB, which consumes 1GB memorry space to create byte array, and somehow causes the jvm to crash.
So, is there any way that I can read large ppt file without enlarging jvm heap size, as I only want to read some doc info of this file, don't really want to read large blocks of the file such as audios or videos into memorry.

Related

Why am I still getting OOM errors with Apache POI using SXSSFWorkbook?

We were using the XSSFWorkbook class to build an excel report, but began getting OOM errors as the included data increased. I've updated the code to make use of the SXSSFWorkbook class instead as I read it is much lighter on memory as data is 'flushed to disk' rather than being stored in memory.
I am able to add all of the data to the workbook but when trying to write the data to a file (OutputStream) I am getting Java heap space errors once again. I'm not sure if I am doing something wrong here, I'm not sure I understand what is meant the records being 'flushed to disk'. Should I be providing a file reference before loading the data into the workbook so it can be flushed to the actual file? Or should I be doing something else when writing the workbook to the output stream?
I've done some testing and it seems I only get the OOM error when setting styles on the cells, if I forgo the styling the file is able to build without running out of memory.
The output xlsx file will have 8 sheets, each with up to 35,000 rows each.

NodeJS how to seek a large gzipped file?

I have a large GZIP-ed file. I want to read a few bytes from a specific offset of uncompressed data.
For example, I have a file that original size is 10GB. In gzipped state it has size 1GB. I want to read a few bytes at 5GB offset in that 1GB gzipped file.
You will need to read all of the first 5 GB in order to get just those bytes.
If you are frequently accessing just a few bytes from the same large gzip file, then it can be indexed for more rapid random access. You would read the entire file once to build the index. See zran.h and zran.c.

Pentaho Data Integration - Excel Writer Output File Size

Is PDI inefficient in terms of writing excel xlsx file with Microsoft Excel Writer.
A transformed excel data file in Pentaho output seems to be three times the size, if the data was transformed manually. Is this inefficiency expected or is there a workaround for it.
A CSV file of the same transformed output is way smaller in size. Have I configured something wrong ?
xlsx files should normally be smaller in size than CSV, since they consist of XML data compressed in ZIP files. Pentaho's Microsoft Excel Writer uses org.apache.poi.xssf.streaming.SXSSFWorkbook and org.apache.poi.xssf.usermodel.XSSFWorkbook to write xlsx files, and they create compressed files so this should not be your issue.
To check the files you could check with a zip utility, to see the file sizes and compression rate, to see if there is a bug. You could also try to open the file in Excel and re-save it, to see if that gives a smaller size, which could indicate an inefficiency.

How to convert xml file of stack overflow dump to csv file

I have stack overflow data dump file in .xml format,nearly 27GB and I want to convert them in .csv file. Please somebody tell me, tools to convert xml to csv file or python program
Use one of the python xml modules to parse the .xml file. Unless you have much more that 27GB ram, you will need to do this incrementally, so limit your choices accordingly. Use the csv module to write the .csv file.
Your real problem is this. Csv files are lines of fields. They represent a rectangular table. Xml files, in general, can represent more complex structures: hierarchical databases, and/or multiple tables. So your real problem to to understand the data dump format well enough to extract records to write to the .csv file.
I have written a PySpark function to parse the .xml in .csv. XmltoCsv_StackExchange is the github repo. Used it to convert 1 GB of xml within 2-3 minutes on a minimal 2-core and 2 GB RAM Spark setup. It can convert 27GB file too, just increase minPartitions from 4 to around 128 in this line.
raw = (sc.textFile(fileName, 4))

NodeJS: How to handle process out of memory error on large xlsx file parsing

I am using nodejs to parse xlsx files as cell by cell and the parsed cell values will be stored in mongodb.
It is working fine for small excel files which is sized less than 3MB. But in case of more than 3MB, the node application was crashed by throwing an error as "CALL_AND_RETRY_2 Allocation failed - process out of memory".
Used technologies:
Nodejs: v0.8.22,
MongoDB: 2.2.4
System Config:
OS: Ubuntu 12.04,
Memory: 4GB,
Processor: Intel I5
My steps to parse and store the xlsx data into mongodb:
Unzip a uploaded xlsx file.
Reading the styles, shared Strings, sheets, cells of each sheets and defined names from the extracted xml files of the uploaded xlsx file and saving those read values into an JS object.
Then save the read values into mongodb collections by iterating the values on the JS object.
Based on my knowledge STEP2 is causing the out of memory error because I am storing the entire xlsx values in a single JS object?.
Please provide some idea to change the way of the above process or some other valuable way to handle this situation.
Thanks.
You could try to start node with
node --max-old-space-size=3000 app
to increase the max mem to 3 GB. However, the default memory limit of node is 512 MB on 32 bit systems and 1 GB on 64 bit (according to https://github.com/joyent/node/wiki/FAQ). If you hit these limits when parsing a 3 MB excel file, that sound seriously odd - might be a memory leak. Maybe you want to post the code?
Btw, Node 0.8 is not exactly a the latest and greatest... Maybe you should also try to update to a more recent version.

Resources