I need to parse big Excel spreadsheet (approximately 20 sheets) sheet by sheet with ColdFusion. cfspreadsheet tag fails when processing large amount of data with [java.lang.OutOfMemoryError: GC overhead limit exceeded]. Apache POI User API directly behaves the same way:
<cfscript>
pkg = CreateObject("java", "org.apache.poi.openxml4j.opc.OPCPackage").open(JavaCast("string", fileName));
// error on next line
wb = CreateObject("java", "org.apache.poi.xssf.usermodel.XSSFWorkbook").Init(pkg);
</cfscript>
I tried to use Apache POI event API instead of User API but faced problems with java inheritance. Has anyone ever used XSSF and SAX (Event API) for big spreadsheets processing in ColdFusion?
After all I succeeded in using CF + Apache POI Event API + Mark Mandel's JavaLoader.cfc, Thank you #Leigh, #barnyr for all your help. I implemented excel parser in java using XSSF and SAX Event API, now it works and works very fast. This wasn't easy due template to parse wasn't simple and as were denoted in comments increasing heap size may be cheaper.
Related
PROBLEM: I've hit a troubleshooting wall and hoping for suggestions on what to check to get past an issue I'm having with an internet site I'm working on. When reading data from a spreadsheet using NPOI (C#), sometimes (not all the time) the row reading stops after just ten rows.
Sorry for the very long post but not sure what is/isn't useful. Primary reason for posting here is that I don't know the right question to ask the Great Google Machine.
I have an intranet site where I'm reading an XLSX file and pushing its contents into an Oracle table. As you can tell by the subject line, I'm using NPOI. For the most part, it's just working, but only sometimes...
In Oracle, I have a staging table, which is truncated and is supposed to be filled with data from the spreadsheet.
In my app (ASPX), users upload their spreadsheet to the server (this just works), then the app calls a WebMethod that truncates data from the Oracle staging table (this just works), then another WebMethod is called that is supposed to read data from the spreadsheet and load the staging table (this, kinda works).
It's this "kinda works" piece is what I need help with.
The spreadsheet has 170 data rows. When I run the app in VS, it reads/writes all 170 records most of the time but sometimes it reads just 10 records. When I run the app from the web server, the first time it fails (haven't been able to catch a specific error), the second and subsequent times, it reads just ten records from the spreadsheet and successfully loads all ten. I've checked the file uploaded to the server and it does have 170 data records.
Whether the process reads 10 records or 170 records, there are no error messages and no indication why it stopped reading after just ten. (I'll mention here that the file today has 170 but tomorrow could have 180 or 162, so it's not fixed).
So, I've described what it's supposed to do and what it's actually doing. I think it's time for code snippet.
/* snowSource below is the path/filename assembled separately */
/* SnowExcelFormat below is a class that basically maps row data with a specific data class */
IWorkbook workbook;
try
{
using (FileStream file = new FileStream(snowSource, FileMode.Open, FileAccess.Read, FileShare.Read))
{
workbook = WorkbookFactory.Create(file);
}
var importer = new Mapper(workbook);
var items = importer.Take<SnowExcelFormat>(0);
/* at this point, item should have 170 rows but sometimes it contains only 10 with no indication why */
/* I don't see anything in the workbook or importer objects that sheds any light on what's happening. */
Again, this works perfectly fine most of the time when running from VS. That tells me this is workable code. When running this on the web server, it fails the first time I try the process but subsequently it runs but only picking up that first 10 records, ignoring the rest. Also, all the data that's read (10 or 170) is successfully inserted into the staging table, which tells me that Oracle is perfectly okay with the data, its format, and this process. All I need is to figure out why my code doesn't read all the data from Excel.
I have verified numerous times that the local DLL and webserver DLL are the same. And I'm reading the same Excel file.
I'm hitting a serious wall here and have run out of ideas on how to troubleshoot where the code is failing, when it fails. I don't know if there's something limiting memory available to the FileStream object causing it to stop reading the file prematurely - and didn't run across anything that looked like a resource limiter. I don't know if there's something limiting the number of rows pulled by the importer.Take method. Any suggestions would be appreciated.
I faced the same issue on some files and after analyzing the problem, this is what worked for me.
importer.Take<SnowExcelFormat>(0) have three parameters and one of them is maxErrorRows. Its default value is 10.
Your parsing stop when facing more than 10 errors then the function stop reading.
What you have to do is to set maxErrorRows instead of taking default value that is 10.
I have referred all post in stack overflow related to reading XLSB file using apache POI.
I tried many ways to read XLSB file using available links/example mentioned in post. But I am ended up in issues.
I am using latest Apache POI 3.17 and used the code mentioned in
Link :
Exception reading XLSB File Apache POI java.io.CharConversionException
Section: Post mentioned by "Gagravarr "
I am getting the following errors
The method getLocale() is undefined for the type XSSFBEventBasedExcelExtractor
The method getFormulasNotResults() is undefined for the type XSSFBEventBasedExcelExtractor
The constructor XSSFEventBasedExcelExtractor.SheetTextExtractor() is not visible
The method getIncludeSheetNames() is undefined for the type XSSFBEventBasedExcelExtractor
.......................... etc
I checked the base class "XSSFEventBasedExcelExtractor" in poi-ooxml-3.17.jar (source files) and I can able to find implementation for all the method.
I wanted to know whether this is an known issue ? Does it mean that there is no working example available to read XLSB files in Java.
I hope this query is not duplicate.
Recently, i study how to use poi to read xlsb.
If you just want to read a xlsb purely, you can use the apache test example code as the following.
https://svn.apache.org/repos/asf/poi/trunk/src/ooxml/testcases/org/apache/poi/xssf/eventusermodel/TestXSSFBReader.java
In fact, xlsb use .bin file instead of .xml file.
If you want to do more thing to xlsb file, you can read this document as the following.
https://msdn.microsoft.com/en-us/library/office/cc313133(v=office.12).aspx
I am using ColdFusion to export a fairly few number of rows (around 1000) but large number of columns (around 300) to Excel. It is a multi-sheet Excel file with at least two of the sheets having the large number of columns. Using cfspreadsheet throws a Java Heap Error. Updating JVM settings value shows no improvement. What is the best way to export to Excel without causing the Java Heap Error?
Edit: I have tried few ways to fix the issue within the program. I am using the xml Workbook within cfsavecontent to build the multiple sheets and render the result using cfcontent. In this case, cfcontent might be utilizing high amount of memory resulting in the heap space error.
<cfsavecontent variables="REQUEST.xmlData">
<cfoutput>
<xml version="1.0"?>
<?mso-application progid="Excel.sheet"?>
<Workbook>
...other contents
</Workbook>
</cfoutput>
</cfsavecontent>
For second workaround, I am using querynew to build the contents and dump the final result in an excel using <Cfspreadsheet action="write">. For subsequent sheets, I am using <cfspreadsheet action="update">. The ultimate goal is serve the excel using <cflocation url="excelPath">, but in this case, cfspreadsheet update is taking forever throwing out of memory error.
If updating jvm is not an option, what other ways do you suggest to implement to overcome the memory issues.
I'm a little late to the party...
From what I can tell, cfspreadsheet tries to materialize the entire file in memory before flushing to disk. With cfsavecontent you're doing this explicitly.
You're familiar with building Workbook XML, so all that your cfsavecontent approach needs is streaming to disk.
This can be accomplished by digging into the underlying Java libraries. java.io.FileWriter can append to a file without keeping the entire file in memory:
var append = true;
var writer = createobject("java", "java.io.FileWriter").init(filename, append);
try {
writer.append("<xml version=""1.0""?>\n<?mso-application progid=""Excel.sheet""?>\n<Workbook>\n");
// Build your workbook in chunks
// for (var row in query)
// writer.append(markup)
writer.append("</Workbook>");
} finally {
writer.close();
}
From testing, I believe FileWriter invokes flush regularly, so I've omitted it, but I can't find any documentation stating that's the case. I never saw memory usage get very high but YMMV.
I am using streams to download the excel file. Please find the code below
$response = new \Zend\Http\Response\Stream();
$response->setStream(fopen($myfile, 'r'));
$response->setStatusCode(200);
$headers = new \Zend\Http\Headers();
$headers->addHeaderLine("Content-Type: application/vnd.ms-excel; charset=UTF-8")
->addHeaderLine('Content-Disposition', 'attachment; filename=my.xls')
->addHeaderLine( "Content-Transfer-Encoding: binary" )
->addHeaderLine('Content-Length', filesize($myfile));
$response->setHeaders($headers);
The file generated is proper but when the same file I am enforcing the user to download there are two empty lines coming in the downloaded excel file. I did some researching and thought may it's http version and header lines but I think it's not because if I try otherwise I get those two empty line in start of excel report.
Please note there are no empty space in the start of the content.
Any idea ?? why it's so?
Mind the streamed response does in fact not really provide a stream context to the client. It just buffers the stream internally and sends out the response in one go.
That being said, I have created a controller plugin to send attachments from a file path or directly with the binary data addressed to a variable. It's in my common Soflomo\Common library. I haven't had the issues you described and I use some more headers than you do.
Tell me if that piece of code works for you. One of the differences is you use the size of the original file as the size of the response. I am not sure, as this might cause an indifference with the cached streamed response. Try to just grab the contents and do a strlen() on this content.
Is there any way to read or write both excel 2003 and 2007 format using apache poi.I know that we can use HSSF workbook for 2003 format and XSSF for 2007(correct me if am wrong).But is there any way to read both the format using any single workbook but not using separately.
Yes, you can do it. In fact, it's fairly widely documented on the Apache POI website!
If you already have code that uses HSSF, then you should follow the HSSF to SS converting guide for help on updating your code to be general across the two formats.
If you don't have any code yet, then follow the User API guide to get started - all the code in that is general for both formats. You can also look at the Quick Guide for some specific problems and how to solve them in the general way.
Use
WorkbookFactory.create(in);
Based on the javadoc, it
Creates the appropriate HSSFWorkbook / XSSFWorkbook from the given
InputStream.
Try Workbook wb = WorkbookFactory.create(OPCPackage pkg);.
It should work. However, if the XSSF is too big you will get an OutOfMemoryException and therefore you should use the event user model to read your file. In that case you should read your path and check the extension of your file, like following:
private boolean isXLS(String inputPath) {
String tmp = inputPath.substring(inputPath.length() - 3,
inputPath.length());
if (tmp.equalsIgnoreCase("XLS"))
return true;
else
return false;
}
Read the How-to for more information about the event user model.