How do i read and edit huge excel files using POI?

How do i read and edit huge excel files using POI? - apache-poi

I have a requirement to do the following
1)Copy a huge excel file 1400*1400 and make a copy.
2)Read the copied file and add new columns and rows and also edit at the same time.
3)This is going to be a standalone program and not on a server. I have limitations of having low memory footprint and fast performance.
I have done some reading and have found the following
1)There is no API to copy sucg a huge file
2)SXSSF can be using for writing but not for reading
3)XSSF and SAX (Event API) can be using for reading but not for editing.If i tried to read and store as objects again i will have a memory issue.
Please can you help on how i can do this?

Assuming your memory size is large enough to use XSSF/SAX to read and SXSSF to write, let me suggest the following solution.
1) Read the file using XSSF/SAX. For each row, create an object with the row data and immediately write it out into a file using ObjectOutputStream or any other output format you find convenient. You will create a separate file for each row. And there will only be 1 row object in memory, because you can keep modifying the same object with each row's data.
2) Make whatever modifications you need to. For rows that need to be modified, read the corresponding file back into your row object, modify as needed, and write it back out. For new rows, simply set the data in your row object and write it out to a new file.
3) Use SXSSF to reassemble your spreadsheet by reading 1 row object file at a time and storing it in your output spreadsheet.
That way, you will only have 1 row in memory at a time.

If there is much data due to which 'Out of Memory' or 'GC overlimit exceeded' occurs and if memory is a problem the data can be initially parsed to a xml file. The excel sheet can be replaced with the xml file so that memory usage will be minimum.
In excel the sheets are represented as xml. Using java.util.zip.ZipFile each entries can be identified. The xml for the sheet can be replaced with the parsed xml so that we get the expected data in excel sheet.
Following class can be used to create xml files:
public class XmlSpreadsheetWriter {
private final Writer _out;
private int _rownum;
public XmlSpreadsheetWriter(Writer out){
_out = out;
}
public void beginSheet() throws IOException {
_out.write("<?xml version=\"1.0\" encoding=\"UTF-8\"?>" +
"<worksheet xmlns=\"http://schemas.openxmlformats.org/spreadsheetml/2006/main\">" );
_out.write("<sheetData>\n");
}
public void endSheet() throws IOException {
_out.write("</sheetData>");
_out.write("</worksheet>");
}
public void insertRow(int rownum) throws IOException {
_out.write("<row r=\""+(rownum+1)+"\">\n");
this._rownum = rownum;
}
public void endRow() throws IOException {
_out.write("</row>\n");
}
public void createCell(int columnIndex, String value, int styleIndex) throws IOException {
String ref = new CellReference(_rownum, columnIndex).formatAsString();
_out.write("<c r=\""+ref+"\" t=\"inlineStr\"");
_out.write(" s=\""+styleIndex+"\"");
_out.write(">");
_out.write("<is><t>"+value+"</t></is>");
_out.write("</c>");
}
public void createCell(int columnIndex, double value, int styleIndex) throws IOException {
String ref = new CellReference(_rownum, columnIndex).formatAsString();
_out.write("<c r=\""+ref+"\" t=\"n\"");
_out.write(" s=\""+styleIndex+"\"");
_out.write(">");
_out.write("<v>"+value+"</v>");
_out.write("</c>");
}
public void createEmptyCell(int columnIndex, int styleIndex)throws IOException {
String ref = new CellReference(_rownum, columnIndex).formatAsString();
_out.write("<c r=\""+ref+"\" t=\"n\"");
_out.write(" s=\""+styleIndex+"\"");
_out.write(">");
_out.write("<v></v>");
_out.write("</c>");
}
}

If memory is the problem with processing the number of records you pointed out (i.e. 1400*1400 ) then getting XML data and processing those might be a solution for you. I know it may not be the best solution but it will for sure address the low memory requirement that you have. Even POI site points this solution too:
"If memory footprint is an issue, then for XSSF, you can get at the underlying XML data, and process it yourself. This is intended for intermediate developers who are willing to learn a little bit of low level structure of .xlsx files, and who are happy processing XML in java. Its relatively simple to use, but requires a basic understanding of the file structure. The advantage provided is that you can read a XLSX file with a relatively small memory footprint."
source:http://poi.apache.org/spreadsheet/how-to.html

Related

General recommendations and tricks of how to instantiate fields and call methods

I want to instantiate a large number of StringProperty fields to put text values
(>100000). All in all my code performs well so far. I'm still trying to optimize my code as well as possible to harness the full power capabilities of my weak CPU (Intel Atom N2600, 1,6GHz, 2GByte Ram).
I'm calling the following method 100000 times and it takes some seconds
until all values are stored in my array of StringProperty.
public setData(int row, int numberOfCols, String data [][]) {
this.dataValue = new StringProperty[numberOfCols];
for(int i=0;i<numberOfCols;i++) dataValue[i] = new SimpleStringProperty(data[row][i]);
}
Is the method above good enough for intantiating fields and putting values?
Any alternative ideas of how to tweak the method above?

How to pass string as value in mapper?

I am trying to pass a string as value in the mapper, but getting error that it is not Writable. How to resolve?
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String TempString = value.toString();
String[] SingleRecord = TempString.split("\t");
//using Integer.parseInt to calculate profit
int Amount = Integer.parseInt(SingleRecord[7]);
int Asset = Integer.parseInt(SingleRecord[8]);
int SalesPrice = Integer.parseInt(SingleRecord[9]);
int Profit = Amount*(SalesPrice-Asset);
String ValueProfit = String.valueOf(Profit);
String ValueOne = String.valueOf(one);
custID.set(SingleRecord[2]);
data.set(ValueOne + ValueProfit);
context.write(custID, data);
}

Yahoo's tutorial says :
Objects which can be marshaled to or from files and across the network must obey a particular interface, called Writable, which allows Hadoop to read and write the data in a serialized form for transmission.
From Cloudera site :
The key and value classes must be serializable by the framework and hence must implement the Writable interface. Additionally, the key classes must implement the WritableComparable interface to facilitate sorting.
So you need an implementation of Writable to write it as a value in the context. Hadoop ships with a few stock classes such as IntWritable. The String counterpart you are looking for is the Text class. It can be used as :
context.write(custID, new Text(data));
OR
Text outValue = new Text();
val.set(data);
context.write(custID, outValue)
I case, you need specialized functionality in the value class, you may implement Writable (not a big deal after all). However seems like Text is just enough for you.

you havent set data in map function according to import text in above,and TextWritable is wrong just use Text as well.

Read the metadata of a XLSM file size 6MB -Apache POI 3.9-

I need to read XLSM file metadata, to files less than 4 MB The following instructions work correctly:
try {
OPCPackage pkg = OPCPackage.open (new FileInputStream ("C:\\Path to file.xlsm"));
XSSFWorkbook XSSFWorkbook = new document (pkg);
documento.getProperties poixmlProperties = ();
...
} catch (Exception ex) {
...
} finally {
...
}
For files larger than 4 MB does not run the second line (XSSFWorkbook document=new XSSFWorkbook (pkg)), jump directly to the finally block without giving any errors.

Firstly, don't use an InputStream when you have a File object! Using an InputStream means POI has to buffer you whole thing into memory, while with a File it can leave it on disk until needed. As covered in the POI documentation, instead open with:
OPCPackage pkg = OPCPackage.open(new File("file.xlsx"));
Secondly, you seem to be opening up the whole Excel structure and parsing that into memory, when all you want is the metadata. Don't! Firing up XSSFWorkbook takes quite a lot of processing and memory, which you don't need if all you care about is metadata, which lives in a separate area of the file
Instead, just load the properties themselves, and work with that. Your final code would be something like:
OPCPackage pkg = OPCPackage.open(new File("file.xlsx"));
POIXMLProperties props = new POIXMLProperties(pkg);
System.out.println("The title is " + props.getCorePart().getTitle());
That ought to load in only a fraction of the file's total size as memory use

Is it possible to do data type conversion on SQLBulkUpload from IDataReader?

I need to grab a large amount of data from one set of tables and SQLBulkInsert into another set...unfortunately the source tables are ALL varchar(max) and I would like the destination to be the correct type. Some tables are in the millions of rows...and (for far too pointless policital reasons to go into) we can't use SSIS.
On top of that, some "bool" values are stored as "Y/N", some "0/1", some "T/F" some "true/false" and finally some "on/off".
Is there a way to overload IDataReader to perform type conversion? Would need to be on a per-column basis I guess?
An alternative (and might be the best solution) is to put a mapper in place (perhaps AutoMapper or custom) and use EF to load from one object and map into the other? This would provoide a lot of control but also require a lot of boilerplate code for every property :(

In the end I wrote a base wrapper class to hold the SQLDataReader, and implementing the IDataReader methods just to call the SQLDataReader method.
Then inherit from the base class and override GetValue on a per-case basis, looking for the column names that need translating:
public override object GetValue(int i)
{
var landingColumn = GetName(i);
string landingValue = base.GetValue(i).ToString();
object stagingValue = null;
switch (landingColumn)
{
case "D4DTE": stagingValue = landingValue.FromStringDate(); break;
case "D4BRAR": stagingValue = landingValue.ToDecimal(); break;
default:
stagingValue = landingValue;
break;
}
return stagingValue;
}
Works well, is extensible, and very fast thanks to SQLBulkUpload. OK, so there's a small maintenance overhead, but since the source columns will very rarely change, this doesn't really affect anything.

Using *.resx files to store string value pairs

I have an application that requires mappings between string values, so essentially a container that can hold key values pairs. Instead of using a dictionary or a name-value collection I used a resource file that I access programmatically in my code. I understand resource files are used in localization scenarios for multi-language implementations and the likes. However I like their strongly typed nature which ensures that if the value is changed the application does not compile.
However I would like to know if there are any important cons of using a *.resx file for simple key-value pair storage instead of using a more traditional programmatic type.

There are two cons which I can think of out of the blue:
it requires I/O operation to read key/value pair, which may result in significant performance decrease,
if you let standard .Net logic to resolve loading resources, it will always try to find the file corresponding to CultureInfo.CurrentUICulture property; this could be problematic if you decide that you actually want to have multiple resx-es (i.e. one per language); this could result in even further performance degradation.
BTW. Couldn't you just create helper class or structure containing properties, like that:
public static class GlobalConstants
{
private const int _SomeInt = 42;
private const string _SomeString = "Ultimate answer";
public static int SomeInt
{
get
{
return _SomeInt;
}
}
public static string SomeString
{
get
{
return _SomeString;
}
}
}
You can then access these properties exactly the same way, as resource files (I am assuming that you're used to this style):
textBox1.Text = GlobalConstants.SomeString;
textBox1.Top = GlobalConstants.SomeInt;
Maybe it is not the best thing to do, but I firmly believe this is still better than using resource file for that...

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

How do i read and edit huge excel files using POI? - apache-poi

Related

General recommendations and tricks of how to instantiate fields and call methods

How to pass string as value in mapper?

Read the metadata of a XLSM file size 6MB -Apache POI 3.9-

Is it possible to do data type conversion on SQLBulkUpload from IDataReader?

Using *.resx files to store string value pairs

Categories

Resources