POI 3.7 - writing 50K+ rows using SS usermodel - apache-poi

I am trying to create a sheet with 50K+ rows using POI 3.7 but the generation is maxing out at 32768 rows. I understanding that createRow() function, which takes a variable of type int as input has a limit of 2 raised to power of 32 - 1 and that may be the limitation here.
Has anyone else faced this issue and if so, is there a workaround that can be used to generate a sheet with possibly 60K or 80K rows. Please let me know, if you have any feedback/ideas on how to generate 50K+ rows using SS usermodel.

Related

OLE DB Source to Excel Destination - Process is Stuck

I have the following Data Flow Task setup (see image).
It takes the correct amount of rows from the OLE DB Source and passes everything through the Data Conversion item. However, the process then gets stuck on 10,104 out of the 29,379 rows at the Sort and Excel Destination item (I'm sorting alphabetically by one column only).
Why is it getting stuck and what can I do to move it out of this rut?
Thanks
Would need to see the properties on your sort transformation but maybe this could be the issue, make sure the following isn't checked:
Thanks.
Gav
The issue was that when inserting into an Excel Data Source the maximum size for each column is 255 but the size of the values from the mapped SQL Server column was on average greater than 700.
So it was necessary to set the maximum size in the Data Conversion to 255 (of the large column) to correspond to the Excel maximum column size. SSIS naturally truncates the column.

Merge very large hive Tables (11 to be precise) using Spark

I am basically substituting for another programmer.
Problem Description:
There are 11 hive tables each has 8 to 11 columns. All these tables have around 5 columns whose names are similar but hold different values.
For example Table A has mobile_no, date, duration columns so has Table B. But values are not same. other columns have different names table wise.
In all tables, Data types are string, integer, double I.e. simple data types. String data has a maximum 100 characters.
Each Table contains around 50 millions of data. I have requirement to merge these 11 table taking their columns as it is and make one big table.
Our spark cluster has 20 physical server, each has 36 cores (if count virtualization then 72), RAM 512 GB each. Spark version 2.2.x
I have to merge those with both memory & speed wise efficiently.
Can you guys, help me regarding this problem?
N.B: please let me know if you have questions

Access query returns a different result when read from Excel

I have a query in an ACCDB that works fine in Access.
I can successfully copy/paste its data to Excel.
However, from Excel, if I try to insert a Pivot Table using External Data Source, pointing to the very same query, then some numeric fields have weird formatting and some calculated numeric columns (formula in the query) have their value divided by 100 compared to the source.
Never seen that behaviour. Any suggestion ?
The whole MS-Office setup is in 2010.
What I have already done in the source query (without visible improvement):
used CCur() to make sure the figures are in a coherent data type
set the Format property of those culprit columns to "Standard"
The behaviour is exactly the same on other PCs in the same bank.
I could solve the problem which was due to 2 different bugs, probably in JetOLEDB.
Like is not handled properly by Excel
The query contained some formulae using Like:
iif(someField Like "XX*";0;anotherField).
Changing this to iif(Left(somefield;2) = "XX";0;anotherField) solved calculation differences between Excel but and Access.
Reference to another calculated column is handled differently
Say you have 2 query columns:
Rate: i.Rate *100 (i is a table alias)
Amount: Rate*Price
Access calculates Amount using the Rate calculated column, while Excel uses the Rate field from table i.Therefore I had to change the Amount expression to:
Rate: i.Rate *100
Amount: i.Rate *100*Price
since Excel does not seem to make always use Rate from the table (i.Rate).
Use the query in Access to first Make Table in Access then import the table to excel.

What is the best way to filter a large list in Excel?

I have a table in Excel that I want to filter. It will have a maximum of 1 million rows and 80 columns. All the calculations etc are done programatically in arrays to cut dwn processing time. However, I want to also filter the results to display only certain results based on one column value, followed by a top 5% based on another filter value.
When I first did the sheet, it was limited to 65000 results so there were no problems with the size of the data set. I just invoked the worksheet filter functions from code and did it that way. Can I do it that way with a larger data set or is there a way to filter an array the way you d a dataset on a sheet?
Thanks
As already mentioned by everyone, excel 2007 will take you to a million rows, but its slower than the excel 2003 that I presume you're using at the moment so filtering using it wouldn't be advisable.
Along with mysql, ms access is also an option.
You really should put that data in an Access table and use Excel's Database Query to do the job. Since it can also filter retrieved data based on a cell value, it's a great combination.
Storing the data in a database brings you another interesting option (depending on what you want to do): to query your database using PowerPivot.
Although using a relational DB would be preferable in many ways, if you don't have any formulas then filtering your data (1 million rows by 80 columns) using Excel will be reasonably fast (< 1 or 2 seconds depending on what sort of filtering you want to do, which will probably be faster than an un-indexed DB table) assuming that you have enough RAM. If you do have any formulas then you will probably need to be in Manual calculation mode to avoid the filtering process triggering multiple recalculations.

Can you nest Excel data tables?

I have an Excel workbook that utilises a data table (A).
I now want to create another data table (B) that effectively sits on top of the other data table. That is, each "iteration" of B calls A.
This approach fails although I cannot find any documentation about data tables that indicates that this would not work.
Basically I'd like to know if anyone has tried this before and whether I am missing something?
Is there a workaround? Do you know of any documentation that spells out whether and why this is not supported?
No.
I tried this at length some years ago in both xl03 and xl07 and my conclusion was that it can't be done - each data table seems to be an independent one-off run, they don't talk if you try to link them
I couldn't find any documentation on this issues either on the process, or for anyone else looking at a similar problem.
I want to share my experience using the data tables.
We have found a workaround for this problematic.
If you have two variables A & B that need to run into a datatable and get one or multiple result.
What we've done is :
Set any combinaison (binari combinaison) for A & B and put an id for each of this combinaison (A=0 & B=0 => id=1)
So you will then run one data table with a length of A*B.
The default here is the length to calculate those data (7min with 25 data table & 2 data table with a length of 8000 rows).
Hope it help !

Resources