It seems that with the recent versions of PrimeFaces, new types were added for the DataExporter (see ExporterType from the current primefaces 6.2 docs)
I can't seem to find anything on the web regarding the new Apache POI XLSX and XLSXSTREAM types. Can somebody explain the differences between the two of them? Is one more efficient that the other? Are there limitations regarding the size of the exported data?
I can explain. They both produce the exact same XLSX files which are the Open Document format of Excel files using Apache POI. There is no limit on size that I am aware of.
The big difference is how they get created.
XLSX - reads the whole document in memory
XLSXSTREAM - processes and garbage collects as it processes so its memory efficient.
From the POI Docs:
SXSSF (package: org.apache.poi.xssf.streaming) is an API-compatible
streaming extension of XSSF to be used when very large spreadsheets
have to be produced, and heap space is limited. SXSSF achieves its low
memory footprint by limiting access to the rows that are within a
sliding window.
Basically if you don't care about your server resources use XLSX if you have many users downloading Excel files and JVM Memory is important to you use the XLSXSTREAM
Related
Question Purpose
Sorting a parquet files provides a number of benefits:
more efficient filtering using file metadata
more efficient compression rate
There may be other benefits for this. There is a lot of discussion about this on the Internet. For this reason, the discussion of this question is not about the cause of sorting. Rather, the purpose of this question is to talk about how to sort, which is mentioned in all Internet links with the least explanation (about 30%) and the challenges of data sorting are not mentioned at all. The purpose of this question is to get help from all friends who are experts and experienced in this field and to determine the best method (based on cost and benefit) for sorting.
Brief explanation about Apache parquet library
Before starting discussing Spark, I will explain about the tool used to produce parquet files. The parquet-mr library (I use Java for example, but it can probably be extended to other languages) writes to a disk and memory at the same time when we create a parquet file. This library also has a feature called getDataSize() that returns the exact final size of the file after it is completely closed on the disk, so we can use it to achieve the following two conditions when we write parquet files:
Do not make parquet files with small size (which is not good for query engines)
All parquet files can be produced with a certain minimum size or fixed size (for example, 1 GB each file)
Since this library writes to disk and memory at the same time, it does not allow data to be sorted unless all the data is first sorted in memory and then given to the library. (But this is not possible with large volumes of data.) We also implicitly assume that data is being generated as a stream that we intend to store. (In the case of a fixed data, the problem stated in this question will be meaningless because it can be said that the whole data is arranged once and for all and the problem is over. But we assume that there is a flow of data, in which case it is important to have an optimal way to sort the data)
One advantage mentioned above for the Apache parquet library is that we can fix the exact size of the output parquet file. This is an advantage in my opinion. Because, for example, if I know that the size of Hadoop blocks is equal to 128 MB and the size of parquet row-group is 128 MB, I can fix the parquet file size to 1 GB. Then I know that all parquet files will have 8 blocks and HDFS storage will be used best and all parquet files will be the same. (Because in HDFS, when the block size is 128 MB, the smaller file will take up the same amount of space) This may not be an advantage for everyone, and we'd be happy for experienced people to critique it if needed.
Parquet File Sorting Challenges
One point before we start is that we are looking for permanent data sorting because we are going to use it in the next thousands of queries. Almost so far, the above descriptions have identified some of challenges for sorting, but I will describe all of the challenges below:
Parquet tools do not allow you to write sorted data. So one way is to keep all the data in memory and after sorting, give it to the parquet library to be written in the parquet file. This method has two drawbacks: 1) It is not possible to keep all data in memory. 2) Because all the data is in memory, the size of the parquet file is not known and may be less than or more than 1 GB or any amount after writing, and the advantage of being fixed parquet size is lost.
Suppose we want to do this sorting in a parallel process instead of doing it in real time and stream. In this way, if we want to use parquet library, we will still have the problem that we have to bring the whole data to the memory for sorting, which is not possible. So let's say we use a tool like Spark for sorting. A specific cost we give in this section is that cluster resources are used for sorting, and in practice each data is written twice. (Once the parquet writing time and once the sorting) The next point is that even if we skip these two cases, after sorting the data, depending on the other columns in the parquet file, the amount of parquet compression for that particular column and for the whole data may change and increase or decrease. For this reason, after the parquet file is written, small files may be created or the fixed size (for example, 1 GB) may change. Unfortunately, Spark does not provide a way to control the file size (it may not be possible in practice), and therefore if we want to restore the fixed file size, we may need to use methods such as the mentioned link, which will not be free (causes to write the file several times apart from the cluster resources that are consumed and the exact file size will not be fixed):How do you control the size of the output file
Maybe there is no other way and the only ways are the mentioned one at the above. In which case, I would be happy for this note to be expressed by experts so that others know that there is no other way right now.
Challenges In Summary
For this reason, we generally observed 2 types of problems in these solutions:
How to do sorting at a reasonable cost and time (in stream flow)
How to keep the size of parquet files fixed
For this reason, although it is said everywhere that sorting is very good (and the results of surveys, both on the Internet and by myself, show that it is really useful), there is no mention at all of its methods and challenges. I ask experienced and expert friends in this field to help me in this direction (hoping that it will help others as well) and if ways or points are missed in this explanation, please state it.
Sorry if there is a typo in some parts due to my weakness in English language. Thanks.
I am using BIRT runtime 4.8.0 in a java project for generating Excel reports. The excel report has 1k columns and can have 10k to 50k rows (result of 1 query, maps to 1 table). I am using spudsoft ExcelEmitter for rendering the static excel reports.
Data source: Impala jdbc connection, using 1 dataset with 1 query
The issue is it takes 6 to 7GB of heap space (java) just to render 10k rows in this report, so as to load everything in memory and then write it to file.
Is there any way to reduce the memory footprints (predictable heap space usage preferably under 3GB) while rendering the excel sheets (options like pagination of query results, rendering file in parts, etc)?
I solved it with a new version of the spudsoft emitter.
That changes the apache poi from xssf to sxssf:
ExcelEmitter.ExtractMode
Experimental feature! When set to true, the emitter should run faster for XLSX files, but with a limited feature set:
Images will be omitted.
Merged cells are not allowed.
Structure header and footer are not supported. See ExcelEmitter.StructuredHeader.
https://www.eclipse.org/forums/index.php/m/1804253/#msg_1804253
Use below code to set the limit to 15K. This resolved my problem.
reportContext.getAppContext().put("MAX_PAGE_BREAK_INTERVAL", 15000);
I was able to generate excel for large dataset (order of 50k rows and 1k columns) by directly using Apache POI Streaming APIs. Aspose APIs are another good tool for doing this.
Using POI streaming APIs you can render excel with order of 50k rows, 1k columns in about a minute or two under 2GB of peak RAM usage.
So if you extend the Spudsoft excel emitter to use the POI streaming APIs then it can be handled using BIRT as well.
A very simple question...
I have downloaded a very large .csv file (around 3.7 GB) and now wish to open it; but excel can't seem to manage this.
Please how do I open this file?
Clearly I am missing a trick!
Please help!
There are a number of other Stackoverflow questions addressing this problem, such as:
Excel CSV. file with more than 1,048,576 rows of data
The bottom line is that you're getting into database territory with that sort of size. The best solution I've found is Bigquery from Google's cloud platform. It's super cheap, astonishingly fast, and it automatically detects schemas on most CSVs. The downside is you'll have to learn SQL to do even the simplest things with the data.
Can you not tell excel to only "open" the file with the first 10 lines ...
This would allow you to inspect the format and then use some database functions on the contents.
Another thing that can impact whether you can open a large Excel file is the resources and capacity of the computer. That's a huge file and you have to have a lot of on-disk swap space (page file in windows terms) + memory to open a file of that size. So, one thing you can do is find another computer that has more memory and resources or increase your swap space on your computer. If you have windows just google how to increase your page file.
This is a common problem. The typical solutions are
Insert your .CSV file into a SQL database such as MySQL, PostgreSQL etc.
Processing you data using Python, or R.
Find a data hub for your data. For example, Acho Studio.
The problem with solution one is that you'll have to design a table schema and find a server to host the database. Also you need to write server side code to maintain or change the database. The problem with Python or R is that running processes on GBs of data will put a of stress to your local computer. A data hub is much easier but its costs may vary.
I have a use case where I need to process huge excel file in a fraction of second which was not possible. Hence, I wish to store the selected information from the excel file in memory so that my application can read it from memory instead of loading excel file every time. By the way, I am using groovy for developing the application. My question is as follows:
What is in-memory data structure? How can I use in groovy?
What happens when multiple processes running in different nodes want to access the in-memory data structure?
Any pointer/link will be very much helpful
Just use apache poi, it will load the workbook into memory (example)
They will need to each load a copy. Or you will need to do something clever
See above
I am working on a project within excel and am starting prepare my document for future performance related problems. The excel file contains large amounts of data and large amounts of images which are all in sets, ie, 40 images belong to one function of the program, another 50 belong to another etc... and only one set of them is used at a time.
This file is only going to get bigger as the number of jobs/functions it has to handle increase. Now, I could just make multiple excel files and let the user choose which one is appropriate for the job but it is requested that this is all done from one file.
Baring this in mind, I started thinking about methods of creating such a large file whilst keeping its performance levels high and had an idea which I am not sure is possible or not. This is to have multiple protected workbooks each one containing the information for each job "set" and a main workbook which accesses these files depending on the user inputs. This will result in many excel files which take time to download initially but whilst being used should eliminate the low performance issues as the computer only has to access a subset of these files.
From what I understand this is sort of like what DLL's are for but I am not sure if the same can be done by excel and if possible would the performance increase be significant?
If anyone has any other suggestions or elegant solutions on how this can be done please let me know.
Rather than saving data such as images in the excel file itself, write your macro to load the appropriate images from files and have your users select which routine to run. This way, you load only files you need. If your data is text / numbers, you can store it in a CSV or, if your data gets very large, use a Microsoft Access database and retrieve the data using the ADODB library.
Inserting Images: How to insert a picture into Excel at a specified cell position with VBA
More on using ADODB: http://msdn.microsoft.com/en-us/library/windows/desktop/ms677497%28v=vs.85%29.aspx