Pentaho Data Integration - Excel Writer Output File Size

Pentaho Data Integration - Excel Writer Output File Size - excel

Is PDI inefficient in terms of writing excel xlsx file with Microsoft Excel Writer.
A transformed excel data file in Pentaho output seems to be three times the size, if the data was transformed manually. Is this inefficiency expected or is there a workaround for it.
A CSV file of the same transformed output is way smaller in size. Have I configured something wrong ?

xlsx files should normally be smaller in size than CSV, since they consist of XML data compressed in ZIP files. Pentaho's Microsoft Excel Writer uses org.apache.poi.xssf.streaming.SXSSFWorkbook and org.apache.poi.xssf.usermodel.XSSFWorkbook to write xlsx files, and they create compressed files so this should not be your issue.
To check the files you could check with a zip utility, to see the file sizes and compression rate, to see if there is a bug. You could also try to open the file in Excel and re-save it, to see if that gives a smaller size, which could indicate an inefficiency.

Related

Continous appending of data on existing tabular data file (CSV, parquet) using PySpark

For a project I need to append frequently but on a non-periodic way about one thousand or more data files (tabular data) on one existing CSV or parquet file with same schema in Hadoop/HDFS (master=yarn). At the end, I need to be able to do some filtering on the result file - to extract subset of data.
One dummy file may look like this (very simple example):
id,uuid,price
1,16c533c3-c191-470c-97d9-e1e01ccc3080,46159
2,6bb0917b-2414-4b24-85ca-ae2c2713c9a0,50222
3,7b1fa3f9-2db2-4d93-a09d-ca6609cfc834,74591
4,e3a3f874-380f-4c89-8b3e-635296a70d76,91026
5,616dd6e8-5d05-4b07-b8f2-7197b579a058,73425
6,23e77a21-702d-4c87-a69c-b7ace0626616,34874
7,339e9a7f-efb1-4183-ac32-d365e89537bb,63317
8,fee09e5f-6e16-4d4f-abd1-ecedb1b6829c,6642
9,2e344444-35ee-47d9-a06a-5a8bc01d9eab,55931
10,d5cba8d6-f0e1-49c8-88e9-2cd62cde9737,51792
Number of rows may vary between 10 and about 100000
On user request, all input files copied in a source folder should be ingested by an ETL pipeline and appended at the end of one single CSV/parquet file or any other appropriate file format (no DB). Data from a single input file may be spread over one, two or more partitions.
Because the input data files may all have different number of rows, I am concerned about getting partitions with different sizes in the resulting CSV/parquet file. Sometimes all the data may be append in one new file. Sometimes the data is so big that several files are appended.
And because input files may be appended a lot of time from different users and different sources, I am also concerned that the result CSV/parquet may contains too much part-files for the namenode to handle them.
I have done some small test appending data on existing CSV / parquet files and noticed that for each appending, a new file was generated - for example:
df.write.mode('append').csv('/user/applepy/pyspark_partition/uuid.csv')
will append the new data as a new file in the file 'uuid.csv' (which is actually a directory generated by pyspark containing all pieces of appended data).
Doing some load tests based on real conditions, I quickly realized that I was generating A LOT of files (several 10-thousands). At some point I got so much files that PySpark was unable to simple count the number of rows (NameNode memory overflow).
So I wonder how to solve this problem. What would be the best practice here? Read the whole file, append the data chunk, same the data in a new file doesn't seems to be very efficient here.

NameNode memory overflow
Then increase the heapsize of the namenode
quickly realized that I was generating A LOT of files
HDFS write operations almost never append to single files. They append "into a directory", and create new files, yes.
From Spark, you can use coalesce and repartition to create larger writer batches.
As you'd mentioned, you wanted parquet, so write that then. That'll cause you to have even smaller file sizes in HDFS.
or any other appropriate file format (no DB)
HDFS is not really the appropriate tool for this. Clickhouse, Druid, and Pinot are the current real time ingest / ETL tools being used, especially when data is streamed in "non periodically" from Kafka

In Excel how can I update an external file if it has so many rows that it cannot be loaded?

I have a .csv file that has around 2 million rows, and I want to add a new column. Problem is, I could manage to that by losing a lot of data (basicly everything above ~1,1m rows). When I used connection to the external file (so that I could read all rows), and made changes to it in Power Query, the changes was not saved to the .csv file.

You can apply one of several solutions:
Using a text editor which can handle huge files, save the csv files into smaller chunks. Apply the modifications to each chunk. Join chunks again to get the desired file.
Create a "small" program yourself, which loads the csv line by line and applies the modification, adding the resulting data to a second file.
Maybe some other software can handle that size of a csv. Patch the LibreOffice for this purpose, to handle 2000000+ lines - the source code is available :)

Updated Python to_csv output file size being magnified

We have a csv file that is being used like a database, and an ETL script that takes input Excel files and transforms them into the same format to append to the csv file.
The script reads the csv file into a dataframe and appends the new input dataframe to the end, and then uses to_csv to overwrite the old csv file.
The problem is, when we updated to a new version of Python (downloaded with Anaconda), the output csv file is growing larger and larger every time we append data to it. The more lines in the original csv read into the script (which gets output with the new appended data), the larger the output file size is magnified. The actual number of rows and data in the csv files are fine, it's just the file size itself that is unusually large.
Does anyone know if updating to a new version of Python could have broken this process?
Is Python storing data in the csv file that we cannot see?
Any ideas or help is appreciated! Thank you.

Streaming writes to a Microsoft Excel file

I am trying to generate very large Microsoft Excel files in a browser application. While there are JavaScript libraries which allow me to generate XLSX files from the browser, the issue with them is that they require all of the document contents to be loaded in memory before writing them, which gives me an upper bound on how much I can store in a single file before the browser crashes. Thus I would like to have a write stream that allows me to write data sequentially into a Excel file using something like StreamSaver.js.
Doing such a thing with CSV would be trivial:
for (let i = 0; i < paginatedRequest.length; i++) {
writer.write(paginatedRequest[i].join(",") + "\n");
}
The approach above would allow me to write an extremely large number of CSV rows to an output stream without having to store all of the data in memory. My question is: is this technically feasible to do with an XLSX file?
My main concern here is that internally XLSX files are ZIP archives, so my first idea was to use an uncompressed ZIP archive and stream writes to it, but every file inside a ZIP archive comes with a header which indicates its size and I can't possibly know that beforehand. Is there a workaround that I could possibly use for this?
Lastly, if not possible, are there any other streamable spreadsheet formats which can be opened in Excel and "look nice"? (There is a flat OpenDocument specification with the .fods extension, so I could stream writes to such a file. Sadly, Microsoft Office does not support flat OpenDocument files.)

A possible solution would be to generate a small, static XLSX file which imports an external CSV file using Excel's Data Model. Since generating a streaming CSV file is almost trivial, that could be a feasible solution. However, it's somewhat unsatisfactory:
It's rather annoying to have the user download two files instead of one (or a compressed file that they'd need to uncompress).
Excel does not support relative routes to external CSV files, so we'd also need a macro to ensure that we update the route every time we open the file (if this is feasible at all). This requires the user accepting the usage of macros, which comes with a security warning and is not terribly nice for them.

How to convert xml file of stack overflow dump to csv file

I have stack overflow data dump file in .xml format,nearly 27GB and I want to convert them in .csv file. Please somebody tell me, tools to convert xml to csv file or python program

Use one of the python xml modules to parse the .xml file. Unless you have much more that 27GB ram, you will need to do this incrementally, so limit your choices accordingly. Use the csv module to write the .csv file.
Your real problem is this. Csv files are lines of fields. They represent a rectangular table. Xml files, in general, can represent more complex structures: hierarchical databases, and/or multiple tables. So your real problem to to understand the data dump format well enough to extract records to write to the .csv file.

I have written a PySpark function to parse the .xml in .csv. XmltoCsv_StackExchange is the github repo. Used it to convert 1 GB of xml within 2-3 minutes on a minimal 2-core and 2 GB RAM Spark setup. It can convert 27GB file too, just increase minPartitions from 4 to around 128 in this line.
raw = (sc.textFile(fileName, 4))

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string