How to convert xml file of stack overflow dump to csv file - python-3.x

I have stack overflow data dump file in .xml format,nearly 27GB and I want to convert them in .csv file. Please somebody tell me, tools to convert xml to csv file or python program

Use one of the python xml modules to parse the .xml file. Unless you have much more that 27GB ram, you will need to do this incrementally, so limit your choices accordingly. Use the csv module to write the .csv file.
Your real problem is this. Csv files are lines of fields. They represent a rectangular table. Xml files, in general, can represent more complex structures: hierarchical databases, and/or multiple tables. So your real problem to to understand the data dump format well enough to extract records to write to the .csv file.

I have written a PySpark function to parse the .xml in .csv. XmltoCsv_StackExchange is the github repo. Used it to convert 1 GB of xml within 2-3 minutes on a minimal 2-core and 2 GB RAM Spark setup. It can convert 27GB file too, just increase minPartitions from 4 to around 128 in this line.
raw = (sc.textFile(fileName, 4))

Related

Continous appending of data on existing tabular data file (CSV, parquet) using PySpark

For a project I need to append frequently but on a non-periodic way about one thousand or more data files (tabular data) on one existing CSV or parquet file with same schema in Hadoop/HDFS (master=yarn). At the end, I need to be able to do some filtering on the result file - to extract subset of data.
One dummy file may look like this (very simple example):
id,uuid,price
1,16c533c3-c191-470c-97d9-e1e01ccc3080,46159
2,6bb0917b-2414-4b24-85ca-ae2c2713c9a0,50222
3,7b1fa3f9-2db2-4d93-a09d-ca6609cfc834,74591
4,e3a3f874-380f-4c89-8b3e-635296a70d76,91026
5,616dd6e8-5d05-4b07-b8f2-7197b579a058,73425
6,23e77a21-702d-4c87-a69c-b7ace0626616,34874
7,339e9a7f-efb1-4183-ac32-d365e89537bb,63317
8,fee09e5f-6e16-4d4f-abd1-ecedb1b6829c,6642
9,2e344444-35ee-47d9-a06a-5a8bc01d9eab,55931
10,d5cba8d6-f0e1-49c8-88e9-2cd62cde9737,51792
Number of rows may vary between 10 and about 100000
On user request, all input files copied in a source folder should be ingested by an ETL pipeline and appended at the end of one single CSV/parquet file or any other appropriate file format (no DB). Data from a single input file may be spread over one, two or more partitions.
Because the input data files may all have different number of rows, I am concerned about getting partitions with different sizes in the resulting CSV/parquet file. Sometimes all the data may be append in one new file. Sometimes the data is so big that several files are appended.
And because input files may be appended a lot of time from different users and different sources, I am also concerned that the result CSV/parquet may contains too much part-files for the namenode to handle them.
I have done some small test appending data on existing CSV / parquet files and noticed that for each appending, a new file was generated - for example:
df.write.mode('append').csv('/user/applepy/pyspark_partition/uuid.csv')
will append the new data as a new file in the file 'uuid.csv' (which is actually a directory generated by pyspark containing all pieces of appended data).
Doing some load tests based on real conditions, I quickly realized that I was generating A LOT of files (several 10-thousands). At some point I got so much files that PySpark was unable to simple count the number of rows (NameNode memory overflow).
So I wonder how to solve this problem. What would be the best practice here? Read the whole file, append the data chunk, same the data in a new file doesn't seems to be very efficient here.
NameNode memory overflow
Then increase the heapsize of the namenode
quickly realized that I was generating A LOT of files
HDFS write operations almost never append to single files. They append "into a directory", and create new files, yes.
From Spark, you can use coalesce and repartition to create larger writer batches.
As you'd mentioned, you wanted parquet, so write that then. That'll cause you to have even smaller file sizes in HDFS.
or any other appropriate file format (no DB)
HDFS is not really the appropriate tool for this. Clickhouse, Druid, and Pinot are the current real time ingest / ETL tools being used, especially when data is streamed in "non periodically" from Kafka

Spark output JSON vs Parquet file size discrepancy

new Spark user here. i wasn't able to find any information about filesize comparison between JSON and parquet output of the same dataFrame via Spark.
testing with a very small data set for now, doing a df.toJSON().collect() and then writing to disk creates a 15kb file. but doing a df.write.parquet creates 105 files at around 1.1kb each. why is the total file size so much larger with parquet in this case than with JSON?
thanks in advance
what you're doing with df.toJSON.collect is you get a single JSON from all your data (15kb in your case) and you save that to disk - this is not something scalable for situations you'd want to use Spark in any way.
For saving parquet you are using spark built-in function and it seems that for some reason you have 105 partitions (probably the result of the manipulation you did) so you get 105 files. Each of these files has the overhead of the file structure and probably stores 0,1 or 2 records. if you want to save a single file you should coalesce(1) before you save (again this just for the toy example you have) so you'd get 1 file. Note that it still might be larger due to the file format overhead (i.e. the overhead might still be larger than the compression benefit)
Conan, it is very hard to answer your question precisely without knowing the nature of the data (you don't even tell amount of row in your DataFrame). But let me speculate.
First. Text files containing JSON usually take more space on disk then parquet. At least when one store millions-billions rows. The reason for that is parquet is highly optimized column based storage format which uses a binary encoding to store your data
Second. I would guess that you have a very small dataframe with 105 partitions (and probably 105 rows). When you store something that small the disk footprint should not bother you but if it does you need to be aware that each parquet file has a relatively sizeable header describing the data you store.

Create recordio file for linear regression

I'm using AWS Sagemaker to run linear regression on a CSV dataset. I have made some tests, and with my sample dataset that is 10% of the full dataset, the csv file ends up at 1.5 GB in size.
Now I want to run the full dataset, but I'm facing issues with the 15 GB file. When I compress the file with Gzip, it ends up only 20 MB. However, Sagemaker only supports Gzip on "Protobuf-Recordio" files. I know I can make Recordio files with im2rec, but it seems to be intended for image files for image classication. I'm also not sure how to generate the protobuf file.
To make things even worse(?) :) I'm generating the dataset in Node.
I would be very grateful to get some pointers in the right direction how to do this.
This link https://docs.aws.amazon.com/sagemaker/latest/dg/ex1-preprocess-data-transform.html has useful information if you are willing to use a Python script to transform your data.
The actual code from the SDK is https://github.com/aws/sagemaker-python-sdk/blob/master/src/sagemaker/amazon/common.py
Basically, you could load your CSV data into an NDArray (in batches so that you can write to multiple files), and then use https://github.com/aws/sagemaker-python-sdk/blob/master/src/sagemaker/amazon/common.py to convert to Recordio-protobuf. You should be able to write the buffer with the Recordio-protobuf into a file.
Thanks

Streaming writes to a Microsoft Excel file

I am trying to generate very large Microsoft Excel files in a browser application. While there are JavaScript libraries which allow me to generate XLSX files from the browser, the issue with them is that they require all of the document contents to be loaded in memory before writing them, which gives me an upper bound on how much I can store in a single file before the browser crashes. Thus I would like to have a write stream that allows me to write data sequentially into a Excel file using something like StreamSaver.js.
Doing such a thing with CSV would be trivial:
for (let i = 0; i < paginatedRequest.length; i++) {
writer.write(paginatedRequest[i].join(",") + "\n");
}
The approach above would allow me to write an extremely large number of CSV rows to an output stream without having to store all of the data in memory. My question is: is this technically feasible to do with an XLSX file?
My main concern here is that internally XLSX files are ZIP archives, so my first idea was to use an uncompressed ZIP archive and stream writes to it, but every file inside a ZIP archive comes with a header which indicates its size and I can't possibly know that beforehand. Is there a workaround that I could possibly use for this?
Lastly, if not possible, are there any other streamable spreadsheet formats which can be opened in Excel and "look nice"? (There is a flat OpenDocument specification with the .fods extension, so I could stream writes to such a file. Sadly, Microsoft Office does not support flat OpenDocument files.)
A possible solution would be to generate a small, static XLSX file which imports an external CSV file using Excel's Data Model. Since generating a streaming CSV file is almost trivial, that could be a feasible solution. However, it's somewhat unsatisfactory:
It's rather annoying to have the user download two files instead of one (or a compressed file that they'd need to uncompress).
Excel does not support relative routes to external CSV files, so we'd also need a macro to ensure that we update the route every time we open the file (if this is feasible at all). This requires the user accepting the usage of macros, which comes with a security warning and is not terribly nice for them.

Pentaho Data Integration - Excel Writer Output File Size

Is PDI inefficient in terms of writing excel xlsx file with Microsoft Excel Writer.
A transformed excel data file in Pentaho output seems to be three times the size, if the data was transformed manually. Is this inefficiency expected or is there a workaround for it.
A CSV file of the same transformed output is way smaller in size. Have I configured something wrong ?
xlsx files should normally be smaller in size than CSV, since they consist of XML data compressed in ZIP files. Pentaho's Microsoft Excel Writer uses org.apache.poi.xssf.streaming.SXSSFWorkbook and org.apache.poi.xssf.usermodel.XSSFWorkbook to write xlsx files, and they create compressed files so this should not be your issue.
To check the files you could check with a zip utility, to see the file sizes and compression rate, to see if there is a bug. You could also try to open the file in Excel and re-save it, to see if that gives a smaller size, which could indicate an inefficiency.

Resources