How to get parquet file schema in Node JS AWS Lambda? - node.js

Is there any way to read a parquet file schema from Node.JS?
If yes, how?
I saw that there is a lib, parquetjs but as I saw it from the documentation it can only read and write the contents of the file.

After some investigation, I've found that the parquetjs-lite can do that. It does not read the whole file, just the footer and then it extracts the schema from it.
It works with a cursor and the way I saw it there is two s3.getobject calls, one for the size and one for the given data.

Related

Avro append a record with non-existent schema and save as an avro file?

I have just started using Avro and I'm using fastavro library in Python.
I prepared a schema and saved data with this one.
Now, I need to append new data (JSON response from an API call ) and save it with a non-existent schema to the same avro file.
How shall I proceed to add the JSON response with no predefined schema and save it to the same Avro file?
Thanks in advance.
Avro files, by definition, already have a schema within them.
You could read that schema first, then continue to append data, or you can read entire file into memory, then append your data, then overwrite the file.
Each option require you to convert the JSON into Avro (or at least a Python dict), though.

Converting 2TB of gziped multiline JSONs to NDJSONs

For my research I have a dataset of about 20,000 gziped multiline json files (~2TB, all have the same schema). I need to process and clean this data (I should say I'm very new to data analytics tools).
After spending a few days reading about Spark and Apache Beam I'm convinced that the first step would be to first convert this dataset to NDJSONs. In most books and tutorials they always assume you are working with some new line delimited file.
What is the best way to go about converting this data?
I've tried to just launch a large instance on gcloud and just use gunzip and jq to do this. Not surprisingly, it seems that this will take a long time.
Thanks in advance for any help!
Apache Beam supports unzipping file if you use TextIO.
But the delimiter remains to be New Line.
For multiline json you can read complete file using in parallel and then convert the json string to pojo and eventually reshuffle the data to utilize parallelism.
So the steps would be
Get the file list > Read individual files > Parse file content to json objects > Reshuffle > ...
You can get the file list by FileSystems.match("gcs://my_bucker").metadata().
Read individual files by Compression Compression.detect((fileResouceId).getFilename()).readDecompressed(FileSystems.open(fileResouceId))
Converting to NDJSON is not necessary if you use sc.wholeTextFiles. Point this method at a directory, and you'll get back an RDD[(String, String)] where ._1 is the filename and ._2 is the content of the file.

use readFileStream to read a changing file

I have an application that streams data to a file, can I use Node.js to read the file while it's being streamed to?
I tried using createReadStrem, but it only read one chunk and the stream ended
You could try watching for file changes with fs.watchFile(filename[, options], listener) or node-watch. In file change you could just read last line with read-last-lines.
Although I'm not sure how efficient it would be.

Parquet file format on S3: which is the actual Parquet file?

Scala 2.12 and Spark 2.2.1 here. I used the following code to write the contents of a DataFrame to S3:
myDF.write.mode(SaveMode.Overwrite)
.parquet("s3n://com.example.mybucket/mydata.parquet")
When I go to com.example.mybucket on S3 I actually see a directory called "mydata.parquet", as well as file called "mydata.parquet_$folder$"!!! If I go into the mydata.parquet directory I see two files under it:
_SUCCESS; and
part-<big-UUID>.snappy.parquet
Whereas I was just expecting to see a single file called mydata.parquet living in the root of the bucket.
Is something wrong here (if so, what?!?) or is this expected with the Parquet file format? If its expected, which is the actual Parquet file that I should read from:
mydata.parquet directory?; or
mydata.parquet_$folder$ file?; or
mydata.parquet/part-<big-UUID>.snappy.parquet?
Thanks!
The mydata.parquet/part-<big-UUID>.snappy.parquet is the actual parquet data file. However, often tools like Spark break data sets into multiple part files, and expect to be pointed to a directory that contains multiple files. The _SUCCESS file is a simple flag indicating that the write operation has completed.
According to the api to save the parqueat file it saves inside the folder you provide. Sucess is incidation that the process is completed scuesffuly.
S3 create those $folder if you write directly commit to s3. What happens is it writes to temporory folders and copies to the final destination inside the s3. The reason is there no concept of rename.
Look at the s3-distcp and also DirectCommiter for performance issue.
The $folder$ marker is used by s3n/amazon's emrfs to indicate "empty directory". ignore.
The _SUCCESS file is, as the others note, a 0-byte file. ignore
all other .parquet files in the directory are the output; the number you end up with depends on the number of tasks executed on the input
When spark uses a directory (tree) as a source of data, all files beginning with _ or . are ignored; s3n will strip out those $folder$ things too. So if you use the path for a new query, it will only pick up that parquet file.

Can AWS Lambda write CSV to response?

like the question says, I would like to know if it is possible to return the response request of a lambda function in CSV format. I already know that is possible to write JSON objects as such, but for my current project, CSV format is necessary. I have only seen discussion of writing CSV files to S3, but that is what we need for this project.
This is an example of what I would like to have displayed in a response:
year,month,day,hour
2017,10,11,00
2017,10,11,01
2017,10,11,02
2017,10,11,03
2017,10,11,04
2017,10,11,05
2017,10,11,06
2017,10,11,07
2017,10,11,08
2017,10,11,09
Thanks!

Resources