Avro append a record with non-existent schema and save as an avro file? - python-3.x

I have just started using Avro and I'm using fastavro library in Python.
I prepared a schema and saved data with this one.
Now, I need to append new data (JSON response from an API call ) and save it with a non-existent schema to the same avro file.
How shall I proceed to add the JSON response with no predefined schema and save it to the same Avro file?
Thanks in advance.

Avro files, by definition, already have a schema within them.
You could read that schema first, then continue to append data, or you can read entire file into memory, then append your data, then overwrite the file.
Each option require you to convert the JSON into Avro (or at least a Python dict), though.

Related

How to store a schema in file and in which file format for databricks autoloader?

I am using databricks autoloader. Here, the table schema will be dynamic for the incoming data. I have to store the schema in some file and read it in autoloader during readStream.
How can I store the schema in a file and in which format?
Whether the file can be read using schema option or "cloudFiles.schemaLocation" option?
spark.readStream.format("cloudFiles").schema("<schema>").option("cloudFiles.schemaLocation", "<path_to_checkpoint>").option("cloudFiles.format", "parquet").load("<path_to_source_data>")

Read avro file with bytes schema in spark

I am trying to read some avro files into a Spark dataframe and have the below sitution:
The avro file schema is defined as
Schema(
org.apache.avro.Schema
.create(org.apache.avro.Schema.Type.BYTES),
"ByteBlob", "1.0");
The file has a nested json structure stored as a simple bytes schema in the avro file.
I can't seem to find a way to read this into a dataframe in spark. Any pointers on how I can read files like these?
Output from avro-tools:
hadoop jar avro-tools/avro-tools-1.10.2.jar getmeta /projects/syslog_paranoids/encrypted/dhr/complete/visibility/zeeklog/202207251345/1.0/202207251351/stg-prd-dhrb-edg-003.data.ne1.yahoo.com_1658690707314_zeeklog_1.0_202207251349_202207251349_6c64f2210c568092c1892d60b19aef36.6.avro
avro.schema "bytes"
avro.codec deflate
The tojson function within avro-tools is able to read the file properly and return a json output contained in the file.

How to get parquet file schema in Node JS AWS Lambda?

Is there any way to read a parquet file schema from Node.JS?
If yes, how?
I saw that there is a lib, parquetjs but as I saw it from the documentation it can only read and write the contents of the file.
After some investigation, I've found that the parquetjs-lite can do that. It does not read the whole file, just the footer and then it extracts the schema from it.
It works with a cursor and the way I saw it there is two s3.getobject calls, one for the size and one for the given data.

How to save files in same directory using saveAsNewAPIHadoopFile spark scala

I am using spark streaming and I want to save each batch of spark streaming on my local in Avro format. I have used saveAsNewAPIHadoopFile to save data in Avro format. This works well. But it overwrites the existing file. Next batch data will overwrite the old data. Is there any way to save Avro file in common directory? I tried by adding some properties of Hadoop job conf for adding a prefix in the file name. But not working any properties.
dstream.foreachRDD {
rdd.saveAsNewAPIHadoopFile(
path,
classOf[AvroKey[T]],
classOf[NullWritable],
classOf[AvroKeyOutputFormat[T]],
job.getConfiguration()
)
}
Try this -
You can make your process split into 2 steps :
Step-01 :- Write Avro file using saveAsNewAPIHadoopFile to <temp-path>
Step-02 :- Move file from <temp-path> to <actual-target-path>
This will definitely solve your problem for now. I will share my thoughts if I get to fulfill this scenario in one step instead of two.
Hope this is helpful.

Using Pyspark how to convert Text file to CSV file

I am new learner for Pyspark. I got a requirement in my project to read JSON file with a schema and need to convert it to CSV file.
Can some one help me how to proceed this request using PYspark.
You can load JSON and write CSV with SparkSession.
spark = SparkSession.builder.master("local").appName("ETL").getOrCreate()
spark.read.json(path-to-txt)
spark.write.csv(path-to-csv)

Resources