Read avro file with bytes schema in spark - apache-spark

I am trying to read some avro files into a Spark dataframe and have the below sitution:
The avro file schema is defined as
Schema(
org.apache.avro.Schema
.create(org.apache.avro.Schema.Type.BYTES),
"ByteBlob", "1.0");
The file has a nested json structure stored as a simple bytes schema in the avro file.
I can't seem to find a way to read this into a dataframe in spark. Any pointers on how I can read files like these?
Output from avro-tools:
hadoop jar avro-tools/avro-tools-1.10.2.jar getmeta /projects/syslog_paranoids/encrypted/dhr/complete/visibility/zeeklog/202207251345/1.0/202207251351/stg-prd-dhrb-edg-003.data.ne1.yahoo.com_1658690707314_zeeklog_1.0_202207251349_202207251349_6c64f2210c568092c1892d60b19aef36.6.avro
avro.schema "bytes"
avro.codec deflate
The tojson function within avro-tools is able to read the file properly and return a json output contained in the file.

Related

How to store a schema in file and in which file format for databricks autoloader?

I am using databricks autoloader. Here, the table schema will be dynamic for the incoming data. I have to store the schema in some file and read it in autoloader during readStream.
How can I store the schema in a file and in which format?
Whether the file can be read using schema option or "cloudFiles.schemaLocation" option?
spark.readStream.format("cloudFiles").schema("<schema>").option("cloudFiles.schemaLocation", "<path_to_checkpoint>").option("cloudFiles.format", "parquet").load("<path_to_source_data>")

Avro append a record with non-existent schema and save as an avro file?

I have just started using Avro and I'm using fastavro library in Python.
I prepared a schema and saved data with this one.
Now, I need to append new data (JSON response from an API call ) and save it with a non-existent schema to the same avro file.
How shall I proceed to add the JSON response with no predefined schema and save it to the same Avro file?
Thanks in advance.
Avro files, by definition, already have a schema within them.
You could read that schema first, then continue to append data, or you can read entire file into memory, then append your data, then overwrite the file.
Each option require you to convert the JSON into Avro (or at least a Python dict), though.

How to read sample records parquet file in S3?

I have 100s of parquet files in S3, I want to check whether all the parquet files are created properly or not. Basically the downstream system should able to read these parquet files without any issue. Before downstream system read these files, I want my python scripts to read the sample, 10 records for each parquet files.
I using the below syntax to read the parquet file:
import pandas as pd
from boto3 import client
conn = client('s3')
buffer = io.BytesIO()
s3 = boto3.resource('s3')
result = s3.get_object(Bucket="my bucket", Key="my file location")
text = result["Body"].read().decode()
Need your input to read sample records, not all the records from parquet file. Thank you.

Using Pyspark how to convert Text file to CSV file

I am new learner for Pyspark. I got a requirement in my project to read JSON file with a schema and need to convert it to CSV file.
Can some one help me how to proceed this request using PYspark.
You can load JSON and write CSV with SparkSession.
spark = SparkSession.builder.master("local").appName("ETL").getOrCreate()
spark.read.json(path-to-txt)
spark.write.csv(path-to-csv)

Spark SQL : Handling schema evolution

I want to read 2 avro files of same data set but with schema evolution
first avro file schema : {String, String, Int}
second avro file schema evolution : {String, String, Long}
(Int field is undergone evolution to long)
I want to read these two avro file to store in dataframe using sparkSQL.
To read avro files I am using 'spark-avro' of databicks
https://github.com/databricks/spark-avro
How to do this efficiently.
Spark version : 2.0.1
Scala. 2.11.8
PS. Here in example I have mentioned only 2 files but in actual scenario file is generated daily so there are more than 1000 such file.
Thank you in advance:)
use an union like
{string,string, [int, long]}
is a valid solution for your? it should allow read both new and old files.

Resources