We are trying to implement badRecordsPath when we are reading in events from an eventhub, as an example to try get it working I have put in schema that should fail the event:
eventStreamDF = (spark.readStream
.format("eventhubs")
.options(**eventHubsConf)
.option("badRecordsPath", "/tmp/badRecordsPath/test1")
.schema(badSchema)
.load()
)
Yet this never fails and always reads the events, is this the behaviour of the readstream for the eventhub for databricks? Currently the work around is to check the inferSchema against our own schema.
The schema of the data in EventHubs is fixed (see docs) (same is for Kafka) - the actual payload is always encoded as binary field with name body, and it's up the developer to decode this binary payload according to the "contact" between producer(s) of the data and consumers of that data. So even if you specify the schema, and badRecordsPath option, they aren't used.
You will need to implement some function that will decode data from JSON, or something else, that would for example return null if data is broken, and then you'll have a filter for null values to split stream into two substreams - for good & bad data.
Related
What should be done in a Spark Structured Streaming job so that it can read a multi event Kafka topic?
I am trying to read a topic that has multiple type of events. Every event may have difference schema. How does a streaming job determine the type of event or which schema of the event to use ?
Kafka Dataframes are always bytes. You use a UDF to deserialize the key/value columns.
For example, assuming data is JSON, you first cast bytes to string, then you can use get_json_object to extract/filter specific fields.
If data is in other format, then you could use Kafka record headers added by the producer (you'll need to add those in your own producer code) to designate what event type each record is, then filter based on those, and add logic for processing different sub-dataframes. Or you could wrap binary data in a more consistent schema such as CloudEvents spec, which includes a type field and nested binary content, which needs further deserialized.
We have a system where we just need the Avro raw bytes, without the single object encoding header, and also not the Avro container format. Just the bytes please. We pass the identifier of the schema to the deserializer via another path.
I can't find the correct call in the Avro Rust sources or documentation
Any hints?
It turns out to be pretty trivial with avro_rs
to_avro_datum(schema, avro_record)
As said above, one needs to identify the writer schema, which in the above link was done via an http header. Internally we use both Confluent and single object encoding to keep the data together with the schema.
Because in this code it's not possible. Writer constructor automatically serialises schema into the buffer. The only way I can think of is to write our own W that would automatically ditch everything we don't need.
I need to achieve the following, and am having difficulty coming up with an approach to accomplish it due to my inexperience with Spark:
Read data from .json.gz files stored in S3.
Each file includes a partial day of Google Analytics data with the schema as specified in https://support.google.com/analytics/answer/3437719?hl=en.
File names are in the pattern ga_sessions_20170101_Part000000000000_TX.json.gz where 20170101 is a YYYYMMDD date specification and 000000000000 is an incremental counter when there are multiple files for a single day (which is usually the case).
An entire day of data is therefore composed of multiple files with incremental "part numbers".
There are generally 3 to 5 files per day.
All fields in the JSON files are stored with qoute (") delimiters, regardless of the data type specified in the aforementioned schema documentation. The data frame which results from reading the files (via sqlContext.read.json) therefore has every field typed as string, even though some are actually integer, boolean, or other data types.
Convert the all-string data frame to a properly typed data frame according to the schema specification.
My goal is to have the data frame typed properly so that when it is saved in Parquet format the data types are correct.
Not all fields in the schema specification are present in every input file, or even every day's worth of input files (the schema may have changed over time). The conversion will therefore need to be dynamic, converting the data types of only the fields actually present in the data frame.
Write the data in the properly typed data frame back to S3 in Parquet format.
The data should be partitioned by day, with each partition stored in a separate folder named "partition_date=YYYYMMDD" where "YYYYMMDD" is the actual date associated with the data (from the original input file names).
I don't think the number of files per day matters. The goal is simply to have partitioned Parquet format data that I can point Spectrum at.
I have been able to read and write the data successfully, but have been unsuccessful with several aspects of the overall task:
I don't know how to approach the problem to ensure that I'm effectively utilizing the AWS EMR cluster to its full potential for parallel/distributed processing, either in reading, converting, or writing the data. I would like to size up the cluster as needed to accomplish the task within whatever time frame I choose (within reason).
I don't know how to best accomplish the data type conversion. Not knowing which fields will or will not be present in any particular batch of input files requires dynamic code to retype the data frame. I also want to make sure this task is distributed effectively and isn't done inefficiently (I'm concerned about creating a new data frame as each field is retyped).
I don't understand how to manage partitioning of the data appropriately.
Any help working through an overall approach would be greatly appreciated!
If your input JSONs have fixed schema you can specify DF schema manually, stating fields as optional. Refer to the official guide.
If you have all values inside "", you can read them as strings and later cast to required type.
I don't know how to approach the problem to ensure that I'm effectively...
Use Dataframe API to read input, most likely defaults will be good for this task. If you hit performance issue, attach Spark Job Timeline.
I don't know how to best accomplish the data type conversion...
Use cast column.cast(DataType) method.
For example, you have 2 JSONs:
{"foo":"firstVal"}{"foo":"val","bar" : "1"}
And you want to read 'foo' as String and bar as integer you can write something like this:
val schema = StructType(
StructField("foo", StringType, true) ::
StructField("bar", StringType, true) :: Nil
)
val df = session.read
.format("json").option("path", s"${yourPath}")
.schema(schema)
.load()
val withCast = df.select('foo, 'bar cast IntegerType)
withCast.show()
withCast.write.format("parquet").save(s"${outputPath}")
I have a case where Kafka producers sends the data twice a day. These producers read all the data from the database/files and send to Kafka. So these messages sent every day, which is duplicated. I need to deduplicate the message and write in some persistent storage using the Spark Streaming. What will the best way of removing the duplicate messages in this case?
The duplicate message sent is a json string with the timestamp field is only updated.
Note: I can't change Kafka Producer to send only the new data/message, it's already installed in the client machine and written by someone else.
For deduplication, you need to store somewhere information about what was already processed (for example unique ids of messages).
To store messages you can use:
spark checkpoints. Pros: out-of-the-box. Cons: if you update the source code of app, you need to clean checkpoints. As result, you will lose information. Solution can work, if the requirements for deduplication is not strict.
any database. For example, if you running on hadoop env, you can use Hbase. For every message you do 'get' (check that it wasn't sent before), and mark in DB sent when it is really send.
You can the change the topic configuration to compact mode. With compaction, a record with same key will be overwritten/updated in the Kafka log. There by you get only the latest value for a key from Kafka.
You can read more about compaction here.
You could try to use mapWithState. Check my answer.
A much simpler approach would be to solve this at kafka end. have a look at kafka's Log compaction feature. It will deduplicate the recors for you provided the records have same unique key.
https://kafka.apache.org/documentation/#compaction
You can use a Key-Value datastore where your key is going to be combination of fields excluding the timestamp field and value the actual json.
As you poll the records create the Key and value pair write to the datastore which either handles the UPSERT(Insert + Update) or check if the key exists in the datastore then drop the message
if(Datastore.get(key)){
// then drop
}else {
//write to the datastore
Datastore.put(key)
}
I suggest you to check HBase(Which handles UPSERTS) and Redis(In-Memory KV datastore used for lookups)
Have you looked into this:
https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#streaming-deduplication
You can try using the dropDuplicates() method.
If you have more than one column that needs to be used to determine the duplicates, you can use the dropDuplicates(String[] colNames) to pass them.
I am building a telemetry pipeline processing for our project. I have AVRO encoded coming in and with the use of the Schema registry, I am decoding Avro data as GenericRecord based on the SchemaID. I am planning to run Spark jobs for further downstream processing. But, what is the best way to handle the data model in Spark jobs ? All examples point to using result.get("fieldname") but is that the suggested way ?
The benefit of using GenericRecord is that, it abstracts out the schema registry related details from the consumer. So, you don't have to fetch the schema id from the payload record, make a GET call to the confluent schema registry to get the avro schema, and then do the deserialization. I don't know of any performance impacts because of this, but would surely love to know if there's any.
On the other hand if you wish to use your own avro bytearray serializer/deserializer, you need to have some knowledge of the structure of the avro payload. E.g. you've to parse the avro payload to validate the magic byte, extract the 4 byte schema Id and the schema, and so on... You might want to implement an in-memory cache of already retrieved schemas, because it's a good idea to reduce the number of http calls to schema registry. More details on this, can be found here.