I am trying to access the json data from tweets in my kafka topic.In my spark structured streaming while creating schema is it necessary to explicitly specify each and every key from the twitter API.Can i not access the only ones which i want to analyse like the text field alone?
While recommended, the schema is optional. You should be able to do something like this
kafkaDf
.select(col("value").cast("string").as("value"))
.select(get_json_object(col("value"), "$.text"))
https://spark.apache.org/docs/latest/api/sql/index.html#get_json_object
Related
The application is streaming from a kafka topic that gets messages with different structure. The only way to know which structure it is, is to use the message key. Is there a way to apply a schema based on the message type. Such as use a function in the from_json method and pass the key into this fuction.
I am implementing some Spark Structured Streaming transformations from a Parquet data source. In order to read the data into a streaming DataFrame, one has to specify the schema (it cannot be automatically inferred). The schema is really complex and manually writing the schema code will be a very complex task.
Can you suggest a walkaround? Currently I am creating a batch DataFrame beforehand (using the same data source), Spark infers the schema and then I save the schema to a Scala object and use it as an input for the Structured Streaming reader.
I don't think it is a reliable or a well performing solution. Please suggest how to generate the schema code automatically or somehow persist the schema in a file and reuse it.
From the docs:
By default, Structured Streaming from file based sources requires you
to specify the schema, rather than rely on Spark to infer it
automatically. This restriction ensures a consistent schema will be
used for the streaming query, even in the case of failures. For ad-hoc
use cases, you can reenable schema inference by setting
spark.sql.streaming.schemaInference to true.
You could also open a shell, read one of the parquet files with automatic schema inference enabled, and save the schema to JSON for later reuse. You only have to do this one time, so it might be faster / more efficient than doing the similar-sounding workaround you're using now.
Suppose we subscribe 2 topics in the stream, one topic is of avro and another one is string, is it possible to dynamically deserialize based on the topic name?
In theory, yes
The Deserializer interface accepts the topic name as a parameter, which you could perform a check against.
However, getting access to this in Spark would require your own UDF wrapper around it.
Ultimately, I think it would be better if you define two streaming dataframes for each topic of a different format, or simply produce the strings as Avro encoded.
I need to export data from Hive to Kafka topics based on some events in another Kafka topic. I know I can read data from hive in Spark job using HQL and write it to Kafka from the Spark, but is there a better way?
This can be achieved using unstructured streaming. The steps mentioned below :
Create a Spark Streaming Job which connects to the required topic and fetched the required data export information.
From stream , do a collect and get your data export requirement in Driver variables.
Create a data frame using the specified condition
Write the data frame into the required topic using kafkaUtils.
Provide a polling interval based on your data volume and kafka write throughputs.
Typically, you do this the other way around (Kafka to HDFS/Hive).
But you are welcome to try using the Kafka Connect JDBC plugin to read from a Hive table on a scheduled basis, which converts the rows into structured key-value Kafka messages.
Otherwise, I would re-evaulate other tools because Hive is slow. Couchbase or Cassandra offer much better CDC features for ingestion into Kafka. Or re-write the upstream applications that inserted into Hive to begin with, rather to write immediately into Kafka, from which you can join with other topics, for example.
I'd like to process a stream of data coming from RabbitMQ. Specifically, it's a list of changes, and I'd like to filter out changes that have already been made. To do that, I'd need to compare the new data against the existing data in a Cassandra database.
Is it okay to do that within a Spark streaming transformation? Is there some more idiomatic approach I should be considering?