Retrieve a String type column from a Spark Dataset as String variable, to pass that as a 'key' for the Redis cache - apache-spark

I am trying to use spark streaming to read the data from a kafka topic.
The message from kafka is a JSON which i am storing below in the value column of the dataset as String.
**Sample message : Just a sample, actual json is complex **
{
"Name": "Bauddhik",
"Profession": "Developer"
}
Dataset<Row> df = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "topic1")
.load()
.selectExpr("CAST(value AS STRING)");
Now as my Dataset have a value column with the entire JSON, I need to pick one of the field which i can use as a key while storing in Redis. Suppose the field is "Name" from the json.
So, first i did below select to take out the "name" field as a new column in my dataframe.
Dataset<Row> df1 = df.select(functions.col("value"), functions.get_json_object(functions.col("value"), "$['name']").as("name");
This works fine and now my df2 looks like
Value | name
<Json> | Bauddhik
Now i want this to be inserted to Redis cache with the key as 'Bauddhik' and the value as the entire Json. So i am using below foreachbatch option to persist in Redis.
df1.writeStream().foreachbatch (
new VoidFunction2<Dataset<Row>, Long>()
{
public void call (Dataset<Row> dataset, Long batchId) {
dataset.write()
.format("org.apache.spark.sql.redis")
.option("key.coloum", **<hereistheissue>**)
.option("table","test")
.mode(SaveMode.Overwrite)
.save();
}
}).start()
If you look at the above code (hereistheissue) , I need to paas the key as Bauddhik which i derived earlier as a seperate column in the Dataframe.
I am not able to retrieve the name column as string so i can pass it to the Redis cache as the key. I have tried using map and df.head().getString(1) but nothing seems to be working.
Can anyone please guide on how i can read a column from a dataset as a String and pass to the key option while writing to Redis cache.

Related

How to store data from a dataframe in a variable to use as a parameter in a select in cassandra?

I have a Spark Structured Streaming application. The application receives data from kafka, and should use these values ​​as a parameter to process data from a cassandra database. My question is how do I use the data that is in the input dataframe (kafka), as "where" parameters in cassandra "select" without taking the error below:
Exception in thread "main" org.apache.spark.sql.AnalysisException: Queries with streaming sources must be executed with writeStream.start();
This is my df input:
val df = spark
.readStream
.format("kafka")
.options(
Map("kafka.bootstrap.servers"-> kafka_bootstrap,
"subscribe" -> kafka_topic,
"startingOffsets"-> "latest",
"fetchOffset.numRetries"-> "5",
"kafka.group.id"-> groupId
))
.load()
I get this error whenever I try to store the dataframe values ​​in a variable to use as a parameter.
This is the method I created to try to convert the data into variables. With that the spark give the error that I mentioned earlier:
def processData(messageToProcess: DataFrame): DataFrame = {
val messageDS: Dataset[Message] = messageToProcess.as[Message]
val listData: Array[Message] = messageDS.collect()
listData.foreach(x => println(x.country))
val mensagem = messageToProcess
mensagem
}
When you need to use data in Kafka to query data in Cassandra, then such operation is a typical join between two datasets - you don't need to call .collect to find entries, you just do the join. And it's quite typical thing - to enrich data in Kafka with data from the external dataset, and Cassandra provides low-latency operations.
Your code could look as following (you'll need to configure so-called DirectJoin, see link below):
import spark.implicits._
import org.apache.spark.sql.cassandra._
val df = spark.readStream.format("kafka")
.options(Map(...)).load()
... decode data in Kafka into columns
val cassdata = spark.read.cassandraFormat("table", "keyspace").load
val joined = df.join(cassdata, cassdata("pk") === df("some_column"))
val processed = ... process joined data
val query = processed.writeStream.....output data somewhere...start()
query.awaitTermination()
I have detailed blog post on how to perform efficient joins with data in Cassandra.
As the error message suggest, you have to use writeStream.start() in order to execute a Structured Streaming query.
You can't use the same actions you use for batch dataframes (like .collect(), .show() or .count()) on streaming dataframes, see the Unsupported Operations section of the Spark Structured Streaming documentation.
In your case, you are trying to use messageDS.collect() on a streaming dataset, which is not allowed. To achieve this goal you can use a foreachBatch output sink to collect the rows you need at each microbatch:
streamingDF.writeStream.foreachBatch { (microBatchDf: DataFrame, batchId: Long) =>
// Now microBatchDf is no longer a streaming dataframe
// you can check with microBatchDf.isStreaming
val messageDS: Dataset[Message] = microBatchDf.as[Message]
val listData: Array[Message] = messageDS.collect()
listData.foreach(x => println(x.country))
// ...
}

How to convert Row to Dictionary in foreach() in pyspark?

I have a dataframe generated from Spark which I want to use for writeStream and also want to save in a database.
I have the following code:
output = (
spark_event_df
.writeStream
.outputMode('update')
.foreach(writerClass(**job_config_data))
.trigger(processingTime="2 seconds")
.start()
)
output.awaitTermination()
As I am using foreach(), writerClass gets a Row and I can not convert it into a dictionary in python.
How can I get a python datatype(preferably dictionary) from the Row in my writerClass so that I can manipulate that according to my needs and save into database?
If you're just looking to save to a database as part of your stream, you could do that using foreachBatch and the built-in JDBC writer. Just do your transformations to shape your data according to the desired output schema, then:
def writeBatch(input, batch_id):
(input
.write
.format("jdbc")
.option("url", url)
.option("dbtable", tbl)
.mode("append")
.save())
output = (spark_event_df
.writeStream
.foreachBatch(writeBatch)
.start())
output.awaitTermination()
If you absolutely need custom logic for writing to your database, that is not supported by the built-in JDBC writer, then you should use the DataFrame foreachPartition method to write your rows in bulk rather than one at a time. If you're using this method, then you can convert the Row objects into a dict by just calling asDict

How to enrich data of a streaming query and write the result to Elasticsearch?

For a given dataset (originalData) I'm required to map the values and then prepare a new dataset combining the search results from elasticsearch.
Dataset<Row> orignalData = spark
.readStream()
.format("kafka")
.option("kafka.bootstrap.servers","test")
.option("subscribe", "test")
.option("startingOffsets", "latest")
.load();
Dataset<Row> esData = JavaEsSparkSQL
.esDF(spark.sqlContext(), "spark_correlation/doc");
esData.createOrReplaceTempView("es_correlation");
List<SGEvent> listSGEvent = new ArrayList<>();
originalData.foreach((ForeachFunction<Row>) row -> {
SGEvent event = new SGEvent();
String sourceKey=row.get(4).toString();
String searchQuery = "select id from es_correlation where es_correlation.key='"+sourceKey+"'";
Dataset<Row> result = spark.sqlContext().sql(searchQuery);
String id = null;
if (result != null) {
result.show();
id = result.first().toString();
}
event.setId(id);
event.setKey(sourceKey);
listSGEvent.add(event)
}
Encoder<SGEvent> eventEncoderSG = Encoders.bean(SGEvent.class);
Dataset<Row> finalData = spark.createDataset(listSGEvent, eventEncoderSG).toDF();
finalData
.writeStream()
.outputMode(OutputMode.Append())
.format("org.elasticsearch.spark.sql")
.option("es.mapping.id", "id")
.option("es.write.operation", "upsert")
.option("checkpointLocation","/tmp/checkpoint/sg_event")
.start("spark_index/doc").awaitTermination();
Spark throws the following exception:
org.apache.spark.sql.AnalysisException: Queries with streaming sources must be executed with writeStream.start();;
at org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$.org$apache$spark$sql$catalyst$analysis$UnsupportedOperationChecker$$throwError(UnsupportedOperationChecker.scala:389)
at org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$$anonfun$checkForBatch$1.apply(UnsupportedOperationChecker.scala:38)
at org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$$anonfun$checkForBatch$1.apply(UnsupportedOperationChecker.scala:36)
at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:127)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:126)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:126)
at scala.collection.immutable.List.foreach(List.scala:392)
at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:126)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:126)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:126)
at scala.collection.immutable.List.foreach(List.scala:392)
Is my approach towards combing elasticsearch value with Dataset valid ? Is there any other better solution for this?
There are a couple of issues here.
As the exception says orignalData is a streaming query (streaming Dataset) and the only way to execute it is to use writeStream.start(). That's one issue.
You did writeStream.start() but with another query finalData which is not streaming but batch. That's another issue.
For "enrichment" cases like yours, you can use a streaming join (Dataset.join operator) or one of DataStreamWriter.foreach and DataStreamWriter.foreachBatch. I think DataStreamWriter.foreachBatch would be more efficient.
public DataStreamWriter<T> foreachBatch(VoidFunction2<Dataset<T>,Long> function)
(Java-specific) Sets the output of the streaming query to be processed using the provided function. This is supported only the in the micro-batch execution modes (that is, when the trigger is not continuous). In every micro-batch, the provided function will be called in every micro-batch with (i) the output rows as a Dataset and (ii) the batch identifier. The batchId can be used deduplicate and transactionally write the output (that is, the provided Dataset) to external systems. The output Dataset is guaranteed to exactly same for the same batchId (assuming all operations are deterministic in the query).
Not only would you get all the data of a streaming micro-batch in one shot (the first input argument of type Dataset<T>), but also a way to submit another Spark job (across executors) based on the data.
The pseudo-code could look as follows (I'm using Scala as I'm more comfortable with the language):
val dsWriter = originalData.foreachBatch { case (data, batchId) =>
// make sure the data is small enough to collect on the driver
// Otherwise expect OOME
// It'd also be nice to have a Java bean to convert the rows to proper types and names
val localData = data.collect
// Please note that localData is no longer Spark's Dataset
// It's a local Java collection
// Use Java Collection API to work with the localData
// e.g. using Scala
// You're mapping over localData (for a single micro-batch)
// And creating finalData
// I'm using the same names as your code to be as close to your initial idea as possible
val finalData = localData.map { row =>
// row is the old row from your original code
// do something with it
// e.g. using Java
String sourceKey=row.get(4).toString();
...
}
// Time to save the data processed to ES
// finalData is a local Java/Scala collection not Spark's DataFrame!
// Let's convert it to a DataFrame (and leverage the Spark distributed platform)
// Note that I'm almost using your code, but it's a batch query not a streaming one
// We're inside foreachBatch
finalData
.toDF // Convert a local collection to a Spark DataFrame
.write // this creates a batch query
.format("org.elasticsearch.spark.sql")
.option("es.mapping.id", "id")
.option("es.write.operation", "upsert")
.option("checkpointLocation","/tmp/checkpoint/sg_event")
.save("spark_index/doc") // save (not start) as it's a batch query inside a streaming query
}
dsWriter is a DataStreamWriter and you can now start it to start the streaming query.
I was able to achieve actual solution by using SQL Joins.
Please refer the code below.
Dataset<Row> orignalData = spark
.readStream()
.format("kafka")
.option("kafka.bootstrap.servers","test")
.option("subscribe", "test")
.option("startingOffsets", "latest")
.load();
orignalData.createOrReplaceTempView("stream_data");
Dataset<Row> esData = JavaEsSparkSQL
.esDF(spark.sqlContext(), "spark_correlation/doc");
esData.createOrReplaceTempView("es_correlation");
Dataset<Row> joinedData = spark.sqlContext().sql("select * from stream_data,es_correlation where es_correlation.key=stream_data.key");
// Or
/* By using Dataset Join Operator
Dataset<Row> joinedData = orignalData.join(esFirst, "key");
*/
Encoder<SGEvent> eventEncoderSG = Encoders.bean(SGEvent.class);
Dataset<SGEvent> finalData = joinedData.map((MapFunction<Row, SGEvent>) row -> {
SGEvent event = new SGEvent();
event.setId(row.get(0).toString());
event.setKey(row.get(3).toString());
return event;
},eventEncoderSG);
finalData
.writeStream()
.outputMode(OutputMode.Append())
.format("org.elasticsearch.spark.sql")
.option("es.mapping.id", "id")
.option("es.write.operation", "upsert")
.option("checkpointLocation","/tmp/checkpoint/sg_event")
.start("spark_index/doc").awaitTermination();

How to save kafka data into different location based on a column value in spark structured streaming?

I have a usecase in which I am consuming data from Kafka using spark structured streaming. I have multiple topics to subscribe and based on the topic name the dataframe should be dumped to a defined location(different location for different topics). I saw if this can be solved using some kind of split/filter function in spark dataframe but could not find any.
As of now I am only subscribed to one topic and I am using my own written method to dump the data into a location in parquet's format. Here is the code I am currently using :
def save_as_parquet(cast_dataframe: DataFrame,output_path:
String,checkpointLocation: String): Unit = {
val query = cast_dataframe.writeStream
.format("parquet")
.option("failOnDataLoss",true)
.option("path",output_path)
.option("checkpointLocation",checkpointLocation)
.start()
.awaitTermination()
}
When I will be subscribed to different topics, then this cast_dataframe will also have values from different topics. I wish to dump the data from a topic to only the location it is assigned location. How can this be done ?
As explained in the official documentation Dataset to be written might contain optional topic column, which can be used for message routing:
* The topic column is required if the “topic” configuration option is not specified.
The value column is the only required option. If a key column is not specified then a null valued key column will be automatically added (see Kafka semantics on how null valued key values are handled). If a topic column exists then its value is used as the topic when writing the given row to Kafka, unless the “topic” configuration option is set i.e., the “topic” configuration option overrides the topic column.
According to the documentation each Row from Kafka source has the following schema:
Column
Type
key
binary
value
binary
topic
string
...
...
Assuming you are reading from multiple topics using the source option
val kafkaInputDf = spark.readStream.format("kafka").[...]
.option("subscribe", "topic1, topic2, topic3")
.start()
.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)", "topic")
you can then apply a filter to on the column topic to split the data accordingly:
val df1 = kafkaInputDf.filter(col("topic") === "topic1")
val df2 = kafkaInputDf.filter(col("topic") === "topic2")
val df3 = kafkaInputDf.filter(col("topic") === "topic3")
Then you can sink those three streaming Dataframes df1, df2 and df3 into their required sinks. As this will create three in parallel running streaming queries it is important that each writeStream get its own checkpoint location.

writing corrupt data from kafka / json datasource in spark structured streaming

In spark batch jobs I usually have a JSON datasource written to a file and can use corrupt column features of the DataFrame reader to write the corrupt data out in a seperate location, and another reader to write the valid data both from the same job. ( The data is written as parquet )
But in Spark Structred Streaming I'm first reading the stream in via kafka as a string and then using from_json to get my DataFrame. Then from_json uses JsonToStructs which uses a FailFast mode in the parser and does not return the unparsed string to a column in the DataFrame. (see Note in Ref) Then how can I write corrupt data that doesn't match my schema and possibly invalid JSON to another location using SSS?
Finally in the batch job the same job can write both dataframes. But Spark Structured Streaming requires special handling for multiple sinks. Then in Spark 2.3.1 (my current version) we should include details about how to write both corrupt and invalid streams properly...
Ref: https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-Expression-JsonToStructs.html
val rawKafkaDataFrame=spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", config.broker)
.option("kafka.ssl.truststore.location", path.toString)
.option("kafka.ssl.truststore.password", config.pass)
.option("kafka.ssl.truststore.type", "JKS")
.option("kafka.security.protocol", "SSL")
.option("subscribe", config.topic)
.option("startingOffsets", "earliest")
.load()
val jsonDataFrame = rawKafkaDataFrame.select(col("value").cast("string"))
// does not provide a corrupt column or way to work with corrupt
jsonDataFrame.select(from_json(col("value"), schema)).select("jsontostructs(value).*")
When you convert to json from string, and if it is not be able to parse with the schema provided, it will return null. You can filter the null values and select the string. Something like this.
val jsonDF = jsonDataFrame.withColumn("json", from_json(col("value"), schema))
val invalidJsonDF = jsonDF.filter(col("json").isNull).select("value")
I was just trying to figure out the _corrupt_record equivalent for structured streaming as well. Here's what I came up with; hopefully it gets you closer to what you're looking for:
// add a status column to partition our output by
// optional: only keep the unparsed json if it was corrupt
// writes up to 2 subdirs: 'out.par/status=OK' and 'out.par/status=CORRUPT'
// additional status codes for validation of nested fields could be added in similar fashion
df.withColumn("struct", from_json($"value", schema))
.withColumn("status", when($"struct".isNull, lit("CORRUPT")).otherwise(lit("OK")))
.withColumn("value", when($"status" <=> lit("CORRUPT"), $"value"))
.write
.partitionBy("status")
.parquet("out.par")

Resources