Does Spark 2.2.0 support Streaming Self-Joins? - apache-spark

I understand JOINS of two different dataframes are not supported in Spark 2.2.0 but I am trying to do self-join so only one stream. Below is my code
val jdf = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "join_test")
.option("startingOffsets", "earliest")
.load();
jdf.printSchema
which print the following
root
|-- key: binary (nullable = true)
|-- value: binary (nullable = true)
|-- topic: string (nullable = true)
|-- partition: integer (nullable = true)
|-- offset: long (nullable = true)
|-- timestamp: timestamp (nullable = true)
|-- timestampType: integer (nullable = true)
Now I run the join query below after reading through this SO post
jdf.as("jdf1").join(jdf.as("jdf2"), $"jdf1.key" === $"jdf2.key")
And I get the following Exception
org.apache.spark.sql.AnalysisException: cannot resolve '`jdf1.key`' given input columns: [timestamp, value, partition, timestampType, topic, offset, key];;
'Join Inner, ('jdf1.key = 'jdf2.key)
:- SubqueryAlias jdf1
: +- StreamingRelation DataSource(org.apache.spark.sql.SparkSession#f662b5,kafka,List(),None,List(),None,Map(startingOffsets -> earliest, subscribe -> join_test, kafka.bootstrap.servers -> localhost:9092),None), kafka, [key#243, value#244, topic#245, partition#246, offset#247L, timestamp#248, timestampType#249]
+- SubqueryAlias jdf2
+- StreamingRelation DataSource(org.apache.spark.sql.SparkSession#f662b5,kafka,List(),None,List(),None,Map(startingOffsets -> earliest, subscribe -> join_test, kafka.bootstrap.servers -> localhost:9092),None), kafka, [key#243, value#244, topic#245, partition#246, offset#247L, timestamp#248, timestampType#249]

I think it will not create any difference if we try to join same streaming data frame or different dataframe.So, It will not be supported.
There are two ways to achieve it.
First, you can join static and streaming dataframe. So, read once as batch data and next as streaming df.
The second solution, you can use Kafka streams. It provides joining of streaming data.

Related

How can i send my structured streaming dataframe to kafka?

Hello everyone !
I'm trying to send my structured streaming dataframe to one of my kafka topics, detection.
This is the schema of the structued streaming dataframe:
root
|-- timestamp: timestamp (nullable = true)
|-- Sigma: string (nullable = true)
|-- time: string (nullable = true)
|-- duration: string (nullable = true)
|-- SourceComputer: string (nullable = true)
|-- SourcePort: string (nullable = true)
|-- DestinationComputer: string (nullable = true)
|-- DestinationPort: string (nullable = false)
|-- protocol: string (nullable = true)
|-- packetCount: string (nullable = true)
|-- byteCount: string (nullable = true)
but then i try to send the dataframe, with this method:
dfwriter=df \
.selectExpr("CAST(value AS STRING)") \
.writeStream \
.format("kafka") \
.option("checkpointLocation", "/Documents/checkpoint/logs") \
.option("kafka.bootstrap.servers", "localhost:9092") \
.option("failOnDataLoss", "false") \
.option("topic", detection) \
.start()
Then i got the error :
pyspark.sql.utils.AnalysisException: cannot resolve 'value' given input columns: [DestinationComputer, DestinationPort, Sigma, SourceComputer, SourcePort, byteCount, duration, packetCount, processName, protocol, time, timestamp]; line 1 pos 5;
If i send a dataframe with juste the column value it works, i receive the data on my kafka topic consumer.
Any idea to send my dataframe with all columns ?
Thank you !
Your dataframe has no value column, as the error says.
You'd need to "embed" all columns under a value StructType column, then use a function like to_json, not CAST( .. AS STRING)
In Pyspark, that'd be something like struct(to_json(struct($"*")).as("value") within a select query
Similar question - Convert all the columns of a spark dataframe into a json format and then include the json formatted data as a column in another/parent dataframe

How to write result of streaming query to multiple database tables?

I am using spark structured streaming and reading from Kafka topic. The goal is to write the message to PostgreSQL database multiple tables.
The message schema is:
root
|-- id: string (nullable = true)
|-- name: timestamp (nullable = true)
|-- comment: string (nullable = true)
|-- map_key_value: map (nullable = true)
|-- key: string
|-- value: string (valueContainsNull = true)
While writing to one table after dropping map_key_value works with below code:
My write code is:
message.writeStream.foreachBatch { (batchDF: DataFrame, batchId: Long) =>
batchDF.write.format("jdbc").option("url", "url")
.option("user", "username")
.option("password", "password")
.option(JDBCOptions.JDBC_TABLE_NAME, "table_1')
.mode(SaveMode.Append).save();
}.outputMode(OutputMode.Append()).start().awaitTermination()
I want to write the message to two DB tables table 1(id, name, comment) and tables 2 need have the map_key_value.
You will need N streaming queries for N sinks; t1 and t2 both count as a separate sink.
writeStream does not currently write to jdbc so you should use foreachBatch operator.

spark read orc with specific columns

I have a orc file, when read with below option it reads all the columns .
val df= spark.read.orc("/some/path/")
df.printSChema
root
|-- id: string (nullable = true)
|-- name: string (nullable = true)
|-- value: string (nullable = true)
|-- all: string (nullable = true)
|-- next: string (nullable = true)
|-- action: string (nullable = true)
but I want to read only two columns from that file , is there any way to read only two columns (id,name) while loading orc file ?
is there any way to read only two columns (id,name) while loading orc file ?
Yes, all you need is subsequent select. Spark will take care of the rest for you:
val df = spark.read.orc("/some/path/").select("id", "name")
Spark has lazy execution model. So you can do any data transformation in you code without immediate real effect. Only after action call Spark start to doing job. And Spark are smart enough not to do extra work.
So you can write like this:
val inDF: DataFrame = spark.read.orc("/some/path/")
import spark.implicits._
val filteredDF: DataFrame = inDF.select($"id", $"name")
// any additional transformations
// real work starts after this action
val result: Array[Row] = filteredDF.collect()

Parquet data and partition issue in Spark Structured streaming

I am using Spark Structured streaming; My DataFrame has the following schema
root
|-- data: struct (nullable = true)
| |-- zoneId: string (nullable = true)
| |-- deviceId: string (nullable = true)
| |-- timeSinceLast: long (nullable = true)
|-- date: date (nullable = true)
How can I do a writeStream with Parquet format and write the data
(containing zoneId, deviceId, timeSinceLast; everything except date) and partition the data by date ? I tried the following code and the partition by clause did
not work
val query1 = df1
.writeStream
.format("parquet")
.option("path", "/Users/abc/hb_parquet/data")
.option("checkpointLocation", "/Users/abc/hb_parquet/checkpoint")
.partitionBy("data.zoneId")
.start()
If you want to partition by date then you have to use it in partitionBy() method.
val query1 = df1
.writeStream
.format("parquet")
.option("path", "/Users/abc/hb_parquet/data")
.option("checkpointLocation", "/Users/abc/hb_parquet/checkpoint")
.partitionBy("date")
.start()
In case if you want to partition data structured by <year>/<month>/<day> you should make sure that the date column is of DateType type and then create columns appropriately formatted:
val df = dataset.withColumn("date", dataset.col("date").cast(DataTypes.DateType))
df.withColumn("year", functions.date_format(df.col("date"), "YYYY"))
.withColumn("month", functions.date_format(df.col("date"), "MM"))
.withColumn("day", functions.date_format(df.col("date"), "dd"))
.writeStream
.format("parquet")
.option("path", "/Users/abc/hb_parquet/data")
.option("checkpointLocation", "/Users/abc/hb_parquet/checkpoint")
.partitionBy("year", "month", "day")
.start()
I think you should try the method repartition which can take two kinds of arguments:
column name
number of wanted partitions.
I suggest using repartition("date") to partition your data by date.
A great link on the subject: https://hackernoon.com/managing-spark-partitions-with-coalesce-and-repartition-4050c57ad5c4

Spark Exception when converting a MySQL table to parquet

I'm trying to convert a MySQL remote table to a parquet file using spark 1.6.2.
The process runs for 10 minutes, filling up memory, than starts with these messages:
WARN NettyRpcEndpointRef: Error sending message [message = Heartbeat(driver,[Lscala.Tuple2;#dac44da,BlockManagerId(driver, localhost, 46158))] in 1 attempts
org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [10 seconds]. This timeout is controlled by spark.executor.heartbeatInterval
at the end fails with this error:
ERROR ActorSystemImpl: Uncaught fatal error from thread [sparkDriverActorSystem-scheduler-1] shutting down ActorSystem [sparkDriverActorSystem]
java.lang.OutOfMemoryError: GC overhead limit exceeded
I'm running it in a spark-shell with these commands:
spark-shell --packages mysql:mysql-connector-java:5.1.26 org.slf4j:slf4j-simple:1.7.21 --driver-memory 12G
val dataframe_mysql = sqlContext.read.format("jdbc").option("url", "jdbc:mysql://.../table").option("driver", "com.mysql.jdbc.Driver").option("dbtable", "...").option("user", "...").option("password", "...").load()
dataframe_mysql.saveAsParquetFile("name.parquet")
I have limits to the max executor memory to 12G. Is there a way to force writing the parquet file in "small" chunks freeing memory?
It seemed like the problem was that you had no partition defined when you read your data with the jdbc connector.
Reading from JDBC isn't distributed by default, so to enable distribution you have to set manual partitioning. You need a column which is a good partitioning key and you have to know distribution up front.
This is what your data looks like apparently :
root
|-- id: long (nullable = false)
|-- order_year: string (nullable = false)
|-- order_number: string (nullable = false)
|-- row_number: integer (nullable = false)
|-- product_code: string (nullable = false)
|-- name: string (nullable = false)
|-- quantity: integer (nullable = false)
|-- price: double (nullable = false)
|-- price_vat: double (nullable = false)
|-- created_at: timestamp (nullable = true)
|-- updated_at: timestamp (nullable = true)
order_year seemed like a good candidate to me. (you seem to have ~20 years according to your comments)
import org.apache.spark.sql.SQLContext
val sqlContext: SQLContext = ???
val driver: String = ???
val connectionUrl: String = ???
val query: String = ???
val userName: String = ???
val password: String = ???
// Manual partitioning
val partitionColumn: String = "order_year"
val options: Map[String, String] = Map("driver" -> driver,
"url" -> connectionUrl,
"dbtable" -> query,
"user" -> userName,
"password" -> password,
"partitionColumn" -> partitionColumn,
"lowerBound" -> "0",
"upperBound" -> "3000",
"numPartitions" -> "300"
)
val df = sqlContext.read.format("jdbc").options(options).load()
PS: partitionColumn, lowerBound, upperBound, numPartitions:
These options must all be specified if any of them is specified.
Now you can save your DataFrame to parquet.

Resources