Writing Spark streaming PySpark dataframe to Cassandra overwrites table instead of appending - apache-spark

I'm running a 1-node cluster of Kafka, Spark and Cassandra. All locally on the same machine.
From a simple Python script I'm streaming some dummy data every 5 seconds into a Kafka topic. Then using Spark structured streaming, I'm reading this data stream (one row at a time) into a PySpark DataFrame with startingOffset = latest. Finally, I'm trying to append this row to an already existing Cassandra table.
I've been following (How to write streaming Dataset to Cassandra?) and (Cassandra Sink for PySpark Structured Streaming from Kafka topic).
One row of data is being successfully written into the Cassandra table but my problem is it's being overwritten every time rather than appended to the end of the table. What might I be doing wrong?
Here's my code:
CQL DDL for creating kafkaspark keyspace followed by randintstream table in Cassandra:
DESCRIBE keyspaces;
CREATE KEYSPACE kafkaspark
WITH REPLICATION = {
'class' : 'SimpleStrategy',
'replication_factor' : 1
};
USE kafkaspark;
CREATE TABLE randIntStream (
key int,
value int,
topic text,
partition int,
offset bigint,
timestamp timestamp,
timestampType int,
PRIMARY KEY (partition, topic)
);
Launch PySpark shell
./bin/pyspark --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.0.1,com.datastax.spark:spark-cassandra-connector_2.12:3.0.0 --conf spark.cassandra.connection.host=127.0.0.1,spark.sql.extensions=com.datastax.spark.connector.CassandraSparkExtensions
Read latest message from Kafka topic into streaming DataFrame:
df = spark.readStream.format("kafka").option("kafka.bootstrap.servers", "localhost:9092").option("startingOffsets","latest").option("subscribe","topic1").load()
Some transformations and checking schema:
df2 = df.withColumn("key", df["key"].cast("string")).withColumn("value", df["value"].cast("string"))
df3 = df2.withColumn("key", df2["key"].cast("integer")).withColumn("value", df2["value"].cast("integer"))
df4 = df3.withColumnRenamed("timestampType","timestamptype")
df4.printSchema()
Function for writing to Cassandra:
def writeToCassandra(writeDF, epochId):
writeDF.write \
.format("org.apache.spark.sql.cassandra") \
.options(table="randintstream", keyspace="kafkaspark") \
.mode("append") \
.save()
Finally, query to write to Cassandra from Spark:
query = df4.writeStream \
.trigger(processingTime="5 seconds") \
.outputMode("update") \
.foreachBatch(writeToCassandra) \
.start()
SELECT * on table in Cassandra:

If the row is always rewritten in Cassandra, then you may have incorrect primary key in the table - you need to make sure that every row will have an unique primary key. If you're creating Cassandra table from Spark, then by default it just takes first column as partition key, and it alone may not be unique.
Update after schema was provided:
Yes, that's the case that I was referring - you have a primary key of (partition, topic), but every row from specific partition that you read from that topic will have the same value for primary key, so it will overwrite previous versions. You need to make your primary key unique - for example, add the offset or timestamp columns to the primary key (although timestamp may not be unique if you have data produced inside the same millisecond).
P.S. Also, in connector 3.0.0 you don't need foreachBatch:
df4.writeStream \
.trigger(processingTime="5 seconds") \
.format("org.apache.spark.sql.cassandra") \
.options(table="randintstream", keyspace="kafkaspark") \
.mode("update") \
.start()
P.P.S if you just want to move data from Kafka into Cassandra, you may consider the use of the DataStax's Kafka Connector that could be much lightweight compared to the Spark.

Related

How to add realtime timestamp column in spark DF while writing from kafka to db using spark structured streaming?

How to add realtime timestamp column in spark DF while writing from kafka to db using spark structured streaming? i want exact timestamp when particular row was written into DB.
Something like this should work
df.withColumn("sourceFile", F.current_timestamp()) \
.write
.format("your database")
.mode("overwrite")
Add DEFAULT timestamp NOW() value to the the database table row/record schema. Spark cannot create an exact timestamp when database was actually written, only when executor got a batch for the database writer

How to update specific set of Cassandra columns from Spark Dataframe using Datastax connector

I have a Cassandra table of few columns and I want to update one of those(and also what for multiple columns?) from Spark 2.4.0. But if I don't provide all the columns then records are not getting updated.
Cassandra schema:
rowkey,message,number,timestamp,name
1,hello,12345,12233454,ABC
The point is Spark DataFrame consists the rowkey with the updated timestamp that has to be updated in the Cassandra table.
I tried to Select the columns right after the options, but seems like there's no such method.
finalDF.select("rowkey","current_ts")
.withColumnRenamed("current_ts","timestamp")
.write
.format("org.apache.spark.sql.cassandra")
.options(Map("table" -> "table_data", "keyspace" -> "ks_data"))
.mode("overwrite")
.option("confirm.truncate","true")
.save()
Say,
finalDF=
rowkey,current_ts
1,12233999
then Cassandra table should hold the value like After the update,
rowkey,message,number,timestamp,name
1,hello,12345,12233999,ABC
I'm using Dataframe API. So rdd approach cannot be used. How I can do this? Cassandra version 3.11.3, Datastax connector 2.4.0-2.11
Clarification is SaveMode is used to specify the expected behavior of saving a DataFrame to a data source.(not only for c* but for any datasource). Available options are
SaveMode.ErrorIfExists
SaveMode.Append
SaveMode.Overwrite
SaveMode.Ignore
In this case, Since you have already data and you want to append you have to use SaveMode.Append
import org.apache.spark.sql.SaveMode
finalDF.select("rowkey","current_ts")
.withColumnRenamed("current_ts","timestamp")
.write
.format("org.apache.spark.sql.cassandra")
.options(Map("table" -> "table_data", "keyspace" -> "ks_data"))
.mode(SaveMode.Append)
.option("confirm.truncate","true")
.save()
see the spark docs here on SaveModes

unable to insert into hive partitioned table from spark

I create an external partitioned table in hive.
in the logs it shows numinputrows. that means the query is working and sending data. but when I connect to hive using beeline and query, select * or count(*) it's always empty.
def hiveOrcSetWriter[T](event_stream: Dataset[T])( implicit spark: SparkSession): DataStreamWriter[T] = {
import spark.implicits._
val hiveOrcSetWriter: DataStreamWriter[T] = event_stream
.writeStream
.partitionBy("year","month","day")
.format("orc")
.outputMode("append")
.option("compression", "zlib")
.option("path", _table_loc)
.option("checkpointLocation", _table_checkpoint)
hiveOrcSetWriter
}
What can be the issue? I'm unable to understand.
msck repair table tablename
It give go and check the location of the table and adds partitions if new ones exits.
In your spark process add this step in order to query from hive.
Your streaming job is writing new partitions to the table_location. But the Hive metastore is not aware of this.
When you run a select query on the table, the Hive checks metastore to get list of table partitions. Since the information in Metastore is outdated, so the data don't show up in the result.
You need to run -
ALTER TABLE <TABLE_NAME> RECOVER PARTITIONS
command from Hive/Spark to update the metastore with new partition info.

Spark Structured Streaming Writestream to Hive ORC Partioned External Table

I am trying to use Spark Structured Streaming - writeStream API to write to an External Partitioned Hive table.
CREATE EXTERNAL TABLE `XX`(
`a` string,
`b` string,
`b` string,
`happened` timestamp,
`processed` timestamp,
`d` string,
`e` string,
`f` string )
PARTITIONED BY (
`year` int, `month` int, `day` int)
CLUSTERED BY (d)
INTO 6 BUCKETS
STORED AS ORC
TBLPROPERTIES (
'orc.compress'='ZLIB',
'orc.compression.strategy'='SPEED',
'orc.create.index'='true',
'orc.encoding.strategy'='SPEED');
and in Spark code,
val hiveOrcWriter: DataStreamWriter[Row] = event_stream
.writeStream
.outputMode("append")
.format("orc")
.partitionBy("year","month","day")
//.option("compression", "zlib")
.option("path", _table_loc)
.option("checkpointLocation", _table_checkpoint)
I see that on a non partition table, records are inserted into Hive. However, on using partitioned table, the spark job does not fail or raise exceptions but records are not inserted to Hive table.
Appreciate comments from anyone who has dealt with similar problems.
Edit:
Just discovered that the .orc files are indeed written to the HDFS, withe correct partition directory structure: eg. /_table_loc/_table_name/year/month/day/part-0000-0123123.c000.snappy.orc
However
select * from 'XX' limit 1; (or where year=2018)
returns no rows.
The InputFormat and OutputFormat for the Table 'XX' are org.apache.hadoop.hive.ql.io.orc.OrcInputFormat and
org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat respectively.
This feature isn't provided out of the box in structured streaming. In normal processing, you would use dataset.write.saveAsTable(table_name) , and that method isn't available.
After processing and saving the data in HDFS, you can manually update the partitions (or using a script that does this on a schedule):
If you use Hive
MSCK REPAIR TABLE table_name
If you use Impala
ALTER TABLE table_name RECOVER PARTITIONS

How do I save spark.writeStream results in hive?

I am using spark.readStream to read data from Kafka and running an explode on the resulting dataframe.
I am trying to save the result of the explode in a Hive table and I am not able to find any solution for that.
I tried the following method but it doesn't work (it runs but I don't see any new partitions created)
val query = tradelines.writeStream.outputMode("append")
.format("memory")
.option("truncate", "false")
.option("checkpointLocation", checkpointLocation)
.queryName("tl")
.start()
sc.sql("set hive.exec.dynamic.partition.mode=nonstrict;")
sc.sql("INSERT INTO TABLE default.tradelines PARTITION (dt) SELECT * FROM tl")
Check HDFS for the dt partitions on the file system
You need to run MSCK REPAIR TABLE on the hive table to see new partitions.
If you aren't doing anything special with Spark, then it's worth pointing out that Kafka Connect HDFS is capable of registering Hive partitions directly from Kafka.

Resources