Split single DStream into multiple Hive tables - apache-spark

I am working on Kafka Spark streaming project. Spark streaming getting data from Kafka. Data is in json format. sample input
{
"table": "tableA",
"Product_ID": "AGSVGF.upf",
"file_timestamp": "2018-07-26T18:58:08.4485558Z000000000000000",
"hdfs_file_name": "null_1532631600050",
"Date_Time": "2018-07-26T13:45:01.0000000Z",
"User_Name": "UBAHTSD"
}
{
"table": "tableB",
"Test_ID": "FAGS.upf",
"timestamp": "2018-07-26T18:58:08.4485558Z000000000000000",
"name": "flink",
"time": "2018-07-26T13:45:01.0000000Z",
"Id": "UBAHTGADSGSCVDGHASD"
}
One JSON string is one message. There are 15 types of JSON string which distinguish using table column. Now I want to save this 15 different JSON in Apache Hive. So I have created a dstream and on the bases of table column I have filtered the rdd and saved into Hive. Code works fine. But some time lots it table much time then spark batch. I have controlled the input using spark.streaming.kafka.maxRatePerPartition=10. I have repartitioned the rdd into 9 partitioned but on Spark UI, it show unknown stage.
Here is my code.
val dStream = dataStream.transform(rdd => rdd.repartition(9)).map(_._2)
dStream.foreachRDD { rdd =>
if (!rdd.isEmpty()) {
val sparkContext = rdd.sparkContext
rdd.persist(StorageLevel.MEMORY_AND_DISK)
val hiveContext = getInstance(sparkContext)
val tableA = rdd.filter(_.contains("tableA"))
if (!tableA.isEmpty()) {
HiveUtil.tableA(hiveContext.read.json(tableA))
tableA.unpersist(true)
}
val tableB = rdd.filter(_.contains("tableB"))
if (!tableB.isEmpty()) {
HiveUtil.tableB(hiveContext.read.json(tableB))
tableB.unpersist(true)
}
.....
.... upto 15 tables
....
val tableK = rdd.filter(_.contains("tableK"))
if (!tableB.isEmpty()) {
HiveUtil.tableB(hiveContext.read.json(tableK))
tableB.unpersist(true)
}
}
}
How I can optimise the code ?
Thank you.

Purely from a management perspective, I would suggest you parameterize your job to accept the table name, then run 15 separate Spark applications. Also ensure that the kafka consumer group is different for each application
This way, you can more easily monitor which Spark job is not performing as well as others and a skew of data to one table won't cause issues with others.
It's not clear what the Kafka message keys are, but if produced with the table as the key, then Spark could scale along with the kafka partitions, and you're guaranteed all messages for each table will be in order.
Overall, I would actually use Kafka Connect or Streamsets for writing to HDFS/Hive, not having to write code or tune Spark settings

Related

How to store data from a dataframe in a variable to use as a parameter in a select in cassandra?

I have a Spark Structured Streaming application. The application receives data from kafka, and should use these values ​​as a parameter to process data from a cassandra database. My question is how do I use the data that is in the input dataframe (kafka), as "where" parameters in cassandra "select" without taking the error below:
Exception in thread "main" org.apache.spark.sql.AnalysisException: Queries with streaming sources must be executed with writeStream.start();
This is my df input:
val df = spark
.readStream
.format("kafka")
.options(
Map("kafka.bootstrap.servers"-> kafka_bootstrap,
"subscribe" -> kafka_topic,
"startingOffsets"-> "latest",
"fetchOffset.numRetries"-> "5",
"kafka.group.id"-> groupId
))
.load()
I get this error whenever I try to store the dataframe values ​​in a variable to use as a parameter.
This is the method I created to try to convert the data into variables. With that the spark give the error that I mentioned earlier:
def processData(messageToProcess: DataFrame): DataFrame = {
val messageDS: Dataset[Message] = messageToProcess.as[Message]
val listData: Array[Message] = messageDS.collect()
listData.foreach(x => println(x.country))
val mensagem = messageToProcess
mensagem
}
When you need to use data in Kafka to query data in Cassandra, then such operation is a typical join between two datasets - you don't need to call .collect to find entries, you just do the join. And it's quite typical thing - to enrich data in Kafka with data from the external dataset, and Cassandra provides low-latency operations.
Your code could look as following (you'll need to configure so-called DirectJoin, see link below):
import spark.implicits._
import org.apache.spark.sql.cassandra._
val df = spark.readStream.format("kafka")
.options(Map(...)).load()
... decode data in Kafka into columns
val cassdata = spark.read.cassandraFormat("table", "keyspace").load
val joined = df.join(cassdata, cassdata("pk") === df("some_column"))
val processed = ... process joined data
val query = processed.writeStream.....output data somewhere...start()
query.awaitTermination()
I have detailed blog post on how to perform efficient joins with data in Cassandra.
As the error message suggest, you have to use writeStream.start() in order to execute a Structured Streaming query.
You can't use the same actions you use for batch dataframes (like .collect(), .show() or .count()) on streaming dataframes, see the Unsupported Operations section of the Spark Structured Streaming documentation.
In your case, you are trying to use messageDS.collect() on a streaming dataset, which is not allowed. To achieve this goal you can use a foreachBatch output sink to collect the rows you need at each microbatch:
streamingDF.writeStream.foreachBatch { (microBatchDf: DataFrame, batchId: Long) =>
// Now microBatchDf is no longer a streaming dataframe
// you can check with microBatchDf.isStreaming
val messageDS: Dataset[Message] = microBatchDf.as[Message]
val listData: Array[Message] = messageDS.collect()
listData.foreach(x => println(x.country))
// ...
}

Kafka delete (tombstone) not updating max aggregate in Spark Structured Streaming

I am prototyping calculating aggregations in a Spark Structured Streaming (Spark 3.0) job and publishing the updates to Kafka. I need to calculate the max date and max percentage all time (no windowing) for each group. The code seems fine except for with Kafka tombstone records (deletes) in the source stream. The stream receives a Kafka record with a valid key and a null value, but the max aggregate continues to include the record in the calculation. What are the best options to have this recalculate without the deleted records when a delete is consumed from Kafka?
Example
Message produced:
<"user1|1", {"user": "user1", "pct":30, "timestamp":"2021-01-01 01:00:00"}>
<"user1|2", {"user": "user1", "pct":40, "timestamp":"2021-01-01 02:00:00"}>
<"user1|2", null>
Spark code snippet:
val usageStreamRaw = spark.readStream.format("kafka").option("kafka.bootstrap.servers", bootstrapServers).option("subscribe", usageTopic).load()
val usageStream = usageStreamRaw
.select(col("key").cast(StringType).as("key"),
from_json(col("value").cast(StringType), valueSchema).as("json"))
.selectExpr("key", "json.*")
val usageAgg = usageStream.groupBy("user")
.agg(
max("timestamp").as("maxTime"),
max("pct").as("maxPct")
)
val sq = usageAgg.writeStream.outputMode("update").option("truncate","false").format("console").start()
sq.awaitTermination()
For user1 the result in column pct is 40 but it should be 30 after deletion. Is there a good way to do this with Spark Structured Streaming?
You could make use of the Kafka timestamp in each message through
val usageStream = usageStreamRaw
.select(col("key").cast(StringType).as("key"),
from_json(col("value").cast(StringType), valueSchema).as("json"),
col("timestamp"))
.selectExpr("key", "json.*", "timestamp")
Then
select only the latest value for each key, and
filter out null values
before applying your aggregation on the maximum time and pct.

Multiple operations/aggregations on the same Dataframe/Dataset in Spark Structured Streaming

I use Spark 2.3.2.
I'm receiving data from Kafka. I must do multiple aggregations on the same data. Then all aggregations results will go to the same database (columns or tables may be changed). For example:
val kafkaSource = spark.readStream.option("kafka") ...
val agg1 = kafkaSource.groupBy().agg ...
val agg2 = kafkaSource.groupBy().mapgroupswithstate() ...
val agg3 = kafkaSource.groupBy().mapgroupswithstate() ...
But when I try call writeStream for each aggregation result:
aggr1.writeStream().foreach().start()
aggr2.writeStream().foreach().start()
aggr3.writeStream().foreach().start()
Spark receives data independently in each writeStream. Is this way efficient?
Can I do multiple aggregations with one writeStream? If it is possible, this way is efficient?
Every “writestream” operation results in a new streaming query. Every streaming query will read from the source and execute the entire query plan. Unlike DStream, there is no cache/persist option available.
In spark 2.4, a new API “forEachBatch” has been introduced to solve these kind of scenarios in a more efficient manner.
Caching can be used to avoid multiple reads:
kafkaSource.writeStream.foreachBatch((df, id) => {
df.persist()
val agg1 = df.groupBy().agg ...
val agg2 = df.groupBy().mapgroupswithstate() ...
val agg3 = df.groupBy().mapgroupswithstate() ...
df.unpersist()
}).start()

How to do a fast insertion of the data in a Kafka topic inside a Hive Table?

I have a Kafka topic in which I have received around 500k events.
Currently, I need to insert those events into a Hive table.
Since events are time-driven, I decided to use the following strategy:
1) Define a route inside HDFS, which I call users. Inside of this route, there will be several Parquet files, each one corresponding to a certain date. E.g.: 20180412, 20180413, 20180414, etc. (Format YYYYMMDD).
2) Create a Hive table and use the date in the format YYYYMMDD as a partition. The idea is to use each of the files inside the users HDFS directory as a partition of the table, by simply adding the corresponding parquet file through the command:
ALTER TABLE users DROP IF EXISTS PARTITION
(fecha='20180412') ;
ALTER TABLE users ADD PARTITION
(fecha='20180412') LOCATION '/users/20180412';
3) Read the data from the Kafka topic by iterating from the earliest event, get the date value in the event (inside the parameter dateClient), and given that date value, insert the value into the corresponding Parque File.
4) In order to accomplish the point 3, I read each event and saved it inside a temporary HDFS file, from which I used Spark to read the file. After that, I used Spark to convert the temporary file contents into a Data Frame.
5) Using Spark, I managed to insert the DataFrame values into the Parquet File.
The code follows this approach:
val conf = ConfigFactory.parseResources("properties.conf")
val brokersip = conf.getString("enrichment.brokers.value")
val topics_in = conf.getString("enrichment.topics_in.value")
val spark = SparkSession
.builder()
.master("yarn")
.appName("ParaTiUserXY")
.getOrCreate()
spark.sparkContext.setLogLevel("ERROR")
import spark.implicits._
val properties = new Properties
properties.put("key.deserializer", classOf[StringDeserializer])
properties.put("value.deserializer", classOf[StringDeserializer])
properties.put("bootstrap.servers", brokersip)
properties.put("auto.offset.reset", "earliest")
properties.put("group.id", "UserXYZ2")
//Schema para transformar los valores del topico de Kafka a JSON
val my_schema = new StructType()
.add("longitudCliente", StringType)
.add("latitudCliente", StringType)
.add("dni", StringType)
.add("alias", StringType)
.add("segmentoCliente", StringType)
.add("timestampCliente", StringType)
.add("dateCliente", StringType)
.add("timeCliente", StringType)
.add("tokenCliente", StringType)
.add("telefonoCliente", StringType)
val consumer = new KafkaConsumer[String, String](properties)
consumer.subscribe( util.Collections.singletonList("geoevents") )
val fs = {
val conf = new Configuration()
FileSystem.get(conf)
}
val temp_path:Path = new Path("hdfs:///tmp/tmpstgtopics")
if( fs.exists(temp_path)){
fs.delete(temp_path, true)
}
while(true)
{
val records=consumer.poll(100)
for (record<-records.asScala){
val data = record.value.toString
val dataos: FSDataOutputStream = fs.create(temp_path)
val bw: BufferedWriter = new BufferedWriter( new OutputStreamWriter(dataos, "UTF-8"))
bw.append(data)
bw.close
val data_schema = spark.read.schema(my_schema).json("hdfs:///tmp/tmpstgtopics")
val fechaCliente = data_schema.select("dateCliente").first.getString(0)
if( fechaCliente < date){
data_schema.select("longitudCliente", "latitudCliente","dni", "alias",
"segmentoCliente", "timestampCliente", "dateCliente", "timeCliente",
"tokenCliente", "telefonoCliente").coalesce(1).write.mode(SaveMode.Append)
.parquet("/desa/landing/parati/xyusers/" + fechaCliente)
}
else{
break
}
}
}
consumer.close()
However, this method is taking around 1 second to process each record in my cluster. So far, it would mean I will take around 6 days to process all the events I have.
Is this the optimal way to insert the whole amount of events inside a Kafka topic into a Hive table?
What other alternatives exist or which upgrades could I do to my code in order to speed it up?
Other than the fact that you're not using Spark Streaming correctly to poll from Kafka (you wrote a vanilla Scala Kafka consumer with a while loop) and coalesce(1) will always be a bottleneck as it forces one executor to collect the records, I'll just say you're really reinventing the wheel here.
What other alternatives exist
That I known of and are all open source
Gobblin (replaces Camus) by LinkedIn
Kafka Connect w/ HDFS Sink Connector (built into Confluent Platform, but also builds from source on Github)
Streamsets
Apache NiFi
Secor by Pinterest
From those listed, it would be beneficial for you to have JSON or Avro encoded Kafka messages, and not a flat string. That way, you can drop the files as is into a Hive serde, and not parse them while consuming them. If you cannot edit the producer code, make a separate Kafka Streams job taking the raw string data, parse it, then write to a new topic of Avro or JSON.
If you choose Avro (which you really should for Hive support), you can use the Confluent Schema Registry. Or if you're running Hortonworks, they offer a similar Registry.
HIve on Avro operates far better than text or JSON. Avro can easily be transformed into Parquet, and I believe each of the above options offers at least Parquet support while the others also can do ORC (Kafka Connect doesn't do ORC at this time).
Each of the above also support some level of automatic Hive partition generation based on the Kafka record time.
You can improve the parallelism by increasing the partitions of the kafka topic and having one or more consumer groups with multiple consumers consuming one-to-one with each partition.
As, cricket_007 mentioned you can use one of the opensource frameworks or you can have more consumer groups consuming the same topic to off-load the data.

Realtime Database streaming Using Apache Spark and kfaka

i am designing a spark streaming application with Kafka. i have few questions as follows :
i am streaming data from RDBMS tables into kafka and using Spark consumer to consume messages and process using Spark - SQL
Questions :
1. i am streaming data from table and streaming to kafka as (key as tablename, value as table data in form of JSON records) -- is this correct architecture ?
in spark consumer i am trying to consume data using DStream.foreachRDD(x => transformation to x RDD) -- i am having issue with this (it says error with transformation within transformation not allowed ... i am trying to extract keys within foreachRDD function to get table names and transform x.values using map function to convert back from JSON to normal string and then save each record to Spark-sql )
Is this architecture and design for database streaming OK and how can i solve transformation within transformation issue ?
Regards,
Piyush Kansal
I have a similar use-case.
I use Nifi to Get the data from RDBMS Views and put into Kafka Topic.
I have a Topic for each View in Oracle Database with multiple partitions.
Using Nifi, data will be converted into JSON format and put into Kafka.
Is there any requirement to use same kafka topic for all table data?
Below Code will be used to persist data into Cassandra.
> val msg = KafkaUtils.createDirectStream[String, String, StringDecoder,
> StringDecoder](ssc, kafkaParams, topicsSet)
>
> /* Process Records for each RDD */ Holder.log.info("Spark foreach Starts")
> val data = msg.map(_._2)
> data.foreachRDD(rdd =>{
> if(rdd.toLocalIterator.nonEmpty) {
>
>
> val messageDfRdd = sqlContext.read.json(rdd)
var data2=messageDfRdd .map(p => employee(p.getLong(1),p.getString(4),p.getString(0),p.getString(2),p.getString(3),p.getString(5)));
> //code to save to Cassandra.
> }

Resources