How to write DataFrame (built from RDD inside foreach) to Kafka? - apache-spark

I'm trying to write a DataFrame from Spark to Kafka and I couldn't find any solution out there. Can you please show me how to do that?
Here is my current code:
activityStream.foreachRDD { rdd =>
val activityDF = rdd
.toDF()
.selectExpr(
"timestamp_hour", "referrer", "action",
"prevPage", "page", "visitor", "product", "inputProps.topic as topic")
val producerRecord = new ProducerRecord(topicc, activityDF)
kafkaProducer.send(producerRecord) // <--- this shows an error
}
type mismatch; found : org.apache.kafka.clients.producer.ProducerRecord[Nothing,org‌​.apache.spark.sql.Da‌​taFrame] (which expands to) org.apache.kafka.clients.producer.ProducerRecord[Nothing,org‌​.apache.spark.sql.Da‌​taset[org.apache.spa‌​rk.sql.Row]] required: org.apache.kafka.clients.producer.ProducerRecord[Nothing,Str‌​ing] Error occurred in an application involving default arguments.

Do collect on the activityDF to get the records (not Dataset[Row]) and save them to Kafka.
Note that you'll end up with a collection of records after collect so you probably have to iterate over it, e.g.
val activities = activityDF.collect()
// the following is pure Scala and has nothing to do with Spark
activities.foreach { a: Row =>
val pr: ProducerRecord = // map a to pr
kafkaProducer.send(pr)
}
Use pattern matching on Row to destructure it to fields/columns, e.g.
activities.foreach { case Row(timestamp_hour, referrer, action, prevPage, page, visitor, product, topic) =>
// ...transform a to ProducerRecord
kafkaProducer.send(pr)
}
PROTIP: I'd strongly suggest using a case class and transform DataFrame (= Dataset[Row]) to Dataset[YourCaseClass].
See Spark SQL's Row and Kafka's ProducerRecord docs.
As Joe Nate pointed out in the comments:
If you do "collect" before writing to any endpoint, it's going to make all the data aggregate at the driver and then make the driver write it out. 1) Can crash the driver if too much data (2) no parallelism in write.
That's 100% correct. I wished I had said it :)
You may want to use the approach as described in Writing Stream Output to Kafka instead.

Related

Nullability in Spark sql schemas is advisory by default. What is best way to strictly enforce it?

I am working on a simple ETL project which reads CSV files, performs
some modifications on each column, then writes the result out as JSON.
I would like downstream processes which read my results
to be confident that my output conforms to
an agreed schema, but my problem is that even if I define
my input schema with nullable=false for all fields, nulls can sneak
in and corrupt my output files, and there seems to be no (performant) way I can
make Spark enforce 'not null' for my input fields.
This seems to be a feature, as stated below in Spark, The Definitive Guide:
when you define a schema where all columns are declared to not have
null values , Spark will not enforce that and will happily let null
values into that column. The nullable signal is simply to help Spark
SQL optimize for handling that column. If you have null values in
columns that should not have null values, you can get an incorrect
result or see strange exceptions that can be hard to debug.
I have written a little check utility to go through each row of a dataframe and
raise an error if nulls are detected in any of the columns (at any level of
nesting, in the case of fields or subfields like map, struct, or array.)
I am wondering, specifically: DID I RE-INVENT THE WHEEL WITH THIS CHECK UTILITY ? Are there any existing libraries, or
Spark techniques that would do this for me (ideally in a better way than what I implemented) ?
The check utility and a simplified version of my pipeline appears below. As presented, the call to the
check utility is commented out. If you run without the check utility enabled, you would see this result in
/tmp/output.csv.
cat /tmp/output.json/*
(one + 1),(two + 1)
3,4
"",5
The second line after the header should be a number, but it is an empty string
(which is how spark writes out the null, I guess.) This output would be problematic for
downstream components that read my ETL job's output: these components just want integers.
Now, I can enable the check by un-commenting out the line
//checkNulls(inDf)
When I do this I get an exception that informs me of the invalid null value and prints
out the entirety of the offending row, like this:
java.lang.RuntimeException: found null column value in row: [null,4]
One Possible Alternate Approach Given in Spark/Definitive Guide
Spark, The Definitive Guide mentions the possibility of doing this:
<dataframe>.na.drop()
But this would (AFAIK) silently drop the bad records rather than flagging the bad ones.
I could then do a "set subtract" on the input before and after the drop, but that seems like
a heavy performance hit to find out what is null and what is not. At first glance, I'd
prefer my method.... But I am still wondering if there might be some better way out there.
The complete code is given below. Thanks !
package org
import java.io.PrintWriter
import org.apache.spark.SparkConf
import org.apache.spark.sql._
import org.apache.spark.sql.types._
// before running, do; rm -rf /tmp/out* /tmp/foo*
object SchemaCheckFailsToExcludeInvalidNullValue extends App {
import NullCheckMethods._
//val input = "2,3\n\"xxx\",4" // this will be dropped as malformed
val input = "2,3\n,4" // BUT.. this will be let through
new PrintWriter("/tmp/foo.csv") { write(input); close }
lazy val sparkConf = new SparkConf()
.setAppName("Learn Spark")
.setMaster("local[*]")
lazy val sparkSession = SparkSession
.builder()
.config(sparkConf)
.getOrCreate()
val spark = sparkSession
val schema = new StructType(
Array(
StructField("one", IntegerType, nullable = false),
StructField("two", IntegerType, nullable = false)
)
)
val inDf: DataFrame =
spark.
read.
option("header", "false").
option("mode", "dropMalformed").
schema(schema).
csv("/tmp/foo.csv")
//checkNulls(inDf)
val plusOneDf = inDf.selectExpr("one+1", "two+1")
plusOneDf.show()
plusOneDf.
write.
option("header", "true").
csv("/tmp/output.csv")
}
object NullCheckMethods extends Serializable {
def checkNull(columnValue: Any): Unit = {
if (columnValue == null)
throw new RuntimeException("got null")
columnValue match {
case item: Seq[_] =>
item.foreach(checkNull)
case item: Map[_, _] =>
item.values.foreach(checkNull)
case item: Row =>
item.toSeq.foreach {
checkNull
}
case default =>
println(
s"bad object [ $default ] of type: ${default.getClass.getName}")
}
}
def checkNulls(row: Row): Unit = {
try {
row.toSeq.foreach {
checkNull
}
} catch {
case err: Throwable =>
throw new RuntimeException(
s"found null column value in row: ${row}")
}
}
def checkNulls(df: DataFrame): Unit = {
df.foreach { row => checkNulls(row) }
}
}
You can use the built-in Row method anyNull to split the dataframe and process both splits differently:
val plusOneNoNulls = plusOneDf.filter(!_.anyNull)
val plusOneWithNulls = plusOneDf.filter(_.anyNull)
If you don't plan to have a manual null-handling process, using the builtin DataFrame.na methods is simpler since it already implements all the usual ways to automatically handle nulls (i.e drop or fill them out with default values).

Convert a Spark SQL batch source to structured streaming sink

Trying to convert an org.apache.spark.sql.sources.CreatableRelationProvider into a org.apache.spark.sql.execution.streaming.Sink by simply implementing addBatch(...) which calls the createRelation(...) but there is a df.rdd in the createRelation(...), which causes the following error:
org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$.org$apache$spark$sql$catalyst$analysis$UnsupportedOperationChecker$$throwError(UnsupportedOperationChecker.scala:374)
Was trying to look into howorg.apache.spark.sql.execution.streaming.FileStreamSink which also needs to get Rdd from dataframe in the streaming job, it seems to play the trick of using df.queryExecution.executedPlan.execute() to generate the RDD instead of calling .rdd.
However things does not seems to be that simple:
It seems the output ordering might need to be taken care of - https://github.com/apache/spark/blob/branch-2.3/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala#L159
Might be some eager execution concerns? (not sure)
https://issues.apache.org/jira/browse/SPARK-20865
More details of the issue I am running into can be found here
Wondering what would be the idiomatic way to do this conversion?
Dataset.rdd() creates a new plan that just breaks the incremental planing. Because StreamExecution uses the existing plan to collect metrics and update watermark, we should never create a new plan. Otherwise, metrics and watermark are updated in the new plan, and StreamExecution cannot retrieval them.
Here is an example of the code in Scala to convert column values in Structured Streaming:
val convertedRows: RDD[Row] = df.queryExecution.toRdd.mapPartitions { iter: Iterator[InternalRow] =>
iter.map { row =>
val convertedValues: Array[Any] = new Array(conversionFunctions.length)
var i = 0
while (i < conversionFunctions.length) {
convertedValues(i) = conversionFunctions(i)(row, i)
i += 1
}
Row.fromSeq(convertedValues)
}
}

How to join dstream and JDBCRDD with checkpointing enabled?

We have a spark streaming job with checkpoint enabled, it executes correctly first time, but throw below exception when restarted from checkpoint.
org.apache.spark.SparkException: RDD transformations and actions can
only be invoked by the driver, not inside of other transformations;
for example, rdd1.map(x => rdd2.values.count() * x) is invalid because
the values transformation and count action cannot be performed inside
of the rdd1.map transformation. For more information, see SPARK-5063.
at org.apache.spark.rdd.RDD.org$apache$spark$rdd$RDD$$sc(RDD.scala:87)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:352)
at org.apache.spark.rdd.RDD.union(RDD.scala:565)
at org.apache.spark.streaming.Repo$$anonfun$createContext$1.apply(Repo.scala:23)
at org.apache.spark.streaming.Repo$$anonfun$createContext$1.apply(Repo.scala:19)
at org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1$$anonfun$apply$mcV$sp$3.apply(DStream.scala:627)
Please suggest any workaround for this issue.
Sample app below:
String URL = "jdbc:oracle:thin:" + USERNAME + "/" + PWD + "#//" + CONNECTION_STRING;
Map<String, String> options = ImmutableMap.of(
"driver", "oracle.jdbc.driver.OracleDriver",
"url", URL,
"dbtable", "READINGS_10K",
"fetchSize", "10000");
DataFrame OracleDB_DF = sqlContext.load("jdbc", options);
JavaPairRDD<String, Row> OracleDB_RDD = OracleDB_DF.toJavaRDD()
.mapToPair(x -> new Tuple2(x.getString(0), x));
Dstream.transformToPair(rdd ->
rdd.mapToPair(record ->
new Tuple2<>(record.getKey().toString(), record))
.join(OracleDB_RDD)) // <-- PairRDD.join inside DStream transformation
.print();
Spark version 1.6, running in yarn cluster mode.
Let me start with the question I'm sure you must've already been asking yourself too.
How big is the OracleDB_RDD?
If it's small enough it could act as a fact table and could be broadcast first. That in turn would make your solution not only working but also efficient.
(That's why working with Spark SQL 2.0 these days makes this and similar questions obsolete as that's the sort of optimizations of the query optimizer).
If it's large, you have to create the DataFrame inside foreach action (as described in DataFrame and SQL Operations or create your own DStream to return a RDD for a join between DStreams (see ConstantInputDStream).

Spark Streaming - how to use reduceByKey within a partition on the Iterator

I am trying to consume Kafka DirectStream, process the RDDs for each partition and write the processed values to DB. When I try to perform reduceByKey(per partition, that is without the shuffle), I get the following error. Usually on the driver node, we can use sc.parallelize(Iterator) to solve this issue. But I would like to solve it in spark streaming.
value reduceByKey is not a member of Iterator[((String, String), (Int, Int))]
Is there a way to perform transformations on Iterator within the partition?
myKafkaDS
.foreachRDD { rdd =>
val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
val commonIter = rdd.mapPartitionsWithIndex ( (i,iter) => {
val offset = offsetRanges(i)
val records = iter.filter(item => {
(some_filter_condition)
}).map(r1 => {
// Some processing
((field2, field2), (field3, field4))
})
val records.reduceByKey((a,b) => (a._1+b._1, a._2+b._2)) // Getting reduceByKey() is not a member of Iterator
// Code to write to DB
Iterator.empty // I just want to store the processed records in DB. So returning empty iterator
})
}
Is there a more elegant way to do this(process kafka RDDs for each partition and store them in a DB)?
So... We can not use spark transformations within mapPartitionsWithIndex. However using scala transform and reduce methods like groupby helped me solve this issue.
yours records value is a iterator and Not a RDD. Hence you are unable to invoke reduceByKey on records relation.
Syntax issues:
1)reduceByKey logic looks ok, please remove val before statement(if not typo) & attach reduceByKey() after map:
.map(r1 => {
// Some processing
((field2, field2), (field3, field4))
}).reduceByKey((a,b) => (a._1+b._1, a._2+b._2))
2)Add iter.next after end of each iteration.
3)iter.empty is wrongly placed. Put after coming out of mapPartitionsWithIndex()
4)Add iterator condition for safety:
val commonIter = rdd.mapPartitionsWithIndex ((i,iter) => if (i == 0 && iter.hasNext){
....
}else iter),true)

How to effectively read millions of rows from Cassandra?

I have a hard task to read from a Cassandra table millions of rows. Actually this table contains like 40~50 millions of rows.
The data is actually internal URLs for our system and we need to fire all of them. To fire it, we are using Akka Streams and it have been working pretty good, doing some back pressure as needed. But we still have not found a way to read everything effectively.
What we have tried so far:
Reading the data as Stream using Akka Stream. We are using phantom-dsl that provides a publisher for a specific table. But it does not read everything, only a small portion. Actually it stops to read after the first 1 million.
Reading using Spark by a specific date. Our table is modeled like a time series table, with year, month, day, minutes... columns. Right now we are selecting by day, so Spark will not fetch a lot of things to be processed, but this is a pain to select all those days.
The code is the following:
val cassandraRdd =
sc
.cassandraTable("keyspace", "my_table")
.select("id", "url")
.where("year = ? and month = ? and day = ?", date.getYear, date.getMonthOfYear, date.getDayOfMonth)
Unfortunately I can't iterate over the partitions to get less data, I have to use a collect because it complains the actor is not serializable.
val httpPool: Flow[(HttpRequest, String), (Try[HttpResponse], String), HostConnectionPool] = Http().cachedHostConnectionPool[String](host, port).async
val source =
Source
.actorRef[CassandraRow](10000000, OverflowStrategy.fail)
.map(row => makeUrl(row.getString("id"), row.getString("url")))
.map(url => HttpRequest(uri = url) -> url)
val ref = Flow[(HttpRequest, String)]
.via(httpPool.withAttributes(ActorAttributes.supervisionStrategy(decider)))
.to(Sink.actorRef(httpHandlerActor, IsDone))
.runWith(source)
cassandraRdd.collect().foreach { row =>
ref ! row
}
I would like to know if any of you have such experience on reading millions of rows for doing anything different from aggregation and so on.
Also I have thought to read everything and send to a Kafka topic, where I would be receiving using Streaming(spark or Akka), but the problem would be the same, how to load all those data effectively ?
EDIT
For now, I'm running on a cluster with a reasonable amount of memory 100GB and doing a collect and iterating over it.
Also, this is far different from getting bigdata with spark and analyze it using things like reduceByKey, aggregateByKey, etc, etc.
I need to fetch and send everything over HTTP =/
So far it is working the way I did, but I'm afraid this data get bigger and bigger to a point where fetching everything into memory makes no sense.
Streaming this data would be the best solution, fetching in chunks, but I haven't found a good approach yet for this.
At the end, I'm thinking of to use Spark to get all those data, generate a CSV file and use Akka Stream IO to process, this way I would evict to keep a lot of things in memory since it takes hours to process every million.
Well, after spending sometime reading, talking with other guys and doing tests the result could be achieve by the following code sample:
val sc = new SparkContext(sparkConf)
val cassandraRdd = sc.cassandraTable(config.getString("myKeyspace"), "myTable")
.select("key", "value")
.as((key: String, value: String) => (key, value))
.partitionBy(new HashPartitioner(2 * sc.defaultParallelism))
.cache()
cassandraRdd
.groupByKey()
.foreachPartition { partition =>
partition.foreach { row =>
implicit val system = ActorSystem()
implicit val materializer = ActorMaterializer()
val myActor = system.actorOf(Props(new MyActor(system)), name = "my-actor")
val source = Source.fromIterator { () => row._2.toIterator }
source
.map { str =>
myActor ! Count
str
}
.to(Sink.actorRef(myActor, Finish))
.run()
}
}
sc.stop()
class MyActor(system: ActorSystem) extends Actor {
var count = 0
def receive = {
case Count =>
count = count + 1
case Finish =>
println(s"total: $count")
system.shutdown()
}
}
case object Count
case object Finish
What I'm doing is the following:
Try to achieve a good number of Partitions and a Partitioner using the partitionBy and groupBy methods
Use Cache to prevent Data Shuffle, making your Spark move large data across nodes, using high IO etc.
Create the whole actor system with it's dependencies as well as the Stream inside the foreachPartition method. Here is a trade off, you can have only one ActorSystem but you will have to make a bad use of .collect as I wrote in the question. However creating everything inside, you still have the ability to run things inside spark distributed across your cluster.
Finish each actor system at the end of the iterator using the Sink.actorRef with a message to kill(Finish)
Perhaps this code could be even more improved, but so far I'm happy to do not make the use of .collect anymore and working only inside Spark.

Resources