How to effectively read millions of rows from Cassandra? - apache-spark

I have a hard task to read from a Cassandra table millions of rows. Actually this table contains like 40~50 millions of rows.
The data is actually internal URLs for our system and we need to fire all of them. To fire it, we are using Akka Streams and it have been working pretty good, doing some back pressure as needed. But we still have not found a way to read everything effectively.
What we have tried so far:
Reading the data as Stream using Akka Stream. We are using phantom-dsl that provides a publisher for a specific table. But it does not read everything, only a small portion. Actually it stops to read after the first 1 million.
Reading using Spark by a specific date. Our table is modeled like a time series table, with year, month, day, minutes... columns. Right now we are selecting by day, so Spark will not fetch a lot of things to be processed, but this is a pain to select all those days.
The code is the following:
val cassandraRdd =
sc
.cassandraTable("keyspace", "my_table")
.select("id", "url")
.where("year = ? and month = ? and day = ?", date.getYear, date.getMonthOfYear, date.getDayOfMonth)
Unfortunately I can't iterate over the partitions to get less data, I have to use a collect because it complains the actor is not serializable.
val httpPool: Flow[(HttpRequest, String), (Try[HttpResponse], String), HostConnectionPool] = Http().cachedHostConnectionPool[String](host, port).async
val source =
Source
.actorRef[CassandraRow](10000000, OverflowStrategy.fail)
.map(row => makeUrl(row.getString("id"), row.getString("url")))
.map(url => HttpRequest(uri = url) -> url)
val ref = Flow[(HttpRequest, String)]
.via(httpPool.withAttributes(ActorAttributes.supervisionStrategy(decider)))
.to(Sink.actorRef(httpHandlerActor, IsDone))
.runWith(source)
cassandraRdd.collect().foreach { row =>
ref ! row
}
I would like to know if any of you have such experience on reading millions of rows for doing anything different from aggregation and so on.
Also I have thought to read everything and send to a Kafka topic, where I would be receiving using Streaming(spark or Akka), but the problem would be the same, how to load all those data effectively ?
EDIT
For now, I'm running on a cluster with a reasonable amount of memory 100GB and doing a collect and iterating over it.
Also, this is far different from getting bigdata with spark and analyze it using things like reduceByKey, aggregateByKey, etc, etc.
I need to fetch and send everything over HTTP =/
So far it is working the way I did, but I'm afraid this data get bigger and bigger to a point where fetching everything into memory makes no sense.
Streaming this data would be the best solution, fetching in chunks, but I haven't found a good approach yet for this.
At the end, I'm thinking of to use Spark to get all those data, generate a CSV file and use Akka Stream IO to process, this way I would evict to keep a lot of things in memory since it takes hours to process every million.

Well, after spending sometime reading, talking with other guys and doing tests the result could be achieve by the following code sample:
val sc = new SparkContext(sparkConf)
val cassandraRdd = sc.cassandraTable(config.getString("myKeyspace"), "myTable")
.select("key", "value")
.as((key: String, value: String) => (key, value))
.partitionBy(new HashPartitioner(2 * sc.defaultParallelism))
.cache()
cassandraRdd
.groupByKey()
.foreachPartition { partition =>
partition.foreach { row =>
implicit val system = ActorSystem()
implicit val materializer = ActorMaterializer()
val myActor = system.actorOf(Props(new MyActor(system)), name = "my-actor")
val source = Source.fromIterator { () => row._2.toIterator }
source
.map { str =>
myActor ! Count
str
}
.to(Sink.actorRef(myActor, Finish))
.run()
}
}
sc.stop()
class MyActor(system: ActorSystem) extends Actor {
var count = 0
def receive = {
case Count =>
count = count + 1
case Finish =>
println(s"total: $count")
system.shutdown()
}
}
case object Count
case object Finish
What I'm doing is the following:
Try to achieve a good number of Partitions and a Partitioner using the partitionBy and groupBy methods
Use Cache to prevent Data Shuffle, making your Spark move large data across nodes, using high IO etc.
Create the whole actor system with it's dependencies as well as the Stream inside the foreachPartition method. Here is a trade off, you can have only one ActorSystem but you will have to make a bad use of .collect as I wrote in the question. However creating everything inside, you still have the ability to run things inside spark distributed across your cluster.
Finish each actor system at the end of the iterator using the Sink.actorRef with a message to kill(Finish)
Perhaps this code could be even more improved, but so far I'm happy to do not make the use of .collect anymore and working only inside Spark.

Related

Spark converting dataframe to RDD takes a huge amount of time, lazy execution or real issue?

In my spark application, I am loading data from Solr into a dataframe, running an SQL query on it, and then writing the resulting dataframe to MongoDB.
I am using spark-solr library to read data from Solr and mongo-spark-connector to write results to MongoDB.
The problem is that it is very slow, for datasets as small as 90 rows in an RDD, the spark job takes around 6 minutes to complete (4 nodes, 96gb RAM, 32 cores each).
I am sure that reading from Solr and writing to MongoDB is not slow because outside Spark they perform very fast.
When I inspect running jobs/stages/tasks on application master UI, it always shows a specific line in this function as taking 99% of the time:
override def exportData(spark: SparkSession, result: DataFrame): Unit = {
try {
val mongoWriteConfig = configureWriteConfig
MongoSpark.save(result.withColumn("resultOrder", monotonically_increasing_id())
.rdd
.map(row => {
implicit val formats: DefaultFormats.type = org.json4s.DefaultFormats
val rMap = Map(row.getValuesMap(row.schema.fieldNames.filterNot(_.equals("resultOrder"))).toSeq: _*)
val m = Map[String, Any](
"queryId" -> queryId,
"queryIndex" -> opIndex,
"resultOrder" -> row.getAs[Long]("resultOrder"),
"result" -> rMap
)
Document.parse(Serialization.write(m))
}), mongoWriteConfig);
} catch {
case e: SparkException => handleMongoException(e)
}
}
The line .rdd is shown to take most of the time to execute. Other stages take a few seconds or less.
I know that converting a dataframe to an rdd is not an inexpensive call but for 90 rows it should not take this long. My local standalone spark instance can do it in a few seconds.
I understand that Spark executes transformations lazily. Does it mean that operations before .rdd call is taking a long time and it's just a display issue on application master UI? Or is it really the dataframe to rdd conversion taking too long? What can cause this?
By the way, SQL queries run on the dataframe are pretty simple ones, just a single group by etc.

Convert a Spark SQL batch source to structured streaming sink

Trying to convert an org.apache.spark.sql.sources.CreatableRelationProvider into a org.apache.spark.sql.execution.streaming.Sink by simply implementing addBatch(...) which calls the createRelation(...) but there is a df.rdd in the createRelation(...), which causes the following error:
org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$.org$apache$spark$sql$catalyst$analysis$UnsupportedOperationChecker$$throwError(UnsupportedOperationChecker.scala:374)
Was trying to look into howorg.apache.spark.sql.execution.streaming.FileStreamSink which also needs to get Rdd from dataframe in the streaming job, it seems to play the trick of using df.queryExecution.executedPlan.execute() to generate the RDD instead of calling .rdd.
However things does not seems to be that simple:
It seems the output ordering might need to be taken care of - https://github.com/apache/spark/blob/branch-2.3/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala#L159
Might be some eager execution concerns? (not sure)
https://issues.apache.org/jira/browse/SPARK-20865
More details of the issue I am running into can be found here
Wondering what would be the idiomatic way to do this conversion?
Dataset.rdd() creates a new plan that just breaks the incremental planing. Because StreamExecution uses the existing plan to collect metrics and update watermark, we should never create a new plan. Otherwise, metrics and watermark are updated in the new plan, and StreamExecution cannot retrieval them.
Here is an example of the code in Scala to convert column values in Structured Streaming:
val convertedRows: RDD[Row] = df.queryExecution.toRdd.mapPartitions { iter: Iterator[InternalRow] =>
iter.map { row =>
val convertedValues: Array[Any] = new Array(conversionFunctions.length)
var i = 0
while (i < conversionFunctions.length) {
convertedValues(i) = conversionFunctions(i)(row, i)
i += 1
}
Row.fromSeq(convertedValues)
}
}

lag function in Spark structured streaming

I am using Spark 2.3 Structured streaming and trying to use 'lag' function. However looks like lag is not supported in structured streaming.
val output = spark.sql("SELECT temperature, time, lag(temperature, 1) OVER (ORDER BY time) AS PrevTemp FROM InputTable")
Get this error:
org.apache.spark.sql.AnalysisException: Non-time-based windows are not supported on streaming DataFrames/Datasets; line 1 pos 0;
Is there an alternate way to achieve this 'lag' functionality with structured streaming?
Thanks!
As far as I know, there isn't.
Probably, you may play with mapGroupsWithState. for example:
case class PayLoad(event_time: java.sql.Timestamp, data: String)
def mappingFunction(key: java.sql.Timestamp, values: Iterator[PayLoad], state: GroupState[PayLoad]): PayLoad = {
??? // Work with values iterator
}
val temperature: DataFrame = ???
temperature
.withColumn("event_time", org.apache.spark.sql.functions.current_timestamp())
.as[PayLoad]
.groupByKey(_.event_time)
.mapGroupsWithState(GroupStateTimeout.ProcessingTimeTimeout())(mappingFunction)
You don't need to keep state, but in this way you have access to values iterator and you are able to solve any task.
Keep in mind, that in this case all micro batch data will go to one partition and with huge payload may lead to huge latencies or even OOM. (as well as with OVER (ORDER BY time))
Hope it helps.

How to write DataFrame (built from RDD inside foreach) to Kafka?

I'm trying to write a DataFrame from Spark to Kafka and I couldn't find any solution out there. Can you please show me how to do that?
Here is my current code:
activityStream.foreachRDD { rdd =>
val activityDF = rdd
.toDF()
.selectExpr(
"timestamp_hour", "referrer", "action",
"prevPage", "page", "visitor", "product", "inputProps.topic as topic")
val producerRecord = new ProducerRecord(topicc, activityDF)
kafkaProducer.send(producerRecord) // <--- this shows an error
}
type mismatch; found : org.apache.kafka.clients.producer.ProducerRecord[Nothing,org‌​.apache.spark.sql.Da‌​taFrame] (which expands to) org.apache.kafka.clients.producer.ProducerRecord[Nothing,org‌​.apache.spark.sql.Da‌​taset[org.apache.spa‌​rk.sql.Row]] required: org.apache.kafka.clients.producer.ProducerRecord[Nothing,Str‌​ing] Error occurred in an application involving default arguments.
Do collect on the activityDF to get the records (not Dataset[Row]) and save them to Kafka.
Note that you'll end up with a collection of records after collect so you probably have to iterate over it, e.g.
val activities = activityDF.collect()
// the following is pure Scala and has nothing to do with Spark
activities.foreach { a: Row =>
val pr: ProducerRecord = // map a to pr
kafkaProducer.send(pr)
}
Use pattern matching on Row to destructure it to fields/columns, e.g.
activities.foreach { case Row(timestamp_hour, referrer, action, prevPage, page, visitor, product, topic) =>
// ...transform a to ProducerRecord
kafkaProducer.send(pr)
}
PROTIP: I'd strongly suggest using a case class and transform DataFrame (= Dataset[Row]) to Dataset[YourCaseClass].
See Spark SQL's Row and Kafka's ProducerRecord docs.
As Joe Nate pointed out in the comments:
If you do "collect" before writing to any endpoint, it's going to make all the data aggregate at the driver and then make the driver write it out. 1) Can crash the driver if too much data (2) no parallelism in write.
That's 100% correct. I wished I had said it :)
You may want to use the approach as described in Writing Stream Output to Kafka instead.

In spark Streaming how to reload a lookup non stream rdd after n batches

Suppose i have a streaming context which does lot of steps and then at the end the micro batch look's up or joins to a preloaded RDD. I have to refresh that preloaded RDD every 12 hours . how can i do this. Anything i do which does not relate to streaming context is not replayed to my understanding, how i get this called form one of the streaming RDD. I need to make only one call non matter how many partition the streaming dstream has
This is possible by re-creating the external RDD at the time it needs to be reloaded. It requires defining a mutable variable to hold the RDD reference that's active at a given moment in time. Within the dstream.foreachRDD we can then check for the moment when the RDD reference needs to be refreshed.
This is an example on how that would look like:
val stream:DStream[Int] = ??? //let's say that we have some DStream of Ints
// Some external data as an RDD of (x,x)
def externalData():RDD[(Int,Int)] = sparkContext.textFile(dataFile)
.flatMap{line => try { Some((line.toInt, line.toInt)) } catch {case ex:Throwable => None}}
.cache()
// this mutable var will hold the reference to the external data RDD
var cache:RDD[(Int,Int)] = externalData()
// force materialization - useful for experimenting, not needed in reality
cache.count()
// a var to count iterations -- use to trigger the reload in this example
var tick = 1
// reload frequency
val ReloadFrequency = 5
stream.foreachRDD{ rdd =>
if (tick == 0) { // will reload the RDD every 5 iterations
// unpersist the previous RDD, otherwise it will linger in memory, taking up resources.
cache.unpersist(false)
// generate a new RDD
cache = externalData()
}
// join the DStream RDD with our reference data, do something with it...
val matches = rdd.keyBy(identity).join(cache).count()
updateData(dataFile, (matches + 1).toInt) // so I'm adding data to the static file in order to see when the new records become alive
tick = (tick + 1) % ReloadFrequency
}
streaming.start
Previous to come with this solution, I studied the possibility to play with the persist flag in the RDD, but it didn't work as expected. Looks like unpersist() does not force re-materialization of the RDD when it's used again.

Resources