scala joinWithCassandraTable result to dataframe - apache-spark

I'm using Datastax spark-Cassandra-connector to access some data in Cassandra.
My requirement is to Join an RDD with a Cassandra table, fetch the result and store it in the hive table.
Im using joinWithCassandraTable to join the cassadra table. After the join the resuting RDD looks like below
com.datastax.spark.connector.rdd.CassandraJoinRDD[org.apache.spark.sql.Row,
com.datastax.spark.connector.CassandraRow] =
CassandraJoinRDD[17] at RDD at CassandraRDD.scala:19
I tried below steps to convert to the data frame but none of the approaches is working.
val data=joinWithRDD.map{
case(_, cassandraRow) => Row(cassandraRow.columnValues:_*)
}
sqlContext.createDataFrame(data,schema)
I'm getting below error
java.lang.ClassCastException: cannot assign instance of
scala.collection.immutable.List$SerializationProxy to field
org.apache.spark.rdd.RDD.org$apache$spark$rdd$RDD$$dependencies_ of
type scala.collection.Seq in instance of org.apache.spark.rdd.MapPartitionsRDD
Can you please help me in converting joinWithCassandraTable to a dataframe?

As I see, you're using dataframe on the left side of the join. Instead of using joinWithCassandraTable that uses RDD API, I recommend to take the Spark Cassandra Connector 2.5.x (2.5.1 is the latest) that has support for join in the Dataframe API, and use it directly. It's really easy, you just need to start your job with --conf spark.sql.extensions=com.datastax.spark.connector.CassandraSparkExtensions to activate this functionality, after that, code is just using normal joins on dataframes:
val parsed = ...some dataframe...
val cassandra = spark.read
.format("org.apache.spark.sql.cassandra")
.options(Map("table" -> "stock_info", "keyspace" -> "test"))
.load
// we can use left join to detect what data is incorrect - if we don't have some data in the
// Cassandra, then symbol field will be null, so we can detect such entries, and do something with that
// we can omit the joinType parameter, in that case, we'll process only data that are in the Cassandra
val joined = parsed.join(cassandra, cassandra("symbol") === parsed("ticker"), "left")
.drop("ticker")
Full source code with README is here.

Related

How to store data from a dataframe in a variable to use as a parameter in a select in cassandra?

I have a Spark Structured Streaming application. The application receives data from kafka, and should use these values ​​as a parameter to process data from a cassandra database. My question is how do I use the data that is in the input dataframe (kafka), as "where" parameters in cassandra "select" without taking the error below:
Exception in thread "main" org.apache.spark.sql.AnalysisException: Queries with streaming sources must be executed with writeStream.start();
This is my df input:
val df = spark
.readStream
.format("kafka")
.options(
Map("kafka.bootstrap.servers"-> kafka_bootstrap,
"subscribe" -> kafka_topic,
"startingOffsets"-> "latest",
"fetchOffset.numRetries"-> "5",
"kafka.group.id"-> groupId
))
.load()
I get this error whenever I try to store the dataframe values ​​in a variable to use as a parameter.
This is the method I created to try to convert the data into variables. With that the spark give the error that I mentioned earlier:
def processData(messageToProcess: DataFrame): DataFrame = {
val messageDS: Dataset[Message] = messageToProcess.as[Message]
val listData: Array[Message] = messageDS.collect()
listData.foreach(x => println(x.country))
val mensagem = messageToProcess
mensagem
}
When you need to use data in Kafka to query data in Cassandra, then such operation is a typical join between two datasets - you don't need to call .collect to find entries, you just do the join. And it's quite typical thing - to enrich data in Kafka with data from the external dataset, and Cassandra provides low-latency operations.
Your code could look as following (you'll need to configure so-called DirectJoin, see link below):
import spark.implicits._
import org.apache.spark.sql.cassandra._
val df = spark.readStream.format("kafka")
.options(Map(...)).load()
... decode data in Kafka into columns
val cassdata = spark.read.cassandraFormat("table", "keyspace").load
val joined = df.join(cassdata, cassdata("pk") === df("some_column"))
val processed = ... process joined data
val query = processed.writeStream.....output data somewhere...start()
query.awaitTermination()
I have detailed blog post on how to perform efficient joins with data in Cassandra.
As the error message suggest, you have to use writeStream.start() in order to execute a Structured Streaming query.
You can't use the same actions you use for batch dataframes (like .collect(), .show() or .count()) on streaming dataframes, see the Unsupported Operations section of the Spark Structured Streaming documentation.
In your case, you are trying to use messageDS.collect() on a streaming dataset, which is not allowed. To achieve this goal you can use a foreachBatch output sink to collect the rows you need at each microbatch:
streamingDF.writeStream.foreachBatch { (microBatchDf: DataFrame, batchId: Long) =>
// Now microBatchDf is no longer a streaming dataframe
// you can check with microBatchDf.isStreaming
val messageDS: Dataset[Message] = microBatchDf.as[Message]
val listData: Array[Message] = messageDS.collect()
listData.foreach(x => println(x.country))
// ...
}

Spark Streaming: Read from HBase by received stream keys?

What is the best way to compare received data in Spark Streaming to existing data in HBase?
We receive data from kafka as DStream, and before writing it down to HBase we must scan HBase for data based on received keys from kafka, do some calculation (based on new vs old data per key) and then write down to HBase.
So if I receive record (key, value_new), I must get from HBase (key, value_old), so I can compare value_new vs value_old.
So the logic would be:
Dstream from Kafka -> Query HBase by DStream keys -> Some calculations
-> Write to HBase
My "naïve" approach was to use Phoenix Spark Connector to read and left join to new data based on key as a way to filter out keys not in the current micro-batch. So I would get a DF with (key, value_new, value_old) and from here I can compare inside partition.
JavaInputDStream<ConsumerRecord<String, String>> kafkaDStream = KafkaUtils.createDirectStream(...);
// use foreachRDD in order to use Phoenix DF API
kafkaDStream.foreachRDD((rdd, time) -> {
// Get the singleton instance of SparkSession
SparkSession spark = JavaSparkSessionSingleton.getInstance(rdd.context().getConf());
JavaPairRDD<String, String> keyValueRdd = rdd.mapToPair(record -> new Tuple2<>(record.key(), record.value()));
// TO SLOW FROM HERE
Dataset<Row> oldDataDF = spark
.read()
.format("org.apache.phoenix.spark")
.option("table", PHOENIX_TABLE)
.option("zkUrl", PHOENIX_ZK)
.load()
.withColumnRenamed("JSON", "JSON_OLD")
.withColumnRenamed("KEY_ROW", "KEY_OLD");
Dataset<Row> newDF = toPhoenixTableDF(spark, keyValueRdd); //just a helper method to get RDD to DF (see note bellow)
Dataset<Row> newAndOld = newDF.join(oldDataDF, oldDataDF.col("KEY_OLD").equalTo(newDF.col("KEY_ROW")), "left");
/// do some calcs based on new vs old values and then write to Hbase ...
});
PROBLEM: getting data from HBase based on a list of keys from received DStream RDD using the above approach is too slow for streaming.
What can be a performant way to do so?
Side note:
Method toPhoenixTableDF is just a helper to transform the received RDD to DF:
private static Dataset<Row> toPhoenixTableDF(SparkSession spark, JavaPairRDD<String, String> keyValueRdd) {
JavaRDD<phoenixTableRecord> tmp = keyValueRdd.map(x -> {
phoenixTableRecord record = new phoenixTableRecord();
record.setKEY_ROW(x._1);
record.setJSON(x._2);
return record;
});
return spark.createDataFrame(tmp, phoenixTableRecord.class);
}
The solution is to use the spark hbase connector for batch get and put.
You can find the source code here with good examples.
https://github.com/apache/hbase-connectors/tree/master/spark
As well as in HBase documentation (spark session).
This library uses plain Java/Scala Hbase api, so you have control over operations, but manages for you the connection pool through an hbaseContext object broadcasted to executors, which is really great. It provides simple wrappers for Hbase operations, but, if needed, we can just use its foreach/mapPartition and gain control over logic, while having access to a managed connection.

How to collect a streaming dataset (to a Scala value)?

How can I store a dataframe value to a scala variable ?
I need to store values from the below dataframe (assuming column "timestamp" producing same values) to a variable and later I need to use this variable somewhere
i have tried following
val spark =SparkSession.builder().appName("micro").
enableHiveSupport().config("hive.exec.dynamic.partition", "true").
config("hive.exec.dynamic.partition.mode", "nonstrict").
config("spark.sql.streaming.checkpointLocation", "hdfs://dff/apps/hive/warehouse/area.db").
getOrCreate()
val xmlSchema = new StructType().add("id", "string").add("time_xml", "string")
val xmlData = spark.readStream.option("sep", ",").schema(xmlSchema).csv("file:///home/shp/sourcexml")
val xmlDf_temp = xmlData.select($"id",unix_timestamp($"time_xml", "dd/mm/yyyy HH:mm:ss").cast(TimestampType).as("timestamp"))
val collect_time = xmlDf_temp.select($"timestamp").as[String].collect()(0)
its thorwing error saying following:
org.apache.spark.sql.AnalysisException: Queries with streaming sources must be executed with writeStream.start()
Is there any way i can store some dataframe values to a variable and use later?
is there any way i can store some dataframe values to a variable and use later ?
That's not possible in Spark Structured Streaming since a streaming query never ends and so it is not possible to express collect.
and later I need to use this variable somewhere
This "later" has to be another streaming query that you could join together and produce a result.

How to do stateless aggregations in spark using Structured Streaming 2.3.0 without using flatMapsGroupWithState?

How to do stateless aggregations in spark using Structured Streaming 2.3.0 without using flatMapsGroupWithState or Dstream API? looking for a more declarative way
Example:
select count(*) from some_view
I want the output to just count whatever records are available in each batch but not aggregate from the previous batch
To do stateless aggregations in spark using Structured Streaming 2.3.0 without using flatMapsGroupWithState or Dstream API, you can use following code-
import spark.implicits._
def countValues = (_: String, it: Iterator[(String, String)]) => it.length
val query =
dataStream
.select(lit("a").as("newKey"), col("value"))
.as[(String, String)]
.groupByKey { case(newKey, _) => newKey }
.mapGroups[Int](countValues)
.writeStream
.format("console")
.start()
Here what we are doing is-
We added one column to our datastream - newKey. We did this so that we can do a groupBy over it, using groupByKey. I have used a literal string "a", but you can use anything. Also, you need to select anyone column from the available columns in datastream. I have selected value column for this purpose, you can select anyone.
We created a mapping function - countValues, to count the values aggregated by groupByKey function by writing it.length.
So, in this way, we can count whatever records are available in each batch but not aggregating from the previous batch.
I hope it helps!

Spark DataTables: where is partitionBy?

A common Spark processing flow we have is something like this:
Loading:
rdd = sqlContext.parquetFile("mydata/")
rdd = rdd.map(lambda row: (row.id,(some stuff)))
rdd = rdd.filter(....)
rdd = rdd.partitionBy(rdd.getNumPatitions())
Processing by id (this is why we do the partitionBy above!)
rdd.reduceByKey(....)
rdd.join(...)
However, Spark 1.3 changed sqlContext.parquetFile to return DataFrame instead of RDD, and it no longer has the partitionBy, getNumPartitions, and reduceByKey methods.
What do we do now with partitionBy?
We can replace the loading code with something like
rdd = sqlContext.parquetFile("mydata/").rdd
rdd = rdd.map(lambda row: (row.id,(some stuff)))
rdd = rdd.filter(....)
rdd = rdd.partitionBy(rdd.getNumPatitions())
df = rdd.map(lambda ...: Row(...)).toDF(???)
and use groupBy instead of reduceByKey.
Is this the right way?
PS. Yes, I understand that partitionBy is not necessary for groupBy et al. However, without a prior partitionBy, each join, groupBy &c may have to do cross-node operations. I am looking for a way to guarantee that all operations requiring grouping by my key will run local.
It appears that, since version 1.6, repartition(self, numPartitions, *cols) does what I need:
.. versionchanged:: 1.6
Added optional arguments to specify the partitioning columns.
Also made numPartitions optional if partitioning columns are specified.
Since DataFrame provide us an abstraction of Table and Column over RDD, the most convenient way to manipulate DataFrame is to use these abstraction along with the specific table manipulations methods that DataFrame enables us.
On a DataFrame, we could:
transform the table schema with select() \ udf() \ as()
filter rows out by filter() or where()
fire an aggregation through groupBy() and agg()
or other analytic job using sample() \ join() \ union()
persist your result using saveAsTable() \ saveAsParquet() \ insertIntoJDBC()
Please refer to Spark SQL and DataFrame Guide for more details.
Therefore, a common job looks like:
val people = sqlContext.parquetFile("...")
val department = sqlContext.parquetFile("...")
people.filter("age > 30")
.join(department, people("deptId") === department("id"))
.groupBy(department("name"), "gender")
.agg(avg(people("salary")), max(people("age")))
And for your specific requirements, this could look like:
val t = sqlContext.parquetFile()
t.filter().select().groupBy().agg()

Resources