add HBase Timestamp using Phoenix-Spark API - apache-spark

How do I add an HBase Timestamp using Phoenix-Spark similiar to HBase API:
Put(rowkey, timestamp.getMillis)
This is my code:
val rdd = processedRdd.map(r => Row.fromSeq(r))
val dataframe = sqlContext.createDataFrame(rdd, schema)
dataframe.save("org.apache.phoenix.spark", SaveMode.Overwrite,
Map("table" -> HTABLE, "zkUrl" -> zkQuorum))

You can use the following:
dataframe.write.format("org.apache.phoenix.spark").mode("overwrite").option("table", "HTable").option("zkUrl", "xxxx:0000").save()

This feature is currently not supported in Phoenix. Maybe, you need to use HBase api instead of Phoenix.

Related

Spark Streaming: Read from HBase by received stream keys?

What is the best way to compare received data in Spark Streaming to existing data in HBase?
We receive data from kafka as DStream, and before writing it down to HBase we must scan HBase for data based on received keys from kafka, do some calculation (based on new vs old data per key) and then write down to HBase.
So if I receive record (key, value_new), I must get from HBase (key, value_old), so I can compare value_new vs value_old.
So the logic would be:
Dstream from Kafka -> Query HBase by DStream keys -> Some calculations
-> Write to HBase
My "naïve" approach was to use Phoenix Spark Connector to read and left join to new data based on key as a way to filter out keys not in the current micro-batch. So I would get a DF with (key, value_new, value_old) and from here I can compare inside partition.
JavaInputDStream<ConsumerRecord<String, String>> kafkaDStream = KafkaUtils.createDirectStream(...);
// use foreachRDD in order to use Phoenix DF API
kafkaDStream.foreachRDD((rdd, time) -> {
// Get the singleton instance of SparkSession
SparkSession spark = JavaSparkSessionSingleton.getInstance(rdd.context().getConf());
JavaPairRDD<String, String> keyValueRdd = rdd.mapToPair(record -> new Tuple2<>(record.key(), record.value()));
// TO SLOW FROM HERE
Dataset<Row> oldDataDF = spark
.read()
.format("org.apache.phoenix.spark")
.option("table", PHOENIX_TABLE)
.option("zkUrl", PHOENIX_ZK)
.load()
.withColumnRenamed("JSON", "JSON_OLD")
.withColumnRenamed("KEY_ROW", "KEY_OLD");
Dataset<Row> newDF = toPhoenixTableDF(spark, keyValueRdd); //just a helper method to get RDD to DF (see note bellow)
Dataset<Row> newAndOld = newDF.join(oldDataDF, oldDataDF.col("KEY_OLD").equalTo(newDF.col("KEY_ROW")), "left");
/// do some calcs based on new vs old values and then write to Hbase ...
});
PROBLEM: getting data from HBase based on a list of keys from received DStream RDD using the above approach is too slow for streaming.
What can be a performant way to do so?
Side note:
Method toPhoenixTableDF is just a helper to transform the received RDD to DF:
private static Dataset<Row> toPhoenixTableDF(SparkSession spark, JavaPairRDD<String, String> keyValueRdd) {
JavaRDD<phoenixTableRecord> tmp = keyValueRdd.map(x -> {
phoenixTableRecord record = new phoenixTableRecord();
record.setKEY_ROW(x._1);
record.setJSON(x._2);
return record;
});
return spark.createDataFrame(tmp, phoenixTableRecord.class);
}
The solution is to use the spark hbase connector for batch get and put.
You can find the source code here with good examples.
https://github.com/apache/hbase-connectors/tree/master/spark
As well as in HBase documentation (spark session).
This library uses plain Java/Scala Hbase api, so you have control over operations, but manages for you the connection pool through an hbaseContext object broadcasted to executors, which is really great. It provides simple wrappers for Hbase operations, but, if needed, we can just use its foreach/mapPartition and gain control over logic, while having access to a managed connection.

scala joinWithCassandraTable result to dataframe

I'm using Datastax spark-Cassandra-connector to access some data in Cassandra.
My requirement is to Join an RDD with a Cassandra table, fetch the result and store it in the hive table.
Im using joinWithCassandraTable to join the cassadra table. After the join the resuting RDD looks like below
com.datastax.spark.connector.rdd.CassandraJoinRDD[org.apache.spark.sql.Row,
com.datastax.spark.connector.CassandraRow] =
CassandraJoinRDD[17] at RDD at CassandraRDD.scala:19
I tried below steps to convert to the data frame but none of the approaches is working.
val data=joinWithRDD.map{
case(_, cassandraRow) => Row(cassandraRow.columnValues:_*)
}
sqlContext.createDataFrame(data,schema)
I'm getting below error
java.lang.ClassCastException: cannot assign instance of
scala.collection.immutable.List$SerializationProxy to field
org.apache.spark.rdd.RDD.org$apache$spark$rdd$RDD$$dependencies_ of
type scala.collection.Seq in instance of org.apache.spark.rdd.MapPartitionsRDD
Can you please help me in converting joinWithCassandraTable to a dataframe?
As I see, you're using dataframe on the left side of the join. Instead of using joinWithCassandraTable that uses RDD API, I recommend to take the Spark Cassandra Connector 2.5.x (2.5.1 is the latest) that has support for join in the Dataframe API, and use it directly. It's really easy, you just need to start your job with --conf spark.sql.extensions=com.datastax.spark.connector.CassandraSparkExtensions to activate this functionality, after that, code is just using normal joins on dataframes:
val parsed = ...some dataframe...
val cassandra = spark.read
.format("org.apache.spark.sql.cassandra")
.options(Map("table" -> "stock_info", "keyspace" -> "test"))
.load
// we can use left join to detect what data is incorrect - if we don't have some data in the
// Cassandra, then symbol field will be null, so we can detect such entries, and do something with that
// we can omit the joinType parameter, in that case, we'll process only data that are in the Cassandra
val joined = parsed.join(cassandra, cassandra("symbol") === parsed("ticker"), "left")
.drop("ticker")
Full source code with README is here.

How to do stateless aggregations in spark using Structured Streaming 2.3.0 without using flatMapsGroupWithState?

How to do stateless aggregations in spark using Structured Streaming 2.3.0 without using flatMapsGroupWithState or Dstream API? looking for a more declarative way
Example:
select count(*) from some_view
I want the output to just count whatever records are available in each batch but not aggregate from the previous batch
To do stateless aggregations in spark using Structured Streaming 2.3.0 without using flatMapsGroupWithState or Dstream API, you can use following code-
import spark.implicits._
def countValues = (_: String, it: Iterator[(String, String)]) => it.length
val query =
dataStream
.select(lit("a").as("newKey"), col("value"))
.as[(String, String)]
.groupByKey { case(newKey, _) => newKey }
.mapGroups[Int](countValues)
.writeStream
.format("console")
.start()
Here what we are doing is-
We added one column to our datastream - newKey. We did this so that we can do a groupBy over it, using groupByKey. I have used a literal string "a", but you can use anything. Also, you need to select anyone column from the available columns in datastream. I have selected value column for this purpose, you can select anyone.
We created a mapping function - countValues, to count the values aggregated by groupByKey function by writing it.length.
So, in this way, we can count whatever records are available in each batch but not aggregating from the previous batch.
I hope it helps!

Spark SQL - Custom Datatype UUID

i am trying to convert the Column in the Dataset from varchar to UUID using the custom datatype in Spark SQL. But i see the conversion not happening. Please let me know if i am missing anything here.
val secdf = sc.parallelize( Array(("85d8b889-c793-4f23-93e9-ea18db640039","Revenue"), ("85d8b889-c793-4f23-93e9-ea18db640038","Income:123213"))).toDF("id", "report")
val metadataBuilder = new MetadataBuilder()
metadataBuilder.putString("database.column.type", "uuid")
metadataBuilder.putLong("jdbc.type", java.sql.Types.OTHER)
val metadata = metadataBuilder.build()
val secReportDF = secdf.withColumn("id", col("id").as("id", metadata))
i did the work around as we are not able to cast to UUID in Spark SQL, i have added the property in the Postgres JDBC client as stringtype=unspecified which solved my issue in Inserting UUID through Spark JDBC

Spark dataframe returning only structure when connected to Phoenix query server

I am connecting to hbase ( ver 1.2) via phoenix (4.11) queryserver from Spark 2.2.0, but the dataframe is returning the only table structure with empty rows thoug data is present in table.
Here is the code I am using to connect to queryserver.
// ---jar ----phoenix-4.11.0-HBase-1.2-thin-client.jar<br>
val prop = new java.util.Properties
prop.setProperty("driver", "org.apache.phoenix.queryserver.client.Driver")
val url = "jdbc:phoenix:thin:url=http://localhost:8765;serialization=PROTOBUF"
val d1 = spark.sqlContext.read.jdbc(url,"TABLE1",prop)
d1.show()
Can anyone please help me in solving this issue. Thanks in advance
If you are using spark2.2 the better approach would be to load directly via pheonix as a dataframe.This way you would provide the zookeeper url only and you can provide a predicate so that you load only the data required and not the entire data.
import org.apache.phoenix.spark._
import org.apache.hadoop.conf.Configuration
import org.apache.spark.sql.SparkSession
val configuration = new Configuration()
configuration.set("hbase.zookeeper.quorum", "localhost:2181");
val spark = SparkSession.builder().master("local").enableHiveSupport().getOrCreate()
val df=spark.sqlContext.phoenixTableAsDataFrame("TABLE1",Seq("COL1","COL2"),predicate = Some("\"COL1\" = 1"),conf = configuration)
Read this for more info on getting table as rdd and saving dataframes and rdd's .

Resources