Connecting to HBase through Spark Phoenix Connector - apache-spark

I am trying to load a HBase Table through spark sql connector.
I am able to get the schema of the table
val port = s"${configuration.get(ZOOKEEPER_CLIENT_PORT, "2181")}"
val znode = s"${configuration.get(ZOOKEEPER_ZNODE_PARENT, "/hbase")}"
val zkUrl = s"${configuration.get(ZOOKEEPER_QUORUM, "localhost")}"
val url = s"jdbc:phoenix:$zkUrl:$port:$znode"
val props = new Properties()
val table ="SOME_Metrics_Test"
props.put("driver", "org.apache.phoenix.jdbc.PhoenixDriver")
val df = spark.read.jdbc(url, getEscapedFullTableName(table), props)
If I do df.printSchema I can fetch the schema of the table
scala> df.printSchema
root
|-- PK: string (nullable = false)
|-- status: string (nullable = true)
|-- other_Status: string (nullable = true)
But when I do df.show I am getting this error :
org.apache.phoenix.schema.TableNotFoundException: ERROR 1012 (42M03): Table undefined. tableName=SOME_Metrics_Test
at org.apache.phoenix.query.ConnectionQueryServicesImpl.getAllTableRegions(ConnectionQueryServicesImpl.java:542)
at org.apache.phoenix.iterate.BaseResultIterators.getParallelScans(BaseResultIterators.java:480)
Any idea why this error is coming and what can I do to resolve it?
While starting the spark shell I have added phoenix-4.7.0-HBase-1.1-client-spark.jar and hbase-site.xml in the spark-shell command.

Try this,
val phoenixDF = spark.read.format("org.apache.phoenix.spark")
.option("table", "my_table")
.option("zkUrl", "0.0.0.0:2181")
.load()

Related

Spark Cassandra Write UDT With Case-Sensitive Names Fails

Spark connector Write fails with a java.lang.IllegalArgumentException: udtId is not a field defined in this definition error when using case-sensitive field names
I need the fields in the Cassandra table to maintain case. So i have used
quotes to create them.
My Cassandra schema
CREATE TYPE my_keyspace.my_udt (
"udtId" text,
"udtValue" text
);
CREATE TABLE my_keyspace.my_table (
"id" text PRIMARY KEY,
"someCol" text,
"udtCol" list<frozen<my_udt>>
);
My Spark DataFrame schema is
root
|-- id: string (nullable = true)
|-- someCol: string (nullable = true)
|-- udtCol: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- udtId: string (nullable = true)
|-- udtValue: string (nullable = true)
Are there any other options to get this write to work other than defining my udt with lowercase names? Making them lower case would make me invoke case management code everywhere this is used and i'd like to avoid that ?
Because i couldn't write successfully, i did try read yet? Is this an issue with reads as well ?
You need to upgrade to Spark Cassandra Connector 2.5.0 - I can't find specific commit that fixes it, or specific Jira that mentions that - I suspect that it was fixed in the DataStax version first, and then released as part of merge announced here.
Here is how it works in SCC 2.5.0 + Spark 2.4.6, while it fails with SCC 2.4.2 + Spark 2.4.6:
scala> import org.apache.spark.sql.cassandra._
import org.apache.spark.sql.cassandra._
scala> val data = spark.read.cassandraFormat("my_table", "test").load()
data: org.apache.spark.sql.DataFrame = [id: string, someCol: string ... 1 more field]
scala> val data2 = data.withColumn("id", concat(col("id"), lit("222")))
data2: org.apache.spark.sql.DataFrame = [id: string, someCol: string ... 1 more field]
scala> data2.write.cassandraFormat("my_table", "test").mode("append").save()

error: overloaded method value createDataFrame

I tried to create Apache Spark dataframe
val valuesCol = Seq(("Male","2019-09-06"),("Female","2019-09-06"),("Male","2019-09-07"))
valuesCol: Seq[(String, String)] = List((Male,2019-09-06), (Female,2019-09-06), (Male,2019-09-07))
Schema
val someSchema = List(StructField("sex", StringType, true),StructField("date", DateType, true))
someSchema: List[org.apache.spark.sql.types.StructField] = List(StructField(sex,StringType,true), StructField(date,DateType,true))
It does not work
val someDF = spark.createDataFrame(spark.sparkContext.parallelize(valuesCol),StructType(someSchema))
I got error
<console>:30: error: overloaded method value createDataFrame with alternatives:
(data: java.util.List[_],beanClass: Class[_])org.apache.spark.sql.DataFrame <and>
(rdd: org.apache.spark.api.java.JavaRDD[_],beanClass: Class[_])org.apache.spark.sql.DataFrame <and>
(rdd: org.apache.spark.rdd.RDD[_],beanClass: Class[_])org.apache.spark.sql.DataFrame <and>
(rows: java.util.List[org.apache.spark.sql.Row],schema: org.apache.spark.sql.types.StructType)org.apache.spark.sql.DataFrame <and>
(rowRDD: org.apache.spark.api.java.JavaRDD[org.apache.spark.sql.Row],schema: org.apache.spark.sql.types.StructType)org.apache.spark.sql.DataFrame <and>
(rowRDD: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row],schema: org.apache.spark.sql.types.StructType)org.apache.spark.sql.DataFrame
cannot be applied to (org.apache.spark.rdd.RDD[(String, String)], org.apache.spark.sql.types.StructType)
val someDF = spark.createDataFrame(spark.sparkContext.parallelize(valuesCol),StructType(someSchema))
Should I change date formatting in valuesCol? What actually causes this error?
With import spark.implicits._ you could convert Seq into Dataframe in place
val df: DataFrame = Seq(("Male","2019-09-06"),("Female","2019-09-06"),("Male","2019-09-07"))
.toDF() // <--- Here
Explicitly setting column names:
val df: DataFrame = Seq(("Male","2019-09-06"),("Female","2019-09-06"),("Male","2019-09-07"))
.toDF("sex", "date")
For the desired schema, you could either cast column or use a different type
//Cast
Seq(("Male","2019-09-06"),("Female","2019-09-06"),("Male","2019-09-07"))
.toDF("sex", "date")
.select($"sex", $"date".cast(DateType))
.printSchema()
//Types
val format = new java.text.SimpleDateFormat("yyyy-MM-dd")
Seq(
("Male", new java.sql.Date(format.parse("2019-09-06").getTime)),
("Female", new java.sql.Date(format.parse("2019-09-06").getTime)),
("Male", new java.sql.Date(format.parse("2019-09-07").getTime)))
.toDF("sex", "date")
.printSchema()
//Output
root
|-- sex: string (nullable = true)
|-- date: date (nullable = true)
Regarding your question, your rdd type is known, Spark will create schema accordingly to it.
val rdd: RDD[(String, String)] = spark.sparkContext.parallelize(valuesCol)
spark.createDataFrame(rdd)
root
|-- _1: string (nullable = true)
|-- _2: string (nullable = true)
You can specify your valuesCol as Seq of Row instead of Seq of Tuple :
val valuesCol = Seq(
Row("Male", "2019-09-06"),
Row ("Female", "2019-09-06"),
Row("Male", "2019-09-07"))

Can't read CSV string using PySpark

The scenario is: EventHub -> Azure Databricks (using pyspark)
File format: CSV (Quoted, Pipe delimited and custom schema )
I am trying to read CSV strings comming from eventhub. Spark is successfully creating the dataframe with the proper schema, but the dataframe end up empty after every message.
I managed to do some tests outside streaming environment, and when getting the data from a file, all goes well, but it fails when the data comes from a string.
So I found some links to help me on this, but none worked:
can-i-read-a-csv-represented-as-a-string-into-apache-spark-using-spark-csv?rq=1
Pyspark - converting json string to DataFrame
Right now I have the code below:
schema = StructType([StructField("Decisao",StringType(),True), StructField("PedidoID",StringType(),True), StructField("De_LastUpdated",StringType(),True)])
body = 'DECISAO|PEDIDOID|DE_LASTUPDATED\r\n"asdasdas"|"1015905177"|"sdfgsfgd"'
csvData = sc.parallelize([body])
df = spark.read \
.option("header", "true") \
.option("mode","FAILFAST") \
.option("delimiter","|") \
.schema(schema) \
.csv(csvData)
df.show()
Is that even possible to do with CSV files?
You can construct schema like this via Row and split on | delimiter
from pyspark.sql.functions import *
from pyspark.sql.types import *
from pyspark.sql import Row
body = 'DECISAO|PEDIDOID|DE_LASTUPDATED\r\n"asdasdas"|"1015905177"|"sdfgsfgd"'
csvData = sc.parallelize([body])
schemaDF = csvData\
.map(lambda x: x.split("|"))\
.map(lambda x: Row(x[0],\
x[1],\
x[2],\
x[3],\
x[4]))\
.toDF(["Decisao", "PedidoID", "De_LastUpdated", "col4", "col5"])
for i in schemaDF.take(1): print(i)
Row(Decisao='DECISAO', PedidoID='PEDIDOID', De_LastUpdated='DE_LASTUPDATED\r\n"asdasdas"', col4='"1015905177"', col5='"sdfgsfgd"')
schemaDF.printSchema()
root
|-- Decisao: string (nullable = true)
|-- PedidoID: string (nullable = true)
|-- De_LastUpdated: string (nullable = true)
|-- col4: string (nullable = true)
|-- col5: string (nullable = true)

How to covert dataframe datatypes to String?

I have a hive Table having Date and Timestamp datatypes. I am creating DataFrame using below java code:
SparkConf conf = new SparkConf(true).setMaster("yarn-cluster").setAppName("SAMPLE_APP");
SparkContext sc = new SparkContext(conf);
HiveContext hc = new HiveContext(sc);
DataFrame df = hc.table("testdb.tbl1");
Dataframe schema:
df.printSchema
root
|-- c_date: date (nullable = true)
|-- c_timestamp: timestamp (nullable = true)
I want to covert these columns to String. How can I achieve this?
I need this because of issue : Spark csv data validation failed for date and timestamp data types of Hive
You can do the following:
df.withColumn("c_date", df.col("c_date").cast(StringType))
In scala, we generally cast datatypes like this:
df.select($"date".cast(StringType).as("new_date"))

Spark Exception when converting a MySQL table to parquet

I'm trying to convert a MySQL remote table to a parquet file using spark 1.6.2.
The process runs for 10 minutes, filling up memory, than starts with these messages:
WARN NettyRpcEndpointRef: Error sending message [message = Heartbeat(driver,[Lscala.Tuple2;#dac44da,BlockManagerId(driver, localhost, 46158))] in 1 attempts
org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [10 seconds]. This timeout is controlled by spark.executor.heartbeatInterval
at the end fails with this error:
ERROR ActorSystemImpl: Uncaught fatal error from thread [sparkDriverActorSystem-scheduler-1] shutting down ActorSystem [sparkDriverActorSystem]
java.lang.OutOfMemoryError: GC overhead limit exceeded
I'm running it in a spark-shell with these commands:
spark-shell --packages mysql:mysql-connector-java:5.1.26 org.slf4j:slf4j-simple:1.7.21 --driver-memory 12G
val dataframe_mysql = sqlContext.read.format("jdbc").option("url", "jdbc:mysql://.../table").option("driver", "com.mysql.jdbc.Driver").option("dbtable", "...").option("user", "...").option("password", "...").load()
dataframe_mysql.saveAsParquetFile("name.parquet")
I have limits to the max executor memory to 12G. Is there a way to force writing the parquet file in "small" chunks freeing memory?
It seemed like the problem was that you had no partition defined when you read your data with the jdbc connector.
Reading from JDBC isn't distributed by default, so to enable distribution you have to set manual partitioning. You need a column which is a good partitioning key and you have to know distribution up front.
This is what your data looks like apparently :
root
|-- id: long (nullable = false)
|-- order_year: string (nullable = false)
|-- order_number: string (nullable = false)
|-- row_number: integer (nullable = false)
|-- product_code: string (nullable = false)
|-- name: string (nullable = false)
|-- quantity: integer (nullable = false)
|-- price: double (nullable = false)
|-- price_vat: double (nullable = false)
|-- created_at: timestamp (nullable = true)
|-- updated_at: timestamp (nullable = true)
order_year seemed like a good candidate to me. (you seem to have ~20 years according to your comments)
import org.apache.spark.sql.SQLContext
val sqlContext: SQLContext = ???
val driver: String = ???
val connectionUrl: String = ???
val query: String = ???
val userName: String = ???
val password: String = ???
// Manual partitioning
val partitionColumn: String = "order_year"
val options: Map[String, String] = Map("driver" -> driver,
"url" -> connectionUrl,
"dbtable" -> query,
"user" -> userName,
"password" -> password,
"partitionColumn" -> partitionColumn,
"lowerBound" -> "0",
"upperBound" -> "3000",
"numPartitions" -> "300"
)
val df = sqlContext.read.format("jdbc").options(options).load()
PS: partitionColumn, lowerBound, upperBound, numPartitions:
These options must all be specified if any of them is specified.
Now you can save your DataFrame to parquet.

Resources