Hive table creation in HDP using Apache Spark job - apache-spark

I have written following Scala program in Eclipse for reading a csv file from a location in HDFS and then saving that data into a hive table [I am using HDP2.4 sandbox running on my VMWare present on my local machine]:
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.hive.HiveContext
object HDFS2HiveFileRead {
def main(args:Array[String]){
val conf = new SparkConf()
.setAppName("HDFS2HiveFileRead")
.setMaster("local")
val sc = new SparkContext(conf)
val hiveContext = new HiveContext(sc)
println("loading data")
val loadDF = hiveContext.read
.format("com.databricks.spark.csv")
.option("header","true")
.option("delimiter",",")
.load("hdfs://192.168.159.129:8020/employee.csv")
println("data loaded")
loadDF.printSchema()
println("creating table")
loadDF.write.saveAsTable("%s.%s".format( "default" , "tblEmployee2" ))
println("table created")
val selectQuery = "SELECT * FROM default.tblEmployee2"
println("selecting data")
val result = hiveContext.sql(selectQuery)
result.show()}}
When I run this program from my Eclipse; using
Run As -> Scala Application
option: It shows me following results on Eclipse Console:
loading data
data loaded
root
|-- empid: string (nullable = true)
|-- empname: string (nullable = true)
|-- empage: string (nullable = true)
creating table
17/06/29 13:27:08 INFO CatalystWriteSupport: Initialized Parquet
WriteSupport with Catalyst schema: { "type" : "struct", "fields" :
[ {
"name" : "empid",
"type" : "string",
"nullable" : true,
"metadata" : { } }, {
"name" : "empname",
"type" : "string",
"nullable" : true,
"metadata" : { } }, {
"name" : "empage",
"type" : "string",
"nullable" : true,
"metadata" : { } } ] } and corresponding Parquet message type: message spark_schema { optional binary empid (UTF8); optional
binary empname (UTF8); optional binary empage (UTF8); }
table created
selecting data
+-----+--------+------+
|empid| empname|empage|
+-----+--------+------+
| 1201| satish| 25|
| 1202| krishna| 28|
| 1203| amith| 39|
| 1204| javed| 23|
| 1205| prudvi| 23|
+-----+--------+------+
17/06/29 13:27:14 ERROR ShutdownHookManager: Exception while deleting
Spark temp dir:
C:\Users\c.b\AppData\Local\Temp\spark-c65aa16b-6448-434f-89dc-c318f0797e10
java.io.IOException: Failed to delete:
C:\Users\c.b\AppData\Local\Temp\spark-c65aa16b-6448-434f-89dc-c318f0797e10
This shows that csv data has been loaded from desired HDFS location [present in HDP] and table with name tblEmployee2 has also been created in hive, as I could read and see the results in the console. I could even read this table again and again by running any spark job to read data from this table
BUT, the issue is as soon as I go to my HDP2.4 through putty and try to see this table in hive,
1) I could not see this table there.
2) I am considering that this code will create a managed/internal table in hive, hence the csv file present at given location in HDFS should also get moved from its base location to hive metastore location, which is not happening?
3) I could also see metastore_db folder getting created in my Eclipse, does that mean that this tblEmployee2 is getting created in my local/windows machine?
4) How can I resolve this issue and ask my code to create hive table in hdp? Is there any configuration which I am missing here?
5) Why am I getting last error in my execution?
Any quick response/pointer would be appreciated.
UPDATE After thinking a lot when I added hiveContext.setConf("hive.metastore.uris","thrift://192.168.159.129:9083")
Code moved a bit but with some permission related issues started appearing. I could now see this table [tblEmployee2] in my hive's default database present in my VMWare but it does that with SparkSQL by itself:
17/06/29 22:43:21 WARN HiveContext$$anon$2: Could not persist `default`.`tblEmployee2` in a Hive compatible way. Persisting it into Hive metastore in Spark SQL specific format.
Hence, I am still not able to use HiveContext, and my above mentioned issues 2-5 still persists.
Regards,
Bhupesh

You are running the spark in local mode.
val conf = new SparkConf()
.setAppName("HDFS2HiveFileRead")
.setMaster("local")
In local mode, when you specify saveAsTable, it will try to create the table in local machine. Change your configuration to run in yarn mode.
You can refer to the below URL, for details:
http://www.coding-daddy.xyz/node/7

Related

How to access BigQuery using Spark which is running outside of GCP

I'm trying to connect my Spark Job which is running on private datacenter with BigQuery. I have created service account and got private JSON key and gained read access to the dataset I wanted to query for. But, when I try integrating with Spark, I'm receiving User does not have bigquery.tables.create permission for dataset xxx:yyy.. Do we need create table permission to read data from table using BigQuery?
Below is the response gets printed on console,
{
"code" : 403,
"errors" : [ {
"domain" : "global",
"message" : "Access Denied: Dataset xxx:yyy: User does not have bigquery.tables.create permission for dataset xxx:yyy.",
"reason" : "accessDenied"
} ],
"message" : "Access Denied: Dataset xxx:yyy: User does not have bigquery.tables.create permission for dataset xxx:yyy.",
"status" : "PERMISSION_DENIED"
}
Below is my Spark code which I'm trying to access BigQuery
object ConnectionTester extends App {
val session = SparkSession.builder()
.appName("big-query-connector")
.config(getConf)
.getOrCreate()
session.read
.format("bigquery")
.option("viewsEnabled", true)
.load("xxx.yyy.table1")
.select("col1")
.show(2)
private def getConf : SparkConf = {
val sparkConf = new SparkConf
sparkConf.setAppName("biq-query-connector")
sparkConf.setMaster("local[*]")
sparkConf.set("parentProject", "my-gcp-project")
sparkConf.set("credentialsFile", "<path to my credentialsFile>")
sparkConf
}
}
For reading regular tables there's no need for bigquery.tables.create permission. However, the code sample you've provided hints that the table is actually a BigQuery view. BigQuery views are logical references, they are not materialized on the server side and in order for spark to read them they first need to be materialized to a temporary table. In order to create this temporary table bigquery.tables.create permission is required.
Check below code.
Credential
val credentials = """
| {
| "type": "service_account",
| "project_id": "your project id",
| "private_key_id": "your private_key_id",
| "private_key": "-----BEGIN PRIVATE KEY-----\nxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\n-----END PRIVATE KEY-----\n",
| "client_email": "xxxxx#company.com",
| "client_id": "111111111111111111111111111",
| "auth_uri": "https://accounts.google.com/o/oauth2/auth",
| "token_uri": "https://oauth2.googleapis.com/token",
| "auth_provider_x509_cert_url": "https://www.googleapis.com/oauth2/v1/certs",
| "client_x509_cert_url": "https://www.googleapis.com/robot/v1/metadata/x509/xxxxx40vvvvvv.iam.gserviceaccount.com"
| }
| """
Encode base64 & pass it to spark conf.
def base64(data: String) = {
import java.nio.charset.StandardCharsets
import java.util.Base64
Base64.getEncoder.encodeToString(data.getBytes(StandardCharsets.UTF_8))
}
spark.conf.set("credentials",base64(credentials))
spark
.read
.options("parentProject","parentProject")
.option("table","dataset.table")
.format("bigquery")
.load()

Cannot convert Catalyst type IntegerType to Avro type ["null","int"]

I've Spark Structured Streaming process build with Pyspark that reads a avro message from a kafka topic, make some transformations and load the data as avro in a target topic.
I use the ABRIS package (https://github.com/AbsaOSS/ABRiS) to serialize/deserialize the Avro from Confluent, integrating with Schema Registry.
The schema contains integer columns as follows:
{
"name": "total_images",
"type": [
"null",
"int"
],
"default": null
},
{
"name": "total_videos",
"type": [
"null",
"int"
],
"default": null
},
The process raises the following error: Cannot convert Catalyst type IntegerType to Avro type ["null","int"].
I've tried to convert the columns to be nullable but the error persists.
If someone have a suggestion I would appreciate that
I burned hours on this one
Actually, It is unrelated to Abris dependency (behaviour is the same with native spark-avro apis)
There may be several root causes but in my case … using Spark 3.0.1, Scala with Dataset : it was related to encoder and wrong type in the case class handling datas.
Shortly, avro field defined with "type": ["null","int"] can’t be mapped to scala Int, it needs Option[Int]
Using the following code:
test("Avro Nullable field") {
val schema: String =
"""
|{
| "namespace": "com.mberchon.monitor.dto.avro",
| "type": "record",
| "name": "TestAvro",
| "fields": [
| {"name": "strVal", "type": ["null", "string"]},
| {"name": "longVal", "type": ["null", "long"]}
| ]
|}
""".stripMargin
val topicName = "TestNullableAvro"
val testInstance = TestAvro("foo",Some(Random.nextInt()))
import sparkSession.implicits._
val dsWrite:Dataset[TestAvro] = Seq(testInstance).toDS
val allColumns = struct(dsWrite.columns.head, dsWrite.columns.tail: _*)
dsWrite
.select(to_avro(allColumns,schema) as 'value)
.write
.format("kafka")
.option("kafka.bootstrap.servers", bootstrap)
.option("topic", topicName)
.save()
val dsRead:Dataset[TestAvro] = sparkSession.read
.format("kafka")
.option("kafka.bootstrap.servers", bootstrap)
.option("subscribe", topicName)
.option("startingOffsets", "earliest")
.load()
.select(from_avro(col("value"), schema) as 'Metric)
.select("Metric.*")
.as[TestAvro]
assert(dsRead.collect().contains(testInstance))
}
It fails if case class is defined as follow:
case class TestAvro(strVal:String,longVal:Long)
Cannot convert Catalyst type LongType to Avro type ["null","long"].
org.apache.spark.sql.avro.IncompatibleSchemaException: Cannot convert Catalyst type LongType to Avro type ["null","long"].
at org.apache.spark.sql.avro.AvroSerializer.newConverter(AvroSerializer.scala:219)
at org.apache.spark.sql.avro.AvroSerializer.$anonfun$newStructConverter$1(AvroSerializer.scala:239)
It works properly with:
case class TestAvro(strVal:String,longVal:Option[Long])
Btw, it would be more than nice to have support for SpecificRecord within Spark encoders (you can use Kryo but it is sub efficient)
Since, in order to use efficiently typed Dataset with my avro data … I need to create additional case classes (which duplicates of my SpecificRecords).

How to convert messages from socket streaming source to custom domain object?

I'm very new to spark streaming. I have a Spark Standalone 2.2 running with one worker. I'm using a socket source and trying to read the incoming stream into an object called MicroserviceMessage.
val message = spark.readStream
.format("socket")
.option("host", host)
.option("port", port)
.load()
val df = message.as[MicroserviceMessage].flatMap(microserviceMessage =>
microserviceMessage.DataPoints.map(datapoint => (datapoint, microserviceMessage.ServiceProperties, datapoint.EpochUTC)))
.toDF("datapoint", "properties", "timestamp")
I'm hoping this will a DataFrame with columns of "datapoint", "properties" and "timestamp"
The data i'm pasting into my netcat terminal looks like this (this is what I'm trying to read in as MicroserviceMessage):
{
"SystemType": "mytype",
"SystemGuid": "6c84fb90-12c4-11e1-840d-7b25c5ee775a",
"TagType": "Raw Tags",
"ServiceType": "FILTER",
"DataPoints": [
{
"TagName": "013FIC003.PV",
"EpochUTC": 1505247956001,
"ItemValue": 25.47177,
"ItemValueStr": "NORMAL",
"Quality": "Good",
"TimeOffset": "P0000"
},
{
"TagName": "013FIC003.PV",
"EpochUTC": 1505247956010,
"ItemValue": 26.47177,
"ItemValueStr": "NORMAL",
"Quality": "Good",
"TimeOffset": "P0000"
}
],
"ServiceProperties": [
{
"Key": "OutputTagName",
"Value": "FI12102.PV_CL"
},
{
"Key": "OutputTagType",
"Value": "Cleansing Flow Tags"
}
]
}
Instead what I see is:
Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot resolve '`SystemType`' given input columns: [value];
MicroserviceMessage case class looks like this:
case class DataPoints
(
TagName: String,
EpochUTC: Double,
ItemValue: Double,
ItemValueStr: String,
Quality: String,
TimeOffset: String
)
case class ServiceProperties
(
Key: String,
Value: String
)
case class MicroserviceMessage
(
SystemType: String,
SystemGuid: String,
TagType: String,
ServiceType: String,
DataPoints: List[DataPoints],
ServiceProperties: List[ServiceProperties]
)
EDIT:
After reading this post I was able to start the job by doing
val messageEncoder = Encoders.bean(classOf[MicroserviceMessage])
val df = message.select($"value").as(messageEncoder).map(
msmg => (msmg.ServiceType, msmg.SystemGuid)
).toDF("service", "guid")
But this causes issues when I start sending data.
Caused by: java.lang.BootstrapMethodError: java.lang.NoClassDefFoundError: scala/runtime/LambdaDeserialize
Full stacktrace
This:
message.as[MicroserviceMessage]
is incorrect as explained by the error message:
cannot resolve 'SystemType' given input columns: [value];
Data that comes from SocketStream is just string (or string and timestamp). To make it usable for strongly typed Dataset you have to parse it, for example with org.apache.spark.sql.functions.from_json.
The reason for the exception
Caused by: java.lang.BootstrapMethodError: java.lang.NoClassDefFoundError: scala/runtime/LambdaDeserialize
is that you compiled your Spark Structured Streaming application using Scala 2.12.4 (or any other in 2.12 stream) which is unsupported in Spark 2.2.
From the scaladoc of scala.runtime.LambdaDeserializer:
This class is only intended to be called by synthetic $deserializeLambda$ method that the Scala 2.12 compiler will add to classes hosting lambdas.
Spark 2.2 supports up to and including Scala 2.11.12 with 2.11.8 being the most "blessed" version.

Creating Hive table on top of parquet files is failing - where am I going wrong?

I'm saving a dataframe to parquet files. The schema generated looks like this:
org.apache.spark.sql.parquet.row.metadata{
"type":"struct",
"fields":[
{
"name":"DCN",
"type":"string",
"nullable":true,
"metadata":{}
},
{
"name":"EDW_id",
"type":"string",
"nullable":true,
"metadata":{}
},
{
"name":"recievedTimestamp",
"type":"string",
"nullable":true,
"metadata":{}
},
{
"name":"recievedDate",
"type":"date",
"nullable":true,
"metadata":{}
},
{
"name":"rule",
"type":"string",
"nullable":true,
"metadata":{}
}
]}
The dataframe is being generated in a spark program; when I run it via spark-submit and display the dataframe I can see there are several hundred records. I'm saving the df to parquet like so:
df.write.format("parquet").mode(SaveMode.Overwrite).save('/home/my/location')
And creating an external table in hive like so:
CREATE EXTERNAL TABLE schemaname.tablename (
DCN STRING,
EDW_ID STRING,
RECIEVEDTIMESTAMP STRING,
RECIEVEDDATE STRING,
RULE STRING)
STORED AS PARQUET
LOCATION '/home/my/location';
The table is being created successfully, but it is not being populated with any data - when I query it, 0 records are returned. Can anyone spot what I'm doing wrong? This is using Hive 1.1 and Spark 1.6.
Hive required jar file for handling the parquet file.
1.First download parquet-hive-bundle-1.5.0.jar
2.include the jar path into hive-site.xml.
<property>
<name>hive.jar.directory</name>
<value>/home/hduser/hive/lib/parquet-hive-bundle-1.5.0.jar</value>
</property>
hive metadata store is case insensitive and stores all column name in lower case where as parquet stores as is . Try recreating the hive table in the same case .

Trying to deserialize Avro in Spark with specific type

I have some Avro classes that i generated, and am now trying to use them in Spark. So I imported my avro generated java class, “twitter_schema”, and refer to it when I deserialize. Seems to work but getting a Cast exception at the end.
My Schema:
$ more twitter.avsc
{ "type" : "record", "name" : "twitter_schema", "namespace" :
"com.miguno.avro", "fields" : [ {
"name" : "username",
"type" : "string",
"doc" : "Name of the user account on Twitter.com" }, {
"name" : "tweet",
"type" : "string",
"doc" : "The content of the user's Twitter message" }, {
"name" : "timestamp",
"type" : "long",
"doc" : "Unix epoch time in seconds" } ], "doc:" : "A basic schema for storing Twitter messages" }
My code:
import org.apache.avro.mapreduce.AvroKeyInputFormat
import org.apache.avro.mapred.AvroKey
import org.apache.hadoop.io.NullWritable
import org.apache.avro.mapred.AvroInputFormat
import org.apache.avro.mapred.AvroWrapper
import org.apache.avro.file.DataFileReader;
import org.apache.avro.file.DataFileWriter;
import org.apache.avro.io.DatumReader;
import org.apache.avro.io.DatumWriter;
import org.apache.avro.specific.SpecificDatumReader;
import com.miguno.avro.twitter_schema
val path = "/app/avro/data/twitter.avro"
val conf = new Configuration
var avroRDD = sc.newAPIHadoopFile(path,classOf[AvroKeyInputFormat[twitter_schema]],
classOf[AvroKey[ByteBuffer]], classOf[NullWritable], conf)
var avroRDD = sc.hadoopFile(path,classOf[AvroInputFormat[twitter_schema]],
classOf[AvroWrapper[twitter_schema]], classOf[NullWritable], 5)
avroRDD.map(l => {
//transformations here
new String(l._1.datum.username)
}
).first
And I get an error on the last line:
scala> avroRDD.map(l => {
| new String(l._1.datum.username)}).first
<console>:30: error: overloaded method constructor String with alternatives:
(x$1: StringBuilder)String <and>
(x$1: StringBuffer)String <and>
(x$1: Array[Byte])String <and>
(x$1: Array[Char])String <and>
(x$1: String)String
cannot be applied to (CharSequence)
new String(l._1.datum.username)}).first
What am I doing wrong – not understanding the error?
Is it the right way of deserializing? I read about Kryo but seems to add to the complexity, and read about the Spark SQL context accepting Avro in 1.2, but it sounds like a performance hog/workaround.. Best practices for this anyone?
thanks,
Matt
I think your problem is that avro has deserialized string into CharSequence but spark expected java String. Avro has 3 ways to deserialize string in java: into CharSequence, into String and into UTF8 (avro class for storing strings, kinda like Hadoop's Text).
You control that by adding "avro.java.string" property into your avro schema. Possible values are (case sensitive): "String", "CharSequence", "Utf8". There may be a way to control that dynamically through the input format as well but I don't know exactly.
Ok since CharSequence is the interface to String, i can keep my Avro schema the way it was, and just make my Avro string a String via toString(), i.e.:
scala> avroRDD.map(l => {
| new String(l._1.datum.get("username").toString())
| } ).first
res2: String = miguno

Resources