How to specify multiplte schema in snowflake,pyspark connections

How to specify multiplte schema in snowflake,pyspark connections - apache-spark

I have connections details as below and I need to read 2 tables from different schema (Employees and HR),
sfparams = {
"sfURL" : "123.com",
"sfUser" : "a",
"sfPassword" : "b",
"sfDatabase" : "Test",
"sfSchema" : "Employee",
"sfWarehouse" : "PD1"
}
How do I read data from HR schema?

To read data from different schema two part names should be provided, i.e., schema_name.table_name:
df = spark.read.format(SNOWFLAKE_SOURCE_NAME) \
.options(**sfparams) \
.option("query", "select * from HR.table1") \
.load()

Related

hive external table on avro alias column not working

I tried creating hive table on avro using avsc spec file and need to renam some of the columns . used alias but seems its not working. the columns are returned as null when i query the table
SPARK DATAFRAME TO SAVE DATA
val data=Seq(("john","adams"),("john","smith"))
val columns = Seq("fname","lname")
import spark.sqlContext.implicits._
val df=data.toDF(columns:_*)
df.write.format("avro").save("/test")
AVSC Spec file
{
"type" : "record",
"name" : "test",
"doc" : " import of test",
"fields" : [ {
"name" : "first_name",
"type" : [ "null", "string" ],
"default" : null,
"aliases" : [ "fname" ],
"columnName" : "fname",
"sqlType" : "12"
}, {
"name" : "last_name",
"type" : [ "null", "string" ],
"default" : null,
"aliases" : [ "lname" ],
"columnName" : "lname",
"sqlType" : "12"
} ],
"tableName" : "test"
}
EXTERNAL HIVE TABLE
create external table test
STORED AS AVRO
LOCATION '/test'
TBLPROPERTIES ('avro.schema.url'='/test.avsc');
HIVE QUERY
SELECT last_name from test;
returns null even though there is data in avro with the original name ie lname

How to encode structs into Avro record in Spark?

I'm trying to use to_avro() function to create Avro records. However, I'm not able to encode multiple columns, as some columns are simply lost after encoding. A simple example to recreate the problem:
val schema = StructType(List(
StructField("entity_type", StringType),
StructField("entity", StringType)
))
val rdd = sc.parallelize(Seq(
Row("PERSON", "John Doe")
))
val df = sqlContext.createDataFrame(rdd, schema)
df
.withColumn("struct", struct(col("entity_type"), col("entity")))
.select("struct")
.collect()
.foreach(println)
// prints [[PERSON, John Doe]]
df
.withColumn("struct", struct(col("entity_type"), col("entity")))
.select(to_avro(col("struct")).as("value"))
.select(from_avro(col("value"), entitySchema).as("entity"))
.collect()
.foreach(println)
// prints [[, PERSON]]
My schema looks like this
{
"type" : "record",
"name" : "Entity",
"fields" : [ {
"name" : "entity_type",
"type" : "string"
},
{
"name" : "entity",
"type" : "string"
} ]
}
What's interesting, is if I change the column order in the struct, the result would be [, John Doe]
I'm using Spark 2.4.5. According to Spark documentation: "to_avro() can be used to turn structs into Avro records. This method is particularly useful when you would like to re-encode multiple columns into a single one when writing data out to Kafka."

It's working after changing field types from "string" to ["string", "null"]. Not sure if this behavior is intended though.

Not able to pass schema of the json record dynamically to spark structured streaming records

I have a Spark-Kakfa-Strucutre streaming pipeline. Listening to a topic, which may have json records of varying schema.
Now I want to resolve the schema based on the key(x_y), and then apply to value portion to parse the json record.
So here key's 'y' part tells about the schema type.
I tried to get the schema string from udf and then pass to from_json() function.
But it fails with exception
org.apache.spark.sql.AnalysisException: Schema should be specified in DDL format as a string literal or output of the schema_of_json function instead of `schema`
Code used:
df.withColumn("data_type", element_at(split(col("key").cast("string"),"_"),1))
.withColumn("schema", schemaUdf($"data_type"))
.select(from_json(col("value").cast("string"), col("schema")).as("data"))
Schema demo:
{
"type" : "struct",
"fields" : [ {
"name" : "name",
"type" : {
"type" : "struct",
"fields" : [ {
"name" : "firstname",
"type" : "string",
"nullable" : true,
"metadata" : { }
}]
},
"nullable" : true,
"metadata" : { }
} ]
}
UDF used:
lazy val fetchSchema = (fileName : String) => {
DataType.fromJson(mapper.readTree(new File(fileName)).toString)
}
val schemaUdf = udf[DataType, String](fetchSchema)
Note: I am not using confluent feature.

Alias value in avsc does not display value on par with avro file

I have updated avsc file to rename column like,
"fields" : [ {
"name" : "department_id",
"type" : [ "null", "int" ],
"default" : null
}, {
"name" : "office_name",
"type" : [ "null", "string" ],
"default" : null,
"aliases" : [ "department_name" ],
"columnName" : "department_name"
}
However in may avro file columns are like department_id : 10, department_name : "maths"
Now when i query like below,
select office_name from t
it always returns null values. Will it not return value from department_name in avro. Is there a way to have multiple names for column in avsc

From cloudera community, "we recommend to use the original name rather than the aliased name of the field in the table, as the Avro aliases are stripped during loading into Spark."
Schema with aliases,
val schema = new Schema.Parser().parse(new File("../spark-2.4.3-bin-hadoop2.7/examples/src/main/resources/user.avsc"))
schema: org.apache.avro.Schema = {"type":"record","name":"User","namespace":"example.avro","fields":[{"name":"name","type":"string","aliases":["customer_name"],"columnName":"customer_name"},{"name":"favorite_color","type":["string","null"],"aliases":["color"],"columnName":"color"}]}
Spark striping the aliases,
val usersDF = spark.read.format("avro").option("avroSchema",schema.toString).load("../spark-2.4.3-bin-hadoop2.7/examples/src/main/resources/users.avro")
usersDF: org.apache.spark.sql.DataFrame = [name: string, favorite_color: string]
I guess you can go with spark builtin features to rename a column, but if you find any other workaround let me know as well.

Writing to Cosmos DB Graph API from Databricks (Apache Spark)

I have a DataFrame in Databricks which I want to use to create a graph in Cosmos, with one row in the DataFrame equating to 1 vertex in Cosmos.
When I write to Cosmos I can't see any properties on the vertices, just a generated id.
Get data:
data = spark.sql("select * from graph.testgraph")
Configuration:
writeConfig = {
"Endpoint" : "******",
"Masterkey" : "******",
"Database" : "graph",
"Collection" : "TestGraph",
"Upsert" : "true",
"query_pagesize" : "100000",
"bulkimport": "true",
"WritingBatchSize": "1000",
"ConnectionMaxPoolSize": "100",
"partitionkeydefinition": "/id"
}
Write to Cosmos:
data.write.
format("com.microsoft.azure.cosmosdb.spark").
options(**writeConfig).
save()

Below is the working code to insert records into cosmos DB.
go to the below site, click on the download option and select the uber.jar
https://search.maven.org/artifact/com.microsoft.azure/azure-cosmosdb-spark_2.3.0_2.11/1.2.2/jar then add in your dependency
spark-shell --master yarn --executor-cores 5 --executor-memory 10g --num-executors 10 --driver-memory 10g --jars "path/to/jar/dependency/azure-cosmosdb-spark_2.3.0_2.11-1.2.2-uber.jar" --packages "com.google.guava:guava:18.0,com.google.code.gson:gson:2.3.1,com.microsoft.azure:azure-documentdb:1.16.1"
import org.apache.spark.sql.types._
import org.apache.spark.sql.Row
val data = Seq(
Row(2, "Abb"),
Row(4, "Bcc"),
Row(6, "Cdd")
)
val schema = List(
StructField("partitionKey", IntegerType, true),
StructField("name", StringType, true)
)
val DF = spark.createDataFrame(
spark.sparkContext.parallelize(data),
StructType(schema)
)
val writeConfig = Map("Endpoint" -> "https://*******.documents.azure.com:443/",
"Masterkey" -> "**************",
"Database" -> "db_name",
"Collection" -> "collection_name",
"Upsert" -> "true",
"query_pagesize" -> "100000",
"bulkimport"-> "true",
"WritingBatchSize"-> "1000",
"ConnectionMaxPoolSize"-> "100",
"partitionkeydefinition"-> "/partitionKey")
DF.write.format("com.microsoft.azure.cosmosdb.spark").mode("overwrite").options(writeConfig).save()

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

How to specify multiplte schema in snowflake,pyspark connections - apache-spark

I have connections details as below and I need to read 2 tables from different schema (Employees and HR), sfparams = { "sfURL" : "123.com", "sfUser" : "a", "sfPassword" : "b", "sfDatabase" : "Test", "sfSchema" : "Employee", "sfWarehouse" : "PD1" } How do I read data from HR schema?

To read data from different schema two part names should be provided, i.e., schema_name.table_name: df = spark.read.format(SNOWFLAKE_SOURCE_NAME) \ .options(**sfparams) \ .option("query", "select * from HR.table1") \ .load()

Related

hive external table on avro alias column not working

How to encode structs into Avro record in Spark?

Not able to pass schema of the json record dynamically to spark structured streaming records

Alias value in avsc does not display value on par with avro file

Writing to Cosmos DB Graph API from Databricks (Apache Spark)

Categories

Resources