Writing to Cosmos DB Graph API from Databricks (Apache Spark)

Writing to Cosmos DB Graph API from Databricks (Apache Spark) - apache-spark

I have a DataFrame in Databricks which I want to use to create a graph in Cosmos, with one row in the DataFrame equating to 1 vertex in Cosmos.
When I write to Cosmos I can't see any properties on the vertices, just a generated id.
Get data:
data = spark.sql("select * from graph.testgraph")
Configuration:
writeConfig = {
"Endpoint" : "******",
"Masterkey" : "******",
"Database" : "graph",
"Collection" : "TestGraph",
"Upsert" : "true",
"query_pagesize" : "100000",
"bulkimport": "true",
"WritingBatchSize": "1000",
"ConnectionMaxPoolSize": "100",
"partitionkeydefinition": "/id"
}
Write to Cosmos:
data.write.
format("com.microsoft.azure.cosmosdb.spark").
options(**writeConfig).
save()

Below is the working code to insert records into cosmos DB.
go to the below site, click on the download option and select the uber.jar
https://search.maven.org/artifact/com.microsoft.azure/azure-cosmosdb-spark_2.3.0_2.11/1.2.2/jar then add in your dependency
spark-shell --master yarn --executor-cores 5 --executor-memory 10g --num-executors 10 --driver-memory 10g --jars "path/to/jar/dependency/azure-cosmosdb-spark_2.3.0_2.11-1.2.2-uber.jar" --packages "com.google.guava:guava:18.0,com.google.code.gson:gson:2.3.1,com.microsoft.azure:azure-documentdb:1.16.1"
import org.apache.spark.sql.types._
import org.apache.spark.sql.Row
val data = Seq(
Row(2, "Abb"),
Row(4, "Bcc"),
Row(6, "Cdd")
)
val schema = List(
StructField("partitionKey", IntegerType, true),
StructField("name", StringType, true)
)
val DF = spark.createDataFrame(
spark.sparkContext.parallelize(data),
StructType(schema)
)
val writeConfig = Map("Endpoint" -> "https://*******.documents.azure.com:443/",
"Masterkey" -> "**************",
"Database" -> "db_name",
"Collection" -> "collection_name",
"Upsert" -> "true",
"query_pagesize" -> "100000",
"bulkimport"-> "true",
"WritingBatchSize"-> "1000",
"ConnectionMaxPoolSize"-> "100",
"partitionkeydefinition"-> "/partitionKey")
DF.write.format("com.microsoft.azure.cosmosdb.spark").mode("overwrite").options(writeConfig).save()

Related

Streaming not working in Delta Live table pipeline (Databricks)?

I am working on a pipeline in Databricks > Workflows > Delta Live Tables and having an issue with the streaming part.
Expectations:
One bronze table reads the json files with AutoLoader (cloudFiles), in a streaming mode (spark.readStream)
One silver table reads and flattens the bronze table in streaming (dlt.read_stream)
Result:
When taking the root location as the source (load /*, several hundreds of files): the pipelines starts, but the number of rows/files appended is not updated in the graph until the bronze part be completed. Then, the silver part starts, the number of files/rows never updates either and the pipeline terminates with a memory error.
When taking a very small number of files (/specific_folder among hundreds) : the pipeline runs well and terminates with no error, but again, the number of rows/files appended is not updated in the graph until each part is completed.
This led me to the conclusion that the pipeline seems not to run in a streaming mode.
Maybe I am missing something about the config or how to run properly a DLT pipeline, and would need your help on this please.
Here is the configuration of the pipeline:
{
"id": "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx",
"clusters": [
{
"label": "default",
"aws_attributes": {
"instance_profile_arn": "arn:aws:iam::xxxxxxxxxxxx:instance-profile/iam_role_example"
},
"autoscale": {
"min_workers": 1,
"max_workers": 10,
"mode": "LEGACY"
}
}
],
"development": true,
"continuous": false,
"channel": "CURRENT",
"edition": "PRO",
"photon": false,
"libraries": [
{
"notebook": {
"path": "/Repos/user_example#xxxxxx.xx/dms/bronze_job"
}
}
],
"name": "01-landing-task-1",
"storage": "dbfs:/pipelines/xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx",
"configuration": {
"SCHEMA": "example_schema",
"RAW_MOUNT_NAME": "xxxx",
"DELTA_MOUNT_NAME": "xxxx",
"spark.sql.parquet.enableVectorizedReader": "false"
},
"target": "landing"}
Here is the code of the pipeline (the query in the silver table contains many more columns with a get_json_object, ~30 actually):
import dlt
import pyspark.sql.functions as F
import pyspark.sql.types as T
from pyspark.sql.window import Window
RAW_MOUNT_NAME = spark.conf.get("RAW_MOUNT_NAME")
SCHEMA = spark.conf.get("SCHEMA")
SOURCE = spark.conf.get("SOURCE")
TABLE_NAME = spark.conf.get("TABLE_NAME")
PRIMARY_KEY_PATH = spark.conf.get("PRIMARY_KEY_PATH")
#dlt.table(
name=f"{SCHEMA}_{TABLE_NAME}_bronze",
table_properties={
"quality": "bronze"
}
)
def bronze_job():
load_path = f"/mnt/{RAW_MOUNT_NAME}/{SOURCE}/5e*"
return spark \
.readStream \
.format("text") \
.option("encoding", "UTF-8") \
.load(load_path) \
.select("value", "_metadata") \
.withColumnRenamed("value", "json") \
.withColumn("id", F.expr(f"get_json_object(json, '$.{PRIMARY_KEY_PATH}')")) \
.withColumn("_etl_timestamp", F.col("_metadata.file_modification_time")) \
.withColumn("_metadata", F.col("_metadata").cast(T.StringType())) \
.withColumn("_etl_operation", F.lit("U")) \
.withColumn("_etl_to_delete", F.lit(False)) \
.withColumn("_etl_file_name", F.input_file_name()) \
.withColumn("_etl_job_processing_timestamp", F.current_timestamp()) \
.withColumn("_etl_table", F.lit(f"{TABLE_NAME}")) \
.withColumn("_etl_partition_date", F.to_date(F.col("_etl_timestamp"), "yyyy-MM-dd")) \
.select("_etl_operation", "_etl_timestamp", "id", "json", "_etl_file_name", "_etl_job_processing_timestamp", "_etl_table", "_etl_partition_date", "_etl_to_delete", "_metadata")
#dlt.table(
name=f"{SCHEMA}_{TABLE_NAME}_silver",
table_properties = {
"quality": "silver",
"delta.autoOptimize.optimizeWrite": "true",
"delta.autoOptimize.autoCompact": "true"
}
)
def silver_job():
df = dlt.read_stream(f"{SCHEMA}_{TABLE_NAME}_bronze").where("_etl_table == 'extraction'")
return df.select(
df.id.alias('medium_id'),
F.get_json_object(df.json, '$.request').alias('request_id'))
Thank you very much for your help!

How to specify multiplte schema in snowflake,pyspark connections

I have connections details as below and I need to read 2 tables from different schema (Employees and HR),
sfparams = {
"sfURL" : "123.com",
"sfUser" : "a",
"sfPassword" : "b",
"sfDatabase" : "Test",
"sfSchema" : "Employee",
"sfWarehouse" : "PD1"
}
How do I read data from HR schema?

To read data from different schema two part names should be provided, i.e., schema_name.table_name:
df = spark.read.format(SNOWFLAKE_SOURCE_NAME) \
.options(**sfparams) \
.option("query", "select * from HR.table1") \
.load()

Spark Structed Streaming read nested json from kafka and flatten it

A json type data :
{
"id": "34cx34fs987",
"time_series": [
{
"time": "2020090300: 00: 00",
"value": 342342.12
},
{
"time": "2020090300: 00: 05",
"value": 342421.88
},
{
"time": "2020090300: 00: 10",
"value": 351232.92
}
]
}
I got the json from kafka:
spark = SparkSession.builder.master('local').appName('test').getOrCreate()
df = spark.readStream.format("kafka")...
How can I manipulate df to get a DataFrame as shown below:
id time value
34cx34fs987 20200903 00:00:00 342342.12
34cx34fs987 20200903 00:00:05 342421.88
34cx34fs987 20200903 00:00:10 351232.92

Using Scala:
If you define your schema as
val schema: StructType = new StructType()
.add("id", StringType)
.add("time_series", ArrayType(new StructType()
.add("time", StringType)
.add("value", DoubleType)
))
you can then make use of Spark SQL built-in functions from_json and explode
import org.apache.spark.sql.functions._
import spark.implicits._
val df1 = df
.selectExpr("CAST(value as STRING) as json")
.select(from_json('json, schema).as("data"))
.select(col("data.id").as("id"), explode(col("data.time_series")).as("time_series"))
.select(col("id"), col("time_series.time").as("time"), col("time_series.value").as("value"))
Your output will be then
+-----------+-----------------+---------+
|id |time |value |
+-----------+-----------------+---------+
|34cx34fs987|20200903 00:00:00|342342.12|
|34cx34fs987|20200903 00:00:05|342421.88|
|34cx34fs987|20200903 00:00:10|351232.92|
+-----------+-----------------+---------+

Sample code in pyspark
df2 = df.select("id", f.explode("time_series").alias("col"))
df2.select("id", "col.time", "col.value").show()

Cannot convert Catalyst type IntegerType to Avro type ["null","int"]

I've Spark Structured Streaming process build with Pyspark that reads a avro message from a kafka topic, make some transformations and load the data as avro in a target topic.
I use the ABRIS package (https://github.com/AbsaOSS/ABRiS) to serialize/deserialize the Avro from Confluent, integrating with Schema Registry.
The schema contains integer columns as follows:
{
"name": "total_images",
"type": [
"null",
"int"
],
"default": null
},
{
"name": "total_videos",
"type": [
"null",
"int"
],
"default": null
},
The process raises the following error: Cannot convert Catalyst type IntegerType to Avro type ["null","int"].
I've tried to convert the columns to be nullable but the error persists.
If someone have a suggestion I would appreciate that

I burned hours on this one
Actually, It is unrelated to Abris dependency (behaviour is the same with native spark-avro apis)
There may be several root causes but in my case … using Spark 3.0.1, Scala with Dataset : it was related to encoder and wrong type in the case class handling datas.
Shortly, avro field defined with "type": ["null","int"] can’t be mapped to scala Int, it needs Option[Int]
Using the following code:
test("Avro Nullable field") {
val schema: String =
"""
|{
| "namespace": "com.mberchon.monitor.dto.avro",
| "type": "record",
| "name": "TestAvro",
| "fields": [
| {"name": "strVal", "type": ["null", "string"]},
| {"name": "longVal", "type": ["null", "long"]}
| ]
|}
""".stripMargin
val topicName = "TestNullableAvro"
val testInstance = TestAvro("foo",Some(Random.nextInt()))
import sparkSession.implicits._
val dsWrite:Dataset[TestAvro] = Seq(testInstance).toDS
val allColumns = struct(dsWrite.columns.head, dsWrite.columns.tail: _*)
dsWrite
.select(to_avro(allColumns,schema) as 'value)
.write
.format("kafka")
.option("kafka.bootstrap.servers", bootstrap)
.option("topic", topicName)
.save()
val dsRead:Dataset[TestAvro] = sparkSession.read
.format("kafka")
.option("kafka.bootstrap.servers", bootstrap)
.option("subscribe", topicName)
.option("startingOffsets", "earliest")
.load()
.select(from_avro(col("value"), schema) as 'Metric)
.select("Metric.*")
.as[TestAvro]
assert(dsRead.collect().contains(testInstance))
}
It fails if case class is defined as follow:
case class TestAvro(strVal:String,longVal:Long)
Cannot convert Catalyst type LongType to Avro type ["null","long"].
org.apache.spark.sql.avro.IncompatibleSchemaException: Cannot convert Catalyst type LongType to Avro type ["null","long"].
at org.apache.spark.sql.avro.AvroSerializer.newConverter(AvroSerializer.scala:219)
at org.apache.spark.sql.avro.AvroSerializer.$anonfun$newStructConverter$1(AvroSerializer.scala:239)
It works properly with:
case class TestAvro(strVal:String,longVal:Option[Long])
Btw, it would be more than nice to have support for SpecificRecord within Spark encoders (you can use Kryo but it is sub efficient)
Since, in order to use efficiently typed Dataset with my avro data … I need to create additional case classes (which duplicates of my SpecificRecords).

Add schema to a Dataset[Row] in Java

I am new to spark and trying to explore Spark structured streaming. I will be consuming messages from Kafka(nested JSON), filter these messages based on certain conditions on the JSON attribute. Every message satisfying the filter should then be pushed to Cassandra.
I have read the documentation on spark Cassandra connector
https://spark.apache.org/docs/2.2.0/structured-streaming-kafka-integration.html
Dataset<Row> df = spark
.readStream()
.format("kafka")
.option("kafka.bootstrap.servers", "host1:port1,host2:port2")
.option("subscribe", "topic1")
.load()
df.selectExpr("CAST(value AS STRING)")
I only need a few of the many attributes present in this nested JSON. How do I apply a schema on top of it, so that I can use sparkSQL for filtering?
For the sample JSON, I need to persist name, age, experience, hobby_name,hobby_experience for players whose sum of playing frequency is more than 5.
{
"name": "Tom",
"age": "24",
"gender": "male",
"hobbies": [{
"name": "Tennis",
"experience": 5,
"places": [{
"city": "London",
"frequency": 4
}, {
"city": "Sydney",
"frequency": 3
}]
}]
}
I am relatively new to Spark, please forgive if this is a repeat. Also, I am looking for a solution in JAVA.

You can specify your schema like so:
import org.apache.spark.sql.types.{DataTypes, StructField, StructType};
StructType schema = DataTypes.createStructType(new StructField[] {
DataTypes.createStructField("name", DataTypes.StringType, true),
DataTypes.createStructField("age", DataTypes.StringType, true),
DataTypes.createStructField("gender", DataTypes.StringType, true),
DataTypes.createStructField("hobbies", DataTypes.createStructType(new StructField[] {
DataTypes.createStructField("name", DataTypes.StringType, true),
DataTypes.createStructField("experience", DataTypes.IntegerType, true),
DataTypes.createStructField("places", DataTypes.createStructType(new StructField[] {
DataTypes.createStructField("city", DataTypes.StringType, true),
DataTypes.createStructField("frequency", DataTypes.IntegerType, true)
}), true)
}), true)
});
And then use the schema to create your dataframe as needed:
import org.apache.spark.sql.functions.{col, from_json};
df.select(from_json(col("value"), schema).as("data"))
.select(
col("data.name").as("name"),
col("data.hobbies.name").as("hobbies_name"))

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Writing to Cosmos DB Graph API from Databricks (Apache Spark) - apache-spark

Related

Streaming not working in Delta Live table pipeline (Databricks)?

How to specify multiplte schema in snowflake,pyspark connections

Spark Structed Streaming read nested json from kafka and flatten it

Cannot convert Catalyst type IntegerType to Avro type ["null","int"]

Add schema to a Dataset[Row] in Java

Categories

Resources