Writing Edges to Cosmos DB using Cosmos Spark - apache-spark

I have followed this tutorial and have created graph vertices in a cosmosDB, but I can't find any documentation on how to create edges between the vertices using the same approach - is this possible?
For info, this is the code used to create the vertices.
spark.createDataFrame((("cat-alive", "Schrodinger cat", 2, True), ("cat-dead", "Schrodinger cat", 2, False)))\
.toDF("id","name","age","isAlive") \
.write\
.format("cosmos.oltp")\
.options(**cfg)\
.mode("APPEND")\
.save()

Figured this out, the edge needs specific column headers (_sink and _vertex become the from and to).
spark.createDataFrame([(True, "edge_label", 'id_whatever_you_choose', "", "cat-alive", "", "cat-dead")])\
.toDF('_isEdge', 'label', 'id', '_vertexLabel', '_sink', '_sinkLabel', '_vertexId')\
.write\
.format("cosmos.oltp")\
.options(**cfg)\
.mode("APPEND")\
.save()

Related

apache spark pivot dynamically

The pivot clause is available in Apache Spark SQL. However, it expects an expression_list which works when you know in advance what columns you expect. However, I would like to pivot on columns dynamically.
In my use case, I would need to retrieve the data to list in a query and then pass that into IN.
I just don't see how I can do this, beyond dynamically building the SQL string and applying the expression_list as a parameter, like building a template to then execute the query.
This would be fine, if I was able to do this in a Databricks notebook(in my case), however, I'm writing a query in Databricks SQL, which only accepts SQL, I have no ability to build an SQL string.
Would really appreciate any advice and how to resolve this issue.
%sql
drop table person;
CREATE TABLE person (id INT, name STRING, age INT, class INT, address STRING);
INSERT INTO person VALUES
(100, 'John', 30, 1, 'Street 1'),
(200, 'Mary', NULL, 1, 'Street 2'),
(300, 'Mike', 80, 3, 'Street 3'),
(400, 'Dan', 50, 4, 'Street 4');
SELECT * FROM person
PIVOT (
SUM(age) AS a
FOR name IN ('John', 'Dan')
)

Change the datatype of a column in delta table

Is there a SQL command that I can easily use to change the datatype of a existing column in Delta table. I need to change the column datatype from BIGINT to STRING. Below is the SQL command I'm trying to use but no luck.
%sql ALTER TABLE [TABLE_NAME] ALTER COLUMN [COLUMN_NAME] STRING
Error I'm getting:
org.apache.spark.sql.AnalysisException
ALTER TABLE CHANGE COLUMN is not supported for changing column 'bam_user' with type
'IntegerType' to 'bam_user' with type 'StringType'
SQL doesn't support this, but it can be done in python:
from pyspark.sql.functions import col
# set dataset location and columns with new types
table_path = '/mnt/dataset_location...'
types_to_change = {
'column_1' : 'int',
'column_2' : 'string',
'column_3' : 'double'
}
# load to dataframe, change types
df = spark.read.format('delta').load(table_path)
for column in types_to_change:
df = df.withColumn(column,col(column).cast(types_to_change[column]))
# save df with new types overwriting the schema
df.write.format("delta").mode("overwrite").option("overwriteSchema",True).save("dbfs:" + table_path)
No Option to change the data type of column or dropping the column. You can read the data in datafame, modify the data type and with help of withColumn() and drop() and overwrite the table.
There is no real way to do this using SQL, unless you copy to a different table altogether. This option includes INSERT data to a new table, DROP TABLE and re-CREATE with the new structure and therefore risky.
The way to do this in python is as follows:
Let's say this is your table :
CREATE TABLE person (id INT, name STRING, age INT, class INT, address STRING);
INSERT INTO person VALUES
(100, 'John', 30, 1, 'Street 1'),
(200, 'Mary', NULL, 1, 'Street 2'),
(300, 'Mike', 80, 3, 'Street 3'),
(400, 'Dan', 50, 4, 'Street 4');
You can check the table structure using the following:
DESCRIBE TABLE person
IF you need to change the id to String:
This is the code:
%py
from pyspark.sql.functions import col
df = spark.read.table("person")
df1 = df.withColumn("id",col("id").cast("string"))
df1.write
.format ("parquet")
.mode("overwrite")
.option("overwriteSchema", "true")
.saveAsTable("person")
Couple of pointers: the format is parquet in this table. That's the default for Databricks. So you can omit the "format" line (note that Python is very sensitive regarding spaces).
Re databricks:
If the format is "delta" you must specify this.
Also, if the table is partitioned, it's important to mention that in the code:
For example:
df1.write
.format ("delta")
.mode("overwrite")
.partitionBy("col_to_partition1", "col_to_partition2")
.option("overwriteSchema", "true")
.save(table_location)
When table_location is where the delta table is saved.
(some of this answer is based on this)
Suppose you want to change data type of column "column_name" to "int" of table "delta_table_name"
spark.read.table("delta_table_name") .withColumn("Column_name",col("Column_name").cast("new_data_type")) .write.format("delta").mode("overwrite").option("overwriteSchema",true).saveAsTable("delta_table_name")
Read the table using spark.
Use withColumn method to transform the column you want.
Write the table back, mode overwrite and overwriteSchema True
Reference: https://docs.databricks.com/delta/update-schema.html#explicitly-update-schema-to-change-column-type-or-name
from pyspark.sql import functions as F
spark.read.table("<TABLE NAME>") \
.withColumn("<COLUMN NAME> ",F.col("<COLUMN NAME>").cast("<DATA TYPE>")) \
.write.format("delta").mode("overwrite").option("overwriteSchema",True).saveAsTable("<TABLE NAME>")

Databricks schema enforcement issues

As suggested in the article about schema enforcement, a declared schema helps detecting issues early.
The two issues described below however are preventing me from creating a descriptive schema.
Comments on a table column are seen as a difference in the schema
# Get data
test_df = spark.createDataFrame([('100000146710')], ['code'])
# ... save
test_df.write.format("delta").mode("append").save('/my_table_location')
# Create table: ... BOOM
spark.sql("""
CREATE TABLE IF NOT EXISTS my_table (
code STRING COMMENT 'Unique identifier'
) USING DELTA LOCATION '/my_table_location'
""")
This will fail with AnalysisException: The specified schema does not match the existing schema at /my_table_location . The only solution I found is to drop the columnt comments.
Not null struct field shows as nullable
json_schema = StructType([
StructField("code", StringType(), False)
])
json_df = (spark.read
.schema(json_schema)
.json('/my_input.json')
)
json_df.printSchema()
will show
root
|-- code: string (nullable = true)
So despite the schema declaration stating that a field is not null, the field shows as nullable in the dataframe. Because of this, adding a NOT NULL constraint on the table column will trigger the AnalysisException error.
Any comments or suggestions are welcome.
With the execution of
test_df.write.format("delta").mode("append").save('/my_table_location')
You have already created a new Delta table with its specific schema as defined by test_df. This new table delta.`/my_table_location` already has the schema of code STRING.
If you would like to create a comment within the schema, perhaps first create the table as you would like it defined, e.g.
spark.sql("""
CREATE TABLE my_table
code STRING COMMENT 'unique identifier'
USING DELTA LOCATION '/my_table_location'
""")
And then insert your data from your test_df into it, e.g.
test_df.createOrReplaceView("test_df_view")
spark.sql("""
INSERT INTO my_table (code) SELECT code FROM test_df_view
""")

Count & Filter in spark

In spark often one performs a filter operations before using a map, to make sure that the map is possible. See the example below:
bc_ids = sc.broadcast(ids)
new_ids = users.filter(lambda x: x.id in ids.value).map(lambda x: ids.value[x])
If you want to know how many users you filtered out, how can you do this efficiently? So I would prefer not to use:
count_before = users.count()
new_ids = users.filter(lambda x: x.id in ids.value).map(lambda x: ids.value[x])
count_after = new_ids .count()
The question is related to 1 but in contrast is not about spark SQL.
In spark often one performs a filter operations before using a map, to
make sure that the map is possible.
The reason to perform filter() before map() is to process only necessary data.
Answer to your question
val base = sc.parallelize(Seq(1, 2, 3, 4, 5, 6, 7))
println(base.filter { _.==(7) }.count())
println(base.filter { !_.==(7) }.count())
First one will give you filtered result and second line will give you how many values are filtered.if you are working against cached and partitioned data, then this could be done effectively.

Selecting columns not present in the dataframe

So, I am creating a dataframe from an XML file. It has some information on a dealer, and then a dealer has multiple cars - each car is an sub-element of the cars element and is represented by a value element - each cars.value element has various car attributes. So I use an explode function to create one row for each car for a dealer like follows:
exploded_dealer = df.select('dealer_id',explode('cars.value').alias('a_car'))
And now I want to get various attributes of cars.value
I do it like this:
car_details_df = exploded_dealer.select('dealer_id','a_car.attribute1','a_car.attribute2')
And that works fine. But sometimes the cars.value elements doesn't have all the attributes I specify in my query. So for example some cars.value elements might have only attribute1 - and then I will get a following error when running the above code:
pyspark.sql.utils.AnalysisException: u"cannot resolve 'attribute2'
given input columns: [dealer_id,attribute1];"
How do I ask Spark to execute the same query anyway. but just return None for the attribute2 if it is not present?
UPDATE I read my data as follows:
initial_file_df = sqlContext.read.format('com.databricks.spark.xml').options(rowTag='dealer').load('<xml file location>')
exploded_dealer = df.select('financial_data',explode('cars.value').alias('a_car'))
Since you already make specific assumptions about the schema the best thing you can do is to define it explicitly with nullable optional fields and use it when importing data.
Let's say you expect documents similar to:
<rows>
<row>
<id>1</id>
<objects>
<object>
<attribute1>...</attribute1>
...
<attributebN>...</attributeN>
</object>
</objects>
</row>
</rows>
where attribute1, attribute2, ..., attributebN may not be present in a given batch but you can define a finite set of choices and corresponding types. For simplicity let's say there are only two options:
{("attribute1", StringType), ("attribute2", LongType)}
You can define schema as:
schema = StructType([
StructField("objects", StructType([
StructField("object", StructType([
StructField("attribute1", StringType(), True),
StructField("attribute2", LongType(), True)
]), True)
]), True),
StructField("id", LongType(), True)
])
and use it with reader:
spark.read.schema(schema).option("rowTag", "row").format("xml").load(...)
It will be valid for any subset of attributes ({∅, {attribute1}, {attribute2}, {attribute1, attribute2}}). At the same time is more efficient than depending on the schema inference.

Resources