How to reliably obtain partition columns of delta table - apache-spark

I need to obtain the partitioning columns of a delta table, but the returned result of a
DESCRIBE delta.`my_table` returns different results on databricks and locally on PyCharm.
Minimal example:
from pyspark.sql.types import StructType, StructField, StringType, IntegerType
delta_table_path = "c:/temp_delta_table"
partition_column = ["rs_nr"]
schema = StructType([
StructField("rs_nr", StringType(), False),
StructField("event_category", StringType(), True),
StructField("event_counter", IntegerType(), True)])
data = [{'rs_nr': '001', 'event_category': 'event_01', 'event_counter': 1},
{'rs_nr': '002', 'event_category': 'event_02', 'event_counter': 2},
{'rs_nr': '003', 'event_category': 'event_03', 'event_counter': 3},
{'rs_nr': '004', 'event_category': 'event_04', 'event_counter': 4}]
sdf = spark.createDataFrame(data=data, schema=schema)
sdf.write.format("delta").mode("overwrite").partitionBy(partition_column).save(delta_table_path)
df_descr = spark.sql(f"DESCRIBE delta.`{delta_table_path}`")
df_descr.toPandas()
Shows, on databricks, the partition column(s):
col_name data_type comment
0 rs_nr string None
1 event_category string None
2 event_counter int None
3 # Partition Information
4 # col_name data_type comment
5 rs_nr string None
But when running this locally in PyCharm, I get the following different output:
col_name data_type comment
0 rs_nr string
1 event_category string
2 event_counter int
3
4 # Partitioning
5 Part 0 rs_nr
Parsing both types of return value seems ugly to me, so is there a reason that this is returned like this?
Setup:
In Pycharm:
pyspark = 3.2.3
delta-spark = 2.0.0
In DataBricks:
DBR 11.3 LTS
Spark = 3.3.0 (I just noted that this differs, I will test if 3.3.0 works locally in the meantime)
Scala = 2.12
In PyCharm, I create the connection using:
def get_spark():
spark = SparkSession.builder.appName('schema_checker')\
.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension")\
.config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog")\
.config("spark.jars.packages", "io.delta:delta-core_2.12:2.0.0")\
.config("spark.sql.catalogImplementation", "in-memory")\
.getOrCreate()
return spark

If you're using Python, then instead of executing SQL command that is harder to parse, it's better to use Python API. The DeltaTable instance has a detail function that returns a dataframe with details about the table (doc), and this dataframe has the partitionColumns column that is array of strings with partition columns names. So you can just do:
from delta.tables import *
detailDF = DeltaTable.forPath(spark, delta_table_path).detail()
partitions = detailDF.select("partitionColumns").collect()[0][0]

Related

how to print a "dictionary" of StringType() in the form of a table with pyspark

Hi,
The above column is part of a table that I am working with in Databricks. What I wish to do is to turn the "ecommerce" col into a table of its own. In this case, it means that I would have a new table with "detail", "products"....etc as columns. Currently "ecommerce" is a StringType.
I have tried using spark dictionary creation, tabulate and other methods but to no success.
The code that I have currently is
def ecommerce_wtchk_dlt():
df = dlt.read_stream("wtchk_dlt")
ddf = df.select(col("ecommerce"))
header = ddf[0].keys()
rows = [x.values() for x in ddf]
dddf = tabulate.tabulate(rows, header)
return dddf
Whenever I try to forcefully set the type of the ecommerce as MapType I have the error that says that since the original datasource is StringType I can only use the same one as well
I have reproduced the above and able to achieve your requirement in this case by using from_json, json_tuple and explode.
This is my sample data with the same format as yours.
Code:
from pyspark.sql import functions as F
from pyspark.sql.types import *
df2 = df.select(F.json_tuple(df["ecommerce"],"detail")).toDF("detail") \
.select(F.json_tuple(F.col("detail"),"products")).toDF("products")
print("products : ")
df2.show()
schema = ArrayType(StructType([
StructField("name", StringType()),
StructField("id", StringType()),
StructField("price", StringType()),
StructField("brand", StringType()),
StructField("category", StringType()),
StructField("variant", StringType())
]))
final_df=df2.withColumn("products", F.from_json("products", schema)).select(F.explode("products").alias("products")).select("products.*")
print("Final dataframe : ")
final_df.show()
My Result:

Why does Spark SQL require double percent symbols for like conditions?

When I am using a like condition in Spark SQL, it seems that it requires the use of 2 percent symbols %%.
However, I could not find any documentation on this in the Spark SQL docs. I am curious as to why my set-up might be causing this requirement.
https://spark.apache.org/docs/3.3.0/sql-ref-syntax-qry-select-like.html
Example data
product_table
id
product_type
region
location
measurement
43635665
ORANGE - Blood Orange
EU
FRA
30.5
78960788
APPLE GrannySmith
NA
USA
16.0
12312343
APPLE [Organic Washington]
NA
CAN
7.1
67867634
ORANGE, NavelOrange
NA
MEX
88.4
import pyspark
from pyspark.sql import functions as F
APP_NAME = "Product: Fruit Template"
SPARK_CONF = [
("spark.dynamicAllocation.maxExecutors", "5"),
("spark.executor.memory", "10g"),
("spark.executor.cores", "4"),
("spark.executor.memoryOverhead", "2000"),
]
spark_conf = pyspark.SparkConf()
spark_conf.setAppName(APP_NAME)
spark_conf.setAll(SPARK_CONF)
sc = pyspark.SparkContext(conf=spark_conf)
spark = pyspark.sql.SparkSession(sc)
def sql(query):
return spark.sql(query)
df = sql("""
SELECT *
FROM product_table
""")
this returns data
df.filter(F.col("product_type").like("ORANGE%%")).show()
whereas this returns an empty dataframe
df.filter(F.col("product_type").like("ORANGE%")).show()
Maybe worth noting, the same issue happens when the LIKE condition is used in the SQL string
this returns data
df_new = sql("""
SELECT *
FROM product_table
WHERE product_type like 'ORANGE%%'
""")
df_new.show()
whereas this returns an empty dataframe
df_new = sql("""
SELECT *
FROM product_table
WHERE product_type like 'ORANGE%'
""")
df_new.show()
i am using PySpark Version: 2.3.2.
conf = (SparkConf()
.set("spark.executor.instances", "24")
.set("spark.executor.cores", "5")
.set("spark.executor.memory", "33g")
.set("spark.driver.memory", "55g")
.set("spark.driver.maxResultSize", "10g")
.set("spark.sql.catalogImplementation", "hive")
.set("mapreduce.fileoutputcommitter.algorithm.version", "2")
)
spark = (
SparkSession.builder.appName("default")
.enableHiveSupport()
.config(conf=conf)
.getOrCreate()
)
df = spark.createDataFrame(
[('43635665','ORANGE - Blood Orange'),
('78960788','APPLE GrannySmith'),
('12312343','APPLE [Organic Washington'),
('67867634','ORANGE, NavelOrange')],
['id', 'product_type'])
df.createOrReplaceTempView("product_table")
def sql(query):
print(query)
return spark.sql(query)
df2 = sql("""
SELECT *
FROM product_table
""")
df2.filter(F.col("product_type").like("ORANGE%")).show(truncate=False)

Dataframe TypeError cannot accept object

I have list of string in python as follows :
['start_column=column123;to_3=2020-09-07 10:29:24;to_1=2020-09-07 10:31:08;to_0=2020-09-07 10:31:13;',
'start_column=column475;to_3=2020-09-07 10:29:34;']
I am trying to convert it into dataframe in following way :
schema = StructType([
StructField('Rows', ArrayType(StringType()), True)
])
rdd = sc.parallelize(test_list)
query_data = spark.createDataFrame(rdd,schema)
print(query_data.schema)
query_data.show()
I am getting following error:
TypeError: StructType can not accept object
You just need to pass that as a list while creating the dataframe as below ...
a_list = ['start_column=column123;to_3=2020-09-07 10:29:24;to_1=2020-09-07 10:31:08;to_0=2020-09-07 10:31:13;',
'start_column=column475;to_3=2020-09-07 10:29:34;']
sparkdf = spark.createDataFrame([a_list],["col1", "col2"])
sparkdf.show(truncate=False)
+--------------------------------------------------------------------------------------------------+------------------------------------------------+
|col1 |col2 |
+--------------------------------------------------------------------------------------------------+------------------------------------------------+
|start_column=column123;to_3=2020-09-07 10:29:24;to_1=2020-09-07 10:31:08;to_0=2020-09-07 10:31:13;|start_column=column475;to_3=2020-09-07 10:29:34;|
+--------------------------------------------------------------------------------------------------+------------------------------------------------+
You should use schema = StringType() because your rows contains strings rather than structs of strings.
I have two possible solutions for you.
SOLUTION 1: Assuming you wanted a dataframe with just one row
I was able to make it work by wrapping the values in test_list in Parentheses and using StringType.
v = [('start_column=column123;to_3=2020-09-07 10:29:24;to_1=2020-09-07 10:31:08;to_0=2020-09-07 10:31:13;',
'start_column=column475;to_3=2020-09-07 10:29:34;')]
schema = StructType([
StructField('col_1', StringType(), True),
StructField('col_2', StringType(), True),
])
rdd = sc.parallelize(v)
query_data = spark.createDataFrame(rdd,schema)
print(query_data.schema)
query_data.show(truncate = False)
SOLUTION 2: Assuming you wanted a dataframe with just one column
v = ['start_column=column123;to_3=2020-09-07 10:29:24;to_1=2020-09-07 10:31:08;to_0=2020-09-07 10:31:13;',
'start_column=column475;to_3=2020-09-07 10:29:34;']
from pyspark.sql.types import StringType
df = spark.createDataFrame(v, StringType())
df.show(truncate = False)

Pyspark select from empty dataframe throws exception

the question is similar to this question but it had no answer, I have a dataframe from which am selecting data if exists
schema = StructType([
StructField("file_name", StringType(), True),
StructField("result", ArrayType(StructType()), True),
])
df = rdd.toDF(schema=schema)
print((df.count(), len(df.columns))) # 0,2
df.cache()
df = df.withColumn('result', F.explode(df['result']))
get_doc_id = F.udf(lambda line: ntpath.basename(line).replace('_all.txt', ''), StringType())
df = df.filter(df.result.isNotNull()).select(F.lit(job_id).alias('job_id'),
get_doc_id(df['file_name']).alias('doc_id'),
df['result._2'].alias('line_content'),
df['result._4'].alias('line1'),
df['result._3'].alias('line2'))
the above throws error when the dataframe is empty
pyspark.sql.utils.AnalysisException: 'No such struct field _2 in ;
shouldn't it only executes if result column had data ? and how to overcome this ?
Spark executes code lazily. So it won't check whether you have data in your filter condition. Your code fails in Analysis stage because you don't have a column named result._2 in your data. You are passing empty StructType in your schema for result column. You should update it to something like this:
schema = StructType([
StructField("file_name", StringType(), True),
StructField("result", ArrayType(StructType([StructField("line_content",StringType(),True), StructField("line1",StringType(),True), StructField("line2",StringType(),True)])), True)
])
df = spark.createDataFrame(sc.emptyRDD(),schema=schema)
df = df.withColumn('result', F.explode(df['result']))
get_doc_id = F.udf(lambda line: ntpath.basename(line).replace('_all.txt', ''), StringType())
df = df.filter(df.result.isNotNull()).select(F.lit('job_id').alias('job_id'),
get_doc_id(df['file_name']).alias('doc_id'),
df['result.line_content'].alias('line_content'),
df['result.line1'].alias('line1'),
df['result.line2'].alias('line2'))
Issue is that 'df' does not have '_2'. So it ends up throwing errors like:
pyspark.sql.utils.AnalysisException: 'No such struct field _2 in ;
You can try checking if the column exists by
if not '_2' in result.columns:
#Your code goes here
I would generally initialise the column with 0 or None if it does not exists like
from pyspark.sql.functions import lit
if not '_2' in result.columns:
result = result.withColumn('_2', lit(0))

Spark: Why does Python read input files twice but Scala doesn't when creating a DataFrame?

I'm using Spark v1.5.2. I wrote a program in Python and I don't understand why it reads the input files twice. The same program written in Scala only reads the input files once.
I use an accumulator to count the number of times that map() is called. From the accumulator value, I infer the number of times the input file is read.
The input file contains 3 lines of text.
Python:
from pyspark import SparkContext, SQLContext
from pyspark.sql.types import *
def createTuple(record): # used with map()
global map_acc
map_acc += 1
return (record[0], record[1].strip())
sc = SparkContext(appName='Spark test app') # appName is shown in the YARN UI
sqlContext = SQLContext(sc)
map_acc = sc.accumulator(0)
lines = sc.textFile("examples/src/main/resources/people.txt")
people_rdd = lines.map(lambda l: l.split(",")).map(createTuple) #.cache()
fieldNames = 'name age'
fields = [StructField(field_name, StringType(), True) for field_name in fieldNames.split()]
schema = StructType(fields)
df = sqlContext.createDataFrame(people_rdd, schema)
print 'record count DF:', df.count()
print 'map_acc:', map_acc.value
#people_rdd.unpersist()
$ spark-submit --master local[1] test.py 2> err
record count DF: 3
map_acc: 6 ##### why 6 instead of 3??
Scala:
import org.apache.spark._
import org.apache.spark.sql.Row;
import org.apache.spark.sql.types.{StructType,StructField,StringType};
object SimpleApp {
def main(args: Array[String]) {
def createTuple(record:Array[String], map_acc: Accumulator[Int]) = { // used with map()
map_acc += 1
Row(record(0), record(1).trim)
}
val conf = new SparkConf().setAppName("Scala Test App")
val sc = new SparkContext(conf)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val map_acc = sc.accumulator(0)
val lines = sc.textFile("examples/src/main/resources/people.txt")
val people_rdd = lines.map(_.split(",")).map(createTuple(_, map_acc))
val fieldNames = "name age"
val schema = StructType(
fieldNames.split(" ").map(fieldName => StructField(fieldName, StringType, true)))
val df = sqlContext.createDataFrame(people_rdd, schema)
println("record count DF: " + df.count)
println("map_acc: " + map_acc.value)
}
}
$ spark-submit ---class SimpleApp --master local[1] test.jar 2> err
record count DF: 3
map_acc: 3
If I remove the comments from the Python program and cache the RDD, then the input files are not read twice. However, I don't think I should have to cache the RDD, right? In the Scala version I don't need to cache the RDD.
people_rdd = lines.map(lambda l: l.split(",")).map(createTuple).cache()
...
people_rdd.unpersist()
$ spark-submit --master local[1] test.py 2> err
record count DF: 3
map_acc: 3
$ hdfs dfs -cat examples/src/main/resources/people.txt
Michael, 29
Andy, 30
Justin, 19
It happens because in 1.5 createDataFrame eagerly validates provided schema on a few elements:
elif isinstance(schema, StructType):
# take the first few rows to verify schema
rows = rdd.take(10)
for row in rows:
_verify_type(row, schema)
In contrast current versions validate schema for all elements but it is done lazily and you wouldn't see the same behavior. For example this would fail instantaneously in 1.5:
from pyspark.sql.types import *
rdd = sc.parallelize([("foo", )])
schema = StructType([StructField("foo", IntegerType(), False)])
sqlContext.createDataFrame(rdd, schema)
but 2.0 equivalent would fail when you try to evaluate DataFrame.
In general you shouldn't expect that Python and Scala code will behave the same way unless you strictly limit yourself to interactions with SQL API. PySpark:
Implements almost all RDD methods natively so the same chain of transformations can result in a different DAG.
Interactions with Java API may require an eager evaluation to provide type information for Java classes.

Resources