how to print a "dictionary" of StringType() in the form of a table with pyspark - azure

Hi,
The above column is part of a table that I am working with in Databricks. What I wish to do is to turn the "ecommerce" col into a table of its own. In this case, it means that I would have a new table with "detail", "products"....etc as columns. Currently "ecommerce" is a StringType.
I have tried using spark dictionary creation, tabulate and other methods but to no success.
The code that I have currently is
def ecommerce_wtchk_dlt():
df = dlt.read_stream("wtchk_dlt")
ddf = df.select(col("ecommerce"))
header = ddf[0].keys()
rows = [x.values() for x in ddf]
dddf = tabulate.tabulate(rows, header)
return dddf
Whenever I try to forcefully set the type of the ecommerce as MapType I have the error that says that since the original datasource is StringType I can only use the same one as well

I have reproduced the above and able to achieve your requirement in this case by using from_json, json_tuple and explode.
This is my sample data with the same format as yours.
Code:
from pyspark.sql import functions as F
from pyspark.sql.types import *
df2 = df.select(F.json_tuple(df["ecommerce"],"detail")).toDF("detail") \
.select(F.json_tuple(F.col("detail"),"products")).toDF("products")
print("products : ")
df2.show()
schema = ArrayType(StructType([
StructField("name", StringType()),
StructField("id", StringType()),
StructField("price", StringType()),
StructField("brand", StringType()),
StructField("category", StringType()),
StructField("variant", StringType())
]))
final_df=df2.withColumn("products", F.from_json("products", schema)).select(F.explode("products").alias("products")).select("products.*")
print("Final dataframe : ")
final_df.show()
My Result:

Related

How to reliably obtain partition columns of delta table

I need to obtain the partitioning columns of a delta table, but the returned result of a
DESCRIBE delta.`my_table` returns different results on databricks and locally on PyCharm.
Minimal example:
from pyspark.sql.types import StructType, StructField, StringType, IntegerType
delta_table_path = "c:/temp_delta_table"
partition_column = ["rs_nr"]
schema = StructType([
StructField("rs_nr", StringType(), False),
StructField("event_category", StringType(), True),
StructField("event_counter", IntegerType(), True)])
data = [{'rs_nr': '001', 'event_category': 'event_01', 'event_counter': 1},
{'rs_nr': '002', 'event_category': 'event_02', 'event_counter': 2},
{'rs_nr': '003', 'event_category': 'event_03', 'event_counter': 3},
{'rs_nr': '004', 'event_category': 'event_04', 'event_counter': 4}]
sdf = spark.createDataFrame(data=data, schema=schema)
sdf.write.format("delta").mode("overwrite").partitionBy(partition_column).save(delta_table_path)
df_descr = spark.sql(f"DESCRIBE delta.`{delta_table_path}`")
df_descr.toPandas()
Shows, on databricks, the partition column(s):
col_name data_type comment
0 rs_nr string None
1 event_category string None
2 event_counter int None
3 # Partition Information
4 # col_name data_type comment
5 rs_nr string None
But when running this locally in PyCharm, I get the following different output:
col_name data_type comment
0 rs_nr string
1 event_category string
2 event_counter int
3
4 # Partitioning
5 Part 0 rs_nr
Parsing both types of return value seems ugly to me, so is there a reason that this is returned like this?
Setup:
In Pycharm:
pyspark = 3.2.3
delta-spark = 2.0.0
In DataBricks:
DBR 11.3 LTS
Spark = 3.3.0 (I just noted that this differs, I will test if 3.3.0 works locally in the meantime)
Scala = 2.12
In PyCharm, I create the connection using:
def get_spark():
spark = SparkSession.builder.appName('schema_checker')\
.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension")\
.config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog")\
.config("spark.jars.packages", "io.delta:delta-core_2.12:2.0.0")\
.config("spark.sql.catalogImplementation", "in-memory")\
.getOrCreate()
return spark
If you're using Python, then instead of executing SQL command that is harder to parse, it's better to use Python API. The DeltaTable instance has a detail function that returns a dataframe with details about the table (doc), and this dataframe has the partitionColumns column that is array of strings with partition columns names. So you can just do:
from delta.tables import *
detailDF = DeltaTable.forPath(spark, delta_table_path).detail()
partitions = detailDF.select("partitionColumns").collect()[0][0]

Access accumulator value after using it in user defined function within df.widthColumn in Palantir Foundry

I am trying to use a customized accumulator within Palantir Foundry to aggregate Data within
a user defined function which is applied to each row of a dataframe within a statement df.withColumn(...).
From the resulting dataframe, I see, that the incrementation of the accumulator-value happens as expected. However, the value of the accumulator variable itself in the script does not change during the execution.
I see, that the Python-ID of the accumulator variable in the script differs from the Python-ID of the accumulator within the user defined function. But that might be expected...
How do I access the accumulator value which incrementation can be watched in the resulting dataframe-colun from within the calling script after the execution, as this is the information I am looking for?
from transforms.api import transform_df, Input, Output
import numpy as np
from pyspark.accumulators import AccumulatorParam
from pyspark.sql.functions import udf, struct
global accum
#transform_df(
Output("ri.foundry.main.dataset.xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx"),
)
def compute(ctx):
from pyspark.sql.types import StructType, StringType, IntegerType, StructField
data2 = [("James","","Smith","36636","M",3000),
("Michael","Rose","","40288","M",4000),
("Robert","","Williams","42114","M",4000),
("Maria","Anne","Jones","39192","F",4000),
("Jen","Mary","Brown","","F",-1)
]
schema = StructType([ \
StructField("firstname",StringType(),True), \
StructField("middlename",StringType(),True), \
StructField("lastname",StringType(),True), \
StructField("id", StringType(), True), \
StructField("gender", StringType(), True), \
StructField("salary", IntegerType(), True) \
])
df = ctx.spark_session.createDataFrame(data=data2, schema=schema)
####################################
class AccumulatorNumpyArray(AccumulatorParam):
def zero(self, zero: np.ndarray):
return zero
def addInPlace(self, v1, v2):
return v1 + v2
# from pyspark.context import SparkContext
# sc = SparkContext.getOrCreate()
sc = ctx.spark_session.sparkContext
shape = 3
global accum
accum = sc.accumulator(
np.zeros(shape, dtype=np.int64),
AccumulatorNumpyArray(),
)
def func(row):
global accum
accum += np.ones(shape)
return str(accum) + '_' + str(id(accum))
user_defined_function = udf(func, StringType())
new = df.withColumn("processed", user_defined_function(struct([df[col] for col in df.columns])))
new.show(2)
print(accum)
return df
results in
+---------+----------+--------+-----+------+------+--------------------+
|firstname|middlename|lastname| id|gender|salary| processed|
+---------+----------+--------+-----+------+------+--------------------+
| James| | Smith|36636| M| 3000|[1. 1. 1.]_140388...|
| Michael| Rose| |40288| M| 4000|[2. 2. 2.]_140388...|
+---------+----------+--------+-----+------+------+--------------------+
only showing top 2 rows
and
> accum
Accumulator<id=0, value=[0 0 0]>
> id(accum)
140574405092256
If the Foundry-Boiler-Plate is removed, resulting in
import numpy as np
from pyspark.accumulators import AccumulatorParam
from pyspark.sql.functions import udf, struct
from pyspark.sql.types import StructType, StringType, IntegerType, StructField
from pyspark.sql import SparkSession
from pyspark.context import SparkContext
spark = (
SparkSession.builder.appName("Python Spark SQL basic example")
.config("spark.some.config.option", "some-value")
.getOrCreate()
)
# ctx = spark.sparkContext.getOrCreate()
data2 = [
("James", "", "Smith", "36636", "M", 3000),
("Michael", "Rose", "", "40288", "M", 4000),
("Robert", "", "Williams", "42114", "M", 4000),
("Maria", "Anne", "Jones", "39192", "F", 4000),
("Jen", "Mary", "Brown", "", "F", -1),
]
schema = StructType(
[
StructField("firstname", StringType(), True),
StructField("middlename", StringType(), True),
StructField("lastname", StringType(), True),
StructField("id", StringType(), True),
StructField("gender", StringType(), True),
StructField("salary", IntegerType(), True),
]
)
# df = ctx.spark_session.createDataFrame(data=data2, schema=schema)
df = spark.createDataFrame(data=data2, schema=schema)
####################################
class AccumulatorNumpyArray(AccumulatorParam):
def zero(self, zero: np.ndarray):
return zero
def addInPlace(self, v1, v2):
return v1 + v2
sc = SparkContext.getOrCreate()
shape = 3
global accum
accum = sc.accumulator(
np.zeros(shape, dtype=np.int64),
AccumulatorNumpyArray(),
)
def func(row):
global accum
accum += np.ones(shape)
return str(accum) + "_" + str(id(accum))
user_defined_function = udf(func, StringType())
new = df.withColumn(
"processed", user_defined_function(struct([df[col] for col in df.columns]))
)
new.show(2, False)
print(id(accum))
print(accum)
the output obtained within a regular Python environment with pyspark version 3.3.1 on Ubuntu meets the expectations and is
+---------+----------+--------+-----+------+------+--------------------------+
|firstname|middlename|lastname|id |gender|salary|processed |
+---------+----------+--------+-----+------+------+--------------------------+
|James | |Smith |36636|M |3000 |[1. 1. 1.]_139642682452576|
|Michael |Rose | |40288|M |4000 |[1. 1. 1.]_139642682450224|
+---------+----------+--------+-----+------+------+--------------------------+
only showing top 2 rows
140166944013424
[3. 3. 3.]
The code that runs outside of the transform is ran in a different environment than the code within your transform. When you commit, you'll be running your checks which runs the code outside the transform to generate the jobspec which is technically your executable transform. You can find these within the "details" of your dataset after the checks pass.
The logic within your transform is then detached and runs in isolation each time you hit build. The global accum you define outside the transform is never ran and doesn't exist when the code inside the compute is running.
global accum <-- runs in checks
#transform_df(
Output("ri.foundry.main.dataset.c0d4fc0c-bb1d-4c7b-86ce-a13ec6666490"),
)
def compute(ctx):
bla bla some logic <-- runs during build
The prints you are doing during your second code example, happen after the df is processed, because you are asking spark to compute with the new.show(2, false). While the print you are doing in the first example happen before the df is processed, since the compute will only happen after your return df.
If you want to try to print after your df is computed, you can use #transform(... instead of #transform_df(... and do a print after writing the dataframe contents. Should be something like this:
#transform(
output=Output("ri.foundry.main.dataset.c0d4fc0c-bb1d-4c7b-86ce-a13ec6666490"),
)
def compute(ctx, output):
df = ... some logic ...
output.write_dataframe(df) # please check the function name I think it was write_dataframe, but may be wrong
print accum

Dataframe TypeError cannot accept object

I have list of string in python as follows :
['start_column=column123;to_3=2020-09-07 10:29:24;to_1=2020-09-07 10:31:08;to_0=2020-09-07 10:31:13;',
'start_column=column475;to_3=2020-09-07 10:29:34;']
I am trying to convert it into dataframe in following way :
schema = StructType([
StructField('Rows', ArrayType(StringType()), True)
])
rdd = sc.parallelize(test_list)
query_data = spark.createDataFrame(rdd,schema)
print(query_data.schema)
query_data.show()
I am getting following error:
TypeError: StructType can not accept object
You just need to pass that as a list while creating the dataframe as below ...
a_list = ['start_column=column123;to_3=2020-09-07 10:29:24;to_1=2020-09-07 10:31:08;to_0=2020-09-07 10:31:13;',
'start_column=column475;to_3=2020-09-07 10:29:34;']
sparkdf = spark.createDataFrame([a_list],["col1", "col2"])
sparkdf.show(truncate=False)
+--------------------------------------------------------------------------------------------------+------------------------------------------------+
|col1 |col2 |
+--------------------------------------------------------------------------------------------------+------------------------------------------------+
|start_column=column123;to_3=2020-09-07 10:29:24;to_1=2020-09-07 10:31:08;to_0=2020-09-07 10:31:13;|start_column=column475;to_3=2020-09-07 10:29:34;|
+--------------------------------------------------------------------------------------------------+------------------------------------------------+
You should use schema = StringType() because your rows contains strings rather than structs of strings.
I have two possible solutions for you.
SOLUTION 1: Assuming you wanted a dataframe with just one row
I was able to make it work by wrapping the values in test_list in Parentheses and using StringType.
v = [('start_column=column123;to_3=2020-09-07 10:29:24;to_1=2020-09-07 10:31:08;to_0=2020-09-07 10:31:13;',
'start_column=column475;to_3=2020-09-07 10:29:34;')]
schema = StructType([
StructField('col_1', StringType(), True),
StructField('col_2', StringType(), True),
])
rdd = sc.parallelize(v)
query_data = spark.createDataFrame(rdd,schema)
print(query_data.schema)
query_data.show(truncate = False)
SOLUTION 2: Assuming you wanted a dataframe with just one column
v = ['start_column=column123;to_3=2020-09-07 10:29:24;to_1=2020-09-07 10:31:08;to_0=2020-09-07 10:31:13;',
'start_column=column475;to_3=2020-09-07 10:29:34;']
from pyspark.sql.types import StringType
df = spark.createDataFrame(v, StringType())
df.show(truncate = False)

Pyspark select from empty dataframe throws exception

the question is similar to this question but it had no answer, I have a dataframe from which am selecting data if exists
schema = StructType([
StructField("file_name", StringType(), True),
StructField("result", ArrayType(StructType()), True),
])
df = rdd.toDF(schema=schema)
print((df.count(), len(df.columns))) # 0,2
df.cache()
df = df.withColumn('result', F.explode(df['result']))
get_doc_id = F.udf(lambda line: ntpath.basename(line).replace('_all.txt', ''), StringType())
df = df.filter(df.result.isNotNull()).select(F.lit(job_id).alias('job_id'),
get_doc_id(df['file_name']).alias('doc_id'),
df['result._2'].alias('line_content'),
df['result._4'].alias('line1'),
df['result._3'].alias('line2'))
the above throws error when the dataframe is empty
pyspark.sql.utils.AnalysisException: 'No such struct field _2 in ;
shouldn't it only executes if result column had data ? and how to overcome this ?
Spark executes code lazily. So it won't check whether you have data in your filter condition. Your code fails in Analysis stage because you don't have a column named result._2 in your data. You are passing empty StructType in your schema for result column. You should update it to something like this:
schema = StructType([
StructField("file_name", StringType(), True),
StructField("result", ArrayType(StructType([StructField("line_content",StringType(),True), StructField("line1",StringType(),True), StructField("line2",StringType(),True)])), True)
])
df = spark.createDataFrame(sc.emptyRDD(),schema=schema)
df = df.withColumn('result', F.explode(df['result']))
get_doc_id = F.udf(lambda line: ntpath.basename(line).replace('_all.txt', ''), StringType())
df = df.filter(df.result.isNotNull()).select(F.lit('job_id').alias('job_id'),
get_doc_id(df['file_name']).alias('doc_id'),
df['result.line_content'].alias('line_content'),
df['result.line1'].alias('line1'),
df['result.line2'].alias('line2'))
Issue is that 'df' does not have '_2'. So it ends up throwing errors like:
pyspark.sql.utils.AnalysisException: 'No such struct field _2 in ;
You can try checking if the column exists by
if not '_2' in result.columns:
#Your code goes here
I would generally initialise the column with 0 or None if it does not exists like
from pyspark.sql.functions import lit
if not '_2' in result.columns:
result = result.withColumn('_2', lit(0))

PySpark: data doesn't always conform to schema - logic to alter data

I'm new with PySpark and am working on a script, reading from .csv files.
I've explicitly defined the schema in the below & the script works perfectly...most of the time.
The issue is, on occasion, a value enters the files which does not conform to the schema - e.g. '-' might appear in an integer field & hence, we get a type error - the error is thrown when df1.show() is reached in the script.
I'm trying to think of a way to effectively say - if the value does not match the defined datatype, then replace with ''
Does anyone know if this may be possible? Any advice would be great!
from pyspark.sql import SparkSession
import pyspark.sql.functions as sqlfunc
from pyspark.sql.types import *
import argparse, sys
from pyspark.sql import *
from pyspark.sql.functions import *
from datetime import datetime
#create a context that supports hive
def create_session(appname):
spark_session = SparkSession\
.builder\
.appName(appname)\
.master('yarn')\
.config("hive.metastore.uris", "thrift://serverip:9083")\
.enableHiveSupport()\
.getOrCreate()
return spark_session
### START MAIN ###
if __name__ == '__main__':
spark_session = create_session('testing_files')
dt_now = datetime.now()
today_unixtime = long(dt_now.strftime('%s'))
today_date = datetime.fromtimestamp(today_unixtime).strftime('%Y%m%d')
twoday_unixtime = long(dt_now.strftime('%s')) - 24*60*60*2
twoday = datetime.fromtimestamp(twoday_unixtime).strftime('%Y%m%d')
hourago = long(dt_now.strftime('%s')) - 60*60*4
hrdate = datetime.fromtimestamp(hourago).strftime('%H')
schema = [\
StructField('field1', StringType(), True),\
StructField('field2',StringType(), True), \
StructField('field3',IntegerType(), True) \
]
final_structure = StructType(schema)
df1 = spark_session.read\
.option("header","false")\
.option("delimiter", "\t")\
.csv('hdfs://hdfspath/dt=%s/*/*/*' %today_date, final_structure)
usercatschema = [\
StructField('field1', StringType(), True),\
StructField('field2',StringType(), True), \
StructField('field3',StringType(), True) \
]
usercat_structure = StructType(usercatschema)
df2 = spark_session.read\
.option("header","false")\
.option("delimiter", "\t")\
.csv('hdfs://hdfspath/v0/dt=%s/*' %twoday, usercat_structure)
df1.show()
df2.show()
df1.createOrReplaceTempView("dpi")
df2.createOrReplaceTempView("usercat")
finaldf = spark_session.sql('''
SQL QUERY
''')
finaldf.coalesce(10).write.format("com.databricks.spark.csv").option("header", "true").option('sep', '\t').mode('append').save('hdfs://hdfs path')
Read it as String type and then convert to int.
df.withColumn("field3",df.field3.cast("int"))

Resources