how to force Glue DynamicFrame to fail if data doesn't conform to the dataframe schema? - apache-spark

I have a Glue job (running on spark) that simply convert a CSV file to Parquet. I don't have control over the CSV data and as a result I want to capture any inconsistency between the data and the table schema during the conversion to Parquet. For instance if a column is defined as Integer, I want the job gives me an error if there is any string value in that column! Currently, DynamicFrame resolves this by giving choices (string and integer) in the resulted Parquet file! which is helpful for some use cases but I'm wondering if there is any way that enforce the schema and have the glue job to throw error if there is any inconsistency. Here is my code:
datasource0 = glueContext.create_dynamic_frame.from_catalog(databasem=mdbName, table_namem=mtable, transformation_ctx="datasource0")
df = datasource0.toDF()
df = df.coalesce(parquetFileCount)
df = convertColDataType(df, "timestamp", "timestamp", dbName, table)
applymapping1 = DynamicFrame.fromDF(df,glueContext,"finalDF")
datasink4 = glueContext.write_dynamic_frame.from_options(frame = applymapping1, connection_type = "s3", connection_options = {"path": path}, format = "parquet", transformation_ctx = "datasink4")
job.commit()

You can solve this by using spark native lib instead of using glue lib
Instead of reading from catalog, read from the corresponding s3 path with custom schema and failfast mode
schema = StructType ([StructField ('id', IntegerType(), True),
StructField ('name', StringType(), True)]
df = spark.read.option('mode', 'FAILFAST').csv(s3Path, schema=schema)

I had a similiar data type issue that I was able to solve by importing another df that I knew was in the correct format. Then I looped over the columns of the two dfs and compared their data types. In this example I reformated the data types where necessary:
df1 = inputfile
df2 = target
if df1.schema != df2.schema:
colnames = df2.schema.names
for colname in colnames:
df1DataType = get_dtype(df1, colname)
df2DataType = get_dtype(df2, colname)
if df1DataType != df2DataType:
if df1DataType == 'timestamp':
not_string = ''
df2 = df2.withColumn(colname, df2[colname].cast(TimestampType()))
elif df1DataType == 'double':
not_string = ''
df2 = df2.withColumn(colname, df2[colname].cast(DoubleType()))
elif df1DataType == 'int':
not_string = ''
df2 = df2.withColumn(colname, df2[colname].cast(IntegerType()))
else:
not_string = 'not '
print(not_string + 'updating: ' + colname + ' - from ' + df2DataType + ' to ' + df1DataType)
target = df2

Related

Dynamic dictionary in pyspark

I am trying to build a dictionary dynamically using pyspark, by reading the table structure on the oracle database. Here's a simplified version of my code
predefined dictionary (convert_dict.py)
conversions = {
"COL1": lambda c: f.col(c).cast("string"),
"COL2": lambda c: f.from_unixtime(f.unix_timestamp(c, dateFormat)).cast("date"),
"COL3": lambda c: f.from_unixtime(f.unix_timestamp(c, dateFormat)).cast("date"),
"COL4": lambda c: f.col(c).cast("float")
}
Main program
from pyspark.sql import SparkSession
from pyspark.sql import functions as f
from pyspark.sql.types import StructType, StructField, StringType
from convert_dict import conversions
spark = SparkSession.builder.appName("file_testing").getOrCreate()
table_name = "TEST_TABLE"
input_file_path = "file:\\\c:\Desktop\foo.txt"
sql_query = "(select listagg(column_name,',') within group(order by column_id) col from user_tab_columns where " \
"table_name = '" + table_name + "' and column_name not in ('COL10', 'COL11','COL12') order by column_id) table_columns"
struct_schema = StructType([\
StructField("COL1", StringType(), True),\
StructField("COL2", StringType(), True),\
StructField("COL3", StringType(), True),\
StructField("COL4", StringType(), True),\
])
data_df = spark.read.schema(struct_schema).option("sep", ",").option("header", "true").csv(input_file_path)
validdateData = lines.withColumn(
"dataTypeValidations",
f.concat_ws(",",
*[
f.when(
v(k).isNull() & f.col(k).isNotNull(),
f.lit(k + " not valid")
).otherwise(f.lit("None"))
for k,v in conversions.items()
]
)
)
data_temp = validdateData
for k,v in conversions.items():
data_temp = data_temp.withColumn(k,v(k))
validateData.show()
spark.stop()
If I am to change the above code to dynamically generate the dictionary from database
DATEFORMAT = "yyyyMMdd"
dict_sql = """
(select column_name,case when data_type = 'VARCHAR2' then 'string' when data_type in ( 'DATE','TIMESTAMP(6)') then 'date' when data_type = 'NUMBER' and NVL(DATA_SCALE,0) <> 0 then 'float' when data_type = 'NUMBER' and NVL(DATA_SCALE,0) = 0 then 'int'
end d_type from user_tab_columns where table_name = 'TEST_TABLE' and column_name not in ('COL10', 'COL11','COL12')) dict
"""
column_df = spark.read.format("jdbc").option("url",url).option("dbtable", dict_sql)\
.option("user",user).option("password",password).option("driver",driver).load()
conversions = {}
for row in column_df.rdd.collect():
column_name = row.COLUMN_NAME
column_type = row.D_TYPE
if column_type == "date":
conversions.update({column_name: lambda c:f.col(c)})
elif column_type == "float":
conversions.update({column_name: lambda c: f.col(c).cast("float")})
elif column_type == "date":
conversions.update({column_name: lambda c: f.from_unixtime(f.unix_timestamp(c, DATEFORMAT)).cast("date")})
elif column_type == "int":
conversions.update({column_name: lambda c: f.col(c).cast("int")})
else:
conversions.update({column_name: lambda c: f.col(c)})
The conversion of data-types doesn't work when the above dynamically generated dictionary is used. For example: if "COL2" contains "20210731", the resulting data from the above code stays the same, i.e. doesn't get converted to the correct date format. Where as the predefined dictionary works in correct manner.
Am I missing something here or is there a better way to implement dynamically generated dictionaries in pyspark?
Had a rookie mistake in my code, in the if-then-else block, I had two separate statements for column_type == "date"

Dataframe TypeError cannot accept object

I have list of string in python as follows :
['start_column=column123;to_3=2020-09-07 10:29:24;to_1=2020-09-07 10:31:08;to_0=2020-09-07 10:31:13;',
'start_column=column475;to_3=2020-09-07 10:29:34;']
I am trying to convert it into dataframe in following way :
schema = StructType([
StructField('Rows', ArrayType(StringType()), True)
])
rdd = sc.parallelize(test_list)
query_data = spark.createDataFrame(rdd,schema)
print(query_data.schema)
query_data.show()
I am getting following error:
TypeError: StructType can not accept object
You just need to pass that as a list while creating the dataframe as below ...
a_list = ['start_column=column123;to_3=2020-09-07 10:29:24;to_1=2020-09-07 10:31:08;to_0=2020-09-07 10:31:13;',
'start_column=column475;to_3=2020-09-07 10:29:34;']
sparkdf = spark.createDataFrame([a_list],["col1", "col2"])
sparkdf.show(truncate=False)
+--------------------------------------------------------------------------------------------------+------------------------------------------------+
|col1 |col2 |
+--------------------------------------------------------------------------------------------------+------------------------------------------------+
|start_column=column123;to_3=2020-09-07 10:29:24;to_1=2020-09-07 10:31:08;to_0=2020-09-07 10:31:13;|start_column=column475;to_3=2020-09-07 10:29:34;|
+--------------------------------------------------------------------------------------------------+------------------------------------------------+
You should use schema = StringType() because your rows contains strings rather than structs of strings.
I have two possible solutions for you.
SOLUTION 1: Assuming you wanted a dataframe with just one row
I was able to make it work by wrapping the values in test_list in Parentheses and using StringType.
v = [('start_column=column123;to_3=2020-09-07 10:29:24;to_1=2020-09-07 10:31:08;to_0=2020-09-07 10:31:13;',
'start_column=column475;to_3=2020-09-07 10:29:34;')]
schema = StructType([
StructField('col_1', StringType(), True),
StructField('col_2', StringType(), True),
])
rdd = sc.parallelize(v)
query_data = spark.createDataFrame(rdd,schema)
print(query_data.schema)
query_data.show(truncate = False)
SOLUTION 2: Assuming you wanted a dataframe with just one column
v = ['start_column=column123;to_3=2020-09-07 10:29:24;to_1=2020-09-07 10:31:08;to_0=2020-09-07 10:31:13;',
'start_column=column475;to_3=2020-09-07 10:29:34;']
from pyspark.sql.types import StringType
df = spark.createDataFrame(v, StringType())
df.show(truncate = False)

display DataFrame when using pyspark aws glue

how can I show the DataFrame with job etl of aws glue?
I tried this code below but doesn't display anything.
df.show()
code
datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "flux-test", table_name = "tab1", transformation_ctx = "datasource0")
sourcedf = ApplyMapping.apply(frame = datasource0, mappings = [("id", "long", "id", "long"),("Rd.Id_Releve", "string", "Rd.Id_R", "string")])
sourcedf = sourcedf.toDF()
data = []
schema = StructType(
[
StructField('PM',
StructType([
StructField('Pf', StringType(),True),
StructField('Rd', StringType(),True)
])
),
])
cibledf = sqlCtx.createDataFrame(data, schema)
cibledf = sqlCtx.createDataFrame(sourcedf.rdd.map(lambda x: Row(PM=Row(Pf=str(x.id_prm), Rd=None ))), schema)
print(cibledf.show())
job.commit()
In your glue console, after you run your glue job, in job listing there would be a column for Logs / Error logs.
Click on the Logs and this would take you to the cloudwatch logs associated to your job. Browse though for the print statement.
also please check here: Convert dynamic frame to a dataframe and do show()
ADDed working/test code sample
Code sample:
zipcode_dynamicframe = glueContext.create_dynamic_frame.from_catalog(
database = "customer_db",
table_name = "zipcode_master")
zipcode_dynamicframe.printSchema()
zipcode_dynamicframe.toDF().show(10)
Screenshot for zipcode_dynamicframe.show() in cloudwatch log:

Generating multiple columns dynamically using loop in pyspark dataframe

I have a requirement where I have to generate multiple columns dynamically in pyspark. I have written a similar code as below to accomplish the same.
sc = SparkContext()
sqlContext = SQLContext(sc)
cols = ['a','b','c']
df = sqlContext.read.option("header","true").option("delimiter", "|").csv("C:\\Users\\elkxsnk\\Desktop\\sample.csv")
for i in cols:
df1 = df.withColumn(i,lit('hi'))
df1.show()
However I am missing out columns a and b in the final result. Please help.
Changed the code like below. its working now, but wanted to know if there is a better way of handling it.
cols = ['a','b','c']
cols_add = []
flg_first = 'Y'
df = sqlContext.read.option("header","true").option("delimiter", "|").csv("C:\\Users\\elkxsnk\\Desktop\\sample.csv")
for i in cols:
print('start'+str(df.columns))
if flg_first == 'Y':
df1 = df.withColumn(i,lit('hi'))
cols_add.append(i)
flg_first = 'N'
else:enter code here
df1 = df1.select(df.columns+cols_add).withColumn(i,lit('hi'))
cols_add.append(i)
print('end' + str(df1.columns))
df1.show()

PySpark: Replace Punctuations with Space Looping Through Columns

I have the following code running successfully in PySpark:
def pd(data):
df = data
df = df.select('oproblem')
text_col = ['oproblem']
for i in text_col:
df = df.withColumn(i, F.lower(F.col(i)))
df = df.withColumn(i, F.regexp_replace(F.col(i), '[.,#-:;/?!\']', ' '))
return df
But when I add a second column in and try to loop it, it doesn't work:
def pd(data):
df = data
df = df.select('oproblem', 'lca')
text_col = ['oproblem', 'lca']
for i in text_col:
df = df.withColumn(i, F.lower(F.col(i)))
df = df.withColumn(i, F.regexp_replace(F.col(i), '[.,#-:;/?!\']', ' '))
return df
Below is the error I get:
TypeError: 'Column' object is not callable
I think it should be df = df.select(['oproblem', 'lca']) instead of df = df.select('oproblem', 'lca').
Better yet for code quality purposes, have the select statement use the text_columns variable, so you only have to change 1 line of code if you need to do this with more columns or if your column names change. Eg,
def pd(data):
df = data
text_col = ['oproblem', 'lca']
df = df.select(text_col)
....

Resources