display DataFrame when using pyspark aws glue - python-3.x

how can I show the DataFrame with job etl of aws glue?
I tried this code below but doesn't display anything.
df.show()
code
datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "flux-test", table_name = "tab1", transformation_ctx = "datasource0")
sourcedf = ApplyMapping.apply(frame = datasource0, mappings = [("id", "long", "id", "long"),("Rd.Id_Releve", "string", "Rd.Id_R", "string")])
sourcedf = sourcedf.toDF()
data = []
schema = StructType(
[
StructField('PM',
StructType([
StructField('Pf', StringType(),True),
StructField('Rd', StringType(),True)
])
),
])
cibledf = sqlCtx.createDataFrame(data, schema)
cibledf = sqlCtx.createDataFrame(sourcedf.rdd.map(lambda x: Row(PM=Row(Pf=str(x.id_prm), Rd=None ))), schema)
print(cibledf.show())
job.commit()

In your glue console, after you run your glue job, in job listing there would be a column for Logs / Error logs.
Click on the Logs and this would take you to the cloudwatch logs associated to your job. Browse though for the print statement.
also please check here: Convert dynamic frame to a dataframe and do show()
ADDed working/test code sample
Code sample:
zipcode_dynamicframe = glueContext.create_dynamic_frame.from_catalog(
database = "customer_db",
table_name = "zipcode_master")
zipcode_dynamicframe.printSchema()
zipcode_dynamicframe.toDF().show(10)
Screenshot for zipcode_dynamicframe.show() in cloudwatch log:

Related

how to print a "dictionary" of StringType() in the form of a table with pyspark

Hi,
The above column is part of a table that I am working with in Databricks. What I wish to do is to turn the "ecommerce" col into a table of its own. In this case, it means that I would have a new table with "detail", "products"....etc as columns. Currently "ecommerce" is a StringType.
I have tried using spark dictionary creation, tabulate and other methods but to no success.
The code that I have currently is
def ecommerce_wtchk_dlt():
df = dlt.read_stream("wtchk_dlt")
ddf = df.select(col("ecommerce"))
header = ddf[0].keys()
rows = [x.values() for x in ddf]
dddf = tabulate.tabulate(rows, header)
return dddf
Whenever I try to forcefully set the type of the ecommerce as MapType I have the error that says that since the original datasource is StringType I can only use the same one as well
I have reproduced the above and able to achieve your requirement in this case by using from_json, json_tuple and explode.
This is my sample data with the same format as yours.
Code:
from pyspark.sql import functions as F
from pyspark.sql.types import *
df2 = df.select(F.json_tuple(df["ecommerce"],"detail")).toDF("detail") \
.select(F.json_tuple(F.col("detail"),"products")).toDF("products")
print("products : ")
df2.show()
schema = ArrayType(StructType([
StructField("name", StringType()),
StructField("id", StringType()),
StructField("price", StringType()),
StructField("brand", StringType()),
StructField("category", StringType()),
StructField("variant", StringType())
]))
final_df=df2.withColumn("products", F.from_json("products", schema)).select(F.explode("products").alias("products")).select("products.*")
print("Final dataframe : ")
final_df.show()
My Result:

Why does Spark SQL require double percent symbols for like conditions?

When I am using a like condition in Spark SQL, it seems that it requires the use of 2 percent symbols %%.
However, I could not find any documentation on this in the Spark SQL docs. I am curious as to why my set-up might be causing this requirement.
https://spark.apache.org/docs/3.3.0/sql-ref-syntax-qry-select-like.html
Example data
product_table
id
product_type
region
location
measurement
43635665
ORANGE - Blood Orange
EU
FRA
30.5
78960788
APPLE GrannySmith
NA
USA
16.0
12312343
APPLE [Organic Washington]
NA
CAN
7.1
67867634
ORANGE, NavelOrange
NA
MEX
88.4
import pyspark
from pyspark.sql import functions as F
APP_NAME = "Product: Fruit Template"
SPARK_CONF = [
("spark.dynamicAllocation.maxExecutors", "5"),
("spark.executor.memory", "10g"),
("spark.executor.cores", "4"),
("spark.executor.memoryOverhead", "2000"),
]
spark_conf = pyspark.SparkConf()
spark_conf.setAppName(APP_NAME)
spark_conf.setAll(SPARK_CONF)
sc = pyspark.SparkContext(conf=spark_conf)
spark = pyspark.sql.SparkSession(sc)
def sql(query):
return spark.sql(query)
df = sql("""
SELECT *
FROM product_table
""")
this returns data
df.filter(F.col("product_type").like("ORANGE%%")).show()
whereas this returns an empty dataframe
df.filter(F.col("product_type").like("ORANGE%")).show()
Maybe worth noting, the same issue happens when the LIKE condition is used in the SQL string
this returns data
df_new = sql("""
SELECT *
FROM product_table
WHERE product_type like 'ORANGE%%'
""")
df_new.show()
whereas this returns an empty dataframe
df_new = sql("""
SELECT *
FROM product_table
WHERE product_type like 'ORANGE%'
""")
df_new.show()
i am using PySpark Version: 2.3.2.
conf = (SparkConf()
.set("spark.executor.instances", "24")
.set("spark.executor.cores", "5")
.set("spark.executor.memory", "33g")
.set("spark.driver.memory", "55g")
.set("spark.driver.maxResultSize", "10g")
.set("spark.sql.catalogImplementation", "hive")
.set("mapreduce.fileoutputcommitter.algorithm.version", "2")
)
spark = (
SparkSession.builder.appName("default")
.enableHiveSupport()
.config(conf=conf)
.getOrCreate()
)
df = spark.createDataFrame(
[('43635665','ORANGE - Blood Orange'),
('78960788','APPLE GrannySmith'),
('12312343','APPLE [Organic Washington'),
('67867634','ORANGE, NavelOrange')],
['id', 'product_type'])
df.createOrReplaceTempView("product_table")
def sql(query):
print(query)
return spark.sql(query)
df2 = sql("""
SELECT *
FROM product_table
""")
df2.filter(F.col("product_type").like("ORANGE%")).show(truncate=False)

Writing spark DataFrame In Apache Hudi Table

I am new to apace hudi and trying to write my dataframe in my Hudi table using spark shell.
For type first time i am not creating any table and writing in overwrite mode so I am expecting it will create hudi
table.I am Writing below code.
spark-shell \
--packages org.apache.hudi:hudi-spark-bundle_2.12:0.7.0,org.apache.spark:spark-avro_2.12:3.0.1 \
--conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'
//Initialize a Spark Session for Hudi
import org.apache.spark.sql.SaveMode
import org.apache.spark.sql.SaveMode
import org.apache.spark.sql.functions._
import org.apache.hudi.DataSourceWriteOptions
import org.apache.hudi.config.HoodieWriteConfig
import org.apache.hudi.hive.MultiPartKeysValueExtractor
import org.apache.spark.sql.SparkSession
val spark1 = SparkSession.builder().appName("hudi-datalake").master("local[*]").config("spark.serializer", "org.apache.spark.serializer.KryoSerializer").config("spark.sql.hive.convertMetastoreParquet", "false").getOrCreat ()
//Write to a Hudi Dataset
val inputDF = Seq(
("100", "2015-01-01", "2015-01-01T13:51:39.340396Z"),
("101", "2015-01-01", "2015-01-01T12:14:58.597216Z"),
("102", "2015-01-01", "2015-01-01T13:51:40.417052Z"),
("103", "2015-01-01", "2015-01-01T13:51:40.519832Z"),
("104", "2015-01-02", "2015-01-01T12:15:00.512679Z"),
("105", "2015-01-02", "2015-01-01T13:51:42.248818Z")
).toDF("id", "creation_date", "last_update_time")
val hudiOptions = Map[String,String](
HoodieWriteConfig.TABLE_NAME -> "work.hudi_test",
DataSourceWriteOptions.TABLE_TYPE_OPT_KEY -> "COPY_ON_WRITE",
DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY -> "id",
DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY -> "creation_date",
DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY -> "last_update_time",
DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY -> "true",
DataSourceWriteOptions.HIVE_TABLE_OPT_KEY -> "work.hudi_test",
DataSourceWriteOptions.HIVE_PARTITION_FIELDS_OPT_KEY -> "creation_date",
DataSourceWriteOptions.HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY -> classOf[MultiPartKeysValueExtractor].getName)
// Upsert Data
// Create a new DataFrame from the first row of inputDF with a different creation_date value
val updateDF = inputDF.limit(1).withColumn("creation_date", lit("2014-01-01"))
updateDF.write.format("org.apache.hudi").options(hudiOptions).mode(SaveMode.overwrite).saveAsTable("work.hudi_test")
while writing this write statement i m getting below error message.
java.lang.NoClassDefFoundError: org/apache/spark/sql/execution/datasources/v2/FileDataSourceV2
can somone Kindly Guide me how should I write this statement.
Here is a working sample for your question in pyspark:
from pyspark.sql import SparkSession
from pyspark.sql.functions import lit
spark = (
SparkSession.builder.appName("Hudi_Data_Processing_Framework")
.config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
.config("spark.sql.hive.convertMetastoreParquet", "false")
.config(
"spark.jars.packages",
"org.apache.hudi:hudi-spark-bundle_2.12:0.7.0,org.apache.spark:spark-avro_2.12:3.0.2"
)
.getOrCreate()
)
input_df = spark.createDataFrame(
[
("100", "2015-01-01", "2015-01-01T13:51:39.340396Z"),
("101", "2015-01-01", "2015-01-01T12:14:58.597216Z"),
("102", "2015-01-01", "2015-01-01T13:51:40.417052Z"),
("103", "2015-01-01", "2015-01-01T13:51:40.519832Z"),
("104", "2015-01-02", "2015-01-01T12:15:00.512679Z"),
("105", "2015-01-02", "2015-01-01T13:51:42.248818Z"),
],
("id", "creation_date", "last_update_time"),
)
hudi_options = {
# ---------------DATA SOURCE WRITE CONFIGS---------------#
"hoodie.table.name": "hudi_test",
"hoodie.datasource.write.recordkey.field": "id",
"hoodie.datasource.write.precombine.field": "last_update_time",
"hoodie.datasource.write.partitionpath.field": "creation_date",
"hoodie.datasource.write.hive_style_partitioning": "true",
"hoodie.upsert.shuffle.parallelism": 1,
"hoodie.insert.shuffle.parallelism": 1,
"hoodie.consistency.check.enabled": True,
"hoodie.index.type": "BLOOM",
"hoodie.index.bloom.num_entries": 60000,
"hoodie.index.bloom.fpp": 0.000000001,
"hoodie.cleaner.commits.retained": 2,
}
# INSERT
(
input_df.write.format("org.apache.hudi")
.options(**hudi_options)
.mode("append")
.save("/tmp/hudi_test")
)
# UPDATE
update_df = input_df.limit(1).withColumn("creation_date", lit("2014-01-01"))
(
update_df.write.format("org.apache.hudi")
.options(**hudi_options)
.mode("append")
.save("/tmp/hudi_test")
)
# REAL UPDATE
update_df = input_df.limit(1).withColumn("last_update_time", lit("2016-01-01T13:51:39.340396Z"))
(
update_df.write.format("org.apache.hudi")
.options(**hudi_options)
.mode("append")
.save("/tmp/hudi_test")
)
output_df = spark.read.format("org.apache.hudi").load(
"/tmp/hudi_test/*/*"
)
output_df.show()
Output:
Hudi table in Filesystem looks as follows:
Note:
Your update operation actually creates a new partition and it does an insert, since you are modifying the partition column (2015-01-01 -> 2014-01-01). You can see that in the output.
And I have provided an update example where it updates the last_update_time to 2016-01-01T13:51:39.340396Z which actually updates the id 100 in partition 2015-01-01 from 2015-01-01T13:51:39.340396Z to 2016-01-01T13:51:39.340396Z
More samples can be found in Hudi Quick Start Guide

how to force Glue DynamicFrame to fail if data doesn't conform to the dataframe schema?

I have a Glue job (running on spark) that simply convert a CSV file to Parquet. I don't have control over the CSV data and as a result I want to capture any inconsistency between the data and the table schema during the conversion to Parquet. For instance if a column is defined as Integer, I want the job gives me an error if there is any string value in that column! Currently, DynamicFrame resolves this by giving choices (string and integer) in the resulted Parquet file! which is helpful for some use cases but I'm wondering if there is any way that enforce the schema and have the glue job to throw error if there is any inconsistency. Here is my code:
datasource0 = glueContext.create_dynamic_frame.from_catalog(databasem=mdbName, table_namem=mtable, transformation_ctx="datasource0")
df = datasource0.toDF()
df = df.coalesce(parquetFileCount)
df = convertColDataType(df, "timestamp", "timestamp", dbName, table)
applymapping1 = DynamicFrame.fromDF(df,glueContext,"finalDF")
datasink4 = glueContext.write_dynamic_frame.from_options(frame = applymapping1, connection_type = "s3", connection_options = {"path": path}, format = "parquet", transformation_ctx = "datasink4")
job.commit()
You can solve this by using spark native lib instead of using glue lib
Instead of reading from catalog, read from the corresponding s3 path with custom schema and failfast mode
schema = StructType ([StructField ('id', IntegerType(), True),
StructField ('name', StringType(), True)]
df = spark.read.option('mode', 'FAILFAST').csv(s3Path, schema=schema)
I had a similiar data type issue that I was able to solve by importing another df that I knew was in the correct format. Then I looped over the columns of the two dfs and compared their data types. In this example I reformated the data types where necessary:
df1 = inputfile
df2 = target
if df1.schema != df2.schema:
colnames = df2.schema.names
for colname in colnames:
df1DataType = get_dtype(df1, colname)
df2DataType = get_dtype(df2, colname)
if df1DataType != df2DataType:
if df1DataType == 'timestamp':
not_string = ''
df2 = df2.withColumn(colname, df2[colname].cast(TimestampType()))
elif df1DataType == 'double':
not_string = ''
df2 = df2.withColumn(colname, df2[colname].cast(DoubleType()))
elif df1DataType == 'int':
not_string = ''
df2 = df2.withColumn(colname, df2[colname].cast(IntegerType()))
else:
not_string = 'not '
print(not_string + 'updating: ' + colname + ' - from ' + df2DataType + ' to ' + df1DataType)
target = df2

Spark: Why does Python read input files twice but Scala doesn't when creating a DataFrame?

I'm using Spark v1.5.2. I wrote a program in Python and I don't understand why it reads the input files twice. The same program written in Scala only reads the input files once.
I use an accumulator to count the number of times that map() is called. From the accumulator value, I infer the number of times the input file is read.
The input file contains 3 lines of text.
Python:
from pyspark import SparkContext, SQLContext
from pyspark.sql.types import *
def createTuple(record): # used with map()
global map_acc
map_acc += 1
return (record[0], record[1].strip())
sc = SparkContext(appName='Spark test app') # appName is shown in the YARN UI
sqlContext = SQLContext(sc)
map_acc = sc.accumulator(0)
lines = sc.textFile("examples/src/main/resources/people.txt")
people_rdd = lines.map(lambda l: l.split(",")).map(createTuple) #.cache()
fieldNames = 'name age'
fields = [StructField(field_name, StringType(), True) for field_name in fieldNames.split()]
schema = StructType(fields)
df = sqlContext.createDataFrame(people_rdd, schema)
print 'record count DF:', df.count()
print 'map_acc:', map_acc.value
#people_rdd.unpersist()
$ spark-submit --master local[1] test.py 2> err
record count DF: 3
map_acc: 6 ##### why 6 instead of 3??
Scala:
import org.apache.spark._
import org.apache.spark.sql.Row;
import org.apache.spark.sql.types.{StructType,StructField,StringType};
object SimpleApp {
def main(args: Array[String]) {
def createTuple(record:Array[String], map_acc: Accumulator[Int]) = { // used with map()
map_acc += 1
Row(record(0), record(1).trim)
}
val conf = new SparkConf().setAppName("Scala Test App")
val sc = new SparkContext(conf)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val map_acc = sc.accumulator(0)
val lines = sc.textFile("examples/src/main/resources/people.txt")
val people_rdd = lines.map(_.split(",")).map(createTuple(_, map_acc))
val fieldNames = "name age"
val schema = StructType(
fieldNames.split(" ").map(fieldName => StructField(fieldName, StringType, true)))
val df = sqlContext.createDataFrame(people_rdd, schema)
println("record count DF: " + df.count)
println("map_acc: " + map_acc.value)
}
}
$ spark-submit ---class SimpleApp --master local[1] test.jar 2> err
record count DF: 3
map_acc: 3
If I remove the comments from the Python program and cache the RDD, then the input files are not read twice. However, I don't think I should have to cache the RDD, right? In the Scala version I don't need to cache the RDD.
people_rdd = lines.map(lambda l: l.split(",")).map(createTuple).cache()
...
people_rdd.unpersist()
$ spark-submit --master local[1] test.py 2> err
record count DF: 3
map_acc: 3
$ hdfs dfs -cat examples/src/main/resources/people.txt
Michael, 29
Andy, 30
Justin, 19
It happens because in 1.5 createDataFrame eagerly validates provided schema on a few elements:
elif isinstance(schema, StructType):
# take the first few rows to verify schema
rows = rdd.take(10)
for row in rows:
_verify_type(row, schema)
In contrast current versions validate schema for all elements but it is done lazily and you wouldn't see the same behavior. For example this would fail instantaneously in 1.5:
from pyspark.sql.types import *
rdd = sc.parallelize([("foo", )])
schema = StructType([StructField("foo", IntegerType(), False)])
sqlContext.createDataFrame(rdd, schema)
but 2.0 equivalent would fail when you try to evaluate DataFrame.
In general you shouldn't expect that Python and Scala code will behave the same way unless you strictly limit yourself to interactions with SQL API. PySpark:
Implements almost all RDD methods natively so the same chain of transformations can result in a different DAG.
Interactions with Java API may require an eager evaluation to provide type information for Java classes.

Resources