Getting error while converting spark dataframe to glue Dynamic dataframe - apache-spark

I have converted dynamic dataframe to Spark Dataframe then again back to dynamic dataframe and getting the error.
Code
dyf = glueContext.create_dynamic_frame.from_options(
connection_type="dynamodb",
connection_options={
"dynamodb.input.tableName": source_table,
"dynamodb.throughput.read.percent": "1.5",
"dynamodb.splits": "700"
}
)
df_raw=dyf.toDF()
db_fields = df_raw.columns
df_raw = DynamicFrame.fromDF(df_raw, glueContext, "df_raw")
Error
py4j.protocol.Py4JJavaError: An error occurred while calling z:com.amazonaws.services.glue.DynamicFrame.apply.
: java.lang.NoSuchMethodError: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.argString()Ljava/lang/String;
at net.snowflake.spark.snowflake.SnowflakeTelemetry$.net$snowflake$spark$snowflake$SnowflakeTelemetry$$planTree(SnowflakeTelemetry.scala:80)

Related

Cast datatype from array to String for multiple column in Spark Throwing Error

I have a dataframe df that contains Three column of type array, i am trying to save output
to csv, so converted data type to string.
import org.apache.spark.sql.functions._
val df2 = df.withColumn("Total", col("total").cast("string")),
("BOOKID", col("BOOKID").cast("string"),
"PublisherID", col("PublisherID").cast("string")
.write
.csv(path="D:/pennymac/SOLUTION1/OUTPUT")
But getting error.
error as "Cannot Resolve symbol write"
Spark 2.2
Scala
Try below code.
Its not possible to add multiple columns inside withColumn function.
val df2 = df
.withColumn("Total", col("total").cast("string"))
.withColumn("BOOKID", col("BOOKID").cast("string"))
.withColumn("PublisherID", col("PublisherID").cast("string"))
.write
.csv(path="D:/pennymac/SOLUTION1/OUTPUT")

AWS Glue : java.lang.UnsupportedOperationException: CSV data source does not support binary data type

I'm trying to implement upsert with aws glue and databricks using preactions and postactions, Here is the code below
sample_dataframe.write.format("com.databricks.spark.redshift")\
.option("url", "jdbc:redshift://staging-db.asdf.ap-southeast-1.redshift.amazonaws.com:5439/stagingdb?user=sample&password=pwd")\
.option("preactions", PRE_ACTION)\
.option("postactions", POST_ACTION)\
.option("dbtable", temporary_table)\
.option("extracopyoptions", "region 'ap-southeast-1'")\
.option("aws_iam_role", "arn:aws:iam::1234:role/AWSService-Role-ForRedshift-etl-s3")\
.option("tempdir", args["TempDir"])\
.mode("append")\
.save()
I'm getting the following error
py4j.protocol.Py4JJavaError: An error occurred while calling o90.save.
: java.lang.UnsupportedOperationException: CSV data source does not support binary data type.
at org.apache.spark.sql.execution.datasources.csv.CSVUtils$.org$apache$spark$sql$execution$datasources$csv$CSVUtils$$verifyType$1(CSVUtils.scala:127)
at org.apache.spark.sql.execution.datasources.csv.CSVUtils$$anonfun$verifySchema$1.apply(CSVUtils.scala:131)
at org.apache.spark.sql.execution.datasources.csv.CSVUtils$$anonfun$verifySchema$1.apply(CSVUtils.scala:131)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
Maybe I've missed something. Please help TIA.
I've also tried passing preactions and postactions as connection_options (below) and this too doesn't seem to work
redshift_datasink = glueContext.write_dynamic_frame_from_jdbc_conf(frame = sample_dyn_frame, catalog_connection='Staging' , connection_options = connect_options, redshift_tmp_dir = args["TempDir"], transformation_ctx = "redshift_datasink")

Spark SQL : custom Hive UDF GenericInternalRow cannot be cast ArrayData

I'm using Spark 1.6 with Scala and R (throught SparkR and SparkLyr)
I have a dataframe containing binary data representing a Double 2D array. I want to deserialize binary data with an Hive UDF (for compatibility with R), but Spark crash with the error :
java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.GenericInternalRow cannot be cast to org.apache.spark.sql.catalyst.util.ArrayData
at org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.getArray(rows.scala:48)
at org.apache.spark.sql.catalyst.expressions.GenericInternalRow.getArray(rows.scala:221)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$ArrayConverter.toScalaImpl(CatalystTypeConverters.scala:190)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$ArrayConverter.toScalaImpl(CatalystTypeConverters.scala:153)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toScala(CatalystTypeConverters.scala:110)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toScala(CatalystTypeConverters.scala:283)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toScala(CatalystTypeConverters.scala:250)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$$anonfun$createToScalaConverter$2.apply(CatalystTypeConverters.scala:414)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeCollectPublic$1.apply(SparkPlan.scala:174)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeCollectPublic$1.apply(SparkPlan.scala:174)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
at org.apache.spark.sql.execution.SparkPlan.executeCollectPublic(SparkPlan.scala:174)
at org.apache.spark.sql.DataFrame$$anonfun$org$apache$spark$sql$DataFrame$$execute$1$1.apply(DataFrame.scala:1499)
at org.apache.spark.sql.DataFrame$$anonfun$org$apache$spark$sql$DataFrame$$execute$1$1.apply(DataFrame.scala:1499)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:56)
at org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:2086)
at org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$execute$1(DataFrame.scala:1498)
at org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$collect(DataFrame.scala:1505)
at org.apache.spark.sql.DataFrame$$anonfun$head$1.apply(DataFrame.scala:1375)
at org.apache.spark.sql.DataFrame$$anonfun$head$1.apply(DataFrame.scala:1374)
at org.apache.spark.sql.DataFrame.withCallback(DataFrame.scala:2099)
at org.apache.spark.sql.DataFrame.head(DataFrame.scala:1374)
at org.apache.spark.sql.DataFrame.take(DataFrame.scala:1456)
at org.apache.spark.sql.DataFrame.showString(DataFrame.scala:170)
at org.apache.spark.sql.DataFrame.show(DataFrame.scala:350)
at org.apache.spark.sql.DataFrame.show(DataFrame.scala:311)
at org.apache.spark.sql.DataFrame.show(DataFrame.scala:319)
Here is the UDF class :
class DeserBytesTo2DArrayDouble extends UDF {
def evaluate(input: BytesWritable): Array[Array[Double]] = {
if (input == null) return null
val res = SerializationUtils.deserializeFromByteArray(input.getBytes,
classOf[Array[Array[Double]]])
logger.info("Deserialized data : {}",res)
res
}
}
And an example for using it :
val data = Array(Array(1d,2d,3d))
val bytArr = SerializationUtils.serializeToByteArray(data)
val df = List(("a", bytArr)).toDF("ID", "DATAB")
df.registerTempTable("toto")
sqlContext.sql("CREATE TEMPORARY FUNCTION b2_2darrD as 'package.to.DeserBytesTo2DArrayDouble'")
sqlContext.sql("select id, b2_2darrD(datab) from toto").show

Writing to S3 with spark raises java.io.IOException: file doesn't exist

I'm trying to write a dataframe to S3.
Strangely it complains that the file doesn't exist when I'm trying to write it.
My code:
schema = StructType([StructField('column1', IntegerType())])
data = [(x,) for x in range(10)]
df = spark.createDataFrame(data, schema)
url = "s3://XXX/YYY/write_test.csv"
df.write.csv(url)
The exception:
py4j.protocol.Py4JJavaError: An error occurred while calling o35.csv.
: java.io.IOException: /YYY/write_test.csv doesn't exist
Using
spark 2.1.1
aws-java-sdk-1.7.4.jar
hadoop-aws-2.7.3.jar

Applying a specific schema on an Apache Spark data frame

I am trying to apply a particular schema on a dataframe , the schema seems to have been applied but all dataframe operations like count, show, etc. always fails with NullPointerException as shown below:
java.lang.NullPointerException was thrown.
java.lang.NullPointerException
at org.apache.spark.sql.catalyst.expressions.AttributeReference.hashCode(namedExpressions.scala:218)
at scala.runtime.ScalaRunTime$.hash(ScalaRunTime.scala:210)
Here is my code:
var fieldSchema = ListBuffer[StructField]()
val columns = mydf.columns
for (i <- 0 until columns.length) {
val columns = mydf.columns
val colName = columns(i)
fieldSchema += StructField(colName, mydf.schema(i).dataType, true, null)
}
val schema = StructType(fieldSchema.toList)
val newdf = sqlContext.createDataFrame(df.rdd, schema) << df is the original dataframe
newdf.printSchema() << this prints the new applied schema
println("newdf count:"+newdf.count()) << this fails with null pointer exception
In short,there are actually 3 dataframes:
df - the original data frame
mydf- the schema that I'm trying to apply on df is coming from this dataframe
newdf- creating a new dataframe same as that of df, but with different schema

Resources