Issue with Spark SQL to Pandas - python-3.x

I'm getting error in the following code:
txn = spark.sql('select * from temporary view').toPandas()
I got Py4JavaError: An error occurred while calling o420.collectToPython.
: org.apache.spark.SparkException: Job aborted due to stage failure.
Please help!

Related

Problem on saving Spark timestamp into Azure Synapse datetime2(7)

I have a database in Azure synapse with only one column with datatype datetime2(7).
In Azure Databricks I have a table with the following schema.
df.schema
StructType(List(StructField(dates_tst,TimestampType,true)))
When I try to save on Synapse, I get an error message
Py4JJavaError: An error occurred while calling o535.save.: org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 in stage 15.0 failed 4 times, most recent failure: Lost task 3.3 in stage 15.0 (TID 46) (10.139.64.5 executor 0): com.microsoft.sqlserver.jdbc.SQLServerException: 110802;An internal DMS error occurred that caused this operation to fail
SqlNativeBufferBufferBulkCopy.WriteTdsDataToServer, error in OdbcDone: SqlState: 42000, NativeError: 4816, 'Error calling: bcp_done(this->GetHdbc()) | SQL Error Info: SrvrMsgState: 1, SrvrSeverity: 16, Error <1>: ErrorMsg: [Microsoft][ODBC Driver 17 for SQL Server][SQL Server]Invalid column type from bcp client for colid 1. | Error calling: pConn->Done() | state: FFFF, number: 75205, active connections: 35', Connection String: Driver={pdwodbc17e};app=TypeD00-DmsNativeWriter:DB2\mpdwsvc (56768)-ODBC;autotranslate=no;trusted_connection=yes;server=\\.\pipe\DB.2-e2f5d1c1f0ba-0\sql\query;database=Distribution_24
EDIT: Runtime version 9.1 LTS (includes Apache Spark 3.1.2, Scala 2.12)
EDIT 2:
It could be solved, the errors were:
use incorrect format in write options, I was using "com.microsoft.sqlserver.jdbc.spark" and changed it to "com.databricks.spark.sqldw".
There were also errors in the scope credentials
org.apache.spark.SparkException: Job aborted due to stage failure
Generally above error occurs when you perform operations on column where there is null value present.
Replace null values with valid datetime values.
Also check spark version.
Refer this SO Answer by Lyuben Todorov

Error when brace-expansion includes more than ~25 files in amazon-s3 read from spark

I have just upgraded to using spark 3 instead of spark 2.4.
The following code ran fine in spark 2.4
df = spark.read.parquet('s3a://bucket/path/{'+
'file1,'+
'file2,'+
'file3,'+
'file4,'+
'file5,'+
'file6,'+
'file7,'+
'file8,'+
'file9,'+
'file10,'+
'file11,'+
'file12,'+
'file13,'+
'file14,'+
'file15,'+
'file16,'+
'file17,'+
'file18,'+
'file19,'+
'file20,'+
'file21,'+
'file22,'+
'file23,'+
'file24,'+
'file25'+
'}')
but in spark 3 I get this error:
Py4JJavaError: An error occurred while calling o944.parquet.
: org.apache.hadoop.fs.s3a.AWSS3IOException: getFileStatus on s3a://
...
com.amazonaws.services.s3.model.AmazonS3Exception: Bad Request (Service: Amazon S3; Status Code: 400; Error Code: 400 Bad Request; Request ID: aaa), S3 Extended Request ID:
If I reduce the number of files to less than about 24 then the query completes successfully in spark 3.
I can't find any reference to limits on the number of files in a brace expansion like this in s3. What might be going wrong? How can it be fixed?
why not just give spark the entire directory to process and let it scan the files?
df = spark.read.parquet('s3a://bucket/path/')
There is a limit to 1024 characters in a aws query. Somehow, this was not a problem in spark 2.

ORA-01555: snapshot too old: rollback segment number with name “” too small Sonar qube

i am getting error while publishing results on sonar.
Error querying database. Cause: org.apache.ibatis.executor.result.ResultMapException: Error attempting to get column 'RAWLINEHASHES' from result set. Cause: java.sql.SQLException: ORA-01555: snapshot too old: rollback segment number 2 with name "_SYSSMU2_111974964$" too small
Cause: org.apache.ibatis.executor.result.ResultMapException: Error attempting to get column 'RAWLINEHASHES' from result set. Cause: java.sql.SQLException: ORA-01555: snapshot too old: rollback segment number 2 with name "_SYSSMU2_111974964$" too small
Pipeline executed for 2 hr 30 mins.
Can you please help ?
The error that you are getting is ORA-01555. Which is an Oracle error message.
Your pipeline is executing something against an Oracle database, which after it has run for a long time, gives the error.
For ways to avoid this error see: https://blog.enmotech.com/2018/09/10/ora-01555-snapshot-old-error-ways-to-avoid-ora-01555-snapshot-too-old-error/

Getting an error while writing to Elastic search from spark with custom mapping id

I'm trying to write a dataframe from spark to Elastic with a custom mapping id. and when I do that I'm getting the below error.
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 14.0 failed 16 times, most recent failure: Lost task 0.15 in stage 14.0 (TID 860, ip-10-122-28-111.ec2.internal, executor 1): org.elasticsearch.hadoop.EsHadoopIllegalArgumentException: [DataFrameFieldExtractor for field [[paraId]]] cannot extract value from entity [class java.lang.String] | instance
and below is the configuration used writing to ES.
var config= Map("es.nodes"->node,
"es.port"->port,
"es.clustername"->clustername,
"es.net.http.auth.user" -> login,
"es.net.http.auth.pass" -> password,
"es.write.operation" -> "upsert",
"es.mapping.id" -> "paraId",
"es.resource" -> "test/type")
df.saveToEs(config)
I'm using the 5.6 version of ES and 2.2.0 of Spark. Let me know if you guys have any insight on this.
Thanks.!

"Missing EOF" Error message when querying using the spark-cassandra-connector

I want to query a cassandra table using the spark-cassandra-connector using the following statements:
sc.cassandraTable("citizens","records")
.select("identifier","name")
.where( "name='Alice' or name='Bob' ")
And I get this error message:
org.apache.spark.SparkException: Job aborted due to stage failure:
Task 0 in stage 81.0 failed 4 times, most recent failure:
Lost task 0.3 in stage 81.0 (TID 9199, mydomain):
java.io.IOException: Exception during preparation of
SELECT "identifier", "name" FROM "citizens"."records" WHERE token("id") > ? AND token("id") <= ? AND name='Alice' or name='Bob' LIMIT 10 ALLOW FILTERING:
line 1:127 missing EOF at 'or' (...<= ? AND name='Alice' [or] name...)
What am I doing wrong here and how can I make an or query using the where clause of the connector?
Your OR clause is not valid CQL. For this few key values (I'm assuming name is a key) you can use an IN clause.
.where( "name in ('Alice', 'Bob') ")
The where clause is used for pushing down CQL to Cassandra so only valid CQL can go inside it. If you are looking to do a Spark Side Sql-Like syntax check out SparkSql and Datasets.

Resources