My AWS Glue job for Hudi CDC is failing on a column that is a precombine field (see error message below). I have validated that there are no NULL values on this column (it has an AFTER UPDATE Trigger and a default of NOW() set). When I query the parquet files using spark, the only records that show NULL are records that are marked with an operation ('op') of DELETE. From my understanding, Hudi only transmits the PRIMARY KEY on a DELETE operation and nothing else.
Why is Hudi failing on a precombine with a NULL value in the DELETE operation? How can I fix this? Am I missing an option or something? Any help is greatly appreciated.
Error message:
2022-06-06 19:05:13,633 ERROR [Executor task launch worker for task
2.0 in stage 46.0 (TID 264)] executor.Executor (Logging.scala:logError(94)): Exception in task 2.0 in stage 46.0 (TID
264) org.apache.hudi.exception.HoodieException: The value of
last_modified_date can not be null
Hudi options:
options = {
"hoodie_overrides": {
"hoodie.datasource.write.recordkey.field": "id",
"hoodie.datasource.write.precombine.field": "last_modified_date",
"hoodie.datasource.hive_sync.partition_extractor_class": "org.apache.hudi.hive.NonPartitionedExtractor",
"hoodie.datasource.write.keygenerator.class": "org.apache.hudi.keygen.NonpartitionedKeyGenerator",
"hoodie.datasource.hive_sync.support_timestamp": "true",
}
}
Spark query of parquet files:
You can try setting this configuration to "false":
hoodie.combine.before.delete
Source: hoodie.combine.before.delete
Related
I have a database in Azure synapse with only one column with datatype datetime2(7).
In Azure Databricks I have a table with the following schema.
df.schema
StructType(List(StructField(dates_tst,TimestampType,true)))
When I try to save on Synapse, I get an error message
Py4JJavaError: An error occurred while calling o535.save.: org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 in stage 15.0 failed 4 times, most recent failure: Lost task 3.3 in stage 15.0 (TID 46) (10.139.64.5 executor 0): com.microsoft.sqlserver.jdbc.SQLServerException: 110802;An internal DMS error occurred that caused this operation to fail
SqlNativeBufferBufferBulkCopy.WriteTdsDataToServer, error in OdbcDone: SqlState: 42000, NativeError: 4816, 'Error calling: bcp_done(this->GetHdbc()) | SQL Error Info: SrvrMsgState: 1, SrvrSeverity: 16, Error <1>: ErrorMsg: [Microsoft][ODBC Driver 17 for SQL Server][SQL Server]Invalid column type from bcp client for colid 1. | Error calling: pConn->Done() | state: FFFF, number: 75205, active connections: 35', Connection String: Driver={pdwodbc17e};app=TypeD00-DmsNativeWriter:DB2\mpdwsvc (56768)-ODBC;autotranslate=no;trusted_connection=yes;server=\\.\pipe\DB.2-e2f5d1c1f0ba-0\sql\query;database=Distribution_24
EDIT: Runtime version 9.1 LTS (includes Apache Spark 3.1.2, Scala 2.12)
EDIT 2:
It could be solved, the errors were:
use incorrect format in write options, I was using "com.microsoft.sqlserver.jdbc.spark" and changed it to "com.databricks.spark.sqldw".
There were also errors in the scope credentials
org.apache.spark.SparkException: Job aborted due to stage failure
Generally above error occurs when you perform operations on column where there is null value present.
Replace null values with valid datetime values.
Also check spark version.
Refer this SO Answer by Lyuben Todorov
Created a new table in hive in partitioned and ORC format.
Writing into this table using spark by using append ,orc and partitioned mode.
It fails with the exception:
org.apache.spark.sql.AnalysisException: The format of the existing table test.table1 is `HiveFileFormat`. It doesn't match the specified format `OrcFileFormat`.;
I change the format to "hive" from "orc" while writing . It still fails with the exception :
Spark not able to understand the underlying structure of table .
So this issue is happening because spark is not able to write into hive table in append mode , because it cant create a new table . I am able to do overwrite successfully because spark creates a table again.
But my use case is to write into append mode from starting. InsertInto also does not work specifically for partitioned tables. I am pretty much blocked with my use case. Any help would be great.
Edit1:
Working on HDP 3.1.0 environment.
Spark Version is 2.3.2
Hive Version is 3.1.0
Edit 2:
// Reading the table
val inputdf=spark.sql("select id,code,amount from t1")
//writing into table
inputdf.write.mode(SaveMode.Append).partitionBy("code").format("orc").saveAsTable("test.t2")
Edit 3: Using insertInto()
val df2 =spark.sql("select id,code,amount from t1")
df2.write.format("orc").mode("append").insertInto("test.t2");
I get the error as:
20/05/17 19:15:12 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, since hive.security.authorization.manager is set to instance of HiveAuthorizerFactory.
20/05/17 19:15:12 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, since hive.security.authorization.manager is set to instance of HiveAuthorizerFactory.
20/05/17 19:15:13 WARN AcidUtils: Cannot get ACID state for test.t1 from null
20/05/17 19:15:13 WARN AcidUtils: Cannot get ACID state for test.t1 from null
20/05/17 19:15:13 WARN HiveMetastoreCatalog: Unable to infer schema for table test.t1 from file format ORC (inference mode: INFER_AND_SAVE). Using metastore schema.
If I rerun the insertInto command I get the following exception :
20/05/17 19:16:37 ERROR Hive: MetaException(message:The transaction for alter partition did not commit successfully.)
at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$alter_partitions_req_result$alter_partitions_req_resultStandardScheme.read(ThriftHiveMetastore.java)
at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$alter_partitions_req_result$alter_partitions_req_resultStandardScheme.read(ThriftHiveMetastore.java)
Error in hive metastore logs :
2020-05-17T21:17:43,891 INFO [pool-8-thread-198]: metastore.HiveMetaStore (HiveMetaStore.java:logInfo(907)) - 163: alter_partitions : tbl=hive.test.t1
2020-05-17T21:17:43,891 INFO [pool-8-thread-198]: HiveMetaStore.audit (HiveMetaStore.java:logAuditEvent(349)) - ugi=X#A.ORG ip=10.10.1.36 cmd=alter_partitions : tbl=hive.test.t1
2020-05-17T21:17:43,891 INFO [pool-8-thread-198]: metastore.HiveMetaStore (HiveMetaStore.java:alter_partitions_with_environment_context(5119)) - New partition values:[BR]
2020-05-17T21:17:43,913 ERROR [pool-8-thread-198]: metastore.ObjectStore (ObjectStore.java:alterPartitions(4397)) - Alter failed
org.apache.hadoop.hive.metastore.api.MetaException: Cannot change stats state for a transactional table without providing the transactional write state for verification (new write ID -1, valid write IDs null; current state null; new state {}
I was able to resolve the issue by using external tables in my use case. We currently have an open issue in spark , which is related to acid properties of hive . Once I create hive table in external mode , I am able to do append operations in partitioned/non partitioned table.
https://issues.apache.org/jira/browse/SPARK-15348
The count() on a dataframe loaded from IBM Blue mix object storage throws the following exception when inferSchema is enabled:
Name: org.apache.spark.SparkException
Message: Job aborted due to stage failure: Task 3 in stage 43.0 failed 10 times, most recent failure: Lost task 3.9 in stage 43.0 (TID 166, yp-spark-dal09-env5-0034): java.lang.NumberFormatException: null
at java.lang.Integer.parseInt(Integer.java:554)
at java.lang.Integer.parseInt(Integer.java:627)
at scala.collection.immutable.StringLike$class.toInt(StringLike.scala:272)
at scala.collection.immutable.StringOps.toInt(StringOps.scala:29)
at org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$.castTo(CSVInferSchema.scala:241)
at org.apache.spark.sql.execution.datasources.csv.CSVRelation$$anonfun$csvParser$3.apply(CSVRelation.scala:116)
at org.apache.spark.sql.execution.datasources.csv.CSVRelation$$anonfun$csvParser$3.apply(CSVRelation.scala:85)
at org.apache.spark.sql.execution.datasources.csv.CSVFileFormat$$anonfun$buildReader$1$$anonfun$apply$2.apply(CSVFileFormat.scala:128)
at org.apache.spark.sql.execution.datasources.csv.CSVFileFormat$$anonfun$buildReader$1$$anonfun$apply$2.apply(CSVFileFormat.scala:127)
at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:91)
I don't get the above exception if I disable the inferSchema.
Why am I getting this exception? by default, how many rows are read by databricks if inferSchema is enabled?
This was actually an issue with the spark-csv package (null value still not correctly parsed #192) that was dragged into spark 2.0. It has been corrected and pushed in spark 2.1.
Here is the associated PR : [SPARK-18269][SQL] CSV datasource should read null properly when schema is lager than parsed tokens.
Since you are already using spark 2.0 you can easily upgrade to 2.1 and drop that spark-csv package. It's not needed anyway.
I want to query a cassandra table using the spark-cassandra-connector using the following statements:
sc.cassandraTable("citizens","records")
.select("identifier","name")
.where( "name='Alice' or name='Bob' ")
And I get this error message:
org.apache.spark.SparkException: Job aborted due to stage failure:
Task 0 in stage 81.0 failed 4 times, most recent failure:
Lost task 0.3 in stage 81.0 (TID 9199, mydomain):
java.io.IOException: Exception during preparation of
SELECT "identifier", "name" FROM "citizens"."records" WHERE token("id") > ? AND token("id") <= ? AND name='Alice' or name='Bob' LIMIT 10 ALLOW FILTERING:
line 1:127 missing EOF at 'or' (...<= ? AND name='Alice' [or] name...)
What am I doing wrong here and how can I make an or query using the where clause of the connector?
Your OR clause is not valid CQL. For this few key values (I'm assuming name is a key) you can use an IN clause.
.where( "name in ('Alice', 'Bob') ")
The where clause is used for pushing down CQL to Cassandra so only valid CQL can go inside it. If you are looking to do a Spark Side Sql-Like syntax check out SparkSql and Datasets.
I try to create indexes on Hive on Azure HDInsight with Tez enabled.
I can successfully create indexes but I can't rebuild them : the job failed with this output :
Map 1: -/- Reducer 2: 0/1
Status: Failed
Vertex failed, vertexName=Map 1, vertexId=vertex_1421234198072_0091_1_01, diagnostics=[Vertex Input: measures initializer failed.]
Vertex killed, vertexName=Reducer 2, vertexId=vertex_1421234198072_0091_1_00, diagnostics=[Vertex > received Kill in INITED state.]
DAG failed due to vertex failure. failedVertices:1 killedVertices:1
FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.tez.TezTask
I have created my table and indexes with the following job :
DROP TABLE IF EXISTS Measures;
CREATE TABLE Measures(
topology string,
val double,
date timestamp,
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
STORED AS TEXTFILE LOCATION 'wasb://<mycontainer>#<mystorage>.blob.core.windows.net/';
CREATE INDEX measures_index_topology ON TABLE Measures (topology) AS 'COMPACT' WITH DEFERRED REBUILD;
CREATE INDEX measures_index_date ON TABLE Measures (date) AS 'COMPACT' WITH DEFERRED REBUILD;
ALTER INDEX measures_index_topology ON Measures REBUILD;
ALTER INDEX measures_index_date ON Measures REBUILD;
Where am I wrong ? And why my rebuilding index fail ?
Best regards
It looks like Tez might have a problem with generating an index on an empty table. I was able to get the same error as you (without using the JSON SerDe), and if you look at the application logs for the DAG that fails, you might see something like:
java.lang.NullPointerException
at org.apache.hadoop.hive.ql.io.HiveInputFormat.init(HiveInputFormat.java:254)
at org.apache.hadoop.hive.ql.io.HiveInputFormat.getSplits(HiveInputFormat.java:299)
at org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat.getSplits(TezGroupedSplitsInputFormat.java:68)
at org.apache.tez.mapreduce.hadoop.MRHelpers.generateOldSplits(MRHelpers.java:263)
at org.apache.tez.mapreduce.common.MRInputAMSplitGenerator.initialize(MRInputAMSplitGenerator.java:139)
at org.apache.tez.dag.app.dag.RootInputInitializerRunner$InputInitializerCallable$1.run(RootInputInitializerRunner.java:154)
at org.apache.tez.dag.app.dag.RootInputInitializerRunner$InputInitializerCallable$1.run(RootInputInitializerRunner.java:146)
...
If you populate the table with a single dummy record, it seems to work fine. I used:
INSERT INTO TABLE Measures SELECT market,0,0 FROM hivesampletable limit 1;
After that, the index rebuild was able to run without error.