Unable to import snapshot meta file in YugabyteDB - yugabytedb

[Question posted by a user on YugabyteDB Community Slack]
Running yb-admin import_snapshot and getting the following error. Looking for a pointer as before running the import snapshot I drop and load the schema/tables from the snapshot schema sql
Error running import_snapshot: Not found (yb/master/catalog_manager_ent.cc:1344): Unable to import snapshot meta file /tmp/yb2/8a347ab6-654e-4287-b3f0-13717a21bf6d/8a347ab6-654e-4287-b3f0-13717a21bf6d.snapshot: Not found new tablet with expected partition keys: - ?: INTERNAL_ERROR (master error 34)
Using YugabyteDB 2.7.1.1 and YSQL layer on ubuntu 20.04.

That version is relatively old. There were some backup/restore fixes for YSQL in later versions, and they may handle this issue.
This looks like a mismatch of partitioning between current table and snapshot table. It sounds like commit 2f9d4dcaacff5b3fad5e2016c4d73b9101196d38:
Return error like:
> Wrong number of tablets in created PGSQL_TABLE_TYPE table hints in namespace id 000030af000030008000000000000000: 2 (expected 6)
instead of unclear:
> Not found new tablet with expected partition keys: - UU
>
2.8 and 2.11 should have the fix.

Related

Delta Lake change data feed - delete, vacuum, read - java.io.FileNotFoundException

I used the following to write to google cloud storage
df.write.format("delta").partitionBy("g","p").option("delta.enableChangeDataFeed", "true").mode("append").save(path)
And then I inserted data in versions 1,2,3,4.
I deleted some of the data in version 5.
Ran
deltaTable.vacuum(8)
I tried to read starting Version 3
spark.read.format("delta")
.option("readChangeFeed", "true")
.option("startingVersion", 3)
.load(path)
Caused by: java.io.FileNotFoundException:
File not found: gs://xxx/yyy.snappy.parquet
It is possible the underlying files have been updated. You can explicitly invalidate
the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by
recreating the Dataset/DataFrame involved.
I deleted the cluster and tried to read again. Same issue. Why is it looking for the vacuumed files?
I expected to see all the data inserted starting version 3
When running deltaTable.vacuum(8), please note that you're removing the files that are more than 8 hours old. Even if you had deleted version 5 of the table, if the files for version 3 are older than 8 hours, then the only files that would be available are the most current version (in this case version 4).
Adding the setting worked!
spark.sql.files.ignoreMissingFiles ->true

Why DeltaTable.forPath throws "[path] is not a Delta table"?

I'm trying to read a delta lake table which I loaded previously using Spark and I'm using IntelliJ IDE.
val dt = DeltaTable.forPath(spark, "/some/path/")
Now when I'm trying to read the table again I'm getting below error, it was working fine but suddenly it throws error like these, what might be the reason for this?
Note:
Checked the files in the DeltaLake path - it looks good.
Colleague was able to read the same DeltaLake file.
Exception in thread "main" org.apache.spark.sql.AnalysisException: `/some/path/` is not a Delta table.
at org.apache.spark.sql.delta.DeltaErrors$.notADeltaTableException(DeltaErrors.scala:260)
at io.delta.tables.DeltaTable$.forPath(DeltaTable.scala:593)
at com.datalake.az.core.DeltaLake$.delayedEndpoint$com$walmart$sustainability$datalake$az$core$DeltaLake$1(DeltaLake.scala:66)
at com.datalake.az.core.DeltaLake$delayedInit$body.apply(DeltaLake.scala:18)
at scala.Function0.apply$mcV$sp(Function0.scala:39)
at scala.Function0.apply$mcV$sp$(Function0.scala:39)
at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:17)
at scala.App.$anonfun$main$1$adapted(App.scala:80)
at scala.collection.immutable.List.foreach(List.scala:431)
at scala.App.main(App.scala:80)
at scala.App.main$(App.scala:78)
at com.datalake.az.core.DeltaLake$.main(DeltaLake.scala:18)
at com.datalake.az.core.DeltaLake.main(DeltaLake.scala)
AnalysisException: /some/path/ is not a Delta table.
AnalysisException is thrown when the given path has no transaction log under _delta_log directory.
There could be other issues but that's the first check.
BTW By the stacktrace I figured you may not be using the latest and greatest Delta Lake 2.0.0. Please upgrade as soon as possible as it brings tons of improvements you don't want to miss.

Partitioned table on synapse

I'm trying to create a new partitioned table on my SqlDW (synapse) based on a partitioned table on Spark (synapse) with
%%spark
val df1 = spark.sql("SELECT * FROM sparkTable")
df1.write.partitionBy("year").sqlanalytics("My_SQL_Pool.dbo.StudentFromSpak", Constants.INTERNAL )
Error : StructuredStream-spark package version: 2.4.5-1.3.1
StructuredStream-spark package version: 2.4.5-1.3.1
StructuredStream-spark package version: 2.4.5-1.3.1
java.sql.SQLException:
com.microsoft.sqlserver.jdbc.SQLServerException: External file access
failed due to internal error: 'File
/synapse/workspaces/test-partition-workspace/sparkpools/myspark/sparkpoolinstances/c5e00068-022d-478f-b4b8-843900bd656b/livysessions/2021/03/09/1/tempdata/SQLAnalyticsConnectorStaging/application_1615298536360_0001/aDtD9ywSeuk_shiw47zntKz.tbl/year=2000/part-00004-5c3e4b1a-a580-4c7e-8381-00d92b0d32ea.c000.snappy.parquet:
HdfsBridge::CreateRecordReader - Unexpected error encountered
creating the record reader: HadoopExecutionException: Column count
mismatch. Source file has 5 columns, external table definition has 6
columns.' at
com.microsoft.spark.sqlanalytics.utils.SQLAnalyticsJDBCWrapper.executeUpdateStatement(SQLAnalyticsJDBCWrapper.scala:89)
at
thanks
The sqlanalytics() function name has been changed to synapsesql(). It does not currently support writing partitioned tables but you could implement this yourself, eg by writing multiple tables back to the dedicated SQL pool and the using partition switching back there.
The syntax is simply (as per the documentation):
df.write.synapsesql("<DBName>.<Schema>.<TableName>", <TableType>)
An example would be:
df.write.synapsesql("yourDb.dbo.yourTablePartition1", Constants.INTERNAL)
df.write.synapsesql("yourDb.dbo.yourTablePartition2", Constants.INTERNAL)
Now do the partition switching in the database using the ALTER TABLE ... SWITCH PARTITION syntax.

How to implement rdd.bulkSaveToCassandra in datastax

I am using datastax cluster with 5.0.5.
[cqlsh 5.0.1 | Cassandra 3.0.11.1485 | DSE 5.0.5 | CQL spec 3.4.0 | Native proto
using spark-cassandra-connector 1.6.8
I tried to implement below code.. import is not working.
val rdd: RDD[SomeType] = ... // create some RDD to save import
com.datastax.bdp.spark.writer.BulkTableWriter._
rdd.bulkSaveToCassandra(keyspace, table)
Can someone suggest me how to implement this code. Are they any dependenceis required for this.
Cassandra Spark Connector has saveToCassandra method that could be used like this (taken from documentation):
val collection = sc.parallelize(Seq(("cat", 30), ("fox", 40)))
collection.saveToCassandra("test", "words", SomeColumns("word", "count"))
There is also saveAsCassandraTableEx that allows you to control schema creation, and other things - it's also described in documentation referenced above.
To use them you need to import com.datastax.spark.connector._ described in "Connecting to Cassandra" document.
And you need to add corresponding dependency - but this depends on what build system do you use.
The bulkSaveToCassandra method is available only when you're using DSE's connector. You need to add corresponding dependencies - see documentation for more details. But even primary developer of Spark connector says that it's better use saveToCassandra instead of it.

Unable to query HIVE Parquet based EXTERNAL table from spark-sql

We have an External Hive Table which is stored as Parquet. I am not the owner of the schema in which this hive-parquet table is so don't have much info.
The Problem here is when in try to Query that table from spark-sql>(Shell prompt) Not by using scala like spark.read.parquet("path"), I am getting 0 records stating "Unable to infer schema". But when i created a Managed Table by using CTAS in my personal schema just for testing i was able to query it from the spark-sql>(Shell prompt)
When i try it from spark-shell> via spark.read.parquet("../../00000_0").show(10) , I was able to see the data.
So this clears that something is wrong between
External Hive table - Parquet - Spark-SQL(shell)
If locating Schema would be the issue then it should behave same while accessing through spark session (spark.read.parquet(""))
I am using MapR 5.2, Spark version 2.1.0
Please suggest what can be the issue

Resources