Spark SQL - Stale Table Cache When Reusing Table Name - apache-spark

I'm running Spark 3.3.1 in Standalone Cluster mode using Hadoop 3.3.4 as storage. I'm attempting to run a large script that re-uses a permanent temp table at different stages. Here's a rough outline of how the table is used:
DROP TABLE IF EXISTS AiTemp;
CREATE TABLE AiTemp AS
SELECT *
FROM SomeTable;
-- Do some work with the table
-- Drop the table
DROP TABLE IF EXISTS AiTemp;
-- Some unrelated code does some unrelated things here
-- Later on in the script, I reuse the table for an unrelated purpose
DROP TABLE IF EXISTS AiTemp;
CREATE TABLE AiTemp AS
SELECT *
FROM SomeOtherTable;
-- Do some different work with the table
--Drop the table
DROP TABLE IF EXISTS AiTemp;
Even though I drop the table each time after I am finished with it, I'm still getting a caching error:
22/11/09 20:15:20 WARN TaskSetManager: Lost task 27.1 in stage 230.0 (TID 2612) (10.0.11.2 executor 1): java.io.FileNotFoundException:
File does not exist: hdfs://sparkmaster:9000/user/spark/warehouse/aitemp/part-00027-7640da4d-0c27-484e-b2ca-7bd20ed86371-c000.snappy.parquet
It is possible the underlying files have been updated. You can explicitly invalidate
the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by
recreating the Dataset/DataFrame involved.
at org.apache.spark.sql.errors.QueryExecutionErrors$.readCurrentFileNotFoundError(QueryExecutionErrors.scala:661)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:212)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:270)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:116)
at org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:554)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.columnartorow_nextBatch_0$(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:140)
at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
at org.apache.spark.scheduler.Task.run(Task.scala:136)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1504)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)
The script doesn't fail after this message is shown, it continues executing. Since the script continues running, does this mean Spark refreshed the cache and restarted the task, and that the failed task eventually completed?
Additionally, what causes this message? It appears that I could put an UNCACHE TABLE IF EXISTS statement before each drop to ensure the cache is invalidated/cleared, but it doesn't feel like that should be necessary if I'm explicitly dropping the table.
EDIT: I put a bunch of UNCACHE TABLE calls before dropping AND creating the table in question, but the error persisted. I changed my code to create the table and fill it in two separate steps, and that works.. but I'd still like to understand why this happened in the first place.
Here's how the code looks now:
DROP TABLE IF EXISTS AiTemp;
CREATE TABLE AiTemp LIKE SomeTable;
INSERT INTO AiTemp
SELECT *
FROM SomeTable;
DROP AiTemp;
EDIT 2: The update to separate CREATE/INSERT doesn't fix the issue. I'm still receiving the error.

let spark do the clean up for you.
Use
df.createTempView('AiTemp')
df2.createTempView('AiTemp2')
And don't worry about the cleanup.

Related

Spark is throwing FileNotFoundException while accessing cached table

There is a table which I read early in the script, but it will fail during run if the underlying table changes in a partition I read in e.g.:
java.io.FileNotFoundException: File does not exist:
hdfs://R2/projects/.../country=AB/date=2021-08-20/part-00005e4-4fa5-aab4-93f02feaf746.c000
Even when I specifically cache the table, and do an action, the script will still fail down the line if the above happens.
df.cache()
df.show(1)
My question is, how is this possible?
If I cache the data on memory/disk, why does it matter if the underlying file is updated or not?
Edit: the code is very long, the main thing:
df= read in table, whose underlying data is in the above HDFS folder
df. cache() and df.show() immediately after it, since Spark evaluates lazily. With show() I make the caching happening
Later when I refer to df: if underlying data is changed, script will fail with java.io.FileNotFoundException:
new_df= df.join(
other_df, 'id', 'right')
As discussed in comment section, Spark will automatically evict the cached data based on LRU(Lease Recently Utilized) concept whenever it encounters out of memory issue.
In your case spark might have evicted the cached table. If there is no cached data then previous lineage will be used to form the dataframe again and it will throw an error if the underlying file is missing.
You can try increasing the memory or use storage level as DISK_ONLY.
df.persist(StorageLevel.DISK_ONLY)

When the underlying files have changed, should PySpark refresh the view or the source tables?

Let's say we have a Hive table foo that's backed by a set of parquet files on e.g. s3://some/path/to/parquet. These files are known to be updated at least once per day, but not always at the same hour of the day.
I have a view on that table, for example defined as
spark.sql("SELECT bar, count(baz) FROM foo GROUP BY bar").createOrReplaceTempView('foo_view')
When I use the foo_view the application will occasionally fail with
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 975.0 failed 4 times, most recent failure: Lost task 0.3 in stage 975.0 (TID 116576, 10.56.247.98, executor 193): com.databricks.sql.io.FileReadException: Error while reading file s3a://some/path/to/parquet. It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved.
I've tried prefixing all my queries on foo_view with a call to spark.catalog.refreshTable('foo'), but the problem keeps on showing up.
Am I doing this right? Or should I call refreshTable() on the view instead of the source table?

Spark staging directory race condition when writing to hive partition?

I am seeing intermittent exceptions when attempting to write a Dataset to a partition in a hive table.
Caused by: org.apache.hadoop.fs.FileAlreadyExistsException: /user/hive/warehouse/devl_fr9.db/fr9_ftdelivery_cpy_2_4d8eebd3_9691_47ce_8acc_b2a5123dabf6/.spark-staging-d996755c-eb81-4362-a393-31e8387104f0/date_id=20180604/part-00000-d996755c-eb81-4362-a393-31e8387104f0.c000.snappy.parquet for client 10.56.219.20 already exists
If I check HDFS the relevant path does not exist. I can only assume this is some race condition regarding temp staging files. I am using Spark 2.3
A possible reason for this issue is that, during a job's execution, a task started writing data to that file and failed.
When a task fails, the data that it had already written is not deleted/purged by Spark (confirmed at least in 2.3 and 2.4). Therefore, when a different executor attempts to re-execute the failed task it will attempt to write to a file with the same name, and you'll get a FileAlreadyExistsException.
In your case, the file that already exists is called part-00000-d996755c-eb81-4362-a393-31e8387104f0.c000, so it's likely that you have a log message in stderr indicating that task 00000 was lost due to failure, something like
WARN TaskSetManager: Lost task **00000** in stage...
If you fix the reason for this failure - probably an OutOfMemoryError, if the issue is intermitent - the FileAlreadyExistsException will likely be solved because the task will not fail and leave temporary files behind.

How to refresh a table and do it concurrently?

I'm using Spark Streaming 2.1. I'd like to refresh some cached table (loaded by spark provided DataSource like parquet, MySQL or user-defined data sources) periodically.
how to refresh the table?
Suppose I have some table loaded by
spark.read.format("").load().createTempView("my_table")
and it is also cached by
spark.sql("cache table my_table")
is it enough with following code to refresh the table, and when
the table is loaded next, it will automatically be cached
spark.sql("refresh table my_table")
or do I have to do that manually with
spark.table("my_table").unpersist
spark.read.format("").load().createOrReplaceTempView("my_table")
spark.sql("cache table my_table")
is it safe to refresh the table concurrently?
By concurrent I mean using ScheduledThreadPoolExecutor to do the refresh work apart from the main thread.
What will happen if the Spark is using the cached table when I call refresh on the table?
In Spark 2.2.0 they have introduced feature of refreshing the metadata of a table if it was updated by hive or some external tools.
You can achieve it by using the API,
spark.catalog.refreshTable("my_table")
This API will update the metadata for that table to keep it consistent.
I had a problem to read a table from hive using a SparkSession specifically the method table, i.e. spark.table(table_name). Every time after wrote the table and try to read that
I got this error:
java.IO.FileNotFoundException ... The underlying files may have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved.
I tried to refresh the table using spark.catalog.refreshTable(table_name) also sqlContext neither worked.
My solutions as wrote the table and after read using:
val usersDF = spark.read.load(s"/path/table_name")
It's work fine.
Is this a problem? Maybe the data at hdfs is not updated yet?

Why is data corruption happen in Cassandra 1.2?

I dropped a column in Cassandra 1.2 couple days ago by:
1. drop the whole table,
2. recreate the table, without the column,
3. insert insert statement (without the column).
The reason why I did that way is because Cassandra 1.2 doesn't support "drop column" operation.
Today I was notified by Ops team because of the data corruption issue.
My questions:
What is the root cause?
How to fix it?
ERROR [ReadStage:79] 2014-11-04 11:29:55,021 CassandraDaemon.java (line 191) Exception in thread Thread[ReadStage:79,5,main]
org.apache.cassandra.io.sstable.CorruptSSTableException: org.apache.cassandra.db.ColumnSerializer$CorruptColumnException: invalid column name length 0 (/data/cassandra/data/xxx/yyy/zzz-Data.db, 1799885 bytes remaining)
at org.apache.cassandra.db.columniterator.SimpleSliceReader.computeNext(SimpleSliceReader.java:110)
at org.apache.cassandra.db.columniterator.SimpleSliceReader.computeNext(SimpleSliceReader.java:40)
at com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:143)
at com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:138)
at org.apache.cassandra.db.columniterator.SSTableSliceIterator.hasNext(SSTableSliceIterator.java:90)
at org.apache.cassandra.db.filter.QueryFilter$2.getNext(QueryFilter.java:171)
at org.apache.cassandra.db.filter.QueryFilter$2.hasNext(QueryFilter.java:154)
at org.apache.cassandra.utils.MergeIterator$OneToOne.computeNext(MergeIterator.java:199)
at com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:143)
at com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:138)
at org.apache.cassandra.db.filter.SliceQueryFilter.collectReducedColumns(SliceQueryFilter.java:160)
at org.apache.cassandra.db.filter.QueryFilter.collateColumns(QueryFilter.java:136)
at org.apache.cassandra.db.filter.QueryFilter.collateOnDiskAtom(QueryFilter.java:84)
at org.apache.cassandra.db.CollationController.collectAllData(CollationController.java:291)
at org.apache.cassandra.db.CollationController.getTopLevelColumns(CollationController.java:65)
at org.apache.cassandra.db.ColumnFamilyStore.getTopLevelColumns(ColumnFamilyStore.java:1398)
at org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(ColumnFamilyStore.java:1214)
at org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(ColumnFamilyStore.java:1130)
at org.apache.cassandra.db.Table.getRow(Table.java:344)
at org.apache.cassandra.db.SliceFromReadCommand.getRow(SliceFromReadCommand.java:70)
at org.apache.cassandra.db.ReadVerbHandler.doVerb(ReadVerbHandler.java:44)
at org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:56)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Unknown Source)
Caused by: org.apache.cassandra.db.ColumnSerializer$CorruptColumnException: invalid column name length 0 (/data/cassandra/data/xxx/yyy/zzz-Data.db, 1799885 bytes remaining)
at org.apache.cassandra.db.ColumnSerializer$CorruptColumnException.create(ColumnSerializer.java:148)
at org.apache.cassandra.db.OnDiskAtom$Serializer.deserializeFromSSTable(OnDiskAtom.java:86)
at org.apache.cassandra.db.OnDiskAtom$Serializer.deserializeFromSSTable(OnDiskAtom.java:73)
at org.apache.cassandra.db.columniterator.SimpleSliceReader.computeNext(SimpleSliceReader.java:106)
... 24 more
ERROR [ReadStage:89] 2014-11-04 11:29:58,076 CassandraDaemon.java (line 191) Exception in thread Thread[ReadStage:89,5,main]
java.lang.OutOfMemoryError: Java heap space
at org.apache.cassandra.io.util.RandomAccessReader.readBytes(RandomAccessReader.java:376)
at org.apache.cassandra.utils.ByteBufferUtil.read(ByteBufferUtil.java:392)
at org.apache.cassandra.utils.ByteBufferUtil.readWithLength(ByteBufferUtil.java:355)
at org.apache.cassandra.db.ColumnSerializer.deserializeColumnBody(ColumnSerializer.java:108)
at org.apache.cassandra.db.OnDiskAtom$Serializer.deserializeFromSSTable(OnDiskAtom.java:92)
at org.apache.cassandra.db.OnDiskAtom$Serializer.deserializeFromSSTable(OnDiskAtom.java:73)
at org.apache.cassandra.db.columniterator.SimpleSliceReader.computeNext(SimpleSliceReader.java:106)
at org.apache.cassandra.db.columniterator.SimpleSliceReader.computeNext(SimpleSliceReader.java:40)
at com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:143)
at com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:138)
at org.apache.cassandra.db.columniterator.SSTableSliceIterator.hasNext(SSTableSliceIterator.java:90)
at org.apache.cassandra.db.filter.QueryFilter$2.getNext(QueryFilter.java:171)
at org.apache.cassandra.db.filter.QueryFilter$2.hasNext(QueryFilter.java:154)
at org.apache.cassandra.utils.MergeIterator$OneToOne.computeNext(MergeIterator.java:199)
C* 1.2 supports column deletions for cql tables - http://www.datastax.com/documentation/cql/3.0/cql/cql_using/use_delete.html
However, I do not see anything wrong from the procedure you described to re-create a new table without your column. Here are some steps to go forward.
Assumptions -
The corruption you are seeing is in the new table not the old one
(do they have the same name?)
You have a replication factor and number of nodes that are high
enough for you to be able to take this node offline
Your client's load balancing policy is set up appropriately so
that when the node goes down it will fail over to another node
Procedure -
1) Take your node offline
nodetool drain
This will flush memtables and make your node stop accepting requests.
2) Run nodetool scrub
nodetool scrub [keyspace][table]
If this completes successfully then you are done, bring your node back-up by restarting cassandra and run a nodetool repair keyspace table
3) If scrub errored out (probably with a corruption error), try the sstablescrub utility. ssh into your box and run:
sstablescrub <keyspace> <table>
Note, run this using the same os user you use to start cassandra.
If this completes successfully then you are done, bring your node back-up by restarting cassandra and run a nodetool repair keyspace table
4) If this doesn't work (again errors out with a corruption error) you will have to remove the SStable and rebuild it from your other replicas using repair:
mv the culprit sstable from your data directory to a backup directory
restart cassandra
(delete it later once it's rebuilt)
nodetool repair keyspace cf -- This repair will take time.
Please let me know if you are able to reproduce this corruption.

Resources