PySpark: how to clear readStream cache? - apache-spark

I am reading a directory with Spark's readStream. Earlier I gave the local path, but got FileNotFoundException. I have changed the path to hdfs path, but still the execution log shows its referring to the old settings (local path).
22/06/01 10:30:32 WARN scheduler.TaskSetManager: Lost task 0.2 in stage 1.0 (TID 3, my.nodes.com, executor 3): java.io.FileNotFoundException: File file:/home/myuser/testing_aiman/data/fix_rates.csv does not exist
It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved.
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:129)
Infact I have hardcoded the path variable, but still its referring to the earlier set local path.
df = spark.readStream.csv("hdfs:///user/myname/streaming_test_dir",sep=sep,schema=df_schema,inferSchema=True,header=True)
i also ran spark.sql("CLEAR CACHE").collect, but it didn't help either.

Before running the spark.readStream(), I ran the following code:
spark.sql("REFRESH \"file:///home/myuser/testing_aiman/data/fix_rates.csv\"").collect
spark.sql("CLEAR CACHE").collect
REFRESH <file:///path/that/showed/FileNotFoundException> actually did the trick.

Related

Manually Deleted data file from delta lake

I have manually deleted a data file from delta lake and now the below command is giving error
mydf = spark.read.format('delta').load('/mnt/path/data')
display(mydf)
Error
A file referenced in the transaction log cannot be found. This occurs when data has been manually deleted from the file system rather than using the table `DELETE` statement. For more information, see https://docs.microsoft.com/azure/databricks/delta/delta-intro#frequently-asked-questions
i have tried restarting the cluster with no luck
also tried the below
spark.conf.set("spark.sql.files.ignoreCorruptFiles", "true")
spark.conf.set("spark.databricks.io.cache.enabled", "false")
Any help on repairing the transaction log or fix the error
as explained before you must use vacuum to remove files as manually deleting files does not lead to the delta transaction log being updated which is what spark uses to identify what files to read.
In your case you can also use the FSCK REPAIR TABLE command.
as per the docs :
"Removes the file entries from the transaction log of a Delta table that can no longer be found in the underlying file system. This can happen when these files have been manually deleted."
The above error indicates that you have manually deleted a data file without using the proper DELETE Statement.
As per MS Doc, you can try vacuum command. Using the vacuum command fix the error.
%sql
vacuum 'Your_path'
For more information refer this link
FSCK Command worked for me. Thanks All

Delta Lake change data feed - delete, vacuum, read - java.io.FileNotFoundException

I used the following to write to google cloud storage
df.write.format("delta").partitionBy("g","p").option("delta.enableChangeDataFeed", "true").mode("append").save(path)
And then I inserted data in versions 1,2,3,4.
I deleted some of the data in version 5.
Ran
deltaTable.vacuum(8)
I tried to read starting Version 3
spark.read.format("delta")
.option("readChangeFeed", "true")
.option("startingVersion", 3)
.load(path)
Caused by: java.io.FileNotFoundException:
File not found: gs://xxx/yyy.snappy.parquet
It is possible the underlying files have been updated. You can explicitly invalidate
the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by
recreating the Dataset/DataFrame involved.
I deleted the cluster and tried to read again. Same issue. Why is it looking for the vacuumed files?
I expected to see all the data inserted starting version 3
When running deltaTable.vacuum(8), please note that you're removing the files that are more than 8 hours old. Even if you had deleted version 5 of the table, if the files for version 3 are older than 8 hours, then the only files that would be available are the most current version (in this case version 4).
Adding the setting worked!
spark.sql.files.ignoreMissingFiles ->true

How do I refresh a HDFS path?

I am runing a sparksession in jupyter notebook .
I would got error sometime on a dataframe which is initial by spark.read.parquet(some_path) when files under that path have changed, even if I cache the dataframe .
For example
reading code is
sp = spark.read.parquet(TB.STORE_PRODUCT)
sp.cache()
sometimes, sp can't not be access anymore, complain :
Py4JJavaError: An error occurred while calling o3274.collectToPython.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 10 in stage 326.0 failed 4 times, most recent failure: Lost task 10.3 in stage 326.0 (TID 111818, dc38, executor 7): java.io.FileNotFoundException: File does not exist: hdfs://xxxx/data/dm/sales/store_product/part-00000-169428df-a9ee-431e-918b-75477c073d71-c000.snappy.parquet
It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved.
The problem
'REFRESH TABLE tableName' doesn't work, because
I don't have a hive table, it is only a hdfs path
Restart sparksession and read that path again can solve this problem , but
I don't want to restart sparksession, it would waste much time
One more
execute sp = spark.read.parquet(TB.STORE_PRODUCT) again doesn't work , I can understand why, spark should scan the path again or there must be a option/setting to force it scan . Keep whole path location in memory is not smart .
spark.read.parquet doesn't have a force scan option
Signature: spark.read.parquet(*paths)
Docstring:
Loads Parquet files, returning the result as a :class:`DataFrame`.
You can set the following Parquet-specific option(s) for reading Parquet files:
* ``mergeSchema``: sets whether we should merge schemas collected from all Parquet part-files. This will override ``spark.sql.parquet.mergeSchema``. The default value is specified in ``spark.sql.parquet.mergeSchema``.
>>> df = spark.read.parquet('python/test_support/sql/parquet_partitioned')
>>> df.dtypes
[('name', 'string'), ('year', 'int'), ('month', 'int'), ('day', 'int')]
.. versionadded:: 1.4
Source:
#since(1.4)
def parquet(self, *paths):
"""Loads Parquet files, returning the result as a :class:`DataFrame`.
You can set the following Parquet-specific option(s) for reading Parquet files:
* ``mergeSchema``: sets whether we should merge schemas collected from all \
Parquet part-files. This will override ``spark.sql.parquet.mergeSchema``. \
The default value is specified in ``spark.sql.parquet.mergeSchema``.
>>> df = spark.read.parquet('python/test_support/sql/parquet_partitioned')
>>> df.dtypes
[('name', 'string'), ('year', 'int'), ('month', 'int'), ('day', 'int')]
"""
return self._df(self._jreader.parquet(_to_seq(self._spark._sc, paths)))
File: /opt/cloudera/parcels/CDH/lib/spark/python/pyspark/sql/readwriter.py
Type: method
Is there a proper way to solve my problem ?
The problem is caused by Dataframe.cache .
I need clear that cache at first , then read again would solve the problem
code :
try:
sp.unpersist()
except:
pass
sp = spark.read.parquet(TB.STORE_PRODUCT)
sp.cache()
You can try two solutions
one is to unpersist the dataframe before reading everytime as suggested by #Mithril
or just create a temp view and trigger the refresh command
sp.createOrReplaceTempView('sp_table')
spark.sql('''REFRESH TABLE sp_table''')
df=spark.sql('''select * from sp_table''')

Spark SQL SaveMode.Overwrite gives FileNotFoundException

I want to read a dataset from an S3 directory, make some updates and overwrite it to the same file. What I do is:
dataSetWriter.writeDf(
finalDataFrame,
destinationPath,
destinationFormat,
SaveMode.Overwrite,
destinationCompression)
However My job fails showing an errorwith this message:
java.io.FileNotFoundException: No such file or directory 's3://processed/fullTableUpdated.parquet/part-00503-2b642173-540d-4c7a-a29a-7d0ae598ea4a-c000.parquet'
It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved.
Why is this happening? Is there anything that I am missing with the "overwrite" mode?
thanks

Reading multiple avro files into RDD from a nested directory structure

suppose I have a directory which contains a bunch of avro files and I want to read them all in one shot. this code works fine
val path = "hdfs:///path/to/your/avro/folder"
val avroRDD = sc.hadoopFile[AvroWrapper[GenericRecord], NullWritable, AvroInputFormat[GenericRecord]](path)
However, if the folder contains subfolders and the avro files are in subfolders. then I get an error
5/10/30 14:57:47 WARN TaskSetManager: Lost task 0.0 in stage 1.0 (TID 6,
hadoop1): java.io.FileNotFoundException: Path is not a file: /folder/subfolder
Is there anyway I can read all the avros (even in subdirectories) into an RDD?
all avros have same schema and I am on spark 1.3.0
Edit::
Based on the suggestion below I executed this line in my spark shell
sc.hadoopConfiguration.set("mapreduce.input.fileinputformat.input.dir.recursive","true")
and this solved the problem.... but now my code is very very slow and I don't understand what does a mapreduce setting have to do with spark.

Resources