How do I refresh a HDFS path? - apache-spark

I am runing a sparksession in jupyter notebook .
I would got error sometime on a dataframe which is initial by spark.read.parquet(some_path) when files under that path have changed, even if I cache the dataframe .
For example
reading code is
sp = spark.read.parquet(TB.STORE_PRODUCT)
sp.cache()
sometimes, sp can't not be access anymore, complain :
Py4JJavaError: An error occurred while calling o3274.collectToPython.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 10 in stage 326.0 failed 4 times, most recent failure: Lost task 10.3 in stage 326.0 (TID 111818, dc38, executor 7): java.io.FileNotFoundException: File does not exist: hdfs://xxxx/data/dm/sales/store_product/part-00000-169428df-a9ee-431e-918b-75477c073d71-c000.snappy.parquet
It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved.
The problem
'REFRESH TABLE tableName' doesn't work, because
I don't have a hive table, it is only a hdfs path
Restart sparksession and read that path again can solve this problem , but
I don't want to restart sparksession, it would waste much time
One more
execute sp = spark.read.parquet(TB.STORE_PRODUCT) again doesn't work , I can understand why, spark should scan the path again or there must be a option/setting to force it scan . Keep whole path location in memory is not smart .
spark.read.parquet doesn't have a force scan option
Signature: spark.read.parquet(*paths)
Docstring:
Loads Parquet files, returning the result as a :class:`DataFrame`.
You can set the following Parquet-specific option(s) for reading Parquet files:
* ``mergeSchema``: sets whether we should merge schemas collected from all Parquet part-files. This will override ``spark.sql.parquet.mergeSchema``. The default value is specified in ``spark.sql.parquet.mergeSchema``.
>>> df = spark.read.parquet('python/test_support/sql/parquet_partitioned')
>>> df.dtypes
[('name', 'string'), ('year', 'int'), ('month', 'int'), ('day', 'int')]
.. versionadded:: 1.4
Source:
#since(1.4)
def parquet(self, *paths):
"""Loads Parquet files, returning the result as a :class:`DataFrame`.
You can set the following Parquet-specific option(s) for reading Parquet files:
* ``mergeSchema``: sets whether we should merge schemas collected from all \
Parquet part-files. This will override ``spark.sql.parquet.mergeSchema``. \
The default value is specified in ``spark.sql.parquet.mergeSchema``.
>>> df = spark.read.parquet('python/test_support/sql/parquet_partitioned')
>>> df.dtypes
[('name', 'string'), ('year', 'int'), ('month', 'int'), ('day', 'int')]
"""
return self._df(self._jreader.parquet(_to_seq(self._spark._sc, paths)))
File: /opt/cloudera/parcels/CDH/lib/spark/python/pyspark/sql/readwriter.py
Type: method
Is there a proper way to solve my problem ?

The problem is caused by Dataframe.cache .
I need clear that cache at first , then read again would solve the problem
code :
try:
sp.unpersist()
except:
pass
sp = spark.read.parquet(TB.STORE_PRODUCT)
sp.cache()

You can try two solutions
one is to unpersist the dataframe before reading everytime as suggested by #Mithril
or just create a temp view and trigger the refresh command
sp.createOrReplaceTempView('sp_table')
spark.sql('''REFRESH TABLE sp_table''')
df=spark.sql('''select * from sp_table''')

Related

Reading Parquet file with Pyspark returns java.lang.AssertionError: assertion failed: Conflicting directory structures detected. Suspicious paths

I am trying to load parquet files in the following directories:
s3://dir1/model=m1/version=newest/versionnumber=3/scores/marketplace_id-1
s3://dir1/model=m1/version=newest/versionnumber=3/scores/marketplace_id-2
s3://dir1/model=m1/version=newest/versionnumber=3/scores/marketplace_id-3
s3://dir1/model=m1/version=newest/versionnumber=3/scores/marketplace_id-4
s3://dir1/model=m1/version=newest/versionnumber=3/scores/marketplace_id-5
s3://dir1/model=m1/version=newest/versionnumber=3/scores/marketplace_id-6
s3://dir1/model=m1/version=newest/versionnumber=3/scores/marketplace_id-7
s3://dir1/model=m1/version=newest/versionnumber=3/scores/marketplace_id-8
This is what I wrote in Pyspark
s3_bucket_location_of_data = "s3://dir1/model=m1/version=newest/versionnumber=3/scores/"
df = spark.read.parquet(s3_bucket_location_of_data)
but I received the following error:
Py4JJavaError: An error occurred while calling o109.parquet.
: java.lang.AssertionError: assertion failed: Conflicting directory structures detected. Suspicious paths:
s3://dir1/model=m1/version=newest/versionnumber=3/scores/marketplace_id-1
s3://dir1/model=m1/version=newest/versionnumber=3/scores/marketplace_id-2
s3://dir1/model=m1/version=newest/versionnumber=3/scores/marketplace_id-3
s3://dir1/model=m1/version=newest/versionnumber=3/scores/marketplace_id-4
s3://dir1/model=m1/version=newest/versionnumber=3/scores/marketplace_id-5
s3://dir1/model=m1/version=newest/versionnumber=3/scores/marketplace_id-6
s3://dir1/model=m1/version=newest/versionnumber=3/scores/marketplace_id-7
s3://dir1/model=m1/version=newest/versionnumber=3/scores/marketplace_id-8
After reading other StackOverflow posts like this, I tried the following:
base_path="s3://dir1/" # I have tried to set this to "s3://dir1/model=m1/version=newest/versionnumber=3/scores/" as well, but it didn't work
s3_bucket_location_of_data = "s3://dir1/model=m1/version=newest/versionnumber=3/scores/"
df = spark.read.option("basePath", base_path).parquet(s3_bucket_location_of_data)
but that returned a similar error message as above. I am new to Spark/Pyspark and I don't know what I could possibly be doing wrong here. Thank you in advance for your answers!
You don't need to specify the detailed path. Just load the files from the base_path.
df = spark.read.parquet("s3://dir1")
df.filter("model = 'm1' and version = 'newest' and versionnumber = 3")
The directory structure is already partitioned by 3 columns, model, version and versionnumber. So read the base and filter the partition, then you could read all the parquet files under the partition path.

PySpark: how to clear readStream cache?

I am reading a directory with Spark's readStream. Earlier I gave the local path, but got FileNotFoundException. I have changed the path to hdfs path, but still the execution log shows its referring to the old settings (local path).
22/06/01 10:30:32 WARN scheduler.TaskSetManager: Lost task 0.2 in stage 1.0 (TID 3, my.nodes.com, executor 3): java.io.FileNotFoundException: File file:/home/myuser/testing_aiman/data/fix_rates.csv does not exist
It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved.
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:129)
Infact I have hardcoded the path variable, but still its referring to the earlier set local path.
df = spark.readStream.csv("hdfs:///user/myname/streaming_test_dir",sep=sep,schema=df_schema,inferSchema=True,header=True)
i also ran spark.sql("CLEAR CACHE").collect, but it didn't help either.
Before running the spark.readStream(), I ran the following code:
spark.sql("REFRESH \"file:///home/myuser/testing_aiman/data/fix_rates.csv\"").collect
spark.sql("CLEAR CACHE").collect
REFRESH <file:///path/that/showed/FileNotFoundException> actually did the trick.

Writing spark.sql dataframe result to parquet file

I enabled the following spark.sql session:
# creating Spark context and connection
spark = (SparkSession.builder.appName("appName").enableHiveSupport().getOrCreate())
and am able to produce see the results of the following query:
spark.sql("select year(plt_date) as Year, month(plt_date) as Mounth, count(build) as B_Count, count(product) as P_Count from first_table full outer join second_table on key1=CONCAT('SS',key_2) group by year(plt_date), month(plt_date)").show()
However, when I try to write the resulting dataframe from this query to hdfs, I get the following error:
I am able to save the resulting dataframe of a simple version of this query to the same path. The problem appears by adding functions such as count(), year() and etc.
What is the problem? and how can I save the results to hdfs?
It is giving error due to '(' present in column 'year(CAST(plt_date AS DATE))' :
Use to rename :
data = data.selectExpr("year(CAST(plt_date AS DATE)) as nameofcolumn")
Upvote if works
Refer : Rename Spark Column

Select not null values in dataframes in Spark

I'm reading a CSV file in Spark 2.0 and counting not null values in a column using the following:
val df = spark.read.option("header", "true").csv(dir)
df.filter("IncidntNum is not null").count()
and it works fine when I test it using spark-shell. When I create a jar file containing the code and submit it to spark-submit, I get an exception at second line above :
Exception in thread "main" org.apache.spark.sql.catalyst.parser.ParseException:
extraneous input '' expecting {'(', 'SELECT', ..
== SQL ==
IncidntNum is not null
^^^
at org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:197)
Any idea why this would happen when I'm using the code that works in spark-shell?
This question has been sitting around a while, but better late than never.
The most likely reason I can think of is that when running using spark-submit you are running in "cluster" mode. That means the driver process will be located on a different machine than when you run spark-shell. That could cause Spark to read a different file.

Reading multiple avro files into RDD from a nested directory structure

suppose I have a directory which contains a bunch of avro files and I want to read them all in one shot. this code works fine
val path = "hdfs:///path/to/your/avro/folder"
val avroRDD = sc.hadoopFile[AvroWrapper[GenericRecord], NullWritable, AvroInputFormat[GenericRecord]](path)
However, if the folder contains subfolders and the avro files are in subfolders. then I get an error
5/10/30 14:57:47 WARN TaskSetManager: Lost task 0.0 in stage 1.0 (TID 6,
hadoop1): java.io.FileNotFoundException: Path is not a file: /folder/subfolder
Is there anyway I can read all the avros (even in subdirectories) into an RDD?
all avros have same schema and I am on spark 1.3.0
Edit::
Based on the suggestion below I executed this line in my spark shell
sc.hadoopConfiguration.set("mapreduce.input.fileinputformat.input.dir.recursive","true")
and this solved the problem.... but now my code is very very slow and I don't understand what does a mapreduce setting have to do with spark.

Resources