java.io.FileNotFoundException in Spark structured streaming job - apache-spark

I am trying to run spark structured streaming job that reads CSV files from local and load to HDFS in parquet format.
I start a pyspark job as follows:
pyspark2 --master yarn --executor-memory 8G --driver-memory 8G
The code looks as follows:
from pyspark.sql.types import StructType
sch = StructType(...)
spark.readStream \
.format("csv") \
.schema(sch) \
.option("header", True) \
.option("delimiter", ',') \
.load("<Load_path>") \
.writeStream \
.format("parquet") \
.outputMode("append") \
.trigger(processingTime='10 seconds') \
.option("path","<Hdfs_path>") \
.option("checkpointLocation","<Checkpoint Loc>") \
.start()
Load path is like file:////home/pardeep/file2 where file2 is a directory name (not a file).
It is running fine in start but after adding more CSV file in source folder, it is giving below error:
Caused by: java.io.FileNotFoundException: File file:<file>.csv does not exist
It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved.
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:127)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:174)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:105)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:315)
Error is not coming always after adding first file, sometime it is after first file addition, sometime it is after second file.
There is another job that is moving file in this folder. Job is writing in temp folder and moving into this folder.
At start, there are some file present in the directory, but files are coming to directory continuously (after every 2-3 minutes). I am not sure how to do refresh (what is table name?) and when to do it because it is a streaming job.

Related

Failed to run pyspark's .withcolumn function

I am trying to run Pyspark on pycharm in Windows 10, but I kept getting some weird error message on Node 81 related to JVM when trying to execute the simple function .withColumn() and .withColumnRenamed. I have a tmp folder on my desktop (see the attached image), and I set all the environment variables for HADOOP_PATH, JAVA_HOME, PATH, PYTHON_PATH and SPARK_HOME. I was also able to create the spark object with the following lines of code
spark = SparkSession \
.builder \
.master("local[*]") \
.appName("Data Est") \
.config("spark.driver.memory", memory='4g') \
.config("spark.sql.shuffle.partitions", partitions=400) \
.config("spark.sql.broadcastTimeout", -1) \
.config("spark.sql.session.timezone", "UTC") \
.config("spark.local.dir", spark_local_dir=[some directory path on desktop]) \
.getOrCreate()
System Environment Variables - Windows 10 64-bit

pyspark connection to MariaDB fails with ClassNotFoundException

I'm trying to retrieve data from MariaDB with pyspark.
I created spark_session with configuration to include jdbc jar file, but couldn't solve problem. Current code to create session looks like below.
path = "hdfs://nameservice1/user/PATH/TO/JDBC/mariadb-java-client-2.7.1.jar"
# or path = "/home/PATH/TO/JDBC/mariadb-java-client-2.7.1.jar"
spark = SparkSession.config("spark.jars", path)\
.config("spark.driver.extraClassPath", path)\
.config("spark.executor.extraClassPath", path)\
.enableHiveSupport()
.getOrCreate()
Note that I've tried every case of configuration I know
(Check Permission, change directory both hdfs or local, add or remove configuration ...)
And then, code to load data is.
sql = "SOME_SQL_TO_RETRIEVE_DATA"
spark = spark.read.format('jdbc').option('dbtable', sql)
.option('url', 'jdbc:mariadb://{host}:{port}/{db}')\
.option("user", SOME_USER)
.option("password", SOME_PASSWORD)
.option("driver", 'org.mariadb.jdbc.Driver')
.load()
But it fails with java.lang.ClassNotFoundException: org.mariadb.jdbc.Driver
When I tried this with spark-submit, I saw log message.
... INFO SparkContext: Added Jar /PATH/TO/JDBC/mariadb-java-client-2.7.1.jar at spark://SOME_PATH/jars/mariadb-java-client-2.7.1.jar with timestamp SOME_TIMESTAMP
What is wrong?
For anyone who suffers from same problem.
I figured out. Spark Document says that
Note: In client mode, this config must not be set through the SparkConf directly in your application, because the driver JVM has already started at that point. Instead, please set this through the --driver-class-path command line option or in your default properties file.
So instead setting configuration on python code, I added arguments on spark-submit following this document.
spark-submit {other arguments ...} \
--driver-class-path PATH/TO/JDBC/my-jdbc.jar \
--jars PATH/TO/JDBC/my-jdbc.jar \
MY_PYTHON_SCRIPT.py

spark dataframe not successfully written in elasticsearch

I am writing data from my spark-dataframe into ES. i did print the schema and the total count of records and it seems all ok until the dump gets started. Job runs successfully and no issue /error raised in spark job but the index doesn't have the supposed amount of data it should have.
i have 1800k records needs to dump and sometimes it dumps only 500k , sometimes 800k etc.
Here is main section of code.
spark = SparkSession \
.builder \
.appName("Python Spark SQL Hive integration example") \
.config("spark.sql.warehouse.dir", warehouse_location) \
.config('spark.yarn.executor.memoryOverhead', '4096') \
.enableHiveSupport() \
.getOrCreate()
final_df = spark.read.load("/trans/MergedFinal_stage_p1", multiline="false", format="json")
print(final_df.count()) # It is perfectly ok
final_df.printSchema() # Schema is also ok
## Issue when data gets write in DB ##
final_df.write.mode("ignore").format(
'org.elasticsearch.spark.sql'
).option(
'es.nodes', ES_Nodes
).option(
'es.port', ES_PORT
).option(
'es.resource', ES_RESOURCE,
).save()
My resources are also ok.
Command to run spark job.
time spark-submit --class org.apache.spark.examples.SparkPi --jars elasticsearch-spark-30_2.12-7.14.1.jar --master yarn --deploy-mode cluster --driver-memory 6g --executor-memory 3g --num-executors 16 --executor-cores 2 main_es.py

Why does spark-submit ignore the package that I include as part of the configuration of my spark session?

I am trying to include the org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.3 package as part of my spark code (via the SparkSession Builder). I understand that I can download the JAR myself and include it but I would like to figure out why the following is not working as expected:
from pyspark.sql import SparkSession
import pyspark
import json
if __name__ == "__main__":
spark = SparkSession.builder \
.master("local") \
.appName("App Name") \
.config("spark.jars.packages", "org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.3") \
.getOrCreate()
df = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "localhost:9092") \
.option("subscribe", "first_topic") \
.load() \
.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
query = df \
.writeStream \
.format("console") \
.outputMode("update") \
.start()
When I run the job:
spark-submit main.py
I receive the following error:
py4j.protocol.Py4JJavaError: An error occurred while calling o48.load.
: org.apache.spark.sql.AnalysisException: Failed to find data source: kafka. Please deploy the application as per the deployment section of "Structured Streaming + Kafka Integration Guide".;
at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:652)
at org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:161)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
If I instead include the packages via the --packages flag, the dependencies are downloaded and the code runs as expected:
spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.3 main.py
The code also works if I open the PySpark shell and paste the code above. Is there a reason that the spark-submit ignores the configuration?
I think that for configurations like "spark.jars.packages", these should be configured either in spark-defaults or passed by command-line arguments, setting it in the runtime shouldn't work.
Against better judgement
I remember some people claimed something like this worked for them, but I would say that the dependency is already somewhere there (installed in local repo), just loaded.
conf = pyspark.SparkConf()
conf.set("spark.jars.packages", "org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.3")
spark = SparkSession.builder \
.master("local") \
.appName("App Name") \
.config(conf = conf) \
.getOrCreate()
When you run spark-submit, it already creates a SparkSession that is reused by your code - thus you have to provide everything through spark-submit.
However, you do not need to actually use spark-submit to run your Spark code. Assuming your main method looks like this:
def main():
spark = SparkSession.builder.config(...).getOrCreate()
# your spark code below
...
You can run this code just via python:
> python ./my_spark_script.py
This will run your program correctly
I faced same problem, after google I found that link
https://issues.apache.org/jira/browse/SPARK-21752
According to #srowen Sean R. Owen "At that point, your app has already launched. You can't change the driver classpath."

Spark can not load the csv file

spark = SparkSession.builder \
.master("spark://ip:7077") \
.appName("usres mobile location information analysis") \
.config("spark.submit.deployMode", "client") \
.config("spark.executor.memory","2g") \
.config('spark.executor.cores', "2") \
.config("spark.executor.extraClassPath","/opt/anaconda3/jars/ojdbc6.jar") \
.config("spark.executor.pyspark.memory","2g") \
.config("spark.driver.maxResultSize", "2g") \
.config("spark.driver.memory", "2g") \
.config("spark.driver.extraClassPath","/opt/anaconda3/jars/ojdbc6.jar") \
.enableHiveSupport() \
.getOrCreate()
I am trying to read a csv file located in my local pc in report folder.But it is not located in the masted location. Is there any problem with my code. I use the following line code to read the csv file.
info_df = spark.read\
.format("csv")\
.option("header","true")\
.option("mode", "PERMISSIVE")\
.load("report/info.csv")
And I get the following error. It is showing the spark can't find the files.What is the probable solution ?
Py4JJavaError: An error occurred while calling o580.load.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 7.0 failed 4 times, most recent failure: Lost task 0.3 in stage 7.0 (TID 31, ip , executor 4): java.io.FileNotFoundException: File file:/C:/Users/taimur.islam/Desktop/banglalink/Data Science/High Value Prediction/report/info.csv does not exist
It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved.

Resources