Spark can not load the csv file - python-3.x

spark = SparkSession.builder \
.master("spark://ip:7077") \
.appName("usres mobile location information analysis") \
.config("spark.submit.deployMode", "client") \
.config("spark.executor.memory","2g") \
.config('spark.executor.cores', "2") \
.config("spark.executor.extraClassPath","/opt/anaconda3/jars/ojdbc6.jar") \
.config("spark.executor.pyspark.memory","2g") \
.config("spark.driver.maxResultSize", "2g") \
.config("spark.driver.memory", "2g") \
.config("spark.driver.extraClassPath","/opt/anaconda3/jars/ojdbc6.jar") \
.enableHiveSupport() \
.getOrCreate()
I am trying to read a csv file located in my local pc in report folder.But it is not located in the masted location. Is there any problem with my code. I use the following line code to read the csv file.
info_df = spark.read\
.format("csv")\
.option("header","true")\
.option("mode", "PERMISSIVE")\
.load("report/info.csv")
And I get the following error. It is showing the spark can't find the files.What is the probable solution ?
Py4JJavaError: An error occurred while calling o580.load.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 7.0 failed 4 times, most recent failure: Lost task 0.3 in stage 7.0 (TID 31, ip , executor 4): java.io.FileNotFoundException: File file:/C:/Users/taimur.islam/Desktop/banglalink/Data Science/High Value Prediction/report/info.csv does not exist
It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved.

Related

Writing to console through Python(PySpark) KAFKA failing

I have written simple prog. for reading CSV file through kafka topic & writing it to console .
When I run the job I am able to print the Schema but messege is not getting displayed .
Below is the spark config
.config("spark.jars",
"file:///home/hdoop/jars/jsr166e-1.1.0.jar,file:///home/hdoop/jars/spark-cassandra-connector_2.12-3.2.0.jar,file:///home/hdoop/jars/mysql-connector-java-8.0.30.jar,file:///home/hdoop/new_jars/spark-sql-kafka-0-10_2.12-3.2.1.jar,file:///home/hdoop/new_jars/kafka-clients-2.8.0.jar,file:///home/hdoop/new_jars/commons-pool2-2.11.1.jar") \
.config("spark.executor.extraClassPath",
"file:///home/hdoop/jars/jsr166e-1.1.0.jar,file:///home/hdoop/jars/spark-cassandra-connector_2.12-3.2.0.jar,file:///home/hdoop/jars/mysql-connector-java-8.0.30.jar,file:///home/hdoop/new_jars/spark-sql-kafka-0-10_2.12-3.2.1.jar,file:///home/hdoop/new_jars/kafka-clients-2.8.0.jar,file:///home/hdoop/new_jars/commons-pool2-2.11.1.jar") \
.config("spark.executor.extraLibrary",
"file:///home/hdoop/jars/jsr166e-1.1.0.jar,file:///home/hdoop/jars/spark-cassandra-connector_2.12-3.2.0.jar,file:///home/hdoop/jars/mysql-connector-java-8.0.30.jar,file:///home/hdoop/new_jars/spark-sql-kafka-0-10_2.12-3.2.1.jar,file:///home/hdoop/new_jars/kafka-clients-2.8.0.jar,file:///home/hdoop/new_jars/commons-pool2-2.11.1.jar") \
.config("spark.driver.extraClassPath",
"file:///home/hdoop/jars/jsr166e-1.1.0.jar,file:///home/hdoop/jars/spark-cassandra-connector_2.12-3.2.0.jar,file:///home/hdoop/jars/mysql-connector-java-8.0.30.jar,file:///home/hdoop/new_jars/spark-sql-kafka-0-10_2.12-3.2.1.jar,file:///home/hdoop/new_jars/kafka-clients-2.8.0.jar,file:///home/hdoop/new_jars/**commons-pool2-2.11.1.jar**") \
.config("spark.cassandra.connection.host", c_host_nm) \
.config("spark.cassandra.connection.port", c_port_number) \
.getOrCreate()
I am getting error -
pyspark.sql.utils.StreamingQueryException: Query [id = dfa3b327-ac6b-4426-956b-0587501592d2, runId = 49ee4949-47a9-4c8c-b46e-d121ae9ac759] terminated with exception: Writing job aborted
Start of log says -
java.lang.NoSuchMethodError: org.apache.spark.kafka010.KafkaTokenUtil$.needTokenUpdate
Any thoughts on what needs to be done here?
(Python 3.9.2 , Kafka 3.2.1 , pySpark 3.3.0)
Thanks

Read CSV file on Spark

I am started working with Spark and found out one problem.
I tried reading CSV file using the below code:
df = spark.read.csv("/home/oybek/Serverspace/Serverspace/Athletes.csv")
df.show(5)
Error:
Py4JJavaError: An error occurred while calling o38.csv.
: java.lang.OutOfMemoryError: Java heap space
I am working in Linux Ubuntu, VirtualBox:~/Serverspace.
You can try changing the driver memory by creating a spark session variable like below:
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.master('local[*]') \
.config("spark.driver.memory", "4g") \
.appName('read-csv') \
.getOrCreate()

java.io.FileNotFoundException in Spark structured streaming job

I am trying to run spark structured streaming job that reads CSV files from local and load to HDFS in parquet format.
I start a pyspark job as follows:
pyspark2 --master yarn --executor-memory 8G --driver-memory 8G
The code looks as follows:
from pyspark.sql.types import StructType
sch = StructType(...)
spark.readStream \
.format("csv") \
.schema(sch) \
.option("header", True) \
.option("delimiter", ',') \
.load("<Load_path>") \
.writeStream \
.format("parquet") \
.outputMode("append") \
.trigger(processingTime='10 seconds') \
.option("path","<Hdfs_path>") \
.option("checkpointLocation","<Checkpoint Loc>") \
.start()
Load path is like file:////home/pardeep/file2 where file2 is a directory name (not a file).
It is running fine in start but after adding more CSV file in source folder, it is giving below error:
Caused by: java.io.FileNotFoundException: File file:<file>.csv does not exist
It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved.
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:127)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:174)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:105)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:315)
Error is not coming always after adding first file, sometime it is after first file addition, sometime it is after second file.
There is another job that is moving file in this folder. Job is writing in temp folder and moving into this folder.
At start, there are some file present in the directory, but files are coming to directory continuously (after every 2-3 minutes). I am not sure how to do refresh (what is table name?) and when to do it because it is a streaming job.

Writing large DataFrame from PySpark to Kafka runs into timeout

I'm trying to write a data frame which has about 230 million records to a Kafka. More specifically to a Kafka-enable Azure Event Hub, but I'm not sure if that's actually the source of my issue.
EH_SASL = 'kafkashaded.org.apache.kafka.common.security.plain.PlainLoginModule required username="$ConnectionString" password="Endpoint=sb://myeventhub.servicebus.windows.net/;SharedAccessKeyName=RootManageSharedAccessKey;SharedAccessKey=****";'
dfKafka \
.write \
.format("kafka") \
.option("kafka.sasl.mechanism", "PLAIN") \
.option("kafka.security.protocol", "SASL_SSL") \
.option("kafka.sasl.jaas.config", EH_SASL) \
.option("kafka.bootstrap.servers", "myeventhub.servicebus.windows.net:9093") \
.option("topic", "mytopic") \
.option("checkpointLocation", "/mnt/telemetry/cp.txt") \
.save()
This starts up fine and writes about 3-4 million records successfully (and pretty fast) to the queue. But then the job stops after a couple of minutes with messages like those:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 6 in stage 7.0 failed 4 times, most recent failure: Lost task 6.3 in stage 7.0 (TID 248, 10.139.64.5, executor 1): kafkashaded.org.apache.kafka.common.errors.TimeoutException: Expiring 61 record(s) for mytopic-18: 32839 ms has passed since last append
or
org.apache.spark.SparkException: Job aborted due to stage failure: Task 13 in stage 8.0 failed 4 times, most recent failure: Lost task 13.3 in stage 8.0 (TID 348, 10.139.64.5, executor 1): kafkashaded.org.apache.kafka.common.errors.TimeoutException: The request timed out.
Also, I never see the checkpoint file being created/written to.
I also played around with .option("kafka.delivery.timeout.ms", 30000) and different values but that didn't seem to have any effect.
I'm running this in an Azure Databricks cluster version 5.0 (includes Apache Spark 2.4.0, Scala 2.11)
I don't see any errors like throttling on my Event Hub, so that should be ok.
Finally figured it out (mostly):
Turns out the default batch size of about 16000 messages was too large for the endpoint. After I set the batch.size parameter to 5000, it worked and is writing at about 700k messages per minute to the Event Hub. Also, the timeout parameter above was wrong and was just being ignored. It is kafka.request.timeout.ms
Only issue is that randomly it still runs in timeouts and apparently starts from the beginning again so that I'm ending up with duplicates. Will open another question for that.
dfKafka \
.write \
.format("kafka") \
.option("kafka.sasl.mechanism", "PLAIN") \
.option("kafka.security.protocol", "SASL_SSL") \
.option("kafka.sasl.jaas.config", EH_SASL) \
.option("kafka.batch.size", 5000) \
.option("kafka.bootstrap.servers", "myeventhub.servicebus.windows.net:9093") \
.option("kafka.request.timeout.ms", 120000) \
.option("topic", "raw") \
.option("checkpointLocation", "/mnt/telemetry/cp.txt") \
.save()

Spark Hive reporting ClassNotFoundException: com.ibm.biginsights.bigsql.sync.BIEventListener

I'm attempting to run a pyspark script on BigInsights on Cloud 4.2 Enterprise that accesses a Hive table.
First I create the hive table:
[biadmin#bi4c-xxxxx-mastermanager ~]$ hive
hive> CREATE TABLE pokes (foo INT, bar STRING);
OK
Time taken: 2.147 seconds
hive> LOAD DATA LOCAL INPATH '/usr/iop/4.2.0.0/hive/doc/examples/files/kv1.txt' OVERWRITE INTO TABLE pokes;
Loading data to table default.pokes
Table default.pokes stats: [numFiles=1, numRows=0, totalSize=5812, rawDataSize=0]
OK
Time taken: 0.49 seconds
hive>
Then I create a simple pyspark script:
[biadmin#bi4c-xxxxxx-mastermanager ~]$ cat test_pokes.py
from pyspark import SparkContext
sc = SparkContext()
from pyspark.sql import HiveContext
hc = HiveContext(sc)
pokesRdd = hc.sql('select * from pokes')
print( pokesRdd.collect() )
I attempt to execute with:
[biadmin#bi4c-xxxxxx-mastermanager ~]$ spark-submit \
--master yarn-cluster \
--deploy-mode cluster \
--jars /usr/iop/4.2.0.0/hive/lib/datanucleus-api-jdo-3.2.6.jar, \
/usr/iop/4.2.0.0/hive/lib/datanucleus-core-3.2.10.jar, \
/usr/iop/4.2.0.0/hive/lib/datanucleus-rdbms-3.2.9.jar \
--files /usr/iop/4.2.0.0/hive/conf/hive-site.xml \
test_pokes.py
However, I encounter the error:
You must build Spark with Hive. Export 'SPARK_HIVE=true' and run build/sbt assembly
Traceback (most recent call last):
File "test_pokes.py", line 8, in <module>
pokesRdd = hc.sql('select * from pokes')
File "/disk2/local/usercache/biadmin/appcache/application_1477084339086_0485/container_e09_1477084339086_0485_02_000001/pyspark.zip/pyspark/sql/context.py", line 580, in sql
...
File /container_e09_1477084339086_0485_02_000001/py4j-0.9-src.zip/py4j/protocol.py", line 308, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling None.org.apache.spark.sql.hive.HiveContext.
: java.lang.RuntimeException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient
at
...
...
Caused by: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient
at
...
... 27 more
Caused by: MetaException(message:Failed to instantiate listener named: com.ibm.biginsights.bigsql.sync.BIEventListener, reason: java.lang.ClassNotFoundException: com.ibm.biginsights.bigsql.sync.BIEventListener)
at org.apache.hadoop.hive.metastore.MetaStoreUtils.getMetaStoreListeners(MetaStoreUtils.java:1478)
at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.init(HiveMetaStore.java:481)
at org.apache.hadoop.hive.metastore.RetryingHMSHandler.<init>(RetryingHMSHandler.java:66)
at org.apache.hadoop.hive.metastore.RetryingHMSHandler.getProxy(RetryingHMSHandler.java:72)
at org.apache.hadoop.hive.metastore.HiveMetaStore.newRetryingHMSHandler(HiveMetaStore.java:5762)
at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.<init>(HiveMetaStoreClient.java:199)
at org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient.<init>(SessionHiveMetaStoreClient.java:74)
... 32 more
See also previous errors related to this issue:
hive spark yarn-cluster job fails with: "ClassNotFoundException: org.datanucleus.api.jdo.JDOPersistenceManagerFactory"
Spark Hive reporting pyspark.sql.utils.AnalysisException: u'Table not found: XXX' when run on yarn cluster
The solution was to use hive-site.xml from the spark-client folder:
[biadmin#bi4c-xxxxxx-mastermanager ~]$ spark-submit \
--master yarn-cluster \
--deploy-mode cluster \
--jars /usr/iop/4.2.0.0/hive/lib/datanucleus-api-jdo-3.2.6.jar, \
/usr/iop/4.2.0.0/hive/lib/datanucleus-core-3.2.10.jar, \
/usr/iop/4.2.0.0/hive/lib/datanucleus-rdbms-3.2.9.jar \
--files /usr/iop/current/spark-client/conf/hive-site.xml \
test_pokes.py
This is captured in the docs: http://www.ibm.com/support/knowledgecenter/SSPT3X_4.2.0/com.ibm.swg.im.infosphere.biginsights.product.doc/doc/bi_spark.html

Resources