Pyspark failed to save df to S3 - python-3.x

I want to save pyspark dataframe of ~14 millions rows into 6 differents files
After cleaning data:
clean_data.repartition(6).write.option("sep", "\t").option("header", "true").csv("s3_path", mode="overwrite")
I got this error An error was encountered:
An error occurred while calling o258.csv.
: org.apache.spark.SparkException: Job aborted.
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:231)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:195)

Related

Problem when calling createOrReplaceGlobalTempView on a dataframe

On my new windows machine, I am using Spark 3.1.1 (using winutils for Hadoop) to create from a csv file a global temp view like so :
DF.createOrReplaceGlobalTempView("firstTable");
Where DF is a Dataset<Row> containing the csv data.
I get the following error on the global view creation :
Exception in thread "main" org.apache.spark.sql.AnalysisException: java.lang.RuntimeException: Error while running command to get file permissions : ExitCodeException exitCode=-1073741515:
What's the problem ?
Thanks

Py4JJavaError: An error occurred while calling o389.csv

I'm new to pyspark. I'm running pyspark using databricks. My data is stored in Azure Data Lake Service.I'm trying to read csv file from ADLS to pyspark data frame. So I wrote following code
import pyspark
from pyspark import SparkContext
from pyspark import SparkFiles
df = sqlContext.read.csv(SparkFiles.get("dbfs:mycsv path in ADSL/Data.csv"),
header=True, inferSchema= True)
But I'm getting error message
Py4JJavaError: An error occurred while calling o389.csv.
Can you suggest me to rectify this error?
The SparkFiles class is intended for accessing the files shipped as part of the Spark job. If you just need access to the CSV file available on ADLS, then you just need to use spark.read.csv, like:
df = spark.read.csv("dbfs:mycsv path in ADSL/Data.csv",
header=True, inferSchema=True)
it's better not to use sqlContext, it's kept for compatibility reasons.

Read multiple text file in single dataframe

I'm trying to read multiple text files into a single DataFrame in Pyspark and then apply the show() but getting the error in second file path.
BUYERS10_m1 = spark.read.text(Buyers_F1_path,Buyers_F2_path)
BUYERS10_m1.show()
Py4JJavaError: An error occurred while calling o245.showString.
: java.lang.IllegalArgumentException: For input string: "s3a://testing/Buyers/File2.TXT"
Does anyone have any idea why I'm getting this error and how to resolve it ?
Following should work.
spark.read.text("s3a://testing/Buyers/File{1,2}.TXT")

Unable to read parquet file locally in spark

I am running Pyspark locally and trying to read a parquet file and load into a data frame from notebook.
df = spark.read.parquet("metastore_db/tmp/userdata1.parquet")
I am getting this exception
An error occurred while calling o738.parquet.
: org.apache.spark.sql.AnalysisException: java.lang.RuntimeException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient;
Does anyone know how to do it?
Assuming that you are running spark on your local, you should be doing something like
df = spark.read.parquet("file:///metastore_db/tmp/userdata1.parquet")

Unable to save data frame as Hive table, throwing file not found exception

When I am trying to save data frame as Hive table in pyspark
df_writer.saveAsTable('hive_table', format='parquet', mode='overwrite')
I am getting following error:
Caused by: org.apache.hadoop.mapred.InvalidInputException: Input path
does not exist:
hdfs://hostname:8020/apps/hive/warehouse/testdb.db/hive_table at
org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:287)
at
org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:229)
I have the path till 'hdfs://hostname:8020/apps/hive/warehouse/testdb.db/'
Please provide your inputs
Try using DataFrameWriter as
df.write.mode(SaveMode.Append).insertInto(s"${dbName}.${t.table}")

Resources