I want to save pyspark dataframe of ~14 millions rows into 6 differents files
After cleaning data:
clean_data.repartition(6).write.option("sep", "\t").option("header", "true").csv("s3_path", mode="overwrite")
I got this error An error was encountered:
An error occurred while calling o258.csv.
: org.apache.spark.SparkException: Job aborted.
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:231)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:195)
Related
On my new windows machine, I am using Spark 3.1.1 (using winutils for Hadoop) to create from a csv file a global temp view like so :
DF.createOrReplaceGlobalTempView("firstTable");
Where DF is a Dataset<Row> containing the csv data.
I get the following error on the global view creation :
Exception in thread "main" org.apache.spark.sql.AnalysisException: java.lang.RuntimeException: Error while running command to get file permissions : ExitCodeException exitCode=-1073741515:
What's the problem ?
Thanks
I'm new to pyspark. I'm running pyspark using databricks. My data is stored in Azure Data Lake Service.I'm trying to read csv file from ADLS to pyspark data frame. So I wrote following code
import pyspark
from pyspark import SparkContext
from pyspark import SparkFiles
df = sqlContext.read.csv(SparkFiles.get("dbfs:mycsv path in ADSL/Data.csv"),
header=True, inferSchema= True)
But I'm getting error message
Py4JJavaError: An error occurred while calling o389.csv.
Can you suggest me to rectify this error?
The SparkFiles class is intended for accessing the files shipped as part of the Spark job. If you just need access to the CSV file available on ADLS, then you just need to use spark.read.csv, like:
df = spark.read.csv("dbfs:mycsv path in ADSL/Data.csv",
header=True, inferSchema=True)
it's better not to use sqlContext, it's kept for compatibility reasons.
I'm trying to read multiple text files into a single DataFrame in Pyspark and then apply the show() but getting the error in second file path.
BUYERS10_m1 = spark.read.text(Buyers_F1_path,Buyers_F2_path)
BUYERS10_m1.show()
Py4JJavaError: An error occurred while calling o245.showString.
: java.lang.IllegalArgumentException: For input string: "s3a://testing/Buyers/File2.TXT"
Does anyone have any idea why I'm getting this error and how to resolve it ?
Following should work.
spark.read.text("s3a://testing/Buyers/File{1,2}.TXT")
I am running Pyspark locally and trying to read a parquet file and load into a data frame from notebook.
df = spark.read.parquet("metastore_db/tmp/userdata1.parquet")
I am getting this exception
An error occurred while calling o738.parquet.
: org.apache.spark.sql.AnalysisException: java.lang.RuntimeException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient;
Does anyone know how to do it?
Assuming that you are running spark on your local, you should be doing something like
df = spark.read.parquet("file:///metastore_db/tmp/userdata1.parquet")
When I am trying to save data frame as Hive table in pyspark
df_writer.saveAsTable('hive_table', format='parquet', mode='overwrite')
I am getting following error:
Caused by: org.apache.hadoop.mapred.InvalidInputException: Input path
does not exist:
hdfs://hostname:8020/apps/hive/warehouse/testdb.db/hive_table at
org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:287)
at
org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:229)
I have the path till 'hdfs://hostname:8020/apps/hive/warehouse/testdb.db/'
Please provide your inputs
Try using DataFrameWriter as
df.write.mode(SaveMode.Append).insertInto(s"${dbName}.${t.table}")