spark sqlContext read parquet S3 path not found - apache-spark

am using spark 2.3 scala 2.11.8 in AWS EMR and seeing s3 path not found but the path exists. aws s3 ls clearly shows the directory and content are fine
org.apache.spark.sql.AnalysisException: Path does not exist: s3://dev-us-east-1/data/v1/output/20190115/individual/part-00000-b8450da0-15e9-482e-b588-08d6baa0637a-c000.snappy.parquet;
val df= sqlContext.read.parquet("s3://dev-us-east-1/data/v1/output/"""+dt+"""/individual/part-*.snappy.parquet")
Other folders/files load just fine with same code. Am wondering if there are file size limits or a memory issue masquerading as a path issue? I've also read about using s3a:// and s3n:// rather than s3:// but I am new to spark and a quick try changing my path to s3a:// got me an ACCESS DENIED exception

Related

How to read a csv file from an FTP using PySpark in Databricks Community

I am trying to fetch a file using FTP (kept on Hostinger) using Pyspark in Databricks community.
Everything works fine until I try to read that file using spark.read.csv('MyFile.csv').
Following is the code and an error,
PySpark Code:
res=spark.sparkContext.addFile('ftp://USERNAME:PASSWORD#URL/<folder_name>/MyFile.csv')
myfile=SparkFiles.get("MyFile.csv")
spark.read.csv(path=myfile) # Errors out here
print(myfile, sc.getRootDirectory()) # Outputs almost same path (except dbfs://)
Error:
AnalysisException: Path does not exist: dbfs:/local_disk0/spark-ce4b313e-00cf-4f52-80d8-dff98fc3eea5/userFiles-90cd99b4-00df-4d59-8cc2-2e18050d395/MyFile.csv
As the spark.addfile downloads file on driver while databricks uses dbfs as default storage you are getting the error please try below code to see if it fixes your issue
res=spark.sparkContext.addFile('ftp://USERNAME:PASSWORD#URL/<folder_name>/MyFile.csv')
myfile=SparkFiles.get("MyFile.csv")
spark.read.csv(path='file:'+myfile)
print(myfile, sc.getRootDirectory())

How convert a local csv file to spark data frame on jupyter server?

My CSV file is inside my directory at jupyter server. I am getting error whenever i am trying to load it in my notebook as spark dataframe using spark.read.csv.
The error I am getting is :
AnalysisException: u'Path does not exist.
spark.read.csv is expecting the location of my file to be at hdfs while it is inside my jupyter directory. How to resolve it?
For copying the files from local to hdfs, Try this
hadoop fs -copyFromLocal /local/path/to/file.csv
Or
hdfs dfs -put /local/path/to/file.csv /user/hadoop/hadoopdir

Spark Streaming reading from local file gives NullPointerException

Using Spark 2.2.0 on OS X High Sierra. I'm running a Spark Streaming application to read a local file:
val lines = ssc.textFileStream("file:///Users/userName/Documents/Notes/MoreNotes/sampleFile")
lines.print()
This gives me
org.apache.spark.streaming.dstream.FileInputDStream logWarning - Error finding new files
java.lang.NullPointerException
at scala.collection.mutable.ArrayOps$ofRef$.length$extension(ArrayOps.scala:192)
The file exists, and I am able to read it using SparkContext (sc) from spark-shell on the terminal. For some reason going through the Intellij application and Spark Streaming is not working. Any ideas appreciated!
Quoting the doc comments of textFileStream:
Create an input stream that monitors a Hadoop-compatible filesystem
for new files and reads them as text files (using key as LongWritable, value
as Text and input format as TextInputFormat). Files must be written to the
monitored directory by "moving" them from another location within the same
file system. File names starting with . are ignored.
#param directory HDFS directory to monitor for new file
So, the method expects the path to a directory in the parameter.
So I believe this should avoid that error:
ssc.textFileStream("file:///Users/userName/Documents/Notes/MoreNotes/")
Spark streaming will not read old files, so first run the spark-submit command and then create the local file in the specified directory. Make sure in the spark-submit command, you give only directory name and not the file name. Below is a sample command. Here, I am passing the directory name through the spark command as my first parameter. You can specify this path in your Scala program as well.
spark-submit --class com.spark.streaming.streamingexample.HdfsWordCount --jars /home/cloudera/pramod/kafka_2.12-1.0.1/libs/kafka-clients-1.0.1.jar--master local[4] /home/cloudera/pramod/streamingexample-0.0.1-SNAPSHOT.jar /pramod/hdfswordcount.txt

(python) Spark .textFile(s3://...) access denied 403 with valid credentials

In order to access my S3 bucket i have exported my creds
export AWS_SECRET_ACCESS_KEY=
export AWS_ACCESSS_ACCESS_KEY=
I can verify that everything works by doing
aws s3 ls mybucket
I can also verify with boto3 that it works in python
resource = boto3.resource("s3", region_name="us-east-1")
resource.Object("mybucket", "text/text.py") \
.put(Body=open("text.py", "rb"),ContentType="text/x-py")
This works and I can see the file in the bucket.
However when I do this with spark:
spark_context = SparkContext()
sql_context = SQLContext(spark_context)
spark_context.textFile("s3://mybucket/my/path/*)
I get a nice
> Caused by: org.jets3t.service.S3ServiceException: Service Error
> Message. -- ResponseCode: 403, ResponseStatus: Forbidden, XML Error
> Message: <?xml version="1.0"
> encoding="UTF-8"?><Error><Code>InvalidAccessKeyId</Code><Message>The
> AWS Access Key Id you provided does not exist in our
> records.</Message><AWSAccessKeyId>[MY_ACCESS_KEY]</AWSAccessKeyId><RequestId>XXXXX</RequestId><HostId>xxxxxxx</HostId></Error>
this is how I submit the job locally
spark-submit --packages com.amazonaws:aws-java-sdk-pom:1.11.98,org.apache.hadoop:hadoop-aws:2.7.3 test.py
Why does it works with command line + boto3 but spark is chocking ?
EDIT:
Same issue using s3a:// with
hadoopConf = spark_context._jsc.hadoopConfiguration()
hadoopConf.set("fs.s3a.access.key", "xxxx")
hadoopConf.set("fs.s3a.secret.key", "xxxxxxx")
hadoopConf.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
and same issue using aws-sdk 1.7.4 and hadoop 2.7.2
Spark will automatically copy your AWS Credentials to the s3n and s3a secrets. Apache Spark releases don't touch the s3:// URLs, as in Apache Hadoop, the s3:// schema is associated with the original, now deprecated s3 client, one which is incompatible with everything else.
On Amazon EMR, s3:// is bound to amazon EMR S3; EC2 VMs will provide the secrets for executors automatically. So I don't think it bothers with the env var propagation mechanism. It might also be that how it sets up the authentication chain, you can't override the EC2/IAM data.
If you are trying to talk to S3 and you are not running in an EMR VM, then presumably you are using Apache Spark with the Apache Hadoop JARs, not the EMR versions. In that world use URLs with s3a:// to get the latest S3 client library
If that doesn't work, look at the troubleshooting section of the apache docs. There's a section on "403" there including recommended steps for troubleshooting. It can be due to classpath/JVM version problems as well as credential issues, even clock-skew between client and AWS.

Spark Shell Read Local Parquet File

Is it possible to read parquet files from local file system at Spark Shell?
I'm having the problem that it wants to read it from hdfs.
sqlContext.sql("SET spark.sql.parquet.binaryAsString=true")
df = sqlContext.parquetFile('/Users/file.parquet')
Also tried: 'file:///Users/file.parquet'
Error:
java.io.FileNotFoundException: File does not exist: hdfs://nameservice1/Users/file.parquet
```
Everything worked magically when I moved to Spark 1.5 instead of 1.3.

Resources