Spark Shell Read Local Parquet File - apache-spark

Is it possible to read parquet files from local file system at Spark Shell?
I'm having the problem that it wants to read it from hdfs.
sqlContext.sql("SET spark.sql.parquet.binaryAsString=true")
df = sqlContext.parquetFile('/Users/file.parquet')
Also tried: 'file:///Users/file.parquet'
Error:
java.io.FileNotFoundException: File does not exist: hdfs://nameservice1/Users/file.parquet
```

Everything worked magically when I moved to Spark 1.5 instead of 1.3.

Related

How convert a local csv file to spark data frame on jupyter server?

My CSV file is inside my directory at jupyter server. I am getting error whenever i am trying to load it in my notebook as spark dataframe using spark.read.csv.
The error I am getting is :
AnalysisException: u'Path does not exist.
spark.read.csv is expecting the location of my file to be at hdfs while it is inside my jupyter directory. How to resolve it?
For copying the files from local to hdfs, Try this
hadoop fs -copyFromLocal /local/path/to/file.csv
Or
hdfs dfs -put /local/path/to/file.csv /user/hadoop/hadoopdir

spark sqlContext read parquet S3 path not found

am using spark 2.3 scala 2.11.8 in AWS EMR and seeing s3 path not found but the path exists. aws s3 ls clearly shows the directory and content are fine
org.apache.spark.sql.AnalysisException: Path does not exist: s3://dev-us-east-1/data/v1/output/20190115/individual/part-00000-b8450da0-15e9-482e-b588-08d6baa0637a-c000.snappy.parquet;
val df= sqlContext.read.parquet("s3://dev-us-east-1/data/v1/output/"""+dt+"""/individual/part-*.snappy.parquet")
Other folders/files load just fine with same code. Am wondering if there are file size limits or a memory issue masquerading as a path issue? I've also read about using s3a:// and s3n:// rather than s3:// but I am new to spark and a quick try changing my path to s3a:// got me an ACCESS DENIED exception

Spark Streaming reading from local file gives NullPointerException

Using Spark 2.2.0 on OS X High Sierra. I'm running a Spark Streaming application to read a local file:
val lines = ssc.textFileStream("file:///Users/userName/Documents/Notes/MoreNotes/sampleFile")
lines.print()
This gives me
org.apache.spark.streaming.dstream.FileInputDStream logWarning - Error finding new files
java.lang.NullPointerException
at scala.collection.mutable.ArrayOps$ofRef$.length$extension(ArrayOps.scala:192)
The file exists, and I am able to read it using SparkContext (sc) from spark-shell on the terminal. For some reason going through the Intellij application and Spark Streaming is not working. Any ideas appreciated!
Quoting the doc comments of textFileStream:
Create an input stream that monitors a Hadoop-compatible filesystem
for new files and reads them as text files (using key as LongWritable, value
as Text and input format as TextInputFormat). Files must be written to the
monitored directory by "moving" them from another location within the same
file system. File names starting with . are ignored.
#param directory HDFS directory to monitor for new file
So, the method expects the path to a directory in the parameter.
So I believe this should avoid that error:
ssc.textFileStream("file:///Users/userName/Documents/Notes/MoreNotes/")
Spark streaming will not read old files, so first run the spark-submit command and then create the local file in the specified directory. Make sure in the spark-submit command, you give only directory name and not the file name. Below is a sample command. Here, I am passing the directory name through the spark command as my first parameter. You can specify this path in your Scala program as well.
spark-submit --class com.spark.streaming.streamingexample.HdfsWordCount --jars /home/cloudera/pramod/kafka_2.12-1.0.1/libs/kafka-clients-1.0.1.jar--master local[4] /home/cloudera/pramod/streamingexample-0.0.1-SNAPSHOT.jar /pramod/hdfswordcount.txt

Pyspark - Load file: Path does not exist

I am a newbie to Spark. I'm trying to read a local csv file within an EMR cluster. The file is located in: /home/hadoop/. The script that I'm using is this one:
spark = SparkSession \
.builder \
.appName("Protob Conversion to Parquet") \
.config("spark.some.config.option", "some-value") \
.getOrCreate()\
df = spark.read.csv('/home/hadoop/observations_temp.csv, header=True)
When I run the script raises the following error message:
pyspark.sql.utils.AnalysisException: u'Path does not exist:
hdfs://ip-172-31-39-54.eu-west-1.compute.internal:8020/home/hadoop/observations_temp.csv
Then, I found out that I have to add file:// in the file path so it can read the file locally:
df = spark.read.csv('file:///home/hadoop/observations_temp.csv, header=True)
But this time, the above approach raised a different error:
Lost task 0.3 in stage 0.0 (TID 3,
ip-172-31-41-81.eu-west-1.compute.internal, executor 1):
java.io.FileNotFoundException: File
file:/home/hadoop/observations_temp.csv does not exist
I think is because the file// extension just read the file locally and it does not distribute the file across the other nodes.
Do you know how can I read the csv file and make it available to all the other nodes?
You are right about the fact that your file is missing from your worker nodes thus that raises the error you got.
Here is the official documentation Ref. External Datasets.
If using a path on the local filesystem, the file must also be accessible at the same path on worker nodes. Either copy the file to all workers or use a network-mounted shared file system.
So basically you have two solutions :
You copy your file into each worker before starting the job;
Or you'll upload in HDFS with something like : (recommended solution)
hadoop fs -put localfile /user/hadoop/hadoopfile.csv
Now you can read it with :
df = spark.read.csv('/user/hadoop/hadoopfile.csv', header=True)
It seems that you are also using AWS S3. You can always try to read it directly from S3 without downloading it. (with the proper credentials of course)
Some suggest that the --files tag provided with spark-submit uploads the files to the execution directories. I don't recommend this approach unless your csv file is very small but then you won't need Spark.
Alternatively, I would stick with HDFS (or any distributed file system).
I think what you are missing is explicitly setting the master node while initializing the SparkSession, try something like this
spark = SparkSession \
.builder \
.master("local") \
.appName("Protob Conversion to Parquet") \
.config("spark.some.config.option", "some-value") \
.getOrCreate()
and then read the file in the same way you have been doing
df = spark.read.csv('file:///home/hadoop/observations_temp.csv')
this should solve the problem...
Might be useful for someone running zeppelin on mac using Docker.
Copy files to custom folder : /Users/my_user/zeppspark/myjson.txt
docker run -p 8080:8080 -v /Users/my_user/zeppspark:/zeppelin/notebook --rm --name zeppelin apache/zeppelin:0.9.0
On Zeppelin you can run this to get your file:
%pyspark
json_data = sc.textFile('/zeppelin/notebook/myjson.txt')

How to avoid "Not a file" exceptions when reading from HDFS with spark

I copy a tree of files from S3 to HDFS with S3DistCP in an initial EMR step. hdfs dfs -ls -R hdfs:///data_dir shows the expected files, which look something like:
/data_dir/year=2015/
/data_dir/year=2015/month=01/
/data_dir/year=2015/month=01/day=01/
/data_dir/year=2015/month=01/day=01/data01.12345678
/data_dir/year=2015/month=01/day=01/data02.12345678
/data_dir/year=2015/month=01/day=01/data03.12345678
The 'directories' are listed as zero-byte files.
I then run a spark step which needs to read these files. The loading code is thus:
sqlctx.read.json('hdfs:///data_dir, schema=schema)
The job fails with a java exception
java.io.IOException: Not a file: hdfs://10.159.123.38:9000/data_dir/year=2015
I had (perhaps naively) assumed that spark would recursively descend the 'dir tree' and load the data files. If I point to S3 it loads the data successfully.
Am I misunderstanding HDFS? Can I tell spark to ignore zero-byte files? Can i use S3DistCp to flatten the tree?
In Hadoop configuration for current spark context, configure "recursive" read for Hadoop InputFormat before to get the sql ctx
val hadoopConf = sparkCtx.hadoopConfiguration
hadoopConf.set("mapreduce.input.fileinputformat.input.dir.recursive", "true")
This will give the solution for "not a file".
Next, to read multiple files:
Hadoop job taking input files from multiple directories
or union the list of files into single dataframe :
Read multiple files from a directory using Spark
Problem solved with :
spark-submit ...
--conf spark.hadoop.mapreduce.input.fileinputformat.input.dir.recursive=true \
--conf spark.hive.mapred.supports.subdirectories=true \
...
The parameters must be set in this way in spark version 2.1.0 :
.set("spark.hive.mapred.supports.subdirectories","true")
.set("spark.hadoop.mapreduce.input.fileinputformat.input.dir.recursive","true")

Resources