Hdfs file access in spark - apache-spark

I am developing an application , where I read a file from hadoop, process and store the data back to hadoop.
I am confused what should be the proper hdfs file path format. When reading a hdfs file from spark shell like
val file=sc.textFile("hdfs:///datastore/events.txt")
it works fine and I am able to read it.
But when I sumbit the jar to yarn which contains same set of code it is giving the error saying
org.apache.hadoop.HadoopIllegalArgumentException: Uri without authority: hdfs:/datastore/events.txt
When I add name node ip as hdfs://namenodeserver/datastore/events.txt everything works.
I am bit confused about the behaviour and need an guidance.
Note: I am using aws emr set up and all the configurations are default.

if you want to use sc.textFile("hdfs://...") you need to give the full path(absolute path), in your example that would be "nn1home:8020/.."
If you want to make it simple, then just use sc.textFile("hdfs:/input/war-and-peace.txt")
That's only one /
I think it will work.

Problem solved. As I debugged further fs.defaultFS property was not used from core-site.xml when I just pass path as hdfs:///path/to/file. But all the hadoop config properties are loaded (as I logged the sparkContext.hadoopConfiguration object.
As a work around I manually read the property as sparkContext.hadoopConfiguration().get("fs.defaultFS) and appended this in the path.
I don't know is it a correct way of doing it.

Related

How can I install flashtext on every executor?

I am using the flashtext library in a couple of UDFs. It works when I run it locally in Client mode, but once I try to run it in the Cloudera Workbench with several executors, I get an ModuleNotFoundError.
After some research I found that it is possible to add archives (and packages?) to a SparkSession when creating it, so I tried:
SparkSession.builder.config('spark.archives', 'flashtext-2.7-pyh9f0a1d_0.tar.gz')
but it didn't help, the same error remains.
According to Spark Configuration doc, there are other configs I could try, e.g. spark.submit.pyFiles, but I don't understand how these py-files to be added would have to look like.
Would it be enough to just create a pyton script with this content?
from flashtext import KeywordProcessor
Could you tell me the easiest way how I can install flashtext on every node?
Edit:
In the meantime, I figured that not only Flashtext was causing issues, but also every relative import from other scripts that I intended to use in a UDF. In order to fix it, I followed this article. I also took the source code from Flashtext and imported it to the main file without installing the actual library.
I think in order to point Spark executors to python modules extracted from your archive, you will need to add another config setting, that adds their location to PYTHONPATH. Something like this:
SparkSession.builder \
.config('spark.archives', 'flashtext-2.7-pyh9f0a1d_0.tar.gz#myUDFs') \
.config('spark.executorEnv.PYTHONPATH', './myUDFs')
Citing from the same link you have in the question:
spark.executorEnv.[EnvironmentVariableName]...Add the environment
variable specified by EnvironmentVariableName to the Executor process.
The user can specify multiple of these to set multiple environment
variables.
There are no environment details in your question (or I'm simply not familiar with Cloudera Workbench) but if you're trying to run Spark on YARN, you may need to use slightly different setting spark.yarn.dist.archives.
Also, please make sure that your driver log contains message confirming that an archive was actually uploaded, as in:
:
22/11/08 INFO yarn.Client: Uploading resource file:/absolute/path/to/your/archive.zip -> hdfs://nameservice/user/<your-user-id>/.sparkStaging/<application-id>/archive.zip
:

assertion failed: conflicting directory structures detected. Suspicious paths

I am trying to read data from aws s3 where I am having error.
s3 bucket and paths for example as below:
s3://USA/Texas/Austin/valid
s3://USA/Texas/Austin/invalid
s3://USA/Texas/Houston/valid
s3://USA/Texas/Houston/invalid
s3://USA/Texas/Dallas/valid
s3://USA/Texas/Dallas/invalid
s3://USA/Texas/San_Antonio/valid
s3://USA/Texas/San_Antonio/invalid
when I try to read as
spark.read.parquet("s3://USA/Texas/Austin/valid")
or
spark.read.parquet("s3://USA/Texas/Austin/invalid")
or
spark.read.parquet("s3://USA/Texas/Austin")
it works just fine.
but when I try to read as
spark.read.parquet("s3://USA/Texas/*")
or
spark.read.parquet("s3://USA/Texas")
it throws an exception.
java.lang.AssertionError: assertion failed: Conflicting directory structures detected. Suspicious paths:
If provided paths are partition directories, please set "basePath" in the options of the data source to specify the root directory of the table. If there are multiple root directories, please load them separately and then union them.
as per suggestion I can read them individually but I have more then 500 files, to read them individually and union them will be hectic.
is there any other way to achieve this?
I am using HDFS with Parquet but I ran into the same issue. For me, setting the basePath to a path level above anything you will be accessing in that query works.
Also, I believe the '*' is unnecessary, though I'm not sure of the behavior of S3 on this one.
eg.
spark.read.option("basePath", "s3://USA/Texas/").parquet("s3://USA/Texas/")
Perhaps this is off-base for your S3 scenario but will hopefully help someone else with HDFS getting the same error.
If you can use Hive, then set two configurations
hive.input.dir.recursive=true
hive.mapred.supports.subdirectories=true
and create external table on the root path. Then, the table should read all the subdirectories data in the table but the schema should be the same or it will get an error.

Spark create a temp directory structure on each node

I am working on a spark java wrapper which uses third party libraries, which will read files from a hard coded directory name say "resdata" from where job executes. I know this is twisted but will try to explain.
when I execute the job it is trying to find the required files in the path something like this below,
/data/Hadoop/yarn/local//appcache/application_xxxxx_xxx/container_00_xxxxx_xxx/resdata
I am assuming it is looking for the files in the current data directory , under that looking for directory name "resdata". At this point I don't know how to configure the current directory to any path on hdfs or local.
So looking for options to create directory structure similar to what the third party libraries expecting and copying required files over there. This I need to do on each node. I am working on spark 2.2.0
Please help me in achieving this?
just now got the answer I need to put all the files under resdata directory and zip it say restdata.zip, pass the file using the options "--archives" . Then each node will have directory restdata.zip/restdata/file1 etc

Junk Spark output file on S3 with dollar signs

I have a simple spark job that reads a file from s3, takes five and writes back in s3.
What I see is that there is always additional file in s3, next to my output "directory", which is called output_$folder$.
What is it? How I can prevent spark from creating it?
Here is some code to show what I am doing...
x = spark.sparkContext.textFile("s3n://.../0000_part_00")
five = x.take(5)
five = spark.sparkContext.parallelize(five)
five.repartition(1).saveAsTextFile("s3n://prod.casumo.stu/dimensions/output/")
After the job I have s3 "directory" called output which contains results and another s3 object called output_$folder$ which I don't know what it is.
Changing S3 paths in the application from s3:// to s3a:// seems to have done the trick for me. The $folder$ files are no longer getting created since I started using s3a://.
Ok, it seems I found out what it is.
It is some kind of marker file, probably used for determining if the S3 directory object exists or not.
How I reached this conclusion?
First, I found this link that shows the source of
org.apache.hadoop.fs.s3native.NativeS3FileSystem#mkdir
method: http://apache-spark-user-list.1001560.n3.nabble.com/S3-Extra-folder-files-for-every-directory-node-td15078.html
Then I googled other source repositories to see if I am going to find different version of the method. I didn't.
At the end, I did an experiment and rerun the same spark job after I removed the s3 output directory object but left output_$folder$ file. Job failed saying that output directory already exists.
My conclusion, this is hadoop's way to know if there is a directory in s3 with given name and I will have to live with that.
All the above happens when I run the job from my local, dev machine - i.e. laptop. If I run the same job from a aws data pipeline, output_$folder$ does not get created.
s3n:// and s3a:// doesn't generate marker directory like <output>_$folder$
If you are using hadoop with AWS EMR., I found moving from s3 to s3n is straight forward since they both use same file system implementation, whereas s3a involves AWS credential related code change.
('fs.s3.impl', 'com.amazon.ws.emr.hadoop.fs.EmrFileSystem')
('fs.s3n.impl', 'com.amazon.ws.emr.hadoop.fs.EmrFileSystem')
('fs.s3a.impl', 'org.apache.hadoop.fs.s3a.S3AFileSystem')

Path for persistent, managed hive tables in Spark 1.4

I am new to Spark and working on JavaSqlNetworkWordCount example to append the word count in a persistent table. I understand that I can only do it via HiveContext. HiveContext, however, keeps trying to save the table in /user/hive/warehouse/. I have tried changing the path by adding
hiveContext.setConf("hive.metastore.warehouse.dir", "/home/user_name");
and by adding the property
<property><name>hive.metastore.warehouse.dir</name>
<value>/home/user_name</value></property>
$SPARK_HOME/conf/hive-site.xml, but nothing seems to work. If anyone else has faced this problem, please let me know if/how you resolved it. I am using Spark1.4 on my local RHEL5 machine.
I think I solved the problem. It looks like spark-submit was creating a metastore_db directory in root directory of the jar file. If metastore_db exists, then hive-stie.xml values are ignored. As soon as I removed that directory, code picked up values from hive-site.xml. I still cannot set the value of the hive.metastore.warehouse.dir property from the code, though.

Resources