pySpark local mode - loading text file with file:/// vs relative path - apache-spark

I am just getting started with spark and I am trying out examples in local mode...
I noticed that in some examples when creating the RDD the relative path to the file is used and in others the path starts with "file:///". The second option did not work for me at all - "Input path does not exist"
Can anyone explain what the difference is between using the file path and putting 'file:///' in front of it ?
I am using Spark 2.2 on Mac running in local mode
from pyspark import SparkConf, SparkContext
conf = SparkConf().setMaster("local").setAppName("test")
sc = SparkContext(conf = conf)
#This will work providing the relative path
lines = sc.textFile("code/test.csv")
#This will not work
lines = sc.textFile("file:///code/test.csv")

sc.textFile("code/test.csv") means test.csv in /<hive.metastore.warehouse.dir>/code/test.csv on HDFS.
sc.textFile("hdfs:///<hive.metastore.warehouse.dir>/code/test.csv") is equal to above.
sc.textFile("file:///code/test.csv") means test.csv in /code/test.csv on local file system.

Related

How to use wildcard in hdfs file path while list out files in nested folder

I'm using below code to list out files in nested folder
val hdfspath="/data/retail/apps/*/landing"
import org.apache.hadoop.fs. (FileSystem, Path) val fs= org.apache.hadoop.fs.FileSystem.get (spark.sparkContext.hadoopConfiguration) fs.listStatus (new Path (s"S(hdfspath)")).filter(.isDirectory).map(_.getPath).foreach (printin)
If I use path as below
hdfspath="/data/retail/apps"
getting results but if I use val hdfspath="/data/retail/apps/*/landing" then I'm getting error it's showing path not exist error.plese help me out.
Error Image
according to this answer, you need to use globStauts instead of listStatus:
fs.globStatus(new Path (s"S(hdfspath)")).filter(.isDirectory).map(_.getPath).foreach (println)

How to handle reading non-existing file in spark

Am trying to read some files from HDFS using spark sc.wholeTextFiles, I pass a list of the required files, yet the job keeps throwing
py4j.protocol.Py4JJavaError: An error occurred while calling o98.showString.
: org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist:
if one of the files didn't exist.
how can I bypass the not found files and only read found ones ?
To know if a file exists (and delete it, in my case) I do the following:
import org.apache.hadoop.fs.{FileSystem, Path}
val fs = FileSystem.get(spark.sparkContext.hadoopConfiguration)
if (fs.exists(new Path(fullPath))) {
println("Output directory already exists. Deleting it...")
fs.delete(new Path(fullPath), true)
}
Use jvm File system from spark context to check file
fs.exists(sc._jvm.org.apache.hadoop.fs.Path("path/to/file.csv"))
fs = sc._jvm.org.apache.hadoop.fs.FileSystem.get(sc._jsc.hadoopConfiguration())
fs.exists(sc._jvm.org.apache.hadoop.fs.Path("test.csv"))
True
fs.exists(sc._jvm.org.apache.hadoop.fs.Path("fail_test.csv"))
False

MLlib not saving the model data in Spark 2.1

We have a machine learning model that looks roughly like this:
sc = SparkContext(appName = "MLModel")
sqlCtx = SQLContext(sc)
df = sqlCtx.createDataFrame(data_res_promo)
#where data_res promo comes from a pandas dataframe
indexer = StringIndexer(inputCol="Fecha_Code", outputCol="Fecha_Index")
train_indexer = indexer.fit(df)
train_indexer.save('ALSIndexer') #This saves the indexer architecture
In my machine, when I run it as a local, it generates a folder ALSIndexer/ that has the parquet and all the information on the model.
When I run it in our Azure cluster of Spark, it does not generate the folder in the main node (nor in the slaves). However, if we try to rewrite it, it says:
cannot overwrite folder
Which means is somewhere, but we can't find it.
Would you have any pointers?
Spark will by default save files to the distributed filesystem (probably HDFS). The files will therefore not be visible on the nodes themselves but, as they are present, you get the "cannot overwrite folder" error message.
You can easily access the files through the HDFS to copy them to the main node. This can be done in the command line by one of these commands:
1.hadoop fs -get <HDFS file path> <Local system directory path>
2.hadoop fs -copyToLocal <HDFS file path> <Local system directory path>
It can also be done by importing the org.apache.hadoop.fs.FileSystem and utilize the commands available there.

Pyspark running external program using subprocess can't read files from hdfs

I'm trying to run an external program(such as bwa) within pyspark. My code looks like this.
import sys
import subprocess
from pyspark import SparkContext
def bwaRun(args):
a = ['/home/hd_spark/tool/bwa-0.7.13/bwa', 'mem', ref, args]
result = subprocess.check_output(a)
return result
sc = SparkContext(appName = 'sub')
ref = 'hdfs://Master:9000/user/hd_spark/spark/ref/human_g1k_v37_chr13_26577411_30674729.fasta'
input = 'hdfs://Master:9000/user/hd_spark/spark/chunk_interleaved.fastq'
chunk_name = []
chunk_name.append(input)
data = sc.parallelize(chunk_name,1)
print data.map(bwaRun).collect()
I'm running spark with yarn cluster with 6 nodes of slaves and each node has bwa program installed. When i run the code, bwaRun function can't read input files from hdfs. Its kind of obvious this doesn't work because when i tried to run bwa program locally by giving
bwa mem hdfs://Master:9000/user/hd_spark/spark/ref/human_g1k_v37_chr13_26577411_30674729.fasta hdfs://Master:9000/user/hd_spark/spark/chunk_interleaved.fastq
on the shell didn't work because it can't read files from hdfs.
Can anyone give me idea how i could solve this?
Thanks in advance!

spark standalone without hdfs

i've been trying a simple wordcount app on spark standalone.
I have 1 windows machine and 1 linux machine,
Windows runs Master & slave
Linux runs slave.
Connection was fast a simple.
i try to avoid using hdfs but i do want to work on a cluster.
My code so far is:
String fileName = "full path at client";
File file = new File(fileName);
Path filePath = new Path(file);
String uri= filePath.toURI().toString();
SparkConf conf = new sparkConf().setAppName("stam").setMaster("spark://192.168.15.17:7077").setJars(new String[] { ..,.. });
sc = new JavaSparkContext(conf);
sc.addFile(uri);
JavaRDD<String> textFile = sc.textFile(SparkFiles.get(getOnlyFileName(fileName))).cache();
This fails with
Input path does not exist:........
or
java.net.URISyntaxException: Relative path in absolute URI
depends on what i try, the error is from the linux slave
Any idea if this possible ?
The file is being copied to all slaves work directories .
Please help
This cannot be done.
I've moved from standalone to yarn

Resources