spark standalone without hdfs - linux

i've been trying a simple wordcount app on spark standalone.
I have 1 windows machine and 1 linux machine,
Windows runs Master & slave
Linux runs slave.
Connection was fast a simple.
i try to avoid using hdfs but i do want to work on a cluster.
My code so far is:
String fileName = "full path at client";
File file = new File(fileName);
Path filePath = new Path(file);
String uri= filePath.toURI().toString();
SparkConf conf = new sparkConf().setAppName("stam").setMaster("spark://192.168.15.17:7077").setJars(new String[] { ..,.. });
sc = new JavaSparkContext(conf);
sc.addFile(uri);
JavaRDD<String> textFile = sc.textFile(SparkFiles.get(getOnlyFileName(fileName))).cache();
This fails with
Input path does not exist:........
or
java.net.URISyntaxException: Relative path in absolute URI
depends on what i try, the error is from the linux slave
Any idea if this possible ?
The file is being copied to all slaves work directories .
Please help

This cannot be done.
I've moved from standalone to yarn

Related

EMR 6.3 Spark 3.1.1 resource file MalformedInputException

I have a maven project with a category.txt file in src/main/resources.
I have a simple job:
package com.test.utilityjobs
import scala.io.Source
object CategoriesLoadingTestJob {
def main(args: Array[String]): Unit = {
val categoryListSource = Source.fromInputStream(getClass.getResourceAsStream("/categories.txt"))
categoryListSource.getLines().toList.foreach(println)
}
}
Which works fine if launched on my local machine or in emr 5.*
However, in emr 6.3, whenever I launch this simple job, I get this error:
java.nio.charset.MalformedInputException: Input length = 1
I've also tried
val categoryListSource: BufferedSource = Source.fromResource("cat2.txt")
but this gives me the same error.
I've checked the file encoding, it is UTF-8. The compiler encoding is UTF-8. I've tried with other files and everything works fine
It is possible to specify the encoding while reading a property file in scala, so I tried
val categoryListSource = Source.fromInputStream(getClass.getResourceAsStream("/categories.txt"))("UTF-8")
and it worked. It still complains about files with BOM so I removed the BOM from the existing file.

Update ini file using conf file data using shell script

I have below 2 files a.conf and b.ini.
File a.conf has drivers path in it. Which needs to be updated in b.ini file against that particular driver.
a.conf
#find driver directory and replace that
oracle=/client64/lib #bla bla
db2=/opt/db2/lib
#dvs=/opt/dvs/lib
b.ini
[SQLSERVER]
Driver = /opt/local/lib/libtdsodbc.so
HOST = 192.168.220.156
PORT = 1433
TDS_VERSION = 8.0
[ORACLE]
Driver=/usr/lib/oracle/19.5/client64/lib/libsqora.so.19.1
HOST = 192.168.220.182
PORT = 1521
I have to write a shell script in such a way that it must read all the values which are not commented in file a.conf and update the path of Driver in b.ini file.
I am new to shell script any kind of help would be appreciated.

pySpark local mode - loading text file with file:/// vs relative path

I am just getting started with spark and I am trying out examples in local mode...
I noticed that in some examples when creating the RDD the relative path to the file is used and in others the path starts with "file:///". The second option did not work for me at all - "Input path does not exist"
Can anyone explain what the difference is between using the file path and putting 'file:///' in front of it ?
I am using Spark 2.2 on Mac running in local mode
from pyspark import SparkConf, SparkContext
conf = SparkConf().setMaster("local").setAppName("test")
sc = SparkContext(conf = conf)
#This will work providing the relative path
lines = sc.textFile("code/test.csv")
#This will not work
lines = sc.textFile("file:///code/test.csv")
sc.textFile("code/test.csv") means test.csv in /<hive.metastore.warehouse.dir>/code/test.csv on HDFS.
sc.textFile("hdfs:///<hive.metastore.warehouse.dir>/code/test.csv") is equal to above.
sc.textFile("file:///code/test.csv") means test.csv in /code/test.csv on local file system.

MLlib not saving the model data in Spark 2.1

We have a machine learning model that looks roughly like this:
sc = SparkContext(appName = "MLModel")
sqlCtx = SQLContext(sc)
df = sqlCtx.createDataFrame(data_res_promo)
#where data_res promo comes from a pandas dataframe
indexer = StringIndexer(inputCol="Fecha_Code", outputCol="Fecha_Index")
train_indexer = indexer.fit(df)
train_indexer.save('ALSIndexer') #This saves the indexer architecture
In my machine, when I run it as a local, it generates a folder ALSIndexer/ that has the parquet and all the information on the model.
When I run it in our Azure cluster of Spark, it does not generate the folder in the main node (nor in the slaves). However, if we try to rewrite it, it says:
cannot overwrite folder
Which means is somewhere, but we can't find it.
Would you have any pointers?
Spark will by default save files to the distributed filesystem (probably HDFS). The files will therefore not be visible on the nodes themselves but, as they are present, you get the "cannot overwrite folder" error message.
You can easily access the files through the HDFS to copy them to the main node. This can be done in the command line by one of these commands:
1.hadoop fs -get <HDFS file path> <Local system directory path>
2.hadoop fs -copyToLocal <HDFS file path> <Local system directory path>
It can also be done by importing the org.apache.hadoop.fs.FileSystem and utilize the commands available there.

Pyspark running external program using subprocess can't read files from hdfs

I'm trying to run an external program(such as bwa) within pyspark. My code looks like this.
import sys
import subprocess
from pyspark import SparkContext
def bwaRun(args):
a = ['/home/hd_spark/tool/bwa-0.7.13/bwa', 'mem', ref, args]
result = subprocess.check_output(a)
return result
sc = SparkContext(appName = 'sub')
ref = 'hdfs://Master:9000/user/hd_spark/spark/ref/human_g1k_v37_chr13_26577411_30674729.fasta'
input = 'hdfs://Master:9000/user/hd_spark/spark/chunk_interleaved.fastq'
chunk_name = []
chunk_name.append(input)
data = sc.parallelize(chunk_name,1)
print data.map(bwaRun).collect()
I'm running spark with yarn cluster with 6 nodes of slaves and each node has bwa program installed. When i run the code, bwaRun function can't read input files from hdfs. Its kind of obvious this doesn't work because when i tried to run bwa program locally by giving
bwa mem hdfs://Master:9000/user/hd_spark/spark/ref/human_g1k_v37_chr13_26577411_30674729.fasta hdfs://Master:9000/user/hd_spark/spark/chunk_interleaved.fastq
on the shell didn't work because it can't read files from hdfs.
Can anyone give me idea how i could solve this?
Thanks in advance!

Resources