spark saveAsTextFile method is really strange in java api,it just not work right in my program - apache-spark

I am new to spark and get this problem when i run my test program。I install spark on an linux server,and it has just one master node and one worker node。Then I write test program on my laptop,code like this:
`JavaSparkContext ct= new JavaSparkContext ("spark://192.168.90.74:7077","test","/home/webuser/spark/spark-1.5.2-bin-hadoop2.4",new String[0]);
ct.addJar("/home/webuser/java.spark.test-0.0.1-SNAPSHOT-jar-with-dependencies.jar");
List list=new ArrayList();
list.add(1);
list.add(6);
list.add(9);
JavaRDD<String> rdd=ct.parallelize(list);
System.out.println(rdd.collect());
rdd.saveAsTextFile("/home/webuser/temp");
ct.close();`
I suppose I could get /home/webuser/temp on my server ,but in fact this program create c://home/webuser/temp in my laptop which os is win8,I don't understand,
shouldn't saveAsTextFile() run on spark's worker node?why it just run on my laptop,which is sprak's driver,I suppose.

It depends on which filesystem is the default for your Spark installation. According to what you're saying the default filesystem for you is file:/// which is the default. In order to change this, you need to modify the fs.defaultFS property in core-site.xml of your Hadoop configuration. Otherwise, you can simply change your code and specify the filesystem URL in the code, i.e.:
rdd.saveAsTextFile("hdfs://192.168.90.74/home/webuser/temp");
if 192.168.90.74 is your Namenode.

Related

How can I install flashtext on every executor?

I am using the flashtext library in a couple of UDFs. It works when I run it locally in Client mode, but once I try to run it in the Cloudera Workbench with several executors, I get an ModuleNotFoundError.
After some research I found that it is possible to add archives (and packages?) to a SparkSession when creating it, so I tried:
SparkSession.builder.config('spark.archives', 'flashtext-2.7-pyh9f0a1d_0.tar.gz')
but it didn't help, the same error remains.
According to Spark Configuration doc, there are other configs I could try, e.g. spark.submit.pyFiles, but I don't understand how these py-files to be added would have to look like.
Would it be enough to just create a pyton script with this content?
from flashtext import KeywordProcessor
Could you tell me the easiest way how I can install flashtext on every node?
Edit:
In the meantime, I figured that not only Flashtext was causing issues, but also every relative import from other scripts that I intended to use in a UDF. In order to fix it, I followed this article. I also took the source code from Flashtext and imported it to the main file without installing the actual library.
I think in order to point Spark executors to python modules extracted from your archive, you will need to add another config setting, that adds their location to PYTHONPATH. Something like this:
SparkSession.builder \
.config('spark.archives', 'flashtext-2.7-pyh9f0a1d_0.tar.gz#myUDFs') \
.config('spark.executorEnv.PYTHONPATH', './myUDFs')
Citing from the same link you have in the question:
spark.executorEnv.[EnvironmentVariableName]...Add the environment
variable specified by EnvironmentVariableName to the Executor process.
The user can specify multiple of these to set multiple environment
variables.
There are no environment details in your question (or I'm simply not familiar with Cloudera Workbench) but if you're trying to run Spark on YARN, you may need to use slightly different setting spark.yarn.dist.archives.
Also, please make sure that your driver log contains message confirming that an archive was actually uploaded, as in:
:
22/11/08 INFO yarn.Client: Uploading resource file:/absolute/path/to/your/archive.zip -> hdfs://nameservice/user/<your-user-id>/.sparkStaging/<application-id>/archive.zip
:

Hdfs file access in spark

I am developing an application , where I read a file from hadoop, process and store the data back to hadoop.
I am confused what should be the proper hdfs file path format. When reading a hdfs file from spark shell like
val file=sc.textFile("hdfs:///datastore/events.txt")
it works fine and I am able to read it.
But when I sumbit the jar to yarn which contains same set of code it is giving the error saying
org.apache.hadoop.HadoopIllegalArgumentException: Uri without authority: hdfs:/datastore/events.txt
When I add name node ip as hdfs://namenodeserver/datastore/events.txt everything works.
I am bit confused about the behaviour and need an guidance.
Note: I am using aws emr set up and all the configurations are default.
if you want to use sc.textFile("hdfs://...") you need to give the full path(absolute path), in your example that would be "nn1home:8020/.."
If you want to make it simple, then just use sc.textFile("hdfs:/input/war-and-peace.txt")
That's only one /
I think it will work.
Problem solved. As I debugged further fs.defaultFS property was not used from core-site.xml when I just pass path as hdfs:///path/to/file. But all the hadoop config properties are loaded (as I logged the sparkContext.hadoopConfiguration object.
As a work around I manually read the property as sparkContext.hadoopConfiguration().get("fs.defaultFS) and appended this in the path.
I don't know is it a correct way of doing it.

Metrics System not recognizing Custom Source/Sink in application jar

Followup from here.
I've added Custom Source and Sink in my application jar and found a way to get a static fixed metrics.properties on Stand-alone cluster nodes. When I want to launch my application, I give the static path - spark.metrics.conf="/fixed-path/to/metrics.properties". Despite my custom source/sink being in my code/fat-jar - I get ClassNotFoundException on CustomSink.
My fat-jar (with Custom Source/Sink code in it) is on hdfs with read access to all.
So here's what all I've already tried setting (since executors can't find Custom Source/Sink in my application fat-jar):
spark.executor.extraClassPath = hdfs://path/to/fat-jar
spark.executor.extraClassPath = fat-jar-name.jar
spark.executor.extraClassPath = ./fat-jar-name.jar
spark.executor.extraClassPath = ./
spark.executor.extraClassPath = /dir/on/cluster/* (although * is not at file level, there are more directories - I have no way of knowing random application-id or driver-id to give absolute name before launching the app)
It seems like this is how executors are getting initialized for this case (please correct me if I am wrong) -
Driver tells here's the jar location - hdfs://../fat-jar.jar and here are some properties like spark.executor.memory etc.
N number of Executors spin up (depending on configuration) on cluster
Start downloading hdfs://../fat-jar.jar but initialize metrics system in the mean time (? - not sure of this step)
Metrics system looking for Custom Sink/Source files - since it's mentioned in metrics.properties - even before it's done downloading fat-jar (which actually has all those files) (this is my hypothesis)
ClassNotFoundException - CustomSink not found!
Is my understanding correct? Moreover, is there anything else I can try? If anyone has experience with custom source/sinks, any help would be appreciated.
I stumbled upon the same ClassNotFoundException when I needed to extend existing GraphiteSink class and here's how I was able to solve it.
First, I created a CustomGraphiteSink class in org.apache.spark.metrics.sink package:
package org.apache.spark.metrics.sink;
public class CustomGraphiteSink extends GraphiteSink {}
Then I specified the class in metrics.properties
*.sink.graphite.class=org.apache.spark.metrics.sink.CustomGraphiteSink
And passed this file to spark-submit via:
--conf spark.metrics.conf=metrics.properties
In order to use custom source/sink, one has to distribute it using spark-submit --files and set it via spark.executor.extraClassPath

How to check Spark configuration from command line?

Basically, I want to check a property of Spark's configuration, such as "spark.local.dir" through command line, that is, without writing a program. Is there a method to do this?
There is no option of viewing the spark configuration properties from command line.
Instead you can check it in spark-default.conf file. Another option is to view from webUI.
The application web UI at http://driverIP:4040 lists Spark properties in the “Environment” tab. Only values explicitly specified through spark-defaults.conf, SparkConf, or the command line will appear. For all other configuration properties, you can assume the default value is used.
For more details, you can refer Spark Configuration
Following command print your conf properties on console
sc.getConf.toDebugString
We can check in Spark shell using below command :
scala> spark.conf.get("spark.sql.shuffle.partitions")
res33: String = 200
Based on http://spark.apache.org/docs/latest/configuration.html. Spark provides three locations to configure the system:
Spark properties control most application parameters and can be set
by using a SparkConf object, or through Java system properties.
Environment variables can be used to set per-machine settings, such the IP address, through the conf/spark-env.sh script on each
node.
Logging can be configured through log4j.properties.
I haven't heard about method through command line.
Master command to check spark config from CLI
sc._conf.getAll()

Running spark code locally on eclipse with spark installed on remote server

I have configured eclipse for scala and created a maven project and wrote a simple word count spark job on windows. Now my spark+hadoop are installed on linux server. How can I launch my spark code from eclipse to spark cluster (which is on linux)?
Any suggestion.
Actually this answer is not so simple, as you would expect.
I will make many assumptions, first that you use sbt, second is that you are working in a linux based computer, third is the last is that you have two classes in your project, let's say RunMe and Globals, and the last assumption will be that you want to set up the settings inside the program. Thus, somewhere in your runnable code you must have something like this:
object RunMe {
def main(args: Array[String]) {
val conf = new SparkConf()
.setMaster("mesos://master:5050") //If you use Mesos, and if your network resolves the hostname master to its IP.
.setAppName("my-app")
.set("spark.executor.memory", "10g")
val sc = new SparkContext(conf)
val sqlContext = new SQLContext()
//your code comes here
}
}
The steps you must follow are:
Compile the project, in the root of it, by using:
$ sbt assembly
Send the job to the master node, this is the interesting part (assuming you have the next structure in your project target/scala/, and inside you have a file .jar, which corresponds to the compiled project)
$ spark-submit --class RunMe target/scala/app.jar
Notice that, because I assumed that the project has two or more classes you would have to identify which class you want to run. Furthermore, I bet that both approaches, for Yarn and Mesos are very similar.
If you are developing a project in Windows and you want to deploy it in Linux environment then you would want to create an executable JAR file and export it to the home directory of your Linux and specify the same in your spark script (on your terminal). This is possible all because of the beauty of Java Virtual Machine. Let me know if you need more help.
To achieve what you want, you would need:
First: Build the jar (if you use gradle -> fatJar or shadowJar)
Second: In your code, when you generate the SparkConf, you need to specify Master address, spark.driver.host and relative Jar location, smth like:
SparkConf conf = new SparkConf()
.setMaster("spark://SPARK-MASTER-ADDRESS:7077")
.set("spark.driver.host", "IP Adress of your local machine")
.setJars(new String[]{"path\\to\\your\\jar file.jar"})
.setAppName("APP-NAME");
And third: Just Right Click and run from your IDE. That's it... !
What you are looking for is the master where the SparkContext should be created.
You need to set your master to be the cluster you want to use.
I invite you to read the Spark Programming Guide or follow an introductory course to understand these basic concepts. Spark is not a tool you can begin work with overnight, it takes some time.
http://spark.apache.org/docs/latest/programming-guide.html#initializing-spark

Resources