I am a beginner to the Spark. I made the codes and ran them on the multi nodes.
I have one master node and four worker nodes. I ran my codes multiple times and to my surprise, sometimes some of them did't work and sometimes all the worker nodes worked because they were assigned to have the data that master specified.
I didn't setup any detailed configurations so this behavior looks weired to me.
I want to have all my worker nodes process at the same time to get the better and faster results. How to achieve my requirement?
I attached my codes and commands. It is very straightforward so I skipped detailed explanation. Thanks.
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.FlatMapFunction;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.Iterator;
import java.util.List;
/**
* Created by dst on 2/1/17.
*/
public class Test {
public static void main(String[] args) throws Exception {
String inputFile = args[0];
String outputFile = args[1];
SparkConf conf = new SparkConf().setAppName("Data Transformation")
.set("spark.serializer","org.apache.spark.serializer.KryoSerializer");
JavaSparkContext sc = new JavaSparkContext(conf);
JavaRDD<String> lines = sc.textFile(inputFile);
JavaRDD<String> newLine = lines.flatMap(new FlatMapFunction<String, String>() {
public Iterator<String> call(String s) throws Exception {
List<String> ret = new ArrayList<String>();
List<String> ls = Arrays.asList(s.split("\t"));
String values = ls.get(ls.size()-1);
List<String> value = Arrays.asList(values.split("\\|"));
for(int i=0;i<value.size();++i){
String ns = ls.get(0)+"\t"+ls.get(1)+"\t"+ls.get(2)+"\t"+ls.get(3)+"\t"+ls.get(4)+"\t"+ls.get(5);
ns = ns + "\t" + value.get(i);
ret.add(ns);
}
return ret.iterator();
}
});
newLine.saveAsTextFile(outputFile);
}
}
Spark-submit.
spark-submit \
--class Test \
--master spark://spark.dso.xxxx \
--executor-memory 10G \
/home/jumbo/user/sclee/dt/jars/dt_01_notcache-1.0-SNAPSHOT.jar \
/user/sclee/data/ /user/sclee/output
Referring to documentation try setting spark.deploy.spreadOut = false and the behavior will remain same after this setting.
Related
I have a multilabel text classification problem that I tried to resolve using the binary relevance method, by creating one binary classifier per label.
I have to read 10000 models classifier to perform my classification phase, after my training phase, on all my documents, using spark.
But for an unknown reason, it becomes very slow when I try to read more than 1000 models, because spark creates a new thread each time, which progressively slow down the process, and I don't know why.
Here is the minimal code which illustrate my problem.
package entrepot.spark;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.util.ArrayList;
import java.util.HashSet;
import java.util.List;
import java.util.Set;
import org.apache.hadoop.fs.FileStatus;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.spark.ml.classification.MultilayerPerceptronClassificationModel;
import org.apache.spark.sql.SparkSession;
public class maintest {
public static void main(String[] args) throws FileNotFoundException, IllegalArgumentException, IOException {
try(SparkSession spark = SparkSession.builder().appName("test").getOrCreate()) {
//Listing directories to get the list of labels
Set<String> labels = new HashSet<>();
FileStatus[] filesstatus = FileSystem.get(spark.sparkContext().hadoopConfiguration()).listStatus(new Path("C:\\Users\\*\\Desktop\\model\\"));
for(int i = 0; i < filesstatus.length; i++) {
if(filesstatus[i].isDirectory()) {
labels.add(filesstatus[i].getPath().getName());
}
}
List<MultilayerPerceptronClassificationModel> models = new ArrayList<>();
// Here is the problem
for(String label : labels) {
System.out.println(label);
MultilayerPerceptronClassificationModel model = MultilayerPerceptronClassificationModel.load("C:\\Users\\*\\Desktop\\model\\" + label + "\\CL\\");
models.add(model);
}
System.out.println("done");
}
}
}
I'm running the program on Windows, with Spark 2.1.1 and Hadoop 2.7.3, using the following command line:
.\bin\spark-submit^
--class entrepot.spark.maintest^
--master local[*]^
/C:/Users/*/eclipse-workspace/spark/target/spark-0.0.1-SNAPSHOT.jar
To download a small repetitive sample of one of my labels model, here is the link : we.tl/T50s9UffYV (Why can't I post a simple link ??)
PS: Even though the models are serializable, I couldn't save and load everything at once using a java collection and an object stream, because I get a scala conversion error. Instead, I'm using the save/load static method from MLLib on each model, resulting in hundreds of thousands of files.
Exception in thread "main" java.lang.IllegalArgumentException: Field "features" does not exist.
at org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:264)
at org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:264)
at scala.collection.MapLike$class.getOrElse(MapLike.scala:128)
at scala.collection.AbstractMap.getOrElse(Map.scala:59)
at org.apache.spark.sql.types.StructType.apply(StructType.scala:263)
at org.apache.spark.ml.util.SchemaUtils$.checkColumnType(SchemaUtils.scala:40)
at org.apache.spark.ml.clustering.KMeansParams$class.validateAndTransformSchema(KMeans.scala:92)
at org.apache.spark.ml.clustering.KMeans.validateAndTransformSchema(KMeans.scala:253)
at org.apache.spark.ml.clustering.KMeans.transformSchema(KMeans.scala:330)
at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:74)
at org.apache.spark.ml.clustering.KMeans.fit(KMeans.scala:304)
at sparkExample.spExample.ClusteringDSPOC.main(ClusteringDSPOC.java:45)
17
My code is
package sparkExample.spExample;
import java.util.Properties;
import java.util.regex.Pattern;
import org.apache.spark.SparkConf;
import org.apache.spark.SparkContext;
import org.apache.spark.ml.clustering.KMeans;
import org.apache.spark.ml.clustering.KMeansModel;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;
public class ClusteringDSPOC {
private static final Pattern SPACE = Pattern.compile(" ");
private static final SparkContext sc = new SparkContext(new SparkConf().setAppName("SparkJdbcDs").setMaster("local[*]"));
private static final String POSTGRESQL_DRIVER = "org.postgresql.Driver";
private static final String POSTGRESQL_USERNAME = "xyz";
private static final String POSTGRESQL_PWD = "xyz";
private static final String POSTGRESQL_CONNECTION_URL = "jdbc:postgresql://192.168.111.130:5432/xyzdb?user=" + POSTGRESQL_USERNAME + "&password=" + POSTGRESQL_PWD;
private static final String POSTGRESQL_TABLE = "(select id, duration from abc where duration is not null ) as abc";
public static void main(String[] args) throws Exception {
//Datasource options
SparkSession spark = SparkSession.builder().appName("JavaKMeansExample").getOrCreate();
Class.forName(POSTGRESQL_DRIVER);
Properties options = new Properties();
Dataset<Row> sdrDS = spark.read().format("libsvm").jdbc(POSTGRESQL_CONNECTION_URL, POSTGRESQL_TABLE, options);
Dataset<Row> durationDS = sdrDS.select("duration");
KMeans kmeans = new KMeans().setK(2).setSeed(1L);
KMeansModel model = kmeans.fit(durationDS);
}
}
I am following this
https://spark.apache.org/docs/latest/ml-clustering.html.
Getting this error while fit method is called.Please help me on fixing this or else some alternate option to do this.Thanks
Here I am trying to devide duration into 2 to 3 clusters and then map cluster with id.Same thing I am able to do by using Spark mllib library in this way
package sparkExample.spExample;
import java.util.HashMap;
import java.util.Map;
import java.util.regex.Pattern;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.mllib.clustering.KMeans;
import org.apache.spark.mllib.clustering.KMeansModel;
import org.apache.spark.mllib.linalg.Vector;
import org.apache.spark.mllib.linalg.Vectors;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SQLContext;
public class ClusteringPOC1 {
private static final Pattern SPACE = Pattern.compile(" ");
private static final JavaSparkContext sc = new JavaSparkContext(new SparkConf().setAppName("SparkJdbcDs").setMaster("local[*]"));
private static final String POSTGRESQL_DRIVER = "org.postgresql.Driver";
private static final String POSTGRESQL_USERNAME = "abc";
private static final String POSTGRESQL_PWD = "abc";
private static final String POSTGRESQL_CONNECTION_URL = "jdbc:postgresql://192.168.111.130:5432/abcdb?user=" + POSTGRESQL_USERNAME + "&password=" + POSTGRESQL_PWD;
private static final SQLContext sqlContext = new SQLContext(sc);
public static void main(String[] args) throws Exception {
//Datasource options
Map<String, String> options = new HashMap<String, String>();
options.put("driver", POSTGRESQL_DRIVER);
options.put("url", POSTGRESQL_CONNECTION_URL);
options.put("dbtable", "(select id, duration from sdr_log where duration is not null ) as sdr_log");
Dataset<Row> sdrDF = sqlContext.load("jdbc", options);
JavaRDD<Row> sdrData = sdrDF.toJavaRDD();
sdrData.cache();
JavaRDD<Vector> durationData = sdrData.map(row -> {
double value = new Double(row.get(2).toString());
return Vectors.dense(value);
});
durationData.cache();
KMeansModel clusters = KMeans.train(durationData.rdd(), numClusters, numIterations);
JavaRDD<Integer> clusterLabel = clusters.predict(durationData);
JavaRDD<Long> id = sdrData.map(row -> new Long(row.get(1).toString()));
JavaPairRDD<Long, Integer> clusterLableData = id.zip(clusterLabel);
clusterLableData.saveAsTextFile("data/mlib/kmeans_output11.txt");
}
}
But I want to do this with spark ml library.
K-means is an unsupervised clustering algorithm that tries to partition a set of points into K sets (clusters) such that the points in each cluster tend to be near each other.
Dataset<Row> durationDS = sdrDS.select("duration");
In your code, you are iterating over row while selecting a single column 'durations' and you're setting the number of clusters as 2. But how can you classify the data into clusters when you are having no basis to do so?
The essence of unsupervised learning algorithms, in this case the Kmeans, is that you are not needed to specify parameters relating to logic of the dataset while using it. You are just needed to pass (fit) the dataset in the model and it classifies it into clusters.
In the K-means algorithm, the model tries to find the K-nearest neighbour. It needs some data to classify the cluster, whereas you're passing a single column.
It is better to use the Spark's Dataframe API to resolve the error you are facing.
Spark automatically reads the schema from the MySQL table and maps its types back to Spark SQL’s types
Import into a Dataframe object
> DataFrame jdbcDF= sql.Context.read().format("libsvm").jdbc(POSTGRESQL_CONNECTION_URL,POSTGRESQL_TABLE, options);
You can now drop columns you don't want using the DF.drop('ColumnName') function.
Or/And fit your dataset this way..
> KMeansModel model = kmeans.fit(jdbcDF);
Also, It would be great if you could provide the dataset
I was trying to run a sample spark streaming code. but I get this error:
16/06/02 15:25:42 ERROR streaming.StreamingContext: Error starting the context, marking it as stopped
java.lang.IllegalArgumentException: requirement failed: No output operations registered, so nothing to execute
at scala.Predef$.require(Predef.scala:233)
at org.apache.spark.streaming.DStreamGraph.validate(DStreamGraph.scala:161)
at org.apache.spark.streaming.StreamingContext.validate(StreamingContext.scala:542)
at org.apache.spark.streaming.StreamingContext.liftedTree1$1(StreamingContext.scala:601)
at org.apache.spark.streaming.StreamingContext.start(StreamingContext.scala:600)
at org.apache.spark.streaming.api.java.JavaStreamingContext.start(JavaStreamingContext.scala:624)
at com.streams.spark_consumer.SparkConsumer.main(SparkConsumer.java:56)
Exception in thread "main" java.lang.IllegalArgumentException: requirement failed: No output operations registered, so nothing to execute
at scala.Predef$.require(Predef.scala:233)
at org.apache.spark.streaming.DStreamGraph.validate(DStreamGraph.scala:161)
at org.apache.spark.streaming.StreamingContext.validate(StreamingContext.scala:542)
at org.apache.spark.streaming.StreamingContext.liftedTree1$1(StreamingContext.scala:601)
at org.apache.spark.streaming.StreamingContext.start(StreamingContext.scala:600)
at org.apache.spark.streaming.api.java.JavaStreamingContext.start(JavaStreamingContext.scala:624)
at com.streams.spark_consumer.SparkConsumer.main(SparkConsumer.java:56)
My code is given below. I know there are a few unused imports, because I was doing something else and getting the same error so I modified the same code to run the sample program given on the spark streaming website:
package com.streams.spark_consumer;
import java.util.HashMap;
import java.util.HashSet;
import java.util.Arrays;
import java.util.Iterator;
import java.util.Map;
import java.util.Set;
import java.util.regex.Pattern;
import scala.Tuple2;
import kafka.serializer.StringDecoder;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.function.*;
import org.apache.spark.streaming.api.java.*;
import org.apache.spark.streaming.kafka.KafkaUtils;
import org.apache.spark.streaming.Durations;
import org.apache.spark.api.java.JavaSparkContext;
public class SparkConsumer {
private static final Pattern SPACE = Pattern.compile(" ");
public static void main(String[] args) throws Exception {
System.out.println("Han chal raha hai"); //just to know if this part of the code is executed
SparkConf conf = new SparkConf().setMaster("local[2]").setAppName("NetworkWordCount");
JavaStreamingContext jssc = new JavaStreamingContext(conf, Durations.seconds(1));
System.out.println("Han bola na chal raha hau chutiye 1"); //just to know if this part of the code is executed
JavaReceiverInputDStream<String> lines = jssc.socketTextStream("localhost", 9999);
JavaDStream<String> words = lines.flatMap(
new FlatMapFunction<String, String>() {
public Iterable<String> call(String x) {
return Arrays.asList(x.split(" "));
}
});
JavaPairDStream<String, Integer> pairs = words.mapToPair(
new PairFunction<String, String, Integer>() {
public Tuple2<String, Integer> call(String s) {
return new Tuple2<String, Integer>(s, 1);
}
});
JavaPairDStream<String, Integer> wordCounts = pairs.reduceByKey(
new Function2<Integer, Integer, Integer>() {
public Integer call(Integer i1, Integer i2) {
return i1 + i2;
}
});
jssc.start();
jssc.awaitTermination();
}
}
Can anybody help me out with this?
I am using local master, even then I have tried starting a master and stoping a master (also slaves), I didn't know why that might help but just in case, I have already tried that.
According to Spark documentation
Since the output operations actually allow the transformed data to be consumed by external systems, they trigger the actual execution of all the DStream transformations (similar to actions for RDDs).
So use any of the output operations after your tranformations.
print()
foreachRDD(func)
saveAsObjectFiles(prefix, [suffix])
saveAsTextFiles(prefix, [suffix])
saveAsHadoopFiles(prefix, [suffix])
I am looking for some help or example code that illustrates pyspark calling user written Java code outside of spark itself that takes a spark context from Python and then returns an RDD built in Java.
For completeness, I'm using Py4J 0.81, Java 8, Python 2.7, and spark 1.3.1
Here is what I am using for the Python half:
import pyspark
sc = pyspark.SparkContext(master='local[4]',
appName='HelloWorld')
print "version", sc._jsc.version()
from py4j.java_gateway import JavaGateway
gateway = JavaGateway()
print gateway.entry_point.getRDDFromSC(sc._jsc)
The Java portion is:
import java.util.Map;
import java.util.List;
import java.util.ArrayList;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.Function;
import org.apache.spark.api.java.function.Function2;
import py4j.GatewayServer;
public class HelloWorld
{
public JavaRDD<Integer> getRDDFromSC(JavaSparkContext jsc)
{
JavaRDD<Integer> result = null;
if (jsc == null)
{
System.out.println("XXX Bad mojo XXX");
return result;
}
int n = 10;
List<Integer> l = new ArrayList<Integer>(n);
for (int i = 0; i < n; i++)
{
l.add(i);
}
result = jsc.parallelize(l);
return result;
}
public static void main(String[] args)
{
HelloWorld app = new HelloWorld();
GatewayServer server = new GatewayServer(app);
server.start();
}
}
Running produces on the Python side:
$ spark-1.3.1-bin-hadoop1/bin/spark-submit main.py
version 1.3.1
sc._jsc <class 'py4j.java_gateway.JavaObject'>
org.apache.spark.api.java.JavaSparkContext#50418105
None
The Java side reports:
$ spark-1.3.1-bin-hadoop1/bin/spark-submit --class "HelloWorld" --master local[4] target/hello-world-1.0.jar
XXX Bad mojo XXX
The problem appears to be that I am not correctly passing the JavaSparkContext from Python to Java. The same failure of the JavaRDD being null occurs when I use from python sc._scj.sc().
What is the correct way to invoke user defined Java code that uses spark from Python?
So I've got an example of this in a branch that I'm working on for Sparkling Pandas The branch lives at https://github.com/holdenk/sparklingpandas/tree/add-kurtosis-support and the PR is at https://github.com/sparklingpandas/sparklingpandas/pull/90 .
As it stands it looks like you have two different gateway servers which seems like it might cause some problems, instead you can just use the existing gateway server and do something like:
sc._jvm.what.ever.your.class.package.is.HelloWorld.getRDDFromSC(sc._jsc)
assuming you make that a static method as well.
Spark standalone cluster looks it's running without a problem :
http://i.stack.imgur.com/gF1fN.png
I followed this tutorial.
I have built a fat jar for running this JavaApp on the cluster. Before maven package:
find .
./pom.xml
./src
./src/main
./src/main/java
./src/main/java/SimpleApp.java
content of SimpleApp.java is :
import org.apache.spark.api.java.*;
import org.apache.spark.api.java.function.Function;
import org.apache.spark.SparkConf;
import org.apache.spark.SparkContext;
public class SimpleApp {
public static void main(String[] args) {
SparkConf conf = new SparkConf()
.setMaster("spark://10.35.23.13:7077")
.setAppName("My app")
.set("spark.executor.memory", "1g");
JavaSparkContext sc = new JavaSparkContext (conf);
String logFile = "/home/ubuntu/spark-0.9.1/test_data";
JavaRDD<String> logData = sc.textFile(logFile).cache();
long numAs = logData.filter(new Function<String, Boolean>() {
public Boolean call(String s) { return s.contains("a"); }
}).count();
System.out.println("Lines with a: " + numAs);
}
}
This program only works when master is set as setMaster("local"). Otherwise I get this error
$java -cp path_to_file/simple-project-1.0-allinone.jar SimpleApp
http://i.stack.imgur.com/doRSn.png
There's the anonymous class (that extends Function) in SimpleApp.java file. This class is compiled to SimpleApp$1, which should be broadcast to each worker in the Spark cluster.
The simplest way for it is to add the jar explicitly to the Spark context. Add something like sparkContext.addJar("path_to_file/simple-project-1.0-allinone.jar") after JavaSparkContext creating and rebuild your jar file. Then the main Spark program (called the driver program) will automatically deliver your application code to the cluster.