streaming.StreamingContext: Error starting the context, marking it as stopped [Spark Streaming] - apache-spark

I was trying to run a sample spark streaming code. but I get this error:
16/06/02 15:25:42 ERROR streaming.StreamingContext: Error starting the context, marking it as stopped
java.lang.IllegalArgumentException: requirement failed: No output operations registered, so nothing to execute
at scala.Predef$.require(Predef.scala:233)
at org.apache.spark.streaming.DStreamGraph.validate(DStreamGraph.scala:161)
at org.apache.spark.streaming.StreamingContext.validate(StreamingContext.scala:542)
at org.apache.spark.streaming.StreamingContext.liftedTree1$1(StreamingContext.scala:601)
at org.apache.spark.streaming.StreamingContext.start(StreamingContext.scala:600)
at org.apache.spark.streaming.api.java.JavaStreamingContext.start(JavaStreamingContext.scala:624)
at com.streams.spark_consumer.SparkConsumer.main(SparkConsumer.java:56)
Exception in thread "main" java.lang.IllegalArgumentException: requirement failed: No output operations registered, so nothing to execute
at scala.Predef$.require(Predef.scala:233)
at org.apache.spark.streaming.DStreamGraph.validate(DStreamGraph.scala:161)
at org.apache.spark.streaming.StreamingContext.validate(StreamingContext.scala:542)
at org.apache.spark.streaming.StreamingContext.liftedTree1$1(StreamingContext.scala:601)
at org.apache.spark.streaming.StreamingContext.start(StreamingContext.scala:600)
at org.apache.spark.streaming.api.java.JavaStreamingContext.start(JavaStreamingContext.scala:624)
at com.streams.spark_consumer.SparkConsumer.main(SparkConsumer.java:56)
My code is given below. I know there are a few unused imports, because I was doing something else and getting the same error so I modified the same code to run the sample program given on the spark streaming website:
package com.streams.spark_consumer;
import java.util.HashMap;
import java.util.HashSet;
import java.util.Arrays;
import java.util.Iterator;
import java.util.Map;
import java.util.Set;
import java.util.regex.Pattern;
import scala.Tuple2;
import kafka.serializer.StringDecoder;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.function.*;
import org.apache.spark.streaming.api.java.*;
import org.apache.spark.streaming.kafka.KafkaUtils;
import org.apache.spark.streaming.Durations;
import org.apache.spark.api.java.JavaSparkContext;
public class SparkConsumer {
private static final Pattern SPACE = Pattern.compile(" ");
public static void main(String[] args) throws Exception {
System.out.println("Han chal raha hai"); //just to know if this part of the code is executed
SparkConf conf = new SparkConf().setMaster("local[2]").setAppName("NetworkWordCount");
JavaStreamingContext jssc = new JavaStreamingContext(conf, Durations.seconds(1));
System.out.println("Han bola na chal raha hau chutiye 1"); //just to know if this part of the code is executed
JavaReceiverInputDStream<String> lines = jssc.socketTextStream("localhost", 9999);
JavaDStream<String> words = lines.flatMap(
new FlatMapFunction<String, String>() {
public Iterable<String> call(String x) {
return Arrays.asList(x.split(" "));
}
});
JavaPairDStream<String, Integer> pairs = words.mapToPair(
new PairFunction<String, String, Integer>() {
public Tuple2<String, Integer> call(String s) {
return new Tuple2<String, Integer>(s, 1);
}
});
JavaPairDStream<String, Integer> wordCounts = pairs.reduceByKey(
new Function2<Integer, Integer, Integer>() {
public Integer call(Integer i1, Integer i2) {
return i1 + i2;
}
});
jssc.start();
jssc.awaitTermination();
}
}
Can anybody help me out with this?
I am using local master, even then I have tried starting a master and stoping a master (also slaves), I didn't know why that might help but just in case, I have already tried that.

According to Spark documentation
Since the output operations actually allow the transformed data to be consumed by external systems, they trigger the actual execution of all the DStream transformations (similar to actions for RDDs).
So use any of the output operations after your tranformations.
print()
foreachRDD(func)
saveAsObjectFiles(prefix, [suffix])
saveAsTextFiles(prefix, [suffix])
saveAsHadoopFiles(prefix, [suffix])

Related

Returning java.util.Map from spark UDF results in scala.MatchError: {} (of class java.util.HashMap)

I'm new to Apache Spark and I'm learning how to use it in java. I would like to define and use an user defined function (udf) and I get the scala.MatchError by returning a java.util.HashMap.
Here is my code for extracting hashtags from a tweets dataset and adding a new column with a map of the hashtag and it's number of occurrences in the respective tweet:
// Open spark session
SparkSession sparkSession = SparkSession.builder().master("local[*]").appName("TwitterAnalyticsExample").getOrCreate();
// Load training data
Dataset<Row> twitterData = sparkSession.read().format("json").load(inputFilePath);
UDF1 extractHashtags = new UDF1<String, Map<String, Integer>>() {
#Override
public Map<String, Integer> call(String tweet) throws Exception {
Map<String, Integer> result = new HashMap<>();
Pattern pattern = Pattern.compile("#\\w*");
Matcher matcher = pattern.matcher(tweet);
while (matcher.find()) {
result.merge(matcher.group(), 1, (v1, v2) -> v1 + v2);
}
return result;
}
};
sparkSession.sqlContext().udf().register("extractHashtags", extractHashtags, DataTypes.StringType);
twitterData.limit(50).select(callUDF("extractHashtags", col("text"))).show(20);
and following imports:
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;
import org.apache.spark.sql.api.java.UDF1;
import org.apache.spark.sql.types.DataTypes;
import java.util.HashMap;
import java.util.Map;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
Any hint, what am I doing wrong? Is the return type java.util.Map a problem for UDF? What could I use instead?

Spark custom output path after partition columns

In Spark, is it possible to have suffix in the path after partition by columns?
For example:
I am write the data to the following path:
/db_name/table_name/dateid=20171009/event_type=TEST/
`dataset.write().partitionBy("event_type").save("/db_name/table_name/dateid=20171009");`
Is it possible to create it to the following with dynamic partition?
/db_name/table_name/dateid=20171009/event_type=TEST/1507764830
It turns out newTaskTempFile is the right place for this. The previous one doesn't work for dynamic partitions.
public String newTaskTempFile(TaskAttemptContext taskContext, Option<String> dir, String ext) {
Option<String> dirWithTimestamp = Option.apply(dir.get() + "/" + timestamp)
return super.newTaskTempFile(taskContext, dirWithTimestamp, ext);
}
//sample json
{"event_type": "type_A", "dateid":"20171009", "data":"garbage" }
{"event_type": "type_B", "dateid":"20171008", "data":"garbage" }
{"event_type": "type_A", "dateid":"20171007", "data":"garbage" }
{"event_type": "type_B", "dateid":"20171006", "data":"garbage" }
// save as partition
spark.read
.json("./data/sample.json")
.write
.partitionBy("dateid", "event_type").saveAsTable("sample")
//result
After reading the source code, the FileOutputCommitter is the way to do this.
SparkSession spark = SparkSession
.builder()
.master("local[2]")
.config("spark.sql.parquet.output.committer.class", "com.estudio.spark.ESParquetOutputCommitter")
.config("spark.sql.sources.commitProtocolClass", "com.estudio.spark.ESSQLHadoopMapReduceCommitProtocol")
.getOrCreate();
ESSQLHadoopMapReduceCommitProtocol.realAppendMode = false;
spark.range(10000)
.withColumn("type", rand()
.multiply(6).cast("int"))
.write()
.mode(Append)
.partitionBy("type")
.format("parquet")
.save("/tmp/spark/test1/");
Here is the customized ParquetOutputCommitter, it's the place to customize the output path. In this case, we're suffix the time-stamp. We have to make sure it's synchronized. Here is the code:
import lombok.extern.slf4j.Slf4j;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.mapreduce.TaskAttemptContext;
import org.apache.parquet.hadoop.ParquetOutputCommitter;
import java.io.IOException;
import java.util.HashMap;
import java.util.Map;
#Slf4j
public class ESParquetOutputCommitter extends ParquetOutputCommitter {
private final static Map<String, Path> pathMap = new HashMap<>();
public final static synchronized Path getNewPath(final Path path) {
final String key = path.toString();
log.debug("path.key: {}", key);
if (pathMap.containsKey(key)) {
return pathMap.get(key);
}
final Path newPath = new Path(path, Long.toString(System.currentTimeMillis()));
pathMap.put(key, newPath);
log.info("---> Path: {}, newPath: {}", path, newPath);
return newPath;
}
public ESParquetOutputCommitter(Path outputPath, TaskAttemptContext context) throws IOException {
super(getNewPath(outputPath), context);
log.info("this: {}", this);
}
}
We can also use the getNewPath method to get the customized path. Until now, this will work for SaveMode.Overwrite.
SaveMode.Append is little different, check out here. So, for us to cover Append mode, we need to override SQLHadoopMapReduceCommitProtocol to always return the customized ParquetOutputCommitter. Here is the code:
import lombok.extern.slf4j.Slf4j;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.mapreduce.OutputCommitter;
import org.apache.hadoop.mapreduce.TaskAttemptContext;
import org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter;
import org.apache.spark.sql.execution.datasources.SQLHadoopMapReduceCommitProtocol;
import org.apache.spark.sql.internal.SQLConf;
import java.lang.reflect.Constructor;
#Slf4j
public class ESSQLHadoopMapReduceCommitProtocol extends SQLHadoopMapReduceCommitProtocol {
public static boolean realAppendMode = false;
private String jobId;
private String path;
private boolean isAppend;
public ESSQLHadoopMapReduceCommitProtocol(String jobId, String path, boolean isAppend) {
super(jobId, path, isAppend);
this.jobId = jobId;
this.path = path;
this.isAppend = isAppend;
}
#Override
public OutputCommitter setupCommitter(TaskAttemptContext context) {
try {
OutputCommitter committer = context.getOutputFormatClass().newInstance().getOutputCommitter(context);
if (realAppendMode) {
log.info("Using output committer class {}", committer.getClass().getCanonicalName());
return committer;
}
final Configuration configuration = context.getConfiguration();
final String key = SQLConf.OUTPUT_COMMITTER_CLASS().key();
final Class<? extends OutputCommitter> clazz;
clazz = configuration.getClass(key , null, OutputCommitter.class);
if (clazz == null) {
log.info("Using output committer class {}", committer.getClass().getCanonicalName());
return committer;
}
log.info("Using user defined output committer class {}", clazz.getCanonicalName());
if (FileOutputCommitter.class.isAssignableFrom(clazz)) {
Constructor<? extends OutputCommitter> ctor = clazz.getDeclaredConstructor(Path.class, TaskAttemptContext.class);
committer = ctor.newInstance(new Path(path), context);
} else {
Constructor<? extends OutputCommitter> ctor = clazz.getDeclaredConstructor();
committer = ctor.newInstance();
}
return committer;
} catch (Exception e) {
e.printStackTrace();
return super.setupCommitter(context);
}
}
}
Also added a static flag realAppendMode to turn all of this off.
Again, I am not Spark expert yet, let me know if ther
e is anything issue with this solution.

Exception in thread "main" java.lang.IllegalArgumentException: Field "features" does not exist

Exception in thread "main" java.lang.IllegalArgumentException: Field "features" does not exist.
at org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:264)
at org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:264)
at scala.collection.MapLike$class.getOrElse(MapLike.scala:128)
at scala.collection.AbstractMap.getOrElse(Map.scala:59)
at org.apache.spark.sql.types.StructType.apply(StructType.scala:263)
at org.apache.spark.ml.util.SchemaUtils$.checkColumnType(SchemaUtils.scala:40)
at org.apache.spark.ml.clustering.KMeansParams$class.validateAndTransformSchema(KMeans.scala:92)
at org.apache.spark.ml.clustering.KMeans.validateAndTransformSchema(KMeans.scala:253)
at org.apache.spark.ml.clustering.KMeans.transformSchema(KMeans.scala:330)
at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:74)
at org.apache.spark.ml.clustering.KMeans.fit(KMeans.scala:304)
at sparkExample.spExample.ClusteringDSPOC.main(ClusteringDSPOC.java:45)
17
My code is
package sparkExample.spExample;
import java.util.Properties;
import java.util.regex.Pattern;
import org.apache.spark.SparkConf;
import org.apache.spark.SparkContext;
import org.apache.spark.ml.clustering.KMeans;
import org.apache.spark.ml.clustering.KMeansModel;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;
public class ClusteringDSPOC {
private static final Pattern SPACE = Pattern.compile(" ");
private static final SparkContext sc = new SparkContext(new SparkConf().setAppName("SparkJdbcDs").setMaster("local[*]"));
private static final String POSTGRESQL_DRIVER = "org.postgresql.Driver";
private static final String POSTGRESQL_USERNAME = "xyz";
private static final String POSTGRESQL_PWD = "xyz";
private static final String POSTGRESQL_CONNECTION_URL = "jdbc:postgresql://192.168.111.130:5432/xyzdb?user=" + POSTGRESQL_USERNAME + "&password=" + POSTGRESQL_PWD;
private static final String POSTGRESQL_TABLE = "(select id, duration from abc where duration is not null ) as abc";
public static void main(String[] args) throws Exception {
//Datasource options
SparkSession spark = SparkSession.builder().appName("JavaKMeansExample").getOrCreate();
Class.forName(POSTGRESQL_DRIVER);
Properties options = new Properties();
Dataset<Row> sdrDS = spark.read().format("libsvm").jdbc(POSTGRESQL_CONNECTION_URL, POSTGRESQL_TABLE, options);
Dataset<Row> durationDS = sdrDS.select("duration");
KMeans kmeans = new KMeans().setK(2).setSeed(1L);
KMeansModel model = kmeans.fit(durationDS);
}
}
I am following this
https://spark.apache.org/docs/latest/ml-clustering.html.
Getting this error while fit method is called.Please help me on fixing this or else some alternate option to do this.Thanks
Here I am trying to devide duration into 2 to 3 clusters and then map cluster with id.Same thing I am able to do by using Spark mllib library in this way
package sparkExample.spExample;
import java.util.HashMap;
import java.util.Map;
import java.util.regex.Pattern;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.mllib.clustering.KMeans;
import org.apache.spark.mllib.clustering.KMeansModel;
import org.apache.spark.mllib.linalg.Vector;
import org.apache.spark.mllib.linalg.Vectors;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SQLContext;
public class ClusteringPOC1 {
private static final Pattern SPACE = Pattern.compile(" ");
private static final JavaSparkContext sc = new JavaSparkContext(new SparkConf().setAppName("SparkJdbcDs").setMaster("local[*]"));
private static final String POSTGRESQL_DRIVER = "org.postgresql.Driver";
private static final String POSTGRESQL_USERNAME = "abc";
private static final String POSTGRESQL_PWD = "abc";
private static final String POSTGRESQL_CONNECTION_URL = "jdbc:postgresql://192.168.111.130:5432/abcdb?user=" + POSTGRESQL_USERNAME + "&password=" + POSTGRESQL_PWD;
private static final SQLContext sqlContext = new SQLContext(sc);
public static void main(String[] args) throws Exception {
//Datasource options
Map<String, String> options = new HashMap<String, String>();
options.put("driver", POSTGRESQL_DRIVER);
options.put("url", POSTGRESQL_CONNECTION_URL);
options.put("dbtable", "(select id, duration from sdr_log where duration is not null ) as sdr_log");
Dataset<Row> sdrDF = sqlContext.load("jdbc", options);
JavaRDD<Row> sdrData = sdrDF.toJavaRDD();
sdrData.cache();
JavaRDD<Vector> durationData = sdrData.map(row -> {
double value = new Double(row.get(2).toString());
return Vectors.dense(value);
});
durationData.cache();
KMeansModel clusters = KMeans.train(durationData.rdd(), numClusters, numIterations);
JavaRDD<Integer> clusterLabel = clusters.predict(durationData);
JavaRDD<Long> id = sdrData.map(row -> new Long(row.get(1).toString()));
JavaPairRDD<Long, Integer> clusterLableData = id.zip(clusterLabel);
clusterLableData.saveAsTextFile("data/mlib/kmeans_output11.txt");
}
}
But I want to do this with spark ml library.
K-means is an unsupervised clustering algorithm that tries to partition a set of points into K sets (clusters) such that the points in each cluster tend to be near each other.
Dataset<Row> durationDS = sdrDS.select("duration");
In your code, you are iterating over row while selecting a single column 'durations' and you're setting the number of clusters as 2. But how can you classify the data into clusters when you are having no basis to do so?
The essence of unsupervised learning algorithms, in this case the Kmeans, is that you are not needed to specify parameters relating to logic of the dataset while using it. You are just needed to pass (fit) the dataset in the model and it classifies it into clusters.
In the K-means algorithm, the model tries to find the K-nearest neighbour. It needs some data to classify the cluster, whereas you're passing a single column.
It is better to use the Spark's Dataframe API to resolve the error you are facing.
Spark automatically reads the schema from the MySQL table and maps its types back to Spark SQL’s types
Import into a Dataframe object
> DataFrame jdbcDF= sql.Context.read().format("libsvm").jdbc(POSTGRESQL_CONNECTION_URL,POSTGRESQL_TABLE, options);
You can now drop columns you don't want using the DF.drop('ColumnName') function.
Or/And fit your dataset this way..
> KMeansModel model = kmeans.fit(jdbcDF);
Also, It would be great if you could provide the dataset

why some nodes were not assigned to allocate data on spark?

I am a beginner to the Spark. I made the codes and ran them on the multi nodes.
I have one master node and four worker nodes. I ran my codes multiple times and to my surprise, sometimes some of them did't work and sometimes all the worker nodes worked because they were assigned to have the data that master specified.
I didn't setup any detailed configurations so this behavior looks weired to me.
I want to have all my worker nodes process at the same time to get the better and faster results. How to achieve my requirement?
I attached my codes and commands. It is very straightforward so I skipped detailed explanation. Thanks.
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.FlatMapFunction;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.Iterator;
import java.util.List;
/**
* Created by dst on 2/1/17.
*/
public class Test {
public static void main(String[] args) throws Exception {
String inputFile = args[0];
String outputFile = args[1];
SparkConf conf = new SparkConf().setAppName("Data Transformation")
.set("spark.serializer","org.apache.spark.serializer.KryoSerializer");
JavaSparkContext sc = new JavaSparkContext(conf);
JavaRDD<String> lines = sc.textFile(inputFile);
JavaRDD<String> newLine = lines.flatMap(new FlatMapFunction<String, String>() {
public Iterator<String> call(String s) throws Exception {
List<String> ret = new ArrayList<String>();
List<String> ls = Arrays.asList(s.split("\t"));
String values = ls.get(ls.size()-1);
List<String> value = Arrays.asList(values.split("\\|"));
for(int i=0;i<value.size();++i){
String ns = ls.get(0)+"\t"+ls.get(1)+"\t"+ls.get(2)+"\t"+ls.get(3)+"\t"+ls.get(4)+"\t"+ls.get(5);
ns = ns + "\t" + value.get(i);
ret.add(ns);
}
return ret.iterator();
}
});
newLine.saveAsTextFile(outputFile);
}
}
Spark-submit.
spark-submit \
--class Test \
--master spark://spark.dso.xxxx \
--executor-memory 10G \
/home/jumbo/user/sclee/dt/jars/dt_01_notcache-1.0-SNAPSHOT.jar \
/user/sclee/data/ /user/sclee/output
Referring to documentation try setting spark.deploy.spreadOut = false and the behavior will remain same after this setting.

How can I make Selenium tests inside my JSF project?

I have quite big JSF 1.2 project and I want to write some integration tests to it.
The perfect situation will be, when I can run these tests from my project and it opens my browser and makes all the actions (with Selenium), which are written in my test cases. Ofc opening browser is not required when It will run these tests anyway :)
I've tried a few possibilities, anyway I still can't attach any selenium library to my project and I realized that I just dont know where to start - can you give me some direction?
might help you ,
you can write you test logic inside test method
package com.test;
import org.junit.After;
import org.junit.Before;
import org.junit.Ignore;
import org.junit.Test;
import org.openqa.selenium.By;
import org.openqa.selenium.NoSuchElementException;
import org.openqa.selenium.WebDriver;
import org.openqa.selenium.chrome.ChromeDriver;
import java.util.concurrent.TimeUnit;
import static org.junit.Assert.fail;
public class test1 {
private WebDriver driver;
private String baseUrl;
private StringBuffer verificationErrors = new StringBuffer();
#Before
public void setUp() throws Exception {
driver = new InternetExplorerDriver();
driver = new ChromeDriver();
baseUrl = "http://www.google.com";
driver.manage().timeouts().implicitlyWait(30, TimeUnit.SECONDS);
}
#Ignore
#Test
public void test1() throws Exception {
// your test code
}
#After
public void tearDown() throws Exception {
driver.quit();
String verificationErrorString = verificationErrors.toString();
if (!"".equals(verificationErrorString)) {
fail(verificationErrorString);
}
}
private boolean isElementPresent(By by) {
try {
driver.findElement(by);
return true;
} catch (NoSuchElementException e) {
return false;
}
}
}
you just need call test1 class which you want to test it .
it will be automatically working on it .

Resources