I am using spark 2.0.0.
Is there a way to pass parameters from spark driver to executors? I tried the following.
class SparkDriver {
public static void main(String argv[]){
SparkConf conf = new SparkConf().setAppName("test").setMaster("yarn");
SparkSession sparkSession = SparkSession.builder().config(conf).getOrCreate();
Dataset<Row> input = sparkSession.read().load("inputfilepath");
Dataset<Row> modifiedinput = input.mapPartitions(new customMapPartition(5),Encoders.bean(Row.class));
}
class customMapPartition implements MapPartitionsFunction{
private static final long serialVersionUID = -6513655566985939627L;
private static Integer variableThatHastobePassed = null;
public customMapPartition(Integer passedInteger){
customMapPartition.variableThatHastobePassed= passedInteger;
}
#Override
public Iterator<Row> call(Iterator<Row> input) throws Exception {
System.out.println("number that is passed " + variableThatHastobePassed);
}
}
As mentioned above I wrote a custom mappartitionfunction to pass the parameters. and am accessing the static variable in call method of partitionfunction. This worked when i ran in my local with "setmaster("local"). But did not work when ran on a cluster with .setmaster("yarn"). (printed null in the system.out.println statements)
Is there a way to pass parameters from driver to executors.
my bad i was using
private static Integer variableThatHastobePassed = null;
the variable should not be declared as static.
Related
I am new to Flink, and I wanted to store Kafka streaming data into Cassandra. I've converted String into POJO. My POJO as below,
#Table(keyspace = "sample", name = "contact")
public class Person implements Serializable {
private static final long serialVersionUID = 1L;
#Column(name = "name")
private String name;
#Column(name = "timeStamp")
private LocalDateTime timeStamp;
and My conversion takes places as below,
stream.flatMap(new FlatMapFunction<String, Person>() {
public void flatMap(String value, Collector<Person> out) {
try {
out.collect(objectMapper.readValue(value, Person.class));
} catch (JsonProcessingException e) {
e.printStackTrace();
}
}
}).print(); // I need to use proper method to convert to Datastream.
env.execute();
I read document on the below link for reference,
https://ci.apache.org/projects/flink/flink-docs-stable/dev/connectors/cassandra.html
The Cassandra Sink accepts DataStream instance. I need to convert my conversion and store them into Kafka.
Cant create Cassandra Pojo Sink gives me also some idea.
There is method .forward() which returns DataStream<Reading> forward, and when pass the instance to,
CassandraSink.addSink(forward)
.setHost("localhost")
.build();
cannot access org.apache.flink.streaming.api.scala.DataStream
How can i convert my POJO to store in Cassandra?
I want to use my own tokenizer to tokenize text stored as Dataset<String>, and get Dataset<Seq<String>> (so I can pass it to CountVectorizer).
Expected input (/tmp/fulltext.txt):
t1 t2 t3
t4 t5
Expected output:
[t1, t2, t3]
[t4, t5]
The tokenizer I write is (basically the thing it does now is the same as Tokenizer shipped with Spark, but I'll need to rewrite it to support tokenization of Chinese text, so I cannot use the official Tokenizer):
public class Utils {
public static Seq<String> segment(String text) {
String[] array = text.split(" ");
List<String> tokens = new ArrayList<>();
for (String term : array) {
tokens.add(term.toLowerCase());
}
return JavaConverters
.asScalaIteratorConverter(tokens.iterator())
.asScala()
.toSeq();
}
}
The Spark application I'm trying to make is
public class TokenizeTest {
public static void main(String[] args) {
SparkSession spark = SparkSession
.builder()
.appName("Tokenize Test")
.getOrCreate();
Dataset<String> rawText = spark
.read()
.textFile("/tmp/fulltext.txt")
.cache();
Encoder<Seq> listEncoder = Encoders.bean(Seq.class);
// Compilation error
Dataset<Seq<String>> newText = rawText
.map((MapFunction<String, Seq<String>>) s -> Utils.segment(s), listEncoder);
newText.show();
spark.stop();
}
}
I'm a beginner of Spark, the above code is just what I think will work (after reading the official guide). But it turns out the code of TokenizeTest doesn't compile at all. Do you think there is a way to fix it?
Using Scala collections like this won't work. For once Seq is not Bean compatible, for second it is generic.
If you want split just use arrays with segement defined as:
public class Utils {
public static String[] segment(String text) {
return text.split(" ");
}
}
and TokenizeTest defined as:
public class TokenizeTest {
public static void main(String[] args) {
SparkSession spark = SparkSession
.builder()
.appName("Tokenize Test")
.getOrCreate();
Dataset<String> rawText = spark
.read()
.textFile("/path/to/file")
.cache();
Encoder<String []> listEncoder = spark.implicits().newStringArrayEncoder();
Dataset<String []> newText = rawText
.map((MapFunction<String, String []>) s -> Utils.segment(s), listEncoder);
newText.show();
spark.stop();
}
}
In practice though, you might consider either org.apache.spark.sql.functions.split or org.apache.spark.ml.feature.Tokenizer instead of reinventing the wheel.
I have a basic question on spark closures. I am not able to distinguish code behavior between scenario 2 & 3, both produces same output but based on my understanding scenario 3 should not work as expected.
The Below code is common for all scenarios:
class A implements Serializable{
String t;
A(String t){
this.t=t;
}
}
//Initiaze spark context
JavaSparkContext context=....
//create rdd
JavaRDD<String> rdd = context.parallelize(Arrays.asList("a","b","c","d","e"),3);
Scenerio 1: don't do this because A is initialize in driver and not visible on executor.
A a=new A("pqr");
rdd.map(i->i+a.t).collect();
Scenerio 2: Recommended way of sharing object
Broadcast<A> broadCast = context.broadcast(new A("pqr"));
rdd.map(i->broadCast.getValue().t+i).collect();
//output: [pqra, pqrb, pqrc, pqrd, pqre]
Scenerio 3: why this code work as expected even when I initiate A in driver?
class TestFunction implements Function<String, String>, Serializable {
private A val;
public TestFunction(){ }
public TestFunction(A a){
this.val = a;
}
#Override
public String call(String integer) throws Exception {
return val.t+integer;
}
}
TestFunction mapFunction = new TestFunction(new A("pqr"));
System.out.println(rdd.map(mapFunction).collect());
//output: [pqra, pqrb, pqrc, pqrd, pqre]
Note: I am running program in cluster mode.
The generated Java bytecodes for Scenerio 1 & 3 are almost the same. The benefit of using Broadcast (Scenerio 2) is the broadcast object will only be sent to an executor once and reuse it in other tasks on this executor. Scenerio 1 & 3 will always send the object A to executors for each task.
I want to implement a class have a function that read from hbase by spark, like this:
public abstract class QueryNode implements Serializable{
private static final long serialVersionUID = -2961214832101500548L;
private int id;
private int parent;
protected static Configuration hbaseConf;
protected static Scan scan;
protected static JavaSparkContext sc;
public abstract RDDResult query();
public int getParent() {
return parent;
}
public void setParent(int parent) {
this.parent = parent;
}
public int getId() {
return id;
}
public void setId(int id) {
this.id = id;
}
public void setScanToConf() {
try {
ClientProtos.Scan proto = ProtobufUtil.toScan(scan);
String scanToString = Base64.encodeBytes(proto.toByteArray());
hbaseConf.set(TableInputFormat.SCAN, scanToString);
} catch (IOException e) {
e.printStackTrace();
}
}}
This is a parent class, i hava some subclasses implement the menthod query() to read from hbase , but if I set Configuration, Scan and JavaSparkContext is not static, I will get some errors : these classes are not serialized.
Why these classes must be static? Have I some other ways to slove this problem? thks.
You can try to set transient for these fields to avoid serialization exception like
Caused by: java.io.NotSerializableException:
org.apache.spark.streaming.api.java.JavaStreamingContext
so you say to java you just dont want to serialize these fields:
protected transient Configuration hbaseConf;
protected transient Scan scan;
protected transient JavaSparkContext sc;
Are you initializing JavaSparkContext, Configuration and Scan in main or in any static method? With static, your fields are shared through all instancies. But it depends on your use cases if static should be used.
But with transient way it is better than static because serialization of JavaSparkCOntext does not make sense cause this is created on driver.
-- edit after discussion in comment:
java doc for newAPIHadoopRDD
public <K,V,F extends org.apache.hadoop.mapreduce.InputFormat<K,V>> JavaPairRDD<K,V> newAPIHadoopRDD(org.apache.hadoop.conf.Configuration conf,
Class<F> fClass,
Class<K> kClass,
Class<V> vClass)
conf - Configuration for setting up the dataset. Note: This will
be put into a Broadcast. Therefore if you plan to reuse this conf
to create multiple RDDs, you need to make sure you won't modify the
conf. A safe approach is always creating a new conf for a new
RDD.
Broadcast:
Broadcast variables allow the programmer to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks.
So basically I think for that case static is ok (you create hbaceConf only once), but if you want to avoid static, you can follow suggestion in javadoc to always craete a new conf for a new RDD.
I am using check pointing in my application, and when my application starts with a failure, I get a NullPointerException on SQLContext.
I assume the application is not able to recover the SQLContext because of serialization/deserialization issues. Is SQLContext not serializable?
Here is my code below
//DriverClass
final JavaSparkContext javaSparkCtx = new JavaSparkContext(conf);
final SQLContext sqlContext = new SQLContext(javaSparkCtx);
JavaStreamingContextFactory javaStreamingContextFactory = new JavaStreamingContextFactory() {
#Override
public JavaStreamingContext create() { //only first time executed
// TODO Auto-generated method stub
JavaStreamingContext jssc = new JavaStreamingContext(javaSparkCtx, Durations.minutes(1));
jssc.checkpoint(CHECKPOINT_DIRECTORY);
HashMap < String, String > kafkaParams = new HashMap < String, String > ();
kafkaParams.put("metadata.broker.list",
"abc.xyz.localdomain:6667");
//....
JavaDStream < String > fullMsg = messages
.map(new MapFunction());
fullMsg.foreachRDD(new SomeClass(sqlContext));
return jssc;
}
};
}
//Closure Class
public class SomeClass implements Serializable, Function < JavaRDD < String > , Void > {
SQLContext sqlContext;
public SomeClass(SQLContext sqlContext) {
// TODO Auto-generated constructor stub
this.sqlContext = sqlContext;
}
public void doSomething() {
this.sqlContext.createDataFrame();**// here is the nullpointerException**
}
//.......
}
SQLContext is Serializable because Spark SQL needs to use SQLContext in the executor side internally. However, you should not serialize it to the Streaming checkpoint. Instead, you should get it from rdd like this SQLContext sqlContext = SQLContext.getOrCreate(rdd.context());
See Streaming docs for more details: http://spark.apache.org/docs/1.6.1/streaming-programming-guide.html#dataframe-and-sql-operations