I am using check pointing in my application, and when my application starts with a failure, I get a NullPointerException on SQLContext.
I assume the application is not able to recover the SQLContext because of serialization/deserialization issues. Is SQLContext not serializable?
Here is my code below
//DriverClass
final JavaSparkContext javaSparkCtx = new JavaSparkContext(conf);
final SQLContext sqlContext = new SQLContext(javaSparkCtx);
JavaStreamingContextFactory javaStreamingContextFactory = new JavaStreamingContextFactory() {
#Override
public JavaStreamingContext create() { //only first time executed
// TODO Auto-generated method stub
JavaStreamingContext jssc = new JavaStreamingContext(javaSparkCtx, Durations.minutes(1));
jssc.checkpoint(CHECKPOINT_DIRECTORY);
HashMap < String, String > kafkaParams = new HashMap < String, String > ();
kafkaParams.put("metadata.broker.list",
"abc.xyz.localdomain:6667");
//....
JavaDStream < String > fullMsg = messages
.map(new MapFunction());
fullMsg.foreachRDD(new SomeClass(sqlContext));
return jssc;
}
};
}
//Closure Class
public class SomeClass implements Serializable, Function < JavaRDD < String > , Void > {
SQLContext sqlContext;
public SomeClass(SQLContext sqlContext) {
// TODO Auto-generated constructor stub
this.sqlContext = sqlContext;
}
public void doSomething() {
this.sqlContext.createDataFrame();**// here is the nullpointerException**
}
//.......
}
SQLContext is Serializable because Spark SQL needs to use SQLContext in the executor side internally. However, you should not serialize it to the Streaming checkpoint. Instead, you should get it from rdd like this SQLContext sqlContext = SQLContext.getOrCreate(rdd.context());
See Streaming docs for more details: http://spark.apache.org/docs/1.6.1/streaming-programming-guide.html#dataframe-and-sql-operations
Related
As question, How to get the taskID or mapperID(something like partitionID in Spark) in a hive UDF ?
You can access task information using TaskContext:
import org.apache.spark.TaskContext
sc.parallelize(Seq[Int](), 4).mapPartitions(_ => {
val ctx = TaskContext.get
val stageId = ctx.stageId
val partId = ctx.partitionId
val hostname = java.net.InetAddress.getLocalHost().getHostName()
Iterator(s"Stage: $stageId, Partition: $partId, Host: $hostname")}).collect.foreach(println)
A similar functionality has been added to PySpark in Spark 2.2.0 (SPARK-18576):
from pyspark import TaskContext
import socket
def task_info(*_):
ctx = TaskContext()
return ["Stage: {0}, Partition: {1}, Host: {2}".format
(ctx.stageId(), ctx.partitionId(), socket.gethostname())]
for x in sc.parallelize([], 4).mapPartitions(task_info).collect():
print(x)
I think it will provide you the information about the task including map id you are looking for.
I have found the correct answer on my own, we can get the taskID in a hive UDF the way as below :
public class TestUDF extends GenericUDF {
private Text result = new Text();
private String tmpStr = "";
#Override
public void configure(MapredContext context) {
//get the number of tasks 获取task总数量
int numTasks = context.getJobConf().getNumMapTasks();
//get the current taskID 获取当前taskID
String taskID = context.getJobConf().get("mapred.task.id");
this.tmpStr = numTasks + "_h_xXx_h_" + taskID;
}
#Override
public ObjectInspector initialize(ObjectInspector[] arguments)
throws UDFArgumentException {
return PrimitiveObjectInspectorFactory.writableStringObjectInspector;
}
#Override
public Object evaluate(DeferredObject[] arguments) {
result.set(this.tmpStr);
return this.result;
}
#Override
public String getDisplayString(String[] children) {
return "RowSeq-func()";
}
}
but this would be effective only in MapReduce execution engine, it would not work in a SparkSQL engine.
Test code as below:
add jar hdfs:///home/dp/tmp/shaw/my_udf.jar;
create temporary function seqx AS 'com.udf.TestUDF';
with core as (
select
device_id
from
test_table
where
p_date = '20210309'
and product = 'google'
distribute by
device_id
)
select
seqx() as seqs,
count(1) as cc
from
core
group by
seqx()
order by
seqs asc
Result in MR engine as below, see we have got the task number and taskID successfully:
Result in Spark engine with same sql above, the UDF is not valid, we get nothing about taskID:
If you run your HQL in Spark engine and call the Hive UDF meanwhile, and really need to get the partitionId in Spark, see the code below :
import org.apache.spark.TaskContext;
public class TestUDF extends GenericUDF {
private Text result = new Text();
private String tmpStr = "";
#Override
public ObjectInspector initialize(ObjectInspector[] arguments)
throws UDFArgumentException {
//get spark partitionId
this.tmpStr = TaskContext.getPartitionId() + "-initial-pid";
return PrimitiveObjectInspectorFactory.writableStringObjectInspector;
}
public Object evaluate(DeferredObject[] arguments) {
//get spark partitionId
this.tmpStr = TaskContext.getPartitionId() + "-evaluate-pid";
result.set(this.tmpStr);
return this.result;
}
}
As above, you can get the Spark partitionId by calling TaskContext.getPartitionId() in the override method initialize or evalute of UDF class.
Notice: your UDF must has params, suchs select my_udf(param), this would lead your UDF initialized in multiple tasks, if your UDF do not have a param, it will be initialized at the Driver, and the Driver do not have the taskContext and partitionId, so you would get nothing.
The image below is a result produced by the above UDF executed in Spark engine,see, we get the partitionIds successfully :
I want to use my own tokenizer to tokenize text stored as Dataset<String>, and get Dataset<Seq<String>> (so I can pass it to CountVectorizer).
Expected input (/tmp/fulltext.txt):
t1 t2 t3
t4 t5
Expected output:
[t1, t2, t3]
[t4, t5]
The tokenizer I write is (basically the thing it does now is the same as Tokenizer shipped with Spark, but I'll need to rewrite it to support tokenization of Chinese text, so I cannot use the official Tokenizer):
public class Utils {
public static Seq<String> segment(String text) {
String[] array = text.split(" ");
List<String> tokens = new ArrayList<>();
for (String term : array) {
tokens.add(term.toLowerCase());
}
return JavaConverters
.asScalaIteratorConverter(tokens.iterator())
.asScala()
.toSeq();
}
}
The Spark application I'm trying to make is
public class TokenizeTest {
public static void main(String[] args) {
SparkSession spark = SparkSession
.builder()
.appName("Tokenize Test")
.getOrCreate();
Dataset<String> rawText = spark
.read()
.textFile("/tmp/fulltext.txt")
.cache();
Encoder<Seq> listEncoder = Encoders.bean(Seq.class);
// Compilation error
Dataset<Seq<String>> newText = rawText
.map((MapFunction<String, Seq<String>>) s -> Utils.segment(s), listEncoder);
newText.show();
spark.stop();
}
}
I'm a beginner of Spark, the above code is just what I think will work (after reading the official guide). But it turns out the code of TokenizeTest doesn't compile at all. Do you think there is a way to fix it?
Using Scala collections like this won't work. For once Seq is not Bean compatible, for second it is generic.
If you want split just use arrays with segement defined as:
public class Utils {
public static String[] segment(String text) {
return text.split(" ");
}
}
and TokenizeTest defined as:
public class TokenizeTest {
public static void main(String[] args) {
SparkSession spark = SparkSession
.builder()
.appName("Tokenize Test")
.getOrCreate();
Dataset<String> rawText = spark
.read()
.textFile("/path/to/file")
.cache();
Encoder<String []> listEncoder = spark.implicits().newStringArrayEncoder();
Dataset<String []> newText = rawText
.map((MapFunction<String, String []>) s -> Utils.segment(s), listEncoder);
newText.show();
spark.stop();
}
}
In practice though, you might consider either org.apache.spark.sql.functions.split or org.apache.spark.ml.feature.Tokenizer instead of reinventing the wheel.
I am using spark 2.0.0.
Is there a way to pass parameters from spark driver to executors? I tried the following.
class SparkDriver {
public static void main(String argv[]){
SparkConf conf = new SparkConf().setAppName("test").setMaster("yarn");
SparkSession sparkSession = SparkSession.builder().config(conf).getOrCreate();
Dataset<Row> input = sparkSession.read().load("inputfilepath");
Dataset<Row> modifiedinput = input.mapPartitions(new customMapPartition(5),Encoders.bean(Row.class));
}
class customMapPartition implements MapPartitionsFunction{
private static final long serialVersionUID = -6513655566985939627L;
private static Integer variableThatHastobePassed = null;
public customMapPartition(Integer passedInteger){
customMapPartition.variableThatHastobePassed= passedInteger;
}
#Override
public Iterator<Row> call(Iterator<Row> input) throws Exception {
System.out.println("number that is passed " + variableThatHastobePassed);
}
}
As mentioned above I wrote a custom mappartitionfunction to pass the parameters. and am accessing the static variable in call method of partitionfunction. This worked when i ran in my local with "setmaster("local"). But did not work when ran on a cluster with .setmaster("yarn"). (printed null in the system.out.println statements)
Is there a way to pass parameters from driver to executors.
my bad i was using
private static Integer variableThatHastobePassed = null;
the variable should not be declared as static.
How can I create a DataFrame from an JavaRDD contains Integers. I have done something like below but not working.
List<Integer> input = Arrays.asList(101, 103, 105);
JavaRDD<Integer> inputRDD = sc.parallelize(input);
DataFrame dataframe = sqlcontext.createDataFrame(inputRDD, Integer.class);
I got ClassCastException saying org.apache.spark.sql.types.IntegerType$ cannot be cast to org.apache.spark.sql.types.StructType
How can I achieve this?
Apparently (although not intuitively), this createDataFrame overload can only work for "Bean" types, which means types that do not correspond to any built-in Spark SQL type.
You can see that in the source code, the class you pass is matched with a Spark SQL type in JavaTypeInference.inferDataType, and the result is cast into a StructType (see dataType.asInstanceOf[StructType] in SQLContext.getSchema - but the built in "primitive" types (like IntegerType) are NOT StructTypes... Looks like a bug or undocumented behavior to me....
WORKAROUNDS:
Wrap your Integers with a "bean" class (that's ugly, I know):
public static class MyBean {
final int value;
MyBean(int value) {
this.value = value;
}
public int getValue() {
return value;
}
}
List<MyBean> input = Arrays.asList(new MyBean(101), new MyBean(103), new MyBean(105));
JavaRDD<MyBean> inputRDD = sc.parallelize(input);
DataFrame dataframe = sqlcontext.createDataFrame(inputRDD, MyBean.class);
dataframe.show(); // this works...
Convert to RDD<Row> yourself:
// convert to Rows:
JavaRDD<Row> rowRdd = inputRDD.map(new Function<Integer, Row>() {
#Override
public Row call(Integer v1) throws Exception {
return RowFactory.create(v1);
}
});
// create schema (this looks nicer in Scala...):
StructType schema = new StructType(new StructField[]{new StructField("number", IntegerType$.MODULE$, false, Metadata.empty())});
DataFrame dataframe = sqlcontext.createDataFrame(rowRdd, schema);
dataframe.show(); // this works...
Now in Spark 2.2 you can do the following to create a Dataset.
Dataset<Integer> dataSet = sqlContext().createDataset(javardd.rdd(), Encoders.INT());
Spark standalone cluster looks it's running without a problem :
http://i.stack.imgur.com/gF1fN.png
I followed this tutorial.
I have built a fat jar for running this JavaApp on the cluster. Before maven package:
find .
./pom.xml
./src
./src/main
./src/main/java
./src/main/java/SimpleApp.java
content of SimpleApp.java is :
import org.apache.spark.api.java.*;
import org.apache.spark.api.java.function.Function;
import org.apache.spark.SparkConf;
import org.apache.spark.SparkContext;
public class SimpleApp {
public static void main(String[] args) {
SparkConf conf = new SparkConf()
.setMaster("spark://10.35.23.13:7077")
.setAppName("My app")
.set("spark.executor.memory", "1g");
JavaSparkContext sc = new JavaSparkContext (conf);
String logFile = "/home/ubuntu/spark-0.9.1/test_data";
JavaRDD<String> logData = sc.textFile(logFile).cache();
long numAs = logData.filter(new Function<String, Boolean>() {
public Boolean call(String s) { return s.contains("a"); }
}).count();
System.out.println("Lines with a: " + numAs);
}
}
This program only works when master is set as setMaster("local"). Otherwise I get this error
$java -cp path_to_file/simple-project-1.0-allinone.jar SimpleApp
http://i.stack.imgur.com/doRSn.png
There's the anonymous class (that extends Function) in SimpleApp.java file. This class is compiled to SimpleApp$1, which should be broadcast to each worker in the Spark cluster.
The simplest way for it is to add the jar explicitly to the Spark context. Add something like sparkContext.addJar("path_to_file/simple-project-1.0-allinone.jar") after JavaSparkContext creating and rebuild your jar file. Then the main Spark program (called the driver program) will automatically deliver your application code to the cluster.