How to map String to Seq<String> in Spark in Java - apache-spark

I want to use my own tokenizer to tokenize text stored as Dataset<String>, and get Dataset<Seq<String>> (so I can pass it to CountVectorizer).
Expected input (/tmp/fulltext.txt):
t1 t2 t3
t4 t5
Expected output:
[t1, t2, t3]
[t4, t5]
The tokenizer I write is (basically the thing it does now is the same as Tokenizer shipped with Spark, but I'll need to rewrite it to support tokenization of Chinese text, so I cannot use the official Tokenizer):
public class Utils {
public static Seq<String> segment(String text) {
String[] array = text.split(" ");
List<String> tokens = new ArrayList<>();
for (String term : array) {
tokens.add(term.toLowerCase());
}
return JavaConverters
.asScalaIteratorConverter(tokens.iterator())
.asScala()
.toSeq();
}
}
The Spark application I'm trying to make is
public class TokenizeTest {
public static void main(String[] args) {
SparkSession spark = SparkSession
.builder()
.appName("Tokenize Test")
.getOrCreate();
Dataset<String> rawText = spark
.read()
.textFile("/tmp/fulltext.txt")
.cache();
Encoder<Seq> listEncoder = Encoders.bean(Seq.class);
// Compilation error
Dataset<Seq<String>> newText = rawText
.map((MapFunction<String, Seq<String>>) s -> Utils.segment(s), listEncoder);
newText.show();
spark.stop();
}
}
I'm a beginner of Spark, the above code is just what I think will work (after reading the official guide). But it turns out the code of TokenizeTest doesn't compile at all. Do you think there is a way to fix it?

Using Scala collections like this won't work. For once Seq is not Bean compatible, for second it is generic.
If you want split just use arrays with segement defined as:
public class Utils {
public static String[] segment(String text) {
return text.split(" ");
}
}
and TokenizeTest defined as:
public class TokenizeTest {
public static void main(String[] args) {
SparkSession spark = SparkSession
.builder()
.appName("Tokenize Test")
.getOrCreate();
Dataset<String> rawText = spark
.read()
.textFile("/path/to/file")
.cache();
Encoder<String []> listEncoder = spark.implicits().newStringArrayEncoder();
Dataset<String []> newText = rawText
.map((MapFunction<String, String []>) s -> Utils.segment(s), listEncoder);
newText.show();
spark.stop();
}
}
In practice though, you might consider either org.apache.spark.sql.functions.split or org.apache.spark.ml.feature.Tokenizer instead of reinventing the wheel.

Related

Transform PCollection<KV> to custom class

My goal is to read a file from GCS and write it to Cassandra.
New to Apache Beam/Dataflow, I could find most of the hand on build with Python. Unfortunately CassandraIO is only Java native with Beam.
I used the word count example as a template and try to get rid of the TextIO.write() and replace it with a CassandraIO.<Words>write().
Here my java class for the Cassandra table
package org.apache.beam.examples;
import java.io.Serializable;
import com.datastax.driver.mapping.annotations.Column;
import com.datastax.driver.mapping.annotations.PartitionKey;
import com.datastax.driver.mapping.annotations.Table;
#Table(keyspace = "test", name = "words", readConsistency = "ONE", writeConsistency = "QUORUM",
caseSensitiveKeyspace = false, caseSensitiveTable = false)
public class Words implements Serializable {
// private static final long serialVersionUID = 1L;
#PartitionKey
#Column(name = "word")
public String word;
#Column(name = "count")
public long count;
public Words() {
}
public Words(String word, int count) {
this.word = word;
this.count = count;
}
#Override
public boolean equals(Object obj) {
Words other = (Words) obj;
return this.word.equals(other.word) && this.count == other.count;
}
}
And here the pipeline part of the main code.
static void runWordCount(WordCount.WordCountOptions options) {
Pipeline p = Pipeline.create(options);
// Concepts #2 and #3: Our pipeline applies the composite CountWords transform, and passes the
// static FormatAsTextFn() to the ParDo transform.
p.apply("ReadLines", TextIO.read().from(options.getInputFile()))
.apply(new WordCountToCassandra.CountWords())
// Here I'm not sure how to transform PCollection<KV> into PCollection<Words>
.apply(MapElements.into(TypeDescriptor.of(Words.class)).via(PCollection<KV<String, Long>>)
}))
.apply(CassandraIO.<Words>write()
.withHosts(Collections.singletonList("my_ip"))
.withPort(9142)
.withKeyspace("test")
.withEntity(Words.class));
p.run().waitUntilFinish();
}
My understand is to use a PTransform to pass from PCollection<T1> from PCollection<T2>. I don't know how to map that.
If it's 1:1 mapping, MapElements.into is the right choice.
You can either specify a class that implements SerializableFunction<FromType, ToType>, or simply use a lambda, for example:
.apply(MapElements.into(TypeDescriptor.of(Words.class)).via(kv -> new Words(kv.getKey(), kv.getValue()));
Please check MapElements for more information.
If the transformation is not one-to-one, there are other available options such as FlatMapElements or ParDo.

How to get the taskID or mapperID(something like partitionID in Spark) in a hive UDF?

As question, How to get the taskID or mapperID(something like partitionID in Spark) in a hive UDF ?
You can access task information using TaskContext:
import org.apache.spark.TaskContext
sc.parallelize(Seq[Int](), 4).mapPartitions(_ => {
val ctx = TaskContext.get
val stageId = ctx.stageId
val partId = ctx.partitionId
val hostname = java.net.InetAddress.getLocalHost().getHostName()
Iterator(s"Stage: $stageId, Partition: $partId, Host: $hostname")}).collect.foreach(println)
A similar functionality has been added to PySpark in Spark 2.2.0 (SPARK-18576):
from pyspark import TaskContext
import socket
def task_info(*_):
ctx = TaskContext()
return ["Stage: {0}, Partition: {1}, Host: {2}".format
(ctx.stageId(), ctx.partitionId(), socket.gethostname())]
for x in sc.parallelize([], 4).mapPartitions(task_info).collect():
print(x)
I think it will provide you the information about the task including map id you are looking for.
I have found the correct answer on my own, we can get the taskID in a hive UDF the way as below :
public class TestUDF extends GenericUDF {
private Text result = new Text();
private String tmpStr = "";
#Override
public void configure(MapredContext context) {
//get the number of tasks 获取task总数量
int numTasks = context.getJobConf().getNumMapTasks();
//get the current taskID 获取当前taskID
String taskID = context.getJobConf().get("mapred.task.id");
this.tmpStr = numTasks + "_h_xXx_h_" + taskID;
}
#Override
public ObjectInspector initialize(ObjectInspector[] arguments)
throws UDFArgumentException {
return PrimitiveObjectInspectorFactory.writableStringObjectInspector;
}
#Override
public Object evaluate(DeferredObject[] arguments) {
result.set(this.tmpStr);
return this.result;
}
#Override
public String getDisplayString(String[] children) {
return "RowSeq-func()";
}
}
but this would be effective only in MapReduce execution engine, it would not work in a SparkSQL engine.
Test code as below:
add jar hdfs:///home/dp/tmp/shaw/my_udf.jar;
create temporary function seqx AS 'com.udf.TestUDF';
with core as (
select
device_id
from
test_table
where
p_date = '20210309'
and product = 'google'
distribute by
device_id
)
select
seqx() as seqs,
count(1) as cc
from
core
group by
seqx()
order by
seqs asc
Result in MR engine as below, see we have got the task number and taskID successfully:
Result in Spark engine with same sql above, the UDF is not valid, we get nothing about taskID:
If you run your HQL in Spark engine and call the Hive UDF meanwhile, and really need to get the partitionId in Spark, see the code below :
import org.apache.spark.TaskContext;
public class TestUDF extends GenericUDF {
private Text result = new Text();
private String tmpStr = "";
#Override
public ObjectInspector initialize(ObjectInspector[] arguments)
throws UDFArgumentException {
//get spark partitionId
this.tmpStr = TaskContext.getPartitionId() + "-initial-pid";
return PrimitiveObjectInspectorFactory.writableStringObjectInspector;
}
public Object evaluate(DeferredObject[] arguments) {
//get spark partitionId
this.tmpStr = TaskContext.getPartitionId() + "-evaluate-pid";
result.set(this.tmpStr);
return this.result;
}
}
As above, you can get the Spark partitionId by calling TaskContext.getPartitionId() in the override method initialize or evalute of UDF class.
Notice: your UDF must has params, suchs select my_udf(param), this would lead your UDF initialized in multiple tasks, if your UDF do not have a param, it will be initialized at the Driver, and the Driver do not have the taskContext and partitionId, so you would get nothing.
The image below is a result produced by the above UDF executed in Spark engine,see, we get the partitionIds successfully :

Pass parameters from driver to executors in spark

I am using spark 2.0.0.
Is there a way to pass parameters from spark driver to executors? I tried the following.
class SparkDriver {
public static void main(String argv[]){
SparkConf conf = new SparkConf().setAppName("test").setMaster("yarn");
SparkSession sparkSession = SparkSession.builder().config(conf).getOrCreate();
Dataset<Row> input = sparkSession.read().load("inputfilepath");
Dataset<Row> modifiedinput = input.mapPartitions(new customMapPartition(5),Encoders.bean(Row.class));
}
class customMapPartition implements MapPartitionsFunction{
private static final long serialVersionUID = -6513655566985939627L;
private static Integer variableThatHastobePassed = null;
public customMapPartition(Integer passedInteger){
customMapPartition.variableThatHastobePassed= passedInteger;
}
#Override
public Iterator<Row> call(Iterator<Row> input) throws Exception {
System.out.println("number that is passed " + variableThatHastobePassed);
}
}
As mentioned above I wrote a custom mappartitionfunction to pass the parameters. and am accessing the static variable in call method of partitionfunction. This worked when i ran in my local with "setmaster("local"). But did not work when ran on a cluster with .setmaster("yarn"). (printed null in the system.out.println statements)
Is there a way to pass parameters from driver to executors.
my bad i was using
private static Integer variableThatHastobePassed = null;
the variable should not be declared as static.

checkpoint SqlContext nullpointerException issue

I am using check pointing in my application, and when my application starts with a failure, I get a NullPointerException on SQLContext.
I assume the application is not able to recover the SQLContext because of serialization/deserialization issues. Is SQLContext not serializable?
Here is my code below
//DriverClass
final JavaSparkContext javaSparkCtx = new JavaSparkContext(conf);
final SQLContext sqlContext = new SQLContext(javaSparkCtx);
JavaStreamingContextFactory javaStreamingContextFactory = new JavaStreamingContextFactory() {
#Override
public JavaStreamingContext create() { //only first time executed
// TODO Auto-generated method stub
JavaStreamingContext jssc = new JavaStreamingContext(javaSparkCtx, Durations.minutes(1));
jssc.checkpoint(CHECKPOINT_DIRECTORY);
HashMap < String, String > kafkaParams = new HashMap < String, String > ();
kafkaParams.put("metadata.broker.list",
"abc.xyz.localdomain:6667");
//....
JavaDStream < String > fullMsg = messages
.map(new MapFunction());
fullMsg.foreachRDD(new SomeClass(sqlContext));
return jssc;
}
};
}
//Closure Class
public class SomeClass implements Serializable, Function < JavaRDD < String > , Void > {
SQLContext sqlContext;
public SomeClass(SQLContext sqlContext) {
// TODO Auto-generated constructor stub
this.sqlContext = sqlContext;
}
public void doSomething() {
this.sqlContext.createDataFrame();**// here is the nullpointerException**
}
//.......
}
SQLContext is Serializable because Spark SQL needs to use SQLContext in the executor side internally. However, you should not serialize it to the Streaming checkpoint. Instead, you should get it from rdd like this SQLContext sqlContext = SQLContext.getOrCreate(rdd.context());
See Streaming docs for more details: http://spark.apache.org/docs/1.6.1/streaming-programming-guide.html#dataframe-and-sql-operations

How to create a a spark dataframe from Integer RDD

How can I create a DataFrame from an JavaRDD contains Integers. I have done something like below but not working.
List<Integer> input = Arrays.asList(101, 103, 105);
JavaRDD<Integer> inputRDD = sc.parallelize(input);
DataFrame dataframe = sqlcontext.createDataFrame(inputRDD, Integer.class);
I got ClassCastException saying org.apache.spark.sql.types.IntegerType$ cannot be cast to org.apache.spark.sql.types.StructType
How can I achieve this?
Apparently (although not intuitively), this createDataFrame overload can only work for "Bean" types, which means types that do not correspond to any built-in Spark SQL type.
You can see that in the source code, the class you pass is matched with a Spark SQL type in JavaTypeInference.inferDataType, and the result is cast into a StructType (see dataType.asInstanceOf[StructType] in SQLContext.getSchema - but the built in "primitive" types (like IntegerType) are NOT StructTypes... Looks like a bug or undocumented behavior to me....
WORKAROUNDS:
Wrap your Integers with a "bean" class (that's ugly, I know):
public static class MyBean {
final int value;
MyBean(int value) {
this.value = value;
}
public int getValue() {
return value;
}
}
List<MyBean> input = Arrays.asList(new MyBean(101), new MyBean(103), new MyBean(105));
JavaRDD<MyBean> inputRDD = sc.parallelize(input);
DataFrame dataframe = sqlcontext.createDataFrame(inputRDD, MyBean.class);
dataframe.show(); // this works...
Convert to RDD<Row> yourself:
// convert to Rows:
JavaRDD<Row> rowRdd = inputRDD.map(new Function<Integer, Row>() {
#Override
public Row call(Integer v1) throws Exception {
return RowFactory.create(v1);
}
});
// create schema (this looks nicer in Scala...):
StructType schema = new StructType(new StructField[]{new StructField("number", IntegerType$.MODULE$, false, Metadata.empty())});
DataFrame dataframe = sqlcontext.createDataFrame(rowRdd, schema);
dataframe.show(); // this works...
Now in Spark 2.2 you can do the following to create a Dataset.
Dataset<Integer> dataSet = sqlContext().createDataset(javardd.rdd(), Encoders.INT());

Resources