Combining Two Spark Streams On Key - apache-spark

I have two kafka streams that contain results for two parallel operations, I need a way to combine both streams so I can process the results in a single spark transform. Is this possible? (illustration below)
Stream 1 {id:1,result1:True}
Stream 2 {id:1,result2:False}
JOIN(Stream 1, Stream 2, On "id") -> Output Stream {id:1,result1:True,result2:False}
Current code that isn't working:
kvs1 = KafkaUtils.createStream(sparkstreamingcontext, ZOOKEEPER, NAME+"_stream", {"test_join_1": 1})
kvs2 = KafkaUtils.createStream(sparkstreamingcontext, ZOOKEEPER, NAME+"_stream", {"test_join_2": 1})
messages_RDDstream1 = kvs1.map(lambda x: x[1])
messages_RDDstream2 = kvs2.map(lambda x: x[1])
messages_RDDstream_Final = messages_RDDstream1.join(messages_RDDstream2)
When I pass two sample jsons to each Kafka queue with the same ID field, nothing is returned in my final RDD stream. I imaging I am missing the stage of converting my Kafka JSON string message into a Tuple?
I have also tried the following:
kvs1.map(lambda (key, value): json.loads(value))
and
kvs1.map(lambda x: json.loads(x))
To no avail
Cheers
Adam

A simple lookup on Spark's documentation would have given you the answer..
You can use the join operation.
join(otherStream, [numTasks]) :
When called on two DStreams of (K, V) and (K, W) pairs, return a new DStream of (K, (V, W)) pairs with all pairs of elements for each key.
For example : val streamJoined = stream1.join(stream2)

What you need can be done using the join() method of key-value pair DStreams:
// Test data
val input1 = List((1, true), (2, false), (3, false), (4, true), (5, false))
val input2 = List((1, false), (2, false), (3, true), (4, true), (5, true))
val input1RDD = sc.parallelize(input1)
val input2RDD = sc.parallelize(input2)
import org.apache.spark.streaming.{Seconds, StreamingContext}
val streamingContext = new StreamingContext(sc, Seconds(3))
// Creates a DStream from the test data
import scala.collection.mutable
val input1DStream = streamingContext.queueStream[(Int, Boolean)](mutable.Queue(input1RDD))
val input2DStream = streamingContext.queueStream[(Int, Boolean)](mutable.Queue(input2RDD))
// Join the two streams together by merging them into a single dstream
val joinedDStream = input1DStream.join(input2DStream)
// Print the result
joinedDStream.print()
// Start the context, time out after one batch, and then stop it
streamingContext.start()
streamingContext.awaitTerminationOrTimeout(5000)
streamingContext.stop()
Results in:
-------------------------------------------
Time: 1468313607000 ms
-------------------------------------------
(4,(true,true))
(2,(false,false))
(1,(true,false))
(3,(false,true))
(5,(false,true))

I have joined two queueStream using Spark java. Please have a look at below code.
import java.util.ArrayList;
import java.util.Arrays;
import java.util.List;
import java.util.Queue;
import org.apache.commons.lang3.tuple.Pair;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.PairFunction;
import org.apache.spark.streaming.Durations;
import org.apache.spark.streaming.api.java.JavaInputDStream;
import org.apache.spark.streaming.api.java.JavaPairDStream;
import org.apache.spark.streaming.api.java.JavaStreamingContext;
import com.google.common.collect.Queues;
import scala.Tuple2;
public class SparkQueueStreamJoin {
public static void main(String[] args) throws InterruptedException {
// Test data
List<Pair<Integer, Boolean>> input1 = Arrays.asList(Pair.of(1,true), Pair.of(2,false), Pair.of(3,false), Pair.of(4,true), Pair.of(5,false));
List<Pair<Integer, Boolean>> input2 = Arrays.asList(Pair.of(1,false), Pair.of(2,false), Pair.of(3,true), Pair.of(4,true), Pair.of(5,true));
SparkConf conf = new SparkConf().setMaster("local[*]").setAppName("SparkQueueStreamJoin ")
.set("spark.testing.memory", "2147480000");
//System.setProperty("hadoop.home.dir", "C:/H`enter code here`adoop/hadoop-2.7.1");
JavaSparkContext sc = new JavaSparkContext(conf);
JavaRDD<Pair<Integer, Boolean>> input1RDD = sc.parallelize(input1);
JavaRDD<Pair<Integer, Boolean>> input2RDD = sc.parallelize(input2);
JavaStreamingContext streamingContext = new JavaStreamingContext(sc, Durations.seconds(3));
Queue<JavaRDD<Pair<Integer, Boolean>>> queue1RDD = Queues.newLinkedBlockingQueue();
queue1RDD.add(input1RDD);
Queue<JavaRDD<Pair<Integer, Boolean>>> queue2RDD = Queues.newLinkedBlockingQueue();
queue2RDD.add(input2RDD);
// Creates a DStream from the test data
JavaInputDStream<Pair<Integer, Boolean>> input1DStream = streamingContext.queueStream(queue1RDD, false);
JavaInputDStream<Pair<Integer, Boolean>> input2DStream = streamingContext.queueStream(queue2RDD, false);
JavaPairDStream<Integer, Boolean> pair1DStream = input1DStream.mapToPair(new PairFunction<Pair<Integer, Boolean>, Integer, Boolean>() {
#Override
public Tuple2<Integer, Boolean> call(Pair<Integer, Boolean> rawEvent) throws Exception {
return new Tuple2<>(rawEvent.getKey(), rawEvent.getValue());
}
});
JavaPairDStream<Integer, Boolean> pair2DStream = input2DStream.mapToPair(new PairFunction<Pair<Integer, Boolean>, Integer, Boolean>() {
#Override
public Tuple2<Integer, Boolean> call(Pair<Integer, Boolean> rawEvent) throws Exception {
return new Tuple2<>(rawEvent.getKey(), rawEvent.getValue());
}
});
// Union two streams together by merging them into a single dstream
//JavaDStream<Pair<Integer, Boolean>> joinedDStream = input1DStream.union(input2DStream);
// Join the two streams together by merging them into a single dstream
JavaPairDStream<Integer, Tuple2<Boolean, Boolean>> joinedDStream = pair1DStream.join(pair2DStream);
// Print the result
joinedDStream.print();
// Start the context, time out after one batch, and then stop it
streamingContext.start();
streamingContext.awaitTerminationOrTimeout(5000);
streamingContext.stop();
}
}
Output:
-------------------------------------------
Time: 1511444352000 ms
-------------------------------------------
(1,(true,false))
(2,(false,false))
(3,(false,true))
(4,(true,true))
(5,(false,true))

Related

Is it possible to write a dataframe into 2 files of different type?

We can use following api to write dataframe into local files.
df.write.parquet(path)
df.write.json(path)
However, Can I write into a parquet and a json in one time without compute the dataframe twice ?
By the way , I dont want to cache the data in memory, because it's too big.
If you don't cache/persist the dataframe, then it'll will need re-computed for each output format.
We can implement an org.apache.spark.sql.execution.datasources.FileFormat to do such thing.
DuplicateOutFormat demo
/**
* Very Dangerous Toy Code. DO NOT USE IN PRODUCTION.
*/
class DuplicateOutFormat
extends FileFormat
with DataSourceRegister
with Serializable {
override def inferSchema(sparkSession: SparkSession, options: Map[String, String], files: Seq[FileStatus]): Option[StructType] = {
throw new UnsupportedOperationException()
}
override def prepareWrite(sparkSession: SparkSession,
job: Job,
options: Map[String, String],
dataSchema: StructType): OutputWriterFactory = {
val format1 = options("format1")
val format2 = options("format2")
val format1Instance = DataSource.lookupDataSource(format1, sparkSession.sessionState.conf)
.newInstance().asInstanceOf[FileFormat]
val format2Instance = DataSource.lookupDataSource(format2, sparkSession.sessionState.conf)
.newInstance().asInstanceOf[FileFormat]
val writerFactory1 = format1Instance.prepareWrite(sparkSession, job, options, dataSchema)
val writerFactory2 = format2Instance.prepareWrite(sparkSession, job, options, dataSchema)
new OutputWriterFactory {
override def getFileExtension(context: TaskAttemptContext): String = ".dup"
override def newInstance(path: String, dataSchema: StructType, context: TaskAttemptContext): OutputWriter = {
val path1 = path.replace(".dup", writerFactory1.getFileExtension(context))
val path2 = path.replace(".dup", writerFactory2.getFileExtension(context))
val writer1 = writerFactory1.newInstance(path1, dataSchema, context)
val writer2 = writerFactory2.newInstance(path2, dataSchema, context)
new OutputWriter {
override def write(row: InternalRow): Unit = {
writer1.write(row)
writer2.write(row)
}
override def close(): Unit = {
writer1.close()
writer2.close()
}
}
}
}
}
override def shortName(): String = "dup"
}
SPI
we should make a SPI file /META-INF/services/org.apache.spark.sql.sources.DataSourceRegister, content:
com.github.sparkdemo.DuplicateOutFormat.
demo usage
class DuplicateOutFormatTest extends FunSuite {
val spark = SparkSession.builder()
.master("local")
.getOrCreate()
val sc = spark.sparkContext
import spark.implicits._
test("testDuplicateWrite") {
val data = Array(
("k1", "fa", "20210901", 16),
("k2", null, "20210902", 15),
("k3", "df", "20210903", 14),
("k4", null, "20210904", 13)
)
val tempDir = System.getProperty("java.io.tmpdir") + "spark-dup-test" + System.nanoTime()
val df = sc.parallelize(data).toDF("k", "col2", "day", "col4")
df.write
.option("format1", "csv")
.option("format2", "orc")
.format("dup").save(tempDir)
df.show(1000, false)
}
}
WARNING
Spark SQL couple some sth in DataFrameWriter#saveToV1Source and other source code, that we can't change. This custom DuplicateOutFormat is just for demo, lacking of test. Full demo in github.

trying to send records (string) using foreachbatch and foreach to AWS SNS

I am trying to send the records from kinesis streams to AWS SNS topic. but nothing seems to happen and I do not receive the messages on the topic. Below is my code. Please advise what I am doing incorrect here. I have tried both foreach and foreachBatch. Pleae refer to the commented section in the code for foreachBatch. To test out I have even tried hardcoding a string to see if foreach works but no luck. to Help please!
//Stream reader (Consumer)
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions.col
import com.amazonaws.services.kinesis.model.PutRecordRequest
import com.amazonaws.services.kinesis.AmazonKinesisClientBuilder
import com.amazonaws.auth.{AWSStaticCredentialsProvider, BasicAWSCredentials}
import java.nio.ByteBuffer
import scala.util.Random
import java.util.Base64
import org.apache.spark.sql.streaming.{OutputMode, StreamingQuery}
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions.from_json
import com.amazonaws.services.sns.{AmazonSNS, AmazonSNSClientBuilder}
import com.amazonaws.services._
import org.apache.spark.sql._
import com.amazonaws.services.sns.model._
val dataSchema = new StructType()
.add("serialnumber", StringType)
.add("purposeId", StringType)
.add("action", StringType)
.add("locale", StringType)
.add("datetime", TimestampType)
.add("ingestiontime", LongType)
val kinesisStreamName = "mystream"
val kinesisRegion = "us-west-2"
val kinesisDF = spark.readStream
.format("kinesis")
.option("streamName", kinesisStreamName)
.option("region", kinesisRegion)
.option("initialPosition", "latest") //"TRIM_HORIZON")
.load()
val recordsStreamDf = kinesisDF
.selectExpr("cast (data as STRING) jsonData")
.select(from_json($"jsonData", dataSchema)/*.cast("struct<installmentNo: integer, loanId: integer>")*/.as("consents"))
.select("consents.*")
display(recordsStreamDf)
val query =
recordsStreamDf
.writeStream.foreach(
new ForeachWriter[Row] {
def open(partitionId: Long, version: Long): Boolean = {
true
}
def process(record: Row): Unit = {
SNSPublisher("test").publishMessage()
}
def close(errorOrNull: Throwable): Unit = {
true
}
}
)
/*.foreachBatch{ (batchDF: DataFrame, batchId: Long) =>
val data = batchDF.withColumn("jsonData", to_json(struct($"serialnumber", $"purposeId", $"action", $"locale", $"datetime", $"ingestiontime"))).select($"jsonData").collect
for (i <- data) {
try {
SNSPublisher(i.getString(0)).publishMessage()
println("Metrics posted to SNS")
} catch {
case e: Exception =>
println("Exception posting message to SNS: " + e)
throw e
}
}
batchDF.foreachPartition {
partitionData =>
partitionData.foreach(row => {
SNSPublisher(row.getString(0)).publishMessage()
})
}
}*/
.outputMode("append")
.format("json")
.queryName("count")
.option("path", "s3://path/dev/user_personal/abhishek.ghosh/") // counts = name of the in-memory table
.option("checkpointLocation", "s3://path/dev/user_personal/abhishek.ghosh/checkpointLocation/")
//.outputMode("complete") // complete = all the counts should be in the table
.start()
//SNS Publisher class for notification of metrics
class SNSPublisher(message: String) {
implicit val sns: SNSPublisher = this
implicit val snsClient: AmazonSNS = AmazonSNSClientBuilder.standard.build
val this.message=message
var msgId=""
def publishMessage() {
msgId = snsClient.publish(new PublishRequest().withTopicArn("my_arn").withMessage(message)).getMessageId
}
}
object SNSPublisher {
def apply(message: String): SNSPublisher = {
new SNSPublisher(message)
}
} ```

How to get the entity from aggregateMessages instead of just the vertexId

package com.mypackage
import org.apache.spark.graphx._
import org.apache.spark.{SparkContext, SparkConf}
/**
* Created by sidazhang on 11/8/16.
*/
case class Person(age: Int)
case class EdgeImpl()
object GraphApp {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName("SparkMain").setMaster("local[1]")
val sc = new SparkContext(conf)
val vertices =
sc.parallelize(Array((1L, Person(10)), (2L, Person(15)),
(3L, Person(20)), (4L, Person(30))))
// Create an RDD for edges
val relationships =
sc.parallelize(Array(Edge(2L, 1L, EdgeImpl()),
Edge(3L, 1L, EdgeImpl()), Edge(4L, 1L, EdgeImpl())))
val graph = Graph(vertices, relationships)
// Compute the number of older followers and their total age
val olderFollowers: VertexRDD[Array[Person]] = graph.aggregateMessages[Array[Person]](
ctx => ctx.sendToDst(Array(ctx.srcAttr)),
// Merge the array of followers
(a, b) => a ++ b
)
// Here I only have the id of the person and a list of his followers.
// How do I get the vertex of the person
olderFollowers.collect.foreach { case (id, followers) => followers.foreach(println(id, _)) }
}
}
The question is that through the aggregateMessage API, I end up with the vertexId. How do I get the actual vertex.
(The question is inline)
You have to join it back with the original data:
graph.joinVertices(olderFollowers)(someMergingFunction).vertices

Trying to apply GBT on a set of data getting ClassCastException

I am getting "Exception in thread "main" java.lang.ClassCastException: org.apache.spark.ml.attribute.UnresolvedAttribute$ cannot be cast to org.apache.spark.ml.attribute.NominalAttribute".
Source code
package com.spark.lograthmicregression;
import java.text.ParseException;
import java.text.SimpleDateFormat;
import java.util.Calendar;
import java.util.Date;
import java.util.HashSet;
import java.util.Set;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.ml.Pipeline;
import org.apache.spark.ml.PipelineModel;
import org.apache.spark.ml.PipelineStage;
import org.apache.spark.ml.classification.GBTClassificationModel;
import org.apache.spark.ml.classification.GBTClassifier;
import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator;
import org.apache.spark.ml.feature.IndexToString;
import org.apache.spark.ml.feature.StringIndexer;
import org.apache.spark.ml.feature.StringIndexerModel;
import org.apache.spark.ml.feature.VectorAssembler;
import org.apache.spark.sql.DataFrame;
import org.apache.spark.sql.SQLContext;
import org.apache.spark.sql.catalyst.expressions.AttributeReference;
import org.apache.spark.sql.catalyst.expressions.Expression;
import org.apache.spark.sql.types.DataType;
import org.apache.spark.sql.types.DataTypes;
import com.google.common.collect.ImmutableMap;
import scala.collection.mutable.Seq;
public class ClickThroughRateAnalytics {
private static SimpleDateFormat sdf = new SimpleDateFormat("yyMMddHH");
public static void main(String[] args) {
final SparkConf sparkConf = new SparkConf().setAppName("Click Analysis").setMaster("local");
try (JavaSparkContext javaSparkContext = new JavaSparkContext(sparkConf)) {
SQLContext sqlContext = new SQLContext(javaSparkContext);
DataFrame dataFrame = sqlContext.read().format("com.databricks.spark.csv").option("inferSchema", "true").option("header", "true")
.load("/splits/sub-suaa");
// This will keep data in memory
dataFrame.cache();
// This will describe the column
// dataFrame.describe("hour").show();
System.out.println("Rows before removing missing data : " + dataFrame.count());
// This will describe column details
// dataFrame.describe("click", "hour", "site_domain").show();
// This will calculate variance between columns +ve one increases
// second increases and -ve means one increases other decreases
// double cov = dataFrame.stat().cov("click", "hour");
// System.out.println("cov : " + cov);
// It provides quantitative measurements of the statistical
// dependence between two random variables
// double corr = dataFrame.stat().corr("click", "hour");
// System.out.println("corr : " + corr);
// Cross Tabulation provides a table of the frequency distribution
// for a set of variables
// dataFrame.stat().crosstab("site_id", "site_domain").show();
// For frequent items
// System.out.println("Frequest Items : " +
// dataFrame.stat().freqItems(new String[] { "site_id",
// "site_domain" }, 0.3).collectAsList());
// TODO we can also set maximum occurring item to categorical
// values.
// This will replace null values with average for numeric columns
dataFrame = modifiyDatFrame(dataFrame);
// Removing rows which have some missing values
dataFrame = dataFrame.na().replace(dataFrame.columns(), ImmutableMap.of("", "NA"));
dataFrame.na().fill(0.0);
dataFrame = dataFrame.na().drop();
System.out.println("Rows after removing missing data : " + dataFrame.count());
// TODO Binning and bucketing
// normalizer will take the column created by the VectorAssembler,
// normalize it and produce a new column
// Normalizer normalizer = new
// Normalizer().setInputCol("features_index").setOutputCol("features");
dataFrame = dataFrame.drop("app_category_index").drop("app_domain_index").drop("hour_index").drop("C20_index")
.drop("device_connection_type_index").drop("C1_index").drop("id").drop("device_ip_index").drop("banner_pos_index");
DataFrame[] splits = dataFrame.randomSplit(new double[] { 0.7, 0.3 });
DataFrame trainingData = splits[0];
DataFrame testData = splits[1];
StringIndexerModel labelIndexer = new StringIndexer().setInputCol("click").setOutputCol("indexedclick").fit(dataFrame);
// Here we will be sending all columns which will participate in
// prediction
VectorAssembler vectorAssembler = new VectorAssembler().setInputCols(findPredictionColumns("click", dataFrame))
.setOutputCol("features_index");
GBTClassifier gbt = new GBTClassifier().setLabelCol("indexedclick").setFeaturesCol("features_index").setMaxIter(10).setMaxBins(69000);
IndexToString labelConverter = new IndexToString().setInputCol("prediction").setOutputCol("predictedLabel");
Pipeline pipeline = new Pipeline().setStages(new PipelineStage[] { labelIndexer, vectorAssembler, gbt, labelConverter });
trainingData.show(1);
PipelineModel model = pipeline.fit(trainingData);
DataFrame predictions = model.transform(testData);
predictions.select("predictedLabel", "label").show(5);
MulticlassClassificationEvaluator evaluator = new MulticlassClassificationEvaluator().setLabelCol("indexedLabel")
.setPredictionCol("prediction").setMetricName("precision");
double accuracy = evaluator.evaluate(predictions);
System.out.println("Test Error = " + (1.0 - accuracy));
GBTClassificationModel gbtModel = (GBTClassificationModel) (model.stages()[2]);
System.out.println("Learned classification GBT model:\n" + gbtModel.toDebugString());
}
}
private static String[] findPredictionColumns(String outputCol, DataFrame dataFrame) {
String columns[] = dataFrame.columns();
String inputColumns[] = new String[columns.length - 1];
int count = 0;
for (String column : dataFrame.columns()) {
if (!column.equalsIgnoreCase(outputCol)) {
inputColumns[count++] = column;
}
}
return inputColumns;
}
/**
* This will replace empty values with mean.
*
* #param columnName
* #param dataFrame
* #return
*/
private static DataFrame modifiyDatFrame(DataFrame dataFrame) {
Set<String> numericColumns = new HashSet<String>();
if (dataFrame.numericColumns() != null && dataFrame.numericColumns().length() > 0) {
scala.collection.Iterator<Expression> iterator = ((Seq<Expression>) dataFrame.numericColumns()).toIterator();
while (iterator.hasNext()) {
Expression expression = iterator.next();
Double avgAge = dataFrame.na().drop().groupBy(((AttributeReference) expression).name()).avg(((AttributeReference) expression).name())
.first().getDouble(1);
dataFrame = dataFrame.na().fill(avgAge, new String[] { ((AttributeReference) expression).name() });
numericColumns.add(((AttributeReference) expression).name());
DataType dataType = ((AttributeReference) expression).dataType();
if (!"double".equalsIgnoreCase(dataType.simpleString())) {
dataFrame = dataFrame.withColumn("temp", dataFrame.col(((AttributeReference) expression).name()).cast(DataTypes.DoubleType))
.drop(((AttributeReference) expression).name()).withColumnRenamed("temp", ((AttributeReference) expression).name());
}
}
}
// Fit method of StringIndexer converts the column to StringType(if
// it is not of StringType) and then counts the occurrence of each
// word. It then sorts these words in descending order of their
// frequency and assigns an index to each word. StringIndexer.fit()
// method returns a StringIndexerModel which is a Transformer
StringIndexer stringIndexer = new StringIndexer();
String allCoumns[] = dataFrame.columns();
for (String column : allCoumns) {
if (!numericColumns.contains(column)) {
dataFrame = stringIndexer.setInputCol(column).setOutputCol(column + "_index").fit(dataFrame).transform(dataFrame);
dataFrame = dataFrame.drop(column);
}
}
dataFrame.printSchema();
return dataFrame;
}
#SuppressWarnings("unused")
private static void copyFile(DataFrame dataFrame) {
dataFrame
.select("id", "click", "hour", "C1", "banner_pos", "site_id", "site_domain", "site_category", "app_id", "app_domain", "app_category",
"device_id", "device_ip", "device_model", "device_type", "device_conn_type", "C14", "C15", "C16", "C17", "C18", "C19", "C20",
"C21")
.write().format("com.databricks.spark.csv").option("header", "true").option("codec", "org.apache.hadoop.io.compress.GzipCodec")
.save("/splits/sub-splitaa-optmized");
}
#SuppressWarnings("unused")
private static Integer parse(String sDate, int field) {
try {
if (sDate != null && !sDate.toString().equalsIgnoreCase("hour")) {
Date date = sdf.parse(sDate.toString());
Calendar cal = Calendar.getInstance();
cal.setTime(date);
return cal.get(field);
}
} catch (ParseException e) {
e.printStackTrace();
}
return 0;
}
}
I am using spark java. Sample file will be :
id,click,hour,C1,banner_pos,site_id,site_domain,site_category,app_id,app_domain,app_category,device_id,device_ip,device_model,device_type,device_conn_type,C14,C15,C16,C17,C18,C19,C20,C21
1000009418151094273,0,14102100,1005,0,1fbe01fe,f3845767,28905ebd,ecad2386,7801e8d9,07d7df22,a99f214a,ddd2926e,44956a24,1,2,15706,320,50,1722,0,35,-1,79
10000169349117863715,0,14102100,1005,0,1fbe01fe,f3845767,28905ebd,ecad2386,7801e8d9,07d7df22,a99f214a,96809ac8,711ee120,1,0,15704,320,50,1722,0,35,100084,79
10000371904215119486,0,14102100,1005,0,1fbe01fe,f3845767,28905ebd,ecad2386,7801e8d9,07d7df22,a99f214a,b3cf8def,8a4875bd,1,0,15704,320,50,1722,0,35,100084,79
10000640724480838376,0,14102100,1005,0,1fbe01fe,f3845767,28905ebd,ecad2386,7801e8d9,07d7df22,a99f214a,e8275b8f,6332421a,1,0,15706,320,50,1722,0,35,100084,79
10000679056417042096,0,14102100,1005,1,fe8cc448,9166c161,0569f928,ecad2386,7801e8d9,07d7df22,a99f214a,9644d0bf,779d90c2,1,0,18993,320,50,2161,0,35,-1,157
10000720757801103869,0,14102100,1005,0,d6137915,bb1ef334,f028772b,ecad2386,7801e8d9,07d7df22,a99f214a,05241af0,8a4875bd,1,0,16920,320,50,1899,0,431,100077,117
10000724729988544911,0,14102100,1005,0,8fda644b,25d4cfcd,f028772b,ecad2386,7801e8d9,07d7df22,a99f214a,b264c159,be6db1d7,1,0,20362,320,50,2333,0,39,-1,157
I am late in replying but I was also facing the same error while using gbt for a dataset in csv file.
I added .setLabels(labelIndexer.labels()) in labelConverter and this solved the problem.
IndexToString labelConverter = new IndexToString()
.setInputCol("prediction")
.setOutputCol("predictedLabel")
.setLabels(labelIndexer.labels())

How can use spark SqlContext object in spark sql registeredFunction?

I am new to Spark SQL. Concat function not available in Spark Sql Query for this we have registered one sql function, with in this function i need access another table. for that we have written spark sql query on SQLContext object.
when i invoke this query i am getting NullpointerException.please can you help on this.
Thanks in advance
//This I My code
class SalesHistory_2(sqlContext:SQLContext,sparkContext:SparkContext) extends Serializable {
import sqlContext._
import sqlContext.createSchemaRDD
try{
sqlContext.registerFunction("MaterialTransformation", Material_Transformation _)
def Material_Transformation(Material_ID: String): String =
{
var material:String =null;
var dd = sqlContext.sql("select * from product_master")
material
}
/* Product master*/
val productRDD = this.sparkContext.textFile("D:\\Realease 8.0\\files\\BHI\\BHI_SOP_PRODUCT_MASTER.txt")
val product_schemaString = productRDD.first
val product_withoutHeaders = dropHeader(productRDD)
val product_schema = StructType(product_schemaString.split("\\|").map(fieldName => StructField(fieldName, StringType, true)))
val productdata = product_withoutHeaders.map{_.replace("|", "| ")}.map(x=> x.split("\\|"))
var product_rowRDD = productdata.map(line=>{
Row.fromSeq(line.map {_.trim() })
})
val product_srctableRDD = sqlContext.applySchema(product_rowRDD, product_schema)
product_srctableRDD.registerTempTable("product_master")
cacheTable("product_master")
/* Customer master*/
/* Sales History*/
val srcRDD = this.sparkContext.textFile("D:\\Realease 8.0\\files\\BHI\\BHI_SOP_TRADE_SALES_HISTORY_DS_4_20150119.txt")
val schemaString= srcRDD.first
val withoutHeaders = dropHeader(srcRDD)
val schema = StructType(schemaString.split("\\|").map(fieldName => StructField(fieldName, StringType, true)))
val lines = withoutHeaders.map {_.replace("|", "| ")}.map(x=> x.split("\\|"))
var rowRDD = lines.map(line=>{
Row.fromSeq(line.map {_.trim() })
})
val srctableRDD = sqlContext.applySchema(rowRDD, schema)
srctableRDD.registerTempTable("SALES_HISTORY")
val srcResults = sqlContext.sql("SELECT Delivery_Number,Delivery_Line_Item,MaterialTransformation(Material_ID),Customer_Group_Node,Ops_ID,DC_ID,Mfg_ID,PGI_Date,Delivery_Qty,Customer_Group_Node,Line_Total_COGS,Line_Net_Rev,Material_Description,Sold_To_Partner_Name,Plant_Description,Originating_Doc,Orig_Doc_Line_item,Revenue_Type,Material_Doc_Ref,Mater_Doc_Ref_Item,Req_Delivery_Date FROM SALES_HISTORY")
val path: Path = Path ("D:/Realease 8.0/files/output/")
try {
path.deleteRecursively(continueOnFailure = false)
} catch {
case e: IOException => // some file could not be deleted
}
val successRDDToFile = srcResults.map { x => x.mkString("|")}
successRDDToFile.coalesce(1).saveAsTextFile("D:/Realease 8.0/files/output/")
}
catch {
case ex: Exception => println(ex) // TODO: handle error
}
this.sparkContext.stop()
def dropHeader(data: RDD[String]): RDD[String] = {
data.mapPartitionsWithIndex((idx, lines) => {
if (idx == 0) {
lines.drop(1)
}
lines
})
}
The answer here is rather short and probably disappointing - you simply cannot do something like this.
General rule in Spark is you cannot trigger action or transformation from another action and transformation or, to be a little bit more precise, outside the driver Spark Context is no longer accessible / defined.
Calling Spark SQL for each row in the Sales History RDD looks like a very bad idea:
val srcResults = sqlContext.sql("SELECT Delivery_Number,Delivery_Line_Item,MaterialTransformation(Material_ID),Customer_Group_Node,Ops_ID,DC_ID,Mfg_ID,PGI_Date,Delivery_Qty,Customer_Group_Node,Line_Total_COGS,Line_Net_Rev,Material_Description,Sold_To_Partner_Name,Plant_Description,Originating_Doc,Orig_Doc_Line_item,Revenue_Type,Material_Doc_Ref,Mater_Doc_Ref_Item,Req_Delivery_Date FROM SALES_HISTORY")
You'd better user a join between your RDDs and forget you custom function:
val srcResults = sqlContext.sql("SELECT s.*, p.* FROM SALES_HISTORY s join product_master p on s.Material_ID=p.ID")

Resources