Cogrouping not supported in streaming DataSet/DataFrames - apache-spark

Executing action:Metrics:P{"input":"tripMetrics","isEnabled":"true","class":"com.mobileum.wcmodel.execution.actions.SaveStreamAction","properties":{"path":"output/tripMetrics","triggerWindow":"5 minutes","checkpointLocation":"output/checkpoints/tripMetrics","format":"console","queryName":"GtpDetailModel"}}}
Exception in thread "main" org.apache.spark.sql.AnalysisException: CoGrouping with a streaming DataFrame/Dataset is not supported;;
I have a use case where I have to cogroup two datasets in streaming . However when doing so I am getting an exception that Cogrouping of Dataset/DataFrames in streaming is not supported
#Override
public List<Dataset<Row>> transform(SparkSession sparkSession, Map<String, Dataset<Row>> inputDatasets, Properties properties) {
Encoder<Row> encoder = RowEncoder.apply((StructType)new CatalystSqlParser(sparkSession.sqlContext().conf()).parseDataType("struct<hostnetworkid:string,partnercountryid:string>"));
try {
Iterator<Map.Entry<String,Dataset<Row>>> itr= inputDatasets.entrySet().iterator();
Dataset<Row> trip = null;
Dataset<Row> registration= null;
while(itr.hasNext()){
trip=itr.next().getValue();
registration=itr.next().getValue();
}
KeyValueGroupedDataset<Long, TripModel> tripKeyValueGroupedDataset =
trip.map((MapFunction<Row, TripModel>) TripModel :: new , Encoders.bean(TripModel.class))
.groupByKey((MapFunction<TripModel, Long>) TripModel::getKey, Encoders.LONG());
KeyValueGroupedDataset<Long, RegistrationModel> regKeyValueGroupedDataset =
registration.map((MapFunction<Row, RegistrationModel>) RegistrationModel :: new , Encoders.bean(RegistrationModel.class))
.groupByKey((MapFunction<RegistrationModel, Long>) RegistrationModel::getKey, Encoders.LONG());
Dataset<Row> cogrouped = tripKeyValueGroupedDataset.cogroup(regKeyValueGroupedDataset, (CoGroupFunction<Long,TripModel, RegistrationModel, Row>) ( key, it1, it2) ->
{
Iterable<TripModel> iterable = () -> it1;
List<TripModel> tripModelList = StreamSupport
.stream(iterable.spliterator(), false)
.collect(Collectors.toList());
List<Row> a1 = new ArrayList<Row>();
a1.add(RowFactory.create(tripModelList.get(0).getCosid(),"asdf"));
return a1.iterator();

Related

Retrieve nested Data from Firebase Database android

Snapshot of my firebase realtime database
I want to extract the entire data under the "Orders" node, please tell me how should I model my data class for android in Kotlin?
I tried with this type of modeling,
After getting the reference of (Orders/uid/)
Order.kt
data class Order(
val items:ArrayList<Myitems>=ArrayList(),
val timeStamp:Long=0,
val totalCost:Int=0
)
MyItems.kt
data class MyItems(
val Item:ArrayList<Menu>=ArrayList()
)
Menu.kt
data class Menu(
val menCategory:String="",
val menName:String="",
val menImage:String="",
val menId:String="",
val menQuantity:Int=0,
val menCost:Int=0
)
After a lot of thinking and research online. I was finally able to model my classes and call add value event listener to it. Here it goes:
Order.kt
data class Order(
val items: ArrayList<HashMap<String, Any>> = ArrayList(),
val timeStamp: Long = 0,
val totalCost: Int = 0
)
OItem.kt
data class OItem(
val menCategory: String = "",
val menId: String = "",
val menImage: String = "",
val menName: String = "",
val menPrice: Int = 0,
var menQuantity: Int = 0
)
MainActivity.kt
val uid = FirebaseAuth.getInstance().uid
val ref = FirebaseDatabase.getInstance().getReference("Orders/$uid")
ref.addListenerForSingleValueEvent(object : ValueEventListener {
override fun onCancelled(error: DatabaseError) {
//
}
override fun onDataChange(p0: DataSnapshot) {
p0.children.forEach {
val order = it.getValue(Order::class.java)
ordList.add(order!!)
}
Log.d("hf", ordList.toString())
}
})

Skipping first few lines in Spark

I have spark 2.0 code which would read .gz(text) files and writes them to the HIVE table.
Can i know How do i ignore the first two lines from all of my files. Just want to skip the first two lines.
SparkSession spark = SparkSession
.builder()
.master("local")
.appName("SparkSessionFiles")
.config("spark.some.config.option", "some-value")
.enableHiveSupport()
.getOrCreate();
JavaRDD<mySchema> peopleRDD = spark.read()
.textFile("file:///app/home/emm/zipfiles/myzips/")
.javaRDD()
.map(new Function<String, mySchema>()
{
#Override
public mySchema call(String line) throws Exception
{
String[] parts = line.split(";");
mySchema mySchema = new mySchema();
mySchema.setCFIELD1 (parts[0]);
mySchema.setCFIELD2 (parts[1]);
mySchema.setCFIELD3 (parts[2]);
mySchema.setCFIELD4 (parts[3]);
mySchema.setCFIELD5 (parts[4]);
return mySchema;
}
});
// Apply a schema to an RDD of JavaBeans to get a DataFrame
Dataset<Row> myDF = spark.createDataFrame(peopleRDD, mySchema.class);
myDF.createOrReplaceTempView("myView");
spark.sql("INSERT INTO myHIVEtable SELECT * from myView");
UPDATE: Modified code
Lambdas are not working on my eclipse. So used regular java syntax. I am getting an exceception now.
.....
Function2 removeHeader= new Function2<Integer, Iterator<String>, Iterator<String>>(){
public Iterator<String> call(Integer ind, Iterator<String> iterator) throws Exception {
System.out.println("ind="+ind);
if((ind==0) && iterator.hasNext()){
iterator.next();
iterator.next();
return iterator;
}else
return iterator;
}
};
JavaRDD<mySchema> peopleRDD = spark.read()
.textFile(path) //file:///app/home/emm/zipfiles/myzips/
.javaRDD()
.mapPartitionsWithIndex(removeHeader,false)
.map(new Function<String, mySchema>()
{
........
Java.util.NoSuchElementException
at java.util.LinkedList.removeFirst(LinkedList.java:268)
at java.util.LinkedList.remove(LinkedList.java:683)
at org.apache.spark.sql.execution.BufferedRowIterator.next(BufferedRowIterator.java:49)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.next(WholeStageCodegenExec.scala:374)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.next(WholeStageCodegenExec.scala:368)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.convert.Wrappers$IteratorWrapper.next(Wrappers.scala:31)
at com.comcast.emm.vodip.SparkSessionFiles.SparkSessionFiles$1.call(SparkSessionFiles.java:2480)
at com.comcast.emm.vodip.SparkSessionFiles.SparkSessionFiles$1.call(SparkSessionFiles.java:2476)
You could do something like that :
JavaRDD<mySchema> peopleRDD = spark.read()
.textFile("file:///app/home/emm/zipfiles/myzips/")
.javaRDD()
.mapPartitionsWithIndex((index, iter) -> {
if (index == 0 && iter.hasNext()) {
iter.next();
if (iter.hasNext()) {
iter.next();
}
}
return iter;
}, true);
...
In Scala, it the syntax is simpler. For example :
rdd.mapPartitionsWithIndex { (idx, iter) => if (idx == 0) iter.drop(2) else iter }
EDIT :
I modified the code to avoid the Exception.
This code will only delete the first 2 lines of the RDD, not of every files.
If you want to remove the first 2 lines of every file, I suggest you do a RDD for each file, apply the .mapPartitionWithIndex(...) for each RDD, then do a union of each RDD.

Spark : cleaner way to build Dataset out of Spark streaming

I want to create an API which looks like this
public Dataset<Row> getDataFromKafka(SparkContext sc, String topic, StructType schema);
here
topic - is Kafka topic name from which the data is going to be consumed.
schema - is schema information for Dataset
so my function contains following code :
JavaStreamingContext jsc = new JavaStreamingContext(javaSparkContext, Durations.milliseconds(2000L));
JavaPairInputDStream<String, String> directStream = KafkaUtils.createDirectStream(
jsc, String.class, String.class,
StringDecoder.class, StringDecoder.class,
kafkaConsumerConfig(), topics
);
Dataset<Row> dataSet = sqlContext.createDataFrame(javaSparkContext.emptyRDD(), schema);
DataSetHolder holder = new DataSetHolder(dataSet);
LongAccumulator stopStreaming = sc.longAccumulator("stop");
directStream.foreachRDD(rdd -> {
RDD<Row> rows = rdd.values().map(value -> {
//get type of message from value
Row row = null;
if (END == msg) {
stopStreaming.add(1);
row = null;
} else {
row = new GenericRow(/*row data created from values*/);
}
return row;
}).filter(row -> row != null).rdd();
holder.union(sqlContext.createDataFrame(rows, schema));
holder.get().count();
});
jsc.start();
//stop stream if stopStreaming value is greater than 0 its spawned as new thread.
return holder.get();
Here DatasetHolder is a wrapper class around Dataset to combine the result of all the rdds.
class DataSetHolder {
private Dataset<Row> df = null;
public DataSetHolder(Dataset<Row> df) {
this.df = df;
}
public void union(Dataset<Row> frame) {
this.df = df.union(frame);
}
public Dataset<Row> get() {
return df;
}
}
This doesn't looks good at all but I had to do it. I am wondering what is the good way to do it. Or is there any provision for this by Spark?
Update
So after consuming all the data from stream i.e. from kafka topic, we create a dataframe out of it so that the data analyst can register it as a temp table and can fire any query to get the meaningful result.

Persisting data to DynamoDB using Apache Spark

I have a application where
1. I read JSON files from S3 using SqlContext.read.json into Dataframe
2. Then do some transformations on the DataFrame
3. Finally I want to persist the records to DynamoDB using one of the record value as key and rest of JSON parameters as values/columns.
I am trying something like :
JobConf jobConf = new JobConf(sc.hadoopConfiguration());
jobConf.set("dynamodb.servicename", "dynamodb");
jobConf.set("dynamodb.input.tableName", "my-dynamo-table"); // Pointing to DynamoDB table
jobConf.set("dynamodb.endpoint", "dynamodb.us-east-1.amazonaws.com");
jobConf.set("dynamodb.regionid", "us-east-1");
jobConf.set("dynamodb.throughput.read", "1");
jobConf.set("dynamodb.throughput.read.percent", "1");
jobConf.set("dynamodb.version", "2011-12-05");
jobConf.set("mapred.output.format.class", "org.apache.hadoop.dynamodb.write.DynamoDBOutputFormat");
jobConf.set("mapred.input.format.class", "org.apache.hadoop.dynamodb.read.DynamoDBInputFormat");
DataFrame df = sqlContext.read().json("s3n://mybucket/abc.json");
RDD<String> jsonRDD = df.toJSON();
JavaRDD<String> jsonJavaRDD = jsonRDD.toJavaRDD();
PairFunction<String, Text, DynamoDBItemWritable> keyData = new PairFunction<String, Text, DynamoDBItemWritable>() {
public Tuple2<Text, DynamoDBItemWritable> call(String row) {
DynamoDBItemWritable writeable = new DynamoDBItemWritable();
try {
System.out.println("JSON : " + row);
JSONObject jsonObject = new JSONObject(row);
System.out.println("JSON Object: " + jsonObject);
Map<String, AttributeValue> attributes = new HashMap<String, AttributeValue>();
AttributeValue attributeValue = new AttributeValue();
attributeValue.setS(row);
attributes.put("values", attributeValue);
AttributeValue attributeKeyValue = new AttributeValue();
attributeValue.setS(jsonObject.getString("external_id"));
attributes.put("primary_key", attributeKeyValue);
AttributeValue attributeSecValue = new AttributeValue();
attributeValue.setS(jsonObject.getString("123434335"));
attributes.put("creation_date", attributeSecValue);
writeable.setItem(attributes);
} catch (Exception e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
return new Tuple2(new Text(row), writeable);
}
};
JavaPairRDD<Text, DynamoDBItemWritable> pairs = jsonJavaRDD
.mapToPair(keyData);
Map<Text, DynamoDBItemWritable> map = pairs.collectAsMap();
System.out.println("Results : " + map);
pairs.saveAsHadoopDataset(jobConf);
However I do not see any data getting written to DynamoDB. Nor do I get any error messages.
I'm not sure, but your's seems more complex than it may need to be.
I've used the following to write an RDD to DynamoDB successfully:
val ddbInsertFormattedRDD = inputRDD.map { case (skey, svalue) =>
val ddbMap = new util.HashMap[String, AttributeValue]()
val key = new AttributeValue()
key.setS(skey.toString)
ddbMap.put("DynamoDbKey", key)
val value = new AttributeValue()
value.setS(svalue.toString)
ddbMap.put("DynamoDbKey", value)
val item = new DynamoDBItemWritable()
item.setItem(ddbMap)
(new Text(""), item)
}
val ddbConf = new JobConf(sc.hadoopConfiguration)
ddbConf.set("dynamodb.output.tableName", "my-dynamo-table")
ddbConf.set("dynamodb.throughput.write.percent", "0.5")
ddbConf.set("mapred.input.format.class", "org.apache.hadoop.dynamodb.read.DynamoDBInputFormat")
ddbConf.set("mapred.output.format.class", "org.apache.hadoop.dynamodb.write.DynamoDBOutputFormat")
ddbInsertFormattedRDD.saveAsHadoopDataset(ddbConf)
Also, have you checked that you have upped the capacity correctly?

How can use spark SqlContext object in spark sql registeredFunction?

I am new to Spark SQL. Concat function not available in Spark Sql Query for this we have registered one sql function, with in this function i need access another table. for that we have written spark sql query on SQLContext object.
when i invoke this query i am getting NullpointerException.please can you help on this.
Thanks in advance
//This I My code
class SalesHistory_2(sqlContext:SQLContext,sparkContext:SparkContext) extends Serializable {
import sqlContext._
import sqlContext.createSchemaRDD
try{
sqlContext.registerFunction("MaterialTransformation", Material_Transformation _)
def Material_Transformation(Material_ID: String): String =
{
var material:String =null;
var dd = sqlContext.sql("select * from product_master")
material
}
/* Product master*/
val productRDD = this.sparkContext.textFile("D:\\Realease 8.0\\files\\BHI\\BHI_SOP_PRODUCT_MASTER.txt")
val product_schemaString = productRDD.first
val product_withoutHeaders = dropHeader(productRDD)
val product_schema = StructType(product_schemaString.split("\\|").map(fieldName => StructField(fieldName, StringType, true)))
val productdata = product_withoutHeaders.map{_.replace("|", "| ")}.map(x=> x.split("\\|"))
var product_rowRDD = productdata.map(line=>{
Row.fromSeq(line.map {_.trim() })
})
val product_srctableRDD = sqlContext.applySchema(product_rowRDD, product_schema)
product_srctableRDD.registerTempTable("product_master")
cacheTable("product_master")
/* Customer master*/
/* Sales History*/
val srcRDD = this.sparkContext.textFile("D:\\Realease 8.0\\files\\BHI\\BHI_SOP_TRADE_SALES_HISTORY_DS_4_20150119.txt")
val schemaString= srcRDD.first
val withoutHeaders = dropHeader(srcRDD)
val schema = StructType(schemaString.split("\\|").map(fieldName => StructField(fieldName, StringType, true)))
val lines = withoutHeaders.map {_.replace("|", "| ")}.map(x=> x.split("\\|"))
var rowRDD = lines.map(line=>{
Row.fromSeq(line.map {_.trim() })
})
val srctableRDD = sqlContext.applySchema(rowRDD, schema)
srctableRDD.registerTempTable("SALES_HISTORY")
val srcResults = sqlContext.sql("SELECT Delivery_Number,Delivery_Line_Item,MaterialTransformation(Material_ID),Customer_Group_Node,Ops_ID,DC_ID,Mfg_ID,PGI_Date,Delivery_Qty,Customer_Group_Node,Line_Total_COGS,Line_Net_Rev,Material_Description,Sold_To_Partner_Name,Plant_Description,Originating_Doc,Orig_Doc_Line_item,Revenue_Type,Material_Doc_Ref,Mater_Doc_Ref_Item,Req_Delivery_Date FROM SALES_HISTORY")
val path: Path = Path ("D:/Realease 8.0/files/output/")
try {
path.deleteRecursively(continueOnFailure = false)
} catch {
case e: IOException => // some file could not be deleted
}
val successRDDToFile = srcResults.map { x => x.mkString("|")}
successRDDToFile.coalesce(1).saveAsTextFile("D:/Realease 8.0/files/output/")
}
catch {
case ex: Exception => println(ex) // TODO: handle error
}
this.sparkContext.stop()
def dropHeader(data: RDD[String]): RDD[String] = {
data.mapPartitionsWithIndex((idx, lines) => {
if (idx == 0) {
lines.drop(1)
}
lines
})
}
The answer here is rather short and probably disappointing - you simply cannot do something like this.
General rule in Spark is you cannot trigger action or transformation from another action and transformation or, to be a little bit more precise, outside the driver Spark Context is no longer accessible / defined.
Calling Spark SQL for each row in the Sales History RDD looks like a very bad idea:
val srcResults = sqlContext.sql("SELECT Delivery_Number,Delivery_Line_Item,MaterialTransformation(Material_ID),Customer_Group_Node,Ops_ID,DC_ID,Mfg_ID,PGI_Date,Delivery_Qty,Customer_Group_Node,Line_Total_COGS,Line_Net_Rev,Material_Description,Sold_To_Partner_Name,Plant_Description,Originating_Doc,Orig_Doc_Line_item,Revenue_Type,Material_Doc_Ref,Mater_Doc_Ref_Item,Req_Delivery_Date FROM SALES_HISTORY")
You'd better user a join between your RDDs and forget you custom function:
val srcResults = sqlContext.sql("SELECT s.*, p.* FROM SALES_HISTORY s join product_master p on s.Material_ID=p.ID")

Resources