I have kafka topic that contains JSON, example:
{"jsonCode":"1234", "jsonData":{.....}}
{"jsonCode":"1234", "jsonData":{.....}}
{"jsonCode":"1235", "jsonData":{.....}}
{"jsonCode":"1235", "jsonData":{.....}}
{"jsonCode":"1236", "jsonData":{.....}}
My question is if I can to create the following hash map during the read from topic:
["1234", [list of jsonCode 1234 jsons]
["1235", [list of jsonCode 1235 jsons]
["1236", [list of jsonCode 1236 jsons]
Its possible? How can I do this mapping?
I want to read from Kafka using SparkStreamming, to get all unread messages on topic and create the hash map
Thanks.
Do you have any consumer configuration setup in your code. Consumer config typically requires key and value pairs.
Check while reading from topic, you able to read values in terms of key value pairs. Typically your consumer should be like this:
final Consumer<yourKey,yourValue> consumer; //consumer with consumer config
final ConsumerRecords<String, String> consumerRecords = consumer.poll(pollvals);
consumerRecords.forEach(record -> {
System.out.printf("[Consumer Record:(key - %s,value- %s,partition- %d, offset %d)]\n", record.key(),
record.value(), record.partition(), record.offset());
//parse your json from either key or from value
String value=null;
.....
value = jsonparser(record.value()); // lets parse from value.
...
Related
We have a requirement to group by multiple fields in a dynamic way on a huge data set. The data is stored in Hazelcast Jet cluster. Example: if Person class contains 4 fields: age, name, city and country. We first need to group by city and then by country and then we may group by name based on conditional parameters.
We already tried using Distributed collection and is not working. Even when we tried using Pipeline API it is throwing error.
Code:
IMap res= client.getMap("res"); // res is distrbuted map
Pipeline p = Pipeline.create();
JobConfig jobConfig = new JobConfig();
p.drawFrom(Sources.<Person>list("inputList"))
.aggregate(AggregateOperations.groupingBy(Person::getCountry))
.drainTo(Sinks.map(res));
jobConfig = new JobConfig();
jobConfig.addClass(Person.class);
jobConfig.addClass(HzJetListClientPersonMultipleGroupBy.class);
Job job = client.newJob(p, jobConfig);
job.join();
Then we read from the map in the client and destroy it.
Error Message on the server:
Caused by: java.lang.ClassCastException: java.util.HashMap cannot be cast to java.util.Map$Entry
groupingBy aggregates all the input items into a HashMap where the key is extracted using the given function. In your case it aggregates a stream of Person items into a single HashMap<String, List<Person>> item.
You need to use this:
p.drawFrom(Sources.<Person>list("inputList"))
.groupingKey(Person::getCountry)
.aggregate(AggregateOperations.toList())
.drainTo(Sinks.map(res));
This will populate the res map with a list of persons in each city.
Remember, without groupingKey() the aggregation is always global. That is, all items in the input will be aggregated to one output item.
I am using Spark streaming and data is being sent to Kafka. I am sending a Map to Kafka. Assuming I have a Map of 20(which may grow to 1000 in a Streaming Batch duration) elements like below:
HashMap<Integer,String> input = new HashMap<Integer,String>();
input.put(11,"One");
input.put(312,"two");
input.put(33,"One");
input.put(24,"One");
input.put(35,"One");
input.put(612,"One");
input.put(7,"One");
input.put(128,"One");
input.put(9,"One");
input.put(10,"One");
input.put(11,"One1");
input.put(12,"two1");
input.put(13,"One1");
input.put(14,"One1");
input.put(15,"One1");
input.put(136,"One1");
input.put(137,"One1");
input.put(158,"One1");
input.put(159,"One1");
input.put(120,"One1");
Set<Integer> inputKeys = input.keySet();
Iterator<Integer> inputKeysIterator = inputKeys.iterator();
while (inputKeysIterator.hasNext()) {
Integer key = inputKeysIterator.next();
ProducerRecord<Integer, String> record = new ProducerRecord<Integer, String>(topic,
key%10, input.get(key));
KafkaProducer.send(record);
}
My Kafka topic is having 10 partitions. Here I am calling kafkaProducer.send() 20 times and hence 20 Kafka call. how can I send whole data in a batch i.e. in one Kafka call, but again I want to ensure each record goes to specific partition driven by formula key%10 as in
ProducerRecord record = new ProducerRecord(topic,
key%10, input.get(key));
Options I see: linger.ms=1 may ensure that but with a latency of 1ms.
How to avoid this latency and to avoid 20 network(Kafka) call or to minimize Kafka calls?
The Kafka Producer API already sends messages in batches even though you call individually one by one
See batch.size in the docs, it is by bytes, not messages, but you can force an actual network event by calling flush on the Producer
Regarding the Partitions, you'll need to create your code Partitioner. Simply passing the mod value as a key doesn't guarantee you won't have a hash collision in the default partitioner
I am experimenting with the Spark job that streams data from Kafka and produces to Cassandra.
The sample I am working with takes a bunch of words in a given time interval and publishes the word count to Cassandra. I am also trying to also publish the timestamp along with the word and its count.
What I have so far is as follows:
JavaPairReceiverInputDStream<String, String> messages =
KafkaUtils.createStream(jssc, zkQuorum, groupId, topicMap);
JavaDStream<String> lines = messages.map(Tuple2::_2);
JavaDStream<String> words = lines.flatMap(x -> Arrays.asList(SPACE.split(x)).iterator());
JavaPairDStream<String, Integer> wordCounts = words.mapToPair(s -> new Tuple2<>(s, 1))
.reduceByKey((i1, i2) -> i1 + i2);
Now I am trying to append to these records the timestamp. What I have tried is something like this:
Tuple3<String, Date, Integer> finalRecord =
wordCounts.map(s -> new Tuple3<>(s._1(), new Date().getTime(), s._2()));
Which of course is shown as wrong in my IDE. I am completely new to working with Spark libraries and writing in this form (I guess lambda based) functions.
Can someone help me correct this error and achieve what I am trying to do?
After some searching done on the web and studying some examples I was able to achieve what I wanted as follows.
In order to append the timestamp attribute to the existing Tuple with two values, I had to create a simple bean with which represents my Cassandra row.
public static class WordCountRow implements Serializable {
String word = "";
long timestamp;
Integer count = 0;
Then, I had map the (word, count) Tuple2 objects in the JavaPairDStream structure to a JavaDStream structure that holds objects of the above WordCountRow class.
JavaDStream<WordCountRow> wordCountRows = wordCounts.map((Function<Tuple2<String, Integer>, WordCountRow>)
tuple -> new WordCountRow(tuple._1, new Date().getTime(), tuple._2));
Finally, I could call foreachRDD method on this structure (which returns objects of WordCountRow) which I can write to Cassandra one after the other.
wordCountRows.foreachRDD((VoidFunction2<JavaRDD<WordCountRow>,Time>)(rdd,time)->{
final SparkConf sc=rdd.context().getConf();
final CassandraConnector cc=CassandraConnector.apply(sc);
rdd.foreach((VoidFunction<WordCountRow>)wordCount->{
try(Session session=cc.openSession()){
String query=String.format(Joiner.on(" ").join(
"INSERT INTO test_keyspace.word_count",
"(word, ts, count)",
"VALUES ('%s', %s, %s);"),
wordCount.word,wordCount.timestamp,wordCount.count);
session.execute(query);
}
});
});
Thanks
I am looking to store the protobuf messages in Hbase/HDFS using spark streaming. And I have below two questions
What is the efficient way of storing huge number of protobuf
messages and the efficient way of retrieving them to do some
analytics? For example, should they be stored as Strings/byte[] in Hbase or Should they be stored in parquet files in HDFS etc.
How should the hierarchical structure of a protobuf
messages be stored? I mean, should the nested elements be flattened
out before storage, or is there any mechanism to store them as is?
If the nested elements are collections or maps should they be
exploded and stored as multiple rows?
The sample structure of Protobuf message is shown below
> +--MsgNode-1
> +--Attribute1 - String
> +--Attribute2 - Int
> +--MsgNode-2
> +--Attribute1 - String
> +--Attribute2 - Double
> +--MsgNode-3 - List of MsgNode-3's
> +--Attribute1 - Int
I am planning to use Spark streaming to collect the protobuf messages as bytes and store them in Hbase/HDFS.
Question 1 :
What is the efficient way of storing huge number of protobuf messages
and the efficient way of retrieving them to do some analytics? For
example, should they be stored as Strings/byte[] in Hbase or Should
they be stored in parquet files in HDFS etc.
I would recommend
- store Proto-buf as Parquet AVRO files(splitting in to meaningful message with AVRO schema).
This can be achieved using dataframes api spark 1.5 and above (PartiotionBy with SaveMode.Append)
see this a-powerful-big-data-trio
If you store as string or byte array you cant do data analytics directly (query on raw data ) is not possible.
If you are using cloudera, impala(which supports parquet-avro) can be used to query rawdata.
Question 2:
How should the hierarchical structure of a protobuf messages be
stored? I mean, should the nested elements be flattened out before
storage, or is there any mechanism to store them as is? If the nested
elements are collections or maps should they be exploded and stored as
multiple rows?
If you store data in a raw format from spark streaming, How will you query if business wants to query and know what kind of data they received(this requirement is very common).
In the first place, You have to understand your data (i.e. relation between different messages with in protobuf so that single row or multiple rows you can decide) then develop protobuf parser to parse message structure of protobuf. based on your data, convert it to avro generic record to save as parquet file.
TIP :
protobuf parsers can be developed in different ways based on your requirements.
one of the generic way is like below example.
public SortedMap<String, Object> convertProtoMessageToMap(GeneratedMessage src) {
final SortedMap<String, Object> finalMap = new TreeMap<String, Object>();
final Map<FieldDescriptor, Object> fields = src.getAllFields();
for (final Map.Entry<FieldDescriptor, Object> fieldPair : fields.entrySet()) {
final FieldDescriptor desc = fieldPair.getKey();
if (desc.isRepeated()) {
final List<?> fieldList = (List<?>) fieldPair.getValue();
if (fieldList.size() != 0) {
final List<String> arrayListOfElements = new ArrayList<String>();
for (final Object o : fieldList) {
arrayListOfElements.add(o.toString());
}
finalMap.put(desc.getName(), arrayListOfElements);
}
} else {
finalMap.put(desc.getName(), fieldPair.getValue().toString());
}
}
return finalMap;
}
I am implementing a stream learner for text classification. There are some single-valued parameters in my implementation that needs to be updated as new stream items arrive. For example, I want to change learning rate as the new predictions are made. However, I doubt that there is a way to broadcast variables after the initial broadcast. So what happens if I need to broadcast a variable every time I update it. If there is a way to do it or a workaround for what I want to accomplish in Spark Streaming, I'd be happy to hear about it.
Thanks in advance.
I got this working by creating a wrapper class over the broadcast variable. The updateAndGet method of wrapper class returns the refreshed broadcast variable. I am calling this function inside dStream.transform -> as per the Spark Documentation
http://spark.apache.org/docs/latest/streaming-programming-guide.html#transform-operation
Transform Operation states:
"the supplied function gets called in every batch interval. This allows you to do time-varying RDD operations, that is, RDD operations, number of partitions, broadcast variables, etc. can be changed between batches."
BroadcastWrapper class will look like :
public class BroadcastWrapper {
private Broadcast<ReferenceData> broadcastVar;
private Date lastUpdatedAt = Calendar.getInstance().getTime();
private static BroadcastWrapper obj = new BroadcastWrapper();
private BroadcastWrapper(){}
public static BroadcastWrapper getInstance() {
return obj;
}
public JavaSparkContext getSparkContext(SparkContext sc) {
JavaSparkContext jsc = JavaSparkContext.fromSparkContext(sc);
return jsc;
}
public Broadcast<ReferenceData> updateAndGet(SparkContext sparkContext){
Date currentDate = Calendar.getInstance().getTime();
long diff = currentDate.getTime()-lastUpdatedAt.getTime();
if (var == null || diff > 60000) { //Lets say we want to refresh every 1 min = 60000 ms
if (var != null)
var.unpersist();
lastUpdatedAt = new Date(System.currentTimeMillis());
//Your logic to refresh
ReferenceData data = getRefData();
var = getSparkContext(sparkContext).broadcast(data);
}
return var;
}
}
You can use this broadcast variable updateAndGet function in stream.transform method that allows RDD-RDD transformations
objectStream.transform(stream -> {
Broadcast<Object> var = BroadcastWrapper.getInstance().updateAndGet(stream.context());
/**Your code to manipulate stream **/
});
Refer to my full answer from this pos :https://stackoverflow.com/a/41259333/3166245
Hope it helps
My understanding is once a broadcast variable is initially sent out, it is 'read only'. I believe you can update the broadcast variable on the local nodes, but not on remote nodes.
May be you need to consider doing this 'outside Spark'. How about using a noSQL store (Cassandra ..etc) or even Memcache? You can then update the variable from one task and periodically check this store from other tasks?
I got an ugly play, but it worked!
We can find how to get a broadcast value from a broadcast object. https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/broadcast/TorrentBroadcast.scala#L114
just by broadcast id.
so i periodically rebroadcast through the same broadcast id.
val broadcastFactory = new TorrentBroadcastFactory()
broadcastFactory.unbroadcast(BroadcastId, true, true)
// append some ids to initIds
val broadcastcontent = broadcastFactory.newBroadcast[.Set[String]](initIds, false, BroadcastId)
and i can get BroadcastId from the first broadcast value.
val ids = ssc.sparkContext.broadcast(initIds)
// broadcast id
val BroadcastId = broadcastIds.id
then worker use ids as a Broadcast Type as normal.
def func(record: Array[Byte], bc: Broadcast[Set[String]]) = ???
bkc.unpersist(true)
bkc.destroy()
bkc = sc.broadcast(tableResultMap)
bkv = bkc.value
You may try this,I not guarantee whether effective
It is best that you collect the data to the driver and then broadcast them to all nodes.
Use Dstream # foreachRDD to collect the computed RDDs at the driver and once you know when you need to change learning rate, then use SparkContext#broadcast(value) to send the new value to all nodes.
I would expect the code to look something like the following:
dStreamContainingBroadcastValue.foreachRDD{ rdd =>
val valueToBroadcast = rdd.collect()
sc.broadcast(valueToBroadcast)
}
You may also find this thread useful, from the spark user mailing list. Let me know if that works.