Saving POJO to Cassandra using Flink

Saving POJO to Cassandra using Flink - cassandra

I am new to Flink, and I wanted to store Kafka streaming data into Cassandra. I've converted String into POJO. My POJO as below,
#Table(keyspace = "sample", name = "contact")
public class Person implements Serializable {
private static final long serialVersionUID = 1L;
#Column(name = "name")
private String name;
#Column(name = "timeStamp")
private LocalDateTime timeStamp;
and My conversion takes places as below,
stream.flatMap(new FlatMapFunction<String, Person>() {
public void flatMap(String value, Collector<Person> out) {
try {
out.collect(objectMapper.readValue(value, Person.class));
} catch (JsonProcessingException e) {
e.printStackTrace();
}
}
}).print(); // I need to use proper method to convert to Datastream.
env.execute();
I read document on the below link for reference,
https://ci.apache.org/projects/flink/flink-docs-stable/dev/connectors/cassandra.html
The Cassandra Sink accepts DataStream instance. I need to convert my conversion and store them into Kafka.
Cant create Cassandra Pojo Sink gives me also some idea.
There is method .forward() which returns DataStream<Reading> forward, and when pass the instance to,
CassandraSink.addSink(forward)
.setHost("localhost")
.build();
cannot access org.apache.flink.streaming.api.scala.DataStream
How can i convert my POJO to store in Cassandra?

Related

Transform PCollection<KV> to custom class

My goal is to read a file from GCS and write it to Cassandra.
New to Apache Beam/Dataflow, I could find most of the hand on build with Python. Unfortunately CassandraIO is only Java native with Beam.
I used the word count example as a template and try to get rid of the TextIO.write() and replace it with a CassandraIO.<Words>write().
Here my java class for the Cassandra table
package org.apache.beam.examples;
import java.io.Serializable;
import com.datastax.driver.mapping.annotations.Column;
import com.datastax.driver.mapping.annotations.PartitionKey;
import com.datastax.driver.mapping.annotations.Table;
#Table(keyspace = "test", name = "words", readConsistency = "ONE", writeConsistency = "QUORUM",
caseSensitiveKeyspace = false, caseSensitiveTable = false)
public class Words implements Serializable {
// private static final long serialVersionUID = 1L;
#PartitionKey
#Column(name = "word")
public String word;
#Column(name = "count")
public long count;
public Words() {
}
public Words(String word, int count) {
this.word = word;
this.count = count;
}
#Override
public boolean equals(Object obj) {
Words other = (Words) obj;
return this.word.equals(other.word) && this.count == other.count;
}
}
And here the pipeline part of the main code.
static void runWordCount(WordCount.WordCountOptions options) {
Pipeline p = Pipeline.create(options);
// Concepts #2 and #3: Our pipeline applies the composite CountWords transform, and passes the
// static FormatAsTextFn() to the ParDo transform.
p.apply("ReadLines", TextIO.read().from(options.getInputFile()))
.apply(new WordCountToCassandra.CountWords())
// Here I'm not sure how to transform PCollection<KV> into PCollection<Words>
.apply(MapElements.into(TypeDescriptor.of(Words.class)).via(PCollection<KV<String, Long>>)
}))
.apply(CassandraIO.<Words>write()
.withHosts(Collections.singletonList("my_ip"))
.withPort(9142)
.withKeyspace("test")
.withEntity(Words.class));
p.run().waitUntilFinish();
}
My understand is to use a PTransform to pass from PCollection<T1> from PCollection<T2>. I don't know how to map that.

If it's 1:1 mapping, MapElements.into is the right choice.
You can either specify a class that implements SerializableFunction<FromType, ToType>, or simply use a lambda, for example:
.apply(MapElements.into(TypeDescriptor.of(Words.class)).via(kv -> new Words(kv.getKey(), kv.getValue()));
Please check MapElements for more information.
If the transformation is not one-to-one, there are other available options such as FlatMapElements or ParDo.

Datastax mapper complains about duplicate columns in generated insert statement

Versions: Datastax Java driver 3.1.4, Cassandra 3.10
Consider the following table:
create table object_ta
(
objid bigint,
version_date timestamp,
objecttype ascii,
primary key (objid, version_date)
);
And a mapped class:
#Table(name = "object_ta")
public class ObjectTa
{
#Column(name = "objid")
private long objid;
#Column(name = "version_date")
private Instant versionDate;
#Column(name = "objecttype")
private String objectType;
public ObjectTa()
{
}
public ObjectTa(long objid)
{
this.objid = objid;
this.versionDate = Instant.now();
}
public long getObjId()
{
return objid;
}
public void setObjId(long objid)
{
this.objid = objid;
}
public Instant getVersionDate()
{
return versionDate;
}
public void setVersionDate(Instant versionDate)
{
this.versionDate = versionDate;
}
public String getObjectType()
{
return objectType;
}
public void setObjectType(String objectType)
{
this.objectType = objectType;
}
}
After creating a mapper for this class (mm is a MappingManager for the session on mykeyspace)
final Mapper<ObjectTa> mapper = mm.mapper(ObjectTa.class);
On calling
mapper.save(new ObjectTa(1));
I get
Query preparation failed: INSERT INTO mykeyspace.object_ta
(objid,objid,version_date,objecttype) VALUES (?,?,?,?);:
com.datastax.driver.core.exceptions.InvalidQueryException: The column
names contains duplicates at
com.datastax.driver.core.Responses$Error.asException(Responses.java:136)
at
com.datastax.driver.core.SessionManager$4.apply(SessionManager.java:220)
at
com.datastax.driver.core.SessionManager$4.apply(SessionManager.java:196)
at
com.google.common.util.concurrent.Futures$ChainingListenableFuture.run(Futures.java:906)
at
com.google.common.util.concurrent.Futures$1$1.run(Futures.java:635)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at
io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137)
at java.lang.Thread.run(Thread.java:745)
I am at a loss to understand, why the duplicate objid is generated in the query.
Thank you in advance for pointers to the problem.
Clemens

I think it is because the inconsistent use of case on the field name (objid) vs the setter/getters (getObjId). If you rename getObjId and setObjId to getObjid and setObjid respectively, I believe it might work.
In a future release, the driver mapper will allow the user to be more explicit about whether setters/getters are used (JAVA-1310) and what the naming conventions are (JAVA-1316).

Some problems about serialization when i use spark read from hbase

I want to implement a class have a function that read from hbase by spark, like this:
public abstract class QueryNode implements Serializable{
private static final long serialVersionUID = -2961214832101500548L;
private int id;
private int parent;
protected static Configuration hbaseConf;
protected static Scan scan;
protected static JavaSparkContext sc;
public abstract RDDResult query();
public int getParent() {
return parent;
}
public void setParent(int parent) {
this.parent = parent;
}
public int getId() {
return id;
}
public void setId(int id) {
this.id = id;
}
public void setScanToConf() {
try {
ClientProtos.Scan proto = ProtobufUtil.toScan(scan);
String scanToString = Base64.encodeBytes(proto.toByteArray());
hbaseConf.set(TableInputFormat.SCAN, scanToString);
} catch (IOException e) {
e.printStackTrace();
}
}}
This is a parent class, i hava some subclasses implement the menthod query() to read from hbase , but if I set Configuration, Scan and JavaSparkContext is not static, I will get some errors : these classes are not serialized.
Why these classes must be static? Have I some other ways to slove this problem? thks.

You can try to set transient for these fields to avoid serialization exception like
Caused by: java.io.NotSerializableException:
org.apache.spark.streaming.api.java.JavaStreamingContext
so you say to java you just dont want to serialize these fields:
protected transient Configuration hbaseConf;
protected transient Scan scan;
protected transient JavaSparkContext sc;
Are you initializing JavaSparkContext, Configuration and Scan in main or in any static method? With static, your fields are shared through all instancies. But it depends on your use cases if static should be used.
But with transient way it is better than static because serialization of JavaSparkCOntext does not make sense cause this is created on driver.
-- edit after discussion in comment:
java doc for newAPIHadoopRDD
public <K,V,F extends org.apache.hadoop.mapreduce.InputFormat<K,V>> JavaPairRDD<K,V> newAPIHadoopRDD(org.apache.hadoop.conf.Configuration conf,
Class<F> fClass,
Class<K> kClass,
Class<V> vClass)
conf - Configuration for setting up the dataset. Note: This will
be put into a Broadcast. Therefore if you plan to reuse this conf
to create multiple RDDs, you need to make sure you won't modify the
conf. A safe approach is always creating a new conf for a new
RDD.
Broadcast:
Broadcast variables allow the programmer to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks.
So basically I think for that case static is ok (you create hbaceConf only once), but if you want to avoid static, you can follow suggestion in javadoc to always craete a new conf for a new RDD.

NullPointerException in Custom Dstinct Mapper

i am using hazelcast 3.6.1 and implementing distinct aggregate functionality using custom mapreduce to get solr facet kind of results.
public class DistinctMapper implements Mapper<String, Employee, String, Long>{
private transient SimpleEntry<String, Employee> entry = new SimpleEntry<String, Employee>();
private static final Long ONE = Long.valueOf(1L);
private Supplier<String, Employee, String> supplier;
public DistinctMapper(Supplier<String, Employee, String> supplier) {
this.supplier = supplier;
}
#Override
public void map(String key, Employee value, Context<String, Long> context) {
System.out.println("Object "+ entry + " and key "+key);
entry.setKey(key);
entry.setValue(value);
String fieldValue = (String) supplier.apply(entry);
//getValue(value, fieldName);
if (null != fieldValue){
context.emit(fieldValue, ONE);
}
}
}
and mapper is failing with NullPointerException. and sysout statement says entry object is null.
SimpleEntry : https://github.com/hazelcast/hazelcast/blob/v3.7-EA/hazelcast/src/main/java/com/hazelcast/mapreduce/aggregation/impl/SimpleEntry.java
Can you point me the issue in the above code ? Thanks.

entry field is transient. This means that it is not serialized, so when DistinctMapperobject is deserialized on hazecalst node, it's value is null.
Removing the transient will solve the NullPointerException.
On the side note:
Why do you need this entry field? It doesn't seem to have any use.

How to convert Cassandra UDT to Optional type

I have a User table and its corresponding POJO
#Table
public class User{
#Column(name = "id")
private String id;
// lots of fields
#Column(name = "address")
#Frozen
private Optional<Address> address;
// getters and setters
}
#UDT
public class Address {
#Field(name = "id")
private String id;
#Field(name = "country")
private String country;
#Field(name = "state")
private String state;
#Field(name = "district")
private String district;
#Field(name = "street")
private String street;
#Field(name = "city")
private String city;
#Field(name = "zip_code")
private String zipCode;
// getters and setters
}
I wanna convert UDT "address" to Optional.
Because I use "cassandra-driver-mapping:3.0.0-rc1" and "cassandra-driver-extras:3.0.0-rc1", there are lots of codec I can use them.
For example: OptionalCodec
I wanna register it to CodecRegistry and pass TypeCodec to OptionalCodec's constructor.
But TypeCodec is a abstract class, I can't initiate it.
Someone have any idea how to initiate OptionalCodec?
Thank you, #Olivier Michallat. Your solution is OK!
But I'm a little confused to set OptionalCodec to CodecRegistry.
You must initial a session at first.
Then pass session to MappingManager, get correct TypeCodec and register codecs.
It's a little weird that you must initial session at first, in order to get TypeCodec !?
Cluster cluster = Cluster.builder()
.addContactPoints("127.0.0.1")
.build();
Session session = cluster.connect(...);
cluster.getConfiguration()
.getCodecRegistry()
.register(new OptionalCodec(new MappingManager(session).udtCodec(Address.class)))
.register(...);
// use session to operate DB

The MappingManager has a method that will create the codec from the annotated class:
TypeCodec<Address> addressCodec = mappingManager.udtCodec(Address.class);
OptionalCodec<Address> optionalAddressCodec = new OptionalCodec(addressCodec);
codecRegistry.register(optionalAddressCodec);

Not really an answer but hope it helps. I couldn't make the Optional work with UDT in scala. However List and Array are working fine:
Here is a scala solution for driver version 4.x:
val reg = session.getContext.getCodecRegistry
val yourTypeUdt: UserDefinedType = session.getMetadata.getKeyspace(keyspace).flatMap(_.getUserDefinedType("YOUR_TYPE")).get
val yourTypeCodec: TypeCodec[UserDefinedType] = reg.codecFor(yourTypeUdt)
reg.asInstanceOf[MutableCodecRegistry].register(TypeCodecs.listOf(yourTypeCodec))
Don't forget to use java.util.* instead of your normal scala types.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Saving POJO to Cassandra using Flink - cassandra

Related

Transform PCollection<KV> to custom class

Datastax mapper complains about duplicate columns in generated insert statement

Some problems about serialization when i use spark read from hbase

NullPointerException in Custom Dstinct Mapper

How to convert Cassandra UDT to Optional type

Categories

Resources