How to covert a Dataframe to a Dataset,having a object reference of the parent class as a composition inside another class? - apache-spark

I am trying to convert a Dataframe to a Dataset, and the java classes structure is as follows:
class A:
public class A {
private int a;
public int getA() {
return a;
}
public void setA(int a) {
this.a = a;
}
}
class B:
public class B extends A {
private int b;
public int getB() {
return b;
}
public void setB(int b) {
this.b = b;
}
}
and class C
public class C {
private A a;
public A getA() {
return a;
}
public void setA(A a) {
this.a = a;
}
}
and the data in the dataframe is as follows :
+-----+
| a |
+-----+
|[1,2]|
+-----+
When I am trying to apply Encoders.bean[C](classOf[C]) to the dataframe. The object reference A which is a instance of B in class C is not returning true when I am checking for .isInstanceOf[B], I am getting it as false. The output of Dataset is as follows:
+-----+
| a |
+-----+
|[1,2]|
+-----+
How do we get all the fields of A and B under the C object while iterating over it in foreach?
Code :-
object TestApp extends App {
implicit val sparkSession = SparkSession.builder()
.appName("Test-App")
.config("spark.sql.codegen.wholeStage", value = false)
.master("local[1]")
.getOrCreate()
var schema = new StructType().
add("a", new ArrayType(new StructType().add("a", IntegerType, true).add("b", IntegerType, true), true))
var dd = sparkSession.read.schema(schema).json("Test.txt")
var ff = dd.as(Encoders.bean[C](classOf[C]))
ff.show(truncate = false)
ff.foreach(f => {
println(f.getA.get(0).isInstanceOf[A])//---true
println(f.getA.get(0).isInstanceOf[B])//---false
})
Content of File : {"a":[{"a":1,"b":2}]}

Spark-catalyst uses google reflection to get schema out of java beans.
Please take a look at the JavaTypeInference.scala#inferDataType. This class uses getters to collect the field name and the returnType of getters to compute the SparkType.
Since class C has getter named getA() with returnType as A and A, in turn, has getter as getA() with returnType as int,
Schema will be created as struct<a:struct<a:int>> where struct<a:int> is derived from the getA of class A.
The solution to this problem that I can think of is -
// Modify your class C to have Real class reference rather its super type
public class C {
private B a;
public B getA() {
return a;
}
public void setA(B a) {
this.a = a;
}
}
Output-
root
|-- a: struct (nullable = true)
| |-- a: integer (nullable = false)
| |-- b: integer (nullable = false)
+------+
|a |
+------+
|[1, 2]|
+------+

Related

Spark Streaming Convert Dataset<Row> to Dataset<CustomObject> in java

I've recently started working with apache spark and came across a requirement where I need to read kafka stream and feed the data in cassandra. While doing so I encountered an issue where as streams are SQL based and cassandra connector is on rdd (I may be wrong here please do correct me) I was struggling to get this working. Somehow I made it work as of now but not sure if that's the true way to implement.
Below is the code
Schema
StructType getSchema(){
StructField[] structFields = new StructField[]{
new StructField("id", DataTypes.LongType, true, Metadata.empty()),
new StructField("name", DataTypes.StringType, true, Metadata.empty()),
new StructField("cat", DataTypes.StringType, true, Metadata.empty()),
new StructField("tag", DataTypes.createArrayType(DataTypes.StringType), true, Metadata.empty())
};
return new StructType(structFields);
}
stream reader
Dataset<Row> results = kafkaDataset.select(
col("key").cast("string"),
from_json(col("value").cast("string"), getSchema()).as("value"),
col("topic"),
col("partition"),
col("offset"),
col("timestamp"),
col("timestampType"));
results.select("value.*")
.writeStream()
.foreachBatch(new VoidFunction2<Dataset<Row>, Long>() {
#Override
public void call(Dataset<Row> dataset, Long batchId) throws Exception {
ObjectMapper mapper = new ObjectMapper();
List<DealFeedSchema> list = new ArrayList<>();
List<Row> rowList = dataset.collectAsList();
if (!rowList.isEmpty()) {
rowList.forEach(row -> {
if (row == null) logger.info("Null DataSet");
else {
try {
list.add(mapper.readValue(row.json(), DealFeedSchema.class));
} catch (JsonProcessingException e) {
logger.error("error parsing Data", e);
}
}
});
JavaRDD<DealFeedSchema> rdd = new JavaSparkContext(session.sparkContext()).parallelize(list);
javaFunctions(rdd).writerBuilder(Constants.CASSANDRA_KEY_SPACE,
Constants.CASSANDRA_DEAL_TABLE_SPACE, mapToRow(DealFeedSchema.class)).saveToCassandra();
}
}
}).
start().awaitTermination();
although this works fine i need to know if theres a better way to do this if there is any please let me know how to acheive it.
Thanks in advance.
for those who are looking for a way you can refer this code as an alternative.. :)
To Convert Dataset< Row > to Dataset< DealFeedSchema > in java
1. Java Bean for DealFeedSchema
import java.util.List;
public class DealFeedSchema {
private long id;
private String name;
private String cat;
private List<String> tag;
public long getId() {
return id;
}
public void setId(long id) {
this.id = id;
}
public String getName() {
return name;
}
public void setName(String name) {
this.name = name;
}
public String getCat() {
return cat;
}
public void setCat(String cat) {
this.cat = cat;
}
public List<String> getTag() {
return tag;
}
public void setTag(List<String> tag) {
this.tag = tag;
}
}
2. Load the test data
Dataset<Row> dataFrame = spark.createDataFrame(Arrays.asList(
RowFactory.create(1L, "foo", "cat1", Arrays.asList("tag1", "tag2"))
), getSchema());
dataFrame.show(false);
dataFrame.printSchema();
/**
* +---+----+----+------------+
* |id |name|cat |tag |
* +---+----+----+------------+
* |1 |foo |cat1|[tag1, tag2]|
* +---+----+----+------------+
*
* root
* |-- id: long (nullable = true)
* |-- name: string (nullable = true)
* |-- cat: string (nullable = true)
* |-- tag: array (nullable = true)
* | |-- element: string (containsNull = true)
*/
3. Convert Dataset<Row> to Dataset<DealFeedSchema>
Dataset<DealFeedSchema> dealFeedSchemaDataset = dataFrame.as(Encoders.bean(DealFeedSchema.class));
dealFeedSchemaDataset.show(false);
dealFeedSchemaDataset.printSchema();
/**
* +---+----+----+------------+
* |id |name|cat |tag |
* +---+----+----+------------+
* |1 |foo |cat1|[tag1, tag2]|
* +---+----+----+------------+
*
* root
* |-- id: long (nullable = true)
* |-- name: string (nullable = true)
* |-- cat: string (nullable = true)
* |-- tag: array (nullable = true)
* | |-- element: string (containsNull = true)
*/
Just write data from Spark Structured Streaming without conversion to RDD - you just need to switch to use Spark Cassandra Connector 2.5.0 that added this capability, together with much more stuff.
When you use it, your code will look as following (I don't have Java example, but it should be similar to this):
val query = streamingCountsDF.writeStream
.outputMode(OutputMode.Update)
.format("org.apache.spark.sql.cassandra")
.option("checkpointLocation", "some_checkpoint_location")
.option("keyspace", "test")
.option("table", "sttest_tweets")
.start()

how to add datatable through type registry in cucumber 4.0

For cucumber 3.0 I was using
typeRegistry.defineDataTableType(DataTableType.entry(CustomData.class));
public class CustomData {
private int id;
private int val;
private Region region;
private Boolean isExisting;
private String type;
//getter and setter methods
}
How to convert this in cucumber 4.0.0 as part of configureTypeRegistry
My step in feature file as
When I set the custom data
| region | id | val | isExisting | type |
| NA | 2 | 10 | true | custom|
There are a few ways to do it. For #2 and #3 you'll have to add a dependency on jackson-databind to your project.
import com.fasterxml.jackson.databind.ObjectMapper;
import io.cucumber.core.api.TypeRegistry;
import io.cucumber.core.api.TypeRegistryConfigurer;
import io.cucumber.datatable.DataTableType;
import java.util.Map;
class TypeRegistryConfiguration implements TypeRegistryConfigurer {
private final ObjectMapper objectMapper = new ObjectMapper();
#Override
public void configureTypeRegistry(TypeRegistry typeRegistry) {
// 1. Define the mapping yourself.
typeRegistry.defineDataTableType(
new DataTableType(MyType.class,
(Map<String, String> entry) -> {
MyType object = new MyType();
object.setX(entry.get("X"));
return object;
}
)
);
// 2. Define a data table type that delegates to an object mapper
typeRegistry.defineDataTableType(
new DataTableType(MyType.class,
(Map<String, String> entry) -> objectMapper.convertValue(entry, MyType.class)
)
);
// 3. Define a default data table entry that takes care of all mappings
typeRegistry.setDefaultDataTableEntryTransformer(
(entryValue, toValueType, cellTransformer) ->
objectMapper.convertValue(entryValue, objectMapper.constructType(toValueType)));
}
}
And in v5 you would do this like:
import com.fasterxml.jackson.databind.ObjectMapper;
import io.cucumber.java.DataTableType;
import io.cucumber.java.DefaultDataTableEntryTransformer;
import java.lang.reflect.Type;
import java.util.Map;
class TypeRegistryConfiguration {
private final ObjectMapper objectMapper = new ObjectMapper();
// 1. Define the mapping yourself
#DataTableType
public MyType myType(Map<String, String> entry) {
MyType object = new MyType();
object.setX(entry.get("X"));
return object;
}
// 2. Define a data table type that delegates to an object mapper
#DataTableType
public MyType myType(Map<String, String> entry) {
return objectMapper.convertValue(entry, MyType.class);
}
// 3. Define a default data table entry that takes care of all mappings
#DefaultDataTableEntryTransformer
public Object defaultDataTableEntry(Map<String, String> entry, Type toValueType) {
return objectMapper.convertValue(entry, objectMapper.constructType(toValueType));
}
}

Is there a way to find out which class instance created 'this'

**I(one instance of a class) want to find out say which class instantiated me?
I have a class C that is instantiated by Class A and Class B. I want to find out which class instantiated me, so that I can access the variable from that class.
The usual way is to pass in an identifier that hey I am from class A and pass in the variable x in the constructor for the C to consume in the way appropriate for it.
**
eg:
public Class A
{
public int x;
public A()
{
C c = new C();
}
}
public Class B
{
public int x;
public B()
{
C c = new C();
}
}
public Class C
{
public CMethod()
{
// I want Access int x from the class that instantiated me.
if I know its B then B.x ...
}
}
There is no way to know without some hacking (see below). This looks like a case for an interface…
Classes A and B define HasX which has a getX() method. You can pass either class to the constructor of C which expects any class which implements HasX. Then C can call getX on either object and it doesn't need to know which type it actually is, but it will get the appropriate X value.
public interface HasX {
public int getX();
}
public class A implements HasX {
private int x;
public A()
{
C c = new C(this);
}
public int getX() {
return x;
}
}
public class B implements HasX {
private int x;
public B() {
C c = new C(this);
}
public int getX() {
return x;
}
}
public class C {
HasX hasX;
public C(HasX hasX) {
this.hasX = hasX;
}
public void doStuff() {
int x = hasX.getX();
}
}
To answer you original question though, the object which created an object is not stored anywhere… but you can do some hacking when C is constructed for find out the class. Here is some code I once used for a Logging implementation which could detect who was the caller by looking back along the stracktrace of a Throwable. Again, this is not good practice, but you asked so… :)
From: https://github.com/slipperyseal/atomicobjects/blob/master/atomicobjects-lang/src/main/java/net/catchpole/trace/PrintTrace.java
public C() {
String whoCalledMe = whereAmI(new Throwable());
}
private String whereAmI(Throwable throwable) {
for (StackTraceElement ste : throwable.getStackTrace()) {
String className = ste.getClassName();
// search stack for first element not within this class
if (!className.equals(this.getClass().getName())) {
int dot = className.lastIndexOf('.');
if (dot != -1) {
className = className.substring(dot + 1);
}
return className + '.' + ste.getMethodName();
}
}
return "";
}
You might want to edit this to simply return the class name, or even do a Class.forName() to resolve the actual class.
If you want the actual objects, and there is only ever 1 of each class, you could out the objects in a Map keyed on classname. But gee what a mess around :)

Do not include few parent class elements when XML constructed from child

Is it possible to not have few fields from parent class when XML is constructed out of the child class?
But the elements should be present when XML is constructed from parent class?
Example
Parent class
#XmlRootElement(name = "location")
#XmlType(propOrder = { "id", "name" })
#JsonPropertyOrder({ "id", "name" })
public class Parent {
private Integer id;
private String name;
#XmlElement(name = "id", nillable = true)
#JsonProperty("id")
public Integer getId() {
return super.getId();
}
#JsonProperty("id")
public void setId(Integer id) {
super.setId(id);
}
#XmlElement(name = "name", nillable = true)
#JsonProperty("name")
public String getName() {
return name;
}
#JsonProperty("name")
public void setName(String name) {
this.name = name;
}
}
Child class
#XmlRootElement(name = "location")
#XmlType(propOrder = { "id" })
#JsonPropertyOrder({ "id" })
public class Child extends Parent {
#XmlElement(name = "id", nillable = true)
#JsonProperty("id")
public Integer getId() {
return super.getId();
}
#JsonProperty("id")
public void setId(Integer id) {
super.setId(id);
}
}
I do not want the name field when XML is constructed from child class. However it should be present when XML is constructed from parent class.
Try to override the getter and setter for name in the subclass and annotate them with #JsonIgnore and/or #XmlTransient.
EDIT
Indeed, #XmlTransient does not work with polymorphism as I expected (and as #JsonIgnore do). What you can try is:
-- move all content of Parent class to an abstract Base class
-- mark Base as #XmlTransient
-- make Parent extend Base and add no content to it
-- make Child extend Base
Here is some synthetic example I worked on. It can easily be translated to your particular classes.
Class Base
#XmlRootElement(name = "location")
#XmlSeeAlso(value = {Parent.class, Child.class})
#XmlTransient
public abstract class Base {
private String a;
private String b;
#XmlElement(name = "a")
public String getA() {
return a;
}
public void setA(String a) {
this.a = a;
}
#XmlElement(name = "b", nillable = true)
public String getB() {
return b;
}
public void setB(String b) {
this.b = b;
}
}
Class Parent
#XmlRootElement(name = "location")
#XmlType
public class Parent extends Base {
}
Class Child
#XmlRootElement(name = "location")
#XmlType
public class Child extends Base {
private String c;
#XmlElement(name = "c")
public String getC() {
return c;
}
public void setC(String c) {
this.c = c;
}
#Override
#XmlTransient
public String getB(){
return super.getB();
}
#Override
public void setB(String b) {
super.setB(b);
}
}
Obviously, if the class hierarchy grows larger, it may be harder to maintain such a workaround. In those cases, you may think about choosing composition rather than inheritance.

Accessing the value of class variable defined in a class from another class

I have the following scenario . I have three classes
CLass A
Class B
Class C
In Class A an object of Class B is created.
In Class B an object of class C is created.
There is a public class variable defined in Class C
which I want to access using an object of Class A in a page.
Is there any way to do this directly ?
Thanks in advance
Regards
Mathew
You could create a property on A that references the C object:
class A
{
public B B { get; set; }
public int CFoo { get { return B.C.Foo; } set { B.C.Foo = value; } }
public A() { B = new B(); }
}
class B
{
public C C { get; set; }
public B() { C = new C(); }
}
class C
{
public int Foo { get; set; }
}
From your page, you would do this:
A a = new A();
// sets A.B.C.Foo
a.CFoo = 1;

Resources