How to customize column mappings with Spark Cassandra Connector in Java? - apache-spark

I wanted to change a column mapping to be append. Is there a better way to customize the column mappings with Spark Cassandra Connector in Java than the following?
ColumnName song_id = new ColumnName("song_id", Option.empty());
CollectionColumnName key_codes = new ColumnName("key_codes", Option.empty()).append();
List<ColumnRef> collectionColumnNames = Arrays.asList(song_id, key_codes);
scala.collection.Seq<ColumnRef> columnRefSeq = JavaApiHelper.toScalaSeq(collectionColumnNames);
javaFunctions(songStream)
.writerBuilder("demo", "song", mapToRow(PianoSong.class))
.withColumnSelector(new SomeColumns(columnRefSeq))
.saveToCassandra();
This is taken from this Spark Streaming code sample.

Just make your column ref's using the
CollectionColumnName
Which has a constructor
case class CollectionColumnName(
columnName: String,
alias: Option[String] = None,
collectionBehavior: CollectionBehavior = CollectionOverwrite) extends ColumnRef
You can rename by setting alias and you can change the insert behavior with collectionBehavior which takes the following classes.
Api Link
/** Insert behaviors for Collections. */
sealed trait CollectionBehavior
case object CollectionOverwrite extends CollectionBehavior
case object CollectionAppend extends CollectionBehavior
case object CollectionPrepend extends CollectionBehavior
case object CollectionRemove extends CollectionBehavior
Which means you can just do
CollectionColumnName appendColumn =
new CollectionColumnName("ColumnName", Option.empty(), CollectionPrepend$.MODULE$);
Which looks a bit more Java-y and is a bit more explicit. Did you have any other goals for this code?

Related

Apache Spark: pass Column as Transformer parameter

I defined a pipeline Transformer like this:
class MyTransformer(condition: Column) extends SparkTransformer {
override def transform(dataset: Dataset[_]): DataFrame = {...}
}
which is then used in a pipeline:
val pipeline = new Pipeline()
pipeline.setStages(Array(new MyTransformer(col("test).equals(lit("value"))))
pipeline.fit(df).transform(mydf)
In my transformer, I want to apply a transformation only on rows that verify the condition.
It results in a serialization issue:
Serialization stack:
- object not serializable (class: org.apache.spark.sql.Column, value: (test = value))
- field (class: my.project.MyTransformer, name: condition, type: class org.apache.spark.sql.Column)
- ...
In my understanding, the Transformer are serialized to be dispatched to executors, so every parameter should be serializable.
How can I bypass it? Is there a workaround?
Thx.
This question seems a bit old...
I don't know if my (untested) idea match your needs.
A solution could be to use the SQL expression (a String instance)
val pipeline = new Pipeline()
pipeline.setStages(Array(new MyTransformer("test = 'value'")))
pipeline.fit(df).transform(mydf)
and to use functions.expr() to convert the expression String to Column instance in Transformer.transform method.
This way, the condition is Serializable and the non-serializable objects are created when needed in executors.

How to create an update statement where a UDT value need to be updated using QueryBuilder

I have the following udt type
CREATE TYPE tag_partitions(
year bigint,
month bigint);
and the following table
CREATE TABLE ${tableName} (
tag text,
partition_info set<FROZEN<tag_partitions>>,
PRIMARY KEY ((tag))
)
The table schema is mapped using the following model
case class TagPartitionsInfo(year:Long, month:Long)
case class TagPartitions(tag:String, partition_info:Set[TagPartitionsInfo])
I have written a function which should create an Update.IfExists query: But I don't know how I should update the udt value. I tried to use set but it isn't working.
def updateValues(tableName:String, model:TagPartitions, id:TagPartitionKeys):Update.IfExists = {
val partitionInfoType:UserType = session.getCluster().getMetadata
.getKeyspace("codingjedi").getUserType("tag_partitions")
//create value
//the logic below assumes that there is only one element in the set
val partitionsInfoSet:Set[UDTValue] = model.partition_info.map((partitionInfo:TagPartitionsInfo) =>{
partitionInfoType.newValue()
.setLong("year",partitionInfo.year)
.setLong("month",partitionInfo.month)
})
println("partition info converted to UDTValue: "+partitionsInfoSet)
QueryBuilder.update(tableName).
`with`(QueryBuilder.WHAT_TO_DO_HERE_TO_UPDATE_UDT("partition_info",partitionsInfoSet))
.where(QueryBuilder.eq("tag", id.tag)).ifExists()
}
The mistake was I was adding partitionsInfoSet in the table but it is a Set of Scala. I needed to convert into Set of Java using setAsJavaSet
QueryBuilder.update(tableName).`with`(QueryBuilder.set("partition_info",setAsJavaSet(partitionsInfoSet)))
.where(QueryBuilder.eq("tag", id.tag))
.ifExists()
}
Although, it didn't answer your exact question, wouldn't it be easier to use Object Mapper for this? Something like this (I didn't modify it heavily to match your code):
#UDT(name = "scala_udt")
case class UdtCaseClass(id: Integer, #(Field #field)(name = "t") text: String) {
def this() {
this(0, "")
}
}
#Table(name = "scala_test_udt")
case class TableObjectCaseClassWithUDT(#(PartitionKey #field) id: Integer,
udts: java.util.Set[UdtCaseClass]) {
def this() {
this(0, new java.util.HashSet[UdtCaseClass]())
}
}
and then just create case class and use mapper.save on it. (Also note that you need to use Java collections, until you're imported Scala codecs).
The primary reason for using Object Mapper could be ease of use, and also better performance, because it's using prepared statements under the hood, instead of built statements that are much less efficient.
You can find more information about Object Mapper + Scala in article that I wrote recently.

Spark : Create dataframe with default values

Can we put a default value in a field of dataframe while creating the dataframe? I am creating a spark dataframe from List<Object[]> rows as :
List<org.apache.spark.sql.Row> sparkRows = rows.stream().map(RowFactory::create).collect(Collectors.toList());
Dataset<org.apache.spark.sql.Row> dataset = session.createDataFrame(sparkRows, schema);
While looking for a way, I found that org.apache.spark.sql.types.DataTypes contains object of org.apache.spark.sql.types.Metadata class. The documentation does not specify what is the exact purpose of the class :
/**
* Metadata is a wrapper over Map[String, Any] that limits the value type to simple ones: Boolean,
* Long, Double, String, Metadata, Array[Boolean], Array[Long], Array[Double], Array[String], and
* Array[Metadata]. JSON is used for serialization.
*
* The default constructor is private. User should use either [[MetadataBuilder]] or
* `Metadata.fromJson()` to create Metadata instances.
*
* #param map an immutable map that stores the data
*
* #since 1.3.0
*/
This class supports a very limited datatypes, and there is no out of the box api for making use of this class for inserting a default value while dataset creation.
Where does one use the metadata, can someone share any real life use case?
I know we can have our own map function to iterate over the rows.stream().map(RowFactory::create) and put default values. But is there any way we could do this using spark apis?
Edit : I am expecting some way similar to Oracle's DEFAULT functionality. We define a default value for each column, according to its datatype, and while creating the dataframe, if there is no value or null, then use this default value.

Spark custom estimator including persistence

I want to develop a custom estimator for spark which handles persistence of the great pipeline API as well. But as How to Roll a Custom Estimator in PySpark mllib put it there is not a lot of documentation out there (yet).
I have some data cleansing code written in spark and would like to wrap it in a custom estimator. Some na-substitutions, column deletions, filtering and basic feature generation are included (e.g. birthdate to age).
transformSchema will use the case class of the dataset ScalaReflection.schemaFor[MyClass].dataType.asInstanceOf[StructType]
fit will only fit e.g. mean age as na. substitutes
What is still pretty unclear to me:
transform in the custom pipeline model will be used to transform the "fitted" Estimator on new data. Is this correct? If yes how should I transfer the fitted values e.g. the mean age from above into the model?
how to handle persistence? I found some generic loadImpl method within private spark components but am unsure how to transfer my own parameters e.g. the mean age into the MLReader / MLWriter which are used for serialization.
It would be great if you could help me with a custom estimator - especially with the persistence part.
First of all I believe you're mixing a bit two different things:
Estimators - which represent stages that can be fit-ted. Estimator fit method takes Dataset and returns Transformer (model).
Transformers - which represent stages that can transform data.
When you fit Pipeline it fits all Estimators and returns PipelineModel. PipelineModel can transform data sequentially calling transform on all Transformers in the the model.
how should I transfer the fitted values
There is no single answer to this question. In general you have two options:
Pass parameters of the fitted model as the arguments of the Transformer.
Make parameters of the fitted model Params of the Transformer.
The first approach is typically used by the built-in Transformer, but the second one should work in some simple cases.
how to handle persistence
If Transformer is defined only by its Params you can extend DefaultParamsReadable.
If you use more complex arguments you should extend MLWritable and implement MLWriter that makes sense for your data. There are multiple examples in Spark source which show how to implement data and metadata reading / writing.
If you're looking for an easy to comprehend example take a look a the CountVectorizer(Model) where:
Estimator and Transformer share common Params.
Model vocabulary is a constructor argument, model parameters are inherited from the parent.
Metadata (parameters) is written an read using DefaultParamsWriter / DefaultParamsReader.
Custom implementation handles data (vocabulary) writing and reading.
The following uses the Scala API but you can easily refactor it to Python if you really want to...
First things first:
Estimator: implements .fit() that returns a Transformer
Transformer: implements .transform() and manipulates the DataFrame
Serialization/Deserialization: Do your best to use built-in Params and leverage simple DefaultParamsWritable trait + companion object extending DefaultParamsReadable[T]. a.k.a Stay away from MLReader / MLWriter and keep your code simple.
Parameters passing: Use a common trait extending the Params and share it between your Estimator and Model (a.k.a. Transformer)
Skeleton code:
// Common Parameters
trait MyCommonParams extends Params {
final val inputCols: StringArrayParam = // usage: new MyMeanValueStuff().setInputCols(...)
new StringArrayParam(this, "inputCols", "doc...")
def setInputCols(value: Array[String]): this.type = set(inputCols, value)
def getInputCols: Array[String] = $(inputCols)
final val meanValues: DoubleArrayParam =
new DoubleArrayParam(this, "meanValues", "doc...")
// more setters and getters
}
// Estimator
class MyMeanValueStuff(override val uid: String) extends Estimator[MyMeanValueStuffModel]
with DefaultParamsWritable // Enables Serialization of MyCommonParams
with MyCommonParams {
override def copy(extra: ParamMap): Estimator[MeanValueFillerModel] = defaultCopy(extra) // deafult
override def transformSchema(schema: StructType): StructType = schema // no changes
override def fit(dataset: Dataset[_]): MyMeanValueStuffModel = {
// your logic here. I can't do all the work for you! ;)
this.setMeanValues(meanValues)
copyValues(new MyMeanValueStuffModel(uid + "_model").setParent(this))
}
}
// Companion object enables deserialization of MyCommonParams
object MyMeanValueStuff extends DefaultParamsReadable[MyMeanValueStuff]
// Model (Transformer)
class MyMeanValueStuffModel(override val uid: String) extends Model[MyMeanValueStuffModel]
with DefaultParamsWritable // Enables Serialization of MyCommonParams
with MyCommonParams {
override def copy(extra: ParamMap): MyMeanValueStuffModel = defaultCopy(extra) // default
override def transformSchema(schema: StructType): StructType = schema // no changes
override def transform(dataset: Dataset[_]): DataFrame = {
// your logic here: zip inputCols and meanValues, toMap, replace nulls with NA functions
// you have access to both inputCols and meanValues here!
}
}
// Companion object enables deserialization of MyCommonParams
object MyMeanValueStuffModel extends DefaultParamsReadable[MyMeanValueStuffModel]
With the code above you can Serialize/Deserialize a Pipeline containing a MyMeanValueStuff stage.
Want to look at some real simple implementation of an Estimator? MinMaxScaler! (My example is actually simpler though...)

How to use java.time.LocalDate in Cassandra query from Spark?

We have a table in Cassandra with column start_time of type date.
When we execute following code:
val resultRDD = inputRDD.joinWithCassandraTable(KEY_SPACE,TABLE)
.where("start_time = ?", java.time.LocalDate.now)
We get following error:
com.datastax.spark.connector.types.TypeConversionException: Cannot convert object 2016-10-13 of type class java.time.LocalDate to com.datastax.driver.core.LocalDate.
at com.datastax.spark.connector.types.TypeConverter$$anonfun$convert$1.apply(TypeConverter.scala:45)
at com.datastax.spark.connector.types.TypeConverter$$anonfun$convert$1.apply(TypeConverter.scala:43)
at com.datastax.spark.connector.types.TypeConverter$LocalDateConverter$$anonfun$convertPF$14.applyOrElse(TypeConverter.scala:449)
at com.datastax.spark.connector.types.TypeConverter$class.convert(TypeConverter.scala:43)
at com.datastax.spark.connector.types.TypeConverter$LocalDateConverter$.com$datastax$spark$connector$types$NullableTypeConverter$$super$convert(TypeConverter.scala:439)
at com.datastax.spark.connector.types.NullableTypeConverter$class.convert(TypeConverter.scala:56)
at com.datastax.spark.connector.types.TypeConverter$LocalDateConverter$.convert(TypeConverter.scala:439)
at com.datastax.spark.connector.types.TypeConverter$OptionToNullConverter$$anonfun$convertPF$29.applyOrElse(TypeConverter.scala:788)
at com.datastax.spark.connector.types.TypeConverter$class.convert(TypeConverter.scala:43)
at com.datastax.spark.connector.types.TypeConverter$OptionToNullConverter.com$datastax$spark$connector$types$NullableTypeConverter$$super$convert(TypeConverter.scala:771)
at com.datastax.spark.connector.types.NullableTypeConverter$class.convert(TypeConverter.scala:56)
at com.datastax.spark.connector.types.TypeConverter$OptionToNullConverter.convert(TypeConverter.scala:771)
at com.datastax.spark.connector.writer.BoundStatementBuilder$$anonfun$8.apply(BoundStatementBuilder.scala:93)
I've tried to register custom converters according to documentation:
object JavaLocalDateToCassandraLocalDateConverter extends TypeConverter[com.datastax.driver.core.LocalDate] {
def targetTypeTag = typeTag[com.datastax.driver.core.LocalDate]
def convertPF = {
case ld: java.time.LocalDate => com.datastax.driver.core.LocalDate.fromYearMonthDay(ld.getYear, ld.getMonthValue, ld.getDayOfMonth)
case _ => com.datastax.driver.core.LocalDate.fromYearMonthDay(1971, 1, 1)
}
}
object CassandraLocalDateToJavaLocalDateConverter extends TypeConverter[java.time.LocalDate] {
def targetTypeTag = typeTag[java.time.LocalDate]
def convertPF = { case ld: com.datastax.driver.core.LocalDate => java.time.LocalDate.of(ld.getYear(), ld.getMonth(), ld.getDay())
case _ => java.time.LocalDate.now
}
}
TypeConverter.registerConverter(JavaLocalDateToCassandraLocalDateConverter)
TypeConverter.registerConverter(CassandraLocalDateToJavaLocalDateConverter)
But it didn't help.
How can I use JDK8 Date/Time classes in Cassandra queries executed from Spark?
I think the simplest thing to do in a where clause like this is to just call
sc
.cassandraTable("test","test")
.where("start_time = ?", java.time.LocalDate.now.toString)
.collect`
And just pass in the string since that will be a well defined conversion.
There seems to be an issue in the TypeConverters where your converter is not taking precedence over the built in converter. I'll take a quick look.
--Edit--
It seems like the registered converters are not being properly transferred to the Executors. In Local mode the code works as expected which makes me think this is a serialization issue. I would open a ticket on the Spark Cassandra Connector for this issue.
Cassandra date format is yyyy-MM-dd HH:mm:ss.SSS
so you can use the below code, if you are using Java 8 to convert Cassandra date to LocalDate, then you can do your logic.
val formatter = DateTimeFormatter.ofPattern("yyyy-MM-dd HH:mm:ss.SSS")
val dateTime = LocalDateTime.parse(cassandraDateTime, formatter);
Or you can convert LocalDate to Cassandra date format and check it.

Resources