Can we put a default value in a field of dataframe while creating the dataframe? I am creating a spark dataframe from List<Object[]> rows as :
List<org.apache.spark.sql.Row> sparkRows = rows.stream().map(RowFactory::create).collect(Collectors.toList());
Dataset<org.apache.spark.sql.Row> dataset = session.createDataFrame(sparkRows, schema);
While looking for a way, I found that org.apache.spark.sql.types.DataTypes contains object of org.apache.spark.sql.types.Metadata class. The documentation does not specify what is the exact purpose of the class :
/**
* Metadata is a wrapper over Map[String, Any] that limits the value type to simple ones: Boolean,
* Long, Double, String, Metadata, Array[Boolean], Array[Long], Array[Double], Array[String], and
* Array[Metadata]. JSON is used for serialization.
*
* The default constructor is private. User should use either [[MetadataBuilder]] or
* `Metadata.fromJson()` to create Metadata instances.
*
* #param map an immutable map that stores the data
*
* #since 1.3.0
*/
This class supports a very limited datatypes, and there is no out of the box api for making use of this class for inserting a default value while dataset creation.
Where does one use the metadata, can someone share any real life use case?
I know we can have our own map function to iterate over the rows.stream().map(RowFactory::create) and put default values. But is there any way we could do this using spark apis?
Edit : I am expecting some way similar to Oracle's DEFAULT functionality. We define a default value for each column, according to its datatype, and while creating the dataframe, if there is no value or null, then use this default value.
Related
The documentation says that one triggers a spark job and the other one does not. I am not sure I understand what that means. Could you help me understand the difference between the two?
The source of truth comes from latest code:
/**
* Zips this RDD with its element indices. The ordering is first based on the partition index
* and then the ordering of items within each partition. So the first item in the first
* partition gets index 0, and the last item in the last partition receives the largest index.
*
* This is similar to Scala's zipWithIndex but it uses Long instead of Int as the index type.
* This method needs to trigger a spark job when this RDD contains more than one partitions.
*
* #note Some RDDs, such as those returned by groupBy(), do not guarantee order of
* elements in a partition. The index assigned to each element is therefore not guaranteed,
* and may even change if the RDD is reevaluated. If a fixed ordering is required to guarantee
* the same index assignments, you should sort the RDD with sortByKey() or save it to a file.
*/
def zipWithIndex(): RDD[(T, Long)] = withScope {
new ZippedWithIndexRDD(this)
}
/**
* Zips this RDD with generated unique Long ids. Items in the kth partition will get ids k, n+k,
* 2*n+k, ..., where n is the number of partitions. So there may exist gaps, but this method
* won't trigger a spark job, which is different from [[org.apache.spark.rdd.RDD#zipWithIndex]].
*
* #note Some RDDs, such as those returned by groupBy(), do not guarantee order of
* elements in a partition. The unique ID assigned to each element is therefore not guaranteed,
* and may even change if the RDD is reevaluated. If a fixed ordering is required to guarantee
* the same index assignments, you should sort the RDD with sortByKey() or save it to a file.
*/
def zipWithUniqueId(): RDD[(T, Long)]
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L1396
I know that you can use the provided helper functions to retrieve subfields from message segments for example
val nameDf = df.select(segment_field("PID", 4).alias("name"))
Is there a way I can extract the entire message segment ("PID") instead of having to put in an index for each subfield?
PID|||d40726da-9b7a-49eb-9eeb-e406708bbb60||Heller^Keneth||||||140 Pacocha Way Suite 52^^Northampton^Massachusetts^^USA
segment_field as implemented here is a helper method to extract a single value given the index.
/**
* Extracts a field from a message segment.
*
* #param segment The ID of the segment to extract.
* #param field The index of the field to extract.
* #param segmentColumn The name of the column containing message segments.
* Defaults to "segments".
* #return Yields a new column containing the field of a message segment.
*
* #note If there are multiple segments with the same ID, this function will
* select the field from one of the segments. Order is undefined.
*/
def segment_field(segment: String,
field: Int,
segmentColumn: Column = col("segments")): Column = {
filter(segmentColumn, s => s("id") === lit(segment))
.getItem(0)
.getField("fields")
.getItem(field)
}
However, if you are interested in extracting all the field values regardless of index you may replicate this behaviour as
import org.apache.spark.sql.Column
import org.apache.spark.sql.functions._
def segment_fields(segment: String,
segmentColumn: Column = col("segments")): Column = {
filter(segmentColumn, s => s("id") === lit(segment))
.getItem(0)
.getField("fields")
}
and use as such
val nameDf = df.select(segment_fields("PID").alias("values"))
You may then extract or transform the data as desired.
I have the following udt type
CREATE TYPE tag_partitions(
year bigint,
month bigint);
and the following table
CREATE TABLE ${tableName} (
tag text,
partition_info set<FROZEN<tag_partitions>>,
PRIMARY KEY ((tag))
)
The table schema is mapped using the following model
case class TagPartitionsInfo(year:Long, month:Long)
case class TagPartitions(tag:String, partition_info:Set[TagPartitionsInfo])
I have written a function which should create an Update.IfExists query: But I don't know how I should update the udt value. I tried to use set but it isn't working.
def updateValues(tableName:String, model:TagPartitions, id:TagPartitionKeys):Update.IfExists = {
val partitionInfoType:UserType = session.getCluster().getMetadata
.getKeyspace("codingjedi").getUserType("tag_partitions")
//create value
//the logic below assumes that there is only one element in the set
val partitionsInfoSet:Set[UDTValue] = model.partition_info.map((partitionInfo:TagPartitionsInfo) =>{
partitionInfoType.newValue()
.setLong("year",partitionInfo.year)
.setLong("month",partitionInfo.month)
})
println("partition info converted to UDTValue: "+partitionsInfoSet)
QueryBuilder.update(tableName).
`with`(QueryBuilder.WHAT_TO_DO_HERE_TO_UPDATE_UDT("partition_info",partitionsInfoSet))
.where(QueryBuilder.eq("tag", id.tag)).ifExists()
}
The mistake was I was adding partitionsInfoSet in the table but it is a Set of Scala. I needed to convert into Set of Java using setAsJavaSet
QueryBuilder.update(tableName).`with`(QueryBuilder.set("partition_info",setAsJavaSet(partitionsInfoSet)))
.where(QueryBuilder.eq("tag", id.tag))
.ifExists()
}
Although, it didn't answer your exact question, wouldn't it be easier to use Object Mapper for this? Something like this (I didn't modify it heavily to match your code):
#UDT(name = "scala_udt")
case class UdtCaseClass(id: Integer, #(Field #field)(name = "t") text: String) {
def this() {
this(0, "")
}
}
#Table(name = "scala_test_udt")
case class TableObjectCaseClassWithUDT(#(PartitionKey #field) id: Integer,
udts: java.util.Set[UdtCaseClass]) {
def this() {
this(0, new java.util.HashSet[UdtCaseClass]())
}
}
and then just create case class and use mapper.save on it. (Also note that you need to use Java collections, until you're imported Scala codecs).
The primary reason for using Object Mapper could be ease of use, and also better performance, because it's using prepared statements under the hood, instead of built statements that are much less efficient.
You can find more information about Object Mapper + Scala in article that I wrote recently.
I need to use an UDF in Spark that takes in a timestamp, an Integer and another dataframe and returns a tuple of 3 values.
I keep hitting error after error and I'm not sure I'm trying to fix it right anymore.
Here is the function:
def determine_price (view_date: org.apache.spark.sql.types.TimestampType , product_id: Int, price_df: org.apache.spark.sql.DataFrame) : (Double, java.sql.Timestamp, Double) = {
var price_df_filtered = price_df.filter($"mkt_product_id" === product_id && $"created"<= view_date)
var price_df_joined = price_df_filtered.groupBy("mkt_product_id").agg("view_price" -> "min", "created" -> "max").withColumn("last_view_price_change", lit(1))
var price_df_final = price_df_joined.join(price_df_filtered, price_df_joined("max(created)") === price_df_filtered("created")).filter($"last_view_price_change" === 1)
var result = (price_df_final.select("view_price").head().getDouble(0), price_df_final.select("created").head().getTimestamp(0), price_df_final.select("min(view_price)").head().getDouble(0))
return result
}
val det_price_udf = udf(determine_price)
the error it gives me is:
error: missing argument list for method determine_price
Unapplied methods are only converted to functions when a function type is expected.
You can make this conversion explicit by writing `determine_price _` or `determine_price(_,_,_)` instead of `determine_price`.
If I start adding the arguments I keep running in other errors such as Int expected Int.type found or object DataFrame is not a member of package org.apache.spark.sql
To give some context:
The idea is that I have a dataframe of prices, a product id and a date of creation and another dataframe containing product IDs and view dates.
I need to determine the price based on which was the last created price entry that is older than the view date.
Since each product ID has multiple view dates in the second dataframe. I thought an UDF is faster than a cross join. If anyone has a different idea, I'd be grateful.
You cannot pass the Dataframe inside UDF as UDF will be running on the Worker On a particular partition. And as you cannot use RDD on Worker( Is it possible to create nested RDDs in Apache Spark? ), similarly you cannot use the DataFrame on Worker too.!
You need to do a work around for this !
What is the best way to partition the data by a field into predefined partition count?
I am currently partitioning the data by specifying the partionCount=600. The count 600 is found to give best query performance for my dataset/cluster setup.
val rawJson = sqlContext.read.json(filename).coalesce(600)
rawJson.write.parquet(filenameParquet)
Now I want to partition this data by the column 'eventName' but still keep the count 600. The data currently has around 2000 unique eventNames, plus the number of rows in each eventName is not uniform. Around 10 eventNames have more than 50% of the data causing data skew. Hence if I do the partitioning like below, its not very performant. The write is taking 5x more time than without.
val rawJson = sqlContext.read.json(filename)
rawJson.write.partitionBy("eventName").parquet(filenameParquet)
What is a good way to partition the data for these scenarios? Is there a way to partition by eventName but spread this into 600 partitions?
My schema looks like this:
{
"eventName": "name1",
"time": "2016-06-20T11:57:19.4941368-04:00",
"data": {
"type": "EventData",
"dataDetails": {
"name": "detailed1",
"id": "1234",
...
...
}
}
}
Thanks!
This is a common problem with skewed data and there are several approaches you can take.
List bucketing works if the skew remains stable over time, which may or may not be the case, especially if new values of the partitioning variable are introduced. I have not researched how easy it is to adjust list bucketing over time and, as your comment states, you can't use that anyway because it is a Spark 2.0 feature.
If you are on 1.6.x, the key observation is that you can create your own function that maps each event name into one of 600 unique values. You can do this as a UDF or as a case expression. Then, you simply create a column using that function and then partition by that column using repartition(600, 'myPartitionCol) as opposed to coalesce(600).
Because we deal with very skewed data at Swoop, I've found the following workhorse data structure to be quite useful for building partitioning-related tools.
/** Given a key, returns a random number in the range [x, y) where
* x and y are the numbers in the tuple associated with a key.
*/
class RandomRangeMap[A](private val m: Map[A, (Int, Int)]) extends Serializable {
private val r = new java.util.Random() // Scala Random is not serializable in 2.10
def apply(key: A): Int = {
val (start, end) = m(key)
start + r.nextInt(end - start)
}
override def toString = s"RandomRangeMap($r, $m)"
}
For example, here is how we build a partitioner for a slightly different case: one where the data is skewed and the number of keys is small so we have to increase the number of partitions for the skewed keys while sticking with 1 as the minimum number of partitions per key:
/** Partitions data such that each unique key ends in P(key) partitions.
* Must be instantiated with a sequence of unique keys and their Ps.
* Partition sizes can be highly-skewed by the data, which is where the
* multiples come in.
*
* #param keyMap maps key values to their partition multiples
*/
class ByKeyPartitionerWithMultiples(val keyMap: Map[Any, Int]) extends Partitioner {
private val rrm = new RandomRangeMap(
keyMap.keys
.zip(
keyMap.values
.scanLeft(0)(_+_)
.zip(keyMap.values)
.map {
case (start, count) => (start, start + count)
}
)
.toMap
)
override val numPartitions =
keyMap.values.sum
override def getPartition(key: Any): Int =
rrm(key)
}
object ByKeyPartitionerWithMultiples {
/** Builds a UDF with a ByKeyPartitionerWithMultiples in a closure.
*
* #param keyMap maps key values to their partition multiples
*/
def udf(keyMap: Map[String, Int]) = {
val partitioner = new ByKeyPartitionerWithMultiples(keyMap.asInstanceOf[Map[Any, Int]])
(key:String) => partitioner.getPartition(key)
}
}
In your case, you have to merge several event names into a single partition, which would require changes but I hope the code above gives you an idea how to approach the problem.
One final observation is that if the distribution of event names values a lot in your data over time, you can perform a statistics gathering pass over some part of the data to compute a mapping table. You don't have to do this all the time, just when it is needed. To determine that, you can look at the number of rows and/or size of output files in each partition. In other words, the entire process can be automated as part of your Spark jobs.