I have the following model
case class TagPartitionsInfo (
year:Int,
month:Int
)
case class TagPartitions(tag:String,
partition_info:Set[TagPartitionsInfo])
The data in the Cassandra table is stored like follows:
tag | partition_info
------------+--------------------------------------------------
javascript | {{year: 2018, month: 1}, {year: 2018, month: 2}}
When I am querying the table, I am trying to create the TagPartitions as follows from the ResultSet but my code isn't compiling. The issue seem to be the way I am extracting Set from the row:
TagPartitions(row.getString("tag"),row.getSet[TagPartitionsInfo]("partition_info",TagPartitionsInfo.getClass))
The error is Cannot resolve symbol getSet.
I also tried row.getSet("partition_info",TagPartitionsInfo.getClass) but then I see the error Type mismatch, expected Set[TagPartitionsInfo], actual util.Set[Any]
What am I doing wrong?
This worked. As I am using a UDT, I have to use UDTType and UDTValue to convert the UDT into my model
val tag = row.getString("tag")
val partitionInfoType:UserType = session.getCluster().getMetadata.getKeyspace("codingjedi").getUserType("tag_partitions")
//create value
//the logic below assumes that there is only one element in the set
val partitionsInfo =
row.getSet("partition_info",partitionInfoType.newValue().getClass)
println("tag is "+tag +" and partition info converted to UDTValue: "+partitionsInfo)
val udtValueScalaSet:Set[UDTValue] = partitionsInfo.asScala.toSet
//convert Set[UDTValue] = Set[TagPartitionsInfo]
val partitionInfoSet:Set[TagPartitionsInfo] = udtValueScalaSet.map(partition=>TagPartitionsInfo(partition.getLong("year"),partition.getLong("month")))
return TagPartitions(tag,partitionInfoSet)
Related
I have the following udt type
CREATE TYPE tag_partitions(
year bigint,
month bigint);
and the following table
CREATE TABLE ${tableName} (
tag text,
partition_info set<FROZEN<tag_partitions>>,
PRIMARY KEY ((tag))
)
The table schema is mapped using the following model
case class TagPartitionsInfo(year:Long, month:Long)
case class TagPartitions(tag:String, partition_info:Set[TagPartitionsInfo])
I have written a function which should create an Update.IfExists query: But I don't know how I should update the udt value. I tried to use set but it isn't working.
def updateValues(tableName:String, model:TagPartitions, id:TagPartitionKeys):Update.IfExists = {
val partitionInfoType:UserType = session.getCluster().getMetadata
.getKeyspace("codingjedi").getUserType("tag_partitions")
//create value
//the logic below assumes that there is only one element in the set
val partitionsInfoSet:Set[UDTValue] = model.partition_info.map((partitionInfo:TagPartitionsInfo) =>{
partitionInfoType.newValue()
.setLong("year",partitionInfo.year)
.setLong("month",partitionInfo.month)
})
println("partition info converted to UDTValue: "+partitionsInfoSet)
QueryBuilder.update(tableName).
`with`(QueryBuilder.WHAT_TO_DO_HERE_TO_UPDATE_UDT("partition_info",partitionsInfoSet))
.where(QueryBuilder.eq("tag", id.tag)).ifExists()
}
The mistake was I was adding partitionsInfoSet in the table but it is a Set of Scala. I needed to convert into Set of Java using setAsJavaSet
QueryBuilder.update(tableName).`with`(QueryBuilder.set("partition_info",setAsJavaSet(partitionsInfoSet)))
.where(QueryBuilder.eq("tag", id.tag))
.ifExists()
}
Although, it didn't answer your exact question, wouldn't it be easier to use Object Mapper for this? Something like this (I didn't modify it heavily to match your code):
#UDT(name = "scala_udt")
case class UdtCaseClass(id: Integer, #(Field #field)(name = "t") text: String) {
def this() {
this(0, "")
}
}
#Table(name = "scala_test_udt")
case class TableObjectCaseClassWithUDT(#(PartitionKey #field) id: Integer,
udts: java.util.Set[UdtCaseClass]) {
def this() {
this(0, new java.util.HashSet[UdtCaseClass]())
}
}
and then just create case class and use mapper.save on it. (Also note that you need to use Java collections, until you're imported Scala codecs).
The primary reason for using Object Mapper could be ease of use, and also better performance, because it's using prepared statements under the hood, instead of built statements that are much less efficient.
You can find more information about Object Mapper + Scala in article that I wrote recently.
I need to use an UDF in Spark that takes in a timestamp, an Integer and another dataframe and returns a tuple of 3 values.
I keep hitting error after error and I'm not sure I'm trying to fix it right anymore.
Here is the function:
def determine_price (view_date: org.apache.spark.sql.types.TimestampType , product_id: Int, price_df: org.apache.spark.sql.DataFrame) : (Double, java.sql.Timestamp, Double) = {
var price_df_filtered = price_df.filter($"mkt_product_id" === product_id && $"created"<= view_date)
var price_df_joined = price_df_filtered.groupBy("mkt_product_id").agg("view_price" -> "min", "created" -> "max").withColumn("last_view_price_change", lit(1))
var price_df_final = price_df_joined.join(price_df_filtered, price_df_joined("max(created)") === price_df_filtered("created")).filter($"last_view_price_change" === 1)
var result = (price_df_final.select("view_price").head().getDouble(0), price_df_final.select("created").head().getTimestamp(0), price_df_final.select("min(view_price)").head().getDouble(0))
return result
}
val det_price_udf = udf(determine_price)
the error it gives me is:
error: missing argument list for method determine_price
Unapplied methods are only converted to functions when a function type is expected.
You can make this conversion explicit by writing `determine_price _` or `determine_price(_,_,_)` instead of `determine_price`.
If I start adding the arguments I keep running in other errors such as Int expected Int.type found or object DataFrame is not a member of package org.apache.spark.sql
To give some context:
The idea is that I have a dataframe of prices, a product id and a date of creation and another dataframe containing product IDs and view dates.
I need to determine the price based on which was the last created price entry that is older than the view date.
Since each product ID has multiple view dates in the second dataframe. I thought an UDF is faster than a cross join. If anyone has a different idea, I'd be grateful.
You cannot pass the Dataframe inside UDF as UDF will be running on the Worker On a particular partition. And as you cannot use RDD on Worker( Is it possible to create nested RDDs in Apache Spark? ), similarly you cannot use the DataFrame on Worker too.!
You need to do a work around for this !
I am using Spark 1.5.0 and I have this issue:
val df = paired_rdd.reduceByKey {
case (val1, val2) => val1 + "|" + val2
}.toDF("user_id","description")
Here is sample data for df, as you can see the column description has
this format (text1#text3#weight | text1#text3#weight|....)
user1
book1#author1#0.07841217886795074|tool1#desc1#0.27044260397331488|song1#album1#-0.052661673730870676|item1#category1#-0.005683148395350108
I want to sort this df based on weight in descending order here is what I tried:
First split the contents at "|" and then for each of those strings, split them at "#" and get the 3rd string which is weight and then convert that into a double value
val getSplitAtWeight = udf((str: String) => {
str.split("|").foreach(_.split("#")(2).toDouble)
})
Sort based on the weigh value returned by the udf (in descending manner)
val df_sorted = df.sort(getSplitAtWeight(col("description")).desc)
I get the following error:
Exception in thread "main" java.lang.UnsupportedOperationException:
Schema for type Unit is not supported at
org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:153)
at
org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:29)
at
org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:64)
at
org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:29)
at org.apache.spark.sql.functions$.udf(functions.scala:2242)
Change foreach in your udf to map as following will eliminate the exception:
def getSplitAtWeight = udf((str: String) => {
str.split('|').map(_.split('#')(2).toDouble)
})
The problem with your method is that foreach method on List doesn't return anything, i.e., its result is of type Unit that's why you get the Exception. To understand more about the foreach, check this blog.
I am trying to make a function that will pull the name of the column out of a dataframe schema. So what I have is the initial function defined:
val df = sqlContext.parquetFile(inputVal.toString)
val dfSchema = df.schema
def schemaMatchP(schema: StructType) : Map[String,List[Int]] =
schema
// get the 1st word (column type) in upper cases
.map(columnDescr => columnDescr
If I do something like this:
.map(columnDescr => columnDescr.toString.split(',')(0).toUpperCase)
I will get STRUCTFIELD(HH_CUST_GRP_MBRP_ID,BINARYTYPE,TRUE)
How do you handle a StructField so I can grab the 1st element out of each column for the schema. So my Column names: HH_CUST_GRP_MBRP_ID, etc...
When in doubt look what the source does itself. DataFrame.toString has the answer :). StructField is a case class with a name property. So, just do:
schema.map(f => s"${f.name}")
I have downloaded Microsoft Dynamic Query API. And using the dynamic query to filter the data using dates. I have written following query :-
Entities db = new Entities();
DateTime d = new DateTime(2014, 1, 17);
var lst = db.MSTPriorityS.Where("ModifiedOn == #0", d.Date.ToString()).ToList();
The result count, i am getting is 0. While there is data in the database table.
Please advise what i am doing wrong?
i think the problem is where you cast DateTime to String probably,
you can create your query step by step, and type safe, follow 'Creating dynamic queries with entity framework'
you can use lambda expression instead: var lst = db.MSTPriorityS.Where(u => u.ModifiedOn == System.Data.Objects.EntityFunctions.TruncateTime(d))