I have a UDF that accepts string parameters as well as fields, but it seems that "callUDF" can only accept fields.
I found a workaround using selectExpr(...) or by using spark.sql(...), but I wonder if there is any better way of doing that.
Here is an example:
Schema - id, map[String, String]
spark.sqlContext.udf.register("get_from_map", (map: Map[String, String], att: String) => map.getOrElse(att, ""))
val data = spark.read...
data.selectExpr("id", "get_from_map(map, 'attr')").show(15)
This will work, but I was kind of hoping for a better approach like:
data.select($"id", callUDF("get_from_map", $"map", "attr"))
Any ideas? Am I missing something?
I haven't seen any JIRA ticket open about this, so either I'm missing something or I'm miss-using.
Thanks!
You can use a lit function for that
data.select($"id", callUDF("get_from_map", $"map", lit("attr")))
essentially using lit() would allow you to pass literals (strings, numbers) where columns are expected.
You might also want to register your function using the udf function - so you'd be able to use it directly rather than call callUDF :
import org.apache.spark.sql.functions._
val getFromMap = udf((map:Map[String,String], att : String) => map.getOrElse(att,""))
data.select($"id", getFromMap($"map", lit("attr")))
Related
I am getting the scan result in a string like ---
DriverId=60cb1daa20056c0c92ebe457,Amount=10.0
I want to retrive driver id and amount from this string.
How can I retrive?
Please help...
It depends on your overall format. Basic operations like substrings as suggested by #iLoveYou3000 can work fine if you really have this fixed format.
If the keys are dynamic, or could be changed in the future, you could also use more general approaches, for instance using split():
val attributeStrings = input.split(",")
val attributesMap = attributeStrings.map { it.split("=") }.associate { it[0] to it[1] }
val driverId = attributesMap["DriverId"]
val amount = attributesMap["Amount"].toDouble() // or .toBigDecimal()
This is one of the possible ways that I could think of.
val driverID= str.substringAfter("DriverId=", "").substringBefore(",", "")
val amount = str.substringAfter("Amount=", "")
Context
In many of the sql queries I write, I find myself combining spark predefined functions in the exact same way, which often results in verbose and duplicated code, and my developer instinct is to want to refactor it.
So, my question is this : is there some way to define some kind of alias for function combinations without resorting to udfs (which are to avoid for perofmance reasons) - the goal being to make the code clearer and cleaner. Essentially, what I want is something like udfs but without the performance penalty. Also, these function MUST be callable from within a spark-sql query usable in spark.sql calls.
Example
For example, let's say my business logic is to reverse some string and hash it like this : (please note that the function combination here is irrelevant, what is important is that it is some combination of existing pre-defined spark functions - possibly many of them)
SELECT
sha1(reverse(person.name)),
sha1(reverse(person.some_information)),
sha1(reverse(person.some_other_information))
...
FROM person
Is there a way of declaring a business function without paying the performance price of using a udf, allowing the code just above to be rewritten as :
SELECT
business(person.name),
business(person.some_information),
business(person.some_other_information)
...
FROM person
I have searched around quite a bit on the spark documentation and on this website and have not found a way of achieving this, which is pretty weird to me because it looks like a pretty natural need, and I don't understand why you should necessarly pay the black-box price of defining and calling a udf.
Is there a way of declaring a business function without paying the performance price of using a udf
You don't have to use udf, you might extend Expression class, or for the simplest operations - UnaryExpression. Then you will have to implement just several methods and here we go. It is natively integrated into Spark, besides that letting use some advantage features such as code generation.
In your case adding business function is pretty straightforward:
def business(column: Column): Column = {
sha1(reverse(column))
}
MUST be callable from within a spark-sql query usable in spark.sql calls
This is more tricky but achievable.
You need to create custom functions registrar:
import org.apache.spark.sql.catalyst.FunctionIdentifier
import org.apache.spark.sql.catalyst.expressions.Expression
object FunctionAliasRegistrar {
val funcs: mutable.Map[String, Seq[Column] => Column] = mutable.Map.empty
def add(name: String, builder: Seq[Column] => Column): this.type = {
funcs += name -> builder
this
}
def registerAll(spark: SparkSession) = {
funcs.foreach { case (alias, builder) => {
def b(children: Seq[Expression]) = builder.apply(children.map(expr => new Column(expr))).expr
spark.sessionState.functionRegistry.registerFunction(FunctionIdentifier(alias), b)
}}
}
}
Then you can use it as follows:
FunctionAliasRegistrar
.add("business1", child => lower(reverse(child.head)))
.add("business2", child => upper(reverse(child.head)))
.registerAll(spark)
dataset.createTempView("data")
spark.sql(
"""
| SELECT business1(name), business2(name) FROM data
|""".stripMargin)
.show(false)
Output:
+--------------------+--------------------+
|lower(reverse(name))|upper(reverse(name))|
+--------------------+--------------------+
|sined |SINED |
|taram |TARAM |
|1taram |1TARAM |
|2taram |2TARAM |
+--------------------+--------------------+
Hope this helps.
Does any one know how to setup the `
streamingContext.fileStream [KeyClass, ValueClass, InputFormatClass] (dataDirectory)
to actually consume binary files.
Where can I find all the inputformatClass ? The documentation give no
links for that. I imagine that the ValueClass is related to the
inputformatClass somehow.
In the non-streaming version using the method binaryfiles, I can get
ByteArrays for each files. Is there a way i can get the same with
sparkStreaming ? If not where can i find those details. Meaning the
inputformat supportted and the value class it produces. Finally Can
one pick any KeyClass, aren't all those element connected ?
If someone clarify the use of the method.
EDIT1
I have tried the following:
val bfiles = ssc.fileStreamBytesWritable, BytesWritable, SequenceFileAsBinaryInputFormat
However the compiler complain as such:
[error] /xxxxxxxxx/src/main/scala/EstimatorStreamingApp.scala:14: type arguments [org.apache.hadoop.io.BytesWritable,org.apache.hadoop.io.BytesWritable,org.apache.hadoop.mapred.SequenceFileAsBinaryInputFormat] conform to the bounds of none of the overloaded alternatives of
[error] value fileStream: [K, V, F <: org.apache.hadoop.mapreduce.InputFormat[K,V]](directory: String, filter: org.apache.hadoop.fs.Path => Boolean, newFilesOnly: Boolean, conf: org.apache.hadoop.conf.Configuration)(implicit evidence$10: scala.reflect.ClassTag[K], implicit evidence$11: scala.reflect.ClassTag[V], implicit evidence$12: scala.reflect.ClassTag[F])org.apache.spark.streaming.dstream.InputDStream[(K, V)] <and> [K, V, F <: org.apache.hadoop.mapreduce.InputFormat[K,V]](directory: String, filter: org.apache.hadoop.fs.Path => Boolean, newFilesOnly: Boolean)(implicit evidence$7: scala.reflect.ClassTag[K], implicit evidence$8: scala.reflect.ClassTag[V], implicit evidence$9: scala.reflect.ClassTag[F])org.apache.spark.streaming.dstream.InputDStream[(K, V)] <and> [K, V, F <: org.apache.hadoop.mapreduce.InputFormat[K,V]](directory: String)(implicit evidence$4: scala.reflect.ClassTag[K], implicit evidence$5: scala.reflect.ClassTag[V], implicit evidence$6: scala.reflect.ClassTag[F])org.apache.spark.streaming.dstream.InputDStream[(K, V)]
[error] val bfiles = ssc.fileStream[BytesWritable, BytesWritable, SequenceFileAsBinaryInputFormat]("/xxxxxxxxx/Casalini_streamed")
What am i doing wrong ?
Follow link to read about about all hadoop input formats
I found here well documented answer about sequence file format.
You are facing the compilation issue because of import missmatch.
Hadoop Mapred vs mapreduce
E.g.
Java
JavaPairInputDStream<Text,BytesWritable> dstream=
sc.fileStream("/somepath",org.apache.hadoop.io.Text.class,
org.apache.hadoop.io.BytesWritable.class,
org.apache.hadoop.mapreduce.lib.input.SequenceFileAsBinaryInputFormat.class);
I didn't try in scala but it should be something similar;
val dstream = sc.fileStream("/somepath",
classOf[org.apache.hadoop.io.Text], classOf[org.apache.hadoop.io.BytesWritable],
classOf[org.apache.hadoop.mapreduce.lib.input.SequenceFileAsBinaryInputFormat] ) ;
I finally got it to compile.
The compilation problem was in the import. I used
import org.apache.hadoop.mapred.SequenceFileAsBinaryInputFormat
I replaced it with
import org.apache.hadoop.mapreduce.lib.input.SequenceFileAsBinaryInputFormat
Then it works. However i have no idea why. I don't understand the difference between the two hierarchy. The two files seem to have the same content. So it hard to say. If someone could help clarify that here, i think it would help a lot
Can anyone please help me on the below query.
I have an RDD with 5 columns. I want to join with a table in Cassandra.
I knew that there is a way to do that by using "joinWithCassandraTable"
I see somewhere a syntax to use it.
Syntax:
RDD.joinWithCassandraTable(KEYSPACE, tablename, SomeColumns("cola","colb"))
.on(SomeColumns("colc"))
Can anyone please send me the correct syntax??
I would like to actually know where to mention the column name of a table which is a key to join.
JoinWithCassandraTable works by pulling only the partition keys which match your RDD entries from C* so it only works on partition keys.
The documentation is here
https://github.com/datastax/spark-cassandra-connector/blob/master/doc/2_loading.md#using-joinwithcassandratable
and API Doc is here
http://datastax.github.io/spark-cassandra-connector/ApiDocs/1.6.0-M2/spark-cassandra-connector/#com.datastax.spark.connector.RDDFunctions
The jWCT table method can be used without the fluent api by specifying all the arguments in the method
def joinWithCassandraTable[R](
keyspaceName: String,
tableName: String,
selectedColumns: ColumnSelector = AllColumns,
joinColumns: ColumnSelector = PartitionKeyColumns)
But the fluent api can also be used
joinWithCassandraTable[R](keyspace, tableName).select(AllColumns).on(PartitionKeyColumns)
These two calls are equivalent
Your example
RDD.joinWithCassandraTable(KEYSPACE, tablename, SomeColumns("cola","colb")) .on(SomeColumns("colc"))
Uses the Object from RDD to join against colc of tablename and only returns cola and colb as join results.
Use below syntax for join in cassandra
joinedData = rdd.joinWithCassandraTable(keyspace,table).on(partitionKeyName).select(Column Names)
It will look something like this,
joinedData = rdd.joinWithCassandraTable(keyspace,table).on('emp_id').select('emp_name', 'emp_city')
I am using Spark 1.3. I have a dataset where the dates in column (ordering_date column) are in yyyy/MM/dd format. I want to do some calculations with dates and therefore I want to use jodatime to do some conversions/formatting. Here is the udf that I have :
val return_date = udf((str: String, dtf: DateTimeFormatter) => dtf.formatted(str))
Here is the code where the udf is being called. However, I get error saying "Not Applicable". Do I need to register this UDF or am I missing something here?
val user_with_dates_formatted = users.withColumn(
"formatted_date",
return_date(users("ordering_date"), DateTimeFormat.forPattern("yyyy/MM/dd")
)
I don't believe you can pass in the DateTimeFormatter as an argument to the UDF. You can only pass in a Column. One solution would be to do:
val return_date = udf((str: String, format: String) => {
DateTimeFormat.forPatten(format).formatted(str))
})
And then:
val user_with_dates_formatted = users.withColumn(
"formatted_date",
return_date(users("ordering_date"), lit("yyyy/MM/dd"))
)
Honestly, though -- both this and your original algorithms have the same problem. They both parse yyyy/MM/dd using forPattern for every record. Better would be to create a singleton object wrapped around a Map[String,DateTimeFormatter], maybe like this (thoroughly untested, but you get the idea):
object DateFormatters {
var formatters = Map[String,DateTimeFormatter]()
def getFormatter(format: String) : DateTimeFormatter = {
if (formatters.get(format).isEmpty) {
formatters = formatters + (format -> DateTimeFormat.forPattern(format))
}
formatters.get(format).get
}
}
Then you would change your UDF to:
val return_date = udf((str: String, format: String) => {
DateFormatters.getFormatter(format).formatted(str))
})
That way, DateTimeFormat.forPattern(...) is only called once per format per executor.
One thing to note about the singleton object solution is that you can't define the object in the spark-shell -- you have to pack it up in a JAR file and use the --jars option to spark-shell if you want to use the DateFormatters object in the shell.