I have this udf function :
val DataSplitter = udf({
device: String =>
val a = "value1,value2,value3";
val b = "value4,value5,value6";
(a,b)
})
and i would like to use it in my query like this :
val df2 = df1.select(col(“aa”), DataSplitter.apply(col(“cc”)).as("dd"))
val df3 = df2.select(col("aa"), explode(col("dd._1")).as("containers"), explode(col("dd._2")).as("parameters"))
but it complains that i cannot use 2 explode in the same select.
How can i solve this ?
Related
I'd like to use a specific UDF with using Spark
Here's the plan:
I have a table A(10 million rows) and a table B(15 millions rows)
I'd like to use an UDF comparing one element of the table A and one of the table B
Is it possible
Here's a a sample of my code. At some point i also need to say that my UDF compare must be greater than 0,9:
DataFrame dfr = df
.select("name", "firstname", "adress1", "city1","compare(adress1,adress2)")
.join(dfa,df.col("adress1").equalTo(dfa.col("adress2"))
.and((df.col("city1").equalTo(dfa.col("city2"))
...;
Is it possible ?
Yes, you can. However it will be slower than normal operators, as Spark will be not able to do predicate pushdown
Example:
val udf = udf((x : String, y : String) => { here compute similarity; });
val df3 = df1.join(df2, udf(df1.field1, df2.field1) > 0.9)
For example:
val df1 = Seq (1, 2, 3, 4).toDF("x")
val df2 = Seq(1, 3, 7, 11).toDF("q")
val udf = org.apache.spark.sql.functions.udf((x : Int, q : Int) => { Math.abs(x - q); });
val df3 = df1.join(df2, udf(df1("x"), df2("q")) > 1)
You can also directly return boolean from User Defined Function
I am trying to registerTemptables, from dynamic dataframes.
I am getting the output as a string., i am not sure if there is a way to execute dataframe or convert a string to dataframe so that the temptable can be created.
Here are the steps to replicate this issue :
import org.apache.spark.sql._
val contact_df = sc.makeRDD(1 to 5).map(i => (i, i * i)).toDF("value", "square")
val acct_df = sc.makeRDD(1 to 5).map(i => (i, i / i)).toDF("value", "devide")
val dataframeJoins = Array(
Row("x","","","" ,"Y","",1,"contact_hotline_df","contact_df","acct_nbr","hotline_df","tm49_acct_nbr"),
Row("x","","","","Y","",2,"contact_hotline_acct_df","acct_df","tm06_acct_nbr" ,"contact_hotline_df","acct_nbr")
)
val dfJoinbroadcast = sc.broadcast(dataframeJoins)
val DFJoins1 = for ( row <- dfJoinbroadcast.value ) yield {
(row(8)+".registerTempTable(\""+row(8)+"\")" )
}
for (rows <- 0 until DFJoins1.size ){
println(DFJoins1(rows) )
DFJoins1(rows)
}
Here is the output of the above for loop :
contact_df.registerTempTable("contact_df")
acct_df.registerTempTable("acct_df")
I am not getting any error. But the table is not getting created.
When i say sqlContext.sql("select * from contact_df") i am getting an error that table is not created.
Is there a way to convert string to a dataframe and execute the dataframe to create temptable.
Please suggest.
Thanks,
Sreehari
Your code concatenates the strings and prints the result, that's it. The registerTempTable method is not being called, that's why you cant use it in the SQL query. Try to do this:
// assuming we have this string to object mapping
val tableNameToDf = Map("contact_df" -> contact_df, "acct_df" -> acct_df)
you could restructure your for loop into something like:
val dfJoins = for (row <- dfJoinbroadcast.value) yield {
val wannabeTable = row(8)
tableNameToRdd(wannabeTable).createOrReplaceTempView(wannabeTable)
wannabeTableName
}
I am trying to fetch values from Hbase using the column names and below is my code:
val cf = Bytes.toBytes("cf")
val tkn_col_num = Bytes.toBytes("TKN_COL_NUM")
val tkn_col_val = Bytes.toBytes("TKN_COL_VAL")
val col_name = Bytes.toBytes("COLUMN_NAME")
val sc = new SparkContext("local", "hbase-test")
val conf = HBaseConfiguration.create()
conf.set(TableInputFormat.INPUT_TABLE, input_table)
conf.set(TableInputFormat.SCAN_COLUMNS, "cf:COLUMN_NAME cf:TKN_COL_NUM cf:TKN_COL_VAL")
val hBaseRDD = sc.newAPIHadoopRDD(conf, classOf[TableInputFormat], classOf[ImmutableBytesWritable], classOf[Result])
hBaseRDD.map{case (x,y) => (y)}.collect().foreach(println)
val colMap : Map[String,(Int,String)] = hBaseRDD.map{case (x,y) =>
((Bytes.toString(y.getValue(cf,col_name))),
(
(Bytes.toInt(y.getValue(cf,tkn_col_num))),
(Bytes.toString(y.getValue(cf,tkn_col_val)))
))
}.collect().toMap
colMap.foreach(println)
sc.stop()
Now the Bytes.toString(y.getValue(cf,col_name)) works and I get the expected column names from table however Bytes.toInt(y.getValue(cf,tkn_col_num))) gives me some random values(I guess it is offset values for the cell but I am not sure on it.). Below is the output that I am getting:
(COL1,(-2147483639,sum))
(COL2,(-2147483636,sum))
(COL3,(-2147483645,count))
(COL4,(-2147483642,sum))
(COL5,(-2147483641,sum))
The integer values should be 1,2,3,4,5. Can anyone please guide me how can I get true integer column data.
Thanks
I have query many dataframes from solr.
These dataframe would be union a dataframe
var sub = sc.textFile("file:/home/zeppelin/query_term.txt")
def qmap(filter: String, options: Map[String, String]): DataFrame = {
val qm = Map(
"query" -> filter
)
val df = sqlContext.read.format("solr").options(options).options(qm).load
return df
}
val dfs = sub.map(x => qmap(x,subject_options)).reduce((x,y) => x.unionAll(y))
however, there are some exceptions to count action for dfs.
Please give me some methods or thoughts to fix it.
Thanks.
Replace
var sub = sc.textFile("file:/home/zeppelin/query_term.txt")
with
var sub = sc.textFile("file:/home/zeppelin/query_term.txt").collect
I am a beginner of Apache Spark. I want to filter two RDD into result RDD with the below code
def runSpark(stList:List[SubStTime],icList:List[IcTemp]): Unit ={
val conf = new SparkConf().setAppName("OD").setMaster("local[*]")
val sc = new SparkContext(conf)
val st = sc.parallelize(stList).map(st => ((st.productId,st.routeNo),st)).groupByKey()
val ic = sc.parallelize(icList).map(ic => ((ic.productId,ic.routeNo),ic)).groupByKey()
//TODO
//val result = st.join(ic).mapValues( )
sc.stop()
}
here is what i want to do
List[ST] ->map ->Map(Key,st) ->groupByKey ->Map(Key,List[st])
List[IC] ->map ->Map(Key,ic) ->groupByKey ->Map(Key,List[ic])
STRDD join ICRDD get Map(Key,(List[st],List[ic]))
I have a function compare listST and listIC get the List[result] result contains both SubStTime and IcTemp information
def calcIcSt(st:List[SubStTime],ic:List[IcTemp]): List[result]
I don't know how to use mapvalues or other some way to get my result
Thanks
val result = st.join(ic).mapValues( x => calcIcSt(x._1,x._2) )