spark dataframe union dataframe in spark-solr - apache-spark

I have query many dataframes from solr.
These dataframe would be union a dataframe
var sub = sc.textFile("file:/home/zeppelin/query_term.txt")
def qmap(filter: String, options: Map[String, String]): DataFrame = {
val qm = Map(
"query" -> filter
)
val df = sqlContext.read.format("solr").options(options).options(qm).load
return df
}
val dfs = sub.map(x => qmap(x,subject_options)).reduce((x,y) => x.unionAll(y))
however, there are some exceptions to count action for dfs.
Please give me some methods or thoughts to fix it.
Thanks.

Replace
var sub = sc.textFile("file:/home/zeppelin/query_term.txt")
with
var sub = sc.textFile("file:/home/zeppelin/query_term.txt").collect

Related

Cannot use 2 explode in the same select

I have this udf function :
val DataSplitter = udf({
device: String =>
val a = "value1,value2,value3";
val b = "value4,value5,value6";
(a,b)
})
and i would like to use it in my query like this :
val df2 = df1.select(col(“aa”), DataSplitter.apply(col(“cc”)).as("dd"))
val df3 = df2.select(col("aa"), explode(col("dd._1")).as("containers"), explode(col("dd._2")).as("parameters"))
but it complains that i cannot use 2 explode in the same select.
How can i solve this ?

Join apache spark dataframes properly with scala avoiding null values

Hellow everyone!
I have two DataFrames in apache spark (2.3) and I want to join them properly. I will explain below what I mean with 'properly'. First of all the two dataframes holds the following information:
nodeDf: ( id, year, title, authors, journal, abstract )
edgeDf: ( srcId, dstId, label )
The label could be 0 or 1 in case node1 is connected with node2 or not.
I want to combine this two dataframes to get one dataframe withe the following information:
JoinedDF: ( id_from, year_from, title_from, journal_from, abstract_from, id_to, year_to, title_to, journal_to, abstract_to, time_dist )
time_dist = abs(year_from - year_to)
When I said 'properly' I meant that the query must be as fast as it could be and I don't want to contain null rows or cels ( value on a row ).
I have tried the following but I took me 500 -540 sec to execute the query and the final dataframe contains null values. I don't even know if the dataframes ware joined correctly.
I want to mention that the node file from which I create the nodeDF has 27770 rows and the edge file (edgeDf) has 615512 rows.
Code:
val spark = SparkSession.builder().master("local[*]").appName("Logistic Regression").getOrCreate()
val sc = spark.sparkContext
val data = sc.textFile("resources/data/training_set.txt").map(line =>{
val fields = line.split(" ")
(fields(0),fields(1), fields(2).toInt)
})
val data2 = sc.textFile("resources/data/test_set.txt").map(line =>{
val fields = line.split(" ")
(fields(0),fields(1))
})
import spark.implicits._
val trainingDF = data.toDF("srcId","dstId", "label")
val testDF = data2.toDF("srcId","dstId")
val infoRDD = spark.read.option("header","false").option("inferSchema","true").format("csv").load("resources/data/node_information.csv")
val infoDF = infoRDD.toDF("srcId","year","title","authors","jurnal","abstract")
println("Showing linksDF sample...")
trainingDF.show(5)
println("Rows of linksDF: ",trainingDF.count())
println("Showing infoDF sample...")
infoDF.show(2)
println("Rows of infoDF: ",infoDF.count())
println("Joining linksDF and infoDF...")
var joinedDF = trainingDF.as("a").join(infoDF.as("b"),$"a.srcId" === $"b.srcId")
println(joinedDF.count())
joinedDF = joinedDF.select($"a.srcId",$"a.dstId",$"a.label",$"b.year",$"b.title",$"b.authors",$"b.jurnal",$"b.abstract")
joinedDF.show(5)
val graphX = new GraphX()
val pageRankDf =graphX.computePageRank(spark,"resources/data/training_set.txt",0.0001)
println("Joining joinedDF and pageRankDf...")
joinedDF = joinedDF.as("a").join(pageRankDf.as("b"),$"a.srcId" === $"b.nodeId")
var dfWithRanks = joinedDF.select("srcId","dstId","label","year","title","authors","jurnal","abstract","rank").withColumnRenamed("rank","pgRank")
dfWithRanks.show(5)
println("Renameming joinedDF...")
dfWithRanks = dfWithRanks
.withColumnRenamed("srcId","id_from")
.withColumnRenamed("dstId","id_to")
.withColumnRenamed("year","year_from")
.withColumnRenamed("title","title_from")
.withColumnRenamed("authors","authors_from")
.withColumnRenamed("jurnal","jurnal_from")
.withColumnRenamed("abstract","abstract_from")
var infoDfRenamed = dfWithRanks
.withColumnRenamed("id_from","id_from")
.withColumnRenamed("id_to","id_to")
.withColumnRenamed("year_from","year_to")
.withColumnRenamed("title_from","title_to")
.withColumnRenamed("authors_from","authors_to")
.withColumnRenamed("jurnal_from","jurnal_to")
.withColumnRenamed("abstract_from","abstract_to").select("id_to","year_to","title_to","authors_to","jurnal_to","jurnal_to")
var finalDF = dfWithRanks.as("a").join(infoDF.as("b"),$"a.id_to" === $"b.srcId")
finalDF = finalDF
.withColumnRenamed("year","year_to")
.withColumnRenamed("title","title_to")
.withColumnRenamed("authors","authors_to")
.withColumnRenamed("jurnal","jurnal_to")
.withColumnRenamed("abstract","abstract_to")
println("Dropping unused columns from joinedDF...")
finalDF = finalDF.drop("srcId")
finalDF.show(5)
Here are my results!
Avoid all calculations and code related to pgRank! Is there any proper way to do this join works?
You can filter your data first and then join, in that case you will avoid nulls
df.filter($"ColumnName".isNotNull)
use <=> operator in your joining column condition
var joinedDF = trainingDF.as("a").join(infoDF.as("b"),$"a.srcId" <=> $"b.srcId")
There is a function in spark 2.1 or greater is eqNullSafe
var joinedDF = trainingDF.join(infoDF,trainingDF("srcId").eqNullSafe(infoDF("srcId")))

spark dataframe reducebykey (with non-unique key values) and custom value operation

I have the code with Spark 1.5.0.
case class myCaseClass(user_id: String, description: String)
Here is my UDF
val getConcatenated = udf( (first: String, second: String, third: String) => { first + " " + second + " " + third} )
Here is where I generate my dataframe
val df_description = df.withColumn("description",getConcatenated(col("text1"), col("text2"), col("weight"))).select("user_id","description")
Now, I want to do a redueByKey operation on this DF which has tow columns (both are strings). My user_ids are not unique and I want to concat all values/description entries for a given user_id.
How can I achieve that?
I can do something like this:
val description_rdd = df_description.map(row => myCaseClass(row.getString(0), row.getString(1)))
But how do I generate a pairedrdd here? I then want to swtich back to dataframe by using CreateDataFrame method on rdd.
The below code will create DF with your key column and a column holding a sequence of your descriptions:
import org.apache.spark.rdd.PairRDDFunctions
val pairRDD : PairRDDFunctions[String, String] = df_description.rdd.map(row => (row.getString(0), row.getString(1)))
val groupedRDD = pairRDD.groupByKey().map(p => (p._1, p._2.toSeq))
val groupedDF = groupedRDD.toDF()

Filtering a Dataframe based on another Dataframe in Spark

I have a dataframe df with columns
date: timestamp
status : String
name : String
I'm trying to find last status of the all the names
val users = df.select("name").distinct
val final_status = users.map( t =>
{
val _name = t.getString(0)
val record = df.where(col("name") === _name)
val lastRecord = userRecord.sort(desc("date")).first
lastRecord
})
This works with an array, but with spark dataframe it is throwing java.lang.NullPointerException
Update1 : Using removeDuplicates
df.sort(desc("date")).removeDuplicates("name")
Is this a good solution?
This
df.sort(desc("date")).removeDuplicates("name")
is not guaranteed to work. The solutions in response to this question should work for you
spark: How to do a dropDuplicates on a dataframe while keeping the highest timestamped row

How two RDD according to funcation get Result RDD

I am a beginner of Apache Spark. I want to filter two RDD into result RDD with the below code
def runSpark(stList:List[SubStTime],icList:List[IcTemp]): Unit ={
val conf = new SparkConf().setAppName("OD").setMaster("local[*]")
val sc = new SparkContext(conf)
val st = sc.parallelize(stList).map(st => ((st.productId,st.routeNo),st)).groupByKey()
val ic = sc.parallelize(icList).map(ic => ((ic.productId,ic.routeNo),ic)).groupByKey()
//TODO
//val result = st.join(ic).mapValues( )
sc.stop()
}
here is what i want to do
List[ST] ->map ->Map(Key,st) ->groupByKey ->Map(Key,List[st])
List[IC] ->map ->Map(Key,ic) ->groupByKey ->Map(Key,List[ic])
STRDD join ICRDD get Map(Key,(List[st],List[ic]))
I have a function compare listST and listIC get the List[result] result contains both SubStTime and IcTemp information
def calcIcSt(st:List[SubStTime],ic:List[IcTemp]): List[result]
I don't know how to use mapvalues or other some way to get my result
Thanks
val result = st.join(ic).mapValues( x => calcIcSt(x._1,x._2) )

Resources