How two RDD according to funcation get Result RDD - apache-spark

I am a beginner of Apache Spark. I want to filter two RDD into result RDD with the below code
def runSpark(stList:List[SubStTime],icList:List[IcTemp]): Unit ={
val conf = new SparkConf().setAppName("OD").setMaster("local[*]")
val sc = new SparkContext(conf)
val st = sc.parallelize(stList).map(st => ((st.productId,st.routeNo),st)).groupByKey()
val ic = sc.parallelize(icList).map(ic => ((ic.productId,ic.routeNo),ic)).groupByKey()
//TODO
//val result = st.join(ic).mapValues( )
sc.stop()
}
here is what i want to do
List[ST] ->map ->Map(Key,st) ->groupByKey ->Map(Key,List[st])
List[IC] ->map ->Map(Key,ic) ->groupByKey ->Map(Key,List[ic])
STRDD join ICRDD get Map(Key,(List[st],List[ic]))
I have a function compare listST and listIC get the List[result] result contains both SubStTime and IcTemp information
def calcIcSt(st:List[SubStTime],ic:List[IcTemp]): List[result]
I don't know how to use mapvalues or other some way to get my result
Thanks

val result = st.join(ic).mapValues( x => calcIcSt(x._1,x._2) )

Related

Join apache spark dataframes properly with scala avoiding null values

Hellow everyone!
I have two DataFrames in apache spark (2.3) and I want to join them properly. I will explain below what I mean with 'properly'. First of all the two dataframes holds the following information:
nodeDf: ( id, year, title, authors, journal, abstract )
edgeDf: ( srcId, dstId, label )
The label could be 0 or 1 in case node1 is connected with node2 or not.
I want to combine this two dataframes to get one dataframe withe the following information:
JoinedDF: ( id_from, year_from, title_from, journal_from, abstract_from, id_to, year_to, title_to, journal_to, abstract_to, time_dist )
time_dist = abs(year_from - year_to)
When I said 'properly' I meant that the query must be as fast as it could be and I don't want to contain null rows or cels ( value on a row ).
I have tried the following but I took me 500 -540 sec to execute the query and the final dataframe contains null values. I don't even know if the dataframes ware joined correctly.
I want to mention that the node file from which I create the nodeDF has 27770 rows and the edge file (edgeDf) has 615512 rows.
Code:
val spark = SparkSession.builder().master("local[*]").appName("Logistic Regression").getOrCreate()
val sc = spark.sparkContext
val data = sc.textFile("resources/data/training_set.txt").map(line =>{
val fields = line.split(" ")
(fields(0),fields(1), fields(2).toInt)
})
val data2 = sc.textFile("resources/data/test_set.txt").map(line =>{
val fields = line.split(" ")
(fields(0),fields(1))
})
import spark.implicits._
val trainingDF = data.toDF("srcId","dstId", "label")
val testDF = data2.toDF("srcId","dstId")
val infoRDD = spark.read.option("header","false").option("inferSchema","true").format("csv").load("resources/data/node_information.csv")
val infoDF = infoRDD.toDF("srcId","year","title","authors","jurnal","abstract")
println("Showing linksDF sample...")
trainingDF.show(5)
println("Rows of linksDF: ",trainingDF.count())
println("Showing infoDF sample...")
infoDF.show(2)
println("Rows of infoDF: ",infoDF.count())
println("Joining linksDF and infoDF...")
var joinedDF = trainingDF.as("a").join(infoDF.as("b"),$"a.srcId" === $"b.srcId")
println(joinedDF.count())
joinedDF = joinedDF.select($"a.srcId",$"a.dstId",$"a.label",$"b.year",$"b.title",$"b.authors",$"b.jurnal",$"b.abstract")
joinedDF.show(5)
val graphX = new GraphX()
val pageRankDf =graphX.computePageRank(spark,"resources/data/training_set.txt",0.0001)
println("Joining joinedDF and pageRankDf...")
joinedDF = joinedDF.as("a").join(pageRankDf.as("b"),$"a.srcId" === $"b.nodeId")
var dfWithRanks = joinedDF.select("srcId","dstId","label","year","title","authors","jurnal","abstract","rank").withColumnRenamed("rank","pgRank")
dfWithRanks.show(5)
println("Renameming joinedDF...")
dfWithRanks = dfWithRanks
.withColumnRenamed("srcId","id_from")
.withColumnRenamed("dstId","id_to")
.withColumnRenamed("year","year_from")
.withColumnRenamed("title","title_from")
.withColumnRenamed("authors","authors_from")
.withColumnRenamed("jurnal","jurnal_from")
.withColumnRenamed("abstract","abstract_from")
var infoDfRenamed = dfWithRanks
.withColumnRenamed("id_from","id_from")
.withColumnRenamed("id_to","id_to")
.withColumnRenamed("year_from","year_to")
.withColumnRenamed("title_from","title_to")
.withColumnRenamed("authors_from","authors_to")
.withColumnRenamed("jurnal_from","jurnal_to")
.withColumnRenamed("abstract_from","abstract_to").select("id_to","year_to","title_to","authors_to","jurnal_to","jurnal_to")
var finalDF = dfWithRanks.as("a").join(infoDF.as("b"),$"a.id_to" === $"b.srcId")
finalDF = finalDF
.withColumnRenamed("year","year_to")
.withColumnRenamed("title","title_to")
.withColumnRenamed("authors","authors_to")
.withColumnRenamed("jurnal","jurnal_to")
.withColumnRenamed("abstract","abstract_to")
println("Dropping unused columns from joinedDF...")
finalDF = finalDF.drop("srcId")
finalDF.show(5)
Here are my results!
Avoid all calculations and code related to pgRank! Is there any proper way to do this join works?
You can filter your data first and then join, in that case you will avoid nulls
df.filter($"ColumnName".isNotNull)
use <=> operator in your joining column condition
var joinedDF = trainingDF.as("a").join(infoDF.as("b"),$"a.srcId" <=> $"b.srcId")
There is a function in spark 2.1 or greater is eqNullSafe
var joinedDF = trainingDF.join(infoDF,trainingDF("srcId").eqNullSafe(infoDF("srcId")))

Spark scala: convert Iterator[char] to RDD[String]

I am reading data from a file and have reached to a point where the datatype is Iterator[char]. Is there a way to transform Iterator[char] to RDD[String]? which then I can transform to Dataframe/Dataset using case class.
Below is the code:
val fileDir = "inputFileName"
val result = IOUtils.toByteArray(new FileInputStream (new File(fileDir)))
val remove_comp = result.grouped(171).map{arr => arr.update(2, 32);arr}.flatMap{arr => arr.update(3, 32); arr}
val convert_char = remove_comp.map( _.toChar)
This return convert_char: Iterator[Char] = non-empty iterator
Thanks
Not sure what you are trying to do, but this should answer your question:
val ic: Iterator[Char] = ???
val spark : SparkSession = ???
val rdd: RDD[String] = spark.sparkContext.parallelize(ic.map(_.toString).toSeq)

why the data has changed after convert to parquet format testing by union two dataframe?

I wrote a function to operation on a csv file, to convert it to parquet format.
and I wonder how to make sure the data is the same,not lost or add.
So I wrote a test for it. But it turns out they are not the same:
My logic is:
1) make the csv to dataframe A.
2)and make the dataframe A to parquet format ,save to a dir.
3)read the parquet file to be a new dataframe B.
4)then A.union(B).
5)count the A and B and A.union(B).
If the three are the same ,then I can get to the conclusion that they are the same data.
But I get third one different.
def doJob(sc: SparkContext, data: RDD[String]): DataFrame = {
logInfo("Extracting omniture data")
val result = data
.filter(_.contains("PAGE."))
.filter(_.contains(".PACKAGE"))
val sqlsqlContext = new SQLContext(sc)
//just ignore above codes...
val packagesCsvDF = sqlsqlContext.load("com.databricks.spark.csv", Map("path" -> "file:///D:/test/testsample.csv", "header" -> "true"))
val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
import sqlContext.implicits._
//
// // we should have some additional filter here
// val mydf = packagesDF.groupBy($"page_url").agg(last($"pagename"),last($"prop46"),last($"prop56"),last($"post_evar34"))
// logInfo("show mydf")
// mydf.show()
//TODO
// save files
logInfo("Saving omniture packages data to S3")
if (true) {
packagesCsvDF
.repartition(sc.defaultParallelism, col("pagename"))
.write
.mode(SaveMode.Append)
.partitionBy("pagename")
.parquet("file:///D:/test/parquet")
logInfo("packagesDF")
}
packagesCsvDF//Is this packagesCsvDF have not been changed yet??????
}
TEST:
object ParquetDataTestsSpec {
def main (args: Array[String] ): Unit = {
val sc = new SparkContext(new SparkConf().setAppName("parquet data test Logs").setMaster("local"))
val input = PackagesOmnitureMapReduceJob.formatToJson(sc.textFile("file:///D:/test/option.json", sc.defaultParallelism))
val df = PackagesOmnitureMapReduceJob.doJob(sc, input)//call the function I want to test in "file:///D:/test/parquet"
val sqlContext = new SQLContext(sc)
val SourceCSVDF = sqlContext.load("com.databricks.spark.csv", Map("path" -> "file:///D:/test/testsample.csv", "header" -> "true"))// original
val parquetDataFrame = sqlContext.read.parquet("file:///D:/test/parquet") //get the new dataframe
val dfCount = df.count()
val SourceCSVDFcount = SourceCSVDF.count()
val parquetDataCount = parquetDataFrame.count()
val unionCount = parquetDataFrame.union(SourceCSVDF).count()
println(dfCount,SourceCSVDFcount,parquetDataCount,unionCount)
}
}
print:
(200,200,200,400)
then I try to parse all the dataframe to json:
parquetDataFrame.write.json("file:///D:/test/parquetDataFrame")
SourceCSVDF.write.json("file:///D:/test/SourceCSVDF")
df.write.json("file:///D:/test/Desktop/df")
and when I open the json files, I find they are so all same..Is the problem is coming with the key word union?
val unionalldis3 = parquetDataFrame.unionAll(SourceCSVDF).distinct().count()
then it is right...
But I am very confused.I thought union() is the distincted unionAll....

Regarding Spark Dataframereader jdbc

I have a question regarding Mechanics of Spark Dataframereader. I will appreciate if anybody can help me. Let me explain the Scenario here
I am creating a DataFrame from Dstream like this. This in Input Data
var config = new HashMap[String,String]();
config += ("zookeeper.connect" ->zookeeper);
config += ("partition.assignment.strategy" ->"roundrobin");
config += ("bootstrap.servers" ->broker);
config += ("serializer.class" -> "kafka.serializer.DefaultEncoder");
config += ("group.id" -> "default");
val lines = KafkaUtils.createDirectStream[String, Array[Byte], StringDecoder, DefaultDecoder](ssc,config.toMap,Set(topic)).map(_._2)
lines.foreachRDD { rdd =>
if(!rdd.isEmpty()){
val rddJson = rdd.map { x => MyFunctions.mapToJson(x) }
val sqlContext = SQLContextSingleton.getInstance(ssc.sparkContext)
val rddDF = sqlContext.read.json(rddJson)
rddDF.registerTempTable("inputData")
val dbDF = ReadDataFrameHelper.readDataFrameHelperFromDB(sqlContext, jdbcUrl, "ABCD","A",numOfPartiton,lowerBound,upperBound)
Here is the code of ReadDataFrameHelper
def readDataFrameHelperFromDB(sqlContext:HiveContext,jdbcUrl:String,dbTableOrQuery:String,
columnToPartition:String,numOfPartiton:Int,lowerBound:Int,highBound:Int):DataFrame={
val jdbcDF = sqlContext.read.jdbc(url = jdbcUrl, table = dbTableOrQuery,
columnName = columnToPartition,
lowerBound = lowerBound,
upperBound = highBound,
numPartitions = numOfPartiton,
connectionProperties = new java.util.Properties()
)
jdbcDF
}
Lastly I am doing a Join like this
val joinedData = rddDF.join(dbDF,rddDF("ID") === dbDF("ID")
&& rddDF("CODE") === dbDF("CODE"),"left_outer")
.drop(dbDF("code"))
.drop(dbDF("id"))
.drop(dbDF("number"))
.drop(dbDF("key"))
.drop(dbDF("loaddate"))
.drop(dbDF("fid"))
joinedData.show()
My input DStream will have 1000 rows and data will contains million of rows. So when I do this join, will spark load all the rows from database and read those rows or will this just read those specific rows from DB which have the code,id from the input DStream
As specified by zero323, i have also confirmed that data will be read full from the table. I checked the DB session logs and saw that whole dataset is getting loaded.
Thanks zero323

Spark: Perform SQL function on RDD subsets

I need to perform SQL function on RDD subsets, increasing in size. For this I have to take subsets from input RDD with take function:
def main(args: Array[String]) {
// set up environment
val conf = new SparkConf()
.setMaster("local[5]")
.setAppName("Test")
.set("spark.executor.memory", "4g")
val sc = new SparkContext(conf)
val cntPairsRdd = cntsRdd.map(n => {
val sample = data0.take(n)
val dataRDD = sc.parallelize(sample)
val df = dataRDD.toDF()
val result = df.select( ...)
val xCnt = result.count
(n, xCnt)
})
}
cntsRdd is a set of increasing integers. Function take returns a list not RDD. So to make my SQL work I need first to convert my list to RDD and then to dataframe. Unfortunately inside map function Spark does not allow to create another RDD. In other words in Spark one can not create RDD inside another RDD. Because of the same reason Spark does not support SparkContext serilization. I get serilization exception when trying to sc.parallelize(sample).
Please advise some workaround to perform SQL function on RDD subsets, as defined in this scenario.

Resources