Spark scala: convert Iterator[char] to RDD[String] - apache-spark

I am reading data from a file and have reached to a point where the datatype is Iterator[char]. Is there a way to transform Iterator[char] to RDD[String]? which then I can transform to Dataframe/Dataset using case class.
Below is the code:
val fileDir = "inputFileName"
val result = IOUtils.toByteArray(new FileInputStream (new File(fileDir)))
val remove_comp = result.grouped(171).map{arr => arr.update(2, 32);arr}.flatMap{arr => arr.update(3, 32); arr}
val convert_char = remove_comp.map( _.toChar)
This return convert_char: Iterator[Char] = non-empty iterator
Thanks

Not sure what you are trying to do, but this should answer your question:
val ic: Iterator[Char] = ???
val spark : SparkSession = ???
val rdd: RDD[String] = spark.sparkContext.parallelize(ic.map(_.toString).toSeq)

Related

To avoid manual files errors how to code dynamic datatype of a column check in spark/scala

We are getting lot of manual files which we need to validate the few datatypes before process the data-frame. Can someone please suggest how can I proceed on this requirement. Basically need to write one spark Generic/common program which should work for many files. if possible please send more detail on this email id as well pathirammi1#gmail.com.
Wondering if your files have records with delimiter seperated (like csv file). If yes, you could very well read it as a text file and split the records based and delimiter and process it.
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.SparkSession
object RDDFromCSVFile {
def main(args:Array[String]): Unit ={
def splitString(row:String):Array[String]={
row.split(",")
}
val spark:SparkSession = SparkSession.builder()
.master("local[1]")
.appName("SparkByExample")
.getOrCreate()
val sc = spark.sparkContext
val rdd = sc.textFile("randomfile.csv")
val rdd2:RDD = rdd.map(row=>{
val strArray = splitString(row)
val field1 = strArray(0)
val field2 = strArray(1)
val field3 = strArray(3)
val field4 = strArray(4)
// DO custom code here and return to create RDD
})
rdd2.foreach(a=>println(a.toString))
}
}
If you have non-structured data then you should use below code
import org.apache.spark.sql.SparkSession
object RDDFromWholeTextFile {
def main(args:Array[String]): Unit = {
val spark:SparkSession = SparkSession.builder()
.master("local[1]")
.appName("SparkByExample")
.getOrCreate()
val sc = spark.sparkContext
val rdd = sc.wholeTextFiles("alice.txt")
rdd.foreach(a=>println(a._1+"---->"+a._2))
}
}
Hope this helps !!
Thanks,
Naveen

Spark Testing Base returning a tuple of Dstream

I'm using spark-unit-test and I don't get to compile the code.
test("Testging") {
val inputInsert = A("data2")
val inputDelete = A("data1")
val outputInsert = B(1)
val outputDelete = C(1)
val input = List(List(inputInsert), List(inputDelete))
val output = (List(List(outputInsert)), List(List(outputDelete)))
//Why doesn't it compile?? I have tried many things here.
testOperation[A,(B,C)](input, service.processing _, output)
}
My method is:
def processing(avroDstream: DStream[A]) : (DStream[B],DStream[C]) ={...}
What does the "_" means in this case?

why the data has changed after convert to parquet format testing by union two dataframe?

I wrote a function to operation on a csv file, to convert it to parquet format.
and I wonder how to make sure the data is the same,not lost or add.
So I wrote a test for it. But it turns out they are not the same:
My logic is:
1) make the csv to dataframe A.
2)and make the dataframe A to parquet format ,save to a dir.
3)read the parquet file to be a new dataframe B.
4)then A.union(B).
5)count the A and B and A.union(B).
If the three are the same ,then I can get to the conclusion that they are the same data.
But I get third one different.
def doJob(sc: SparkContext, data: RDD[String]): DataFrame = {
logInfo("Extracting omniture data")
val result = data
.filter(_.contains("PAGE."))
.filter(_.contains(".PACKAGE"))
val sqlsqlContext = new SQLContext(sc)
//just ignore above codes...
val packagesCsvDF = sqlsqlContext.load("com.databricks.spark.csv", Map("path" -> "file:///D:/test/testsample.csv", "header" -> "true"))
val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
import sqlContext.implicits._
//
// // we should have some additional filter here
// val mydf = packagesDF.groupBy($"page_url").agg(last($"pagename"),last($"prop46"),last($"prop56"),last($"post_evar34"))
// logInfo("show mydf")
// mydf.show()
//TODO
// save files
logInfo("Saving omniture packages data to S3")
if (true) {
packagesCsvDF
.repartition(sc.defaultParallelism, col("pagename"))
.write
.mode(SaveMode.Append)
.partitionBy("pagename")
.parquet("file:///D:/test/parquet")
logInfo("packagesDF")
}
packagesCsvDF//Is this packagesCsvDF have not been changed yet??????
}
TEST:
object ParquetDataTestsSpec {
def main (args: Array[String] ): Unit = {
val sc = new SparkContext(new SparkConf().setAppName("parquet data test Logs").setMaster("local"))
val input = PackagesOmnitureMapReduceJob.formatToJson(sc.textFile("file:///D:/test/option.json", sc.defaultParallelism))
val df = PackagesOmnitureMapReduceJob.doJob(sc, input)//call the function I want to test in "file:///D:/test/parquet"
val sqlContext = new SQLContext(sc)
val SourceCSVDF = sqlContext.load("com.databricks.spark.csv", Map("path" -> "file:///D:/test/testsample.csv", "header" -> "true"))// original
val parquetDataFrame = sqlContext.read.parquet("file:///D:/test/parquet") //get the new dataframe
val dfCount = df.count()
val SourceCSVDFcount = SourceCSVDF.count()
val parquetDataCount = parquetDataFrame.count()
val unionCount = parquetDataFrame.union(SourceCSVDF).count()
println(dfCount,SourceCSVDFcount,parquetDataCount,unionCount)
}
}
print:
(200,200,200,400)
then I try to parse all the dataframe to json:
parquetDataFrame.write.json("file:///D:/test/parquetDataFrame")
SourceCSVDF.write.json("file:///D:/test/SourceCSVDF")
df.write.json("file:///D:/test/Desktop/df")
and when I open the json files, I find they are so all same..Is the problem is coming with the key word union?
val unionalldis3 = parquetDataFrame.unionAll(SourceCSVDF).distinct().count()
then it is right...
But I am very confused.I thought union() is the distincted unionAll....

Regarding Spark Dataframereader jdbc

I have a question regarding Mechanics of Spark Dataframereader. I will appreciate if anybody can help me. Let me explain the Scenario here
I am creating a DataFrame from Dstream like this. This in Input Data
var config = new HashMap[String,String]();
config += ("zookeeper.connect" ->zookeeper);
config += ("partition.assignment.strategy" ->"roundrobin");
config += ("bootstrap.servers" ->broker);
config += ("serializer.class" -> "kafka.serializer.DefaultEncoder");
config += ("group.id" -> "default");
val lines = KafkaUtils.createDirectStream[String, Array[Byte], StringDecoder, DefaultDecoder](ssc,config.toMap,Set(topic)).map(_._2)
lines.foreachRDD { rdd =>
if(!rdd.isEmpty()){
val rddJson = rdd.map { x => MyFunctions.mapToJson(x) }
val sqlContext = SQLContextSingleton.getInstance(ssc.sparkContext)
val rddDF = sqlContext.read.json(rddJson)
rddDF.registerTempTable("inputData")
val dbDF = ReadDataFrameHelper.readDataFrameHelperFromDB(sqlContext, jdbcUrl, "ABCD","A",numOfPartiton,lowerBound,upperBound)
Here is the code of ReadDataFrameHelper
def readDataFrameHelperFromDB(sqlContext:HiveContext,jdbcUrl:String,dbTableOrQuery:String,
columnToPartition:String,numOfPartiton:Int,lowerBound:Int,highBound:Int):DataFrame={
val jdbcDF = sqlContext.read.jdbc(url = jdbcUrl, table = dbTableOrQuery,
columnName = columnToPartition,
lowerBound = lowerBound,
upperBound = highBound,
numPartitions = numOfPartiton,
connectionProperties = new java.util.Properties()
)
jdbcDF
}
Lastly I am doing a Join like this
val joinedData = rddDF.join(dbDF,rddDF("ID") === dbDF("ID")
&& rddDF("CODE") === dbDF("CODE"),"left_outer")
.drop(dbDF("code"))
.drop(dbDF("id"))
.drop(dbDF("number"))
.drop(dbDF("key"))
.drop(dbDF("loaddate"))
.drop(dbDF("fid"))
joinedData.show()
My input DStream will have 1000 rows and data will contains million of rows. So when I do this join, will spark load all the rows from database and read those rows or will this just read those specific rows from DB which have the code,id from the input DStream
As specified by zero323, i have also confirmed that data will be read full from the table. I checked the DB session logs and saw that whole dataset is getting loaded.
Thanks zero323

How two RDD according to funcation get Result RDD

I am a beginner of Apache Spark. I want to filter two RDD into result RDD with the below code
def runSpark(stList:List[SubStTime],icList:List[IcTemp]): Unit ={
val conf = new SparkConf().setAppName("OD").setMaster("local[*]")
val sc = new SparkContext(conf)
val st = sc.parallelize(stList).map(st => ((st.productId,st.routeNo),st)).groupByKey()
val ic = sc.parallelize(icList).map(ic => ((ic.productId,ic.routeNo),ic)).groupByKey()
//TODO
//val result = st.join(ic).mapValues( )
sc.stop()
}
here is what i want to do
List[ST] ->map ->Map(Key,st) ->groupByKey ->Map(Key,List[st])
List[IC] ->map ->Map(Key,ic) ->groupByKey ->Map(Key,List[ic])
STRDD join ICRDD get Map(Key,(List[st],List[ic]))
I have a function compare listST and listIC get the List[result] result contains both SubStTime and IcTemp information
def calcIcSt(st:List[SubStTime],ic:List[IcTemp]): List[result]
I don't know how to use mapvalues or other some way to get my result
Thanks
val result = st.join(ic).mapValues( x => calcIcSt(x._1,x._2) )

Resources