why the data has changed after convert to parquet format testing by union two dataframe? - apache-spark

I wrote a function to operation on a csv file, to convert it to parquet format.
and I wonder how to make sure the data is the same,not lost or add.
So I wrote a test for it. But it turns out they are not the same:
My logic is:
1) make the csv to dataframe A.
2)and make the dataframe A to parquet format ,save to a dir.
3)read the parquet file to be a new dataframe B.
4)then A.union(B).
5)count the A and B and A.union(B).
If the three are the same ,then I can get to the conclusion that they are the same data.
But I get third one different.
def doJob(sc: SparkContext, data: RDD[String]): DataFrame = {
logInfo("Extracting omniture data")
val result = data
.filter(_.contains("PAGE."))
.filter(_.contains(".PACKAGE"))
val sqlsqlContext = new SQLContext(sc)
//just ignore above codes...
val packagesCsvDF = sqlsqlContext.load("com.databricks.spark.csv", Map("path" -> "file:///D:/test/testsample.csv", "header" -> "true"))
val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
import sqlContext.implicits._
//
// // we should have some additional filter here
// val mydf = packagesDF.groupBy($"page_url").agg(last($"pagename"),last($"prop46"),last($"prop56"),last($"post_evar34"))
// logInfo("show mydf")
// mydf.show()
//TODO
// save files
logInfo("Saving omniture packages data to S3")
if (true) {
packagesCsvDF
.repartition(sc.defaultParallelism, col("pagename"))
.write
.mode(SaveMode.Append)
.partitionBy("pagename")
.parquet("file:///D:/test/parquet")
logInfo("packagesDF")
}
packagesCsvDF//Is this packagesCsvDF have not been changed yet??????
}
TEST:
object ParquetDataTestsSpec {
def main (args: Array[String] ): Unit = {
val sc = new SparkContext(new SparkConf().setAppName("parquet data test Logs").setMaster("local"))
val input = PackagesOmnitureMapReduceJob.formatToJson(sc.textFile("file:///D:/test/option.json", sc.defaultParallelism))
val df = PackagesOmnitureMapReduceJob.doJob(sc, input)//call the function I want to test in "file:///D:/test/parquet"
val sqlContext = new SQLContext(sc)
val SourceCSVDF = sqlContext.load("com.databricks.spark.csv", Map("path" -> "file:///D:/test/testsample.csv", "header" -> "true"))// original
val parquetDataFrame = sqlContext.read.parquet("file:///D:/test/parquet") //get the new dataframe
val dfCount = df.count()
val SourceCSVDFcount = SourceCSVDF.count()
val parquetDataCount = parquetDataFrame.count()
val unionCount = parquetDataFrame.union(SourceCSVDF).count()
println(dfCount,SourceCSVDFcount,parquetDataCount,unionCount)
}
}
print:
(200,200,200,400)
then I try to parse all the dataframe to json:
parquetDataFrame.write.json("file:///D:/test/parquetDataFrame")
SourceCSVDF.write.json("file:///D:/test/SourceCSVDF")
df.write.json("file:///D:/test/Desktop/df")
and when I open the json files, I find they are so all same..Is the problem is coming with the key word union?

val unionalldis3 = parquetDataFrame.unionAll(SourceCSVDF).distinct().count()
then it is right...
But I am very confused.I thought union() is the distincted unionAll....

Related

Schema not writing to csv even if header=true is set

I am trying to create an empty dataframe and simply writing it to csv file.I was expecting schema will get written to file as I have specified header=true while writing but its creating empty .csv file.
I have tried setting different properties but nothing is working.
object HeaderTest extends App {
val spark = SparkSession.builder.master("local").appName("learning
spark").getOrCreate
val sc = spark.sparkContext
import spark.implicits._
val df: DataFrame = Seq.empty[(String, Int)].toDF("k", "v")
val f = "E:\\data.csv"
df.write.mode("overwrite").option("header", "true").csv(f)
}

To avoid manual files errors how to code dynamic datatype of a column check in spark/scala

We are getting lot of manual files which we need to validate the few datatypes before process the data-frame. Can someone please suggest how can I proceed on this requirement. Basically need to write one spark Generic/common program which should work for many files. if possible please send more detail on this email id as well pathirammi1#gmail.com.
Wondering if your files have records with delimiter seperated (like csv file). If yes, you could very well read it as a text file and split the records based and delimiter and process it.
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.SparkSession
object RDDFromCSVFile {
def main(args:Array[String]): Unit ={
def splitString(row:String):Array[String]={
row.split(",")
}
val spark:SparkSession = SparkSession.builder()
.master("local[1]")
.appName("SparkByExample")
.getOrCreate()
val sc = spark.sparkContext
val rdd = sc.textFile("randomfile.csv")
val rdd2:RDD = rdd.map(row=>{
val strArray = splitString(row)
val field1 = strArray(0)
val field2 = strArray(1)
val field3 = strArray(3)
val field4 = strArray(4)
// DO custom code here and return to create RDD
})
rdd2.foreach(a=>println(a.toString))
}
}
If you have non-structured data then you should use below code
import org.apache.spark.sql.SparkSession
object RDDFromWholeTextFile {
def main(args:Array[String]): Unit = {
val spark:SparkSession = SparkSession.builder()
.master("local[1]")
.appName("SparkByExample")
.getOrCreate()
val sc = spark.sparkContext
val rdd = sc.wholeTextFiles("alice.txt")
rdd.foreach(a=>println(a._1+"---->"+a._2))
}
}
Hope this helps !!
Thanks,
Naveen

Spark load files collection in batch and find the line from each file with additional info from file level

I have the files collection specified with comma separator, like:
hdfs://user/cloudera/date=2018-01-15,hdfs://user/cloudera/date=2018-01-16,hdfs://user/cloudera/date=2018-01-17,hdfs://user/cloudera/date=2018-01-18,hdfs://user/cloudera/date=2018-01-19,hdfs://user/cloudera/date=2018-01-20,hdfs://user/cloudera/date=2018-01-21,hdfs://user/cloudera/date=2018-01-22
and I'm loading the files with Apache Spark, all in once with:
val input = sc.textFile(files)
Also, I have additional information associated with each file - the unique ID, for example:
File ID
--------------------------------------------------
hdfs://user/cloudera/date=2018-01-15 | 12345
hdfs://user/cloudera/date=2018-01-16 | 09245
hdfs://user/cloudera/date=2018-01-17 | 345hqw4
and so on
As the output, I need to receive the DataFrame with the rows, where each row will contain the same ID, as the ID of the file from which this line was read.
Is it possible to pass this information in some way to Spark in order to be able to associate with the lines?
Core sql approach with UDF (the same thing you can achieve with join if you represent File -> ID mapping as Dataframe):
import org.apache.spark.sql.functions
val inputDf = sparkSession.read.text(".../src/test/resources/test")
.withColumn("fileName", functions.input_file_name())
def withId(mapping: Map[String, String]) = functions.udf(
(file: String) => mapping.get(file)
)
val mapping = Map(
"file:///.../src/test/resources/test/test1.txt" -> "id1",
"file:///.../src/test/resources/test/test2.txt" -> "id2"
)
val resutlDf = inputDf.withColumn("id", withId(mapping)(inputDf("fileName")))
resutlDf.show(false)
Result:
+-----+---------------------------------------------+---+
|value|fileName |id |
+-----+---------------------------------------------+---+
|row1 |file:///.../src/test/resources/test/test1.txt|id1|
|row11|file:///.../src/test/resources/test/test1.txt|id1|
|row2 |file:///.../src/test/resources/test/test2.txt|id2|
|row22|file:///.../src/test/resources/test/test2.txt|id2|
+-----+---------------------------------------------+---+
text1.txt:
row1
row11
text2.txt:
row2
row22
This could help (not tested)
// read single text file into DataFrame and add 'id' column
def readOneFile(filePath: String, fileId: String)(implicit spark: SparkSession): DataFrame = {
val dfOriginal: DataFrame = spark.read.text(filePath)
val dfWithIdColumn: DataFrame = dfOriginal.withColumn("id", lit(fileId))
dfWithIdColumn
}
// read all text files into DataFrame
def readAllFiles(filePathIdsSeq: Seq[(String, String)])(implicit spark: SparkSession): DataFrame = {
// create empty DataFrame with expected schema
val emptyDfSchema: StructType = StructType(List(
StructField("value", StringType, false),
StructField("id", StringType, false)
))
val emptyDf: DataFrame = spark.createDataFrame(
rowRDD = spark.sparkContext.emptyRDD[Row],
schema = emptyDfSchema
)
val unionDf: DataFrame = filePathIdsSeq.foldLeft(emptyDf) { (intermediateDf: DataFrame, filePathIdTuple: (String, String)) =>
intermediateDf.union(readOneFile(filePathIdTuple._1, filePathIdTuple._2))
}
unionDf
}
References
spark.read.text(..) method
Create empty DataFrame

Spark scala: convert Iterator[char] to RDD[String]

I am reading data from a file and have reached to a point where the datatype is Iterator[char]. Is there a way to transform Iterator[char] to RDD[String]? which then I can transform to Dataframe/Dataset using case class.
Below is the code:
val fileDir = "inputFileName"
val result = IOUtils.toByteArray(new FileInputStream (new File(fileDir)))
val remove_comp = result.grouped(171).map{arr => arr.update(2, 32);arr}.flatMap{arr => arr.update(3, 32); arr}
val convert_char = remove_comp.map( _.toChar)
This return convert_char: Iterator[Char] = non-empty iterator
Thanks
Not sure what you are trying to do, but this should answer your question:
val ic: Iterator[Char] = ???
val spark : SparkSession = ???
val rdd: RDD[String] = spark.sparkContext.parallelize(ic.map(_.toString).toSeq)

How two RDD according to funcation get Result RDD

I am a beginner of Apache Spark. I want to filter two RDD into result RDD with the below code
def runSpark(stList:List[SubStTime],icList:List[IcTemp]): Unit ={
val conf = new SparkConf().setAppName("OD").setMaster("local[*]")
val sc = new SparkContext(conf)
val st = sc.parallelize(stList).map(st => ((st.productId,st.routeNo),st)).groupByKey()
val ic = sc.parallelize(icList).map(ic => ((ic.productId,ic.routeNo),ic)).groupByKey()
//TODO
//val result = st.join(ic).mapValues( )
sc.stop()
}
here is what i want to do
List[ST] ->map ->Map(Key,st) ->groupByKey ->Map(Key,List[st])
List[IC] ->map ->Map(Key,ic) ->groupByKey ->Map(Key,List[ic])
STRDD join ICRDD get Map(Key,(List[st],List[ic]))
I have a function compare listST and listIC get the List[result] result contains both SubStTime and IcTemp information
def calcIcSt(st:List[SubStTime],ic:List[IcTemp]): List[result]
I don't know how to use mapvalues or other some way to get my result
Thanks
val result = st.join(ic).mapValues( x => calcIcSt(x._1,x._2) )

Resources