spark dataframe null value count - apache-spark

I am new to spark and i want to calculate the null rate of each columns,(i have 200 columns), my function is as follows:
def nullCount(dataFrame: DataFrame): Unit = {
val args = dataFrame.columns.length
val cols = dataFrame.columns
val d=dataFrame.count()
println("Follows are the null value rate of each columns")
for (i <- Range(0,args)) {
var nullrate = dataFrame.rdd.filter(r => r(i) == (-900)).count.toDouble / d
println(cols(i), nullrate)
}
}
But I find it's too slow , is there any more effective way to do this ?

Adapted from this answer by zero323:
import org.apache.spark.sql.functions.{col, count, when}
df.select(df.columns.map(c => (count(c) / count("*")).alias(c)): _*)
with -900:
df.select(df.columns.map(
c => (count(when(col(c) === -900, col(c))) / count("*")).alias(c)): _*)

Related

How to generate large word count file in Spark?

I want to generate 10 million lines’ wordcount file for performance test(each line has the same sentence). But I have no idea about how to code it.
You can give me an example code, and save file in HDFS directly.
You can try something like this.
Generate 1 column with values from 1 to 100k and one with values from 1 to 100 explode both of them with explode(column).
You can't generate one column with 10 Mil values because kryo buffer is gonna throw an error.
I don't know if this is the best performance way to do it, but it is the fastest way I can think right now.
val generateList = udf((s: Int) => {
val buf = scala.collection.mutable.ArrayBuffer.empty[Int]
for(i <- 1 to s) {
buf += i
}
buf
})
val someDF = Seq(
("Lorem ipsum dolor sit amet, consectetur adipiscing elit.")
).toDF("sentence")
val someDfWithMilColumn = someDF.withColumn("genColumn1", generateList(lit(100000)))
.withColumn("genColumn2", generateList(lit(100)))
val someDfWithMilColumn100k = someDfWithMilColumn
.withColumn("expl_val", explode($"mil")).drop("expl_val", "genColumn1")
val someDfWithMilColumn10mil = someDfWithMilColumn100k
.withColumn("expl_val2", explode($"10")).drop("genColumn2", "expl_val2")
someDfWithMilColumn10mil.write.parquet(path)
You can do it by joining the 2 DFs as below,
Also find the code explanation inline.
import org.apache.spark.sql.SaveMode
object GenerateTenMils {
def main(args: Array[String]): Unit = {
val spark = Constant.getSparkSess
spark.conf.set("spark.sql.crossJoin.enabled","true") // Enable cross join
import spark.implicits._
//Create a DF with your sentence
val df = List("each line has the same sentence").toDF
//Create another Dataset with 10000000 records
spark.range(10000000)
.join(df) // Cross Join the dataframes
.coalesce(1) // Output to a single file
.drop("id") // Drop the extra column
.write
.mode(SaveMode.Overwrite)
.text("src/main/resources/tenMils") // Write as text file
}
}
You could follow this approach.
Tail recursive to generate the objects list and Dataframes, and Union to generate the big Dataframe
val spark = SparkSession
.builder()
.appName("TenMillionsRows")
.master("local[*]")
.config("spark.sql.shuffle.partitions","4") //Change to a more reasonable default number of partitions for our data
.config("spark.app.id","TenMillionsRows") // To silence Metrics warning
.getOrCreate()
val sc = spark.sparkContext
import spark.implicits._
/**
* Returns a List of nums sentences
* #param sentence
* #param num
* #return
*/
def getList(sentence: String, num: Int) : List[String] = {
#tailrec
def loop(st: String,n: Int, acc: List[String]): List[String] = {
n match {
case num if num == 0 => acc
case _ => loop(st, n - 1, st :: acc)
}
}
loop(sentence,num,List())
}
/**
* Returns a Dataframe that is the union of nums dataframes
* #param lst
* #param num
* #return
*/
def getDataFrame(lst: List[String], num: Int): DataFrame = {
#tailrec
def loop (ls: List[String],n: Int, acc: DataFrame): DataFrame = {
n match {
case n if n == 0 => acc
case _ => loop(lst,n - 1, acc.union(sc.parallelize(ls).toDF("sentence")))
}
}
loop(lst, num, sc.parallelize(List(sentence)).toDF("sentence"))
}
val sentence = "hope for the best but prepare for the worst"
val lSentence = getList(sentence, 100000)
val dfs = getDataFrame(lSentence,100)
println(dfs.count())
// output: 10000001
dfs.write.orc("path_to_hdfs") // write dataframe to a orc file
// you can save the file as parquet, txt, json .......
// with dataframe.write
Hope this helps.

Spark Find the Occurrences of Matched Strings

how i can find the occurence of the matched string as per the below code snippet, i'm able to get the filtered strings as an output , but not the occurences
import org.apache.spark._
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
object WordCount {
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("wordCount")
val sc = new SparkContext(conf)
// Load our input data.
val input = sc.textFile("file:///tmp/ganesh/*")
val matched_pattern = input.filter(line => line.contains("Title"))
// Split it up into words.
val words = matched_pattern.flatMap(line => line.split(" "))
// Transform into pairs and count.
val counts = words.map(word => (word, 1)).reduceByKey{case (x, y) => x + y}
// Save the word count back out to a text file, causing evaluation.
counts.saveAsTextFile("file:///tmp/sparkout")
}
}
Here is an example - with broadcast variable usage. stopWords is in fact include words.
val dfsFilename = "/FileStore/tables/7dxa9btd1477497663691/Text_File_01-880f5.txt"
val readFileRDD = spark.sparkContext.textFile(dfsFilename)
// res4: Array[String] = Array(The the is Is a A to To OK ok I) //stopWords
val stopWordsInput = spark.sparkContext.textFile("/FileStore/tables/filter_words.txt")
val stopWords = stopWordsInput.flatMap(x => x.split(" ")).map(_.trim).collect.toSet
val broadcasted = sc.broadcast(stopWords)
val wcounts1 = readFileRDD.map(x => (x.replaceAll("[^A-Za-z0-9]", " ")
.trim.toLowerCase))
.flatMap(line=>line.split(" "))
.filter(broadcasted.value.contains(_))
.map(word=>(word, 1))
.reduceByKey(_ + _)
wcounts1.collect
returns:
res2: Array[(String, Int)] = Array((The,1), (I,3), (to,1), (the,1))
You can embellish with broadcast on the stopWords -which is what I did.
I saw you XML input and a replaceAll. You can fiddle with that to your liking. I also added a clause to put it all to lower case.

spark: How to merge shuffled rdd efficiently?

I have 5 shuffled key-value rdds, one big one(1,000,000 records), and 4 relative small ones(100,000 records).All rdds were shullfed with the same number of partitions, I have two strategies to merge the 5 one,
Merge the 5 rdds together
merge the 4 small rdds together and then join the bigone
I think the strategy 2 would be more efficiently, as it would not re-shuffle the big one. But the experiment result shows the strategy 1 more efficient. The code and output are following:
Code
import org.apache.log4j.{Level, Logger}
import org.apache.spark.sql.SQLContext
import org.apache.spark.{SparkContext, SparkConf}
object MergeStrategy extends App {
Logger.getLogger("org").setLevel(Level.ERROR)
Logger.getLogger("akka").setLevel(Level.ERROR)
val conf = new SparkConf().setMaster("local[4]").setAppName("test")
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
val bigRddSize = 1e6.toInt
val smallRddSize = 1e5.toInt
println(bigRddSize)
val bigRdd = sc.parallelize((0 until bigRddSize)
.map(x => (scala.util.Random.nextInt, 0))).repartition(100).cache
bigRdd.take(10).foreach(println)
val smallRddList = (0 until 4).map(i => {
val rst = sc.parallelize((0 until smallRddSize)
.map(x => (scala.util.Random.nextInt, 0))).repartition(100).cache
println(rst.count)
rst
}).toArray
// strategy 1
{
val begin = System.currentTimeMillis
val s1Rst = sc.union(Array(bigRdd) ++ smallRddList).distinct(100)
println(s1Rst.count)
val end = System.currentTimeMillis
val timeCost = (end - begin) / 1000d
println("S1 time count: %.1f s".format(timeCost))
}
// strategy 2
{
val begin = System.currentTimeMillis
val smallMerged = sc.union(smallRddList).distinct(100).cache
println(smallMerged.count)
val s2Rst = bigRdd.fullOuterJoin(smallMerged).flatMap({ case (key, (left, right)) => {
if (left.isDefined && right.isDefined) Array((key, left.get), (key, right.get)).distinct
else if (left.isDefined) Array((key, left.get))
else if (right.isDefined) Array((key, right.get))
else throw new Exception("Cannot happen")
}
})
println(s2Rst.count)
val end = System.currentTimeMillis
val timeCost = (end - begin) / 1000d
println("S2 time count: %.1f s".format(timeCost))
}
}
Output
1000000
(688282474,0)
(-255073127,0)
(872746474,0)
(-792516900,0)
(417252803,0)
(-1514224305,0)
(1586932811,0)
(1400718248,0)
(939155130,0)
(1475156418,0)
100000
100000
100000
100000
1399777
S1 time count: 39.7 s
399984
1399894
S2 time count: 49.8 s
My understanding for shuffled rdd was wrong? Can anybody give some advices?
Thanks!
I found a method to merge rdd more efficiently, see the following 2 merging strategies:
import org.apache.log4j.{Level, Logger}
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.SQLContext
import org.apache.spark.{HashPartitioner, SparkContext, SparkConf}
import scala.collection.mutable.ArrayBuffer
object MergeStrategy extends App {
Logger.getLogger("org").setLevel(Level.ERROR)
Logger.getLogger("akka").setLevel(Level.ERROR)
val conf = new SparkConf().setMaster("local[4]").setAppName("test")
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
val rddCount = 20
val mergeCount = 5
val dataSize = 20000
val parts = 50
// generate data
scala.util.Random.setSeed(943343)
val testData = for (i <- 0 until rddCount)
yield sc.parallelize(scala.util.Random.shuffle((0 until dataSize).toList).map(x => (x, 0)))
.partitionBy(new HashPartitioner(parts))
.cache
testData.foreach(x => println(x.count))
// strategy 1: merge directly
{
val buff = ArrayBuffer[RDD[(Int, Int)]]()
val begin = System.currentTimeMillis
for (i <- 0 until rddCount) {
buff += testData(i)
if ((buff.size >= mergeCount || i == rddCount - 1) && buff.size > 1) {
val merged = sc.union(buff).distinct
.partitionBy(new HashPartitioner(parts)).cache
println(merged.count)
buff.foreach(_.unpersist(false))
buff.clear
buff += merged
}
}
val end = System.currentTimeMillis
val timeCost = (end - begin) / 1000d
println("Strategy 1 Time Cost: %.1f".format(timeCost))
assert(buff.size == 1)
println("Strategy 1 Complete, with merged Count %s".format(buff(0).count))
}
// strategy 2: merge directly without repartition
{
val buff = ArrayBuffer[RDD[(Int, Int)]]()
val begin = System.currentTimeMillis
for (i <- 0 until rddCount) {
buff += testData(i)
if ((buff.size >= mergeCount || i == rddCount - 1) && buff.size > 1) {
val merged = sc.union(buff).distinct(parts).cache
println(merged.count)
buff.foreach(_.unpersist(false))
buff.clear
buff += merged
}
}
val end = System.currentTimeMillis
val timeCost = (end - begin) / 1000d
println("Strategy 2 Time Cost: %.1f".format(timeCost))
assert(buff.size == 1)
println("Strategy 2 Complete, with merged Count %s".format(buff(0).count))
}
}
The result shows that strategy 1 (time cost 20.8 seconds) is more efficient than strategy 2 (time cost 34.3 seconds). my pc is window 8, CPU 4 cores 2.0GHz, 8GB memory.
The only difference is that strategy partitioned by HashPartitioner, but strategy 2 not. As a result, the strategy 1 produce ShuffledRDD, but strategy 1 MapPartitionsRDD. I think RDD.distinct function processes ShuflledRDD more efficiently than MapPartitionsRDD.

Writing an efficient aggregation function for Spark SQL

I wrote an aggregation function which returns range encoded representation of a column of Long data. I ran it on a 1 GB parquet file which has 50 columns. My cluster has 55 executors with 4 cores for each node. The run time is around 5 minutes even after caching the dataframe. Is there any way to run this query in a more efficient manner ?
Here is the UDAF -
class Concat extends UserDefinedAggregateFunction {
def inputSchema: org.apache.spark.sql.types.StructType =
StructType(StructField("value", LongType) :: Nil)
def bufferSchema: StructType = StructType(
StructField("concatenation",ArrayType(LongType,false) ) :: Nil
)
def dataType: DataType = ArrayType(LongType,false)
def deterministic: Boolean = true
def initialize(buffer: MutableAggregationBuffer): Unit = {
buffer.update(0, new ArrayBuffer[Long]() )
}
def update(buffer: MutableAggregationBuffer,input: Row): Unit = {
val l=buffer.getAs[ ArrayBuffer[Long] ](0).toBuffer.asInstanceOf[ ArrayBuffer[Long] ]
val v=input.getAs[ Long ](0)
val n=l.size
if(n<2){
l += v
l += 0L
}
else{
val x1 = l(n-2)
val x2 = l(n-1)
if( x1-1 == v){
l(n-2)= v
l(n-1)= x2+1
}
else if(x1+x2+1 == v)
l(n-1)= x2+1
else{
l += v
l += 0L
}
}
buffer.update(0,l)
}
def merge(buffer1: MutableAggregationBuffer, buffer2: Row): Unit = {
val a=buffer1.getAs[ WrappedArray[Long] ](0)
val b=buffer2.getAs[ WrappedArray[Long] ](0)
buffer1.update(0,a ++ b)
}
def evaluate(buffer: Row): Any = {
buffer.getSeq(0)
}
}
Here is how I am running the query -
val concat = new Concat
sqlContext.udf.register("lcon", concat)
val df=sqlContext.read.parquet("file_url")
df.cache
df.registerTempTable("agg11")
val results=sqlContext.sql("SELECT lcon(Id) from agg11 WHERE Status IN (1) AND Device IN (1,4) AND Medium IN (1)").collect

How to process a subset of input records in a batch, i.e. the first second in 3-sec batch time?

If I set Seconds(1) for the batch time in StreamingContext, like this:
val ssc = new StreamingContext(sc, Seconds(1))
3 seconds will receive the 3 seconds of data, but I only need the first seconds of data, I can discard the next 2 seconds of data. So can I spend 3 seconds to process only first second of data?
You can do this via updateStateByKey if you keep track of counter, for example like below:
import org.apache.spark.SparkContext
import org.apache.spark.streaming.dstream.ConstantInputDStream
import org.apache.spark.streaming.{Seconds, StreamingContext}
object StreamEveryThirdApp {
def main(args: Array[String]) {
val sc = new SparkContext("local[*]", "Streaming Test")
implicit val ssc = new StreamingContext(sc, Seconds(1))
ssc.checkpoint("./checkpoint")
// generate stream
val inputDStream = createConstantStream
// increase seconds counter
val accStream = inputDStream.updateStateByKey(updateState)
// keep only 1st second records
val firstOfThree = accStream.filter { case (key, (value, counter)) => counter == 1}
firstOfThree.print()
ssc.start()
ssc.awaitTermination()
}
def updateState: (Seq[Int], Option[(Option[Int], Int)]) => Option[(Option[Int], Int)] = {
case(values, state) =>
state match {
// If no previous state, i.e. set first Second
case None => Some(Some(values.sum), 1)
// If this is 3rd second - remove state
case Some((prevValue, 3)) => None
// If this is not the first second - increase seconds counter, but don't calculate values
case Some((prevValue, counter)) => Some((None, counter + 1))
}
}
def createConstantStream(implicit ssc: StreamingContext): ConstantInputDStream[(String, Int)] = {
val seq = Seq(
("key1", 1),
("key2", 3),
("key1", 2),
("key1", 2)
)
val rdd = ssc.sparkContext.parallelize(seq)
val inputDStream = new ConstantInputDStream(ssc, rdd)
inputDStream
}
}
In case if you have time information within your data, you could also use 3 seconds window stream.window(Seconds(3), Seconds(3)) and filter records by the time information from data, and quite often this is preferred approach

Resources