Spark - How to word count without RDD - apache-spark

It looks RDD is to be removed from Spark.
Announcement: DataFrame-based API is primary API
The RDD-based API is expected to be removed in Spark 3.0
Then, how to implement programs like word count in Spark?

The data you manipulate as tuples using RDD api can be thought of and manipulated as columns/fields in a SQL like manner using DataFrame api.
df.withColumn("word", explode(split(col("lines"), " ")))
.groupBy("word")
.count()
.orderBy(col("count").desc())
.show()
+---------+-----+
| word|count|
+---------+-----+
| foo| 5|
| bar| 2|
| toto| 1|
...
+---------+-----+
Notes:
This code snippet requires necessary imports from org.apache.spark.sql.functions
Relevant examples can be found in this question's answers.

Related

Spark starting job with all workers, but at some point using only 1 executor in a single worker when doing count()

I have an DF which is partitioned, and relatively small.
I try to do a simple count().
At starts all workers and executors participating at the task, but at some point of the job, only 1 worker and 1 core is working. Even though the data is distributed in a balanced way among the workers.
I have tried coalesce to 1 and also repartition to 2*number of cores, still no effect - no matter what kind of action I do on this DF, it always starts with all workers and keep working only on a single 1.
I'd appreciate if anyone has any idea what could be wrong.
Information on the DF:
Total Count:
13065
Partitions:
+--------------------+-----+
|SPARK_PARTITION_ID()|count|
+--------------------+-----+
| 9| 5557|
| 10| 62|
| 11| 167|
| 0| 128|
| 1| 83|
| 2| 110|
| 3| 129|
| 4| 131|
| 5| 78|
| 6| 6429|
| 7| 39|
| 8| 152|
+--------------------+-----+
Screenshots from application master:
DAG:
event timeline:
tasks:
A task always takes place on a single executor: it never gets chopped up into pieces and distributed since it is the most atomic bit of work a Spark executor does.
By looking at the image with the table of tasks we see 12 tasks (are these all the tasks in your stage or are there more?): most of them take <1s but then there is one that takes 4.6min.
Interesting observation indeed! It makes sense that if you have a task that takes MUCH longer than all of the other tasks that you can end up with with a single executor having to calculate alone at the end.
So actually your problem is more of a data skew question: why does this single task take so much longer than the other ones? To answer this, we don't have enough information out of your question. I would start looking at the following:
As you can see in the screenshot of your stage, your stage reads from a shuffledRowRDD. So that means it is partitioned by something. By what is it partitioned? How many shuffled partitions does it read? What is the value of your spark.sql.shuffle.partitions configuration?
In the task table, you can see that you have a GC time of 1s (WAY larger than for the other tasks). Are you counting large objects? Is it possible that the difference in size of your objects can be really big?
Hope this helps you in figuring this one out!

Spark partitionBy | save by column value rather than columnName={value}

I am using scala and spark, my spark version is 2.4.3
My dataframe looks like this, there are other columns which i have not put and is not relavent.
+-----------+---------+---------+
|ts_utc_yyyy|ts_utc_MM|ts_utc_dd|
+-----------+---------+---------+
|2019 |01 |20 |
|2019 |01 |13 |
|2019 |01 |12 |
|2019 |01 |19 |
|2019 |01 |19 |
+-----------+---------+---------+
Basically i want to store the data in a bucketed format like
2019/01/12/data
2019/01/13/data
2019/01/19/data
2019/01/20/data
I am using following code snippet
df.write
.partitionBy("ts_utc_yyyy","ts_utc_MM","ts_utc_dd")
.format("csv")
.save(outputPath)
But the problem is it is getting stored along with the column name like below.
ts_utc_yyyy=2019/ts_utc_MM=01/ts_utc_dd=12/data
ts_utc_yyyy=2019/ts_utc_MM=01/ts_utc_dd=13/data
ts_utc_yyyy=2019/ts_utc_MM=01/ts_utc_dd=19/data
ts_utc_yyyy=2019/ts_utc_MM=01/ts_utc_dd=20/data
how do i save without column name in the folder name ?
Thanks.
This is the expected behaviour. Spark uses Hive partitioning so it writes using this convention, which enables partition discovery, filtering and pruning. In short, it optimises your queries by ensuring that the minimum amount of data is read.
Spark isn't really designed for the output you need. The easiest way for you to solve this is to have a downstream task that will simply rename the directories by splitting on the equals sign.

Spark SQL HAVING clause without Group/Aggregate

I am wondering how the HAVING clause is working in spark sql without GroupBY or any aggregate function?
1) Can we rely on HAVING without aggregate function?
2) Is there any other way to filter the columns that are generated on that select level?
I have tried executing the below Spark SQL is it working fine but can we rely on this?
spark.sql("""
select 1 as a having a=1
""").show()
spark.sql("""
select 1 as a having a=2
""").show()
+---+
| a|
+---+
| 1|
+---+
+---+
| a|
+---+
+---+
In some databases / engines, when the GROUP BY is not used in conjunction with HAVING, HAVING defaults to a WHERE clause.
Normally the WHERE clause is used.
I would not rely on HAVING without a GROUP BY.
To answer this: 1) Can we rely on HAVING without aggregate function?
No, you cannot rely on this behavior. This illegal SQL is no longer the default behavior as of Spark 2.4. But if you really want to use HAVING like a WHERE clause you can get the old behavior by setting conf spark.sql.legacy.parser.havingWithoutGroupByAsWhere = true

How to sort within partitions (and avoid sort across the partitions) using RDD API?

It is Hadoop MapReduce shuffle's default behavior to sort the shuffle key within partition, but not cross partitions(It is the total ordering that makes keys sorted cross the parttions)
I would ask how to achieve the same thing using Spark RDD(sort within Partition,but not sort cross the partitions)
RDD's sortByKey method is doing total ordering
RDD's repartitionAndSortWithinPartitions is doing sort within partition but not cross partitions, but unfortunately it adds an extra step to do repartition.
Is there a direct way to sort within partition but not cross partitions?
You can use Dataset and sortWithinPartitions method:
import spark.implicits._
sc.parallelize(Seq("e", "d", "f", "b", "c", "a"), 2)
.toDF("text")
.sortWithinPartitions($"text")
.show
+----+
|text|
+----+
| d|
| e|
| f|
| a|
| b|
| c|
+----+
In general shuffle is an important factor in sorting partitions because it reuse shuffle structures to sort without loading all data into memory at once.
I've never had this need before, but my first guess would be to use any of the *Partition* methods (e.g. foreachPartition or mapPartitions) to do the sorting within every partition.
Since they give you a Scala Iterator, you could use it.toSeq and then apply any of the sorting methods of Seq, e.g. sortBy or sortWith or sorted.

Describe on Dataframe is not displaying the complete resultset

I am using Scala 1.6. The describe on a data frame is not displaying the column header and the values. Please see below:
val data=sc.textFile("/tmp/sample.txt")
data.toDF.describe().show
This gives the below result:
Please let me know why it is not displaying the entire result set.
+-------+
|summary|
+-------+
| count|
| mean|
| stddev|
| min|
| max|
+-------+
I think you just need to use the show method.
sc.textFile("/tmp/sample.txt").toDF.show
As far as displaying the complete RDD, be careful with this as you will need to collect the results on the driver in order to do this. You may want to consider using take instead if the csv file is large.
val data = sc.textFile("/tmp/sample.txt").toDF
data.collect.foreach(println)
or
data.take(100).foreach(println)
This was because, spark 1.6 was considering every filed as String by default and it does not provide summary stats on String type. However, in Spark 2.1, the columns were correctly inferred as their respective data type (Int/String/Double etc.,) and summary stats included all the columns in the file and it was not restricted only to numerical fields.
I feel, df.describe() works more elegantly in Spark 2.1 than Spark 1.6.

Resources