How can Spark writes larger files without additional data? - apache-spark

I use Spark EMR to process data and write them to S3. The data are partitioned by date. In the case where we re-process the same date data, I use a custom-made function that compares the ongoing processed dataframe with the data that is already in S3. Both data are fused so that no data is lost.
My issue is that between the first write and the second write of the same data, the total size of the data is different in S3.
The first write results in 200 files of variable sizes (20-100KB) for a total of 74MB. The second write results in 200 files of fixed sizes (about 430KB each) for a total of 84MB.
I compared both data from the different writes by importing them into dataframes, the number of rows is similar. The data are the same (I used df1.exceptAll(df2)).
Why is there a difference in file sizing between first and second writes?
Where could this additional 10MB come from?
I do not use any repartitions / coalesce.
Thanks in advance.

Maybe for some reason there are duplicates in the second df and your validation doesn't handle that scenario. In that case, you'll need to do the same verification, but inverting your df's.
Sample:
import spark.implicits._
val df1 = Seq(
(1,2,3),
(4,5,6)
).toDF("col_a", "col_b", "col_c")
val df2 = Seq(
(1,2,3),
(4,5,6),
(4,5,6)
).toDF("col_a", "col_b", "col_c")
df1.show()
df2.show()
// output:
+-----+-----+-----+
|col_a|col_b|col_c|
+-----+-----+-----+
| 1| 2| 3|
| 4| 5| 6|
+-----+-----+-----+
+-----+-----+-----+
|col_a|col_b|col_c|
+-----+-----+-----+
| 1| 2| 3|
| 4| 5| 6|
| 4| 5| 6|
+-----+-----+-----+
exceptAll validations:
df1.exceptAll(df2).show()
+-----+-----+-----+
|col_a|col_b|col_c|
+-----+-----+-----+
+-----+-----+-----+
df2.exceptAll(df1).show()
+-----+-----+-----+
|col_a|col_b|col_c|
+-----+-----+-----+
| 4| 5| 6|
+-----+-----+-----+

Related

How to get 1000 records from dataframe and write into a file using PySpark?

I am having 100,000+ of records in dataframe. I want to create a file dynamically and push 1000 records per file. Can anyone help me to solve this, thanks in advance.
You can use maxRecordsPerFile option while writing dataframe.
If you need whole dataframe to write 1000 records in each file then use repartition(1) (or) write 1000 records for each partition use .coalesce(1)
Example:
# 1000 records written per file in each partition
df.coalesce(1).write.option("maxRecordsPerFile", 1000).mode("overwrite").parquet(<path>)
# 1000 records written per file for dataframe 100 files created for 100,000
df.repartition(1).write.option("maxRecordsPerFile", 1000).mode("overwrite").parquet(<path>)
#or by set config on spark session
spark.conf.set("spark.sql.files.maxRecordsPerFile", 1000)
#or
spark.sql("set spark.sql.files.maxRecordsPerFile=1000").show()
df.coalesce(1).write.mode("overwrite").parquet(<path>)
df.repartition(1).write.mode("overwrite").parquet(<path>)
Method-2:
Caluculating number of partitions then repartition the dataframe:
df = spark.range(10000)
#caluculate partitions
no_partitions=df.count()/1000
from pyspark.sql.functions import *
#repartition and check number of records on each partition
df.repartition(no_partitions).\
withColumn("partition_id",spark_partition_id()).\
groupBy(col("partition_id")).\
agg(count("*")).\
show()
#+-----------+--------+
#|partiton_id|count(1)|
#+-----------+--------+
#| 1| 1001|
#| 6| 1000|
#| 3| 999|
#| 5| 1000|
#| 9| 1000|
#| 4| 999|
#| 8| 1000|
#| 7| 1000|
#| 2| 1001|
#| 0| 1000|
#+-----------+--------+
df.repartition(no_partitions).write.mode("overwrite").parquet(<path>)
Firstly, create a row number column
df = df.withColumn('row_num', F.row_number().over(Window.orderBy('any_column'))
Now, run a loop and keep saving the records.
for i in range(0, df.count(), 1000):
records = df.where(F.col("row_num").between(i, i+999))
records.toPandas().to_csv("file-{}.csv".format(i))

Compare two large dataframes using pyspark

I am currently working on a data migration assignment, trying to compare two dataframes from two different databases using pyspark to find out the differences between two dataframes and record the results in a csv file as part of data validation. I am trying for a performance efficient solution since there are two reasons.i.e. large dataframes and table keys are unknown
#Approach 1 - Not sure about the performance and it is case-sensitive
df1.subtract(df2)
#Approach 2 - Creating row hash for each row in dataframe
piperdd=df1.rdd.map(lambda x: hash(x))
r=row("h_cd")
df1_new=piperdd.map(r).toDF()
The problem which I am facing in approach 2 is final dataframe(df1_new) is retrieving only hash column(h_cd) but I need all the columns of dataframe1(df1) with hash code column(h_cd) since I need to report the row difference in a csv file.Please help
Have a try with dataframes, it should be more concise.
df1 = spark.createDataFrame([(a, a*2, a+3) for a in range(10)], "A B C".split(' '))
#df1.show()
from pyspark.sql.functions import hash
df1.withColumn('hash_value', hash('A','B', 'C')).show()
+---+---+---+-----------+
| A| B| C| hash_value|
+---+---+---+-----------+
| 0| 0| 3| 1074520899|
| 1| 2| 4|-2073566230|
| 2| 4| 5| 2060637564|
| 3| 6| 6|-1286214988|
| 4| 8| 7|-1485932991|
| 5| 10| 8| 2099126539|
| 6| 12| 9| -558961891|
| 7| 14| 10| 1692668950|
| 8| 16| 11| 708810699|
| 9| 18| 12| -11251958|
+---+---+---+-----------+

SparkSQL - got duplicate rows after join & groupBy

I have 2 dataframes with columns as shown below.
Note: Column uid is not a unique key, and there're duplicate rows with the same uid in the dataframes.
val df1 = spark.read.parquet(args(0)).drop("sv")
val df2 = spark.read.parquet(args(1))
scala> df1.orderBy("uid").show
+----+----+---+
| uid| hid| sv|
+----+----+---+
|uid1|hid2| 10|
|uid1|hid1| 10|
|uid1|hid3| 10|
|uid2|hid1| 2|
|uid3|hid2| 10|
|uid4|hid2| 3|
|uid5|hid3| 5|
+----+----+---+
scala> df2.orderBy("uid").show
+----+----+---+
| uid| pid| sv|
+----+----+---+
|uid1|pid2| 2|
|uid1|pid1| 1|
|uid2|pid1| 2|
|uid3|pid1| 3|
|uid3|pidx|999|
|uid3|pid2| 4|
|uidx|pid1| 2|
+----+----+---+
scala> df1.drop("sv")
.join(df2, "uid")
.groupBy("hid", "pid")
.agg(count("*") as "xcnt", sum("sv") as "xsum", avg("sv") as "xavg")
.orderBy("hid").show
+----+----+----+----+-----+
| hid| pid|xcnt|xsum| xavg|
+----+----+----+----+-----+
|hid1|pid1| 2| 3| 1.5|
|hid1|pid2| 1| 2| 2.0|
|hid2|pid2| 2| 6| 3.0|
|hid2|pidx| 1| 999|999.0|
|hid2|pid1| 2| 4| 2.0|
|hid3|pid1| 1| 1| 1.0|
|hid3|pid2| 1| 2| 2.0|
+----+----+----+----+-----+
In this demo case, everything looks good.
But when I apply the same operations on the production large data, the final output contains many duplicate rows (of same (hid, pid) pair).
I though the groupBy operator would be like select distinct hid, pid from ..., but obviously not.
So what's wrong with my operation? Should I repartition the dataframe by hid, pid?
Thanks!
-- Update
And if I add .drop("uid") once I join the dataframes, then some rows are missed from the final output.
scala> df1.drop("sv")
.join(df2, "uid").drop("uid")
.groupBy("hid", "pid")
.agg(count("*") as "xcnt", sum("sv") as "xsum", avg("sv") as "xavg")
.orderBy("hid").show
To be honest I think that there are problems with the data, not the code. Of course there shouldn't be any duplicates if pid and hid are truly different (I've seen some rogue Cyrillic symbols in data before).
To debug this issue you can try and see what combinations of 'uid' and sv values represent each duplicate row.
df1.drop( "sv" )
.join(df2, "uid")
.groupBy( "hid", "pid" )
.agg( collect_list( "uid" ), collect_list( "sv" ) )
.orderBy( "hid" )
.show
After that you'll have some start point to assess your data. Or, if the lists of uid (and 'sv') are the same, file a bug.
I think I might have found the root cause.
Maybe this is caused by AWS S3 consistency model.
The background is, I submitted 2 Spark jobs to create 2 tables, and submitted a third task to join the two tables (I split them in case any of them fails and I don't need to re-run them).
I put these 3 spark-submit in a shell script running in sequence, and got the result with duplicated rows.
When I re-ran the last job just now, the result seems good.

When i use partitionBy in window,why i get a different result with spark/scala?

I use Window.sum function to get the sum of a value in an RDD, but when I convert the DataFrame to an RDD, I found that the result's has only one partition. When does the repartitioning occur?
val rdd = sc.parallelize(List(1,3,2,4,5,6,7,8), 4)
val df = rdd.toDF("values").
withColumn("csum", sum(col("values")).over(Window.orderBy("values")))
df.show()
println(s"numPartitions ${df.rdd.getNumPartitions}")
// 1
//df is:
// +------+----+
// |values|csum|
// +------+----+
// | 1| 1|
// | 2| 3|
// | 3| 6|
// | 4| 10|
// | 5| 15|
// | 6| 21|
// | 7| 28|
// | 8| 36|
// +------+----+
I add partitionBy in Window ,but the result is error,what should i do?this is my change code:
val rdd=sc.parallelize(List(1,3,2,4,5,6,7,8),4)
val sqlContext = new SQLContext(m_sparkCtx)
import sqlContext.implicits._
val df = rdd.toDF("values").withColumn("csum", sum(col("values")).over(Window.partitionBy("values").orderBy("values")))
df.show()
println(s"numPartitions ${df.rdd.getNumPartitions}")
//1
//df is:
// +------+----+
// |values|csum|
// +------+----+
// | 1| 1|
// | 6| 6|
// | 3| 3|
// | 5| 5|
// | 4| 4|
// | 8| 8|
// | 7| 7|
// | 2| 2|
// +------+----+
Window function has partitionBy api for grouping the dataframe and orderBy to order the grouped rows in ascending or descending order.
In your first case you hadn't defined partitionBy, thus all the values were grouped in one dataframe for ordering purpose and thus shuffling the data into one partition.
But in your second case you had partitionBy defined on values itself. So since each value are distinct, each row is grouped into individual groups.
The partition in second case is 200 as that is the default partitioning defined in spark when you haven't defined partitions and shuffle occurs
To get the same result from your second case as you get with the first case, you need to group your dataframe as in your first case i.e. into one group. For that you will need to create another column with constant value and use that value for partitionBy.
When you create a column as
withColumn("csum", sum(col("values")).over(Window.orderBy("values")))
The Window.orderBy("values") is ordering the values of column "values" in single partition since you haven't defined partitionBy() method to define the partition.
This is changing the number of partition from initial 4 to 1.
The partition is 200 in your second case since the partitionBy()method uses 200 as default partition. if you need the number of partition as 4 you can use methods like repartition(4) or coalesce(4)
Hope you got the point!

PySpark: Randomize rows in dataframe

I have a dataframe and I want to randomize rows in the dataframe. I tried sampling the data by giving a fraction of 1, which didn't work (interestingly this works in Pandas).
It works in Pandas because taking sample in local systems is typically solved by shuffling data. Spark from the other hand avoids shuffling by performing linear scans over the data. It means that sampling in Spark only randomizes members of the sample not an order.
You can order DataFrame by a column of random numbers:
from pyspark.sql.functions import rand
df = sc.parallelize(range(20)).map(lambda x: (x, )).toDF(["x"])
df.orderBy(rand()).show(3)
## +---+
## | x|
## +---+
## | 2|
## | 7|
## | 14|
## +---+
## only showing top 3 rows
but it is:
expensive - because it requires full shuffle and it something you typically want to avoid.
suspicious - because order of values in a DataFrame is not something you can really depend on in non-trivial cases and since DataFrame doesn't support indexing it is relatively useless without collecting.
This code works for me without any RDD operations:
import pyspark.sql.functions as F
df = df.select("*").orderBy(F.rand())
Here is a more elaborated example:
import pyspark.sql.functions as F
# Example: create a Dataframe for the example
pandas_df = pd.DataFrame(([1,2],[3,1],[4,2],[7,2],[32,7],[123,3]),columns=["id","col1"])
df = sqlContext.createDataFrame(pandas_df)
df = df.select("*").orderBy(F.rand())
df.show()
+---+----+
| id|col1|
+---+----+
| 1| 2|
| 3| 1|
| 4| 2|
| 7| 2|
| 32| 7|
|123| 3|
+---+----+
df.select("*").orderBy(F.rand()).show()
+---+----+
| id|col1|
+---+----+
| 7| 2|
|123| 3|
| 3| 1|
| 4| 2|
| 32| 7|
| 1| 2|
+---+----+

Resources