Split large dataframe into small ones Spark - apache-spark

I have a DF that has 200 million lines. I cant group this DF and I have to split this DF in 8 smaller DFs (approx 30 million lines each). I've tried this approach but with no success. Without caching the DF, the count of the splitted DFs does not match the larger DF. If I use cache I get out of disk space (my config is 64gb RAM and 512 SSD).
Considering this, I though about the following approach:
Load the entire DF
Give 8 random numbers to this DF
Distribute the random number evenly in the DF
Consider the following DF as example:
+------+--------+
| val1 | val2 |
+------+--------+
|Paul | 1.5 |
|Bostap| 1 |
|Anna | 3 |
|Louis | 4 |
|Jack | 2.5 |
|Rick | 0 |
|Grimes| null|
|Harv | 2 |
|Johnny| 2 |
|John | 1 |
|Neo | 5 |
|Billy | null|
|James | 2.5 |
|Euler | null|
+------+--------+
The DF has 14 lines, I though to use window to create the following DF:
+------+--------+----+
| val1 | val2 | sep|
+------+--------+----+
|Paul | 1.5 |1 |
|Bostap| 1 |1 |
|Anna | 3 |1 |
|Louis | 4 |1 |
|Jack | 2.5 |1 |
|Rick | 0 |1 |
|Grimes| null|1 |
|Harv | 2 |2 |
|Johnny| 2 |2 |
|John | 1 |2 |
|Neo | 5 |2 |
|Billy | null|2 |
|James | 2.5 |2 |
|Euler | null|2 |
+------+--------+----+
Considering the last DF, I will use a filter to filter by sep. My doubt is: How can I use window function to generate the column sep of last DF?

Since you are randomly splitting the dataframe into 8 parts, you could use randomSplit():
split_weights = [1.0] * 8
splits = df.randomSplit(split_weights)
for df_split in splits:
# do what you want with the smaller df_split
Note that this will not ensure same number of records in each df_split. There may be some fluctuation but with 200 million records it will be negligible.

If you want to process and store to files with the count names to avoid getting mixed up.
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
df = spark.read.parquet('parquet-files')
split_w = [1.0] * 5
splits = df.randomSplit(split_w)
for count, df_split in enumerate(splits, start=1):
df_split.write.parquet(f'split-files/split-file-{count}', mode='overwrite')
The file sizes will be averagely the same size, some with a slight difference.

Related

PySpark: Timeslice and split rows in dataframe with 5 minutes interval on a specific condition

I have a dataframe with the following columns:
+-----+----------+--------------------------+-----------+
|id | sourceid | timestamp | indicator |
+-----+----------+--------------------------+-----------+
| 0 | 128 | 2019-12-03 12:00:00.0 | 0 |
| 1 | 128 | 2019-12-03 12:30:00.0 | 1 |
| 2 | 128 | 2019-12-03 12:37:00.0 | 0 |
| 3 | 128 | 2019-12-03 13:15:00.0 | 1 |
| 4 | 128 | 2019-12-03 13:17:00.0 | 0 |
+-----+----------+--------------------------+-----------+
I am trying to split the timestamp column into rows of 5 minute time intervals for indicator values which are not 0.
Explanation:
The first entry is at time timestamp = 2019-12-03 12:00:00.0, indicator= 0, do nothing.
Moving on to the next entry with timestamp = 2019-12-03 12:30:00.0, indicator= 1, I want to split timestamp into rows with a 5 minutes interval till we reach the next entry which is timestamp = 2019-12-03 12:37:00.0, indicator= 0.
If there is a case where timestamp = 2019-12-03 13:15:00.0, indicator = 1 and the next timestamp = 2019-12-03 13:17:00.0, indicator = 0, I'd like to split the row considering both the times have indicator as 1 as 13:17:00.0 falls between 13:15:00.0 - 13:20:00.0 as shown below.
How can I achieve this with PySpark?
Expected Output:
+-----+----------+--------------------------+-------------+
|id | sourceid | timestamp | indicator |
+-----+----------+--------------------------+-------------+
| 1 | 128 | 2019-12-03 12:30:00.0 | 1 |
| 1 | 128 | 2019-12-03 12:35:00.0 | 1 |
| 4 | 128 | 2019-12-03 13:15:00.0 | 1 |
| 4 | 128 | 2019-12-03 13:20:00.0 | 1 |
+-----+----------+--------------------------+-------------+
IIUC, you can filter rows based on indicators on the current and the next rows, and then use array + explode to create new rows (for testing purpose, I added some more rows into your original example):
from pyspark.sql import Window, functions as F
w1 = Window.partitionBy('sourceid').orderBy('timestamp')
# add a flag to check if the next indicator is '0'
df1 = df.withColumn('next_indicator_is_0', F.lead('indicator').over(w1) == 0)
df1.show(truncate=False)
+---+--------+---------------------+---------+-------------------+
|id |sourceid|timestamp |indicator|next_indicator_is_0|
+---+--------+---------------------+---------+-------------------+
|0 |128 |2019-12-03 12:00:00.0|0 |false |
|1 |128 |2019-12-03 12:30:00.0|1 |true |
|2 |128 |2019-12-03 12:37:00.0|0 |false |
|3 |128 |2019-12-03 13:12:00.0|1 |false |
|4 |128 |2019-12-03 13:15:00.0|1 |true |
|5 |128 |2019-12-03 13:17:00.0|0 |false |
|6 |128 |2019-12-03 13:20:00.0|1 |null |
+---+--------+---------------------+---------+-------------------+
df1.filter("indicator = 1 AND next_indicator_is_0") \
.withColumn('timestamp', F.expr("explode(array(`timestamp`, `timestamp` + interval 5 minutes))")) \
.drop('next_indicator_is_0') \
.show(truncate=False)
+---+--------+---------------------+---------+
|id |sourceid|timestamp |indicator|
+---+--------+---------------------+---------+
|1 |128 |2019-12-03 12:30:00.0|1 |
|1 |128 |2019-12-03 12:35:00 |1 |
|4 |128 |2019-12-03 13:15:00.0|1 |
|4 |128 |2019-12-03 13:20:00 |1 |
+---+--------+---------------------+---------+
Note: you can reset id column by using F.row_number().over(w1) or F.monotonically_increasing_id() based on your requirements.

How to combine and sort different dataframes into one?

Given two dataframes, which may have completely different schemas, except for a index column (timestamp in this case), such as df1 and df2 below:
df1:
timestamp | length | width
1 | 10 | 20
3 | 5 | 3
df2:
timestamp | name | length
0 | "sample" | 3
2 | "test" | 6
How can I combine these two dataframes into one that would look something like this:
df3:
timestamp | df1 | df2
| length | width | name | length
0 | null | null | "sample" | 3
1 | 10 | 20 | null | null
2 | null | null | "test" | 6
3 | 5 | 3 | null | null
I am extremely new to spark, so this might not actually make a lot of sense. But the problem I am trying to solve is: I need to combine these dataframes so that later I can convert each row to a given object. However, they have to be ordered by timestamp, so when I write these objects out, they are in the correct order.
So for example, given the df3 above, I would be able to generate the following list of objects:
objs = [
ObjectType1(timestamp=0, name="sample", length=3),
ObjectType2(timestamp=1, length=10, width=20),
ObjectType1(timestamp=2, name="test", length=6),
ObjectType2(timestamp=3, length=5, width=3)
]
Perhaps combining the dataframes does not make sense, but how could I sort the dataframes individually and somehow grab the Rows from each one of them ordered by timestamp globally?
P.S.: Note that I repeated length in both dataframes. That was done on purpose to illustrate that they may have columns of same name and type, but represent completely different data, so merging schema is not a possibility.
what you need is a full outer join, possibly renaming one of the columns, something like df1.join(df2.withColumnRenamed("length","length2"), Seq("timestamp"),"full_outer")
See this example, built from yours (just less typing)
// data shaped as your example
case class t1(ts:Int, width:Int,l:Int)
case class t2(ts:Int, width:Int,l:Int)
// create data frames
val df1 = Seq(t1(1,10,20),t1(3,5,3)).toDF
val df2 = Seq(t2(0,"sample",3),t2(2,"test",6)).toDF
df1.join(df2.withColumnRenamed("l","l2"),Seq("ts"),"full_outer").sort("ts").show
+---+-----+----+------+----+
| ts|width| l| name| l2|
+---+-----+----+------+----+
| 0| null|null|sample| 3|
| 1| 10| 20| null|null|
| 2| null|null| test| 6|
| 3| 5| 3| null|null|
+---+-----+----+------+----+

How to operate global variable in Spark SQL dataframe row by row sequentially on Spark cluster?

I have dataset which like this:
+-------+------+-------+
|groupid|rownum|column2|
+-------+------+-------+
| 1 | 1 | 7 |
| 1 | 2 | 9 |
| 1 | 3 | 8 |
| 1 | 4 | 5 |
| 1 | 5 | 1 |
| 1 | 6 | 0 |
| 1 | 7 | 15 |
| 1 | 8 | 1 |
| 1 | 9 | 13 |
| 1 | 10 | 20 |
| 2 | 1 | 8 |
| 2 | 2 | 1 |
| 2 | 3 | 4 |
| 2 | 4 | 2 |
| 2 | 5 | 19 |
| 2 | 6 | 11 |
| 2 | 7 | 5 |
| 2 | 8 | 6 |
| 2 | 9 | 15 |
| 2 | 10 | 8 |
still have more rows......
I want to add a new column "column3" , which if the continuous column2 values are less than 10,then they will be arranged a same number such as 1. if their appear a value larger than 10 in column2, this row will be dropped ,then the following column3 row’s value will increase 1. For example, when groupid = 1,the column3's value from rownum 1 to 6 will be 1 and the rownum7 will be dropped, the column3's value of rownum 8 will be 2 and the rownum9,10 will be dropped.After the procedure, the table will like this:
+-------+------+-------+-------+
|groupid|rownum|column2|column3|
+-------+------+-------+-------+
| 1 | 1 | 7 | 1 |
| 1 | 2 | 9 | 1 |
| 1 | 3 | 8 | 1 |
| 1 | 4 | 5 | 1 |
| 1 | 5 | 1 | 1 |
| 1 | 6 | 0 | 1 |
| 1 | 7 | 15 | drop | this row will be dropped, in fact not exist
| 1 | 8 | 1 | 2 |
| 1 | 9 | 13 | drop | like above
| 1 | 10 | 20 | drop | like above
| 2 | 1 | 8 | 1 |
| 2 | 2 | 1 | 1 |
| 2 | 3 | 4 | 1 |
| 2 | 4 | 2 | 1 |
| 2 | 5 | 19 | drop | ...
| 2 | 6 | 11 | drop | ...
| 2 | 7 | 5 | 2 |
| 2 | 8 | 6 | 2 |
| 2 | 9 | 15 | drop | ...
| 2 | 10 | 8 | 3 |
In our project, the dataset is expressed as dataframe in spark sql
I try to solve this problem by udf in this way:
var last_rowNum: Int = 1
var column3_Num: Int = 1
def assign_column3_Num(rowNum:Int): Int = {
if (rowNum == 1){ //do nothing, just arrange 1
column3_Num = 1
last_rowNum = 1
return column3_Num
}
/*** if the difference between rownum is 1, they have the same column3
* value, if not, column3_Num++, so they are different
*/
if(rowNum - last_rowNum == 1){
last_rowNum = rowNum
return column3_Num
}else{
column3_Num += 1
last_rowNum = rowNum
return column3_Num
}
}
spark.sqlContext.udf.register("assign_column3_Num",assign_column3_Num _)
df.filter("column2>10") //drop the larger rows
.withColumn("column3",assign_column3_Num(col("column2"))) //add column3
as you can see, I use global variable. However, it's only effective in spark local[1] model. if i use local[8] or yarn-client, the result will totally wrong! this is because spark's running mechanism,they operate the global variable without distinguishing groupid and order!
So the question is how can i arrange right number when spark running on cluster?
use udf or udaf or RDD or other ?
thank you!
You can achieve your requirement by defining a udf function as below (comments are given for clarity)
import org.apache.spark.sql.functions._
def createNewCol = udf((rownum: collection.mutable.WrappedArray[Int], column2: collection.mutable.WrappedArray[Int]) => { // udf function
var value = 1 //value for column3
var previousValue = 0 //value for checking condition
var arrayBuffer = Array.empty[(Int, Int, Int)] //initialization of array to be returned
for((a, b) <- rownum.zip(column2)){ //zipping the collected lists and looping
if(b > 10 && previousValue < 10) //checking condition for column3
value = value +1 //adding 1 for column3
arrayBuffer = arrayBuffer ++ Array((a, b, value)) //adding the values
previousValue = b
}
arrayBuffer
})
Now utilize the algorithm defined in the udf function and to get the desired result, you would need to collect the values of rownum and column2 grouping them by groupid and sorting them by rownum and then call the udf function. Next steps would be to explode and select necessary columns. (commented for clarity)
df.orderBy("rownum").groupBy("groupid").agg(collect_list("rownum").as("rownum"), collect_list("column2").as("column2")) //collecting in order for generating values for column3
.withColumn("new", createNewCol(col("rownum"), col("column2"))) //calling udf function and storing the array of struct(rownum, column2, column3) in new column
.drop("rownum", "column2") //droping unnecessary columns
.withColumn("new", explode(col("new"))) //exploding the new column array so that each row can have struct(rownum, column2, column3)
.select(col("groupid"), col("new._1").as("rownum"), col("new._2").as("column2"), col("new._3").as("column3")) //selecting as separate columns
.filter(col("column2") < 10) // filtering the rows with column2 greater than 10
.show(false)
You should have your desired output as
+-------+------+-------+-------+
|groupid|rownum|column2|column3|
+-------+------+-------+-------+
|1 |1 |7 |1 |
|1 |2 |9 |1 |
|1 |3 |8 |1 |
|1 |4 |5 |1 |
|1 |5 |1 |1 |
|1 |6 |0 |1 |
|1 |8 |1 |2 |
|2 |1 |8 |1 |
|2 |2 |1 |1 |
|2 |3 |4 |1 |
|2 |4 |2 |1 |
|2 |7 |5 |2 |
|2 |8 |6 |2 |
|2 |10 |8 |3 |
+-------+------+-------+-------+

Compute a value using multiple preceding rows

I have a DataFrame, that contains events ordered by timestamp.
Certain events mark the beginning of a new epoch:
+------+-----------+
| Time | Type |
+------+-----------+
| 0 | New Epoch |
| 2 | Foo |
| 3 | Bar |
| 11 | New Epoch |
| 12 | Baz |
+------+-----------+
I would like to add a column with epoch number, that, for simplicity, can be equal to the timestamp of its beginning:
+------+-----------+–------+
| Time | Type | Epoch |
+------+-----------+-------+
| 0 | New Epoch | 0 |
| 2 | Foo | 0 |
| 3 | Bar | 0 |
| 11 | New Epoch | 11 |
| 12 | Baz | 11 |
+------+-----------+-------+
How can I achieve this?
The naive algorithm would be to write a function that goes backwards until it finds a row with $"Type" === "New Epoch" and takes its $"Time". In case I know the maximum number of events within an epoch, I can probably implement it by calling lag() that many times. But for the general case I don't have any ideas.
Below is my solution. Briefly, I create a dataframe that represents epoch intervals then join it with original dataframe.
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
val ds = List((0, "New Epoch"), (2, "Fo"), (3, "Bar"), (11, "New Epoch"), (12, "Baz")).toDF("Time", "Type")
val epoch = ds.filter($"Type" === "New Epoch")
val spec = Window.orderBy("Time")
val epochInterval = epoch.withColumn("next_epoch", lead($"Time", 1).over(spec))//.show(false)
val result = ds.as("left").join(epochInterval.as("right"), $"left.Time" >= $"right.Time" && ($"left.Time" < $"right.next_epoch" || $"right.next_epoch".isNull))
.select($"left.Time", $"left.Type", $"right.Time".as("Epoch"))
result.show(false)
+----+---------+-----+
|Time|Type |Epoch|
+----+---------+-----+
|0 |New Epoch|0 |
|2 |Fo |0 |
|3 |Bar |0 |
|11 |New Epoch|11 |
|12 |Baz |11 |
+----+---------+-----+

Using orderBy with dataframes in spark(python)

I am not sure what is going wrong here. I am having a dataframe:
>DFexample.columns
>['url','weight1','weight2']
and I am trying to order in descending order from the weight2:
>DFexample.orderBy(DFexample.weight2.desc()).show(4)
>-----+--------+-------------------+
| url |weight1 | weight2 |
+-----+--------+-------------------+
| x |0 | 9.800000342342E-4 |
| x2 |1 | 2.432432432 |
| x3 | 1.2 | 2.134234234 |
| x4 | 1.32 | 1.232324 |
+-----+--------+-------------------+
Everything seems to be ordered except from the first value. For what reason would this happen?

Resources