Spark Chunk/split large array column into multiple dataframes - apache-spark

I have 1 outlier record that has array of items. This array is between 900mb-980mb:
+---+----------+----------------------------------------------------------------+
| id| version| items |
+---+----------+----------------------------------------------------------------+
| a| 1 |{"a":"100", "b":"200", ...}, {"a":"100", "b":"200", ...}, .....n
+---+----------+----------------------------------------------------------------+
I want to dynamically repartition the array to be less than 50mb:
file 1 ( 50 mb or less):
+---+----------+------------------------------------------------------+
| id| version| items |
+---+----------+------------------------------------------------------+
| a| 1 |{"a":"100", "b":"200", ...}, .....
+---+----------+------------------------------------------------------+
file 2 ( 50 mb or less):
+---+----------+------------------------------------------------------+
| id| version| items |
+---+----------+------------------------------------------------------+
| a| 1 |{"a":"100", "b":"200", ...}, ...
+---+----------+------------------------------------------------------+
file 3 ( 50 mb or less):
+---+----------+------------------------------------------------------+
| id| version| items |
+---+----------+------------------------------------------------------+
| a| 1 |{"a":"100", "b":"200", ...}, ...
+---+----------+------------------------------------------------------+
...etc
However, if I have one record with less than 50mb, it should ignore it or dynamically
repartitions so the solution play well with very large data and with very small data
sets.
I've tried dF.repartition(n) but dF.count() yields 1 record so the 1st file is 900mb,
then the other N files are 0 bytes.

Related

Split large dataframe into small ones Spark

I have a DF that has 200 million lines. I cant group this DF and I have to split this DF in 8 smaller DFs (approx 30 million lines each). I've tried this approach but with no success. Without caching the DF, the count of the splitted DFs does not match the larger DF. If I use cache I get out of disk space (my config is 64gb RAM and 512 SSD).
Considering this, I though about the following approach:
Load the entire DF
Give 8 random numbers to this DF
Distribute the random number evenly in the DF
Consider the following DF as example:
+------+--------+
| val1 | val2 |
+------+--------+
|Paul | 1.5 |
|Bostap| 1 |
|Anna | 3 |
|Louis | 4 |
|Jack | 2.5 |
|Rick | 0 |
|Grimes| null|
|Harv | 2 |
|Johnny| 2 |
|John | 1 |
|Neo | 5 |
|Billy | null|
|James | 2.5 |
|Euler | null|
+------+--------+
The DF has 14 lines, I though to use window to create the following DF:
+------+--------+----+
| val1 | val2 | sep|
+------+--------+----+
|Paul | 1.5 |1 |
|Bostap| 1 |1 |
|Anna | 3 |1 |
|Louis | 4 |1 |
|Jack | 2.5 |1 |
|Rick | 0 |1 |
|Grimes| null|1 |
|Harv | 2 |2 |
|Johnny| 2 |2 |
|John | 1 |2 |
|Neo | 5 |2 |
|Billy | null|2 |
|James | 2.5 |2 |
|Euler | null|2 |
+------+--------+----+
Considering the last DF, I will use a filter to filter by sep. My doubt is: How can I use window function to generate the column sep of last DF?
Since you are randomly splitting the dataframe into 8 parts, you could use randomSplit():
split_weights = [1.0] * 8
splits = df.randomSplit(split_weights)
for df_split in splits:
# do what you want with the smaller df_split
Note that this will not ensure same number of records in each df_split. There may be some fluctuation but with 200 million records it will be negligible.
If you want to process and store to files with the count names to avoid getting mixed up.
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
df = spark.read.parquet('parquet-files')
split_w = [1.0] * 5
splits = df.randomSplit(split_w)
for count, df_split in enumerate(splits, start=1):
df_split.write.parquet(f'split-files/split-file-{count}', mode='overwrite')
The file sizes will be averagely the same size, some with a slight difference.

cassandra select query to select rows when atleast one column value condition is true

I have a requirement where there are six amount columns. I want to run a select query that gives me rows when either one of the column value is >0 . I tried OR condition but that is not supported. Is there any other alternative? I tried selecting records checking for >0 values for each column . But this approach is a head ache if we need to check many columns.
Example records:
|id| col1| col2| col3| col4| col5| col6|
-----------------------------------------
|1 | 2.0 | 0 | 0 | 0 | 0 | 0 |
|2 | 0 | 0 | 2.0 | 3.0 | 5.0 | 66 |
|3 | 0 | 0 | 0 | 0 | 0 | 0 |
|4 | null| null| null| null| null| null|
What i need is records for id 1 2 which have atleast one column value > 0
You cannot query in Cassandra like that. To achieve such results you may need to depend on your application (like getting the row and then checking columns). If you are scanning complete database then you may use spark for faster processing.

Sparksql get sample rows with where clause

Is it possible to get a sample n rows of a query with a where clause?
I tried to use the tablesample function below but I ended up only getting records in the first partition '2021-09-14.' P
select * from (select * from table where ts in ('2021-09-14', '2021-09-15')) tablesample (100 rows)
You can utilise Monotonically Increasing ID - here or Rand to generate an additional column which can be used to Order your dataset to generate the necessary sampling field
Both of these functions can be used in conjunction or individually
Further more you can use LIMIT clause to sample your required N records
NOTE - orderBy would be a costly operation
Data Preparation
input_str = """
1 2/12/2019 114 2
2 3/5/2019 116 1
3 3/3/2019 120 6
4 3/4/2019 321 10
6 6/5/2019 116 1
7 6/3/2019 116 1
8 10/1/2019 120 3
9 10/1/2019 120 3
10 10/1/2020 120 3
11 10/1/2020 120 3
12 10/1/2020 120 3
13 10/1/2022 120 3
14 10/1/2021 120 3
15 10/6/2019 120 3
""".split()
input_values = list(map(lambda x: x.strip() if x.strip() != 'null' else None, input_str))
cols = list(map(lambda x: x.strip() if x.strip() != 'null' else None, "shipment_id ship_date customer_id quantity".split()))
n = len(input_values)
input_list = [tuple(input_values[i:i+4]) for i in range(0,n,4)]
sparkDF = sql.createDataFrame(input_list, cols)
sparkDF = sparkDF.withColumn('ship_date',F.to_date(F.col('ship_date'),'d/M/yyyy'))
sparkDF.show()
+-----------+----------+-----------+--------+
|shipment_id| ship_date|customer_id|quantity|
+-----------+----------+-----------+--------+
| 1|2019-12-02| 114| 2|
| 2|2019-05-03| 116| 1|
| 3|2019-03-03| 120| 6|
| 4|2019-04-03| 321| 10|
| 6|2019-05-06| 116| 1|
| 7|2019-03-06| 116| 1|
| 8|2019-01-10| 120| 3|
| 9|2019-01-10| 120| 3|
| 10|2020-01-10| 120| 3|
| 11|2020-01-10| 120| 3|
| 12|2020-01-10| 120| 3|
| 13|2022-01-10| 120| 3|
| 14|2021-01-10| 120| 3|
| 15|2019-06-10| 120| 3|
+-----------+----------+-----------+--------+
Order By - Monotonically Increasing ID & Rand
sparkDF.createOrReplaceTempView("shipment_table")
sql.sql("""
SELECT
*
FROM (
SELECT
*
,monotonically_increasing_id() as increasing_id
,RAND(10) as random_order
FROM shipment_table
WHERE ship_date BETWEEN '2019-01-01' AND '2019-12-31'
ORDER BY monotonically_increasing_id() DESC ,RAND(10) DESC
LIMIT 5
)
""").show()
+-----------+----------+-----------+--------+-------------+-------------------+
|shipment_id| ship_date|customer_id|quantity|increasing_id| random_order|
+-----------+----------+-----------+--------+-------------+-------------------+
| 15|2019-06-10| 120| 3| 8589934593|0.11682250456449328|
| 9|2019-01-10| 120| 3| 8589934592|0.03422639313807285|
| 8|2019-01-10| 120| 3| 6| 0.8078688178371882|
| 7|2019-03-06| 116| 1| 5|0.36664222617947817|
| 6|2019-05-06| 116| 1| 4| 0.2093704977577|
+-----------+----------+-----------+--------+-------------+-------------------+
If you are using Dataset there is built-in functionality for this as outlined in the documenation:
sample(withReplacement: Boolean, fraction: Double): Dataset[T]
Returns a new Dataset by sampling a fraction of rows, using a random seed.
withReplacement: Sample with replacement or not.
fraction: Fraction of rows to generate, range [0.0, 1.0].
Since
1.6.0
Note
This is NOT guaranteed to provide exactly the fraction of the total count of the given Dataset.
To use this you'd filter your dataset against whatever criteria you're looking for, then sample the result. If you need an exact number of rows rather than a fraction you can follow the call to sample with limit(n) where n is the number of rows to return.

How to delete rows which have the same values in only one column in Spark DataFrame

I have a Spark's DataFrame like this one below
*----------*-------*
| Node ID | value |
*----------*-------*
| Node 1 | 0 |
| Node 2 | 1 |
| Node 3 | 0 |
| Node 2 | 0 |
*----------*-------*
Is there any way to detect only the same node ID values (e.g. Node 2 in the DataFrame above) in the Node ID column and delete the row of the same node ID values, even though these rows are different in the value column.
For example, can I output a new DataFrame like this one below, in which the row of "NodeID=Node 2, value=1" will be deleted compared with the original one?
*----------*-------*
| Node ID | value |
*----------*-------*
| Node 1 | 0 |
| Node 3 | 0 |
| Node 2 | 0 |
*----------*-------*
Try Window function with filter to achieve this
scala> var df = Seq(("Node 1" , 0),("Node 2" , 1),("Node 3", 0),("Node 2", 0)).toDF("NodeID","value")
scala> import org.apache.spark.sql.expressions.Window
scala> var win = Window.partitionBy("NodeID").orderBy("value")
scala> df.withColumn("result",row_number().over(win)).filter(col("result")<2).drop("result").orderBy("NodeId").show(9)
+------+-----+
|NodeID|value|
+------+-----+
|Node 1| 0|
|Node 2| 0|
|Node 3| 0|
+------+-----+
Filtering data based on row_number. which will help you to keep the number of records based on your requirements.

How to combine and sort different dataframes into one?

Given two dataframes, which may have completely different schemas, except for a index column (timestamp in this case), such as df1 and df2 below:
df1:
timestamp | length | width
1 | 10 | 20
3 | 5 | 3
df2:
timestamp | name | length
0 | "sample" | 3
2 | "test" | 6
How can I combine these two dataframes into one that would look something like this:
df3:
timestamp | df1 | df2
| length | width | name | length
0 | null | null | "sample" | 3
1 | 10 | 20 | null | null
2 | null | null | "test" | 6
3 | 5 | 3 | null | null
I am extremely new to spark, so this might not actually make a lot of sense. But the problem I am trying to solve is: I need to combine these dataframes so that later I can convert each row to a given object. However, they have to be ordered by timestamp, so when I write these objects out, they are in the correct order.
So for example, given the df3 above, I would be able to generate the following list of objects:
objs = [
ObjectType1(timestamp=0, name="sample", length=3),
ObjectType2(timestamp=1, length=10, width=20),
ObjectType1(timestamp=2, name="test", length=6),
ObjectType2(timestamp=3, length=5, width=3)
]
Perhaps combining the dataframes does not make sense, but how could I sort the dataframes individually and somehow grab the Rows from each one of them ordered by timestamp globally?
P.S.: Note that I repeated length in both dataframes. That was done on purpose to illustrate that they may have columns of same name and type, but represent completely different data, so merging schema is not a possibility.
what you need is a full outer join, possibly renaming one of the columns, something like df1.join(df2.withColumnRenamed("length","length2"), Seq("timestamp"),"full_outer")
See this example, built from yours (just less typing)
// data shaped as your example
case class t1(ts:Int, width:Int,l:Int)
case class t2(ts:Int, width:Int,l:Int)
// create data frames
val df1 = Seq(t1(1,10,20),t1(3,5,3)).toDF
val df2 = Seq(t2(0,"sample",3),t2(2,"test",6)).toDF
df1.join(df2.withColumnRenamed("l","l2"),Seq("ts"),"full_outer").sort("ts").show
+---+-----+----+------+----+
| ts|width| l| name| l2|
+---+-----+----+------+----+
| 0| null|null|sample| 3|
| 1| 10| 20| null|null|
| 2| null|null| test| 6|
| 3| 5| 3| null|null|
+---+-----+----+------+----+

Resources