Pyspark filter or split based on row value - apache-spark

I'm dealing with some data in pyspark. We have the issue, that metadata and actual data are mixed. This means we have Strings which are interrupted by a "STOP" string. The number of Strings between the "STOP" is variable and we would like to filter out short occurrences.
An example dataframe where we have ints instead of Strings and 0 is the stop signal is below:
df = spark.createDataFrame([*[(1,),(0,),(3,),(0,),(4,),(4,),(5,),(0,)]])
My goal would now be to have a filter function, where I can say how many elements between two stop signals need to be, in order for the data to be kept. E.g. if min_length was two, we would end up with the dataframe:
df = spark.createDataFrame([(4,),(4,),(5,),(0,)]])
My idea was to create a seperate column and create a group in there:
df.select("_1", F.when(df["_1"]==2, 0).otherwise(get_counter())).show()
The get_counter function should count how many times we've already seen "Stop" (or 0 in the example df). Due to the distributed nature of Spark that does not work though.
Is it somehow possible to easily achive this by filtering? Or is it maybe possible to split the dataframes, everytime "STOP" occurs? I could then delete to short dataframes and merge them again.
Preferably this would be solved in pyspark or sql-spark. But if someone knows how to do this with the spark-shell, I'd also be curious :)

Spark sql implementation:
with t2 as (
select
monotonically_increasing_id() as id
, _1
, case when _1 = 0 then 1 else 0 end as stop
from
t1
)
, t3 as (
select
*
, row_number() over (partition by stop order by id) as stop_seq
from
t2
)
select * from t3 where stop_seq > 2

Related

Inconsistent duplicated row on Spark

I'm having a weird behavior with Apache Spark, which I run in a Python Notebook on Azure Databricks. I have a dataframe with some data, with 2 columns of interest: name and ftime
I found that I sometime have duplicated values, sometime not, depending on how I fetch the data:
df.where(col('name') == 'test').where(col('ftime') == '2022-07-18').count()
# Result is 1
But when I run
len(df.where(col('name') == 'test').where(col('ftime') == '2022-07-18').collect())
# Result is 2
, I now have a result of 2 rows, which are exactly the same. Those two cells are ran one after the other, the order doesn't change anything.
I tried creating a temp view in spark with
df.createOrReplaceTempView('df_referential')
but I run in the same problem:
%sql
SELECT name, ftime, COUNT(*)
FROM df_referential
GROUP BY name, ftime
HAVING COUNT(*) > 1
returns no result, while
%sql
SELECT *
FROM df_referential
WHERE name = 'test' AND ftime = '2022-07-18'
returns two rows, perfectly identical.
I'm having a hard time understanding why this happens. I expect these to returns only one row, and the JSON file that the data is read from contains only one occurrence of the data.
If someone can point me at what I'm doing wrong, this would be of great help

drop_duplicates after unionByName

I am trying to stack two dataframes (with unionByName()) and, then, dropping duplicate entries (with drop_duplicates()).
Can I trust that unionByName() will preserve the order of the rows, i.e., that df1.unionByName(df2) will always produce a dataframe whose first N rows are df1's? Because, if so, when applying drop_duplicates(), df1's row would always be preserved, which is the behaviour I want.
UnionByName will not guarantee that you will have your records ranked first from df1 and then from df2. These are distributed and parallel tasks so you definitely can't build on that.
The solution might be to add a technical priority column to each DataFrame, then unionByName() and use the row_number() analytical function to sort by priority within that ID and then select the one with the higher priority (in below case 1 means higher than 2).
Take a look at the Scala code below:
val df1WithPriority = df1.withColumn("priority", lit(1))
val df2WithPriority = df2.withColumn("priority", lit(2))
df1WithPriority
.unionByName(df2WithPriority)
.withColumn(
"row_num",
row_number()
.over(Window.partitionBy("ID").orderBy(col("priority").asc)
)
.where(col("row_num") === lit(1))

Spark condition on partition column from another table (Performance)

I have a huge parquet table partitioned on registration_ts column - named stored.
I'd like to filter this table based on data obtained from small table - stream
In sql world the query would look like:
spark.sql("select * from stored where exists (select 1 from stream where stream.registration_ts = stored.registration_ts)")
In Dataframe world:
stored.join(broadcast(stream), Seq("registration_ts"), "leftsemi")
This all works, but the performance is suffering, because the partition pruning is not applied. Spark full-scans stored table, which is too expensive.
For example this runs 2 minutes:
stream.count
res45: Long = 3
//takes 2 minutes
stored.join(broadcast(stream), Seq("registration_ts"), "leftsemi").collect
[Stage 181:> (0 + 1) / 373]
This runs in 3 seconds:
val stream = stream.where("registration_ts in (20190516204l, 20190515143l,20190510125l, 20190503151l)")
stream.count
res44: Long = 42
//takes 3 seconds
stored.join(broadcast(stream), Seq("registration_ts"), "leftsemi").collect
The reason is that in the 2-nd example the partition filter is propagated to joined stream table.
I'd like to achieve partition filtering on dynamic set of partitions.
The only solution I was able to come up with:
val partitions = stream.select('registration_ts).distinct.collect.map(_.getLong(0))
stored.where('registration_ts.isin(partitions:_*))
Which collects the partitions to driver and makes a 2-nd query. This works fine only for small number of partitions. When I've tried this solution with 500k distinct partitions, the delay was significant.
But there must be a better way ...
Here's one way that you can do it in PySpark and I've verified in Zeppelin that it is using the set of values to prune the partitions
# the collect_set function returns a distinct list of values and collect function returns a list of rows. Getting the [0] element in the list of rows gets you the first row and the [0] element in the row gets you the value from the first column which is the list of distinct values
from pyspark.sql.functions import collect_set
filter_list = spark.read.orc(HDFS_PATH)
.agg(collect_set(COLUMN_WITH_FILTER_VALUES))
.collect()[0][0]
# you can use the filter_list with the isin function to prune the partitions
df = spark.read.orc(HDFS_PATH)
.filter(col(PARTITION_COLUMN)
.isin(filter_list))
.show(5)
# you may want to do some checks on your filter_list value to ensure that your first spark.read actually returned you a valid list of values before trying to do the next spark.read and prune your partitions

Spark multiple count distinct in single expression

I have a following code written in Spark using Scala and SQL API:
sourceData
.groupBy($"number")
.agg(
countDistinct(when(...something...)),
countDistinct(when(...something...)),
countDistinct(when(...something...)),
countDistinct(when(...something...))),
countDistinct(when(...something...)))
When I check execution plan, Spark internally does something called "expand" and it multiples records 5 times(for each count distinct column). As I already have billions of records, this becomes very inefficient to do. Is there a way to do this in more efficient way, and please do not say countApproxDistinct as I need exact values :)
You could try to engineer new columns (1 or 0) before the aggregations and then just do a max(). This should reduce the number of scans.
sourceData
.withColumn("engineered_col1", expr("CASE WHEN ... THEN 1 ELSE 0 END")
.withColumn("engineered_col2", expr("CASE WHEN ... THEN 1 ELSE 0 END")
.groupBy($"number")
.agg(max($"engineered_col1"),max($"engineered_col2"))

Using Large Look up table

Problem Statement :
I have two tables - Data (40 cols) and LookUp(2 cols) . I need to use col10 in data table with lookup table to extract the relevant value.
However I cannot make equi join . I need a join based on like/contains as values in lookup table contain only partial content of value in Data table not complete value. Hence some regex based matching is required.
Data Size :
Data Table : Approx - 2.3 billion entries (1 TB of data)
Look up Table : Approx 1.4 Million entries (50 MB of data)
Approach 1 :
1.Using the Database ( I am using Google Big Query) - A Join based on like take close to 3 hrs , yet it returns no result. I believe Regex based join leads to Cartesian join.
Using Apache Beam/Spark - I tried to construct a Trie for the lookup table which will then be shared/broadcast to worker nodes. However with this approach , I am getting OOM as I am creating too many Strings. I tried increasing memory to 4GB+ per worker node but to no avail.
I am using Trie to extract the longest matching prefix.
I am open to using other technologies like Apache spark , Redis etc.
Do suggest me on how can I go about handling this problem.
This processing needs to performed on a day-to-day basis , hence time and resources both needs to be optimized .
However I cannot make equi join
Below is just to give you an idea to explore for addressing in pure BigQuery your equi join related issue
It is based on an assumption I derived from your comments - and covers use-case when y ou are looking for the longest match from very right to the left - matches in the middle are not qualified
The approach is to revers both url (col10) and shortened_url (col2) fields and then SPLIT() them and UNNEST() with preserving positions
UNNEST(SPLIT(REVERSE(field), '.')) part WITH OFFSET position
With this done, now you can do equi join which potentially can address your issue at some extend.
SO, you JOIN by parts and positions then GROUP BY original url and shortened_url while leaving only those groups HAVING count of matches equal of count of parts in shorteded_url and finally you GROUP BY url and leaving only entry with highest number of matching parts
Hope this can help :o)
This is for BigQuery Standard SQL
#standardSQL
WITH data_table AS (
SELECT 'cn456.abcd.tech.com' url UNION ALL
SELECT 'cn457.abc.tech.com' UNION ALL
SELECT 'cn458.ab.com'
), lookup_table AS (
SELECT 'tech.com' shortened_url, 1 val UNION ALL
SELECT 'abcd.tech.com', 2
), data_table_parts AS (
SELECT url, x, y
FROM data_table, UNNEST(SPLIT(REVERSE(url), '.')) x WITH OFFSET y
), lookup_table_parts AS (
SELECT shortened_url, a, b, val,
ARRAY_LENGTH(SPLIT(REVERSE(shortened_url), '.')) len
FROM lookup_table, UNNEST(SPLIT(REVERSE(shortened_url), '.')) a WITH OFFSET b
)
SELECT url,
ARRAY_AGG(STRUCT(shortened_url, val) ORDER BY weight DESC LIMIT 1)[OFFSET(0)].*
FROM (
SELECT url, shortened_url, COUNT(1) weight, ANY_VALUE(val) val
FROM data_table_parts d
JOIN lookup_table_parts l
ON x = a AND y = b
GROUP BY url, shortened_url
HAVING weight = ANY_VALUE(len)
)
GROUP BY url
with result as
Row url shortened_url val
1 cn457.abc.tech.com tech.com 1
2 cn456.abcd.tech.com abcd.tech.com 2

Resources