I have a spark dataframe that looks like this:
+----+------+-------------+
|user| level|value_pair |
+----+------+-------------+
| A | 25 |(23.52,25.12)|
| A | 6 |(0,0) |
| A | 2 |(11,12.12) |
| A | 32 |(17,16.12) |
| B | 22 |(19,57.12) |
| B | 42 |(10,3.2) |
| B | 43 |(32,21.0) |
| C | 33 |(12,0) |
| D | 32 |(265.21,19.2)|
| D | 62 |(57.12,50.12)|
| D | 32 |(75.12,57.12)|
| E | 63 |(0,0) |
+----+------+-------------+
How do I extract the values in the value_pair column and add them to two new columns called value1 and value2, using the comma as the separator.
+----+------+-------------+-------+
|user| level|value1 |value2 |
+----+------+-------------+-------+
| A | 25 |23.52 |25.12 |
| A | 6 |0 |0 |
| A | 2 |11 |12.12 |
| A | 32 |17 |16.12 |
| B | 22 |19 |57.12 |
| B | 42 |10 |3.2 |
| B | 43 |32 |21.0 |
| C | 33 |12 |0 |
| D | 32 |265.21 |19.2 |
| D | 62 |57.12 |50.12 |
| D | 32 |75.12 |57.12 |
| E | 63 |0 |0 |
+----+------+-------------+-------+
I know I can separate the values like so:
df = df.withColumn('value1', pyspark.sql.functions.split(df['value_pair'], ',')[0]
df = df.withColumn('value2', pyspark.sql.functions.split(df['value_pair'], ',')[1]
But how do I also get rid of the parantheses?
For the parentheses, as shown in the comments you can use regexp_replace, but you also need to include \. The backslash \ is the escape character for regular expressions.
Also, I believe you need to first remove the brackets, and then expand the column.
from pyspark.sql.functions import split
from pyspark.sql.functions import regexp_replace
df = df.withColumn('value_pair', regexp_replace(df.value_pair, "\(",""))
df = df.withColumn('value_pair', regexp_replace(df.value_pair, "\)",""))
df = df.withColumn('value1', split(df['value_pair'], ',').getItem(0)) \
.withColumn('value2', split(df['value_pair'], ',').getItem(1))
>>> df.show(truncate=False)
+----+-----+-----------+------+---------+
|user|level|value_pair |value1|value2 |
+----+-----+-----------+------+---------+
| A |25 |23.52,25.12|23.52 |25.12 |
| A |6 |0,0 |0 |0 |
| A |2 |11,12.12 |11 |12.12 |
| A |32 |17,16.12 |17 |16.12 |
| B |22 |19,57.12 |19 |57.12 |
| B |42 |10,3.2 |10 |3.2 |
| B |43 |32,21.0 |32 |21.0 |
| C |33 |12,0 |12 |0 |
| D |32 |265.21,19.2|265.21|19.2 |
| D |62 |57.12,50.12|57.12 |50.12 |
| D |32 |75.12,57.12|75.12 |57.12 |
| E |63 |0,0 |0 |0 |
+----+-----+-----------+------+---------+
As noticed, I changed slightly your code on how you grab the 2 items.
More information can be found here
Related
Consedering that I have the following DF:
|-----------------|
|Date | Cod |
|-----------------|
|2022-08-01 | A |
|2022-08-02 | A |
|2022-08-03 | A |
|2022-08-04 | A |
|2022-08-05 | A |
|2022-08-01 | B |
|2022-08-02 | B |
|2022-08-03 | B |
|2022-08-04 | B |
|2022-08-05 | B |
|-----------------|
And considering that I have a backward observation of 2 days, how can I generate the following output DF
|------------------------------|
|RefDate | Date | Cod
|------------------------------|
|2022-08-03 | 2022-08-01 | A |
|2022-08-03 | 2022-08-02 | A |
|2022-08-03 | 2022-08-03 | A |
|2022-08-04 | 2022-08-02 | A |
|2022-08-04 | 2022-08-03 | A |
|2022-08-04 | 2022-08-04 | A |
|2022-08-05 | 2022-08-03 | A |
|2022-08-05 | 2022-08-04 | A |
|2022-08-05 | 2022-08-05 | A |
|2022-08-03 | 2022-08-01 | B |
|2022-08-03 | 2022-08-02 | B |
|2022-08-03 | 2022-08-03 | B |
|2022-08-04 | 2022-08-02 | B |
|2022-08-04 | 2022-08-03 | B |
|2022-08-04 | 2022-08-04 | B |
|2022-08-05 | 2022-08-03 | B |
|2022-08-05 | 2022-08-04 | B |
|2022-08-05 | 2022-08-05 | B |
|------------------------------|
I know that I can use loops to generate this output DF, but loops doesn't have a good performance since I can't cache the DF on memory (My original DF has approx 6 billion lines). So, what is the best way to get this output?
MVCE:
data_1=[
("2022-08-01","A"),
("2022-08-02","A"),
("2022-08-03","A"),
("2022-08-04","A"),
("2022-08-05","A"),
("2022-08-01","B"),
("2022-08-02","B"),
("2022-08-03","B"),
("2022-08-04","B"),
("2022-08-05","B")
]
schema_1 = StructType([
StructField("Date", StringType(),True),
StructField("Cod", StringType(),True)
])
df_1 = spark.createDataFrame(data=data_1,schema=schema_1)
You could try a self join. My thoughts - If your cluster and session are configured optimally, it should work with 6B rows.
data_sdf.alias('a'). \
join(data_sdf.alias('b'),
[func.col('a.cod') == func.col('b.cod'),
func.datediff(func.col('a.date'), func.col('b.date')).between(0, 2)],
'inner'
). \
drop(func.col('a.cod')). \
selectExpr('cod', 'a.date as ref_date', 'b.date as date'). \
show()
# +---+----------+----------+
# |cod| ref_date| date|
# +---+----------+----------+
# | B|2022-08-01|2022-08-01|
# | B|2022-08-02|2022-08-01|
# | B|2022-08-02|2022-08-02|
# | B|2022-08-03|2022-08-01|
# | B|2022-08-03|2022-08-02|
# | B|2022-08-03|2022-08-03|
# | B|2022-08-04|2022-08-02|
# | B|2022-08-04|2022-08-03|
# | B|2022-08-04|2022-08-04|
# | B|2022-08-05|2022-08-03|
# | B|2022-08-05|2022-08-04|
# | B|2022-08-05|2022-08-05|
# | A|2022-08-01|2022-08-01|
# | A|2022-08-02|2022-08-01|
# | A|2022-08-02|2022-08-02|
# | A|2022-08-03|2022-08-01|
# | A|2022-08-03|2022-08-02|
# | A|2022-08-03|2022-08-03|
# | A|2022-08-04|2022-08-02|
# | A|2022-08-04|2022-08-03|
# +---+----------+----------+
# only showing top 20 rows
This will generate records for the initial 2 dates as well which can be discarded.
I have a csv as described:
s_table | s_name | t_cast | t_d |
aaaaaa | juuoo | TRUE |float |
aaaaaa | juueo | TRUE |float |
aaaaaa | ju4oo | | |
aaaaaa | juuoo | | |
aaaaaa | thyoo | | |
aaaaaa | juioo | | |
aaaaaa | rtyoo | | |
I am trying to use pyspark when condition to check the condition of t_cast with s_table and if it is TRUE, return a statement in a new column.
What i've tried is:
filters = filters.withColumn("p3", f.when((f.col("s_table") == "aaaaaa") & (f.col("t_cast").isNull()),f.col("s_name")).
when((f.col("s_table") == "aaaaaa") & (f.col("t_cast") == True),
f"CAST({f.col('s_table')} AS {f.col('t_d')}) AS {f.col('s_table')}"))
What I am trying to achieve is for the column p3 to return this:
s_table | s_name | t_cast | t_d | p_3 |
aaaaaa | juuoo | TRUE |float | cast ('juuoo' as float) as 'juuoo' |
aaaaaa | juueo | TRUE |float | cast ('juueo' as float) as 'juuoo' |
aaaaaa | ju4oo | | | ju4oo |
aaaaaa | juuoo | | | juuoo |
aaaaaa | thyoo | | | thyoo |
aaaaaa | juioo | | | juioo |
aaaaaa | rtyoo | | | rtyoo |
But the result that I get is:
CAST(Column<'s_field'> AS Column<'t_data_type'>) AS Column<'s_field'>,
CAST(Column<'s_field'> AS Column<'t_data_type'>) AS Column<'s_field'>,
I feel like I am almost there, but I can't quite figure it out.
You need to use Spark concat function instead of Python string format to get the expected string. Something like:
import pyspark.sql.functions as F
filters = filters.withColumn(
"p3",
(F.when((F.col("s_table") == "aaaaaa") & (F.col("t_cast").isNull()), F.col("s_name"))
.when((F.col("s_table") == "aaaaaa") & F.col("t_cast"),
F.expr(r"concat('CAST(\'', s_name, '\' AS ', t_d, ') AS \'', s_table, '\'')")
)
)
)
filters.show(truncate=False)
#+-------+------+------+-----+----------------------------------+
#|s_table|s_name|t_cast|t_d |p3 |
#+-------+------+------+-----+----------------------------------+
#|aaaaaa |juuoo |true |float|CAST('juuoo' AS float) AS 'aaaaaa'|
#|aaaaaa |juueo |true |float|CAST('juueo' AS float) AS 'aaaaaa'|
#|aaaaaa |ju4oo |null |null |ju4oo |
#|aaaaaa |juuoo |null |null |juuoo |
#|aaaaaa |thyoo |null |null |thyoo |
#|aaaaaa |juioo |null |null |juioo |
#|aaaaaa |rtyoo |null |null |rtyoo |
#+-------+------+------+-----+----------------------------------+
I have two similar dataframes, one has a single date and the other has multiple dates plus an additional column:
df:
| yyyy_mm_dd | id | region | country | product | count |
|------------|-----|--------|----------|---------|-------|
| 2021-06-14 | 111 | EMEA | Spain | P1 | 10 |
| 2021-06-14 | 111 | EMEA | England | P1 | 9 |
| 2021-06-14 | 111 | EMEA | France | P1 | 10 |
| 2021-06-14 | 111 | EMEA | Spain | P2 | 299 |
| 2021-06-14 | 111 | EMEA | England | P2 | 39 |
| 2021-06-14 | 111 | EMEA | France | P2 | 10 |
| 2021-06-14 | 112 | LATAM | Brazil | P1 | 64 |
| 2021-06-14 | 112 | LATAM | Paraguay | P2 | 21 |
| 2021-06-14 | ... | ... | ... | ... | ... |
df1:
| yyyy_mm_dd | id | region | country | product | count | fullfilments |
|------------|-----|--------|----------|---------|-------|--------------|
| 2021-06-14 | 111 | EMEA | Spain | P1 | 1 | 1 |
| 2021-06-14 | 111 | EMEA | England | P1 | 1 | 3 |
| 2021-06-14 | 111 | EMEA | France | P1 | 2 | 4 |
| 2021-06-14 | 111 | EMEA | Spain | P2 | 1 | 1 |
| 2021-06-14 | 111 | EMEA | England | P2 | 2 | 1 |
| 2021-06-14 | 111 | EMEA | France | P2 | 1 | 5 |
| 2021-06-14 | 112 | LATAM | Brazil | P1 | 2 | 2 |
| 2021-06-14 | 112 | LATAM | Paraguay | P2 | 21 | 1 |
| 2021-06-14 | ... | ... | ... | ... | ... | ... |
| 2021-06-13 | 111 | EMEA | Spain | P1 | 0 | 1 |
| 2021-06-13 | 111 | EMEA | England | P2 | 0 | 2 |
Df1 has many dates of grouped data and df only has one date. I would like to replace the count column in df1 with the count in df for matching rows (yyyy_mm_dd, id, region, country, product) and retain fullfilments.
I could probably join both together and drop count in the first df, however I only want to replace where the date is matching and retain all other rows in df1.
You can simply join and use the coalesce function.
When you do the left join from the first dataframe to the second, the only matching records have the not null new_count value. Now, use the coalesce function that will return the first value when it is not null but the second value when the first is null.
coalesce(a , b ) => a
coalesce(a , null) => a
coalesce(null, b ) => b
From your dataframes,
from pyspark.sql import functions as f
df1 = spark.read.option("inferSchema","true").option("header","true").csv("test1.csv")
+----------+---+------+--------+-------+-----+
|yyyy_mm_dd|id |region|country |product|count|
+----------+---+------+--------+-------+-----+
|2021-06-14|111|EMEA |Spain |P1 |10 |
|2021-06-14|111|EMEA |England |P1 |9 |
|2021-06-14|111|EMEA |France |P1 |10 |
|2021-06-14|111|EMEA |Spain |P2 |299 |
|2021-06-14|111|EMEA |England |P2 |39 |
|2021-06-14|111|EMEA |France |P2 |10 |
|2021-06-14|112|LATAM |Brazil |P1 |64 |
|2021-06-14|112|LATAM |Paraguay|P2 |21 |
+----------+---+------+--------+-------+-----+
df2 = spark.read.option("inferSchema","true").option("header","true").csv("test2.csv")
+----------+---+------+--------+-------+-----+------------+
|yyyy_mm_dd|id |region|country |product|count|fullfilments|
+----------+---+------+--------+-------+-----+------------+
|2021-06-14|111|EMEA |Spain |P1 |1 |1 |
|2021-06-14|111|EMEA |England |P1 |1 |3 |
|2021-06-14|111|EMEA |France |P1 |2 |4 |
|2021-06-14|111|EMEA |Spain |P2 |1 |1 |
|2021-06-14|111|EMEA |England |P2 |2 |1 |
|2021-06-14|111|EMEA |France |P2 |1 |5 |
|2021-06-14|112|LATAM |Brazil |P1 |2 |2 |
|2021-06-14|112|LATAM |Paraguay|P2 |21 |1 |
|2021-06-13|111|EMEA |Spain |P1 |0 |1 |
|2021-06-13|111|EMEA |England |P2 |0 |2 |
+----------+---+------+--------+-------+-----+------------+
the join of two dataframes are given by follows:
cols_to_join = ['yyyy_mm_dd', 'id', 'region', 'country', 'product']
df3 = df2.join(df1.withColumnRenamed('count', 'new_count'), cols_to_join, 'left') \
.withColumn('count', f.coalesce('new_count', 'count')).drop('new_count')
df3.show(truncate=False)
+----------+---+------+--------+-------+-----+------------+
|yyyy_mm_dd|id |region|country |product|count|fullfilments|
+----------+---+------+--------+-------+-----+------------+
|2021-06-14|111|EMEA |Spain |P1 |10 |1 |
|2021-06-14|111|EMEA |England |P1 |9 |3 |
|2021-06-14|111|EMEA |France |P1 |10 |4 |
|2021-06-14|111|EMEA |Spain |P2 |299 |1 |
|2021-06-14|111|EMEA |England |P2 |39 |1 |
|2021-06-14|111|EMEA |France |P2 |10 |5 |
|2021-06-14|112|LATAM |Brazil |P1 |64 |2 |
|2021-06-14|112|LATAM |Paraguay|P2 |21 |1 |
|2021-06-13|111|EMEA |Spain |P1 |0 |1 |
|2021-06-13|111|EMEA |England |P2 |0 |2 |
+----------+---+------+--------+-------+-----+------------+
Everytime you need to retrieve a column from different dataframes you must join them:
import pyspark.sql.functions as f
df2 = df1.join(df.withColumnRenamed('count', 'new_count'),
on=['yyyy_mm_dd', 'id', 'region', 'country', 'product'], how='left')
df2 = (df2
.withColumn('count', f.coalesce('new_count', 'count'))
.drop('new_count'))
df2.show(truncate=False)
I have a dataframe which contains negative numbers, with accountancy notation i.e.:
df.select('sales').distinct().show()
+------------+
| sales |
+------------+
| 18 |
| 3 |
| 10 |
| (5)|
| 4 |
| 40 |
| 0 |
| 8 |
| 16 |
| (2)|
| 2 |
| (1)|
| 14 |
| (3)|
| 9 |
| 19 |
| (6)|
| 1 |
| (9)|
| (4)|
+------------+
only showing top 20 rows
The numbers wrapped in () are negative. How can I replace them to have minus values instead i.e. (5) becomes -5 and so on.
Here is what I have tried:
sales = (
df
.select('sales')
.withColumn('sales_new',
sf.when(sf.col('sales').substr(1,1) == '(',
sf.concat(sf.lit('-'), sf.col('sales').substr(2,3)))
.otherwise(sf.col('sales')))
)
sales.show(20,False)
+---------+---------+
|salees |sales_new|
+---------+---------+
| 151 | 151 |
| 134 | 134 |
| 151 | 151 |
|(151) |-151 |
|(134) |-134 |
|(151) |-151 |
| 151 | 151 |
| 50 | 50 |
| 101 | 101 |
| 134 | 134 |
|(134) |-134 |
| 46 | 46 |
| 151 | 151 |
| 134 | 134 |
| 185 | 185 |
| 84 | 84 |
| 188 | 188 |
|(94) |-94) |
| 38 | 38 |
| 21 | 21 |
+---------+---------+
The issue is that the length of sales can vary so hardcoding a value into the substring() won't work in some cases.
I have tried using regexp_replace but get an error that:
PatternSyntaxException: Unclosed group near index 1
sales = (
df
.select('sales')
.withColumn('sales_new', regexp_replace(sf.col('sales'), '(', ''))
)
This can be solved with a case statement and regular expression together:
from pyspark.sql.functions import regexp_replace, col
sales = (
df
.select('sales')
.withColumn('sales_new', sf.when(sf.col('sales').substr(1,1) == '(',
sf.concat(sf.lit('-'), regexp_replace(sf.col('sales'), '\(|\)', '')))
.otherwise(sf.col('sales')))
)
sales.show(20,False)
+---------+---------+
|sales |sales_new|
+---------+---------+
|151 |151 |
|134 |134 |
|151 |151 |
|(151) |-151 |
|(134) |-134 |
|(151) |-151 |
|151 |151 |
|50 |50 |
|101 |101 |
|134 |134 |
|(134) |-134 |
|46 |46 |
|151 |151 |
|134 |134 |
|185 |185 |
|84 |84 |
|188 |188 |
|(94) |-94 |
|38 |38 |
|21 |21 |
+---------+---------+
You can slice the string from the second character to the second last character, and then convert it to float, for example:
def convert(number):
try:
number = float(number)
except:
number = number[1:-1]
number = float(number)
return number
You can iterate through all the elements and apply this function.
I have a dataframe with the following columns:
+-----+----------+--------------------------+-----------+
|id | sourceid | timestamp | indicator |
+-----+----------+--------------------------+-----------+
| 0 | 128 | 2019-12-03 12:00:00.0 | 0 |
| 1 | 128 | 2019-12-03 12:30:00.0 | 1 |
| 2 | 128 | 2019-12-03 12:37:00.0 | 0 |
| 3 | 128 | 2019-12-03 13:15:00.0 | 1 |
| 4 | 128 | 2019-12-03 13:17:00.0 | 0 |
+-----+----------+--------------------------+-----------+
I am trying to split the timestamp column into rows of 5 minute time intervals for indicator values which are not 0.
Explanation:
The first entry is at time timestamp = 2019-12-03 12:00:00.0, indicator= 0, do nothing.
Moving on to the next entry with timestamp = 2019-12-03 12:30:00.0, indicator= 1, I want to split timestamp into rows with a 5 minutes interval till we reach the next entry which is timestamp = 2019-12-03 12:37:00.0, indicator= 0.
If there is a case where timestamp = 2019-12-03 13:15:00.0, indicator = 1 and the next timestamp = 2019-12-03 13:17:00.0, indicator = 0, I'd like to split the row considering both the times have indicator as 1 as 13:17:00.0 falls between 13:15:00.0 - 13:20:00.0 as shown below.
How can I achieve this with PySpark?
Expected Output:
+-----+----------+--------------------------+-------------+
|id | sourceid | timestamp | indicator |
+-----+----------+--------------------------+-------------+
| 1 | 128 | 2019-12-03 12:30:00.0 | 1 |
| 1 | 128 | 2019-12-03 12:35:00.0 | 1 |
| 4 | 128 | 2019-12-03 13:15:00.0 | 1 |
| 4 | 128 | 2019-12-03 13:20:00.0 | 1 |
+-----+----------+--------------------------+-------------+
IIUC, you can filter rows based on indicators on the current and the next rows, and then use array + explode to create new rows (for testing purpose, I added some more rows into your original example):
from pyspark.sql import Window, functions as F
w1 = Window.partitionBy('sourceid').orderBy('timestamp')
# add a flag to check if the next indicator is '0'
df1 = df.withColumn('next_indicator_is_0', F.lead('indicator').over(w1) == 0)
df1.show(truncate=False)
+---+--------+---------------------+---------+-------------------+
|id |sourceid|timestamp |indicator|next_indicator_is_0|
+---+--------+---------------------+---------+-------------------+
|0 |128 |2019-12-03 12:00:00.0|0 |false |
|1 |128 |2019-12-03 12:30:00.0|1 |true |
|2 |128 |2019-12-03 12:37:00.0|0 |false |
|3 |128 |2019-12-03 13:12:00.0|1 |false |
|4 |128 |2019-12-03 13:15:00.0|1 |true |
|5 |128 |2019-12-03 13:17:00.0|0 |false |
|6 |128 |2019-12-03 13:20:00.0|1 |null |
+---+--------+---------------------+---------+-------------------+
df1.filter("indicator = 1 AND next_indicator_is_0") \
.withColumn('timestamp', F.expr("explode(array(`timestamp`, `timestamp` + interval 5 minutes))")) \
.drop('next_indicator_is_0') \
.show(truncate=False)
+---+--------+---------------------+---------+
|id |sourceid|timestamp |indicator|
+---+--------+---------------------+---------+
|1 |128 |2019-12-03 12:30:00.0|1 |
|1 |128 |2019-12-03 12:35:00 |1 |
|4 |128 |2019-12-03 13:15:00.0|1 |
|4 |128 |2019-12-03 13:20:00 |1 |
+---+--------+---------------------+---------+
Note: you can reset id column by using F.row_number().over(w1) or F.monotonically_increasing_id() based on your requirements.