Create PySpark dataframe with timeseries column

Create PySpark dataframe with timeseries column - apache-spark

I have an initial PySpark dataframe from which I would like to take the MIN and MAX from a date column and then create a new PySpark dataframe with a timeseries (daily date), using the MIN and MAX from my initial dataframe.
I will use it to then join with my initial dataframe and find missing days (null in the rest of the column of my inital DF).
I tried in many different ways to build the timeseries DF, but it doesn't seem to work in PySpark. Any suggestions?

Max column's value can be extracted like this:
df.agg(F.max('col_name')).head()[0]
Date range df can be created like this:
df2 = spark.sql("SELECT sequence(to_date('2000-01-01'), to_date('2000-02-02'), interval 1 day) as date_col").withColumn('date_col', F.explode('date_col'))
And then join.
Full example:
from pyspark.sql import SparkSession, functions as F
spark = SparkSession.builder.getOrCreate()
df1 = spark.createDataFrame([(1, '2022-04-01'),(2, '2022-04-05')], ['id', 'df1_date']).select('id', F.col('df1_date').cast('date'))
df1.show()
# +---+----------+
# | id| df1_date|
# +---+----------+
# | 1|2022-04-01|
# | 2|2022-04-05|
# +---+----------+
min_date = df1.agg(F.min('df1_date')).head()[0]
max_date = df1.agg(F.max('df1_date')).head()[0]
df2 = spark.sql(f"SELECT sequence(to_date('{min_date}'), to_date('{max_date}'), interval 1 day) as df2_date").withColumn('df2_date', F.explode('df2_date'))
df3 = df2.join(df1, df1.df1_date == df2.df2_date, 'left')
df3.show()
# +----------+----+----------+
# | df2_date| id| df1_date|
# +----------+----+----------+
# |2022-04-01| 1|2022-04-01|
# |2022-04-02|null| null|
# |2022-04-03|null| null|
# |2022-04-04|null| null|
# |2022-04-05| 2|2022-04-05|
# +----------+----+----------+

Related

Join the dataframe if value of one column exists as substring in another dataframe

I have a dataframe df1 like this:
and another dataframe df2 like this:
How could I join df2 with df1 using left join so that my output would look like the following?

You can split values in df1 and explode them before the join.
df3 = df1.withColumn('Value', F.explode(F.split('Value', ';')))
df4 = df2.join(df3, 'Value', 'left')
Full example:
from pyspark.sql import functions as F
df1 = spark.createDataFrame([('apple;banana', 150), ('carrot', 20)], ['Value', 'Amount'])
df2 = spark.createDataFrame([('apple',), ('orange',)], ['Value'])
df3 = df1.withColumn('Value', F.explode(F.split('Value', ';')))
df4 = df2.join(df3, 'Value', 'left')
df4.show()
# +------+------+
# | Value|Amount|
# +------+------+
# | apple| 150|
# |orange| null|
# +------+------+
Dealing with nulls. If you have nulls in the column "Value" in both dataframes which you want to successfully join, you will need to use eqNullSafe equality. Using this condition would normally leave "Value" columns from both dataframes in the output dataframe. So to explicitly remove it, I suggest using alias on dataframes.
from pyspark.sql import functions as F
df1 = spark.createDataFrame([('apple;banana', 150), (None, 20)], ['Value', 'Amount'])
df2 = spark.createDataFrame([('apple',), ('orange',), (None,)], ['Value'])
df3 = df1.withColumn('Value', F.explode(F.coalesce(F.split('Value', ';'), F.array(F.lit(None)))))
df4 = df2.alias('a').join(
df3.alias('b'),
df2.Value.eqNullSafe(df3.Value),
'left'
).drop(F.col('b.Value'))
df4.show()
# +------+------+
# | Value|Amount|
# +------+------+
# | apple| 150|
# | null| 20|
# |orange| null|
# +------+------+

Use SQL "like" operator in left outer join.
Try this
//Input
spark.sql(" select 'apple;banana' value, 150 amount union all select 'carrot', 50 ").createOrReplaceTempView("df1")
spark.sql(" select 'apple' value union all select 'orange' ").createOrReplaceTempView("df2")
//Output
spark.sql("""
select a.value, b.amount
from df2 a
left join df1 b
on ';'||b.value||';' like '%;'||a.value||';%'
""").show(false)
+------+------+
|value |amount|
+------+------+
|apple |150 |
|orange|null |
+------+------+

Pyspark - Find sub-string from a column of data-frame with another data-frame

I have two different dataframes in Pyspark of String type. First dataframe is of single work while second is a string of words i.e., sentences. I have to check existence of first dataframe column from the second dataframe column. For example,
df2
+------+-------+-----------------+
|age|height| name| Sentences |
+---+------+-------+-----------------+
| 10| 80| Alice| 'Grace, Sarah'|
| 15| null| Bob| 'Sarah'|
| 12| null| Tom|'Amy, Sarah, Bob'|
| 13| null| Rachel| 'Tom, Bob'|
+---+------+-------+-----------------+
Second dataframe
df1
+-------+
| token |
+-------+
| 'Ali' |
|'Sarah'|
|'Bob' |
|'Bob' |
+-------+
So, how can I search for each token of df1 from df2 Sentence column. I need count for each word and add as a new column in df1
I have tried this solution, but work for a single word i.e., not for a complete column of dataframe

Considering the dataframe in the prev answer
from pyspark.sql.functions import explode,explode_outer,split, length,trim
df3 = df2.select('Sentences',explode(split('Sentences',',')).alias('friends'))
df3 = df3.withColumn("friends", trim("friends")).withColumn("length_of_friends", length("friends"))
display(df3)
df3 = df3.join(df1, df1.token == df3.friends,how='inner').groupby('friends').count()
display(df3)

You could use pyspark udf to create the new column in df1.
Problem is you cannot access a second dataframe inside udf (view here).
As advised in the referenced question, you could get sentences as broadcastable varaible.
Here is a working example :
from pyspark.sql.types import *
from pyspark.sql.functions import udf
# Instanciate df2
cols = ["age", "height", "name", "Sentences"]
data = [
(10, 80, "Alice", "Grace, Sarah"),
(15, None, "Bob", "Sarah"),
(12, None, "Tom", "Amy, Sarah, Bob"),
(13, None, "Rachel", "Tom, Bob")
]
df2 = spark.createDataFrame(data).toDF(*cols)
# Instanciate df1
cols = ["token"]
data = [
("Ali",),
("Sarah",),
("Bob",),
("Bob",)
]
df1 = spark.createDataFrame(data).toDF(*cols)
# Creating broadcast variable for Sentences column of df2
lstSentences = [data[0] for data in df2.select('Sentences').collect()]
sentences = spark.sparkContext.broadcast(lstSentences)
def countWordInSentence(word):
# Count if sentence contains word
return sum(1 for item in lstSentences if word in item)
func_udf = udf(countWordInSentence, IntegerType())
df1 = df1.withColumn("COUNT",
func_udf(df1["token"]))
df1.show()

how can we force dataframe repartitioning to be balanced in spark?

I created a synthetic dataset and I trying to experiment with repartitioning based on a one column. The objective is to end up with a balanced (equal size) number of partitions, but I cannot achieve this. Is there a way it could be done, preferably without resorting to RDDs and saving the dataframe?
Example code:
from pyspark.sql import SparkSession
from pyspark.sql.types import *
import pyspark.sql.functions as f
spark = SparkSession.builder.appName('learn').getOrCreate()
import pandas as pd
import random
from pyspark.sql.types import *
nr = 500
data = {'id': [random.randint(0,5) for _ in range(nr)], 'id2': [random.randint(0,5) for _ in range(nr)]}
data = pd.DataFrame(data)
df = spark.createDataFrame(data)
# df.show()
df = df.repartition(3, 'id')
# see the different partitions
for ipart in range(3):
print(f'partition {ipart}')
def fpart(partition_idx, iterator, target_partition_idx=ipart):
if partition_idx == target_partition_idx:
return iterator
else:
return iter(())
res = df.rdd.mapPartitionsWithIndex(fpart)
res = res.toDF(schema=schema)
# res.show(n=5, truncate=False)
print(f"number of rows {res.count()}, unique ids {res.select('id').drop_duplicates().toPandas()['id'].tolist()}")
It produces:
partition 0
number of rows 79, unique ids [3]
partition 1
number of rows 82, unique ids [0]
partition 2
number of rows 339, unique ids [5, 1, 2, 4]
so the partitions are clearly not balanced.
I saw in How to guarantee repartitioning in Spark Dataframe that this is explainable because assigning to partitions is based on the hash of column id modulo 3 (the number of partitions):
df.select('id', f.expr("hash(id)"), f.expr("pmod(hash(id), 3)")).drop_duplicates().show()
that produces
+---+-----------+-----------------+
| id| hash(id)|pmod(hash(id), 3)|
+---+-----------+-----------------+
| 3| 519220707| 0|
| 0|-1670924195| 1|
| 1|-1712319331| 2|
| 5| 1607884268| 2|
| 4| 1344313940| 2|
| 2| -797927272| 2|
+---+-----------+-----------------+
but I find this strange. The point of specifying the column in the repartition function is to somehow split the values of id to different partitions. If the column id had more unique values than 6 in this example it would work better, but still.
Is there a way to achieve this?

Promote Row 1 as Column Heading - Spark DataFrame

I got below Spark Data Frame.
I want to promote Row 1 as column Headings and the new spark DataFrame should be
I know this can be done in pandas easily as:
new_header = pandaDF.iloc[0]
pandaDF = pandaDF[1:]
pandaDF.columns = new_header
But doesn't want to convert into Pandas DF as have to persist this into to Database, wherein have to convert back pandas DF to Spark DF and then register as table and then write to db.

Try with .toDF and filter our the column values.
Example:
#sample dataframe
df.show()
#+----------+------------+----------+
#| prop_0| prop_1| prop_2|
#+----------+------------+----------+
#|station_id|station_name|sample_num|
#| 101| Station101| Sample101|
#| 102| Station102| Sample102|
#+----------+------------+----------+
from pyspark.sql.functions import *
cols=sc.parallelize(cols).map(lambda x:x).collect()
df.toDF(*cols).filter(~col("station_id").isin(*cols)).show()
#+----------+------------+----------+
#|station_id|station_name|sample_num|
#+----------+------------+----------+
#| 101| Station101| Sample101|
#| 102| Station102| Sample102|
#+----------+------------+----------+

Removing somepart of a dataframe column

I have a dataframe named DF like this
Dataframe DF
I have the below code
def func(row):
temp=row.asDict()
temp["concat_val"]="|".join([str(x) for x in row])
put=Row(**temp)
return put
DF.show()
row_rdd=DF.rdd.map(func)
concat_df=row_rdd.toDF().show()
I am getting a result like this
However I want an output which will remove id and nm colume values from concat_val column.
The table should look like below
Please suggest a way to remove id and nm value

So here you are trying to concat the column txt and uppertx and the values should be delimited by "|". You can try below code.
# Load required libraries
from pyspark.sql.functions import *
# Create DataFrame
df = spark.createDataFrame([(1,"a","foo","qwe"), (2,"b","bar","poi"), (3,"c","mnc","qwe")], ["id", "nm", "txt", "uppertxt"])
# Concat column txt and uppertxt delimited by "|"
# Approach - 1 : using concat function.
df1 = df.withColumn("concat_val", concat(df["txt"] , lit("|"), df["uppertxt"]))
# Approach - 2 : Using concat_ws function
df1 = df.withColumn("concat_val", concat_ws("|", df["txt"] , df["uppertxt"]))
# Display Output
df1.show()
Output
+---+---+---+--------+----------+
| id| nm|txt|uppertxt|concat_val|
+---+---+---+--------+----------+
| 1| a|foo| qwe| foo|qwe|
| 2| b|bar| poi| bar|poi|
| 3| c|mnc| qwe| mnc|qwe|
+---+---+---+--------+----------+
You can fnd more info on concat and concat_ws in spark docs.
I hope this helps.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Create PySpark dataframe with timeseries column - apache-spark

Related

Join the dataframe if value of one column exists as substring in another dataframe

Pyspark - Find sub-string from a column of data-frame with another data-frame

how can we force dataframe repartitioning to be balanced in spark?

Promote Row 1 as Column Heading - Spark DataFrame

Removing somepart of a dataframe column

Categories

Resources