split my dataframe depending on the number of nodes pyspark - python-3.x

I'm trying to split my dataframe depending on the number of nodes (of my cluster),
my dataframe looks like :
If i had node=2, and dataframe.count=7 :
So, to apply an iterative approach the result of split will be :
My question is : how can i do this ?

You can do that (have a look at the code below) with one of the rdd partition functions, but I don't recommend it as
long as you are not fully aware of what you are doing and the reason why you are doing this. In general (or better for most usecase) it is better to let spark handle the data distribution.
import pyspark.sql.functions as F
import itertools
import math
#creating a random dataframe
l = [(x,x+2) for x in range(1009)]
columns = ['one', 'two']
df=spark.createDataFrame(l, columns)
#create on partition to asign a partition key
df = df.coalesce(1)
#number of nodes (==partitions)
pCount = 5
#creating a list of partition keys
#basically it repeats range(5) several times until we have enough keys for each row
partitionKey = list(itertools.chain.from_iterable(itertools.repeat(x, math.ceil(df.count()/pCount)) for x in range(pCount)))
#now we can distribute the data to the partitions
df = df.rdd.partitionBy(pCount, partitionFunc = lambda x: partitionKey.pop()).toDF()
#This shows us the number of records within each partition
df.withColumn("partition_id", F.spark_partition_id()).groupBy("partition_id").count().show()
Output:
+------------+-----+
|partition_id|count|
+------------+-----+
| 1| 202|
| 3| 202|
| 4| 202|
| 2| 202|
| 0| 201|
+------------+-----+

Related

Pyspark replace string in every column name

I am converting Pandas commands into Spark ones. I bumped into wanting to convert this line into Apache Spark code:
This line replaces every two spaces into one.
df = df.columns.str.replace(' ', ' ')
Is it possible to replace a string from all columns using Spark?
I came into this, but it is not quite right.
df = df.withColumnRenamed('--', '-')
To be clear I want this
//+---+----------------------+-----+
//|id |address__test |state|
//+---+----------------------+-----+
to this
//+---+----------------------+-----+
//|id |address_test |state|
//+---+----------------------+-----+
You can apply the replace method on all columns by iterating over them and then selecting, like so:
df = spark.createDataFrame([(1, 2, 3)], "id: int, address__test: int, state: int")
df.show()
+---+-------------+-----+
| id|address__test|state|
+---+-------------+-----+
| 1| 2| 3|
+---+-------------+-----+
from pyspark.sql.functions import col
new_cols = [col(c).alias(c.replace("__", "_")) for c in df.columns]
df.select(*new_cols).show()
+---+------------+-----+
| id|address_test|state|
+---+------------+-----+
| 1| 2| 3|
+---+------------+-----+
On the sidenote: calling withColumnRenamed makes Spark create a Projection for each distinct call, while a select makes just single Projection, hence for large number of columns, select will be much faster.
Here's a suggestion.
We get all the target columns:
columns_to_edit = [col for col in df.columns if "__" in col]
Then we use a for loop to edit them all one by one:
for column in columns_to_edit:
new_column = column.replace("__", "_")
df = df.withColumnRenamed(column, new_column)
Would this solve your issue?

Create PySpark dataframe with timeseries column

I have an initial PySpark dataframe from which I would like to take the MIN and MAX from a date column and then create a new PySpark dataframe with a timeseries (daily date), using the MIN and MAX from my initial dataframe.
I will use it to then join with my initial dataframe and find missing days (null in the rest of the column of my inital DF).
I tried in many different ways to build the timeseries DF, but it doesn't seem to work in PySpark. Any suggestions?
Max column's value can be extracted like this:
df.agg(F.max('col_name')).head()[0]
Date range df can be created like this:
df2 = spark.sql("SELECT sequence(to_date('2000-01-01'), to_date('2000-02-02'), interval 1 day) as date_col").withColumn('date_col', F.explode('date_col'))
And then join.
Full example:
from pyspark.sql import SparkSession, functions as F
spark = SparkSession.builder.getOrCreate()
df1 = spark.createDataFrame([(1, '2022-04-01'),(2, '2022-04-05')], ['id', 'df1_date']).select('id', F.col('df1_date').cast('date'))
df1.show()
# +---+----------+
# | id| df1_date|
# +---+----------+
# | 1|2022-04-01|
# | 2|2022-04-05|
# +---+----------+
min_date = df1.agg(F.min('df1_date')).head()[0]
max_date = df1.agg(F.max('df1_date')).head()[0]
df2 = spark.sql(f"SELECT sequence(to_date('{min_date}'), to_date('{max_date}'), interval 1 day) as df2_date").withColumn('df2_date', F.explode('df2_date'))
df3 = df2.join(df1, df1.df1_date == df2.df2_date, 'left')
df3.show()
# +----------+----+----------+
# | df2_date| id| df1_date|
# +----------+----+----------+
# |2022-04-01| 1|2022-04-01|
# |2022-04-02|null| null|
# |2022-04-03|null| null|
# |2022-04-04|null| null|
# |2022-04-05| 2|2022-04-05|
# +----------+----+----------+

how can we force dataframe repartitioning to be balanced in spark?

I created a synthetic dataset and I trying to experiment with repartitioning based on a one column. The objective is to end up with a balanced (equal size) number of partitions, but I cannot achieve this. Is there a way it could be done, preferably without resorting to RDDs and saving the dataframe?
Example code:
from pyspark.sql import SparkSession
from pyspark.sql.types import *
import pyspark.sql.functions as f
spark = SparkSession.builder.appName('learn').getOrCreate()
import pandas as pd
import random
from pyspark.sql.types import *
nr = 500
data = {'id': [random.randint(0,5) for _ in range(nr)], 'id2': [random.randint(0,5) for _ in range(nr)]}
data = pd.DataFrame(data)
df = spark.createDataFrame(data)
# df.show()
df = df.repartition(3, 'id')
# see the different partitions
for ipart in range(3):
print(f'partition {ipart}')
def fpart(partition_idx, iterator, target_partition_idx=ipart):
if partition_idx == target_partition_idx:
return iterator
else:
return iter(())
res = df.rdd.mapPartitionsWithIndex(fpart)
res = res.toDF(schema=schema)
# res.show(n=5, truncate=False)
print(f"number of rows {res.count()}, unique ids {res.select('id').drop_duplicates().toPandas()['id'].tolist()}")
It produces:
partition 0
number of rows 79, unique ids [3]
partition 1
number of rows 82, unique ids [0]
partition 2
number of rows 339, unique ids [5, 1, 2, 4]
so the partitions are clearly not balanced.
I saw in How to guarantee repartitioning in Spark Dataframe that this is explainable because assigning to partitions is based on the hash of column id modulo 3 (the number of partitions):
df.select('id', f.expr("hash(id)"), f.expr("pmod(hash(id), 3)")).drop_duplicates().show()
that produces
+---+-----------+-----------------+
| id| hash(id)|pmod(hash(id), 3)|
+---+-----------+-----------------+
| 3| 519220707| 0|
| 0|-1670924195| 1|
| 1|-1712319331| 2|
| 5| 1607884268| 2|
| 4| 1344313940| 2|
| 2| -797927272| 2|
+---+-----------+-----------------+
but I find this strange. The point of specifying the column in the repartition function is to somehow split the values of id to different partitions. If the column id had more unique values than 6 in this example it would work better, but still.
Is there a way to achieve this?

Pyspark parallelized loop of dataframe column

I have a raw Dataframe pyspark with encapsulate column. I need to loop on all columns to unwrap those columns. I don't know name columns and they could change. So I need generic algorithm. The problem is that I can't use classic loop (for) because I need a paralleled code.
Example of Data:
Timestamp | Layers
1456982 | [[1, 2],[3,4]]
1486542 | [[3,5], [5,5]]
In layers, it's a column which contain other columns (with their own column names). My goal is to have something like this:
Timestamp | label | number1 | text | value
1456982 | 1 | 2 |3 |4
1486542 | 3 | 5 |5 |5
How can I make a loop on columns with pyspark function?
Thanks for advice
You can use reduce function to this. I dont know what you want to do but lets suppose you wanna add 1 to all columns:
from functools import reduce
from pyspark.sql import functions as F
def add_1(df, col_name):
return df.withColumn(col_name, F.col(col_name)+1) # using same column name will update column
reduce(add_1, df.columns, df)
Edit:
I am not sure about solving it without converting rdd. Maybe this can be helpful:
from pyspark.sql import Row
flatF = lambda col: [item for item in l for l in col]
df \
.rdd \
.map(row: Row(timestamp=row['timestamp'],
**dict(zip(col_names, flatF(row['layers']))))) \
.toDF()

Trim in a Pyspark Dataframe

I have a Pyspark dataframe(Original Dataframe) having below data(all columns have string datatype). In my use case i am not sure of what all columns are there in this input dataframe. User just pass me the name of dataframe and ask me to trim all the columns of this dataframe. Data in a typical dataframe looks like as below:
id Value Value1
1 "Text " "Avb"
2 1504 " Test"
3 1 2
Is there anyway i can do it without being dependent on what all columns are present in this dataframe and get all the column trimmed in this dataframe. Data after trimming aall the columns of dataframe should look like.
id Value Value1
1 "Text" "Avb"
2 1504 "Test"
3 1 2
Can someone help me out? How can i achieve it using Pyspark dataframe? Any help will be appreciated.
input:
df.show()
+---+-----+------+
| id|Value|Value1|
+---+-----+------+
| 1|Text | Avb|
| 2| 1504| Test|
| 3| 1| 2|
+---+-----+------+
Code:
import pyspark.sql.functions as func
for col in df.columns:
df = df.withColumn(col, func.ltrim(func.rtrim(df[col])))
Output:
df.show()
+---+-----+------+
| id|Value|Value1|
+---+-----+------+
| 1| Text| Avb|
| 2| 1504| Test|
| 3| 1| 2|
+---+-----+------+
Using trim() function in #osbon123's answer.
from pyspark.sql.functions import trim
for c_name in df.columns:
df = df.withColumn(c_name, trim(col(c_name)))
You should avoid using withColumn because it creates a new DataFrame which is time-consuming for very large dataframes. I created the following function based on this solution, but now it works with any dataframe even when it has string and non-string columns.
from pyspark.sql import functions as F
def trim_string_columns(of_data: DataFrame) -> DataFrame:
data_trimmed = of_data.select([
(F.trim(c.name).alias(c.name) if isinstance(c.dataType, StringType) else c.name) for c in of_data.schema
])
return data_trimmed
This is the cleanest (and most computationally efficient) way I've seen it done to trim all spaces in all columns. If you want underscores to replace spaces, simply replace "" with "_".
# Standardize Column names no spaces to underscore
new_column_name_list = list(map(lambda x: x.replace(" ", ""), df.columns))
df = df.toDF(*new_column_name_list)
You can use dtypes function in DataFrame API to get the list of Cloumn Names along with their Datatypes and then for all string columns use "trim" function to trim the values.
Regards,
Neeraj

Resources