Pyspark parallelized loop of dataframe column - python-3.x

I have a raw Dataframe pyspark with encapsulate column. I need to loop on all columns to unwrap those columns. I don't know name columns and they could change. So I need generic algorithm. The problem is that I can't use classic loop (for) because I need a paralleled code.
Example of Data:
Timestamp | Layers
1456982 | [[1, 2],[3,4]]
1486542 | [[3,5], [5,5]]
In layers, it's a column which contain other columns (with their own column names). My goal is to have something like this:
Timestamp | label | number1 | text | value
1456982 | 1 | 2 |3 |4
1486542 | 3 | 5 |5 |5
How can I make a loop on columns with pyspark function?
Thanks for advice

You can use reduce function to this. I dont know what you want to do but lets suppose you wanna add 1 to all columns:
from functools import reduce
from pyspark.sql import functions as F
def add_1(df, col_name):
return df.withColumn(col_name, F.col(col_name)+1) # using same column name will update column
reduce(add_1, df.columns, df)
Edit:
I am not sure about solving it without converting rdd. Maybe this can be helpful:
from pyspark.sql import Row
flatF = lambda col: [item for item in l for l in col]
df \
.rdd \
.map(row: Row(timestamp=row['timestamp'],
**dict(zip(col_names, flatF(row['layers']))))) \
.toDF()

Related

In Pyspark get most frequent string from a column with list of strings

Working with a dataframe which contains a column, the values in the columns are lists,
id | values
1 | ['good','good','good','bad','bad','good','good']
2 | ['bad','badd','good','bad',Null,'good','bad']
....
How could I get the most frequent showed string in the list?
expected output:
id | most_frequent
1 | 'good'
2 | 'bad'
....
I dont see a reason to explode and groupby here (compute intensive shuffle operations), as with Spark2.4+, we can use higher order functions to get your desired output:
from pyspark.sql import functions as F
df\
.withColumn("most_common", F.expr("""sort_array(transform(array_distinct(values),\
x-> array(aggregate(values, 0,(acc,t)->acc+IF(t=x,1,0)),x)),False)[0][1]"""))\
.show(truncate=False)
#+---+----------------------------------------+-----------+
#|id |values |most_common|
#+---+----------------------------------------+-----------+
#|1 |[good, good, good, bad, bad, good, good]|good |
#|2 |[bad, badd, good, bad,, good, bad] |bad |
#+---+----------------------------------------+-----------+
We can also use array_max instead of sort_array.
from pyspark.sql import functions as F
df\
.withColumn("most_common", F.expr("""array_max(transform(array_distinct(values),\
x-> array(aggregate(values, 0,(acc,t)->acc+IF(t=x,1,0)),x)))[1]"""))\
.show(truncate=False)

split my dataframe depending on the number of nodes pyspark

I'm trying to split my dataframe depending on the number of nodes (of my cluster),
my dataframe looks like :
If i had node=2, and dataframe.count=7 :
So, to apply an iterative approach the result of split will be :
My question is : how can i do this ?
You can do that (have a look at the code below) with one of the rdd partition functions, but I don't recommend it as
long as you are not fully aware of what you are doing and the reason why you are doing this. In general (or better for most usecase) it is better to let spark handle the data distribution.
import pyspark.sql.functions as F
import itertools
import math
#creating a random dataframe
l = [(x,x+2) for x in range(1009)]
columns = ['one', 'two']
df=spark.createDataFrame(l, columns)
#create on partition to asign a partition key
df = df.coalesce(1)
#number of nodes (==partitions)
pCount = 5
#creating a list of partition keys
#basically it repeats range(5) several times until we have enough keys for each row
partitionKey = list(itertools.chain.from_iterable(itertools.repeat(x, math.ceil(df.count()/pCount)) for x in range(pCount)))
#now we can distribute the data to the partitions
df = df.rdd.partitionBy(pCount, partitionFunc = lambda x: partitionKey.pop()).toDF()
#This shows us the number of records within each partition
df.withColumn("partition_id", F.spark_partition_id()).groupBy("partition_id").count().show()
Output:
+------------+-----+
|partition_id|count|
+------------+-----+
| 1| 202|
| 3| 202|
| 4| 202|
| 2| 202|
| 0| 201|
+------------+-----+

truncating all strings in a dataframe column after a specific character using pyspark

I have a dataframe df that contains a list of strings like so:
+-------------+
Products
+-------------+
| Z9L57.W3|
| H9L23.05|
| PRL57.AF|
+-------------+
I would like to truncate the list after the '.' character such that
it looks like:
+--------------+
Products_trunc
+--------------+
| Z9L57 |
| H9L23 |
| PRL57 |
+--------------+
I tried using the split function, but it only works for a single string and not lists.
I also tried
df['Products_trunc'] = df['Products'].str.split('.').str[0]
but I am getting the following error:
TypeError: 'Column' object is not callable
Does anyone have any insights into this?
Thank You
Your code looks like if you are used to pandas. The truncating in pyspark is a bit different. Have a look below:
from pyspark.sql import functions as F
l = [
( 'Z9L57.W3' , ),
( 'H9L23.05' ,),
( 'PRL57.AF' ,)
]
columns = ['Products']
df=spark.createDataFrame(l, columns)
The withColumn function allows you to modify existing columns or creating new one. The function takes 2 parameters: column name and columne expression. You will modify a columne when the column name already exists.
df = df.withColumn('Products', F.split(df.Products, '\.').getItem(0))
df.show()
Output:
+--------+
|Products|
+--------+
| Z9L57|
| H9L23|
| PRL57|
+--------+
You will create a new column when you choose a not existing column name.
df = df.withColumn('Products_trunc', F.split(df.Products, '\.').getItem(0))
df.show()
Output:
+--------+--------------+
|Products|Products_trunc|
+--------+--------------+
|Z9L57.W3| Z9L57|
|H9L23.05| H9L23|
|PRL57.AF| PRL57|
+--------+--------------+

Pyspark substring of one column based on the length of another column

Using Pyspark 2.2
I have a spark DataFrame with multiple columns. I need to input 2 columns to a UDF and return a 3rd column
Input:
+-----+------+
|col_A| col_B|
+-----+------+
| abc|abcdef|
| abc| a|
+-----+------+
Both col_A and col_B are StringType()
Desired output:
+-----+------+-------+
|col_A| col_B|new_col|
+-----+------+-------+
| abc|abcdef| abc|
| abc| a| a|
+-----+------+-------+
I want new_col to be a substring of col_A with the length of col_B.
I tried
udf_substring = F.udf(lambda x: F.substring(x[0],0,F.length(x[1])), StringType())
df.withColumn('new_col', udf_substring([F.col('col_A'),F.col('col_B')])).show()
But it gives the TypeError: Column is not iterable.
Any idea how to do such manipulation?
There are two major things wrong here.
First, you defined your udf to take in one input parameter when it should take 2.
Secondly, you can't use the API functions within the udf. (Calling the udf serializes to python so you need to use python syntax and functions.)
Here's a proper udf implementation for this problem:
import pyspark.sql.functions as F
def my_substring(a, b):
# You should add in your own error checking
return a[:len(b)]
udf_substring = F.udf(lambda x, y: my_substring(a, b), StringType())
And then call it by passing in the two columns as arguments:
df.withColumn('new_col', udf_substring(F.col('col_A'),F.col('col_B')))
However, in this case you can do this without a udf using the method described in this post.
df.withColumn(
'new_col',
F.expr("substring(col_A,0,length(col_B))")
)

Using Filter in spark dataframes (python)

I am having a dataframe with the following format:
+------+--------+
| id | values |
+------+--------+
| 1 |[1,2,3] |
+------+--------+
| 2 |[1,2,3] |
+------+--------+
| 3 |[1,3] |
+------+--------+
| 4 |[1,2,8] |
.
.
.
And I want to filter and take the rows that the length of the list of the values column is equal or more than 3. Assuming that the dataframe is called df i am doing the following:
udf_filter = udf(lambda value: len(alist)>=3,BooleanType())
filtered_data = df.filter(udf_filter("values"))
When I run:
filtered_data.count()
It always give different result. How can it be possible?
Notes:
df comes from another dataframe by sampling it (same seed)
df.count always give the same number
Edit:
I am using the following code to take the sample from the original table:
df = df_original.sample(False, 0.01, 42)
Even though that I am using seed=42 if I run it multiple times it will not give the same results. To avoid that I persist the df and it gives always the same results:
df.persist()
But what I dont understand is that seed doesnt give the same sample rows. What could be a reason for that?

Resources