I am having a dataframe with the following format:
+------+--------+
| id | values |
+------+--------+
| 1 |[1,2,3] |
+------+--------+
| 2 |[1,2,3] |
+------+--------+
| 3 |[1,3] |
+------+--------+
| 4 |[1,2,8] |
.
.
.
And I want to filter and take the rows that the length of the list of the values column is equal or more than 3. Assuming that the dataframe is called df i am doing the following:
udf_filter = udf(lambda value: len(alist)>=3,BooleanType())
filtered_data = df.filter(udf_filter("values"))
When I run:
filtered_data.count()
It always give different result. How can it be possible?
Notes:
df comes from another dataframe by sampling it (same seed)
df.count always give the same number
Edit:
I am using the following code to take the sample from the original table:
df = df_original.sample(False, 0.01, 42)
Even though that I am using seed=42 if I run it multiple times it will not give the same results. To avoid that I persist the df and it gives always the same results:
df.persist()
But what I dont understand is that seed doesnt give the same sample rows. What could be a reason for that?
Related
im using spark dataframe API.
i'm trying to give sum() a list parameter containing columns names as strings.
when i'm putting columns names directly into the function- the script works'
when i'm trying to provide it to the function as a parameter of type list- i get the error:
"py4j.protocol.Py4JJavaError: An error occurred while calling o155.sum.
: java.lang.ClassCastException: java.util.ArrayList cannot be cast to java.lang.String"
using same kind of list parameter for groupBy() is working.
this is my script:
groupBy_cols = ['date_expense_int', 'customer_id']
agged_cols_list = ['total_customer_exp_last_m','total_customer_exp_last_3m']
df = df.groupBy(groupBy_cols).sum(agged_cols_list)
when i write it like so it works:
df = df.groupBy(groupBy_cols).sum('total_customer_exp_last_m','total_customer_exp_last_3m')
i tryied also to give sum() a list of column by using
agged_cols_list2 = []
for i in agged_cols_list:
agged_cols_list2.append(col(i))
also didn't work
Unpack your list using the asterisk notation:
df = df.groupBy(groupBy_cols).sum(*agged_cols_list)
If you are having a df like below and want to sum a list of fields
df.show(5,truncate=False)
+---+---------+----+
|id |subject |mark|
+---+---------+----+
|100|English |45 |
|100|Maths |63 |
|100|Physics |40 |
|100|Chemistry|94 |
|100|Biology |74 |
+---+---------+----+
only showing top 5 rows
agged_cols_list=['subject', 'mark']
df.groupBy("id").agg(*[sum(col(c)) for c in agged_cols_list]).show(5,truncate=False)
+---+------------+---------+
|id |sum(subject)|sum(mark)|
+---+------------+---------+
|125|null |330.0 |
|124|null |332.0 |
|155|null |304.0 |
|132|null |382.0 |
|154|null |300.0 |
+---+------------+---------+
Note that sum(subject) beomes null as it is a string column.
In this case you may want to apply count to subject and sum to mark. So you can use a dictionary
summary={ "subject":"count","mark":"sum" }
df.groupBy("id").agg(summary).show(5,truncate=False)
+---+--------------+---------+
|id |count(subject)|sum(mark)|
+---+--------------+---------+
|125|5 |330.0 |
|124|5 |332.0 |
|155|5 |304.0 |
|132|5 |382.0 |
|154|5 |300.0 |
+---+--------------+---------+
only showing top 5 rows
Working with a dataframe which contains a column, the values in the columns are lists,
id | values
1 | ['good','good','good','bad','bad','good','good']
2 | ['bad','badd','good','bad',Null,'good','bad']
....
How could I get the most frequent showed string in the list?
expected output:
id | most_frequent
1 | 'good'
2 | 'bad'
....
I dont see a reason to explode and groupby here (compute intensive shuffle operations), as with Spark2.4+, we can use higher order functions to get your desired output:
from pyspark.sql import functions as F
df\
.withColumn("most_common", F.expr("""sort_array(transform(array_distinct(values),\
x-> array(aggregate(values, 0,(acc,t)->acc+IF(t=x,1,0)),x)),False)[0][1]"""))\
.show(truncate=False)
#+---+----------------------------------------+-----------+
#|id |values |most_common|
#+---+----------------------------------------+-----------+
#|1 |[good, good, good, bad, bad, good, good]|good |
#|2 |[bad, badd, good, bad,, good, bad] |bad |
#+---+----------------------------------------+-----------+
We can also use array_max instead of sort_array.
from pyspark.sql import functions as F
df\
.withColumn("most_common", F.expr("""array_max(transform(array_distinct(values),\
x-> array(aggregate(values, 0,(acc,t)->acc+IF(t=x,1,0)),x)))[1]"""))\
.show(truncate=False)
I'm trying to split my dataframe depending on the number of nodes (of my cluster),
my dataframe looks like :
If i had node=2, and dataframe.count=7 :
So, to apply an iterative approach the result of split will be :
My question is : how can i do this ?
You can do that (have a look at the code below) with one of the rdd partition functions, but I don't recommend it as
long as you are not fully aware of what you are doing and the reason why you are doing this. In general (or better for most usecase) it is better to let spark handle the data distribution.
import pyspark.sql.functions as F
import itertools
import math
#creating a random dataframe
l = [(x,x+2) for x in range(1009)]
columns = ['one', 'two']
df=spark.createDataFrame(l, columns)
#create on partition to asign a partition key
df = df.coalesce(1)
#number of nodes (==partitions)
pCount = 5
#creating a list of partition keys
#basically it repeats range(5) several times until we have enough keys for each row
partitionKey = list(itertools.chain.from_iterable(itertools.repeat(x, math.ceil(df.count()/pCount)) for x in range(pCount)))
#now we can distribute the data to the partitions
df = df.rdd.partitionBy(pCount, partitionFunc = lambda x: partitionKey.pop()).toDF()
#This shows us the number of records within each partition
df.withColumn("partition_id", F.spark_partition_id()).groupBy("partition_id").count().show()
Output:
+------------+-----+
|partition_id|count|
+------------+-----+
| 1| 202|
| 3| 202|
| 4| 202|
| 2| 202|
| 0| 201|
+------------+-----+
I have a dataframe df that contains a list of strings like so:
+-------------+
Products
+-------------+
| Z9L57.W3|
| H9L23.05|
| PRL57.AF|
+-------------+
I would like to truncate the list after the '.' character such that
it looks like:
+--------------+
Products_trunc
+--------------+
| Z9L57 |
| H9L23 |
| PRL57 |
+--------------+
I tried using the split function, but it only works for a single string and not lists.
I also tried
df['Products_trunc'] = df['Products'].str.split('.').str[0]
but I am getting the following error:
TypeError: 'Column' object is not callable
Does anyone have any insights into this?
Thank You
Your code looks like if you are used to pandas. The truncating in pyspark is a bit different. Have a look below:
from pyspark.sql import functions as F
l = [
( 'Z9L57.W3' , ),
( 'H9L23.05' ,),
( 'PRL57.AF' ,)
]
columns = ['Products']
df=spark.createDataFrame(l, columns)
The withColumn function allows you to modify existing columns or creating new one. The function takes 2 parameters: column name and columne expression. You will modify a columne when the column name already exists.
df = df.withColumn('Products', F.split(df.Products, '\.').getItem(0))
df.show()
Output:
+--------+
|Products|
+--------+
| Z9L57|
| H9L23|
| PRL57|
+--------+
You will create a new column when you choose a not existing column name.
df = df.withColumn('Products_trunc', F.split(df.Products, '\.').getItem(0))
df.show()
Output:
+--------+--------------+
|Products|Products_trunc|
+--------+--------------+
|Z9L57.W3| Z9L57|
|H9L23.05| H9L23|
|PRL57.AF| PRL57|
+--------+--------------+
I have a raw Dataframe pyspark with encapsulate column. I need to loop on all columns to unwrap those columns. I don't know name columns and they could change. So I need generic algorithm. The problem is that I can't use classic loop (for) because I need a paralleled code.
Example of Data:
Timestamp | Layers
1456982 | [[1, 2],[3,4]]
1486542 | [[3,5], [5,5]]
In layers, it's a column which contain other columns (with their own column names). My goal is to have something like this:
Timestamp | label | number1 | text | value
1456982 | 1 | 2 |3 |4
1486542 | 3 | 5 |5 |5
How can I make a loop on columns with pyspark function?
Thanks for advice
You can use reduce function to this. I dont know what you want to do but lets suppose you wanna add 1 to all columns:
from functools import reduce
from pyspark.sql import functions as F
def add_1(df, col_name):
return df.withColumn(col_name, F.col(col_name)+1) # using same column name will update column
reduce(add_1, df.columns, df)
Edit:
I am not sure about solving it without converting rdd. Maybe this can be helpful:
from pyspark.sql import Row
flatF = lambda col: [item for item in l for l in col]
df \
.rdd \
.map(row: Row(timestamp=row['timestamp'],
**dict(zip(col_names, flatF(row['layers']))))) \
.toDF()