im using spark dataframe API.
i'm trying to give sum() a list parameter containing columns names as strings.
when i'm putting columns names directly into the function- the script works'
when i'm trying to provide it to the function as a parameter of type list- i get the error:
"py4j.protocol.Py4JJavaError: An error occurred while calling o155.sum.
: java.lang.ClassCastException: java.util.ArrayList cannot be cast to java.lang.String"
using same kind of list parameter for groupBy() is working.
this is my script:
groupBy_cols = ['date_expense_int', 'customer_id']
agged_cols_list = ['total_customer_exp_last_m','total_customer_exp_last_3m']
df = df.groupBy(groupBy_cols).sum(agged_cols_list)
when i write it like so it works:
df = df.groupBy(groupBy_cols).sum('total_customer_exp_last_m','total_customer_exp_last_3m')
i tryied also to give sum() a list of column by using
agged_cols_list2 = []
for i in agged_cols_list:
agged_cols_list2.append(col(i))
also didn't work
Unpack your list using the asterisk notation:
df = df.groupBy(groupBy_cols).sum(*agged_cols_list)
If you are having a df like below and want to sum a list of fields
df.show(5,truncate=False)
+---+---------+----+
|id |subject |mark|
+---+---------+----+
|100|English |45 |
|100|Maths |63 |
|100|Physics |40 |
|100|Chemistry|94 |
|100|Biology |74 |
+---+---------+----+
only showing top 5 rows
agged_cols_list=['subject', 'mark']
df.groupBy("id").agg(*[sum(col(c)) for c in agged_cols_list]).show(5,truncate=False)
+---+------------+---------+
|id |sum(subject)|sum(mark)|
+---+------------+---------+
|125|null |330.0 |
|124|null |332.0 |
|155|null |304.0 |
|132|null |382.0 |
|154|null |300.0 |
+---+------------+---------+
Note that sum(subject) beomes null as it is a string column.
In this case you may want to apply count to subject and sum to mark. So you can use a dictionary
summary={ "subject":"count","mark":"sum" }
df.groupBy("id").agg(summary).show(5,truncate=False)
+---+--------------+---------+
|id |count(subject)|sum(mark)|
+---+--------------+---------+
|125|5 |330.0 |
|124|5 |332.0 |
|155|5 |304.0 |
|132|5 |382.0 |
|154|5 |300.0 |
+---+--------------+---------+
only showing top 5 rows
Related
I have the following dataframe that is extracted with the following command:
extract = data.select('properties.id', 'flags')
| id | flags |
|-------| ---------------------------|
| v_001 | "{"93":true,"83":true}" |
| v_002 | "{"45":true,"76":true}" |
The desired result I want is:
| id | flags |
|-------| ------|
| v_001 | 93 |
| v_001 | 83 |
| v_002 | 45 |
| v_002 | 76 |
I tried to apply explode as the following:
extract = data.select('properties.id', explode(col('flags')))
But I encountered the following:
cannot resolve 'explode(flags)' due to data type mismatch: input to function explode should be array or map type, not struct<93:boolean,83:boolean,45:boolean,76:boolean>
This makes sense as the schema of the column is not compatible with the explode function. How can I adjust the function to get my desired result? Is there a better way to solve this problem?
P.D.: The desired table schema is not the best design but this is out of my scope since this will involve another topic discussion.
As you might already looked, explode requires ArrayType and it seems you are only taking the keys from the dict in flags.
So, you can first convert the flags to MapType and use map_keys to extract all keys into list.
df.withColumn('flags', F.map_keys(F.from_json('flags', MapType(StringType(), BooleanType()))))
This will result in like this
+-----+--------+
| id| flags|
+-----+--------+
|v_001|[93, 83]|
|v_002|[45, 76]|
+-----+--------+
Then you can use explode on the flags.
.select('id', F.explode('flags'))
+-----+---+
| id|col|
+-----+---+
|v_001| 93|
|v_001| 83|
|v_002| 45|
|v_002| 76|
+-----+---+
The whole code
df = (df.withColumn('flags', F.map_keys(F.from_json('flags', MapType(StringType(), BooleanType()))))
.select('id', F.explode('flags')))
Update
It is probably better to supply the schema and read as MapType for the flags but if your json is complex and hard to create the schema, you can convert the struct into String once then convert to MapType.
# Add this line before `from_json`
df = df.select('id', F.to_json('flags').alias('flags'))
# Or you can do in 1 shot.
df = (df.withColumn('flags', F.map_keys(F.from_json(F.to_json('flags'), MapType(StringType(), BooleanType()))))
.select('id', F.explode('flags')))
Below is my sample dataframe for household things.
Here W represents Wooden
G represents Glass and P represents Plastic, and different items are classified in that category.
So I want to identify which items falls in W,G,P categories. As an initial step ,I tried classifying it for Chair
M = sqlContext.createDataFrame([('W-Chair-Shelf;G-Vase;P-Cup',''),
('W-Chair',''),
('W-Shelf;G-Cup;P-Chair',''),
('G-Cup;P-ShowerCap;W-Board','')],
['Household_chores_arrangements','Chair'])
M.createOrReplaceTempView('M')
+-----------------------------+-----+
|Household_chores_arrangements|Chair|
+-----------------------------+-----+
| W-Chair-Shelf;G-Vase;P-Cup| |
| W-Chair| |
| W-Shelf;G-Cup;P-Chair| |
| G-Cup;P-ShowerCap;W-Board| |
+-----------------------------+-----+
I tried to do it for one condition where I can mark it as W, But I am not getting expected results,may be my condition is wrong.
df = sqlContext.sql("select * from M where Household_chores_arrangements like '%W%Chair%'")
display(df)
Is there a better way to do this in pySpark
Expected output
+-----------------------------+-----+
|Household_chores_arrangements|Chair|
+-----------------------------+-----+
| W-Chair-Shelf;G-Vase;P-Cup| W|
| W-Chair| W|
| W-Shelf;G-Cup;P-Chair| P|
| G-Cup;P-ShowerCap;W-Board| NULL|
+-----------------------------+-----+
Thanks #mck - for the solution.
Update
In addition to that I was trying to analyse more on regexp_extract option.So altered the sample set
M = sqlContext.createDataFrame([('Wooden|Chair',''),
('Wooden|Cup;Glass|Chair',''),
('Wooden|Cup;Glass|Showercap;Plastic|Chair','') ],
['Household_chores_arrangements','Chair'])
M.createOrReplaceTempView('M')
df = spark.sql("""
select
Household_chores_arrangements,
nullif(regexp_extract(Household_chores_arrangements, '(Wooden|Glass|Plastic)(|Chair)', 1), '') as Chair
from M
""")
display(df)
Result:
+-----------------------------+-----------------+
|Household_chores_arrangements| Chair|
+-----------------------------+-----------------+
| Wooden|Chair |Wooden|
| Wooden|Cup;Glass|Chair |Wooden|
|Wooden|Cup;Glass|Showercap;Plastic|Chair|Wooden|
+-----------------------------+----------------+
Changed delimiter to | instead of - and made changes in the query aswell. Was expecting results as below, But derived a wrong result
+-----------------------------+-----------------+
|Household_chores_arrangements| Chair|
+-----------------------------+-----------------+
| Wooden|Chair |Wooden|
| Wooden|Cup;Glass|Chair |Glass |
|Wooden|Cup;Glass|Showercap;Plastic|Chair|Plastic|
+-----------------------------+----------------+
If delimiter alone is changed,should we need to change any other values?
update - 2
I have got the solution for the above mentioned update.
For pipe delimiter we have to escape them using 4 \
You can use regexp_extract to extract the categories, and if no match is found, replace empty string with null using nullif.
df = spark.sql("""
select
Household_chores_arrangements,
nullif(regexp_extract(Household_chores_arrangements, '([A-Z])-Chair', 1), '') as Chair
from M
""")
df.show(truncate=False)
+-----------------------------+-----+
|Household_chores_arrangements|Chair|
+-----------------------------+-----+
|W-Chair-Shelf;G-Vase;P-Cup |W |
|W-Chair |W |
|W-Shelf;G-Cup;P-Chair |P |
|G-Cup;P-ShowerCap;W-Board |null |
+-----------------------------+-----+
If I have a PySpark DataFrame with two columns, text and subtext, where subtext is guaranteed to occur somewhere within text. How would I calculate the position of subtext in text column?
Input data:
+---------------------------+---------+
| text | subtext |
+---------------------------+---------+
| Where is my string? | is |
| Hm, this one is different | on |
+---------------------------+---------+
Expected output:
+---------------------------+---------+----------+
| text | subtext | position |
+---------------------------+---------+----------+
| Where is my string? | is | 6 |
| Hm, this one is different | on | 9 |
+---------------------------+---------+----------+
Note: I can do this using static text/regex without issue, I have not been able to find any resources on doing this with a row-specific text/regex.
You can use locate. You need to subtract 1 because string index starts from 1, not 0.
import pyspark.sql.functions as F
df2 = df.withColumn('position', F.expr('locate(subtext, text) - 1'))
df2.show(truncate=False)
+-------------------------+-------+--------+
|text |subtext|position|
+-------------------------+-------+--------+
|Where is my string? |is |6 |
|Hm, this one is different|on |9 |
+-------------------------+-------+--------+
Another way using position SQL function :
from pyspark.sql.functions import expr
df1 = df.withColumn('position', expr("position(subtext in text) -1"))
df1.show(truncate=False)
#+-------------------------+-------+--------+
#|text |subtext|position|
#+-------------------------+-------+--------+
#|Where is my string? |is |6 |
#|Hm, this one is different|on |9 |
#+-------------------------+-------+--------+
pyspark.sql.functions.instr(str, substr)
Locate the position of the first occurrence of substr column in the given string. Returns null if either of the arguments are null.
import pyspark.sql.functions as F
df.withColumn('pos',F.instr(df["text"], df["subtext"]))
You can use locate itself. The problem is first parameter of locate (substr) should be string.
So you can use expr function to convert column to string
Please find the correct code as below:
df=input_df.withColumn("poss", F.expr("locate(subtext,text,1)"))
I have a dataframe df that contains a list of strings like so:
+-------------+
Products
+-------------+
| Z9L57.W3|
| H9L23.05|
| PRL57.AF|
+-------------+
I would like to truncate the list after the '.' character such that
it looks like:
+--------------+
Products_trunc
+--------------+
| Z9L57 |
| H9L23 |
| PRL57 |
+--------------+
I tried using the split function, but it only works for a single string and not lists.
I also tried
df['Products_trunc'] = df['Products'].str.split('.').str[0]
but I am getting the following error:
TypeError: 'Column' object is not callable
Does anyone have any insights into this?
Thank You
Your code looks like if you are used to pandas. The truncating in pyspark is a bit different. Have a look below:
from pyspark.sql import functions as F
l = [
( 'Z9L57.W3' , ),
( 'H9L23.05' ,),
( 'PRL57.AF' ,)
]
columns = ['Products']
df=spark.createDataFrame(l, columns)
The withColumn function allows you to modify existing columns or creating new one. The function takes 2 parameters: column name and columne expression. You will modify a columne when the column name already exists.
df = df.withColumn('Products', F.split(df.Products, '\.').getItem(0))
df.show()
Output:
+--------+
|Products|
+--------+
| Z9L57|
| H9L23|
| PRL57|
+--------+
You will create a new column when you choose a not existing column name.
df = df.withColumn('Products_trunc', F.split(df.Products, '\.').getItem(0))
df.show()
Output:
+--------+--------------+
|Products|Products_trunc|
+--------+--------------+
|Z9L57.W3| Z9L57|
|H9L23.05| H9L23|
|PRL57.AF| PRL57|
+--------+--------------+
I am having a dataframe with the following format:
+------+--------+
| id | values |
+------+--------+
| 1 |[1,2,3] |
+------+--------+
| 2 |[1,2,3] |
+------+--------+
| 3 |[1,3] |
+------+--------+
| 4 |[1,2,8] |
.
.
.
And I want to filter and take the rows that the length of the list of the values column is equal or more than 3. Assuming that the dataframe is called df i am doing the following:
udf_filter = udf(lambda value: len(alist)>=3,BooleanType())
filtered_data = df.filter(udf_filter("values"))
When I run:
filtered_data.count()
It always give different result. How can it be possible?
Notes:
df comes from another dataframe by sampling it (same seed)
df.count always give the same number
Edit:
I am using the following code to take the sample from the original table:
df = df_original.sample(False, 0.01, 42)
Even though that I am using seed=42 if I run it multiple times it will not give the same results. To avoid that I persist the df and it gives always the same results:
df.persist()
But what I dont understand is that seed doesnt give the same sample rows. What could be a reason for that?