Dynamically add padding Zeros - apache-spark

mock_data = [('TYCO', ' 1303','13'),('EMC', ' 120989 ','123'), ('VOLVO ', '102329 ','1234'),('BMW', '1301571345 ',' '),('FORD', '004','21212')]
df = spark.createDataFrame(mock_data, ['col1', 'col2','col3'])
+-------+------------+-----+
| col1 | col2| col3|
+-------+------------+-----+
| TYCO| 1303| 13|
| EMC| 120989 | 123|
|VOLVO | 102329 | 1234|
| BMW|1301571345 | |
| FORD| 004|21212|
+-------+------------+-----+
trim the col2 and based on the length(10-col2 length) need to dynamically add padding zeroes in col3. concatenate col2 and col3.
df2 = df.withColumn('length_col2', 10-length(trim(df.col2)))
+-------+------------+-----+-----------+
| col1| col2| col3|length_col2|
+-------+------------+-----+-----------+
| TYCO| 1303| 13| 6|
| EMC| 120989 | 123| 4|
|VOLVO | 102329 | 1234| 4|
| BMW|1301571345 | | 0|
| FORD| 004|21212| 7|
+-------+------------+-----+-----------+
expected output
+-------+----------+-----+-------------
| col1| col2 | col3|output
+-------+----------+-----+-------------
| TYCO| 1303 | 13|1303000013
| EMC| 120989 | 123|1209890123
|VOLVO | 102329 | 1234|1023291234
| BMW| 1301571345 | |1301571345
| FORD| 004 |21212|0040021212
+-------+----------+-----+-------------

What You are looking for is rpad Function in pyspark.sql.functions as listed here => https://spark.apache.org/docs/2.3.0/api/sql/index.html
See The Solution Below :
%pyspark
mock_data = [('TYCO', ' 1303','13'),('EMC', ' 120989 ','123'), ('VOLVO ', '102329 ','1234'),('BMW', '1301571345 ',' '),('FORD', '004','21212')]
df = spark.createDataFrame(mock_data, ['col1', 'col2','col3'])
df.createOrReplaceTempView("input_df")
spark.sql("SELECT *, concat(rpad(trim(col2),10,'0') , col3) as OUTPUT from input_df").show(20,False)
and Result
+-------+------------+-----+---------------+
|col1 |col2 |col3 |OUTPUT |
+-------+------------+-----+---------------+
|TYCO | 1303 |13 |130300000013 |
|EMC | 120989 |123 |1209890000123 |
|VOLVO |102329 |1234 |10232900001234 |
|BMW |1301571345 | |1301571345 |
|FORD |004 |21212|004000000021212|
+-------+------------+-----+---------------+

Related

PySpark: boolean previous values with conditions

I have mydata like this:
data = [("110125","James","2021-12-05","NY","PA",60000),("110125","James","2021-12-07","NY","PA",3000),("110125","James","2021-12-07","NY","AT",3000),
("5225","Michael","2021-12-25","LA","AT",60000),("5225","Michael","2021-12-17","LA","PA",15000),("5225","Michael","2021-12-17","LA","PA",65000)]
columns = ["id","Name","Date","Local","Office","salary"]
df = spark.createDataFrame(data = data, schema = columns)
Input:
+--------+--------+----------+-----+------+------+
| id |Name |Date |Local|Office|salary|
+--------+--------+----------+-----+------+------+
| 110125| James |2021-12-05|NY |PA | 60000|
| 110125| James |2021-12-07|NY |PA | 3000 |
| 110125| James |2021-12-07|NY |AT | 3000 |
| 5225 | Michael|2021-12-25|LA |AT | 60000|
| 5225 | Michael|2021-12-17|LA |PA | 15000|
| 5225 | Michael|2021-12-17|LA |PA | 65000|
+--------+--------+----------+-----+------+------+
I want a new column 'Check', if one of 4 values Date, Local; Offfice; Salary different with previous values and a same id, name so True.
Output:
+--------+--------+----------+-----+------+------+-----+
| id |Name |Date |Local|Office|salary|Check|
+--------+--------+----------+-----+------+------+-----+
| 110125| James |2021-12-05|NY |PA | 60000| |
| 110125| James |2021-12-07|NY |PA | 3000 | True|
| 110125| James |2021-12-07|NY |AT | 3000 | True|
| 5225 | Michael|2021-12-25|LA |AT | 60000| |
| 5225 | Michael|2021-12-17|LA |PA | 15000| True|
| 5225 | Michael|2021-12-17|LA |PA | 65000| True|
+--------+--------+----------+-----+------+------+-----+
My code PySpark:
df.groupby("ID", "Name").withColumn("Check", F.when((F.col('Local') == F.lag('Local')) |(F.col('Office') == F.lag('Office'))|
(F.col('Date') == F.lag('Date'))|(F.col('salary') == F.lag('salary')), False ).otherwise(True))
AttributeError: 'GroupedData' object has no attribute 'withColumn'
You want to use window:
from pyspark.sql import Window, functions as F
w = Window.partitionBy("id", "name").orderBy("Date")
df = df.withColumn(
"Check",
~((F.col('Local') == F.lag('Local').over(w))
& (F.col('Office') == F.lag('Office').over(w))
& (F.col('Date') == F.lag('Date').over(w))
& (F.col('salary') == F.lag('salary').over(w))
)
)
df.show()
#+------+-------+----------+-----+------+------+-----+
#| id| Name| Date|Local|Office|salary|Check|
#+------+-------+----------+-----+------+------+-----+
#|110125| James|2021-12-05| NY| PA| 60000| null|
#|110125| James|2021-12-07| NY| PA| 3000| true|
#|110125| James|2021-12-07| NY| AT| 3000| true|
#| 5225|Michael|2021-12-17| LA| PA| 15000| null|
#| 5225|Michael|2021-12-17| LA| PA| 65000| true|
#| 5225|Michael|2021-12-25| LA| AT| 60000| true|
#+------+-------+----------+-----+------+------+-----+

Expand last value of string column to groupby Pandas Dataframe

I have the following Pandas dataframe:
+--------+----+
|id |name|
+--------+----+
| 1| |
| 1| |
| 1| |
| 1|Carl|
| 2| |
| 2| |
| 2|John|
+--------+----+
What I want to achieve is to expand the last value of each group to the rest of the group:
+--------+----+
|id |name|
+--------+----+
| 1|Carl|
| 1|Carl|
| 1|Carl|
| 1|Carl|
| 2|John|
| 2|John|
| 2|John|
+--------+----+
It looks pretty easy but I am struggling to achieve it because of the columns' type.
What I've tried so far is:
df['name'] = df.groupby('id')['name'].transform('last')
This works for int or float columns, but not for string columns.
I am getting the following error:
No numeric types to aggregate
Thanks in advance.
Edit
bfill() is not valid because I can have the following:
+--------+----+
|id |name|
+--------+----+
| 1| |
| 1| |
| 1| |
| 1|Carl|
| 2| |
| 2| |
| 2| |
| 3| |
| 3| |
| 3|John|
+--------+----+
In this case, I want id = 2 to remain as NaN, and it would end up as John, which is incorrect. The desired output would be:
+--------+----+
|id |name|
+--------+----+
| 1|Carl|
| 1|Carl|
| 1|Carl|
| 1|Carl|
| 2| |
| 2| |
| 2| |
| 3|John|
| 3|John|
| 3|John|
+--------+----+
If the empty values are NaN, could you try fillna
df['name'] = df['name'].bfill()
If not, replace empty strings by NaN.
Try this.
import pandas as pd
import numpy as np
dff = pd.DataFrame({"id":[1,1,1,1,2,2,2,3,3,3],
"name":["","","","car1","","","","","","john"]})
dff = dff.replace(r'', np.NaN)
def c(x):
if sum(pd.isnull(x)) != np.size(x):
l = [v for v in x if type(v) == str]
return [l[0]]*np.size(x)
else:
return [""]*np.size(x)
df=dff.groupby('id')["name"].apply(lambda x:c(list(x)))
df = df.to_frame().reset_index()
df = df.set_index('id').name.apply(pd.Series).stack().reset_index(level=0).rename(columns={0:'name'})
output
id name
0 1 car1
1 1 car1
2 1 car1
3 1 car1
0 2
1 2
2 2
0 3 john
1 3 john
2 3 john

PySpark - Select rows where the column has non-consecutive values after grouping

I have a dataframe of the form:
|user_id| action | day |
------------------------
| d25as | AB | 2 |
| d25as | AB | 3 |
| d25as | AB | 5 |
| m3562 | AB | 1 |
| m3562 | AB | 7 |
| m3562 | AB | 9 |
| ha42a | AB | 3 |
| ha42a | AB | 4 |
| ha42a | AB | 5 |
I want to filter out users that are seen on consecutive days, if they are not seen in at least a single nonconsecutive day. The resulting dataframe should be:
|user_id| action | day |
------------------------
| d25as | AB | 2 |
| d25as | AB | 3 |
| d25as | AB | 5 |
| m3562 | AB | 1 |
| m3562 | AB | 7 |
| m3562 | AB | 9 |
where the last user has been removed, since he appeared just on consecutive days.
Does anyone know how this can be done in spark?
Using spark-sql window functions and without any udfs. The df construction is done in scala but the sql part will be same in python. Check this out:
val df = Seq(("d25as","AB",2),("d25as","AB",3),("d25as","AB",5),("m3562","AB",1),("m3562","AB",7),("m3562","AB",9),("ha42a","AB",3),("ha42a","AB",4),("ha42a","AB",5)).toDF("user_id","action","day")
df.createOrReplaceTempView("qubix")
spark.sql(
""" with t1( select user_id, action, day, row_number() over(partition by user_id order by day)-day diff from qubix),
t2( select user_id, action, day, collect_set(diff) over(partition by user_id) diff2 from t1)
select user_id, action, day from t2 where size(diff2) > 1
""").show(false)
Results:
+-------+------+---+
|user_id|action|day|
+-------+------+---+
|d25as |AB |2 |
|d25as |AB |3 |
|d25as |AB |5 |
|m3562 |AB |1 |
|m3562 |AB |7 |
|m3562 |AB |9 |
+-------+------+---+
pyspark version
>>> from pyspark.sql.functions import *
>>> values = [('d25as','AB',2),('d25as','AB',3),('d25as','AB',5),
... ('m3562','AB',1),('m3562','AB',7),('m3562','AB',9),
... ('ha42a','AB',3),('ha42a','AB',4),('ha42a','AB',5)]
>>> df = spark.createDataFrame(values,['user_id','action','day'])
>>> df.show()
+-------+------+---+
|user_id|action|day|
+-------+------+---+
| d25as| AB| 2|
| d25as| AB| 3|
| d25as| AB| 5|
| m3562| AB| 1|
| m3562| AB| 7|
| m3562| AB| 9|
| ha42a| AB| 3|
| ha42a| AB| 4|
| ha42a| AB| 5|
+-------+------+---+
>>> df.createOrReplaceTempView("qubix")
>>> spark.sql(
... """ with t1( select user_id, action, day, row_number() over(partition by user_id order by day)-day diff from qubix),
... t2( select user_id, action, day, collect_set(diff) over(partition by user_id) diff2 from t1)
... select user_id, action, day from t2 where size(diff2) > 1
... """).show()
+-------+------+---+
|user_id|action|day|
+-------+------+---+
| d25as| AB| 2|
| d25as| AB| 3|
| d25as| AB| 5|
| m3562| AB| 1|
| m3562| AB| 7|
| m3562| AB| 9|
+-------+------+---+
>>>
Read the comments in between. The code will be self explanatory then.
from pyspark.sql.functions import udf, collect_list, explode
#Creating the DataFrame
values = [('d25as','AB',2),('d25as','AB',3),('d25as','AB',5),
('m3562','AB',1),('m3562','AB',7),('m3562','AB',9),
('ha42a','AB',3),('ha42a','AB',4),('ha42a','AB',5)]
df = sqlContext.createDataFrame(values,['user_id','action','day'])
df.show()
+-------+------+---+
|user_id|action|day|
+-------+------+---+
| d25as| AB| 2|
| d25as| AB| 3|
| d25as| AB| 5|
| m3562| AB| 1|
| m3562| AB| 7|
| m3562| AB| 9|
| ha42a| AB| 3|
| ha42a| AB| 4|
| ha42a| AB| 5|
+-------+------+---+
# Grouping together the days in one list.
df = df.groupby(['user_id','action']).agg(collect_list('day'))
df.show()
+-------+------+-----------------+
|user_id|action|collect_list(day)|
+-------+------+-----------------+
| ha42a| AB| [3, 4, 5]|
| m3562| AB| [1, 7, 9]|
| d25as| AB| [2, 3, 5]|
+-------+------+-----------------+
# Creating a UDF to check if the days are consecutive or not. Only keep False ones.
check_consecutive = udf(lambda row: sorted(row) == list(range(min(row), max(row)+1)))
df = df.withColumn('consecutive',check_consecutive(col('collect_list(day)')))\
.where(col('consecutive')==False)
df.show()
+-------+------+-----------------+-----------+
|user_id|action|collect_list(day)|consecutive|
+-------+------+-----------------+-----------+
| m3562| AB| [1, 7, 9]| false|
| d25as| AB| [2, 3, 5]| false|
+-------+------+-----------------+-----------+
# Finally, exploding the DataFrame from above to get the result.
df = df.withColumn("day", explode(col('collect_list(day)')))\
.drop('consecutive','collect_list(day)')
df.show()
+-------+------+---+
|user_id|action|day|
+-------+------+---+
| m3562| AB| 1|
| m3562| AB| 7|
| m3562| AB| 9|
| d25as| AB| 2|
| d25as| AB| 3|
| d25as| AB| 5|
+-------+------+---+

Randomly Split DataFrame by Unique Values in One Column

I have a pyspark DataFrame like the following:
+--------+--------+-----------+
| col1 | col2 | groupId |
+--------+--------+-----------+
| val11 | val21 | 0 |
| val12 | val22 | 1 |
| val13 | val23 | 2 |
| val14 | val24 | 0 |
| val15 | val25 | 1 |
| val16 | val26 | 1 |
+--------+--------+-----------+
Each row has a groupId and multiple rows can have the same groupId.
I want to randomly split this data into two datasets. But all the data having a particular groupId must be in one of the splits.
This means that if d1.groupId = d2.groupId, then d1 and d2 are in the same split.
For example:
# Split 1:
+--------+--------+-----------+
| col1 | col2 | groupId |
+--------+--------+-----------+
| val11 | val21 | 0 |
| val13 | val23 | 2 |
| val14 | val24 | 0 |
+--------+--------+-----------+
# Split 2:
+--------+--------+-----------+
| col1 | col2 | groupId |
+--------+--------+-----------+
| val12 | val22 | 1 |
| val15 | val25 | 1 |
| val16 | val26 | 1 |
+--------+--------+-----------+
What is the good way to do it on PySpark? Can I use the randomSplit method somehow?
You can use randomSplit to split just the distinct groupIds, and then use the results to split the source DataFrame using join.
For example:
split1, split2 = df.select("groupId").distinct().randomSplit(weights=[0.5, 0.5], seed=0)
split1.show()
#+-------+
#|groupId|
#+-------+
#| 1|
#+-------+
split2.show()
#+-------+
#|groupId|
#+-------+
#| 0|
#| 2|
#+-------+
Now join these back to the original DataFrame:
df1 = df.join(split1, on="groupId", how="inner")
df2 = df.join(split2, on="groupId", how="inner")
df1.show()
3+-------+-----+-----+
#|groupId| col1| col2|
#+-------+-----+-----+
#| 1|val12|val22|
#| 1|val15|val25|
#| 1|val16|val26|
#+-------+-----+-----+
df2.show()
#+-------+-----+-----+
#|groupId| col1| col2|
#+-------+-----+-----+
#| 0|val11|val21|
#| 0|val14|val24|
#| 2|val13|val23|
#+-------+-----+-----+

Spark : How do I exploded data and add column name also in pyspark or scala spark?

Spark: I want explode multiple columns and consolidate as single column with column name as separate row.
Input data:
+-----------+-----------+-----------+
| ASMT_ID | WORKER | LABOR |
+-----------+-----------+-----------+
| 1 | A1,A2,A3| B1,B2 |
+-----------+-----------+-----------+
| 2 | A1,A4 | B1 |
+-----------+-----------+-----------+
Expected Output:
+-----------+-----------+-----------+
| ASMT_ID |WRK_CODE |WRK_DETL |
+-----------+-----------+-----------+
| 1 | A1 | WORKER |
+-----------+-----------+-----------+
| 1 | A2 | WORKER |
+-----------+-----------+-----------+
| 1 | A3 | WORKER |
+-----------+-----------+-----------+
| 1 | B1 | LABOR |
+-----------+-----------+-----------+
| 1 | B2 | LABOR |
+-----------+-----------+-----------+
| 2 | A1 | WORKER |
+-----------+-----------+-----------+
| 2 | A4 | WORKER |
+-----------+-----------+-----------+
| 2 | B1 | LABOR |
+-----------+-----------+-----------+
PFA: Input image
Not the best case probably but a couple of explodes and unionAll is all you need.
import org.apache.spark.sql.functions._
df1.show
+-------+--------+-----+
|ASMT_ID| WORKER|LABOR|
+-------+--------+-----+
| 1|A1,A2,A3|B1,B2|
| 2| A1,A4| B1|
+-------+--------+-----+
df1.cache
val workers = df1.drop("LABOR")
.withColumn("WRK_CODE" , explode(split($"WORKER" , ",") ) )
.withColumn("WRK_DETL", lit("WORKER"))
.drop("WORKER")
val labors = df1.drop("WORKER")
.withColumn("WRK_CODE" , explode(split($"LABOR", ",") ) )
.withColumn("WRK_DETL", lit("LABOR") )
.drop("LABOR")
workers.unionAll(labors).orderBy($"ASMT_ID".asc , $"WRK_CODE".asc).show
+-------+--------+--------+
|ASMT_ID|WRK_CODE|WRK_DETL|
+-------+--------+--------+
| 1| A1| WORKER|
| 1| A2| WORKER|
| 1| A3| WORKER|
| 1| B1| LABOR|
| 1| B2| LABOR|
| 2| A1| WORKER|
| 2| A4| WORKER|
| 2| B1| LABOR|
+-------+--------+--------+

Resources