Trim addtional whitespace between the names in PySpark - apache-spark

How to trim the additional spaces present between the names in PySpark dataframe?
Below is my dataframe
+----------------------+----------+
|name |account_id|
+----------------------+----------+
| abc xyz pqr | 1 |
| pqm rst | 2 |
+----------------------+----------+
Output I want
+-------------+----------+
|name |account_id|
+-------------+----------+
| abc xyz pqr | 1 |
| pqm rst | 2 |
+-------------+----------+
I tried using regex_replace, but it trims the space completely. Is there any other way to implement this ? Thanks a lot!

I tried using 'regexp_replace(,'\s+',' ')' and I got the output.
df=df.withColumn("name",regexp_replace(col("name"),'\s+',' '))
Output
+-----------+----------+
| name |account_id|
+-----------+----------+
|abc xyz pqr| 1 |
| pqm rst| 2 |
+-----------+----------+

Related

Split column on condition in dataframe

The data frame I am working on has a column named "Phone" and I want to split in on / or , in a way such that I get the data frame as shown below in separate columns. For example, the first row is 0674-2537100/101 and I want to split it on "/" into two columns having values as 0674-2537100 and 0674-2537101.
Input:
+-------------------------------+
| Phone |
+-------------------------------+
| 0674-2537100/101 |
| 0674-2725627 |
| 0671 – 2647509 |
| 2392229 |
| 2586198/2583361 |
| 0663-2542855/2405168 |
| 0674 – 2563832/0674-2590796 |
| 0671-6520579/3200479 |
+-------------------------------+
Output:
+-----------------------------------+
| Phone | Phone1 |
+-----------------------------------+
| 0674-2537100 | 0674-2537101 |
| 0674-2725627 | |
| 0671 – 2647509 | |
| 2392229 | |
| 2586198 | 2583361 |
| 0663-2542855 | 0663-2405168 |
| 0674 – 2563832 | 0674-2590796 |
| 0671-6520579 | 0671-3200479 |
+-----------------------------------+
Here I came up with a solution where I can take out the length of strings on both sides of the separator(/). Take out their difference. Copy the substring from the first column from character position [:difference-1] to the second column.
So far my progress is,
df['Phone'] = df['Phone'].str.replace(' ', '')
df['Phone'] = df['Phone'].str.replace('–', '-')
df[['Phone','Phone1']] = df['Phone'].str.split("/",expand=True)
df["Phone1"].fillna(value=np.nan, inplace=True)
m2 = (df["Phone1"].str.len() < 12) & (df["Phone"].str.len() > 7)
m3 = df["Phone"].str.len() - df["Phonenew"].str.len()
df.loc[m2, "Phone1"] = df["Phone"].str[:m3-1] + df["Phonenew"]
It gives an error and the column has only nan values after I run this. PLease help me out here.
Considering you're only going to have 2 '/' in the 'Phone' column. Here's what you can do:
'''
This fucntion takes in rows of a dataframe as an input and returns row with appropriate values.
'''
def split_phone_number(row):
split_str=row['Phone'].split('/')
# Considering that you're only going to have 2 or lesser values, update
# the passed row's columns with appropriate values.
if len(split_str)>1:
row['Phone']=split_str[0]
row['Phone1']=split_str[1]
else:
row['Phone']=split_str[0]
row['Phone1']=''
# Return the updated row.
return row
# Making a dummy dataframe.
d={'Phone':['0674-2537100/101','0674-257349','0671-257349','257349','257349/100','101/100','5688343/438934']}
dataFrame= pd.DataFrame(data=d)
# Considering you're only going to have one extra column. adding that column to dataframe.
dataFrame=dataFrame.assign(Phone1=['' for i in range(dataFrame.shape[0])])
# applying the split_phone_number function to dataframe.
dataFrame=dataFrame.apply(split_phone_number,axis=1)
# Prinitng dataframe.
print(dataFrame)
Input:
+---------------------+
| Phone |
+---------------------+
| 0 0674-2537100/101 |
| 1 0674-257349 |
| 2 0671-257349 |
| 3 257349 |
| 4 257349/100 |
| 5 101/100 |
| 6 5688343/438934 |
+---------------------+
Output:
+----------------------------+
| Phone Phone1 |
+----------------------------+
| 0 0674-2537100 101 |
| 1 0674-257349 |
| 2 0671-257349 |
| 3 257349 |
| 4 257349 100 |
| 5 101 100 |
| 6 5688343 438934 |
+----------------------------+
For further reading:
dataframe.apply()
Hope this helps. Cheers!

Append a monotonically increasing id column that increases on column value match

I am ingesting a dataframe and I want to append a monotonically increasing column that increases whenever another column matches a certain value. For example I have the following table
+------+-------+
| Col1 | Col2 |
+------+-------+
| B | 543 |
| A | 1231 |
| B | 14234 |
| B | 34234 |
| B | 3434 |
| A | 43242 |
| B | 43242 |
| B | 56453 |
+------+-------+
I would like to append a column that increases in value whenever "A" in col1 is present. So the result would look like
+------+-------+------+
| Col1 | Col2 | Col3 |
+------+-------+------+
| B | 543 | 0 |
| A | 1231 | 1 |
| B | 14234 | 1 |
| B | 34234 | 1 |
| B | 3434 | 1 |
| A | 43242 | 2 |
| B | 43242 | 2 |
| B | 56453 | 2 |
+------+-------+------+
Keeping the initial order is important.
I tried zippering but that doesn't seem to produce the right result. Splitting it up into individual seqs manually and doing it that way is not going to be performant enough (think 100+ GB tables).
I looked into trying this with a map function that would keep a counter somewhere but couldn't get that to work.
Any advice or pointer in the right direction would be greatly appreciated.
spark does not provide any default functions to achieve this kind of functionality
I would do like to do most probably in this way
//inputDF contains Col1 | Col2
val df = inputDF.select("Col1").distinct.rdd.zipWithIndex().toDF("Col1","Col2")
val finalDF = inputDF.join(df,df("Col1") === inputDF("Col1"),"left").select(inputDF("*"),"Col3")
but the problem here I can see is (join which will result in the shuffle).
you can also check other autoincrement API's here.
Use window and sum over the window of the value 1 when Col1 = A.
import pyspark.sql.functions as f
from pyspark.sql import Window
w = Window.partitionBy().rowsBetween(Window.unboundedPreceding, Window.currentRow)
df.withColumn('Col3', f.sum(f.when(f.col('Col1') == f.lit('A'), 1).otherwise(0)).over(w)).show()
+----+-----+----+
|Col1| Col2|Col3|
+----+-----+----+
| B| 543| 0|
| A| 1231| 1|
| B|14234| 1|
| B|34234| 1|
| B| 3434| 1|
| A|43242| 2|
| B|43242| 2|
| B|56453| 2|
+----+-----+----+

Match first instance of cell in a column not equal to the other column with offset

Suppose I have the following table
+---+-------+-------+
| | A | B |
+---+-------+-------+
| 1 | 7,75 | 9,50 |
+---+-------+-------+
| 2 | 9,50 | 10,50 |
+---+-------+-------+
| 3 | 10,50 | 11,50 |
+---+-------+-------+
| 4 | 11,75 | 13,00 |
+---+-------+-------+
| 5 | 13,00 | 14,00 |
+---+-------+-------+
| 6 | 14,00 | 15,25 |
+---+-------+-------+
| 7 | 15,25 | |
+---+-------+-------+
I want to find the first occurrence in column A where it does not match column B with an offset of -1 in the rows (in this case, it should give me 11.75, A4)
I would like to avoid using VBA.
I tried using MATCH, but I'm unsure how to do the condition with 2 tables
Would this work for you:
=LOOKUP(1,1/(A2:A7<>B1:B6),A2:A7)
No need to CSE LOOKUP().
use INDEX/AGGREGATE:
=INDEX(A:A,AGGREGATE(15,7,ROW(A2:A7)/(A2:A7<>B1:B6),1))

Optimize spark dataframe operation

I have a spark(version-2.4) dataframe of the pattern.
+----------+
| ColumnA |
+----------+
| 1000#Cat |
| 1001#Dog |
| 1000#Cat |
| 1001#Dog |
| 1001#Dog |
+----------+
I am conditionally applying a regex removal of the number that is appended to the string using the following code
dataset.withColumn("ColumnA",when(regexp_extract(dataset.col("ColumnA"), "\\#(.*)", 1)
.equalTo(""), dataset.col("ColumnA"))
.otherwise(regexp_extract(dataset.col("ColumnA"), "\\#(.*)", 1)));
which would result a dataframe in the following format
+---------+
| ColumnA |
+---------+
| Cat |
| Dog |
| Cat |
| Dog |
| Dog |
+---------+
This runs correctly and produces the desired output.
However the regexp_extract operation is being applied twice, once to check if the returned string is empty and if not then reapply the regexp_extract on the column.
Is there any optimization that can be done on this code to make it perform better.?
Use split function instead of regexp_extract.
Please check below code with execution time
scala> df.show(false)
+--------+
|columna |
+--------+
|1000#Cat|
|1001#Dog|
|1000#Cat|
|1001#Dog|
|1001#Dog|
+--------+
scala> spark.time(df.withColumn("parsed",split($"columna","#")(1)).show(false))
+--------+------+
|columna |parsed|
+--------+------+
|1000#Cat|Cat |
|1001#Dog|Dog |
|1000#Cat|Cat |
|1001#Dog|Dog |
|1001#Dog|Dog |
+--------+------+
Time taken: 14 ms
scala> spark.time { df.withColumn("ColumnA",when(regexp_extract($"columna", "\\#(.*)", 1).equalTo(""), $"columna").otherwise(regexp_extract($"columna", "\\#(.*)", 1))).show(false) }
+-------+
|ColumnA|
+-------+
|Cat |
|Dog |
|Cat |
|Dog |
|Dog |
+-------+
Time taken: 22 ms
scala>
contains function to check # value in column
scala> spark.time(df.withColumn("parsed",when($"columna".contains("#"), lit(split($"columna","#")(1))).otherwise("")).show(false))
+--------+------+
|columna |parsed|
+--------+------+
|1000#Cat|Cat |
|1001#Dog|Dog |
|1000#Cat|Cat |
|1001#Dog|Dog |
|1001#Dog|Dog |
+--------+------+
Time taken: 14 ms

Use a dataframe as lookup for another dataframe

I’ve two dataframes df_1 and df_2
df_1 is my master dataframe and df_2 is a lookup dataframe.
I want to test if the value in df_1[‘col_c1’] contains any of the the values from df_2[‘col_a2’].
If this is true (can be multiple matches !);
add the value(s) from df_2[‘col_b2’] to df_1[‘col_d1’]
add the value(s) from df_2[‘col_c2’] to df_1[‘col_e1’]
How can i achieve this?
I’ve really no idea and therefore I can’t share any code for this.
Sample df_1
col_a1 | col_b1 | col_c1 | col_d1 | col_e1
----------------------------------------------------
1_001 | aaaaaa | bbbbccccdddd | |
1_002 | zzzzz | ggggjjjjjkkkkk | |
1_003 | pppp | qqqqffffgggg | |
1_004 | sss | wwwcccyyy | |
1_005 | eeeeee | eecccffffll | |
1_006 | tttt | hhggeeuuuuu | |
Sample df_2
col_a2 | col_b2 | col_c2
------------------------------
ccc | 2_001 | some_data_c
jjj | 2_002 | some_data_j
fff | 2_003 | some_data_f
Desired output df_1
col_a1 | col_b1 | col_c1 | col_d1 | col_e1
------------------------------------------------------------------------------
1_001 | aaaaaa | bbbbccccdddd | 2_001 | some_data_c
1_002 | zzzzz | ggggjjjjjkkkkk | 2_002 | some_data_j
1_003 | pppp | qqqqffffgggg | 2_003 | some_data_f
1_004 | sss | wwwcccyyy | 2_001 | some_data_c
1_005 | eeeeee | eecccffffll | 2_001;2_003 | some_data_c; some_data_f
1_006 | tttt | hhggeeuuuuu | |
df_1 has approx 45.000 rows and df_2 approx. 16.000 rows. (Also added a non matching row)
I've been struggling with this for hours, but I really have no idea.
I don't think merging is an option because there's no exact match.
Your help is greatly appreciated.
Use:
#exctract values by df_2["col_a2"] to new column
s = (df_1['col_c1'].str.extractall(f'({"|".join(df_2["col_a2"])})')[0].rename('new')
.reset_index(level=1, drop=True))
#repeat rows with duplicated match
df_1 = df_1.join(s)
#add new columns by map
df_1['col_d1'] = df_1['new'].map(df_2.set_index('col_a2')['col_b2'])
df_1['col_e1'] = df_1['new'].map(df_2.set_index('col_a2')['col_c2'])
#aggregate join
cols = df_1.columns.difference(['new','col_d1','col_e1']).tolist()
df = df_1.drop('new', axis=1).groupby(cols).agg(','.join).reset_index()
print (df)
col_a1 col_b1 col_c1 col_d1 col_e1
0 1_001 aaaaaa bbbbccccdddd 2_001 some_data_c
1 1_002 zzzzz ggggjjjjjkkkkk 2_002 some_data_j
2 1_003 pppp qqqqffffgggg 2_003 some_data_f
3 1_004 sss wwwcccyyy 2_001 some_data_c
4 1_005 eeeeee eecccffffll 2_001,2_003 some_data_c,some_data_f
this will solve it
df['col_d1'] = df.apply(lambda x: ','.join([df2['col_b2'][i] for i in range(len(df2)) if df2['col_a2'][i] in x.col_c1]), axis=1)
df['col_e1'] = df.apply(lambda x: ','.join([df2['col_c2'][i] for i in range(len(df2)) if df2['col_a2'][i] in x.col_c1]), axis=1)
Output
col_a1 col_b1 col_c1 col_d1 \
0 1_001 aaaaaa bbbbccccdddd 2_001
1 1_002 zzzzz ggggjjjjjkkkkk 2_002
2 1_003 pppp qqqqffffgggg 2_003
3 1_004 sss wwwcccyyy 2_001
4 1_005 eeeeee eecccffffll 2_001 , 2_003
col_e1
0 some_data_c
1 some_data_j
2 some_data_f
3 some_data_c
4 some_data_c; some_data_f

Resources