How to replace special charachters in Pyspark? - apache-spark

I am fairly new to Pyspark, and I am trying to do some text pre-processing with Pyspark.
I have a column Name and ZipCode that belongs to a spark data frame new_df. The column 'Name' contains values like WILLY:S MALMÖ, EMPORIA and ZipCode contains values like 123 45 which is a string too. what I want to do is I want to remove characters like :, , etc and want to remove space between the ZipCode.
I tried the following but nothing seems to work :
new_df = new_df.withColumn('Name', sfn.regexp_replace('Name', r',' , ' '))
new_df = new_df.withColumn('ZipCode', sfn.regexp_replace('ZipCode', r' ' , ''))
I tried other things too from the SO and other websites. Nothing seems to work.

Use [,|:] to match , or : and replace with space ' ' in Name column and for zipcode search for space ' ' and replace with empty string ''.
Example:
new_df.show(10,False)
#+-----------------------+-------+
#|Name |ZipCode|
#+-----------------------+-------+
#|WILLY:S MALMÖ, EMPORIA|123 45 |
#+-----------------------+-------+
new_df.withColumn('Name', regexp_replace('Name', r'[,|:]' , ' ')).\
withColumn('ZipCode', regexp_replace('ZipCode', r' ' , '')).\
show(10,False)
#or
new_df.withColumn('Name', regexp_replace('Name', '[,|:]' , ' ')).\
withColumn('ZipCode', regexp_replace('ZipCode', '\s+' , '')).\
show(10,False)
#+-----------------------+-------+
#|Name |ZipCode|
#+-----------------------+-------+
#|WILLY S MALMÖ EMPORIA|12345 |
#+-----------------------+-------+

Related

How do I sort pandas data frame that has a multi-index?

Hey guys I got that dataframe that you see in the image and I want to sort it by the first 'всичко' column the one under 'Общо'.
This is the output when I type:
df.columns =
MultiIndex([( ' Общо', ' всичко'),
( ' Общо', ' мъже'),
( ' Общо', ' жени'),
('В градовете', ' всичко'),
('В градовете', ' мъже'),
('В градовете', ' жени'),
( 'В селата', ' всичко'),
( 'В селата', ' мъже'),
( 'В селата', ' жени')],
names=['Области', 'Общини'])
and
df.index =
Index(['Общо за страната', 'Благоевград', 'Банско', 'Белица', 'Благоевград',
'Гоце Делчев', 'Гърмен', 'Кресна', 'Петрич', 'Разлог',
...
'Нови пазар', 'Смядово', 'Хитрино', 'Шумен', 'Ямбол', 'Болярово',
'Елхово', 'Стралджа', 'Тунджа', 'Ямбол'],
dtype='object', length=294)
Again, I need to the 'всичко' column in descending order.
Best regards.
I tried using the df.sort_values() but I am having difficulties working around the MultiIndex
you can use:
df=df.sort_values([(' Общо', ' всичко')], ascending=False) #define columns names as a tuple

Replace $$ or more with single spaceusing Regex in python

In the following list of string i want to remove $$ or more with only one space.
eg- if i have $$ then one space character or if there are $$$$ or more then also only 1 space is to be replaced.
I am using the following regex but i'm not sure if it serves the purpose
regex_pattern = r"['$$']{2,}?"
Following is the test string list:
['1', 'Patna City $$$$ $$$$$$$$View Details', 'Serial No:$$$$5$$$$ $$$$Deed No:$$$$5$$$$ $$$$Token No:$$$$7$$$$ $$$$Reg Year:2020', 'Anil Kumar Singh Alias Anil Kumar$$$$$$$$Executant$$$$$$$$Late. Harinandan Singh$$$$$$$$$$$$Md. Shahzad Ahmad$$$$$$$$Claimant$$$$$$$$Late. Md. Serajuddin', 'Anil Kumar Singh Alias Anil Kumar', 'Executant', 'Late. Harinandan Singh', 'Md. Shahzad Ahmad', 'Claimant', 'Late. Md. Serajuddin', 'Circle:Patna City Mauja: $$$$ $$$$Khata : na$$$$ $$$$Plot :2497 Area(in Decimal):1.5002 Land Type :Res. Branch Road Land Value :1520000 MVR Value :1000000', 'Circle:Patna City Mauja: $$$$ $$$$Khata : na$$$$ $$$$Plot :2497 Area(in Decimal):1.5002 Land Type :Res. Branch Road Land Value :1520000 MVR Value :1000000']
About
I am using the following regex but i'm not sure if it serves the
purpose
The pattern ['$$']{2,}? can be written as ['$]{2,}? and matches 2 or more chars being either ' or $ in a non greedy way.
Your pattern currently get the right matches, as there are no parts present like '' or $'
As the pattern is non greedy, it will only match 2 chars and will not match all 3 characters in $$$
You could write the pattern matching 2 or more dollar signs without making it non greedy so the odd number of $ will also be matched:
regex_pattern = r"\${2,}"
In the replacement use a space.
Is this what you need?:
import re
for d in data:
d = re.sub(r'\${2,}', ' ', d)

How to identify number near word using regex

Need to identify numbers near keyword number:, no:, etc..
Tried:
import re
matchstring="Sales Quote"
string_lst = ['number:', 'No:','no:','number','No : ']
x=""" Sentence1: Sales Quote number 36886DJ9 is entered
Sentence2: SALES QUOTE No: 89745DFD is entered
Sentence3: Sales Quote No : 7964KL is entered
Sentence4: SALES QUOTE NUMBER:879654DF is entered
Sentence5: salesquote no: 9874656LD is entered"""
documentnumber= re.findall(r"(?:(?<="+matchstring+ '|'.join(string_lst)+r')) [\w\d-]',x,flags=re.IGNORECASE)
print(documentnumber)
Required soln:36886DJ9,89745DFD,7964KL,879654DF,9874656LD
Is there any solution?
Actually your solution is very close. You just need some missing parenthesis and check for optional whitespace:
documentnumber = re.findall(r"(?:(?<="+matchstring + ").*?(?:" + '|'.join(string_lst) + ')\s?)([\w\d-]*)', x, re.IGNORECASE)
However this won't match with the last one (9874656LD) because of the missing whitespace between "Sales" and "quote". If you want to build it in the same way than the rest of the pattern, replace the lookbehind by a non capturing group and join words with \s?:
documentnumber= re.findall(r"(?:(?:" + "\s?".join(matchstring.split()) + ").*?(?:" + '|'.join(string_lst) + ')\s?)([\w\d-]*)', x, re.IGNORECASE)
Output:
['36886DJ9', '89745DFD', '7964KL', '879654DF', '9874656LD']

How to extract several timestamp pairs from a list in Python

I have extracted all timestamps from a transcript file. The output looks like this:
('[, 00:00:03,950, 00:00:06,840, 00:00:06,840, 00:00:09,180, 00:00:09,180, '
'00:00:10,830, 00:00:10,830, 00:00:14,070, 00:00:14,070, 00:00:16,890, '
'00:00:16,890, 00:00:19,080, 00:00:19,080, 00:00:21,590, 00:00:21,590, '
'00:00:24,030, 00:00:24,030, 00:00:26,910, 00:00:26,910, 00:00:29,640, '
'00:00:29,640, 00:00:31,920, 00:00:31,920, 00:00:35,850, 00:00:35,850, '
'00:00:38,629, 00:00:38,629, 00:00:40,859, 00:00:40,859, 00:00:43,170, '
'00:00:43,170, 00:00:45,570, 00:00:45,570, 00:00:48,859, 00:00:48,859, '
'00:00:52,019, 00:00:52,019, 00:00:54,449, 00:00:54,449, 00:00:57,210, '
'00:00:57,210, 00:00:59,519, 00:00:59,519, 00:01:02,690, 00:01:02,690, '
'00:01:05,820, 00:01:05,820, 00:01:08,549, 00:01:08,549, 00:01:10,490, '
'00:01:10,490, 00:01:13,409, 00:01:13,409, 00:01:16,409, 00:01:16,409, '
'00:01:18,149, 00:01:18,149, 00:01:20,340, 00:01:20,340, 00:01:22,649, '
'00:01:22,649, 00:01:26,159, 00:01:26,159, 00:01:28,740, 00:01:28,740, '
'00:01:30,810, 00:01:30,810, 00:01:33,719, 00:01:33,719, 00:01:36,990, '
'00:01:36,990, 00:01:39,119, 00:01:39,119, 00:01:41,759, 00:01:41,759, '
'00:01:43,799, 00:01:43,799, 00:01:46,619, 00:01:46,619, 00:01:49,140, '
'00:01:49,140, 00:01:51,240, 00:01:51,240, 00:01:53,759, 00:01:53,759, '
'00:01:56,460, 00:01:56,460, 00:01:58,740, 00:01:58,740, 00:02:01,640, '
'00:02:01,640, 00:02:04,409, 00:02:04,409, 00:02:07,229, 00:02:07,229, '
'00:02:09,380, 00:02:09,380, 00:02:12,060, 00:02:12,060, 00:02:14,840, ]')
In this output, there are always timestamp pairs, i.e. always 2 consecutive timestamps belong together, for example: 00:00:03,950 and 00:00:06,840, 00:00:06,840 and 00:00:09,180, etc.
Now, I want to extract all these timestamp pairs separately so that the output looks like this:
00:00:03,950 - 00:00:06,840
00:00:06,840 - 00:00:09,180
00:00:09,180 - 00:00:10,830
etc.
For now, I have the following (very inconvenient) solution for my problem:
# get first part of first timestamp
a = res_timestamps[2:15]
print(dedent(a))
# get second part of first timestamp
b = res_timestamps[17:29]
print(b)
# combine timestamp parts
c = a + ' - ' + b
print(dedent(c))
Of course, this is very bad since I cannot extract the indices manually for all transcripts. Trying to use a loop has not worked yet because each item is not a timestamp but a single character.
Is there an elegant solution for my problem?
I appreciate any help or tip.
Thank you very much in advance!
Regex to the rescue!
A solution that works perfectly on your example data:
import re
from pprint import pprint
pprint(re.findall(r"(\d{2}:\d{2}:\d{2},\d{3}), (\d{2}:\d{2}:\d{2},\d{3})", your_data))
This prints:
[('00:00:03,950', '00:00:06,840'),
('00:00:06,840', '00:00:09,180'),
('00:00:09,180', '00:00:10,830'),
('00:00:10,830', '00:00:14,070'),
('00:00:14,070', '00:00:16,890'),
('00:00:16,890', '00:00:19,080'),
('00:00:19,080', '00:00:21,590'),
('00:00:21,590', '00:00:24,030'),
('00:00:24,030', '00:00:26,910'),
('00:00:26,910', '00:00:29,640'),
('00:00:29,640', '00:00:31,920'),
('00:00:31,920', '00:00:35,850'),
('00:00:35,850', '00:00:38,629'),
('00:00:38,629', '00:00:40,859'),
('00:00:40,859', '00:00:43,170'),
('00:00:43,170', '00:00:45,570'),
('00:00:45,570', '00:00:48,859'),
('00:00:48,859', '00:00:52,019'),
('00:00:52,019', '00:00:54,449'),
('00:00:54,449', '00:00:57,210'),
('00:00:57,210', '00:00:59,519'),
('00:00:59,519', '00:01:02,690'),
('00:01:02,690', '00:01:05,820'),
('00:01:05,820', '00:01:08,549'),
('00:01:08,549', '00:01:10,490'),
('00:01:10,490', '00:01:13,409'),
('00:01:13,409', '00:01:16,409'),
('00:01:16,409', '00:01:18,149'),
('00:01:18,149', '00:01:20,340'),
('00:01:20,340', '00:01:22,649'),
('00:01:22,649', '00:01:26,159'),
('00:01:26,159', '00:01:28,740'),
('00:01:28,740', '00:01:30,810'),
('00:01:30,810', '00:01:33,719'),
('00:01:33,719', '00:01:36,990'),
('00:01:36,990', '00:01:39,119'),
('00:01:39,119', '00:01:41,759'),
('00:01:41,759', '00:01:43,799'),
('00:01:43,799', '00:01:46,619'),
('00:01:46,619', '00:01:49,140'),
('00:01:49,140', '00:01:51,240'),
('00:01:51,240', '00:01:53,759'),
('00:01:53,759', '00:01:56,460'),
('00:01:56,460', '00:01:58,740'),
('00:01:58,740', '00:02:01,640'),
('00:02:01,640', '00:02:04,409'),
('00:02:04,409', '00:02:07,229'),
('00:02:07,229', '00:02:09,380'),
('00:02:09,380', '00:02:12,060'),
('00:02:12,060', '00:02:14,840')]
You could output this in your desired format like so:
for start, end in timestamps:
print(f"{start} - {end}")
Here's a solution without regular expressions
Clean the string, and split on ', ' to create a list
Use string slicing to select the odd and even values and zip them together.
# give data as your string
# convert data into a list by removing end brackets and spaces, and splitting
data = data.replace('[, ', '').replace(', ]', '').split(', ')
# use list slicing and zip the two components
combinations = list(zip(data[::2], data[1::2]))
# print the first 5
print(combinations[:5])
[out]:
[('00:00:03,950', '00:00:06,840'),
('00:00:06,840', '00:00:09,180'),
('00:00:09,180', '00:00:10,830'),
('00:00:10,830', '00:00:14,070'),
('00:00:14,070', '00:00:16,890')]

How to make the escape for variable whose value is already a string?

Create a sample.csv for the discussion.
cat > sample.csv <<EOF
class;grade
tom:class(3+2);80
tom:class(2+2);90
marry:class(3+2);85
marry:class(2+2);70
EOF
Show the data in sample.csv.
cat sample.csv
class;grade
tom:class(3+2);80
tom:class(2+2);90
marry:class(3+2);85
marry:class(2+2);70
Let's read it with pandas:
import pandas as pd
df = pd.read_csv('sample.csv',sep=';')
df
class grade
0 tom:class(3+2) 80
1 tom:class(2+2) 90
2 marry:class(3+2) 85
3 marry:class(2+2) 70
Now i want to select all such records whose field class contains string class(3+2) as below:
tom:class(3+2) 80
marry:class(3+2) 85
Get it this way:
classname = 'class\(3\+2\)'
df[df['class'].str.contains(pat=classname]
class grade
0 tom:class(3+2) 80
2 marry:class(3+2) 85
The difficult thing is that classname is already assigned value as class(3+2),
classname='class(3+2)'
df[df['class'].str.contains(pat=classname)]
The above code can't work now,how to make the escape for variable classname whose value is already a string class(3+2) ?
Note:you can't write classname = 'class\(3\+2\)' ,its value is classname='class(3+2)'.
Turn regex to False
classname='class(3+2)' # this is regex () , we need turn it off just match the string
df[df['class'].str.contains(pat=classname, regex=False)]
Out[166]:
class grade
0 tom:class(3+2) 80
2 marry:class(3+2) 85
If you insist on using regex for the search, you need to escape the + as well, and use a raw string, like so:
classname = r'class\(3\+2\)'

Resources