Regex for simple pattern in python - python-3.x

I have a string containing exactly one pair of parentheses (and some words between them), and lots of other words.
How would one create a regex to split the string into [ words before (, words between (), words after )]?
e.g.
line = "a bbbb cccc dd ( ee fff ggg ) hhh iii jk"
would be split into
[ "a bbbb cccc dd", "ee fff ggg", "hhh iii jk" ]
I've tried
line = re.compile("[^()]+").split(line)
but it doesn't work.

It seems that in the process you want to remove the leading and trailing whitespaces, i.e., the whitespaces before and after ( and ). You could try:
>>> line = "a bbbb cccc dd ( ee fff ggg ) hhh iii jk"
>>> re.split(r'\s*[\(\)]\s*', line)
['a bbbb cccc dd', 'ee fff ggg', 'hhh iii jk']
>>>
>>> # to make it look as in your description ...
>>> line = re.compile(r'\s*[\(\)]\s*').split(line)
>>> line
['a bbbb cccc dd', 'ee fff ggg', 'hhh iii jk']

To split the output in three I think the simplest option is to use three capture groups (some_regex)(another_regex)(yet_another_regex). In your case, the first part is any character that is not a (, followed by (, then any character that is not ) followed by ) and finally followed by any character.
Therefore the regex is ([^(]*)\(([^)]*)\)(.*), which you can then use to retrieve groups (your desired output):
>>> import re
>>> pattern = re.compile(r'([^(]*)\(([^)]*)\)(.*)')
>>> pattern.match(line).groups()
('a bbbb cccc dd ', ' ee fff ggg ', ' hhh iii jk')
With:
([^(]*) the first group
([^)]*) the second group
(.*) the last group

Related

Excel to pandas, whitespace as row delimeter

I have some badly formatted Excel data that looks like this:
aaaa
bbbb
cccc
aaaa
bbbb
cccc
dddd
Is there a decent way to, using the white space as a delimiter, turn each segment into a pandas row with a default value to fix raggedness?
I'd like the end result to be something like this:
aaaa bbbb cccc ""
aaaa bbbb cccc dddd
Thanks!
I think I was able to get something that works by bringing the column in as a list, and then trying to pull out each section of the list separated by any number of '' values. There may be a better way to do these things, but maybe the idea is helpful (and it does work for my example at least).
The sample data I used here is like yours, with another little chunk at the end. Hopefully I'm understanding it right.
l = ['aaaa',
'bbbb',
'cccc',
'',
'',
'',
'aaaa',
'bbbb',
'cccc',
'dddd',
'',
'asdf',
'badfd'
]
With that list l, the idea is to loop through the list for the number of '' items, and pull out the items between lastspace (the last blank-space value) and thisspace (the current blank-space value).
It needed some special handling for when there are multiple '' values in a row. That is the while thisspace == lastspace+1... piece, which if that happens it will work to skip it and go to the next '' value. Because of this, we want to increment our loop counter when we find these consecutives, and so our loop also has to be a while loop instead of a for loop (which can't handle manual increments).
l.index() gives a ValueError when it can't find the substring, and so when we have gone past the last '', we need this error-handling.
d = []
lastspace = -1
i = 0
while i <= l.count(''):
try:
thisspace = l.index('',lastspace+1)
while thisspace == lastspace+1:
lastspace = thisspace
thisspace = l.index('',lastspace+1)
i += 1
d.append(l[lastspace+1:thisspace])
lastspace = thisspace
i += 1
except ValueError:
d.append(l[lastspace+1:])
i += 1
df = pd.DataFrame(d)
The dataframe I get in the end looks like this:
0 1 2 3
0 aaaa bbbb cccc None
1 aaaa bbbb cccc dddd
2 asdf badfd None None
I have used the following text file test.txt to simulate your inputs
aaaa bbbb cccc aaaa bbbb cccc dddd
Is this correct?
df = pd.read_csv('test.txt', delimiter = ' ', header = None)
0 1 2 3 4 5 6 7
0 aaaa bbbb cccc NaN aaaa bbbb cccc dddd
And to further process the dataframe
df = df.fillna('')
print(np.array(df.loc[0,:].to_list()).reshape(-1,4))
[['aaaa' 'bbbb' 'cccc' '']
['aaaa' 'bbbb' 'cccc' 'dddd']]

Why does the negated character class doesn't work as expected?

xyz mnl pqt aaaa ccc
yz mn ats aa cbc ddd eee ggg
I want to match the first two columns with:
[^\s]*\s[^\s]*\s
But this pattern matches up to all columns but the last one. That is:
xyz mnl pqt aaaa
yz mn ats aa cbc ddd eee
I don't understand this in VIM.
Two things:
\s doesn't work in a character class. Use \S instead.
Prefix the regex with ^ to make it start from the beginning of each line.
^\S*\s\S*\s
Which matches:
xyz mnl pqt aaaa ccc
^^^^^^^^
yz mn ats aa cbc ddd eee ggg
^^^^^^

How to compare two columns in same file and store the difference in new file with the unchanged column according to it?

Row Actual Expected
1 AAA BBB
2 CCC CCC
3 DDD EEE
4 FFF GGG
5 HHH HHH
I want to compare actual and expected and store the difference in a file. Like
Row Actual Expected
1 AAA BBB
3 DDD EEE
4 FFF GGG
I have used awk -F, '{if ($2!=$3) {print $1,$2,$3}}' Sample.csv It will only compare Int values not String value
You can use AWK to do this
awk '{if($2!=$3) print $0}' oldfile > newfile
where
$2 and $3 are second and third columns
!= means second and third columns does not match
$0 means whole line
> newfile redirects to new file
I prefer an awk solution (can handle more fields and easier to understand), but you could use
sed -r '/\t([^ ]*)\t\1$/d' Sample.csv
Assuming the file uses tab or some other delimiter to separate the columns, then tsv-filter from eBay's TSV Utilities supports this type of field comparison directly. For the file above:
$ tsv-filter --header --ff-str-ne 2:3 file.tsv
Row Actual Expected
1 AAA BBB
3 DDD EEE
4 FFF GGG
The --ff-str-ne option compares two fields in a row for non-equal strings.
Disclaimer: I'm the author.

How to replace the character I want in a line

1 aaa bbb aaa
2 aaa ccccccccc aaa
3 aaa xx aaa
How to replace the second aaa to yyy for each line
1 aaa bbb yyy
2 aaa ccccccccc yyy
3 aaa xx yyy
Issuing the following command will solve your problem.
:%s/\(aaa.\{-}\)aaa/\1yyy/g
Another way would be with \zs and \ze, which mark the beginning and end of a match in a pattern. So you could do:
:%s/aaa.*\zsaaa\ze/yyy
In other words, find "aaa" followed by anything and then another "aaa", and replace that with "yyy".
If you have three "aaa"s on a line, this won't work, though, and you should use \{-} instead of *. (See :h non-greedy)

Delete whole line NOT containing given string

Is there a way to delete the whole line if it contains specific word using sed? i.e.
I have the following:
aaa bbb ccc
qqq fff yyy
ooo rrr ttt
kkk ccc www
I want to delete lines that contain 'ccc' and leave other lines intact. In this example the output would be:
qqq fff yyy
ooo rrr ttt
All this using sed. Any hints?
sed -n '/ccc/!p'
or
sed '/ccc/d'

Resources