Regex splitting on newline outside of quotes in VBA Macros [closed] - excel

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed last year.
This post was edited and submitted for review last year and failed to reopen the post:
Original close reason(s) were not resolved
Improve this question
I have a file which contains pipe separated string, I want to split that string on new lines that are outside double quotes using Split function of VBA if possible.
File date looks like this.
fileStr = abc|hbd|hss
abd|"shs
hshs"|jdjd
hddn|hddd|sdjdd
desired ouput should be like below
Row 1 -> abc|hbd|hss
Row 2 -> abd|"shs
hshs"|jdjd
Row 3 -> hddn|hddd|sdjdd
I have tried Split(strData, vbNewLine) but its not working.
Can you please give the code snippet which I can directly use in my VBA Code.
Note: Need this in VBA Macros not in other languages

I suggest you match, rather than split, using the following regular expression (with re.Global = True).
(?=.)[^"\r\n]*(?:"(?:[^"]*")+)?[^"\r\n]*
Demo
The expression can be broken down as follows.
(?=.) # positive lookahead asserts that a character follows
[^"\r\n]* # match >= 0 characters other than double-quotes
# and line terminators
(?: # begin a non-capture group
" # match a double-quote
(?: # begin a non-capture group
[^"]* # match >= 0 characters other than double-quotes
" # match a double-quote
)+ # end inner non-capture group and execute >= 1 times
)? # end outer non-capture group and make it optional
[^"\r\n]* # match >= 0 characters other than double-quotes
The purpose of the positive lookahead at the beginning is to avoid matching empty strings.

Related

Need guidance with Regular Expression in Python

I need help with one of my current tasks wherein i am trying to pick only the table names from the query via Python
So basically lets say a query looks like this
Create table a.dummy_table1
as
select a.dummycolumn1,a.dummycolumn2,a.dummycolumn3 from dual
Now i am passing this query into Python using STRINGIO and then reading only the strings where it starts with "a" and has "_" in it like below
table_list = set(re.findall(r'\ba\.\w+', str(data)))
Here data is the dataframe in which i have parsed the query using StringIO
now in table_list i am getting the below output
a.dummy_table1
a.dummycolumn1
a.dummycolumn2
whereas the Expected output should have been like
a.dummy_table1
<Let me know how we can get this done , have tried the above regular expression but that is not working properly>
Any help on same would be highly appreciated
Your current regex string r"\ba.\w+" simply matches any string which:
Begins with "a" (the "\ba" part)
Followed by a period (the "." part)
Followed by 1 or more alphanumeric characters (the "\w+" part).
If I've understood your problem correctly, you are looking to extract from str(data) any string fragments which match this pattern instead:
Begins with "a"
Followed by a period
Followed by 1 or more alphanumeric characters
Followed by an underscore
Followed by 1 or more alphanumeric characters
Thus, the regular expression should have "_\w+" added to the end to match criteria 4 and 5:
table_list = set(re.findall(r"\ba\.\w+_\w+", str(data)))

Strip characters to the left of a specific character in a pandas column

I have the following data:
key German
0 0:- Profile 1
1 1:- Archetype Realist*in
2 2:- RIASEC Code: R- Realistic
3 3:- Subline Deine Stärke? Du bleibst dir selber treu.
4 4:- Copy Dein Erfolg basiert auf deiner praktischen Ver...
In the "Key" column I would like to remove the numbers and colon dash which follows. This order is always the same (from the left). So for the first row I would like to remove "0:- ", and just leave "Profile 1". I am struggling to find the correct regex expression to do what I want. Originally I tried the following:
df_json['key'] = df_json['key'].map(lambda x: x.strip(':- ')[1])
However, this approach is too restrictive since there can be multiple words in the field.
I would like to use pd.Series.str.replace(), but I cant figure out the correct regex expression to achieve the desired results. Any help would be greatly appreciated.
With your shown samples, please try following. Using replace function of Pandas here. Simple explanation would be, apply replace function of Pandas to German column of dataframe and then use regex ^[0-9]+:-\s+ to replace values with NULL.
df['German'].replace('(^[0-9]+:-\s+)','', regex=True)
Explanation:
^[0-9]+: match starting digits followed by colon here.
:-\s+: Match colon, followed by - followed by 1 or more space occurrences.
What about just using pandas.Series.str.partition instead of regular expressions:
df['German'] = df['German'].str.partition()[2]
This would split the series on the 1st space only and grab the trailing part. Alternatively to partition you could also just split:
df['German'] = df['German'].str.split(' ', 1).str[1]
If regex is a must for you, maybe use a lazy quantifier to match upto the 1st space character:
df['German'] = df['German'].replace('^.*? +','', regex=True)
Where:
^ - Start line anchor.
.*? - Any 0+ (lazy) characters other than newline upto;
+ - 1+ literal space characters.
Here is an online demo
You need
df_json['key'] = df_json['key'].str.replace(r'^\d+:-\s*', '', regex=True)
See the regex demo and the regex graph:
Details:
^ - start of string
\d+ - one or more digits
: - a colon
- - a hyphen
\s* - zero or more whitespaces
Extract any non white Space \S and Non Digits \D which are immediately to the left of unwanted characters
df['GermanFiltered']=df['German'].str.extract("((?<=^\d\:\-\s)\S+\D+)")

Python - catch only specific tokens from file and ignore one line and multi line comments and new-lines

Given a file (this is the input), I have to tokenize it (so to speak) - I have to retrieve only the following 5 types of tokens:
identifiers - A sequence of letters, digits, and underscore ('_') not starting with a digit - like the built-in function - isidentifier().
integers - A decimal number in the range 0 .. 32767.
strings - '"' A sequence of Unicode characters not including double-quote or newline '"'.
keywords - as specified in the below code.
Symbols - as specified in the below code.
So I actually tried to retrieve only those relevant chunks from the file, but some content of comments may seem to match a symbol/keyword, and that's the problem.
The code I have come up with, to retrieve only the relevant tokens:
identifier_regex = '\w+'
integer_regex = '\d+'
string_regex = '\".*\"'
keyword_regex = ('class|method|function|constructor|int|boolean|char|void|'
'var|static|field|let|do|if|else|while|return|true|false|'
'null|this')
symbol_regex = '{|}|\[|\]|\(|\)|\.|,|;|\+|-|\*|\/|&|\||<|>|=|~'
composed_regex = r'({}|{}|{}|{}|{})'.format(identifier_regex,
integer_regex,
string_regex,
keyword_regex,
symbol_regex)
Possible types of comments:
// Comment to end of line
/* Comment until closing */
/** API documentation comment */
So I think the problem is only in identifying the empty-lines, comments (one-line comments - // OR multi-line comments - /** */) - for that I have come up with that Regex:
\s+|//.*
1. But \s can also ignore relevant whitespaces from the file, no?
2. //.* is for catching one-line comments.
3. Regarding multi-line comments, I don't know how to handle them, as it can be spread over multiple lines...
Sample input and output example:
Here is an image displaying a sample input and its expected output

Python regex search until specific word and exclude everything behind it

I have a script that always have the "get the" and the "get" in a string.
The "ONE TWO THREE" can vary, like it also can be "THIRTEEN FORTY" or "SIX". After these variations there will always be a 2nd "get".
I have the following code:
variable = 'get the ONE TWO THREE get FOUR FIVE'
myVariable = re.compile(r'(?<=get the) .*')
myVariableSearch = myVariable.search(variable)
mySearchGroup = myVariableSearch.group()
print(mySearchGroup)
#prints ONE TWO THREE get FOUR FIVE
I want my script to exclude the 2nd "get" and everything behind it. My desired result is to be just the "ONE TWO THREE".
How do I exclude this? Any help would be appreciated!
You can use
\bget\s+the\s+(.*?)(?=\s*\bget\b|$)
See the regex demo.
Details
\bget\s+the\s+ - whole word get, 1+ whitespaces, the, 1+ whitespaces
(.*?) - Group 1:
(?=\s*\bget\b|$) - a positive lookahead that requires 0+ whitespaces and then a whole word get, or end of string immediately on the right of the current location.
See the Python demo:
import re
variable = 'get the ONE TWO THREE get FOUR FIVE'
myVariableSearch = re.search(r'\bget\s+the\s+(.*?)(?=\s*\bget\b|$)', variable)
mySearchGroup = ''
if myVariableSearch:
mySearchGroup = myVariableSearch.group(1)
print(mySearchGroup)
# => ONE TWO THREE

Other than text how to remove numbers , punctuation, white spaces and special characters from text? [duplicate]

This question already has answers here:
Remove all special characters, punctuation and spaces from string
(19 answers)
Closed 2 years ago.
I just scraped text data from a website and that data contains numbers, special characters and punctuation. After splitting the data and I tried to keep plain text but I'm getting spcaes, numbers, special characters. How to remove all those things and keep the text free from above things.
url = 'www.example.com'
html = urllib.request.urlopen(url).read().decode('utf-8')
text = get_text(html)
extracted_data = text.split()
refined_data = []
SYMBOLS = '{}()[].,:;+-*/&|<>=~0123456789'
for i in extracted_data:
if i not in SYMBOLS:
refined_data.append(i)
print("\n", "$" * 50, "HEYAAA we got arround: ", len(refined_data), " of keywords! Here are they: ","$" * 50, "\n")
print(type(refined_data))
output:
1.My
2.system
3.showing
4.error
5.404
6.I
7.don't
8.understand
9.why
10. it
11. showing ,
12.like
13.this?
14.53251
15.$45
extracted_data is the result of string.split()
The string.split() method used as such will split your text along 'any whitespaces'.
The not in operator compares i (the entire string) to a sequence. Your sequence here is just a single string, so it's like a list of the individual characters in that string.
So is 'system' in the sequence SYMBOLS? Asked again: is the string 'system' any of the characters in SYMBOLS? No it is not. Therefore, your if statement is executed and it is appended to your product.
Is '53251' in the list of one characters SYMBOLS? Not it is not. Therefore, it is appended.
And so on.
Such a list comparison is not necessary. You should be using str.strip()

Resources