extract substring from large string

extract substring from large string - python-3.x

I have a string as:
string="(2021-07-02 01:00:00 AM BST)
---
syl.hs has joined the conversation
(2021-07-02 01:00:23 AM BST)
---
e.wang
Good Morning
How're you?
(2021-07-02 01:05:11 AM BST)
---
wk.wang
Hi, I'm Good.
(2021-07-02 01:08:01 AM BST)
---
perter.derrek
we got the update on work.
It will get complete by next week.
(2021-07-15 08:59:41 PM BST)
---
ad.ft has left the conversation
---
* * *"
I want to extract the conversation text only (text in between name and timestamp) expected output as:
comments=['Good Morning How're you?','Hi, I'm Good.','we got the
update on work.It will get complete by next week.']
What I have tried is:
comments=re.findall(r'---\s*\n(.(?:\n(?!(?:(\s\d{4}-\d{2}-\d{2}\s\d{2}:\d{2}:\d{2}\s*[AP]M\s+GMT\s*)\w+\s*\n)?---).))',string)

You could use a single capture group:
^---\s*\n(?!.* has (?:joined|left) the conversation|\* \* \*)\S.*((?:\n(?!\(\d|---).*)*)
The pattern matches:
^ Start of string
---\s*\n Match --- optional whitespace chars and a newline
(?!.* has (?:joined|left) the conversation|\* \* \*) Assert that the line does not contain a has joined or has left the conversation part, or contains * * *
\S.* Match at least a non whitespace char at the start of the line and the rest of the line
( Capture group 1 (this will be returned by re.findall)
(?:\n(?!\(\d|---).*)* Match all lines the do not start with ( and a digit or --
) Close group 1
See a regex demo and a Python demo.
Example
pattern = r"^---\s*\n(?!.* has (?:joined|left) the conversation|\* \* \*)\S.*((?:\n(?!\(\d|---).*)*)"
result = [m.strip() for m in re.findall(pattern, s, re.M) if m]
print(result)
Output
["Good Morning\nHow're you?", "Hi, I'm Good.", 'we got the update on work. \nIt will get complete by next week.']

I've assumed:
The text of interest begins after a block of three lines: a line containing a timestamp, followed by the line "---", which may be padded to the right with spaces, followed by a line comprised of a string of letters containing one period which is neither at the beginning nor end of that string and that string may be padded on the right with spaces.
The block of text of interest may contain blank lines, a blank line being a string that contains nothing other than spaces and a line terminator.
The last line of the block of text of interest cannot be a blank line.
I believe the following regular expression (with multiline (m) and case-indifferent (i) flags set) meets these requirements.
^\(\d{4}\-\d{2}\-\d{2} .*\) *\r?\n-{3} *\r?\n[a-z]+\.[a-z]+ *\r?\n((?:.*[^ (\n].*\r?\n| *\r?\n(?=(?: *\r?\n)*(?!\(\d{4}\-\d{2}\-\d{2} .*\)).*[^ (\n]))*)
The blocks of lines of interest are contained in capture group 1.
Start your engine!
The elements of the expression are as follows.
^\(\d{4}\-\d{2}\-\d{2} .*\) *\r?\n # match timestamp line
-{3} *\r?\n # match 3-hyphen line
[a-z]+\.[a-z]+ *\r?\n # match name
( # begin capture group 1
(?: # begin non-capture group (a)
.*[^ (\n].*\r?\n # match a non-blank line
| # or
\ *\r?\n # match a blank line
(?= # begin a positive lookahead
(?: # begin non-capture group (b)
\ *\r?\n # match a blank line
)* # end non-capture group b and execute 0+ times
(?! # begin a negative lookahead
\(\d{4}\-\d{2}\-\d{2} .*\) # match timestamp line
) # end negative lookahead
.*[^ (\n] # march a non-blank line
) # end positive lookahead
)* # end non-capture group a and execute 0+ times
) # end capture group 1

Here is a self-documenting regex that will strip leading and trailing whitespace:
(?x)(?m)(?s) # re.X, re.M, re.S (DOTALL)
(?: # start of non capturing group
^\(\d{4}-\d{2}-\d{2}\ \d{2}:\d{2}:\d{2}\ [AP]M\ BST\)\s*\r?\n # date and time
(?!---\s*\r?\nad\.ft has) # next lines are not the ---\n\ad.ft etc.
---\s*\r?\n # --- line
[\w.]+\s*\r?\n # name line
\s* # skip leading whitespace
) # end of non-capture group
# The folowing is capture group 1. Match characters until you get to the next date-time:
((?:(?!\s*\r?\n\(\d{4}-\d{2}-\d{2}\ \d{2}:\d{2}:\d{2}\ [AP]M\ BST\)).)*)# skip trailing whitespace
See Regex Demo
See Python Demo
import re
string = """(2021-07-02 01:00:00 AM BST)
---
syl.hs has joined the conversation
(2021-07-02 01:00:23 AM BST)
---
e.wang
Good Morning
How're you?
(2021-07-02 01:05:11 AM BST)
---
wk.wang
Hi, I'm Good.
(2021-07-02 01:08:01 AM BST)
---
perter.derrek
we got the update on work.
It will get complete by next week.
(2021-07-15 08:59:41 PM BST)
---
ad.ft has left the conversation
---
* * *"""
regex = r'''(?x)(?m)(?s) # re.X, re.M, re.S (DOTALL)
(?: # start of non capturing group
^\(\d{4}-\d{2}-\d{2}\ \d{2}:\d{2}:\d{2}\ [AP]M\ BST\)\s*\r?\n # date and time
(?!---\s*\r?\nad\.ft has) # next lines are not the ---\n\ad.ft etc.
---\s*\r?\n # --- line
[\w.]+\s*\r?\n # name line
\s* # skip leading whitespace
) # end of non-capture group
# The folowing is capture group 1. Match characters until you get to the next date-time:
((?:(?!\s*\r?\n\(\d{4}-\d{2}-\d{2}\ \d{2}:\d{2}:\d{2}\ [AP]M\ BST\)).)*)# skip trailing whitespace
'''
matches = re.findall(regex, string)
print(matches)
Prints:
["Good Morning\nHow're you?", "Hi, I'm Good.", 'we got the update on work.\nIt will get complete by next week.']

Related

Split a big text file into multiple smaller one on set parameter of regex

I have a large text file looking like:
....
sdsdsd
..........
asdfhjgjksdfk dfkaskk sdkfk skddkf skdf sdk ssaaa akskdf sdksdfsdf ksdf sd kkkkallwow.
sdsdllla lsldlsd lsldlalllLlsdd asdd. sdlsllall asdsdlallOEFOOASllsdl lsdlla.
slldlllasdlsd.ss;sdsdasdas.
......
ddss
................
asdfhjgjksdfk ddjafjijjjj.dfsdfsdfsdfsi dfodoof ooosdfow oaosofoodf aosolflldlfl , dskdkkfkdsa asddf;akkdfkdkk . sdlsllall asdsdlallOEFOOASllsdl lsdlla.
slldlllasdlsd.ss;sdsdasdas.
.....
xxxx
.......
asdfghjkl
I want to split the text files into multiple small text files and save them as .txt in my system on occurences of ..... [multiple period markers] saved like
group1_sdsdsd.txt
....
sdsdsd
..........
asdfhjgjksdfk dfkaskk sdkfk skddkf skdf sdk ssaaa akskdf sdksdfsdf ksdf sd kkkkallwow.
sdsdllla lsldlsd lsldlalllLlsdd asdd. sdlsllall asdsdlallOEFOOASllsdl lsdlla.
slldlllasdlsd.ss;sdsdasdas.
group1_ddss.txt
ddss
................
asdfhjgjksdfk ddjafjijjjj.dfsdfsdfsdfsi dfodoof ooosdfow oaosofoodf aosolflldlfl , dskdkkfkdsa asddf;akkdfkdkk . sdlsllall asdsdlallOEFOOASllsdl lsdlla.
slldlllasdlsd.ss;sdsdasdas.
and
group1_xxxx.txt
.....
xxxx
.......
asdfghjkl
I have figured that by usinf regex of sort of following can be done
txt =re.sub(r'(([^\w\s])\2+)', r' ', txt).strip() #for letters more than 2 times
but not able to figure out completely.
The saved text files should be named as group1_sdsdsd.txt , group1_ddss.txt and group1_xxxx.txt [group1 being identifier for the specific big text file as I have multiple bigger text files and need to do same on all to know which big text file i am splitting.

If you want to get the parts with multiple dots only on the same line, you can use and get the separate parts, you might use a pattern like:
^\.{3,}\n(\S+)\n\.{3,}(?:\n(?!\.{3,}\n\S+\n\.{3,}).*)*
Explanation
^ Start of string
\.{3,}\n Match 3 or more dots and a newline
(\S+)\n Capture 1+ non whitespace chars in group 1 for the filename and match a newline
\.{3,} Match 3 or more dots
(?: Non capture group to repeat as a whole part
\n Match a newline
(?!\.{3,}\n\S+\n\.{3,}) Negative lookahead, assert that from the current position we are not looking at a pattern that matches the dots with a filename in between
.* Match the whole line
)* Close the non capture group and optionally repeat it
Then you can use re.finditer to loop the matches, and use the group 1 value as part of the filename.
See a regex demo and a Python demo with the separate parts.
Example code
import re
pattern = r"^\.{3,}\n(\S+)\n\.{3,}(?:\n(?!\.{3,}\n\S+\n\.{3,}).*)*"
s = ("....your data here")
matches = re.finditer(pattern, s, re.MULTILINE)
your_path = "/your/path/"
for matchNum, match in enumerate(matches, start=1):
f = open(your_path + "group1_{}".format(match.group(1)), 'w')
f.write(match.group())
f.close()

Python - Disect/Tokenize and Iterate Over Segments of Multilined Text with re

Assuming a VCD file with a structure like the one that follows as a minimum example:
#0 <--- section
b10000011#
0$
1%
0&
1'
0(
0)
#2211 <--- section
0'
#2296 <--- section
b0#
1$
#2302 <--- section
0$
I want to split the whole thing into timestamp sections and search in every one of them for certain values. That is to first isolate the section inbetween the #0 and #2211 timestamp, then the section inbetween the #2211 and #2296 and so on.
I am trying to do this with python in the following way.
search_space = "
#0
b10000011#
0$
1%
0&
1'
0(
0)
#2211
0'
#2296
b0#
1$
#2302
0$"
# the "delimiter"
timestamp_regex = "\#[0-9]+(.*)\#[0-9]+"
for match in re.finditer(timestamp_regex, search_space, flags=re.DOTALL|re.MULTILINE):
print(match.groups())
But it has no effect. What is the proper way to handle such scenario with the re package?

You need to use a lazy quantifier ? here.
I made some little changes like this:
timestamp_regex = r"(\#[0-9]+)(.+?)(?=\#[0-9]+|\Z)"
for match in re.finditer(timestamp_regex, search_space, flags=re.DOTALL|re.MULTILINE):
print(f"section: {match.group(1)}\nchunk:{match.group(2)}\n----")
output:
section: #0
chunk:
b10000011#
0$
1%
0&
1'
0(
0)
----
section: #2211
chunk:
0'
----
section: #2296
chunk:
b0#
1$
----
section: #2302
chunk:
0$
----
Check the pattern at Regex101
Details:
(\#[0-9]+) - 1st capturing group consisting of # and one or more digits
(.+?) - 2nd capturing group - match anything one or more times non-greedy (match as little as possible)
(?=\#[0-9]+|\Z) - Positive lookahead on \#[0-9]+ OR \Z which is the end of your input string (2nd capturing group is followed by either another section or the end of string). End of string is needed here because for the last section there is only the chunk and no following #[0-9]+, so the chunk is followed by end of string.

Remove leading dollar sign from data and improve current solution

I have string like so:
"Job 1233:name_uuid (table n_Cars_1234567$20220316) done. Records: 24, with errors: 0."
I'd like to retieve the datte from the table name, so far I use:
"\$[0-9]+"
but this yields $20220316. How do I get only the date, without $?
I'd also like to get the table name: n_Cars_12345678$20220316
So far I have this:
pattern_table_info = "\(([^\)]+)\)"
pattern_table_name = "(?<=table ).*"
table_info = re.search(pattern_table_info, message).group(1)
table = re.search(pattern_table_name, table_info).group(0)
However I'd like to have a more simpler solution, how can I improve this?
EDIT:
Actually the table name should be:
n_Cars_12345678
So everything before the "$" sign and after "table"...how can this part of the string be retrieved?

You can use a regex with two capturing groups:
table\s+([^()]*)\$([0-9]+)
See the regex demo. Details:
table - a word
\s+ - one or more whitespaces
([^()]*) - Group 1: zero or more chars other than ( and )
\$ - a $ char
([0-9]+) - Group 2: one or more digits.
See the Python demo:
import re
text = "Job 1233:name_uuid (table n_Cars_1234567$20220316) done. Records: 24, with errors: 0."
rx = r"table\s+([^()]*)\$([0-9]+)"
m = re.search(rx, text)
if m:
print(m.group(1))
print(m.group(2))
Output:
n_Cars_1234567
20220316

You can write a single pattern with 2 capture groups:
\(table (\w+\$(\d+))\)
The pattern matches:
\(table
( Capture group 1
\w+\$ match 1+ word characters and $
(\d+) Capture group 2, match 1+ digits
) Close group 1
\) Match )
See a Regex demo and a Python demo.
import re
s = "Job 1233:name_uuid (table n_Cars_1234567$20220316) done. Records: 24, with errors: 0."
m = re.search(r"\(table (\w+\$(\d+))\)", s)
if m:
print(m.group(1))
print(m.group(2))
Output
n_Cars_1234567$20220316
20220316

ConfigObj - indentation missing for multiline string value in dict while having set the indentation option

How do I set indentation for string-based value spanning multiple lines in ConfigObj such that the second line, etc. does surpass the delimiter?
For example:
about = {'Info' : {'Purpose': 'blabla continues for fixed chars ...\
\n and another line of bla ... etc.'}} # here nicely aligned under "b" from bla.
# 11 white-spaces.
config = ConfigObj(indent_type= 3*' ', interpolation=True, encoding='utf8')
config.filename = 'lol.ini'
config['About'] = about
config.write()
this result shows in the ini file as:
[About]
[[Info]]
Purpose = '''blabla continues for fixed chars ...
and another line of bla ... etc.''' # here the indentation goes sub-optimal/wrong.
# 11 white spaces but missing the indentations (6 white-spaces)
For two levels the indentation shift would be 6 white-spaces to add (for "About"and for "Info" 3 earch). Apparently, "interpolation=True" is not what does the trick. Any suggestions?
Configobj ver. = 5.0.6
Py = 3.9

awk-insert row with specific text within specific position

I have a file where the first couple of rows start with # mark, then follow the classical netlist, where also can be there rows begin with # mark. I need to insert one row with text protect between block of first rows begining on # and first row of classical netlist. In the end of file i need insert row with word unprotect. It will be good to save this modified text to new file with specific name because of the original file protected.
Sample file:
// Generated for: spectre
// Design library name: Kovi
// Design cell name: T_Line
// Design view name: schematic
simulator lang=spectre
global 0
parameters frequency=3.8G Zo=250
// Library name: Kovi
// Cell name: T_Line
// View name: schematic
T8 (7 0 6 0) tline z0=Zo f=3.8G nl=0.5 vel=1
T7 (net034 0 net062 0) tline z0=Zo f=3.8G nl=0.5 vel=1
T5 (net021 0 4 0) tline z0=Zo f=3.8G nl=0.5 vel=1
T4 (net019 0 2 0) tline z0=Zo f=3.8G nl=0.5 vel=1

How about sed
sed -e '/^#/,/^#/!iprotect'$'\n''$aunprotect'$'\n' input_file > new_file
Inserts 'protect' on a line by itself after the first block of commented lines, then adds 'unprotect' at the end.
Note: Because I use $'\n' in place of literal newline bash is assumed as the shell.

Since you awk'd the post
awk 'BEGIN{ protected=""} { if($0 !~ /#/ && !protected){ protected="1"; print "protect";} print $0}END{print "unprotect";}' input_file > output_file
As soon a row is detected without # as the first non-whitespace character, it will output a line with protect. At the end it will output a line for unprotect.
Test file
#
#
#
#Preceded by a tab
begin protect
#
before unprotect
Result
#
#
#
#Preceded by tab
protect
begin protect
#
before unprotect
unprotect
Edit:
Removed the [:space:]* as it seems that is already handled by default.
Support //
If you wanted to support both # and // in the same script, the regex portion would change to /#|\//. The special character / has to be escaped by using \.
This would check for at least one /.
Adding a quantifier {2} will match // exactly: /#|\/{2}/

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string