How to extract the substring preceding marker?

How to extract the substring preceding marker? - python-3.x

I have a string:
[3016] - Device is ready...
[10ice is loading..13] - v3[3016] - Device is ready...
[1r 0.[3016] - Device is ready.
Everything except '[3016] - Device is ready...' is 'noise'
The key word here is "Device is ready"
3016 - timestamp in msec. I need to extract '3016' from string for further operations
Tried following:
if "Device is ready" in reply:
# set a pattern for extracting time from the result
found = re.findall("\[.*\]", reply)
# Cut timestemp from reply
x = [tm[1:-1] for tm in found]
in case the reply was 'clean' ([3016] - Device is ready...) it's ok, but if there is 'noise' in reply then it doesn't work. Can someone point me in the right direction or perhaps assist with the code? Thanks in advance

If there is a single key, and it should precede the marker Device is ready, you can capture the digits first.
\[(\d+)].*\bDevice is ready\b
The pattern matches:
\[(\d+)] Capture 1+ digits between square brackets in group 1
.* Match 0+ times any char
\bDevice is ready\b and then Device is ready
Regex demo | Python demo
import re
strings = [
"[3016] - Device is ready...",
"[10ice is loading..13] - v3[3017] - Device is ready...",
"[1r 0.[3018] - Device is ready.",
"[1r 0 - Device is ready. [3019]",
]
pattern = r"\[(\d+)].*\bDevice is ready\b"
for s in strings:
match = re.search(pattern, s)
if match:
print(match.group(1))
Output
3016
3017
3018

You should use a regex group () to extract the number. found will be a list of all the numbers found inside []:
if "Device is ready" in reply:
# set a pattern for extracting time from the result
found = re.findall("\[(\d+)\]", reply)
print(found[0])

Related

Regex to find text & value in large text

As I SSH into CM, run commands and start reading the CLI output, I get the following
back:
# * A lot more output above but been removed *
terminal_output = """
[24;1H [79b[1GCommand: disp sys cust<<[23;0H[0;7m [79b[1G[0m[24;0H [79b[1G[1;0H[0;7m [79b[1G[0m[2;0H [79b[1G[3;1H[0J7[1;1H[0;7mdisplay system-parameters customer-options [0m8[1;65H[0;7mPage 1 of 12[0m[2;33HOPTIONAL FEATURES[4;8HG3 Version: [4;20HV20 [4;50HSoftware Package: [4;68HEnterprise [5;10HLocation: [5;20H2[6;10HPlatform: [6;20H28 [5;51HSystem ID (SID): [5;68H9990093751 [6;51HModule ID (MID): [6;68H1 [8;60HUSED[9;29HPlatform Maximum Ports: [9;53H 81000[9;60H 436[10;35HMaximum Stations: [10;53H 135[10;60H 110[11;27HMaximum XMOBILE Stations: [11;53H 41000[11;60H 0[12;17HMaximum Off-PBX Telephones - EC500: [12;53H 135[12;60H 2[13;17HMaximum Off-PBX Telephones - OPS: [13;53H 135[13;60H 40[14;17HMaximum Off-PBX Telephones - PBFMC: [14;53H 135[14;60H 0[15;17HMaximum Off-PBX Telephones - PVFMC: [15;53H 135[15;60H 0[16;17HMaximum Off-PBX Telephones - SCCAN: [16;53H 0[16;60H 0[17;22HMaximum Survivable Processors: [17;53H 313[17;62H 1[22;9H(NOTE: You must logoff & login to effect the permission changes.)[2;50H[0m
"""
It's a lot of ANSI escape codes (I think?) which sort of makes the output not too readable but anyways, what I'm trying to get back is the following from the text above:
Maximum Stations: 135 110
I know from my understanding that a Regex would be required for this.
The Regexes that I tried using but did not work:
r'Maximum Stations:\s*(\d+)(\d+)'
r'Maximum Stations: \d+'
If anyone knows how to filter out these ANSI character codes so they don't appear in the final output that'd be great too.
Thank you.

you can try the following
"(Maximum Stations:)\s\[\d*;\d*H\s*(\d*)\[\d*;\d*H\s*(\d*)"gm
it produces three groups the first with the maximum stations text then two more each with the number you wanted to capture. You would have to combine the groups to get your final output.
I don't know if this will be generic enough for your application though.

Split a big text file into multiple smaller one on set parameter of regex

I have a large text file looking like:
....
sdsdsd
..........
asdfhjgjksdfk dfkaskk sdkfk skddkf skdf sdk ssaaa akskdf sdksdfsdf ksdf sd kkkkallwow.
sdsdllla lsldlsd lsldlalllLlsdd asdd. sdlsllall asdsdlallOEFOOASllsdl lsdlla.
slldlllasdlsd.ss;sdsdasdas.
......
ddss
................
asdfhjgjksdfk ddjafjijjjj.dfsdfsdfsdfsi dfodoof ooosdfow oaosofoodf aosolflldlfl , dskdkkfkdsa asddf;akkdfkdkk . sdlsllall asdsdlallOEFOOASllsdl lsdlla.
slldlllasdlsd.ss;sdsdasdas.
.....
xxxx
.......
asdfghjkl
I want to split the text files into multiple small text files and save them as .txt in my system on occurences of ..... [multiple period markers] saved like
group1_sdsdsd.txt
....
sdsdsd
..........
asdfhjgjksdfk dfkaskk sdkfk skddkf skdf sdk ssaaa akskdf sdksdfsdf ksdf sd kkkkallwow.
sdsdllla lsldlsd lsldlalllLlsdd asdd. sdlsllall asdsdlallOEFOOASllsdl lsdlla.
slldlllasdlsd.ss;sdsdasdas.
group1_ddss.txt
ddss
................
asdfhjgjksdfk ddjafjijjjj.dfsdfsdfsdfsi dfodoof ooosdfow oaosofoodf aosolflldlfl , dskdkkfkdsa asddf;akkdfkdkk . sdlsllall asdsdlallOEFOOASllsdl lsdlla.
slldlllasdlsd.ss;sdsdasdas.
and
group1_xxxx.txt
.....
xxxx
.......
asdfghjkl
I have figured that by usinf regex of sort of following can be done
txt =re.sub(r'(([^\w\s])\2+)', r' ', txt).strip() #for letters more than 2 times
but not able to figure out completely.
The saved text files should be named as group1_sdsdsd.txt , group1_ddss.txt and group1_xxxx.txt [group1 being identifier for the specific big text file as I have multiple bigger text files and need to do same on all to know which big text file i am splitting.

If you want to get the parts with multiple dots only on the same line, you can use and get the separate parts, you might use a pattern like:
^\.{3,}\n(\S+)\n\.{3,}(?:\n(?!\.{3,}\n\S+\n\.{3,}).*)*
Explanation
^ Start of string
\.{3,}\n Match 3 or more dots and a newline
(\S+)\n Capture 1+ non whitespace chars in group 1 for the filename and match a newline
\.{3,} Match 3 or more dots
(?: Non capture group to repeat as a whole part
\n Match a newline
(?!\.{3,}\n\S+\n\.{3,}) Negative lookahead, assert that from the current position we are not looking at a pattern that matches the dots with a filename in between
.* Match the whole line
)* Close the non capture group and optionally repeat it
Then you can use re.finditer to loop the matches, and use the group 1 value as part of the filename.
See a regex demo and a Python demo with the separate parts.
Example code
import re
pattern = r"^\.{3,}\n(\S+)\n\.{3,}(?:\n(?!\.{3,}\n\S+\n\.{3,}).*)*"
s = ("....your data here")
matches = re.finditer(pattern, s, re.MULTILINE)
your_path = "/your/path/"
for matchNum, match in enumerate(matches, start=1):
f = open(your_path + "group1_{}".format(match.group(1)), 'w')
f.write(match.group())
f.close()

Remove leading dollar sign from data and improve current solution

I have string like so:
"Job 1233:name_uuid (table n_Cars_1234567$20220316) done. Records: 24, with errors: 0."
I'd like to retieve the datte from the table name, so far I use:
"\$[0-9]+"
but this yields $20220316. How do I get only the date, without $?
I'd also like to get the table name: n_Cars_12345678$20220316
So far I have this:
pattern_table_info = "\(([^\)]+)\)"
pattern_table_name = "(?<=table ).*"
table_info = re.search(pattern_table_info, message).group(1)
table = re.search(pattern_table_name, table_info).group(0)
However I'd like to have a more simpler solution, how can I improve this?
EDIT:
Actually the table name should be:
n_Cars_12345678
So everything before the "$" sign and after "table"...how can this part of the string be retrieved?

You can use a regex with two capturing groups:
table\s+([^()]*)\$([0-9]+)
See the regex demo. Details:
table - a word
\s+ - one or more whitespaces
([^()]*) - Group 1: zero or more chars other than ( and )
\$ - a $ char
([0-9]+) - Group 2: one or more digits.
See the Python demo:
import re
text = "Job 1233:name_uuid (table n_Cars_1234567$20220316) done. Records: 24, with errors: 0."
rx = r"table\s+([^()]*)\$([0-9]+)"
m = re.search(rx, text)
if m:
print(m.group(1))
print(m.group(2))
Output:
n_Cars_1234567
20220316

You can write a single pattern with 2 capture groups:
\(table (\w+\$(\d+))\)
The pattern matches:
\(table
( Capture group 1
\w+\$ match 1+ word characters and $
(\d+) Capture group 2, match 1+ digits
) Close group 1
\) Match )
See a Regex demo and a Python demo.
import re
s = "Job 1233:name_uuid (table n_Cars_1234567$20220316) done. Records: 24, with errors: 0."
m = re.search(r"\(table (\w+\$(\d+))\)", s)
if m:
print(m.group(1))
print(m.group(2))
Output
n_Cars_1234567$20220316
20220316

regex to match paragraph in between 2 substrings

I have a string look like this:
string=""
( 2021-07-10 01:24:55 PM GMT )TEST
---
Badminton is a racquet sport played using racquets to hit a shuttlecock across
a net. Although it may be played with larger teams, the most common forms of
the game are "singles" (with one player per side) and "doubles" (with two
players per side).
( 2021-07-10 01:27:55 PM GMT )PATRICKWARR
---
Good morning, I am doing well. And you?
---
---
* * *""
I am trying to split the String up into parts as:
text=['Badminton is a racquet sport played using racquets to hit a
shuttlecock across a net. Although it may be played with larger teams,
the most common forms of the game are "singles" (with one player per
side) and "doubles" (with two players per side).','Good morning, I am
doing well. And you?']
What I have tried as:
text=re.findall(r'\( \d{4}-\d{2}-\d{2}\s\d{2}:\d{2}:\d{2} PM GMT \)\w+ [\S\n]--- .*',string)
I'm not able get how to extract multiple lines.

You can use
(?m)^\(\s*\d{4}-\d{2}-\d{2}\s\d{2}:\d{2}:\d{2}\s*[AP]M\s+GMT\s*\)\w+\s*\n---\s*\n(.*(?:\n(?!(?:\(\s*\d{4}-\d{2}-\d{2}\s\d{2}:\d{2}:\d{2}\s*[AP]M\s+GMT\s*\)\w+\s*\n)?---).*)*)
See the regex demo. Details:
^ - start of line
{left_rx} - left boundary
--- - three hyphens
\s*\n - zero or more whitespaces and then an LF char
(.*(?:\n(?!(?:{left_rx})?---).*)*) - Group 1:
.* - zero or more chars other than line break chars as many as possible
(?:\n(?!(?:{left_rx})?---).*)* - zero or more (even empty, due to .*) lines that do not start with the (optional) left boundary pattern followed with ---
The boundary pattern defined in left_rx is \(\s*\d{4}-\d{2}-\d{2}\s\d{2}:\d{2}:\d{2}\s*[AP]M\s+GMT\s*\)\w+\s*\n, it is basically the same as the original, I used \s* to match any zero or more whitespaces or \s+ to match one or more whitespaces between "words".
See the Python demo:
import re
text = '''string=""\n( 2021-07-10 01:24:55 PM GMT )TEST \n--- \nBadminton is a racquet sport played using racquets to hit a shuttlecock across\na net. Although it may be played with larger teams, the most common forms of\nthe game are "singles" (with one player per side) and "doubles" (with two\nplayers per side). \n \n \n\n \n\n( 2021-07-10 01:27:55 PM GMT )PATRICKWARR \n--- \nGood morning, I am doing well. And you? \n \n \n\n \n \n \n--- \n \n \n \n \n--- \n \n* * *""'''
left_rx = r"\(\s*\d{4}-\d{2}-\d{2}\s\d{2}:\d{2}:\d{2}\s*[AP]M\s+GMT\s*\)\w+\s*\n"
rx = re.compile(fr"^{left_rx}---\s*\n(.*(?:\n(?!(?:{left_rx})?---).*)*)", re.M)
print ( [x.strip().replace('\n', ' ') for x in rx.findall(text)] )
Output:
['Badminton is a racquet sport played using racquets to hit a shuttlecock across a net. Although it may be played with larger teams, the most common forms of the game are "singles" (with one player per side) and "doubles" (with two players per side).', 'Good morning, I am doing well. And you?']

One of the approaches:
import re
# Replace all \n with ''
string = string.replace('\n', '')
# Replace the date string '( 2021-07-10 01:27:55 PM GMT )PATRICKWARR ' and string like '* * *' with ''
string = re.sub(r"\(\s*\d{4}-\d{2}-\d{2}\s\d{2}:\d{2}:\d{2} [AP]M GMT\s*\)\w+|\*+", '', string)
data = string.split('---')
data = [item.strip() for item in data if item.strip()]
print (data)
Output:
['Badminton is a racquet sport played using racquets to hit a shuttlecock acrossa net. Although it may be played with larger teams, the most common forms ofthe game are "singles" (with one player per side) and "doubles" (with twoplayers per side).', 'Good morning, I am doing well. And you?']

splitting a file.txt into two file with a condition

How can i split the given file into two different files results codes and warning codes.AS given below is single text file and I want to split it into two files as I had lot more file in such condition to split.
Result Codes:
0 - SYS_OK - "Ok"
1 - SYS_ERROR_E - "System Error"
1001 - MVE_SYS_E - "MTE System Error"
1002 - MVE_COMMAND_SYNTAX_ERROR_E - "Command Syntax is wrong"
Warning Codes:
0 - SYS_WARN_W - "System Warning"
100001 - MVE_SYS_W - "MVE System Warning"
200001 - SLEA_SYS_W - "SLEA System Warning"
200002 - SLEA_INCOMPLETE_SCRIPTED_OVERRIDE_COMMAND_W - "One or more of the entered scripted override commands has missing mandatory parameters"
300001 - L1_SYS_W - "L1 System Warning"

Well, on first glance, the distinction seems to be that "warnings" all contain the character sequence _W - and anything that doesn't is "results". Did you notice that?
awk '/_W -/{print >"warnings";next}{print >"results"}'

Here is a python solution:
I am assuming you are having the list of warning codes.
import re
warnings = open(r'warning-codes.txt');
warn_codes =[]
for line in warnings:
m = re.search(r'(\d+) .*',line);
if(m):
warn_codes.append(m.groups(1));
ow = open('output-warnings.txt','w')
ors = open('output-results.txt','w')
log_file = open(r'log.txt');
for line in log_file:
m = re.search(r'(\d+) .*',line);
if(m and (m.groups(1) in warn_codes)):
ow.write(line+'\n');
elif(m):
ors.write(line+'\n');
else:
print("none");
ow.close()
ors.close()

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string