Separate complex text from number patterns from a single string - string

I have a text file that looks something like this (but with hundreds of lines):
1147-1 SYRUP: DR.PEPPER 5GALLON/BOX
1653-1 SYRUP: DIET DR.PEPPER 5GAL/BOX
2011-2 WATER DISTILLED 6 / 1 GA
1217-2 ALL PURPOSE RASPBERRY FIL 40#
1273-1 STRAWBERRY PIE FILLING 38#
2893-1 BREAD: SOURDOUGH 12/1# OVAL
2287-1 BREAD SQUAW: 8/2.25#LF
1929-1 VINEGAR HERB CONT GRDN 12/12.7
1949-2 KETCHUP: 16/14OZ-PLASTIC BTLS
2408-1 CONE 3 NAB SAMPLER 28/45
2939-1 DULCE LECH FLVR PKT 3/12 EA CS
3017-1 GINGRBRD FLVR PKT 3/12 EA CS
3055-2 EGGNOG FLVR PKT 3/12 EA CS
3192-1 ORIGINAL MRS. DASH SEASONING
I've created the code to pull everything in from the text file line by line and strip out the numbers at the beginning and save the next portion (ie SYRUP: DR.PEPPER 5GALLON/BOX, ALL PURPOSE RASPBERRY FIL 40#) to Mid(nextLine, 10, 30). I want to take that portion and split it up by pulling the name (SYRUP: DR.PEPPER, ALL PURPOSE RASPBERRY FIL), the size (5GALLON, 40#) the number of that size (if it is 12/1# = 12x 1LB) and the unit (BOX, LB) out. As you can see almost every line is different but with many similarities. Not really sure what to do next. I have been trying to use:
re.Pattern = "GALLON|BOX|#|LF|GAL|EA|CS|BTLS"
to pull out the unit portion but I don't know what else to do.
Here is the code I have so far for this portion:
Function NumericOnly(s As String) As String
Dim StrUnit As String
Static re As RegExp
If re Is Nothing Then Set re = New RegExp
re.IgnoreCase = True
re.Global = True
re.Pattern = "GALLON|BOX|#|LF|GAL|EA|CS|BTLS"
StrUnit = re.Replace(s, "")
End Function

Related

running for loop until arbitrary index (python 3.x)

So I have these strings that I split by spaces (' ') and I just rolled them into a single list I called 'keyLabelRun'
so it looks like this:
keyLabelRun[0-12]:
0 OS=Dengue
1 virus
2 3
3 PE=4
4 SV=1
5 Split=0
6
7 OS=Bacillus
8 subtilis
9 XF-1
10 GN=opuBA
11 PE=4
12 SV=1
I only want the elements that include and are after "OS=", anything else, whether it be "SV=" or "PE=" etc. I want to skip over those elements until I get to the next "OS="
The number of elements to the next "OS=" is arbitrary so that's where I'm having the problem.
This is what I'm currently trying:
OSarr = []
for i in range(len(keyLabelrun)):
if keyLabelrun[i].count('OS='):
OSarr.append(keyLabelrun[i])
if keyLabelrun[i+1].count('=') != 1:
continue
But the elements where "OS=" is not included is what is tripping me up I think.
Also at the end I'm going to join them all back together in their own elements but I feel like I will be able to handle that after this.
In my attempt, I am trying to append all elements I'm looking for in order to an new list 'OSarr'
If anyone can lend a hand, it would be much appreciated.
Thank you.
These list of strings came from a dataset that is a text file in the form:
>tr|W0FSK4|W0FSK4_9FLAV Genome polyprotein (Fragment) OS=Dengue virus 3 PE=4 SV=1 Split=0
MNNQRKKTGKPSINMLKRVRNRVSTGSQLAKRFSKGLLNGQGPMKLVMAFIAFLRFLAIPPTAGVLARWGTFKKSGAIKVLKGFKKEISNMLSIINKRKKTSLCLMMILPAALAFHLTSRDGEPRMIVGKNERGKSLLFKTASGINMCTLIAMDLGEMCDDTVTYKCPHITEVEPEDIDCWCNLTSTWVTYGTCNQAGEHRRDKRSVALAPHVGMGLDTRTQTWMSAEGAWRQVEKVETWALRHPGFTILALFLAHYIGTSLTQKVVIFILLMLVTPSMTMRCVGVGNRDFVEGLSGATWVDVVLEHGGCVTTMAKNKPTLDIELQKTEATQLATLRKLCIEGKITNITTDSRCPTQGEATLPEEQDQNYVCKHTYVDRGWGNGCGLFGKGSLVTCAKFQCLEPIEGKVVQYENLKYTVIITVHTGDQHQVGNETQGVTAEITPQASTTEAILPEYGTLGLECSPRTGLDFNEMILLTMKNKAWMVHRQWFFDLPLPWTSGATTETPTWNRKELLVTFKNAHAKKQEVVVLGSQEGAMHTALTGATEIQNSGGTSIFAGHLKCRLKMDKLELKGMSYAMCTNTFVLKKEVSETQHGTILIKVEYKGEDVPCKIPFSTEDGQGKAHNGRLITANPVVTKKEEPVNIEAEPPFGESNIVIGIGDNALKINWYKKGSSIGKMFEATARGARRMAILGDTAWDFGSVGGVLNSLGKMVHQIFGSAYTALFSGVSWVMKIGIGVLLTWIGLNSKNTSMSFSCIAIGIITLYLGAVVQADMGCVINWKGKELKCGSGIFVTNEVHTWTEQYKFQADSPKRLATAIAGAWENGVCGIRSTTRMENLLWKQIANELNYILWENNIKLTVVVGDIIGVLEQGKRTLTPQPMELKYSWKTWGKAKIVTAETQNSSFIIDGPNTPECPSVSRAWNVWEVEDYGFGVFTTNIWLKLREVYTQLCDHRLMSAAVKDERAVHADMGYWIESQKNGSWKLEKASLIEVKTCTWPKSHTLWSNGVLESDMIIPKSLAGPISQHNHRPGYHTQTAGPWHLGKLELDFNYCEGTTVVITENCGTRGPSLRTTTVSGKLIHEWCCRSCTLPPLRYMGEDGCWYGMEIRPISEKEENMVKSLVSAGSGKVDNFTMGVLCLAILFEEVMRGKFGKKHMIAGVFFTFVLLLSGQITWRDMAHTLIMIGSNASDRMGMGVTYLALIATFKIQPFLALGFFLRKLTSRENLLLGVGLAMATTLQLPEDIEQMANGIALGLMALKLITQFETYQLWTALISLTCSNTIFTLTVAWRTATLILAGVSLLPVCQSSSMRKTDWLPMAVAAMGVPPLPLFIFGLKDTLKRRSWPLNEGVMAVGLVSILASSLLRNDVPMAGPLVAGGLLIACYVITGTSADLTVEKAADITWEEEAEQTGVSHNLMITVDDDGTMRIKDDETENILTVLLKTALLIVSGIFPYSIPATLLVWHTWQKQTQRSGVLWDVPSPPETQKAELEEGVYRIKQQGIFGKTQVGVGVQKEGVFHTMWHVTRGAVLTYNGKRLEPNWASVKKDLISYGGGWRLSAQWQKGEEVQVIAVEPGKNPKNFQTMPGTFQTTTGEIGAIALDFKPGTSGSPIINREGKVVGLYGNGVVTKNGGYVSGIAQTNAEPDGPTPELEEEMFKKRNLTIMDLHPGSGKTRKYLPAIVREAIKRRLRTLILAPTRVVAAEMEEALKGLPIRYQTTATKSEHTGREIVDLMCHATFTMRLLSPVRVPNYNLIIMDEAHFTDPASIAARGYISTRVGMGEAAAIFMTATPPGTADAFPQSNAPIQDEERDIPERSWNSGNEWITDFAGKTVWFVPSIKAGNDIANCLRKNGKKVIQLSRKTFDTEYQKTKLNDWDFVV
>tr|M4KW32|M4KW32_BACIU Choline ABC transporter (ATP-binding protein) OS=Bacillus subtilis XF-1 GN=opuBA PE=4 SV=1 Split=0
MLTLENVSKTYKGGKKAVNNVNLKIAKGEFICFIGPSGCGKTTTMKMINRLIEPSAGKIFIDGENIMDQDPVELRRKIGYVIQQIGLFPHMTIQQNISLVPKLLKWPEQQRKERARELLKLVDMGPEYVDRYPHELSGGQQQRIGVLRALAAEPPLILMDEPFGALDPITRDSLQEEFKKLQKTLHKTIVFVTHDMDEAIKLADRIVILKAGEIVQVGTPDDILRNPADEFVEEFIGKERLIQSSSPDVERVDQIMNTQPVTITADKTLSEAIQLMRQERVDSLLVVDDEHVLQGYVDVEIIDQCRKKANLIGEVLHEDIYTVLGGTLLRDTVRKILKRGVKYVPVVDEDRRLIGIVTRASLVDIVYDSLWGEEKQLAALS
>sp|Q8AWH3|SX17A_XENTR Transcription factor Sox-17-alpha OS=Xenopus tropicalis GN=sox17a PE=2 SV=1 Split=0
MSSPDGGYASDDQNQGKCSVPIMMTGLGQCQWAEPMNSLGEGKLKSDAGSANSRGKAEARIRRPMNAFMVWAKDERKRLAQQNPDLHNAELSKMLGKSWKALTLAEKRPFVEEAERLRVQHMQDHPNYKYRPRRRKQVKRMKRADTGFMHMAEPPESAVLGTDGRMCLESFSLGYHEQTYPHSQLPQGSHYREPQAMAPHYDGYSLPTPESSPLDLAEADPVFFTSPPQDECQMMPYSYNASYTHQQNSGASMLVRQMPQAEQMGQGSPVQGMMGCQSSPQMYYGQMYLPGSARHHQLPQAGQNSPPPEAQQMGRADHIQQVDMLAEVDRTEFEQYLSYVAKSDLGMHYHGQESVVPTADNGPISSVLSDASTAVYYCNYPSA
I got it! :D
OSarr = []
G = 0
for i in range(len(keyLabelrun)):
OSarr.append(keyLabelrun[G])
G += 1
if keyLabelrun[G].count('='):
while keyLabelrun[G].count('OS=') != 1:
G+=1
Maybe next time everyone, thank you!
Due to the syntax, you have to keep track of which part (OS, PE, etc) you're currently parsing. Here's a function to extract the species name from the FASTA header:
def extract_species(description):
species_parts = []
is_os = False
for word in description.split():
if word[:3] == 'OS=':
is_os = True
species_parts.append(word[3:])
elif '=' in word:
is_os = False
elif is_os:
species_parts.append(word)
return ' '.join(species_parts)
You can call it when processing your input file, e.g.:
from Bio import SeqIO
for record in SeqIO.parse('input.fa', 'fasta'):
species = extract_species(record.description)

Regex to extract lines that starts with any character but not with characters /words like Rs or SBC using Python

Below is the input file:
8% OFF
Sugar Free Gold Sweetener
500 tablets
Rs230
Rs 250
SBC IconClub Price: Rs227
43% OFF
Palm Oil (Bottle)
1 l
Rs186
Rs 330
SBC IconClub Price: Rs185
I am trying to extract those lines that are not starting with either Rs or SBC or any number. They are actually product names which are starting with words only. I used below Regex:
^[A-Za-z].*$
This removed the lines starting with numbers only.
However, I am unable to remove the below lines that are starting with RS, and SBC.
Rs230
Rs 250
SBC IconClub Price: Rs227
Rs186
Rs 330
SBC IconClub Price: Rs185
Could anyone please help me with the regex in Python which can give the product name only? Products are on line 2 and 8.
You can exclude matching SBC RS or Rs using a negative lookahead:
^(?!SBC|R[sS])[A-Za-z].*$
Regex demo
If the strings all start with an uppercase char, you can also use
^(?!SBC|R[sS])[A-Z].*$

Parsing heterogenous data from a text file in Python

I am trying to parse raw data results from a text file into an organised tuple but having trouble getting it right.
My raw data from the textfile looks something like this:
Episode Cumulative Results
EpisodeXD0281119
Date collected21/10/2019
Time collected10:00
Real time PCR for M. tuberculosis (Xpert MTB/Rif Ultra):
PCR result Mycobacterium tuberculosis complex NOT detected
Bacterial Culture:
Bottle: Type FAN Aerobic Plus
Result No growth after 5 days
EpisodeST32423457
Date collected23/02/2019
Time collected09:00
Gram Stain:
Neutrophils Occasional
Gram positive bacilli Moderate (2+)
Gram negative bacilli Numerous (3+)
Gram negative cocci Moderate (2+)
EpisodeST23423457
Date collected23/02/2019
Time collected09:00
Bacterial Culture:
A heavy growth of
1) Klebsiella pneumoniae subsp pneumoniae (KLEPP)
ensure that this organism does not spread in the ward/unit.
A heavy growth of
2) Enterococcus species (ENCSP)
Antibiotic/Culture KLEPP ENCSP
Trimethoprim-sulfam R
Ampicillin / Amoxic R S
Amoxicillin-clavula R
Ciprofloxacin R
Cefuroxime (Parente R
Cefuroxime (Oral) R
Cefotaxime / Ceftri R
Ceftazidime R
Cefepime R
Gentamicin S
Piperacillin/tazoba R
Ertapenem R
Imipenem S
Meropenem R
S - Sensitive ; I - Intermediate ; R - Resistant ; SDD - Sensitive Dose Dependant
Comment for organism KLEPP:
** Please note: this is a carbapenem-RESISTANT organism. Although some
carbapenems may appear susceptible in vitro, these agents should NOT be used as
MONOTHERAPY in the treatment of this patient. **
Please isolate this patient and practice strict contact precautions. Please
inform Infection Prevention and Control as contact screening might be
indicated.
For further advice on the treatment of this isolate, please contact.
The currently available laboratory methods for performing colistin
susceptibility results are unreliable and may not predict clinical outcome.
Based on published data and clinical experience, colistin is a suitable
therapeutic alternative for carbapenem resistant Acinetobacter spp, as well as
carbapenem resistant Enterobacteriaceae. If colistin is clinically indicated,
please carefully assess clinical response.
EpisodeST234234057
Date collected23/02/2019
Time collected09:00
Authorised by xxxx on 27/02/2019 at 10:35
MIC by E-test:
Organism Klebsiella pneumoniae (KLEPN)
Antibiotic Meropenem
MIC corrected 4 ug/mL
MIC interpretation Resistant
Antibiotic Imipenem
MIC corrected 1 ug/mL
MIC interpretation Sensitive
Antibiotic Ertapenem
MIC corrected 2 ug/mL
MIC interpretation Resistant
EpisodeST23423493
Date collected18/02/2019
Time collected03:15
Potassium 4.4 mmol/L 3.5 - 5.1
EpisodeST45445293
Date collected18/02/2019
Time collected03:15
Creatinine 32 L umol/L 49 - 90
eGFR (MDRD formula) >60 mL/min/1.73 m2
Creatinine 28 L umol/L 49 - 90
eGFR (MDRD formula) >60 mL/min/1.73 m2
Essentially the pattern is that ALL information starts with a unique EPISODE NUMBER and follows with a DATE and TIME and then the result of whatever test. This is the pattern throughout.
What I am trying to parse into my tuple is the date, time, name of the test and the result - whatever it might be. I have the following code:
with open(filename) as f:
data = f.read()
data = data.splitlines()
DS = namedtuple('DS', 'date time name value')
parsed = list()
idx_date = [i for i, r in enumerate(data) if r.strip().startswith('Date')]
for start, stop in zip(idx_date[:-1], idx_date[1:]):
chunk = data[start:stop]
date = time = name = value = None
for row in chunk:
if not row: continue
row = row.strip()
if row.startswith('Episode'): continue
if row.startswith('Date'):
_, date = row.split()
date = date.replace('collected', '')
elif row.startswith('Time'):
_, time = row.split()
time = time.replace('collected', '')
else:
name, value, *_ = row.split()
print (name)
parsed.append(DS(date, time, name, value))
print(parsed)
My error is that I am unable to find a way to parse the heterogeneity of the test RESULT in a way that I can use later, for example for the tuple DS ('DS', 'date time name value'):
DATE = 21/10/2019
TIME = 10:00
NAME = Real time PCR for M tuberculosis or Potassium
RESULT = Negative or 4.7
Any advice appreciated. I have hit a brick wall.

Lua string.gsub with Multiple Patterns

I am working on renaming the Movie titles that has unwanted letters. The string.gsub can replace a string with "" nil value but I have around 200 string patterns that need to be replaces with "".
Right now I have to string.gsub for every pattern. I was thinking is there is a way to put all the string patterns in to single string.gsub line. I have searched around the web for the solution but still didn't got anything.
The movie title is like this B.A.Pass 2013 Hindi 720p DvDRip CROPPED AAC x264 RickyKT
and I want to remove the extra characters like 2013, Hindi, 720p, DvDRip, CROPPED, AAC, x264, RickyKT.
You can pass to string.gsub a table as the third argument like this:
local movie = "B.A.Pass 2013 Hindi 720p DvDRip CROPPED AAC x264 RickyKT"
movie = movie:gsub("%S+", {["2013"] = "", ["Hindi"] = "", ["720p"] = "",
["DvDRip"] = "", ["CROPPED"] = "", ["AAC"] = "",
["x264"] = "", ["RickyKT"] = ""})
print(movie)
Put all of the patterns in a table and then enumerate the table, calling string.gsub() for each pattern:
str = "B.A.Pass 2013 Hindi 720p DvDRip CROPPED AAC x264 RickyKT"
patterns = {"pattern1", "pattern2", "pattern3"}
for i,v in ipairs(patterns) do
str = string.gsub(str, v, "")
end
This will require many invocations of string.gsub(), but the code should be much more maintainable than having a lot of string.gsub() calls.
To avoid to write keys and values on a table for every new entry, i'd write a function to handle a numerically indexed table (the patterns being the values).
This way I dont need to write {["pattern_n"] = ""} for every new pattern.
Ex:
PATTERNS = {"2013", "Hindi", "720p", "DvDRip", "CROPPED", "AAC", "x264", "RickyKT"}
function replace(match)
local ret = nil
for i, v in ipairs(PATTERNS) do
if v:find(match) then
ret = ""
end
end
return ret
end
local movie = "B.A.Pass 2013 Hindi 720p DvDRip CROPPED AAC x264 RickyKT"
movie = movie:gsub("%S+", replace)
print(movie)
You could do it in a simple function, that way you do not need to write the code each time per string, or just put string.gsub, and the replacement value for the string you need
Function:
local large_name = "B.A.Pass 2013 Hindi 720p DvDRip CROPPED AAC x264 RickyKT"
function clean_name(str)
local v = string.gsub(str, "(.-)%s([%(%[']?%d%d%d?%d?[%)%]]?)%s*(.*)", "%1")
return v
end
print(clean_name(large_name))
Only string.gsub for value
local large_name = "B.A.Pass 2013 Hindi 720p DvDRip CROPPED AAC x264 RickyKT"
local clean_name = string.gsub(large_name, "(.-)%s([%(%[']?%d%d%d?%d?[%)%]]?)%s*(.*)", "%1")
print(clean_name)
The replacement pattern places the first value (name of the movie) separated by a space and prints it, also identifies the year as the second value, to avoid error in the titles, so it is not necessary to place all the values ​​that can exist within the name of the movie and will avoid many false positives
I add a testing function to test different movie names
local testing = {"Whiplash 2014 [1080p]",
"Anon (2018) [WEBRip] [1080p] [YTS.AM]",
"Maze Runner The Death Cure 2018 [WEBRip] [1080p] [YTS.AM]",
"12 Strong [2018] [WEBRip] [1080p] [YTS.AM]",
"Kingsman The Secret Service (2014) [1080p]",
"The Equalizer [2014] [1080p]",
"Annihilation 2018 [WEBRip] [1080p] [YTS.AM]",
"The Shawshank Redemption '94",
"Assassin's Creed 2016 HC 720p HDRip 850 MB - iExTV",
"Captain Marvel (2019) [WEBRip] [1080p] [YTS.AM]",}
for k,v in pairs(testing) do
local result = string.gsub(v, "(.-)%s([%(%[']?%d%d%d?%d?[%)%]]?)%s*(.*)", "%1")
print(result)
end
Output:
Whiplash
Anon
Maze Runner The Death Cure
12 Strong
Kingsman The Secret Service
The Equalizer
Annihilation
The Shawshank Redemption
Assassin's Creed
Captain Marvel

Reading a specific txt file and re-arrange it to a given format

Below is an output of Chemichal analysis instrument. I need to rearrange the format and sort it in a way that percentage figure for each element goes below its name. My question is how to read this file word by word? how can I choose, for instance word number 12?
txt file format:
Header_1 Date Time Method_Name (Filter_Name) Calc_Mode Heat No. Quality Anal. Code Sample ID C Si Mn P S Cr Mo Ni Al Co Cu Nb Ti V W Pb Sn As Bi Ca Sb Se B Zn N Fe Place Code Work Phase
Single 13.01.13 09:51:10 Fe-10 Test AutoResult 12A 00001.040 00000.437 00000.292 00000.023 00000.007 00001.505 00000.263 00000.081 00000.012 00000.014 00000.110 00000.155 00000.040 00000.098 00000.015 00000.014 00000.013 00000.012 00000.002 00000.001 00000.016 00000.014 00000.005 00000.001 00000.016 00095.813
To find word 12, read the line character by character until you have seen 11 instances of whatever is being used to separate words (which you have not specified); what follows, until the next such separator, will be the 12th word.

Resources