parsing dates from strings - string

I have a list of strings in python like this
['AM_B0_D0.0_2016-04-01T010000.flac.h5',
'AM_B0_D3.7_2016-04-13T215000.flac.h5',
'AM_B0_D10.3_2017-03-17T110000.flac.h5',
'AM_B0_D0.7_2016-10-21T104000.flac.h5',
'AM_B0_D4.4_2016-08-05T151000.flac.h5',
'AM_B0_D0.0_2016-04-01T010000.flac.h5',
'AM_B0_D3.7_2016-04-13T215000.flac.h5',
'AM_B0_D10.3_2017-03-17T110000.flac.h5',
'AM_B0_D0.7_2016-10-21T104000.flac.h5',
'AM_B0_D4.4_2016-08-05T151000.flac.h5']
I want to parse only the date and time (for example, 2016-08-05 15:10:00 )from these strings.
So far I used a for loop like the one below but it's very time consuming, is there a better way to do this?
for files in glob.glob("AM_B0_*.flac.h5"):
if files[11]=='_':
year=files[12:16]
month=files[17:19]
day= files[20:22]
hour=files[23:25]
minute=files[25:27]
second=files[27:29]
tindex=pd.date_range(start= '%d-%02d-%02d %02d:%02d:%02d' %(int(year),int(month), int(day), int(hour), int(minute), int(second)), periods=60, freq='10S')
else:
year=files[11:15]
month=files[16:18]
day= files[19:21]
hour=files[22:24]
minute=files[24:26]
second=files[26:28]
tindex=pd.date_range(start= '%d-%02d-%02d %02d:%02d:%02d' %(int(year), int(month), int(day), int(hour), int(minute), int(second)), periods=60, freq='10S')

Try this (based on the 2nd last '-', no need of if-else case):
filesall = ['AM_B0_D0.0_2016-04-01T010000.flac.h5',
'AM_B0_D3.7_2016-04-13T215000.flac.h5',
'AM_B0_D10.3_2017-03-17T110000.flac.h5',
'AM_B0_D0.7_2016-10-21T104000.flac.h5',
'AM_B0_D4.4_2016-08-05T151000.flac.h5',
'AM_B0_D0.0_2016-04-01T010000.flac.h5',
'AM_B0_D3.7_2016-04-13T215000.flac.h5',
'AM_B0_D10.3_2017-03-17T110000.flac.h5',
'AM_B0_D0.7_2016-10-21T104000.flac.h5',
'AM_B0_D4.4_2016-08-05T151000.flac.h5']
def find_second_last(text, pattern):
return text.rfind(pattern, 0, text.rfind(pattern))
for files in filesall:
start = find_second_last(files,'-') - 4 # from yyyy- part
timepart = (files[start:start+17]).replace("T"," ")
#insert 2 ':'s
timepart = timepart[:13] + ':' + timepart[13:15] + ':' +timepart[15:]
# print(timepart)
tindex=pd.date_range(start= timepart, periods=60, freq='10S')

In Place of using file[11] as hard coded go for last or 2nd last index of _ then use your code then you don't have to write 2 times same code. Or use regex to parse the string.

Related

How to identify number near word using regex

Need to identify numbers near keyword number:, no:, etc..
Tried:
import re
matchstring="Sales Quote"
string_lst = ['number:', 'No:','no:','number','No : ']
x=""" Sentence1: Sales Quote number 36886DJ9 is entered
Sentence2: SALES QUOTE No: 89745DFD is entered
Sentence3: Sales Quote No : 7964KL is entered
Sentence4: SALES QUOTE NUMBER:879654DF is entered
Sentence5: salesquote no: 9874656LD is entered"""
documentnumber= re.findall(r"(?:(?<="+matchstring+ '|'.join(string_lst)+r')) [\w\d-]',x,flags=re.IGNORECASE)
print(documentnumber)
Required soln:36886DJ9,89745DFD,7964KL,879654DF,9874656LD
Is there any solution?
Actually your solution is very close. You just need some missing parenthesis and check for optional whitespace:
documentnumber = re.findall(r"(?:(?<="+matchstring + ").*?(?:" + '|'.join(string_lst) + ')\s?)([\w\d-]*)', x, re.IGNORECASE)
However this won't match with the last one (9874656LD) because of the missing whitespace between "Sales" and "quote". If you want to build it in the same way than the rest of the pattern, replace the lookbehind by a non capturing group and join words with \s?:
documentnumber= re.findall(r"(?:(?:" + "\s?".join(matchstring.split()) + ").*?(?:" + '|'.join(string_lst) + ')\s?)([\w\d-]*)', x, re.IGNORECASE)
Output:
['36886DJ9', '89745DFD', '7964KL', '879654DF', '9874656LD']

running for loop until arbitrary index (python 3.x)

So I have these strings that I split by spaces (' ') and I just rolled them into a single list I called 'keyLabelRun'
so it looks like this:
keyLabelRun[0-12]:
0 OS=Dengue
1 virus
2 3
3 PE=4
4 SV=1
5 Split=0
6
7 OS=Bacillus
8 subtilis
9 XF-1
10 GN=opuBA
11 PE=4
12 SV=1
I only want the elements that include and are after "OS=", anything else, whether it be "SV=" or "PE=" etc. I want to skip over those elements until I get to the next "OS="
The number of elements to the next "OS=" is arbitrary so that's where I'm having the problem.
This is what I'm currently trying:
OSarr = []
for i in range(len(keyLabelrun)):
if keyLabelrun[i].count('OS='):
OSarr.append(keyLabelrun[i])
if keyLabelrun[i+1].count('=') != 1:
continue
But the elements where "OS=" is not included is what is tripping me up I think.
Also at the end I'm going to join them all back together in their own elements but I feel like I will be able to handle that after this.
In my attempt, I am trying to append all elements I'm looking for in order to an new list 'OSarr'
If anyone can lend a hand, it would be much appreciated.
Thank you.
These list of strings came from a dataset that is a text file in the form:
>tr|W0FSK4|W0FSK4_9FLAV Genome polyprotein (Fragment) OS=Dengue virus 3 PE=4 SV=1 Split=0
MNNQRKKTGKPSINMLKRVRNRVSTGSQLAKRFSKGLLNGQGPMKLVMAFIAFLRFLAIPPTAGVLARWGTFKKSGAIKVLKGFKKEISNMLSIINKRKKTSLCLMMILPAALAFHLTSRDGEPRMIVGKNERGKSLLFKTASGINMCTLIAMDLGEMCDDTVTYKCPHITEVEPEDIDCWCNLTSTWVTYGTCNQAGEHRRDKRSVALAPHVGMGLDTRTQTWMSAEGAWRQVEKVETWALRHPGFTILALFLAHYIGTSLTQKVVIFILLMLVTPSMTMRCVGVGNRDFVEGLSGATWVDVVLEHGGCVTTMAKNKPTLDIELQKTEATQLATLRKLCIEGKITNITTDSRCPTQGEATLPEEQDQNYVCKHTYVDRGWGNGCGLFGKGSLVTCAKFQCLEPIEGKVVQYENLKYTVIITVHTGDQHQVGNETQGVTAEITPQASTTEAILPEYGTLGLECSPRTGLDFNEMILLTMKNKAWMVHRQWFFDLPLPWTSGATTETPTWNRKELLVTFKNAHAKKQEVVVLGSQEGAMHTALTGATEIQNSGGTSIFAGHLKCRLKMDKLELKGMSYAMCTNTFVLKKEVSETQHGTILIKVEYKGEDVPCKIPFSTEDGQGKAHNGRLITANPVVTKKEEPVNIEAEPPFGESNIVIGIGDNALKINWYKKGSSIGKMFEATARGARRMAILGDTAWDFGSVGGVLNSLGKMVHQIFGSAYTALFSGVSWVMKIGIGVLLTWIGLNSKNTSMSFSCIAIGIITLYLGAVVQADMGCVINWKGKELKCGSGIFVTNEVHTWTEQYKFQADSPKRLATAIAGAWENGVCGIRSTTRMENLLWKQIANELNYILWENNIKLTVVVGDIIGVLEQGKRTLTPQPMELKYSWKTWGKAKIVTAETQNSSFIIDGPNTPECPSVSRAWNVWEVEDYGFGVFTTNIWLKLREVYTQLCDHRLMSAAVKDERAVHADMGYWIESQKNGSWKLEKASLIEVKTCTWPKSHTLWSNGVLESDMIIPKSLAGPISQHNHRPGYHTQTAGPWHLGKLELDFNYCEGTTVVITENCGTRGPSLRTTTVSGKLIHEWCCRSCTLPPLRYMGEDGCWYGMEIRPISEKEENMVKSLVSAGSGKVDNFTMGVLCLAILFEEVMRGKFGKKHMIAGVFFTFVLLLSGQITWRDMAHTLIMIGSNASDRMGMGVTYLALIATFKIQPFLALGFFLRKLTSRENLLLGVGLAMATTLQLPEDIEQMANGIALGLMALKLITQFETYQLWTALISLTCSNTIFTLTVAWRTATLILAGVSLLPVCQSSSMRKTDWLPMAVAAMGVPPLPLFIFGLKDTLKRRSWPLNEGVMAVGLVSILASSLLRNDVPMAGPLVAGGLLIACYVITGTSADLTVEKAADITWEEEAEQTGVSHNLMITVDDDGTMRIKDDETENILTVLLKTALLIVSGIFPYSIPATLLVWHTWQKQTQRSGVLWDVPSPPETQKAELEEGVYRIKQQGIFGKTQVGVGVQKEGVFHTMWHVTRGAVLTYNGKRLEPNWASVKKDLISYGGGWRLSAQWQKGEEVQVIAVEPGKNPKNFQTMPGTFQTTTGEIGAIALDFKPGTSGSPIINREGKVVGLYGNGVVTKNGGYVSGIAQTNAEPDGPTPELEEEMFKKRNLTIMDLHPGSGKTRKYLPAIVREAIKRRLRTLILAPTRVVAAEMEEALKGLPIRYQTTATKSEHTGREIVDLMCHATFTMRLLSPVRVPNYNLIIMDEAHFTDPASIAARGYISTRVGMGEAAAIFMTATPPGTADAFPQSNAPIQDEERDIPERSWNSGNEWITDFAGKTVWFVPSIKAGNDIANCLRKNGKKVIQLSRKTFDTEYQKTKLNDWDFVV
>tr|M4KW32|M4KW32_BACIU Choline ABC transporter (ATP-binding protein) OS=Bacillus subtilis XF-1 GN=opuBA PE=4 SV=1 Split=0
MLTLENVSKTYKGGKKAVNNVNLKIAKGEFICFIGPSGCGKTTTMKMINRLIEPSAGKIFIDGENIMDQDPVELRRKIGYVIQQIGLFPHMTIQQNISLVPKLLKWPEQQRKERARELLKLVDMGPEYVDRYPHELSGGQQQRIGVLRALAAEPPLILMDEPFGALDPITRDSLQEEFKKLQKTLHKTIVFVTHDMDEAIKLADRIVILKAGEIVQVGTPDDILRNPADEFVEEFIGKERLIQSSSPDVERVDQIMNTQPVTITADKTLSEAIQLMRQERVDSLLVVDDEHVLQGYVDVEIIDQCRKKANLIGEVLHEDIYTVLGGTLLRDTVRKILKRGVKYVPVVDEDRRLIGIVTRASLVDIVYDSLWGEEKQLAALS
>sp|Q8AWH3|SX17A_XENTR Transcription factor Sox-17-alpha OS=Xenopus tropicalis GN=sox17a PE=2 SV=1 Split=0
MSSPDGGYASDDQNQGKCSVPIMMTGLGQCQWAEPMNSLGEGKLKSDAGSANSRGKAEARIRRPMNAFMVWAKDERKRLAQQNPDLHNAELSKMLGKSWKALTLAEKRPFVEEAERLRVQHMQDHPNYKYRPRRRKQVKRMKRADTGFMHMAEPPESAVLGTDGRMCLESFSLGYHEQTYPHSQLPQGSHYREPQAMAPHYDGYSLPTPESSPLDLAEADPVFFTSPPQDECQMMPYSYNASYTHQQNSGASMLVRQMPQAEQMGQGSPVQGMMGCQSSPQMYYGQMYLPGSARHHQLPQAGQNSPPPEAQQMGRADHIQQVDMLAEVDRTEFEQYLSYVAKSDLGMHYHGQESVVPTADNGPISSVLSDASTAVYYCNYPSA
I got it! :D
OSarr = []
G = 0
for i in range(len(keyLabelrun)):
OSarr.append(keyLabelrun[G])
G += 1
if keyLabelrun[G].count('='):
while keyLabelrun[G].count('OS=') != 1:
G+=1
Maybe next time everyone, thank you!
Due to the syntax, you have to keep track of which part (OS, PE, etc) you're currently parsing. Here's a function to extract the species name from the FASTA header:
def extract_species(description):
species_parts = []
is_os = False
for word in description.split():
if word[:3] == 'OS=':
is_os = True
species_parts.append(word[3:])
elif '=' in word:
is_os = False
elif is_os:
species_parts.append(word)
return ' '.join(species_parts)
You can call it when processing your input file, e.g.:
from Bio import SeqIO
for record in SeqIO.parse('input.fa', 'fasta'):
species = extract_species(record.description)

kdb/q: How to apply a string manipulation function to a vector of strings to output a vector of strings?

Thanks in advance for the help. I am new to kdb/q, coming from a Python and C++ background.
Just a simple syntax question: I have a string with fields and their corresponding values
pp_str: "field_1:abc field_2:xyz field_3:kdb"
I wrote an atomic (scalar) function to extract the value of a given field.
get_field_value: {[field; pp_str] pp_fields: " " vs pp_str; pid_field: pp_fields[where like[pp_fields; field,":*"]]; start_i: (pid_field[0] ss ":")[0] + 1; end_i: count pid_field[0]; indices: start_i + til (end_i - start_i); pid_field[0][indices]}
show get_field_value["field_1"; pp_str]
"abc"
show get_field_value["field_3"; pp_str]
"kdb"
Now how do I generalize this so that if I input a vector of fields, I get a vector of values? I want to input ("field_1"; "field_2"; "field_3") and output ("abc"; "xyz"; "kdb"). I tried multiple approaches (below) but I just don't understand kdb/q's syntax well enough to vectorize my function:
/ Attempt 1 - Fail
get_field_value[enlist ("field_1"; "field_2"); pp_str]
/ Attempt 2 - Fail
get_field_value[; pp_str] /. enlist ("field_1"; "field_3")
/ Attempt 3 - Fail
fields: ("field_1"; "field_2")
get_field_value[fields; pp_str]
To run your function for each you could project the pp_str variable and use each for the others
q)get_field_value[;pp_str]each("field_1";"field_3")
"abc"
"kdb"
Kdb actually has built-in functionality to handle this: https://code.kx.com/q/ref/file-text/#key-value-pairs
q){#[;x](!/)"S: "0:y}[`field_1;pp_str]
"abc"
q)
q){#[;x](!/)"S: "0:y}[`field_1`field_3;pp_str]
"abc"
"kdb"
I think this might be the syntax you're looking for.
q)get_field_value[; pp_str]each("field_1";"field_2")
"abc"
"xyz"

How to extract several timestamp pairs from a list in Python

I have extracted all timestamps from a transcript file. The output looks like this:
('[, 00:00:03,950, 00:00:06,840, 00:00:06,840, 00:00:09,180, 00:00:09,180, '
'00:00:10,830, 00:00:10,830, 00:00:14,070, 00:00:14,070, 00:00:16,890, '
'00:00:16,890, 00:00:19,080, 00:00:19,080, 00:00:21,590, 00:00:21,590, '
'00:00:24,030, 00:00:24,030, 00:00:26,910, 00:00:26,910, 00:00:29,640, '
'00:00:29,640, 00:00:31,920, 00:00:31,920, 00:00:35,850, 00:00:35,850, '
'00:00:38,629, 00:00:38,629, 00:00:40,859, 00:00:40,859, 00:00:43,170, '
'00:00:43,170, 00:00:45,570, 00:00:45,570, 00:00:48,859, 00:00:48,859, '
'00:00:52,019, 00:00:52,019, 00:00:54,449, 00:00:54,449, 00:00:57,210, '
'00:00:57,210, 00:00:59,519, 00:00:59,519, 00:01:02,690, 00:01:02,690, '
'00:01:05,820, 00:01:05,820, 00:01:08,549, 00:01:08,549, 00:01:10,490, '
'00:01:10,490, 00:01:13,409, 00:01:13,409, 00:01:16,409, 00:01:16,409, '
'00:01:18,149, 00:01:18,149, 00:01:20,340, 00:01:20,340, 00:01:22,649, '
'00:01:22,649, 00:01:26,159, 00:01:26,159, 00:01:28,740, 00:01:28,740, '
'00:01:30,810, 00:01:30,810, 00:01:33,719, 00:01:33,719, 00:01:36,990, '
'00:01:36,990, 00:01:39,119, 00:01:39,119, 00:01:41,759, 00:01:41,759, '
'00:01:43,799, 00:01:43,799, 00:01:46,619, 00:01:46,619, 00:01:49,140, '
'00:01:49,140, 00:01:51,240, 00:01:51,240, 00:01:53,759, 00:01:53,759, '
'00:01:56,460, 00:01:56,460, 00:01:58,740, 00:01:58,740, 00:02:01,640, '
'00:02:01,640, 00:02:04,409, 00:02:04,409, 00:02:07,229, 00:02:07,229, '
'00:02:09,380, 00:02:09,380, 00:02:12,060, 00:02:12,060, 00:02:14,840, ]')
In this output, there are always timestamp pairs, i.e. always 2 consecutive timestamps belong together, for example: 00:00:03,950 and 00:00:06,840, 00:00:06,840 and 00:00:09,180, etc.
Now, I want to extract all these timestamp pairs separately so that the output looks like this:
00:00:03,950 - 00:00:06,840
00:00:06,840 - 00:00:09,180
00:00:09,180 - 00:00:10,830
etc.
For now, I have the following (very inconvenient) solution for my problem:
# get first part of first timestamp
a = res_timestamps[2:15]
print(dedent(a))
# get second part of first timestamp
b = res_timestamps[17:29]
print(b)
# combine timestamp parts
c = a + ' - ' + b
print(dedent(c))
Of course, this is very bad since I cannot extract the indices manually for all transcripts. Trying to use a loop has not worked yet because each item is not a timestamp but a single character.
Is there an elegant solution for my problem?
I appreciate any help or tip.
Thank you very much in advance!
Regex to the rescue!
A solution that works perfectly on your example data:
import re
from pprint import pprint
pprint(re.findall(r"(\d{2}:\d{2}:\d{2},\d{3}), (\d{2}:\d{2}:\d{2},\d{3})", your_data))
This prints:
[('00:00:03,950', '00:00:06,840'),
('00:00:06,840', '00:00:09,180'),
('00:00:09,180', '00:00:10,830'),
('00:00:10,830', '00:00:14,070'),
('00:00:14,070', '00:00:16,890'),
('00:00:16,890', '00:00:19,080'),
('00:00:19,080', '00:00:21,590'),
('00:00:21,590', '00:00:24,030'),
('00:00:24,030', '00:00:26,910'),
('00:00:26,910', '00:00:29,640'),
('00:00:29,640', '00:00:31,920'),
('00:00:31,920', '00:00:35,850'),
('00:00:35,850', '00:00:38,629'),
('00:00:38,629', '00:00:40,859'),
('00:00:40,859', '00:00:43,170'),
('00:00:43,170', '00:00:45,570'),
('00:00:45,570', '00:00:48,859'),
('00:00:48,859', '00:00:52,019'),
('00:00:52,019', '00:00:54,449'),
('00:00:54,449', '00:00:57,210'),
('00:00:57,210', '00:00:59,519'),
('00:00:59,519', '00:01:02,690'),
('00:01:02,690', '00:01:05,820'),
('00:01:05,820', '00:01:08,549'),
('00:01:08,549', '00:01:10,490'),
('00:01:10,490', '00:01:13,409'),
('00:01:13,409', '00:01:16,409'),
('00:01:16,409', '00:01:18,149'),
('00:01:18,149', '00:01:20,340'),
('00:01:20,340', '00:01:22,649'),
('00:01:22,649', '00:01:26,159'),
('00:01:26,159', '00:01:28,740'),
('00:01:28,740', '00:01:30,810'),
('00:01:30,810', '00:01:33,719'),
('00:01:33,719', '00:01:36,990'),
('00:01:36,990', '00:01:39,119'),
('00:01:39,119', '00:01:41,759'),
('00:01:41,759', '00:01:43,799'),
('00:01:43,799', '00:01:46,619'),
('00:01:46,619', '00:01:49,140'),
('00:01:49,140', '00:01:51,240'),
('00:01:51,240', '00:01:53,759'),
('00:01:53,759', '00:01:56,460'),
('00:01:56,460', '00:01:58,740'),
('00:01:58,740', '00:02:01,640'),
('00:02:01,640', '00:02:04,409'),
('00:02:04,409', '00:02:07,229'),
('00:02:07,229', '00:02:09,380'),
('00:02:09,380', '00:02:12,060'),
('00:02:12,060', '00:02:14,840')]
You could output this in your desired format like so:
for start, end in timestamps:
print(f"{start} - {end}")
Here's a solution without regular expressions
Clean the string, and split on ', ' to create a list
Use string slicing to select the odd and even values and zip them together.
# give data as your string
# convert data into a list by removing end brackets and spaces, and splitting
data = data.replace('[, ', '').replace(', ]', '').split(', ')
# use list slicing and zip the two components
combinations = list(zip(data[::2], data[1::2]))
# print the first 5
print(combinations[:5])
[out]:
[('00:00:03,950', '00:00:06,840'),
('00:00:06,840', '00:00:09,180'),
('00:00:09,180', '00:00:10,830'),
('00:00:10,830', '00:00:14,070'),
('00:00:14,070', '00:00:16,890')]

pd.to_datetime to solve '2010/1/1' rather than '2010/01/01'

I have a dataframe which contain a column 'trade_dt' like this
2009/12/1
2009/12/2
2009/12/3
2009/12/4
I got this problem
benchmark['trade_dt'] = pd.to_datetime(benchmark['trade_dt'], format='%Y-&m-%d')
ValueError: time data '2009/12/1' does not match format '%Y-&m-%d' (match)
how to solve it? Thanks~
Need change format for match - replace & and - to % and /:
benchmark['trade_dt'] = pd.to_datetime(benchmark['trade_dt'], format='%Y/%m/%d')
Also working with sample data removing format (but not sure with real data):
benchmark['trade_dt'] = pd.to_datetime(benchmark['trade_dt'])
print (benchmark)
trade_dt
0 2009-12-01
1 2009-12-02
2 2009-12-03
3 2009-12-04

Resources