How to extract several timestamp pairs from a list in Python - python-3.x

I have extracted all timestamps from a transcript file. The output looks like this:
('[, 00:00:03,950, 00:00:06,840, 00:00:06,840, 00:00:09,180, 00:00:09,180, '
'00:00:10,830, 00:00:10,830, 00:00:14,070, 00:00:14,070, 00:00:16,890, '
'00:00:16,890, 00:00:19,080, 00:00:19,080, 00:00:21,590, 00:00:21,590, '
'00:00:24,030, 00:00:24,030, 00:00:26,910, 00:00:26,910, 00:00:29,640, '
'00:00:29,640, 00:00:31,920, 00:00:31,920, 00:00:35,850, 00:00:35,850, '
'00:00:38,629, 00:00:38,629, 00:00:40,859, 00:00:40,859, 00:00:43,170, '
'00:00:43,170, 00:00:45,570, 00:00:45,570, 00:00:48,859, 00:00:48,859, '
'00:00:52,019, 00:00:52,019, 00:00:54,449, 00:00:54,449, 00:00:57,210, '
'00:00:57,210, 00:00:59,519, 00:00:59,519, 00:01:02,690, 00:01:02,690, '
'00:01:05,820, 00:01:05,820, 00:01:08,549, 00:01:08,549, 00:01:10,490, '
'00:01:10,490, 00:01:13,409, 00:01:13,409, 00:01:16,409, 00:01:16,409, '
'00:01:18,149, 00:01:18,149, 00:01:20,340, 00:01:20,340, 00:01:22,649, '
'00:01:22,649, 00:01:26,159, 00:01:26,159, 00:01:28,740, 00:01:28,740, '
'00:01:30,810, 00:01:30,810, 00:01:33,719, 00:01:33,719, 00:01:36,990, '
'00:01:36,990, 00:01:39,119, 00:01:39,119, 00:01:41,759, 00:01:41,759, '
'00:01:43,799, 00:01:43,799, 00:01:46,619, 00:01:46,619, 00:01:49,140, '
'00:01:49,140, 00:01:51,240, 00:01:51,240, 00:01:53,759, 00:01:53,759, '
'00:01:56,460, 00:01:56,460, 00:01:58,740, 00:01:58,740, 00:02:01,640, '
'00:02:01,640, 00:02:04,409, 00:02:04,409, 00:02:07,229, 00:02:07,229, '
'00:02:09,380, 00:02:09,380, 00:02:12,060, 00:02:12,060, 00:02:14,840, ]')
In this output, there are always timestamp pairs, i.e. always 2 consecutive timestamps belong together, for example: 00:00:03,950 and 00:00:06,840, 00:00:06,840 and 00:00:09,180, etc.
Now, I want to extract all these timestamp pairs separately so that the output looks like this:
00:00:03,950 - 00:00:06,840
00:00:06,840 - 00:00:09,180
00:00:09,180 - 00:00:10,830
etc.
For now, I have the following (very inconvenient) solution for my problem:
# get first part of first timestamp
a = res_timestamps[2:15]
print(dedent(a))
# get second part of first timestamp
b = res_timestamps[17:29]
print(b)
# combine timestamp parts
c = a + ' - ' + b
print(dedent(c))
Of course, this is very bad since I cannot extract the indices manually for all transcripts. Trying to use a loop has not worked yet because each item is not a timestamp but a single character.
Is there an elegant solution for my problem?
I appreciate any help or tip.
Thank you very much in advance!

Regex to the rescue!
A solution that works perfectly on your example data:
import re
from pprint import pprint
pprint(re.findall(r"(\d{2}:\d{2}:\d{2},\d{3}), (\d{2}:\d{2}:\d{2},\d{3})", your_data))
This prints:
[('00:00:03,950', '00:00:06,840'),
('00:00:06,840', '00:00:09,180'),
('00:00:09,180', '00:00:10,830'),
('00:00:10,830', '00:00:14,070'),
('00:00:14,070', '00:00:16,890'),
('00:00:16,890', '00:00:19,080'),
('00:00:19,080', '00:00:21,590'),
('00:00:21,590', '00:00:24,030'),
('00:00:24,030', '00:00:26,910'),
('00:00:26,910', '00:00:29,640'),
('00:00:29,640', '00:00:31,920'),
('00:00:31,920', '00:00:35,850'),
('00:00:35,850', '00:00:38,629'),
('00:00:38,629', '00:00:40,859'),
('00:00:40,859', '00:00:43,170'),
('00:00:43,170', '00:00:45,570'),
('00:00:45,570', '00:00:48,859'),
('00:00:48,859', '00:00:52,019'),
('00:00:52,019', '00:00:54,449'),
('00:00:54,449', '00:00:57,210'),
('00:00:57,210', '00:00:59,519'),
('00:00:59,519', '00:01:02,690'),
('00:01:02,690', '00:01:05,820'),
('00:01:05,820', '00:01:08,549'),
('00:01:08,549', '00:01:10,490'),
('00:01:10,490', '00:01:13,409'),
('00:01:13,409', '00:01:16,409'),
('00:01:16,409', '00:01:18,149'),
('00:01:18,149', '00:01:20,340'),
('00:01:20,340', '00:01:22,649'),
('00:01:22,649', '00:01:26,159'),
('00:01:26,159', '00:01:28,740'),
('00:01:28,740', '00:01:30,810'),
('00:01:30,810', '00:01:33,719'),
('00:01:33,719', '00:01:36,990'),
('00:01:36,990', '00:01:39,119'),
('00:01:39,119', '00:01:41,759'),
('00:01:41,759', '00:01:43,799'),
('00:01:43,799', '00:01:46,619'),
('00:01:46,619', '00:01:49,140'),
('00:01:49,140', '00:01:51,240'),
('00:01:51,240', '00:01:53,759'),
('00:01:53,759', '00:01:56,460'),
('00:01:56,460', '00:01:58,740'),
('00:01:58,740', '00:02:01,640'),
('00:02:01,640', '00:02:04,409'),
('00:02:04,409', '00:02:07,229'),
('00:02:07,229', '00:02:09,380'),
('00:02:09,380', '00:02:12,060'),
('00:02:12,060', '00:02:14,840')]
You could output this in your desired format like so:
for start, end in timestamps:
print(f"{start} - {end}")

Here's a solution without regular expressions
Clean the string, and split on ', ' to create a list
Use string slicing to select the odd and even values and zip them together.
# give data as your string
# convert data into a list by removing end brackets and spaces, and splitting
data = data.replace('[, ', '').replace(', ]', '').split(', ')
# use list slicing and zip the two components
combinations = list(zip(data[::2], data[1::2]))
# print the first 5
print(combinations[:5])
[out]:
[('00:00:03,950', '00:00:06,840'),
('00:00:06,840', '00:00:09,180'),
('00:00:09,180', '00:00:10,830'),
('00:00:10,830', '00:00:14,070'),
('00:00:14,070', '00:00:16,890')]

Related

How do I sort pandas data frame that has a multi-index?

Hey guys I got that dataframe that you see in the image and I want to sort it by the first 'всичко' column the one under 'Общо'.
This is the output when I type:
df.columns =
MultiIndex([( ' Общо', ' всичко'),
( ' Общо', ' мъже'),
( ' Общо', ' жени'),
('В градовете', ' всичко'),
('В градовете', ' мъже'),
('В градовете', ' жени'),
( 'В селата', ' всичко'),
( 'В селата', ' мъже'),
( 'В селата', ' жени')],
names=['Области', 'Общини'])
and
df.index =
Index(['Общо за страната', 'Благоевград', 'Банско', 'Белица', 'Благоевград',
'Гоце Делчев', 'Гърмен', 'Кресна', 'Петрич', 'Разлог',
...
'Нови пазар', 'Смядово', 'Хитрино', 'Шумен', 'Ямбол', 'Болярово',
'Елхово', 'Стралджа', 'Тунджа', 'Ямбол'],
dtype='object', length=294)
Again, I need to the 'всичко' column in descending order.
Best regards.
I tried using the df.sort_values() but I am having difficulties working around the MultiIndex
you can use:
df=df.sort_values([(' Общо', ' всичко')], ascending=False) #define columns names as a tuple

Split a big text file into multiple smaller one on set parameter of regex

I have a large text file looking like:
....
sdsdsd
..........
asdfhjgjksdfk dfkaskk sdkfk skddkf skdf sdk ssaaa akskdf sdksdfsdf ksdf sd kkkkallwow.
sdsdllla lsldlsd lsldlalllLlsdd asdd. sdlsllall asdsdlallOEFOOASllsdl lsdlla.
slldlllasdlsd.ss;sdsdasdas.
......
ddss
................
asdfhjgjksdfk ddjafjijjjj.dfsdfsdfsdfsi dfodoof ooosdfow oaosofoodf aosolflldlfl , dskdkkfkdsa asddf;akkdfkdkk . sdlsllall asdsdlallOEFOOASllsdl lsdlla.
slldlllasdlsd.ss;sdsdasdas.
.....
xxxx
.......
asdfghjkl
I want to split the text files into multiple small text files and save them as .txt in my system on occurences of ..... [multiple period markers] saved like
group1_sdsdsd.txt
....
sdsdsd
..........
asdfhjgjksdfk dfkaskk sdkfk skddkf skdf sdk ssaaa akskdf sdksdfsdf ksdf sd kkkkallwow.
sdsdllla lsldlsd lsldlalllLlsdd asdd. sdlsllall asdsdlallOEFOOASllsdl lsdlla.
slldlllasdlsd.ss;sdsdasdas.
group1_ddss.txt
ddss
................
asdfhjgjksdfk ddjafjijjjj.dfsdfsdfsdfsi dfodoof ooosdfow oaosofoodf aosolflldlfl , dskdkkfkdsa asddf;akkdfkdkk . sdlsllall asdsdlallOEFOOASllsdl lsdlla.
slldlllasdlsd.ss;sdsdasdas.
and
group1_xxxx.txt
.....
xxxx
.......
asdfghjkl
I have figured that by usinf regex of sort of following can be done
txt =re.sub(r'(([^\w\s])\2+)', r' ', txt).strip() #for letters more than 2 times
but not able to figure out completely.
The saved text files should be named as group1_sdsdsd.txt , group1_ddss.txt and group1_xxxx.txt [group1 being identifier for the specific big text file as I have multiple bigger text files and need to do same on all to know which big text file i am splitting.
If you want to get the parts with multiple dots only on the same line, you can use and get the separate parts, you might use a pattern like:
^\.{3,}\n(\S+)\n\.{3,}(?:\n(?!\.{3,}\n\S+\n\.{3,}).*)*
Explanation
^ Start of string
\.{3,}\n Match 3 or more dots and a newline
(\S+)\n Capture 1+ non whitespace chars in group 1 for the filename and match a newline
\.{3,} Match 3 or more dots
(?: Non capture group to repeat as a whole part
\n Match a newline
(?!\.{3,}\n\S+\n\.{3,}) Negative lookahead, assert that from the current position we are not looking at a pattern that matches the dots with a filename in between
.* Match the whole line
)* Close the non capture group and optionally repeat it
Then you can use re.finditer to loop the matches, and use the group 1 value as part of the filename.
See a regex demo and a Python demo with the separate parts.
Example code
import re
pattern = r"^\.{3,}\n(\S+)\n\.{3,}(?:\n(?!\.{3,}\n\S+\n\.{3,}).*)*"
s = ("....your data here")
matches = re.finditer(pattern, s, re.MULTILINE)
your_path = "/your/path/"
for matchNum, match in enumerate(matches, start=1):
f = open(your_path + "group1_{}".format(match.group(1)), 'w')
f.write(match.group())
f.close()

pd.to_datetime to solve '2010/1/1' rather than '2010/01/01'

I have a dataframe which contain a column 'trade_dt' like this
2009/12/1
2009/12/2
2009/12/3
2009/12/4
I got this problem
benchmark['trade_dt'] = pd.to_datetime(benchmark['trade_dt'], format='%Y-&m-%d')
ValueError: time data '2009/12/1' does not match format '%Y-&m-%d' (match)
how to solve it? Thanks~
Need change format for match - replace & and - to % and /:
benchmark['trade_dt'] = pd.to_datetime(benchmark['trade_dt'], format='%Y/%m/%d')
Also working with sample data removing format (but not sure with real data):
benchmark['trade_dt'] = pd.to_datetime(benchmark['trade_dt'])
print (benchmark)
trade_dt
0 2009-12-01
1 2009-12-02
2 2009-12-03
3 2009-12-04

parsing dates from strings

I have a list of strings in python like this
['AM_B0_D0.0_2016-04-01T010000.flac.h5',
'AM_B0_D3.7_2016-04-13T215000.flac.h5',
'AM_B0_D10.3_2017-03-17T110000.flac.h5',
'AM_B0_D0.7_2016-10-21T104000.flac.h5',
'AM_B0_D4.4_2016-08-05T151000.flac.h5',
'AM_B0_D0.0_2016-04-01T010000.flac.h5',
'AM_B0_D3.7_2016-04-13T215000.flac.h5',
'AM_B0_D10.3_2017-03-17T110000.flac.h5',
'AM_B0_D0.7_2016-10-21T104000.flac.h5',
'AM_B0_D4.4_2016-08-05T151000.flac.h5']
I want to parse only the date and time (for example, 2016-08-05 15:10:00 )from these strings.
So far I used a for loop like the one below but it's very time consuming, is there a better way to do this?
for files in glob.glob("AM_B0_*.flac.h5"):
if files[11]=='_':
year=files[12:16]
month=files[17:19]
day= files[20:22]
hour=files[23:25]
minute=files[25:27]
second=files[27:29]
tindex=pd.date_range(start= '%d-%02d-%02d %02d:%02d:%02d' %(int(year),int(month), int(day), int(hour), int(minute), int(second)), periods=60, freq='10S')
else:
year=files[11:15]
month=files[16:18]
day= files[19:21]
hour=files[22:24]
minute=files[24:26]
second=files[26:28]
tindex=pd.date_range(start= '%d-%02d-%02d %02d:%02d:%02d' %(int(year), int(month), int(day), int(hour), int(minute), int(second)), periods=60, freq='10S')
Try this (based on the 2nd last '-', no need of if-else case):
filesall = ['AM_B0_D0.0_2016-04-01T010000.flac.h5',
'AM_B0_D3.7_2016-04-13T215000.flac.h5',
'AM_B0_D10.3_2017-03-17T110000.flac.h5',
'AM_B0_D0.7_2016-10-21T104000.flac.h5',
'AM_B0_D4.4_2016-08-05T151000.flac.h5',
'AM_B0_D0.0_2016-04-01T010000.flac.h5',
'AM_B0_D3.7_2016-04-13T215000.flac.h5',
'AM_B0_D10.3_2017-03-17T110000.flac.h5',
'AM_B0_D0.7_2016-10-21T104000.flac.h5',
'AM_B0_D4.4_2016-08-05T151000.flac.h5']
def find_second_last(text, pattern):
return text.rfind(pattern, 0, text.rfind(pattern))
for files in filesall:
start = find_second_last(files,'-') - 4 # from yyyy- part
timepart = (files[start:start+17]).replace("T"," ")
#insert 2 ':'s
timepart = timepart[:13] + ':' + timepart[13:15] + ':' +timepart[15:]
# print(timepart)
tindex=pd.date_range(start= timepart, periods=60, freq='10S')
In Place of using file[11] as hard coded go for last or 2nd last index of _ then use your code then you don't have to write 2 times same code. Or use regex to parse the string.

Use matlab to search excel data file for time range and copy data into variable

In my excel file I have a time column in 12 hr clock time and a bunch of data columns. I have pasted a snippet of it in this post as a code since i cant attach a file. I am trying to build a gui that will take an input from the user like so:
start time: 7:29:32 AM
End time: 7:29:51 AM
Then do the following:
calculate the time that has passed in seconds (should be just a row count, data is gathered once a second)
copy the data in the time range from the "Data 3" column in to a variable perform other calculations on the data copied as needed
I am having some trouble figuring out what to do to search the time data and find its location since it imports as text with xlsread. any ideas?
The data looks like this:
Time Data 1 Data 2 Data 3 Data 4 Data 5
7:29:25 AM 0.878556385 0.388400561 0.076890401 0.93335277 0.884750618
7:29:26 AM 0.695838393 0.712762566 0.014814069 0.81264949 0.450303694
7:29:27 AM 0.250846937 0.508617941 0.24802015 0.722457624 0.47119616
7:29:28 AM 0.206189924 0.82970364 0.819163787 0.060932817 0.73455323
7:29:29 AM 0.161844331 0.768214077 0.154097877 0.988201094 0.951520263
7:29:30 AM 0.704242494 0.371877481 0.944482485 0.79207359 0.57390951
7:29:31 AM 0.072028024 0.120263127 0.577396985 0.694153791 0.341824004
7:29:32 AM 0.241817775 0.32573323 0.484644494 0.377938298 0.090122672
7:29:33 AM 0.500962945 0.540808907 0.582958676 0.043377373 0.041274613
7:29:34 AM 0.087742217 0.596508236 0.020250297 0.926901109 0.45960323
7:29:35 AM 0.268222071 0.291034947 0.598887588 0.575571111 0.136424853
7:29:36 AM 0.42880255 0.349597405 0.936733938 0.232128788 0.555528823
7:29:37 AM 0.380425154 0.162002488 0.208550466 0.776866494 0.79340504
7:29:38 AM 0.727940393 0.622546124 0.716007768 0.660480612 0.02463804
7:29:39 AM 0.582772435 0.713406643 0.306544291 0.225257421 0.043552277
7:29:40 AM 0.371156954 0.163821476 0.780515577 0.032460418 0.356949005
7:29:42 AM 0.484167263 0.377878242 0.044189636 0.718147456 0.603177625
7:29:43 AM 0.294017186 0.463360581 0.962296024 0.504029061 0.183131098
7:29:44 AM 0.95635086 0.367849494 0.362230918 0.984421096 0.41587606
7:29:45 AM 0.198645523 0.754955312 0.280338922 0.79706146 0.730373691
7:29:46 AM 0.058483961 0.46774544 0.86783339 0.147418954 0.941713252
7:29:47 AM 0.411193343 0.340857813 0.162066261 0.943124515 0.722124394
7:29:48 AM 0.389312994 0.129281042 0.732723258 0.803458815 0.045824426
7:29:49 AM 0.549633038 0.73956852 0.542532728 0.618321989 0.358525184
7:29:50 AM 0.269925317 0.501399748 0.938234302 0.997577871 0.318813506
7:29:51 AM 0.798825842 0.24038537 0.958224157 0.660124357 0.07469288
7:29:52 AM 0.963581196 0.390150081 0.077448543 0.294604314 0.903519943
7:29:53 AM 0.890540963 0.50284339 0.229976565 0.664538451 0.926438543
7:29:54 AM 0.46951573 0.192568637 0.506730373 0.060557482 0.922857391
7:29:55 AM 0.56552394 0.952136998 0.739438663 0.107518765 0.911045415
7:29:56 AM 0.433149875 0.957190309 0.475811126 0.855705733 0.942255155
and this is the code I am using:
[Data,Text] = xlsread('C:\Users\data.xlsx',2);
IndexStart=strmatch('7:29:29 AM',Text,'exact'); %start time
IndexEnd=strmatch('2:30:29 PM',Text,'exact'); %end time
seconds = IndexEnd-IndexStart;
TestData = Data([IndexStart: IndexEnd],:);
You probably need to:
Use strfind to find the relevant string in the data imported
Use datenum to convert the date to serial date numbers, to be able to calculate the elapsed time between the two points.
It would help if you posted your code so far though.
EDIT based on comments:
Here's what I would do for cycling through the list of start and end times:
[Data,Text] = xlsread('C:\Users\data.xlsx',2);
start_times = {'7:29:29 AM','7:29:35 AM','7:29:44 AM','7:29:49 AM'}; % etc...
end_times = {'2:30:29 PM','2:30:59 PM','2:31:22 PM','2:32:49 PM'}; % etc...
elapsed_time = zeros(length(start_times),1);
TestData = cell(length(start_times),1); % need a cell array because data can/will be of unequal lengths
for k=1:length(start_times)
IndexStart=strmatch(start_times{k},Text,'exact'); %start time
IndexEnd=strmatch(end_times{k},Text,'exact'); %end time
elapsed_time(k) = IndexEnd-IndexStart;
TestData{k} = Data([IndexStart: IndexEnd],:);
end
Use the "Import Data" from the Variable Tag in the Home menu. There you can set how you want the data to be imported like. With or without heading and the format.

Resources