So I have some unclean HTML:
"<table class="content divbackground"><tr><td class='title'> </td><td class='title'>From</td><td class='title'>To</td></tr><tr><td class='entry'>Monday</td><td class='entry'>09:00</td><td class='entry'>18:00</td></tr><tr><td class='entry'>Tuesday</td><td class='entry'>09:00</td><td class='entry'>18:00</td></tr><tr><td class='entry'>Wednesday</td><td class='entry'>09:00</td><td class='entry'>18:00</td></tr><tr><td class='entry'>Thursday</td><td class='entry'>09:00</td><td class='entry'>20:00</td></tr><tr><td class='entry'>Friday</td><td class='entry'>09:00</td><td class='entry'>20:00</td></tr><tr><td class='entry'>Saturday</td><td class='entry'>09:00</td><td class='entry'>18:00</td></tr><tr><td class='entry'>Sunday</td><td class='entry'>11:00</td><td class='entry'>18:00</td></tr></table></td></td>"
It's the opening hours of a pharmacy (the information is published on a public register).
Now I could parse the HTML using a parser, but I find that this is not robust to errors and I still have to pull out the code between <table> and </table>.
Is there some nice unix command (sed?) that searches for all occurances of:
XX:XX
inside <td></td> tags
where X must be a number?
handle html with regex is not the good practice. however if your input format is fixed, you can try this grep line:
grep -oP '<td[^>]*>\K\d\d:\d\d' input
with your example input, it outputs:
09:00
18:00
09:00
18:00
09:00
18:00
09:00
20:00
09:00
20:00
09:00
18:00
11:00
18:00
Related
When I run my function to retrieve the history values in Wondereware into Excel it only loads 4 of the 50 tags.
I have tried to select the cells that have the tag names in it and it seems that it only takes 4 of the tag names and I need all 50.
=wwWideHistory3("LEWC2012INSQL", AFTagBinding,"Row7",AFStartBinding,AFEndBinding,254,0,0,0,0,3,0,"",3,"",-1,0,"","NoFilter",16384)
The 4 tags that it loads. I need 50 tags.
DateTime L2_PT6130 L2_PT6230 NU_DC01_DX01_DISPLAY NU_DC01_DX02_DISPLAY
6/1/19 6:00:00 AM 2.255943775 3.29255867 0.039721143 -0.231841758
6/2/19 10:00:00 AM 2.26124382 3.646864653 0.084306099 -0.209954605
6/3/19 2:00:00 PM 2.498756409 3.312580585 0.042153049 -0.206712067
6/4/19 6:00:00 PM 2.703880787 3.238382339 0.027561609 -0.233463034
6/5/19 10:00:00 PM 2.20412302 3.113344669 0.091601819 -0.229409859
6/7/19 2:00:00 AM 2.044145584 2.985558987 2.3581388 0.968709469
6/8/19 6:00:00 AM 2.187830925 3.223267794 2.323281527 0.663099885
I have a list of scheduled appointments already and want to be able to show all possible timeslots available between 7:30 AM - 5:00 PM for a 2 hour appointment. I've tried a visual and been able to get it through a hack, but I need to get it to work just from reading the below table
SCHEDULED APPOINTMENTS
|---------------------|-------------------|
| Start Date/Time | End Date/Time |
| 6/12/2019 7:30 AM | 6/12/2019 8:30 AM |
| 6/12/2019 8:45 AM | 6/12/2019 9:15 AM |
| 6/12/2019 3:00 PM | 6/12/2019 3:30 PM |
| 6/12/2019 3:45 PM | 6/12/2019 4:15 PM |
| 6/12/2019 4:15 PM | 6/12/2019 5:00 PM |
|---------------------|-------------------|
EXPECTED OUTCOME:
6/12/2019 9:15 AM
6/12/2019 9:30 AM
6/12/2019 9:45 AM
6/12/2019 10:00 AM
6/12/2019 10:15 AM
6/12/2019 10:30 AM
6/12/2019 10:45 AM
6/12/2019 11:00 AM
6/12/2019 11:15 AM
6/12/2019 11:30 AM
6/12/2019 11:45 AM
6/12/2019 12:00 PM
6/12/2019 12:15 PM
6/12/2019 12:30 PM
6/12/2019 12:45 PM
6/12/2019 1:00 PM
To get just that list directly would require VBA, which is possible, but StackOverflow is not a write-your-code-for you service. We would help if you got stuck with your code, but you need to know how to code in the first place and have made a start.
That said, if you accept a slightly easier solution, then a single formula can give you your desired result:
Convert your appointments range to a data table with column headings "Start" and "End"
Set the table name to "Appointments"
Store your new appointment length (2) in a cell and give it the name "Length"
Create a list of every possible appointment start time, starting from A1
Enter this formula next to the first time in B1, and save it by pressing CTRL+SHIFT+ENTER:
=AND((ROUND(Appointments[Start],4)>=ROUND(A1+Length/24,4))+(ROUND(Appointments[End],4)<=ROUND(A1,4)),ROUND(A1-TRUNC(A1),4)<=ROUND((17-Length)/24,4))
Then fill down that formula against every time slot and it will say TRUE for the available time slots.
For each possible time slot, the formula checks that all existing appointments finish on or before the time slot or start 2 or more hours after the time slot. It also checks that there are at least 2 hours left in the day before finishing at 5pm. The formula handles different lengths required for the new appointment by changing the value in the "length" cell.
The ROUND functions are added to eliminate issues with floating point precision on fractions/times not always correctly identifying when 2 times are the same.
Epoch time for 2nd July 2018 , 11 PM. (IST)
> moment('2018-07-02T23:00:00.000').unix()
1530552600
Now When I convert from epoch to IST, It added 7 minute Extra.
> moment.unix(1530552600).tz("Asia/Kolkata").format("DD:MM:YYYY HH:MM z");
'02:07:2018 23:07 IST'
When converted to ET timezone , It gives 30 minute less from IST timezone. ET is 9.5 behind IST so it should have been "02:07:2018 01:30:00 EDT'
> moment.unix(1530552600).tz("America/New_York").format("DD:MM:YYYY HH:MM z");
'02:07:2018 13:07 EDT'
IST
your formatting string is wrong, you used MM (month) instead of mm (minutes)
try
moment.unix(1530552600).tz("Asia/Kolkata").format("DD:MM:YYYY HH:mm z");
for all other formats see the moment documentation
I am having this problem. When i run this code, with the above file it gives me a index out of range error.
f = open(sys.argv[1], 'r')
file_contents = [x.split('\t')[2:5] for x in f.readlines()]
#Set the variables for average and total for cities
total = 0
city = set()
for line in file_contents:
print(line[0])
This is the content for the file
2012-01-01 09:00 San Jose Men's Clothing 214.05 Amex
2012-01-01 09:00 Fort Worth Women's Clothing 153.57 Visa
2012-01-01 09:00 San Diego Music 66.08 Cash
2012-01-01 09:00 Pittsburgh Pet Supplies 493.51 Discover
2012-01-01 09:00 Omaha Children's Clothing 235.63 MasterCard
You need to close the file after reading from it, the recommended practice is to open it using the with statement which automatically closes it;
with open(sys.argv[1], 'r') as f:
file_contents = [x.split(' ')[2:5] for x in f.readlines()]
#Set the variables for average and total for cities
total = 0
city = set()
for line in file_contents:
print(line[0])
However the issue you are having is splitting the lines by \t, use a blank space and it should give you what you need.
OUTPUT
09:00
09:00
09:00
09:00
09:00
I want to extract the text from the table http://www.amiriconstruction.co.uk/goodwoodgolf/scoretable.htm into a textile in plain text without html tags from the Mac OS X command line.
I tried a lot of sed commands, but sed will only print the whole file again. What am I doing wrong?
Example of what I tried
sed -n '/<tr>/,/<\/tr>/p' scoretable.htm (will just print table contents with html tags :( )
A little TXR web scraping, with the help of wget to grab the page:
#(deffilter nobr ("<br />" ""))
#(deffilter brsp ("<br />" " "))
#(deffilter nosp (" " ""))
#(next "!wget 2>/dev/null -O - http://www.amiriconstruction.co.uk/goodwoodgolf/scoretable.htm")
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
#(skip)
<div class="scoreTableArea">
#(collect)
<h2 class="unify">#year - #event</h2>
# (filter brsp event)
# (collect)
<tr>
<td class="center">#pos</td>
<td>#player</td>
<td>#company</td>
<td>#date</td>
<td class="center">#points</td>
</tr>
# (filter nobr player company date points)
# (filter nosp pos points)
# (until)
</tbody>
# (end)
#(end)
#(output :filter :from_html)
# (repeat)
Event: #event
Year: #year
DATE POS PT PLAYER COMPANY
# (repeat)
#{date -10} #{pos -2} #{points 2} #{player 16} #company
# (end)
# (end)
#(end)
Sample run:
$ txr scoretable.txr
Event: Teeing off to Clobber Ken
Year: 2011
DATE POS PT PLAYER COMPANY
Sept 2011 1 40 John Durrant King Sumners Partnership
Sept 2011 2 34 Grahame Pettit Amiri Construction
Oct 2011 3 31 Tony Deacon Gleeds
Oct 2011 4 29 Tony Boyle Lacey Hickey Caley
Oct 2011 5 29 Richard Hemming Scott White and Hookins
Sept 2011 6 29 Ian McCoy Selway Joyce
June 2011 7 27 Julian Larkin C&G Properties
Sept 2011 8 25 Roque Menezes Capita Symonds
June 2011 9 22 Shawn Lambert PWP Architects
Sept 2011 10 22 Kevin Lendon Amiri Construction
Event: Ken Watson (HNW Architects) Undisputed Amiri Golf Demon of the Downs
Year: 2010
DATE POS PT PLAYER COMPANY
2010 1 40 Ken Watson HNW Architects
2010 2 37 David Heda London Clancy
2010 3 34 Gordon Brown Currie & Brown
2010 4 32 Alistair Taylor Wildbrook Properties
5 30 Andy Goodridge City Estates
6 25 Russ Pitman Henderson Green
7 24 Phil Piper Piper Whitlock
8 23 Kevin Miller Urban Pulse Architects
9 19 Simon Asquith Godsall Arnold Partnership
10 19 Shawn Lambert PWP Architects
11 18 Martin Judd Davis Langdon
sed -n 's;</\?td>;;gp' scoretable.html | \
sed -e 's;<td class="center">;;' \
-e 's;<.*>;;'
Note that I use ; instead of / as my delimiter - I find it a bit easier to read. Sed will use whatever character you put after 's as the delimiter.
Okay, now the explanation. The first line:
-n will repress output, but the p at the end of the command tells sed to specifically print all lines matching the pattern. This will get us only the lines wrapped in <td> tags. At the same time, I'm finding anything that matches </\?td> and substituting it with nothing. /\? means / must not appear or appear only once, so this will match both the opening and closing tags. The g at the end, or global, means that it won't stop trying to match the pattern after it succeeds for the first time in a line. Without g it would only substitute the opening tag.
The output from this is piped into sed again on the second line:
-e just specifies that there is an editing command to run. If you're just running one command it's implied, but here I run two (the next one is on the third line).
This removes <td class="center">, and the next line removes any other tags (in this case the <br> tags.
The last command can only be run if you're sure that there's only at most one tag on a line. Otherwise, the .* will be greedy and match too much, so in:
<td class="center">24 </ br>
it would match the entire line, and remove everything.