Extract text from HTML Table - string

I want to extract the text from the table http://www.amiriconstruction.co.uk/goodwoodgolf/scoretable.htm into a textile in plain text without html tags from the Mac OS X command line.
I tried a lot of sed commands, but sed will only print the whole file again. What am I doing wrong?
Example of what I tried
sed -n '/<tr>/,/<\/tr>/p' scoretable.htm (will just print table contents with html tags :( )

A little TXR web scraping, with the help of wget to grab the page:
#(deffilter nobr ("<br />" ""))
#(deffilter brsp ("<br />" " "))
#(deffilter nosp (" " ""))
#(next "!wget 2>/dev/null -O - http://www.amiriconstruction.co.uk/goodwoodgolf/scoretable.htm")
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
#(skip)
<div class="scoreTableArea">
#(collect)
<h2 class="unify">#year - #event</h2>
# (filter brsp event)
# (collect)
<tr>
<td class="center">#pos</td>
<td>#player</td>
<td>#company</td>
<td>#date</td>
<td class="center">#points</td>
</tr>
# (filter nobr player company date points)
# (filter nosp pos points)
# (until)
</tbody>
# (end)
#(end)
#(output :filter :from_html)
# (repeat)
Event: #event
Year: #year
DATE POS PT PLAYER COMPANY
# (repeat)
#{date -10} #{pos -2} #{points 2} #{player 16} #company
# (end)
# (end)
#(end)
Sample run:
$ txr scoretable.txr
Event: Teeing off to Clobber Ken
Year: 2011
DATE POS PT PLAYER COMPANY
Sept 2011 1 40 John Durrant King Sumners Partnership
Sept 2011 2 34 Grahame Pettit Amiri Construction
Oct 2011 3 31 Tony Deacon Gleeds
Oct 2011 4 29 Tony Boyle Lacey Hickey Caley
Oct 2011 5 29 Richard Hemming Scott White and Hookins
Sept 2011 6 29 Ian McCoy Selway Joyce
June 2011 7 27 Julian Larkin C&G Properties
Sept 2011 8 25 Roque Menezes Capita Symonds
June 2011 9 22 Shawn Lambert PWP Architects
Sept 2011 10 22 Kevin Lendon Amiri Construction
Event: Ken Watson (HNW Architects) Undisputed Amiri Golf Demon of the Downs
Year: 2010
DATE POS PT PLAYER COMPANY
2010 1 40 Ken Watson HNW Architects
2010 2 37 David Heda London Clancy
2010 3 34 Gordon Brown Currie & Brown
2010 4 32 Alistair Taylor Wildbrook Properties
  5 30 Andy Goodridge City Estates
  6 25 Russ Pitman Henderson Green
  7 24 Phil Piper Piper Whitlock
  8 23 Kevin Miller Urban Pulse Architects
  9 19 Simon Asquith Godsall Arnold Partnership
  10 19 Shawn Lambert PWP Architects
  11 18 Martin Judd Davis Langdon

sed -n 's;</\?td>;;gp' scoretable.html | \
sed -e 's;<td class="center">;;' \
-e 's;<.*>;;'
Note that I use ; instead of / as my delimiter - I find it a bit easier to read. Sed will use whatever character you put after 's as the delimiter.
Okay, now the explanation. The first line:
-n will repress output, but the p at the end of the command tells sed to specifically print all lines matching the pattern. This will get us only the lines wrapped in <td> tags. At the same time, I'm finding anything that matches </\?td> and substituting it with nothing. /\? means / must not appear or appear only once, so this will match both the opening and closing tags. The g at the end, or global, means that it won't stop trying to match the pattern after it succeeds for the first time in a line. Without g it would only substitute the opening tag.
The output from this is piped into sed again on the second line:
-e just specifies that there is an editing command to run. If you're just running one command it's implied, but here I run two (the next one is on the third line).
This removes <td class="center">, and the next line removes any other tags (in this case the <br> tags.
The last command can only be run if you're sure that there's only at most one tag on a line. Otherwise, the .* will be greedy and match too much, so in:
<td class="center">24 </ br>
it would match the entire line, and remove everything.

Related

Grep total amount of specific elements based on date

Is there a way in linux to filter multiple files with bunch of data in one command without writing a script?
For this example I want to know how many males appear by date. Also the problem is that a specific date (January 3rd) appears in 2 seperate files:
file1
Jan 1 john male=yes
Jan 1 james male=yes
Jan 2 kate male=no
Jan 3 jonathan male=yes
file2
Jan 3 alice male=no
Jan 4 john male=yes
Jan 4 jonathan male=yes
Jan 4 alice male=no
I want the total amount of males for each date from all files. If there are no males for a specific date, no output will be given.
Jan 1 2
Jan 3 1
Jan 4 2
The only way I can think of is count the total amount of male genders given a specific date, but this would not performant as in real-world examples there could be much more files and manually entering all the dates would be a waste of time. Any help would be appreciated, thank you!
localhost:~# cat file1 file2 | grep "male=yes" | grep "Jan 1" | wc -l
2
grep -h 'male=yes' file? | \
cut -c-6 | \
awk '{c[$0] += 1} END {for(i in c){printf "%6s %4d\n", i, c[i]}}'
The grep will print the male lines, cut will remove everything but the first 6 chars (date) and awk will count every date and printout every date and the counter in the end.
Given your files the output will be:
Jan 1 2
Jan 3 1
Jan 4 2

Python Web-scraped text, print vertically

This is my code:
from selenium import webdriver
from selenium.webdriver.common.by import By
url = ("https://overwatchleague.com/en-us/schedule?stage=regular_season&week=12")
driver = webdriver.Chrome('C:\\Program Files (x86)\\chromedriver.exe')
driver.get(url)
MatchScores = driver.find_elements_by_xpath('//*
[#id="__next"]/div/div/div[3]/div[3]/div[1]/div[2]/div[13]/div/section/div[3]')[0]
Results = MatchScores.text
print('-----------------------------')
print(Results)
When I run it, I get something like this:
FRI, JUL 02
FINAL
Paris Eternal
1
-
3
San Francisco Shock
MATCH DETAILS
FRI, JUL 02
FINAL
Washington Justice
0
-
3
Atlanta Reign
MATCH DETAILS
This continues for the other matches. Is there a way for me to print in so that it comes out like this?
FRI, JUL 02 FINAL Paris Eternal 1 - 3 San Francisco Shock
FRI, JUL 02 FINAL Washington Justice 0 - 3 Atlanta Reign
Any help would be appreciated, it would be a bonus if it could be printed without the "MATCH DETAILS" at the back.
You really just need to convert all newlines to spaces. This can be done with first splitting by newlines then join with spaces.
You can remove MATCH DETAILS by removing the last 13 characters.
Hence it is like this:
print(' '.join(Results.split('\n'))[:-13])

How to extract and create new columns from specific match

I have a column bike_name and I want to know the easiest way to split it into year and CC.
CC should contain the numeric data attached before the word cc. In some cases, where cc is not available, it should remain blank.
While year contains just the year in the last word.
TVS Star City Plus Dual Tone 110cc 2018
Royal Enfield Classic 350cc 2017
Triumph Daytona 675R 2013
TVS Apache RTR 180cc 2017
Yamaha FZ S V 2.0 150cc-Ltd. Edition 2018
Yamaha FZs 150cc 2015
You can extract them separately: year is the last 4 characters, CC is via a regex:
df["year"] = df.bike_name.str[-4:]
df["CC"] = df.bike_name.str.extract(r"(\d+)cc").fillna("")
where regex is looking for sequence of digits followed literally by "cc" and in case of no match, it will give NaNs; so we fill them with empty string,
to get
bike_name year CC
0 TVS Star City Plus Dual Tone 110cc 2018 2018 110
1 Royal Enfield Classic 350cc 2017 2017 350
2 Triumph Daytona 675R 2013 2013
3 TVS Apache RTR 180cc 2017 2017 180
4 Yamaha FZ S V 2.0 150cc-Ltd. Edition 2018 2018 150
5 Yamaha FZs 150cc 2015 2015 150
If not only extraction but also removal is needed:
df.bike_name = (df.bike_name.str[:-4]
.str.replace(r"\d+cc", "", regex=True)
.str.rstrip())
where first line removes year, second line removes the cc parts and lastly we right strip all the rows if space at the end is unwanted,
to get
>>> df
bike_name year CC
0 TVS Star City Plus Dual 2018 110
1 Royal Enfield Cla 2017 350
2 Triumph Daytona 2013
3 TVS Apache 2017 180
4 Yamaha FZ S V 2.0 -Ltd. Edi 2018 150
5 Yamaha 2015 150

Create New DataFrame Columns Based on Year

I have a pandas DataFrame that contains NFL Quarterback Data from the 2015-2016 to the 2019-2020 Seasons. The DataFrame looks like this
Player Season End Year YPG TD
Tom Brady 2019 322.6 25
Tom Brady 2018 308.1 26
Tom Brady 2017 295.7 24
Tom Brady 2016 308.7 28
Aaron Rodgers 2019 360.4 30
Aaron Rodgers 2018 358.8 33
Aaron Rodgers 2017 357.9 35
Aaron Rodgers 2016 355.2 32
I want to be able to create new columns that contains the years' data I select and the last three years' data. For example if the year I select is 2019 the resulting DataFrame would be(SY stands for selected year:
Player Season End Year YPG SY YPG SY-1 YPG SY-2 YPG SY-3 TD
Tom Brady 2019 322.6 308.1 295.7 308.7 25
Aaron Rodgers 2019 360.4 358.8 357.9 355.2 30
This is how I am attempting to do it:
NFL_Data.loc[NFL_Data['Season End Year'] == (NFL_Data['SY']), 'YPG SY'] = NFL_Data['YPG']
NFL_Data.loc[NFL_Data['Season End Year'] == (NFL_Data['SY']-1), 'YPG SY-1'] = NFL_Data['YPG']
NFL_Data.loc[NFL_Data['Season End Year'] == (NFL_Data['SY']-2), 'YPG SY-2'] = NFL_Data['YPG']
NFL_Data.loc[NFL_Data['Season End Year'] == (NFL_Data['SY']-3), 'YPG SY-3'] = NFL_Data['YPG']
However, when I run the code above, it doesn't fill out the columns appropriately. Most of the rows are 0. Am I approaching the problem the right way or is there a better way to attack it?
(Edited to include TD Column)
First step is to pivot your data frame.
pivoted = df.pivot_table(index='Player', columns='Season End Year', values='YPG')
Which yields
Season End Year 2016 2017 2018 2019
Player
Aaron Rodgers 355.2 357.9 358.8 360.4
Tom Brady 308.7 295.7 308.1 322.6
Then, you may select:
pivoted.loc[:, range(year, year-3, -1)]
2019 2018 2017
Player
Aaron Rodgers 360.4 358.8 357.9
Tom Brady 322.6 308.1 295.7
Or alternatively as suggested by Quang:
pivoted.loc[:, year:year-3:-1]

How do I count lines which specific column has two patterns?

year start year end location topic data type data value
2016 2017 AL Alcohol Crude Prevalence 16.9
2016 2017 CA Alcohol Other 15
2016 2017 AZ Neuropathy Other 13.1
2016 2017 HI Smoke Crude Prevalence 20
2016 2017 IL Cancer Other 20
2016 2017 KS Cancer Other 14
2016 2017 AZ Smoke Crude Prevalence 16.9
2016 2017 KY Cancer Other 13.8
2016 2017 LA Alcohol Crude Prevalence 18
The answer is required to count lines which are associated with the disease “topic”s "Alcohol" and "Cancer".
I already got the index of column named as "topic" , but the contents I am going to extract from "topic" is not correct, then I am not able to count the lines which is containing "Alcohol" and "Cancer", how to solve it?
Here is my code:
awk '{print $4}' AAA.csv > topic.txt
head -n5 topic.txt | less
You could try the following:
the call to awk gets the column in question, the grep filters the keywords, and the word count counts the lines
$ awk '{ print $4 }' data.txt | grep -e Alcohol -e Cancer | wc -l
6
Using a regexp with grep:
cat data.txt|tr -s " "|cut -d " " -f 4|grep -E '(Alcohol|Cancer)'|wc -l
If you are sure that words "Alcohol" and "Cancer" only appear in the 4th column you can just do
grep -E '(Alcohol|Cancer)' data.txt|wc -l
Addition
The OP asks in the comment:
If there are many columns, and I don't know the index of them. How can I extract the columns just based on their name ("topic")?
This code will store in the variable i the column containing "topic". Essentially, the code stores the first line of data.txt as an array variable s, and then parse the array elements until it finds the desired word. (You have to increase i by one at the end because the array index starts from 0).
Note: the code works only if actually a column "topic" is found.
head -n 1 data.txt|read -a s
for (( i=0; i<${#s[#]}; i++ ))
do
if [ "${s[$i]}" == "topic" ]
then
break
fi
done
i=$(( $i + 1 ))

Resources