Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
Do you have any idea how to get this little table from this website to excel?
Normal source code scraping won't work since the results are not stored in the source. Power query doesn't work either...
Edit:
I have tried Power Query. I have some codes that download data from websites searching by class, tag etc. - but all of them search in the source, not in the rendered website, so posting such codes, just to post anything, is pointless.
I know that starting off with web scraping can be sometimes cumbersome and the volume of information out there can be overwhelming so I decided to kickstart your efforts in the hopes that in the future you will at least know where to start from.
Inspect the network traffic.
Use your browser's developer tools to inspect the requests being sent when you browse a website. In your case, the requests that are sent under the hood when you press search are quite a few. However, you only need one of them. It's the XHR request that produces the table as a response.
Inspect the request itself
The request basically consists of a URL which contains the parameters that you select in the dropdown menus, the headers which in your case are not essential to the result and a body which in your case is empty because all the parameters are contained in the URL.
Inspect the response
The response in your case is an HTML. It could have been something else like a JSON. The data you want is in an HTML table with an ID "qoutaTable".
<html>
<head>
<!-- Including version.html for defect CUSTD00035918 Start -->
<meta name="application" content="DDS2-TARIC" />
<meta name="version" content="#REL#" />
<!-- Defect# CUSTD00024730 Start -->
<!-- IPG Rule requires the following 7 metatags in all application pages. Additional metatags e.g. version and application can be added if required by the application. -->
<meta http-equiv="Content-Language" content="en">
<meta name="description" content="DDS2-TARIC Application page">
<meta name="reference" content="DDS2-TARIC Reference">
<meta name="creator" content="DG-TAXUD">
<meta name="classification" content="DDS2-TARIC">
<meta name="keywords" content="DDS2-TARIC, TARIC, DDS2">
<meta name="date" content="">
<!-- Defect# CUSTD00024730 End -->
<!-- Including version.html for defect CUSTD00035918 End -->
</head>
<body style="background-color:#FFFFF0;">
<div id="quotaMarkedUpContainer">
<div class='scroller' id="navigation" align=center>
<table>
<tr>
<td>
</td>
<td>
</td>
</tr>
</table>
</div>
<table id="quotaTable" class="list" width="100%" style="padding-left: 7%; padding-right: 7%;">
<thead>
<tr class="columnHeader">
<th>
Order number
</th>
<th>
Origins
</th>
<th style="text-align: center;">
Start date
</th>
<th style="text-align: center;">
End date
</th>
<th style="text-align: right;">
Balance
</th>
<th/>
</tr>
</thead>
<tr class="oddRow">
<td>
096714
</td>
<td>
<div>
Ukraine
</div>
</td>
<td style="text-align: center;">
01-01-2019
</td>
<td style="text-align: center;">
31-12-2019
</td>
<td style="text-align: right;">
0 Kilogram
</td>
<td>
<a id="quotaLink" href="https://ec.europa.eu/taxation_customs/dds2/taric/quota_tariff_details.jsp?Lang=en&StartDate=2019-01-01&Code=096714" style="color:#3247e8; text-decoration:underline;" class='browse_action_a'>[More info]</a>
</td>
</tr>
</table>
<div class='scroller' id="navigation" align=center>
<table>
<tr>
<td>
</td>
<td>
</td>
</tr>
</table>
</div>
</div>
</body>
</html>
Write the code
For that you will need the following references
Microsoft WinHTTP Services, version 5.1 (to create and manipulate HTTP requests)
Microsoft HTML Object Library (to manipulate HTML elements)
Here's an example of how to get one of the table's cells:
Option Explicit
Sub getData()
Dim req As New WinHttpRequest
Dim doc As New HTMLDocument
Dim table As HTMLTable
Dim url As String, code As String, year As String, origin As String, status As String, critical As String 'the request's parameters
critical = "" 'you can leave it blank if it's not important to your search
status = "" 'you can leave it blank if it's not important to your search
origin = "UA"
year = "2019"
code = "096714"
url = "https://ec.europa.eu/taxation_customs/dds2/taric/quota_list.jsp?Lang=en&Origin=" & origin & "&Code=" & code & "&Year=" & year & "&Status=" & status & "&Critical=" & critical & "&Expand=true&Offset=0" 'build the URL by concatenating the various parameters
With req
.Open "GET", url, False
.send
doc.body.innerHTML = .responseText 'Assign the HTML response to an HTML document object
'Debug.Print .responseText
End With
Set table = doc.getElementById("quotaTable") 'get the table you're interested in
Debug.Print table.Rows(1).Cells(4).innerText 'print the 5th cell of the 2nd row in the immediate window
End Sub
The result looks like that:
For demonstration purposes I'm only showing you how to print the contents of one of the table's cells. You can experiment with the above code and modify it to get access to the other elements of the table as well.
I use Chrome and have the result stored in the source. Then I simply copy the html code to the online html to csv:
Html to csv online editor
It works for me. Or if this is not your solution please try to describe better your problem.
Related
I'm using scrapy and need to extract "Gray / Gray" using xpath selectors.
Here's the html snippet:
<div class="Vehicle-Overview">
<div class="Txt-YMM">
2006 GMC Sierra 1500
</div>
<div class="Txt-Price">
Price : $8,499
</div>
<table width="100%" border="0" cellpadding="0" cellspacing="0"
class="Table-Specs">
<tr>
<td>
<strong>2006 GMC Sierra 1500 Crew Cab 143.5 WB 4WD
SLE</strong>
<strong class="text-right t-none"></strong>
</td>
</tr>
<tr>
<td>
<strong>Gray / Gray</strong><br />
<strong>209,123
Miles
/ VIN: XXXXXXXXXX
</td>
</tr>
</table>
I'm stuck trying to extract "Gray / Gray" within the "strong" tag. Any help is appreciated.
This XPath will work in Scrapy and also in Google/Firefox Developer's Console:
//div[#class='Vehicle-Overview']/table[#class='Table-Specs']//tr[2]/td[1]/strong[1]/text()
You can use this code in your spider:
color = response.xpath("//div[#class='Vehicle-Overview']/table[#class='Table-Specs']//tr[2]/td[1]/strong[1]/text()").extract_first()
You can use this XPath expression with your sample XML/HTML:
//div[#class='Vehicle-Overview']/table[#class='Table-Specs']/tr[2]/td[1]/strong[1]
A full XPath given the full file mentioned below with respect to a namespace "http://www.w3.org/1999/xhtml" can be
/html/body/div/div/div[#class='content-bg']/div/div/div[#class='Vehicle-Overview']/table[#class='Table-Specs']/tr[2]/td[1]/strong[1]
I have a table like this:
Table
<!DOCTYPE html>
<html>
<body>
<table border="1" style="width:100%">
<tr>
<td>email</td>
<td>data</td>
</tr>
<tr>
<td>creator_a#creator.com</td>
<td>"vimeo_profile"=>"", "twitter_profile"=>"", "youtube_profile"=>"", "creator_category"=>"production_company", "facebook_profile"=>"", "linkedin_profile"=>"", "personal_website"=>"", "instagram_profile"=>"", "content_expertise_categories"=>"4,5,8"</td>
</tr>
<tr>
<td>creator_b#creator.com</td>
<td>"twitter_profile"=>"", "creator_category"=>"association", "facebook_profile"=>"", "linkedin_profile"=>"", "personal_website"=>"", "content_expertise_type"=>"image", "content_expertise_categories"=>"4, 6"</td>
</tr>
</table>
</body>
</html>
And I want to query this using PostgreSQL, so I only get the values regarding content_expertise_categories:
*Important to mention that the number of values vary. The table has many more entries so I am looking for a solution that helps me extract the values regardless of whether there are 2 or 15 values to pull out.
Result
<!DOCTYPE html>
<html>
<body>
<table border="1" style="width:100%">
<tr>
<td>email</td>
<td>data</td>
</tr>
<tr>
<td>creator_a#creator.com</td>
<td>4,5,8</td>
</tr>
<tr>
<td>creator_b#creator.com</td>
<td>4,6</td>
</tr>
</table>
</body>
</html>
I have tried substring but can't make it to work.
Some help would be much appreciated, thanks!
SELECT
email,
(string_to_array(
data::text,'"content_expertise_categories"=>'::text
)
)[2] as data
FROM users
;
Update:
In your example all strings have "content_expertise_categories" listed last, which allows to think you can just split string to two pieces. If you happen to have more php array definition values after, you'll need an additional split on ',"' and taking [1] part this time...
Mind casting column "data" to ::text before using it in content_expertise_categories function, as it requires text type, and your column appeared to be not such.
I believe more elegant would be this query:
select
email,
data->'content_expertise_categories' as data
from h
;
But when I was posting first query I did not know that you use hstore
I know this is really easy for some of you out there. But I have been going deep on the internet and I can not find an answer. I need to get the company name that is inside the
tbody tr td a eBay-tradera.com
and
td class="bS aR" 970,80
/td /tr /tbody
<tbody id="matrix1_group0">
<tr class="oR" onmouseover="onMouseOver(this, false)" onmouseout="onMouseOut(this, false)" onclick="onClick(this, false)">
<td class="bS"> </td>
<td>
<a href="aProgramInfoApplyRead.action?programId=175&affiliateId=2014848" title="http://www.tradera.com/" target="_blank">
eBay-Tradera.com
</a>
</td>
<td class="aR">
175</td>
<td class="bS aR">0</td><td class="bS aR">0</td><td class="bS aR">187</td>
<td class="aR">0,00%</td><td class="bS aR">124</td>
<td class="aR">0,00%</td>
<td class="bS aR">26</td>
<td class="aR">20,97%</td>
<td class="bS aR">32</td>
<td class="aR">60,80</td>
<td class="aR">25,81%</td>
<td class="bS aR">5 102,00</td>
<td class="bS aR">0,00</td>
<td class="aR">0,00</td>
<td class="bS aR">
970,80
</td>
</tr>
</tbody>
This is my code, where I only try to get the a tag to start of with but I cant get that to work either
Set TDelements = document.getElementById("matrix1_group0").document.getElementsbytagname("a").innerHTML
r = 0
C = 0
For Each TDelement In TDelements
Blad1.Range("A1").Offset(r, C).Value = TDelement.innerText
r = r + 1
Next
Thanks on beforehand I know that this might be to simple. But I hope that other people might have the same issue and this will be helpful for them as well. The reason for the "r = r + 1" is because there are many more companies on this list. I just wanted to make it as easy as I could. Thanks again!
You will need to specify the element location in the table. Ebay seems to be obfuscating the class-names so we cannot rely on those being consistent. Nor would I usually rely on the elements by their table index being consistent but I don't see any way around this.
I am assuming that this is the HTML document you are searching
<tbody id="matrix1_group0">
<tr class="oR" onmouseover="onMouseOver(this, false)" onmouseout="onMouseOut(this, false)" onclick="onClick(this, false)">
<td class="bS"> </td>
<td>
<a href="aProgramInfoApplyRead.action?programId=175&affiliateId=2014848" title="http://www.tradera.com/" target="_blank">
eBay-Tradera.com <!-- <=== You want this? -->
</a>
</td>
<!-- ... -->
</tr>
<!-- ... -->
</tbody>
We can ignore the rest of the document as the table element has an ID. In short, we assume that
.getElementById("matrix1_group0").getElementsByTagName("TR")
will return a collection of html row objects sorted by their appearance.
Set matrix = document.getElementById("matrix1_group0")
Set firstRow = matrix.getElementsByTagName("TR")(1)
Set firstRowSecondCell = firstRow.getElementsByTagName("TD")(2)
traderaName = firstRowSecondCell.innerText
Of course you could inline this all as
document.getElementById("matrix1_group0").getElementsByTagName("TR")(1).getElementsByTagName("TD")(2).innerText
but that would make debugging harder. Also if the web-page is ever presented to you in a different format then this won't work. Ebay is deliberately making it hard for you to scrape data off of it for security.
With only the HTML you have shown you can use CSS selectors to obtain these:
a[href*='aProgramInfoApplyRead.action?programId']
Which says a tag with attribute href that contains the string 'aProgramInfoApplyRead.action?programId'. This matches two elements but the first is the one you want.
CSS Selector:
VBA:
You can use .querySelector method of .document to retrieve the first match
Debug.Print ie.document.querySelector("a[href*='aProgramInfoApplyRead.action?programId']").innerText
I have a page in asp that makes a xls table, however when open the table all the rows are stuffed into the default column width which I would like to set.
My table looks something like this:
<table>
<thead>
'A for loop makes a series of th
</thead>
'another loop pulls db values
<tr><td>value1</td><td>value2</td> 'etc </tr>
</table>
I have tried the following to set the space
width="3.29in"
&nsp; spam (barbaric but sometimes effective)
width="400px"
style="width:300px"
none of the above seem to work.
Additionally here is my header asp incase its relevant
Response.Clear()
Response.Buffer = False
Response.ContentType = "application/vnd.ms-excel"
Response.AddHeader "Content-Disposition", "attachment; filename=blah.xls"
Also on a side note for some reason when I have a dollar value printed as such
<td>$<%=dbvalue%></td>
for some reason this yields '$dollar value and I am not sure how to nuke the single quote.
Do you need the thead tag? Something like this should work:
<html>
<body>
<h1>Report Title</h1>
<table >
<tr>
<th style="width : 300px">header1</th>
<th style="width : 100px">header2</th>
<th style="width : 200px">header..</th>
<th style="width : 300px">....</th>
</tr>
<tr class="row1">
<td >value1</td>
<td >value2</td>
<td >value..</td>
<td >....</td>
</tr>
....
Optionally you can put the table row with th tags inside a <thead> tag
I am trying to write my watir script to grab the following data (the table body headers and the table row data, but I am having trouble trying to figure out how to access the table. (Once I get that, teh rest is a piece of cake).
Can anyone come up with something that will help me access the table? It doesn't have a name or an ID...
<div id="income">
<table class="tHe" cellspacing="0">
<thead>
<tr>
<th id="companyLabel" class="tFirst" style="width:30%"> Quarter Ending </th>
<th id="201004" class="tFirst right">Apr 10 </th>
<th id="201001" class="tFirst right">Jan 10 </th>
<th id="200910" class="tFirst right">Oct 09 </th>
<th id="200907" class="tFirst right">Jul 09 </th>
<th id="200904" class="tFirst right">Apr 09 </th>
</tr>
</thead>
<tbody id="revenueBody">
<tr>
<td class="indtr">Totals</dfn></td>
<td class="right">2849.00</td>
<td class="right">3177.00</td>
<td class="right">5950.00</td>
<td class="right">4451.00</td>
<td class="right">3351.00</td>
</tr>
...
ie.table(:class=>'tHe') should work if there's no other tables with the same class name
ie.table(:after?, ie.div(:id, 'income')) should work if there's no other div with id 'income'
or ie.table(:index=>0) - you would need to check your page to see what the correct index value for your table is.
But wait, there is more! :)
browser.div(:id => "income").table(:class => 'tHe')
browser.div(:id => "income").table(:index => 1)
...
There is also XPath if you are stuck.
If you fire up the page and access it through Firebug or your browser's native developer tools, you can find the xpath expression for the table and then plug that into the Watir API call.
I think it was in later versions of Watir 1.5.x that support for advanced page querying came in (basically your problem, where there are no ID tags). This page on the watir wiki should help:
Ways Available To Identify HTML Element