I am trying to build a scraper with node.js which will allow me to extract news headlines from a large number of domains (they are all different so I have to be as general as possible in my approach). At the moment I have a working implementation in python which utilises Beautiful soup and regex allowing me to define a set of keywords and return headlines containing those keywords. Below is a relevant snippet of python code:
for items in soup(text=re.compile(r'\b(?:%s)\b' % '|'.join(keywords)))
To illustrate the expected output, lets assume there is a domain with news articles (Bellow is a html snippet containing a headline):
<a class="gs-c-promo-heading gs-o-faux-block-link__overlay-link gel-pica-bold nw-o-link-split__anchor" href="/news/uk-52773032"><h3 class="gs-c-promo-heading__title gel-pica-bold nw-o-link-split__text">Time to end Clap for Carers, says founder</h3></a>
The expected output given a keyword Time would be a string with a headline Time to end Clap for Carers
My question is: is it possible to do a similar thing with cheerio? What would be the best approach to achieve the same results in nodejs?
EDIT: This works for me now. On top of matching headlines I also wanted to extract post urls
function match_headlines($) {
const keywords = ['lockdown', 'quarantine'];
new RegExp('\\b[A-Z].*?' + "(" + test_keywords.join('|') + ")" +
'.*\\b', "g");
let matches = $('a').map((i, a) => {
let links = $(a).attr('href');
let match = $(a).text().match(regexPattern);
if (match !== null) {
let posts = {
headline: match['input'],
post_url: links
}
return posts
}
})
return matches.filter((x) => x !== null)
}
Maybe something like this:
let re = new RegExp('\\b' + keywords.join('|') + '\\b')
let texts = $('a h3').map((i, a) => $(a).text())
let titles = texts.filter(text => text.match(re))
Related
I've tried to write short program in node js, that will calculate the euro exchange rate compared with the dollar.
So , as everyone knows that google supply this information by search a simple sentence like: "dollar to euro"
so, I find this code from github
var google = require('google')
google.resultsPerPage = 25
var nextCounter = 0
google('node.js best practices', function (err, res){
if (err) console.error(err)
for (var i = 0; i < res.links.length; ++i) {
var link = res.links[i];
console.log(link.title + ' - ' + link.href)
console.log(link.description + "\n")
}
if (nextCounter < 4) {
nextCounter += 1
if (res.next) res.next()
}
})
(https://github.com/jprichardson/node-google)
this co
de is prints out the first 100 search results of the query node.js best practices.
But I want to access to the little sqaure of google, that holds the information that important to me.
And the response unfortunately didn't return this info.
Thank you!
Take a look at this issue: https://github.com/jprichardson/node-google/issues/10
Looks like you can access the body and $(cheerio instance) to get the "box" data from the scraped response. Try finding any valid HTML selector for this box (for instance, I saw that the currency exchange number element has an html id tag of knowledge-currency__tgt-amount which suggests that each "box" will have its own selector)
I am trying to retrieve a list of elements using XPATH and from this list I want to retrieve a child element based on classname and click it.
var rowList = XPATH1 + className;
var titleList = className + innerHTMLofChildElement;
for(var i = 0; i < titleList.length; i++) {
if(titleList[i][0] === title) {
browser.click(titleList[i][0]); //I don't know what to do inside the click function
}
}
I had a similar implementation perhaps to what you are trying to do, however my implementation is perhaps more complex due to using CSS selectors rather than XPath. I'm certain this is not optimized, and can most likely be improved upon.
This uses the methods elementIdText() and elementIdClick() from WebdriverIO to work with the "Text Values" of the Web JSON Elements and then click the intended Element after matching what you're looking for.
http://webdriver.io/api/protocol/elementIdText.html
http://webdriver.io/api/protocol/elementIdClick.html
Step 1 - Find all your potential elements you want to work with:
// Elements Query to return Elements matching Selector (titles) as JSON Web Elements.
// Also `browser.elements('<selector>')`
titles = browser.$$('<XPath or CSS selector>')
Step 2 - Cycle through the Elements stripping out the InnerHTML or Text Values and pushing it into a separate Array:
// Create an Array of Titles
var titlesTextArray = [];
titles.forEach(function(elem) {
// Push all found element's Text values (titles) to the titlesTextArray
titlesTextArray.push(browser.elementIdText(elem.value.ELEMENT))
})
Step 3 - Cycle through the Array of Title Texts Values to find what you're looking for. Use elementIdClick() function to click your desired value:
//Loop through the titleTexts array looking for matching text to the desired title.
for (var i = 0; i < titleTextsArray.length; i++) {
if (titleTextsArray[i].value === title) {
// Found a match - Click the corresponding element that
// it belongs to that was found above in the Titles
browser.elementIdClick(titles[i].value.ELEMENT)
}
}
I wrapped all of this into a function in which i provided the intended Text (in your case a particular title) I wanted to search for. Hope this helps!
I don't know node.js, but in Java you should achieve your goal by:
titleList[i].findElementBy(By.className("classToFind"))
assuming titleList[i] is an element on list you want to get child elements from
I have the string with html markup, an I want to cretae Doc elment from it like this:
Doc.FromHtm "<div><p>.....</p>.....</div>"
As I understand that this is not possible right now. Ok, what is not possible to accurately sew, I tried to roughly nail using jquery:
JQuery.JQuery.Of( "." + class'name ).First().Html(html'content)
But to call this code, I need to specify an event handler for the Doc element. But it is not implemented in UI.Next.
I tried to track changes of a model with a given CSS class asynchronously:
let inbox'post'after'render = MailboxProcessor.Start(fun agent ->
let rec loop (ids'contents : Map<int,string>) : Async<unit> = async {
// try to recive event with new portion of data
let! new'ids'contents = agent.TryReceive 100
// calculate the state of the agent
let ids'contents =
// merge Map's
( match new'ids'contents with
| None -> ids'contents
| Some (new'ids'contents) ->
new'ids'contents # (Map.toList ids'contents)
|> Map.ofList )
|> Map.filter( fun id content ->
// calculate CSS class name
let class'name = post'text'view'class'name id
// change it's contents of html
JQuery.JQuery.Of( "." + class'name ).First().Html(content).Size() = 0)
// accept the state of the agent
return! loop ids'contents }
loop Map.empty )
and then, for example for one element:
inbox'post'after'render.Post [id, content]
But it is too difficult, unreliable and not really working.
Please give me an idea how to solve the problem if possible. Thanks in advance!
Just in case someone needs to use static HTML in WebSharper on the server (I needed to add some javascript to the WebSharper generated HTML page), there is fairly new Doc.Verbatim usable e.g. like
let Main ctx action title body =
Content.Page(
MainTemplate.Doc(
title = title,
menubar = MenuBar ctx action,
body = body,
my_scripts = [ Doc.Verbatim JavaScript.Content ]
)
)
Already answered this on https://github.com/intellifactory/websharper.ui.next/issues/22 but copying my answer here:
If all you want is to create a UI.Next Doc from static html you can do the following:
let htmlElem = JQuery.JQuery.Of("<div>hello</div>").Get(0)
let doc = Doc.Static (htmlElem :?> _)
This is not very nice but should work and I don't think there's a better way to do it at the moment. Or maybe you could use templating but that's not documented yet and I'm not sure it would fit your use case.
Obviously if you want to change the rendered element dynamically you can make it a Var and use Doc.EmbedView together with Doc.Static.
I have stringfilters bind to my table of data.
What i would like to get from the stringfilter is to be able to search like you would do with queries.
There is a colomn with names - a name for each row - for example - "Steve","Monica","Andreas","Michael","Steve","Andreas",...
I want to have both rows with Monica and Steve from the StringFilter.
I would like to be able to search like this
Steve+Monica
or
"Steve"+"Monica"
or
"Steve","Monica"
This is one of my stringfilters:
var stringFilter1 = new google.visualization.ControlWrapper({
controlType: 'StringFilter',
containerId: 'string_filter_div_1',
options: {
filterColumnIndex: 0, matchType : 'any'
}
});
I had a similar problem, and I ended up creating my own function for filtering my rows. I made an example with the function that you describe, but I'm not sure it's the best or the right way, but it works.
One flaw is that you need to type in the names exactly as they are entered, the whole name and with capital letters.
Fiddle, try to add multiple names separated with a "+" (and no spaces).
The function I added looks like this:
function redrawChart(filterString) {
var filterWords = filterString.split("+")
var rows = []
for(i = 0; i < filterWords.length; i++) {
rows = rows.concat(data.getFilteredRows([{value:filterWords[i], column:0}]))
}
return rows
}
And the listener that listens for updates in your string input looks like:
google.visualization.events.addListener(control, 'statechange', function () {
if (control.getState().value == '') {
realChart.setView({'rows': null})
}else{
realChart.setView({'rows': redrawChart(control.getState().value)})
}
realChart.draw();
});
Probably not a complete solution, but maybe some new ideas and directions to your own thoughts.
How can I grab all the text in a website, and I don't just mean ctrl+a/c. I'd like to be able to extract all the text from a website (and all the pages associated) and use it to build a concordance of words from that site. Any ideas?
I was intrigued by this so I've written the first part of a solution to this.
The code is written in PHP because of the convenient strip_tags function. It's also rough and procedural but I feel in demonstrates my ideas.
<?php
$url = "http://www.stackoverflow.com";
//To use this you'll need to get a key for the Readabilty Parser API http://readability.com/developers/api/parser
$token = "";
//I make a HTTP GET request to the readabilty API and then decode the returned JSON
$parserResponse = json_decode(file_get_contents("http://www.readability.com/api/content/v1/parser?url=$url&token=$token"));
//I'm only interested in the content string in the json object
$content = $parserResponse->content;
//I strip the HTML tags for the article content
$wordsOnPage = strip_tags($content);
$wordCounter = array();
$wordSplit = explode(" ", $wordsOnPage);
//I then loop through each word in the article keeping count of how many times I've seen the word
foreach($wordSplit as $word)
{
incrementWordCounter($word);
}
//Then I sort the array so the most frequent words are at the end
asort($wordCounter);
//And dump the array
var_dump($wordCounter);
function incrementWordCounter($word)
{
global $wordCounter;
if(isset($wordCounter[$word]))
{
$wordCounter[$word] = $wordCounter[$word] + 1;
}
else
{
$wordCounter[$word] = 1;
}
}
?>
I needed to do this to configure PHP for the SSL the readability API uses.
The next step in the solution would be too search for links in the page and call this recursively in an intelligent way to hance the associated pages requirement.
Also the code above just gives the raw data of a word-count you would want to process it some more to make it meaningful.