Scrape single column from table on Wikipedia using NodeJS Cheerio

Scrape single column from table on Wikipedia using NodeJS Cheerio - node.js

I am trying to scrape a column from the table located on Wikipedia. I am trying to get the first column, Symbol, and use those symbols in an array in NodeJS. I attempted to scrape only this column using Cheerio and Axios. For some reason, after running the function I do not get any syntax errors but I also do not get any result after execution. I'm not sure if the elements I have loaded are correct or not, but any advice on how I can scrape the Symbol column into an array would be helpful. Below is my code:
const express = require('express');
const app = express();
const http = require('http');
const server = http.createServer(app);
const cheerio = require('cheerio');
const axios = require("axios");
async function read_fortune_500() {
try {
const url = 'https://en.wikipedia.org/wiki/List_of_S%26P_500_companies'
const { data } = await axios({
method: "GET",
url: url,
})
const $ = cheerio.load(data)
const elemSelector = '#constituents > tbody > tr:nth-child(1) > td'
$(elemSelector).each((parentIndex, parentElem) => {
if (parentIndex <= 9){
$(parentElem).children().each((childIndex, childElem) => {
console.log($(childElem).text())
})
}
})
} catch (err) {
console.error(err)
}
}
read_fortune_500()
Result
[Finished in 1.238s]

To help with your original issue:
For some reason, after running the function I do not get any syntax
errors but I also do not get any result after execution.
The reason for this is that you are calling an async function in javascript. Because read_fortune_500 has the async keyword, you need to 'wait' for this function to complete. In javascript world, the read_fortune_500 is actually returning a promise and you need to wait until the promise resolves. You can do that in a couple of ways:
The easiest way to handle this is tp wrap your function call inside an IIFE:
(async () => {
await read_fortune_500()
})();
In future versions of node you can use await without the need for wrapping it but hopefully that helps.
For the second issue, getting a list of symbols. You need to update the query selector you are using:
const $ = cheerio.load(data)
const elements = $("#constituents > tbody > tr > td:nth-child(1)")
elements.each((parentIndex, parentElem) => {
The CSS selector is slightly different but the selector above tells cheerio to look inside each table row in the DOM and then select the first column in that row.
Full working code below:
const cheerio = require('cheerio');
const axios = require("axios");
async function read_fortune_500() {
try {
const url = 'https://en.wikipedia.org/wiki/List_of_S%26P_500_companies'
const { data } = await axios({
method: "GET",
url: url,
})
const $ = cheerio.load(data)
const elements = $("#constituents > tbody > tr > td:nth-child(1)")
elements.each((parentIndex, parentElem) => {
if (parentIndex <= 9){
$(parentElem).children().each((childIndex, childElem) => {
console.log($(childElem).text())
})
}
})
} catch (err) {
console.error(err)
}
}
(async () => {
await read_fortune_500()
})();

Related

NodeJS, Cheerio. How to find text without knowing selectors?

I'm trying to find a specific text. In my case, I have no idea of selectors, elements, parents, or anything else in the HTML code of the target. Just trying to find out if this page has robots.txt. Doing that by searching for 'User-agent:'.
Is someone who knows how to search for a specific text in the parse, without knowing any other piece of information on the page?
getApiTest = async () => {
axios.get('http://webilizerr.com/robots.txt')
.then(res => {
const $ = cheerio.load(res.data)
console.log($(this).text().trim() === 'User-agent:'
)
}).catch(err => console.error(err))
};
Thanks for your time.

You can simply use a regular expression to check whether "User-agent" is part of the returned HTML.
Be aware: If the scraped page doesn't have a robots.txt file and returns a 404 status code, which should normally be the case, axios throws an error. You should consider this in your catch statement.
Following a working example:
const axios = require("axios");
const cheerio = require("cheerio");
const getApiTest = async () => {
try {
const res = await axios.get("https://www.finger.digital/robots.txt");
const $ = cheerio.load(res.data);
const userAgentRegExp = new RegExp(/User-agent/g);
const userAgentRegExpResult = userAgentRegExp.exec($.text());
if (!userAgentRegExpResult) {
console.log("Doesn't have robots.txt");
return;
}
console.log("Has robots.txt");
} catch (error) {
console.error(error);
console.log("Doesn't have robots.txt");
}
};
getApiTest();

Firebase cloud function: http function returns null

Here is what I am trying to do.
I am introducing functionality to enable users to search for local restaurants.
I created a HTTP cloud function, so that when the client delivers a keyword, the function will call an external API to search for the keyword, fetch the responses, and deliver the results.
In doing #2, I need to make two separate url requests and merge the results.
When I checked, the function does call the API, fetch the results and merge them without any issue. However, for some reason, it only returns null to the client.
Below is the code: could someone take a look and advise me on where I went wrong?
exports.restaurantSearch = functions.https.onCall((data,context)=>{
const request = data.request;
const k = encodeURIComponent(request);
const url1 = "an_url_to_call_the_external_API"+k;
const url2 = "another_url_to_call_the_external_API"+k;
const url_array = [ url1, url2 ];
const result_array = [];
const info_array = [];
url_array.forEach(url=>{
return fetch(url, {headers: {"Authorization": "API_KEY"}})
.then(response=>{
return response.json()
})
.then(res=>{
result_array.push(res.documents);
if (result_array.length===2) {
const new_result_array_2 = [...new Set((result_array))];
new_result_array_2.forEach(nra=>{
info_array.push([nra.place_name,nra.address_name])
})
//info_array is not null at this point, but the below code only return null when checked from the client
return info_array;
}
})
.catch(error=>{
console.log(error)
return 'error';
})
})
});
Thanks a lot in advance!

You should use Promise.all() instead of running each promise (fetch request) separately in a forEach loop. Also I don't see the function returning anything if result_array.length is not 2. I can see there are only 2 requests that you are making but it's good to handle all possible cases so try adding a return statement if the condition is not satisfied. Try refactoring your code to this (I've used an async function):
exports.restaurantSearch = functions.https.onCall(async (data, context) => {
// Do note the async ^^^^^
const request = data.request;
const k = encodeURIComponent(request);
const url1 = "an_url_to_call_the_external_API" + k;
const url2 = "another_url_to_call_the_external_API" + k;
const url_array = [url1, url2];
const responses = await Promise.all(url_array.map((url) => fetch(url, { headers: { "Authorization": "API_KEY" } })))
const responses_array = await Promise.all(responses.map((response) => response.json()))
console.log(responses_array)
const result_array: any[] = responses_array.map((res) => res.documents)
// Although this if statement is redundant if you will be running exactly 2 promises
if (result_array.length === 2) {
const new_result_array_2 = [...new Set((result_array))];
const info_array = new_result_array_2.map(({place_name, address_name}) => ({place_name, address_name}))
return {data: info_array}
}
return {error: "Array length incorrect"}
});
If you'll be running 2 promises only, other option would be:
// Directly adding promises in Promise.all() instead of using map
const [res1, res2] = await Promise.all([fetch("url1"), fetch("url2")])
const [data1, data2] = await Promise.all([res1.json(), res2.json()])
Also check Fetch multiple links inside of forEach loop

Can't get text from a div

I want to get the content of the div mw-content-text from some wikipedia page (this is just examples to learn node.js) I have made this:
var fetch = require('node-fetch');
var cheerio = require('cheerio');
var fs = require('fs');
var vv = [
'https://en.wikipedia.org/wiki/Ben_Silbermann',
'https://en.wikipedia.org/wiki/List_of_Internet_entrepreneurs'
];
var bo=[],
$;
vv.forEach((t)=>{
fetch(t)
.then(res => res.text())
.then((body) => {
$ = cheerio.load(body);
var finded = $('#mw-content-text').text();
bo.push(finded);
});
});
console.log(bo);
If I output body, it is filled with a string containing the whole html page (so, this step is ok),
If I output $ it contains a collection (but I'm not sure if it's populated, I use the node.js command prompt but it looks that it's not the right tool, any advice on that too?)
Anyway, variable bo returns me an empty array

The issue here is that we're logging bo before the fetch call is complete. I'd suggest using the async/await syntax to ensure we wait for all the gets to return, then we can log the result.
You could follow with some more processing like removing empty lines, whitespace etc, but that shouldn't be too hard.
var fetch = require('node-fetch');
var cheerio = require('cheerio');
var vv = [
'https://en.wikipedia.org/wiki/Ben_Silbermann',
'https://en.wikipedia.org/wiki/List_of_Internet_entrepreneurs'
];
async function getDivcontent() {
const promises = vv.map(async t => {
const body = await fetch(t).then(res => res.text());
const $ = cheerio.load(body);
return $('#mw-content-text').text();
});
return await Promise.all(promises);
}
async function test() {
let result = await getDivcontent();
console.log("Result:" + result);
}
test();

Check if element is visible in DOM in Node.js

I would like to check if element is visible in DOM in Node.js. I use jsdom library for getting DOM structure. There are 2 approaches how to check element's visibility in client side javascript, but it doesn't work with jsdom in node.js.
1) offsetParent property is always null, even for visible elements
2) dom.window.getComputedStyle(el).display returns block, but element's css rule is display: none
const request = require('request');
const jsdom = require("jsdom");
const { JSDOM } = jsdom;
request({ 'https://crooked.com/podcast-series/majority-54/', jar: true }, function (e, r, b) {
const dom = new JSDOM(b);
test(dom);
});
const test = (dom) => {
const hiddenElement = dom.window.document.querySelector('.search-outer-lg');
const visibleElement = dom.window.document.querySelector('.body-tag-inner');
console.log(dom.window.getComputedStyle(hiddenElement).display); // block
console.log(visibleElement.offsetParent); // null
}
Is it possible or another way how to check element's visibility in DOM in node.js?

I tried puppeteer instead of jsdom and I got correct display value. Here is the snippet:
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto(uri);
const searchDiv = await page.evaluate(() => {
const btn = document.querySelector('.search-outer-lg');
return getComputedStyle(btn).display;
});
console.log(searchDiv)
await browser.close()
})()

Trick method :)
function isHiddenElement(selector) {
return (document.querySelector(selector).offsetParent === null)
}
if(isHiddenElement('.search-outer-lg')
{
alert("element hidden");
}

try without use
const a1=dom.window.document.querySelector('.search-outer-lg');
const coponentStyle= dom.window.getComputedStyle(a1)
coponentStyle.getPropertyValue('display')
[![const offsetParet=window.document.querySelector('.body-tag-inner').offsetParent][1]][1]
it return body hav class archive tax-podcast_type term-majority-54 term-98
// it will be return none
itry this in the consle without use dom
show this image
if it not work tell me

All my scraped text ends up in one big object instead of separate objects with Cheerio

I'm following a web scraping course that uses Cheerio. I practice on a different website then they use in the course and now I run into the problem that all my scraped text end up in one big object. But every title should end up in it's own object. Can someone see what I did wrong? I already bumbed my head 2 hours on this problem.
const request = require('request-promise');
const cheerio = require('cheerio');
const url = "https://huurgoed.nl/gehele-aanbod";
const scrapeResults = [];
async function scrapeHuurgoed() {
try {
const htmlResult = await request.get(url);
const $ = await cheerio.load(htmlResult);
$("div.aanbod").each((index, element) => {
const result = $(element).children(".item");
const title = result.find("h2").text().trim();
const characteristics = result.find("h4").text();
const scrapeResult = {title, characteristics};
scrapeResults.push(scrapeResult);
});
console.log(scrapeResults);
} catch(err) {
console.error(err);
}
}
scrapeHuurgoed();
This is the link to the repo: https://github.com/danielkroon/huurgoed-scraper/blob/master/index.js
Thanks!

That is because of the way you used selectors. I've modified your script to fetch the content as you expected. Currently the script is collecting titles and characteristics. Feel free to add the rest within your script.
This is how you can get the required output:
const request = require('request-promise');
const cheerio = require('cheerio');
const url = "https://huurgoed.nl/gehele-aanbod";
const scrapeResults = [];
async function scrapeHuurgoed() {
try {
const htmlResult = await request.get(url);
const $ = await cheerio.load(htmlResult);
$("div.item").each((index, element) => {
const title = $(element).find(".kenmerken > h2").text().trim();
const characteristics = $(element).find("h4").text().trim();
scrapeResults.push({title,characteristics});
});
console.log(scrapeResults);
} catch(err) {
console.error(err);
}
}
scrapeHuurgoed();

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Scrape single column from table on Wikipedia using NodeJS Cheerio - node.js

Related

NodeJS, Cheerio. How to find text without knowing selectors?

Firebase cloud function: http function returns null

Can't get text from a div

Check if element is visible in DOM in Node.js

All my scraped text ends up in one big object instead of separate objects with Cheerio

Categories

Resources