I'm trying to find a specific text. In my case, I have no idea of selectors, elements, parents, or anything else in the HTML code of the target. Just trying to find out if this page has robots.txt. Doing that by searching for 'User-agent:'.
Is someone who knows how to search for a specific text in the parse, without knowing any other piece of information on the page?
getApiTest = async () => {
axios.get('http://webilizerr.com/robots.txt')
.then(res => {
const $ = cheerio.load(res.data)
console.log($(this).text().trim() === 'User-agent:'
)
}).catch(err => console.error(err))
};
Thanks for your time.
You can simply use a regular expression to check whether "User-agent" is part of the returned HTML.
Be aware: If the scraped page doesn't have a robots.txt file and returns a 404 status code, which should normally be the case, axios throws an error. You should consider this in your catch statement.
Following a working example:
const axios = require("axios");
const cheerio = require("cheerio");
const getApiTest = async () => {
try {
const res = await axios.get("https://www.finger.digital/robots.txt");
const $ = cheerio.load(res.data);
const userAgentRegExp = new RegExp(/User-agent/g);
const userAgentRegExpResult = userAgentRegExp.exec($.text());
if (!userAgentRegExpResult) {
console.log("Doesn't have robots.txt");
return;
}
console.log("Has robots.txt");
} catch (error) {
console.error(error);
console.log("Doesn't have robots.txt");
}
};
getApiTest();
Related
I am using the following function to crawl entire websites, extracting all links from each page using cheerio.
const crawlUrl = async (urlCrawl) => {
try {
//Check if url was already crawled
if (urlsCrawleadas[urlCrawl]) return;
urlsCrawleadas[urlCrawl] = true;
console.log('Crawling', urlCrawl)
const response = await fetch(urlCrawl)
const html = await response.text()
//Extract links from every page
const $ = cheerio.load(html)
const links = $("a").map((i, link) => link.attribs.href).get()
const { host } = urlParser.parse(urlCrawl)
//Filter links from same website, and external sites
const newLinks = links
.map(link => {
if (!link.includes(host.hostname) && link.includes('http://') || !link.includes(host.hostname) && link.includes('https://')) {
const nueva = urlParser.parse(link)
othersUrls.push(nueva.host.hostname)
}
return link
})
.filter(link => link.includes(host.hostname))
.map(async link => await crawlUrl(link))
//remove duplicates urls from
const test = [...othersUrls.reduce((map, obj) => map.set(obj, obj), new Map()).values()]
//export urls
let writer = FS.createWriteStream('data.txt')
writer.write(test.toString())
return 'End'
} catch (error) {
console.log('function crawler', error)
}
}
The problem is that I cannot define when the crawler has finished going through all the urls. This makes it unable to tell the front when the full task has finished.
Regards
As an exercise, I'm creating a simple API that allows users to provide a search term to retrieve links to appropriate news articles across a collection of resources. The relevent function and the route handler that uses the function is as follows:
function GetArticles(searchTerm) {
const articles = [];
//Loop through each resource
resources.forEach(async resource => {
const result = await axios.get(resource.address);
const html = result.data;
//Use Cheerio: load the html document and create Cheerio selector/API
const $ = cheerio.load(html);
//Filter html to retrieve appropriate links
$(`a:contains(${searchTerm})`, html).each((i, el) => {
const title = $(el).text();
let url = $(el).attr('href');
articles.push(
{
title: title,
url: url,
source: resource.name
}
);
})
})
return articles; //Empty array is returned
}
And the route handler that uses the function:
app.get('/news/:searchTerm', async (req, res) => {
const searchTerm = req.params.searchTerm;
const articles = await GetArticles(searchTerm);
res.json(articles);
})
The problem I'm getting is that the returned "articles" array is empty. However, if I'm not "looping over each resource" as commented in the beginning of GetArticles, but instead perform the main logic on just a single "resource", "articles" is returned with the requested data and is not empty. In other words, if the function is the following:
async function GetArticles(searchTerm) {
const articles = [];
const result = await axios.get(resources[0].address);
const html = result.data;
const $ = cheerio.load(html);
$(`a:contains(${searchTerm})`, html).each((i, el) => {
const title = $(el).text();
let url = $(el).attr('href');
articles.push(
{
title: title,
url: url,
source: resources[0].name
}
);
})
return articles; //Populated array
}
Then "articles" is not empty, as intended.
I'm sure this has to do with how I'm dealing with the asynchronous nature of the code. I've tried refreshing my knowledge of asynchronous programming in JS but I still can't quite fix the function. Clearly, the "articles" array is being returned before it's populated, but how?
Could someone please help explain why my GetArticles function works with a single "resource" but not when looping over an array of "resources"?
Try this
function GetArticles(searchTerm) {
return Promise.all(resources.map(resource => axios.get(resource.address))
.then(responses => responses.flatMap(result => {
const html = result.data;
//Use Cheerio: load the html document and create Cheerio selector/API
const $ = cheerio.load(html);
let articles = []
//Filter html to retrieve appropriate links
$(`a:contains(${searchTerm})`, html).each((i, el) => {
const title = $(el).text();
let url = $(el).attr('href');
articles.push(
{
title: title,
url: url,
source: resource.name
}
);
})
return articles;
}))
}
The problem in your implementation was here
resources.forEach(async resource...
You have defined your function async but when result.foreach get executed and launch your async functions it doesn't wait.
So your array will always be empty.
I am trying to create a search function that auto populates my form with the corresponding record from my database and get API.
Here is my route
//Get number and dets to page
router.get("/sop/:id", function (request, response, next) {
response.render("test", { output: request.params.id });
});
//Elements i want to print my results to
const sopN = document.getElementById("sop");
const CustName = document.getElementById("cusName");
const button = document.getElementById("subBtn");
button.onclick = function (e) {
e.preventDefault();
const url = "http://localhost:6600/api/sopId/";
let sopSearch = document.getElementById("sop");
fetch(`${url}/${sopSearch.value}`, {
method: "GET",
})
.then((response) => response.json())
.then((json) => console.log(json));
};
The error says I'm getting is:
"http://localhost:6600/api/sopId/(record)" 404 not found and uncaught in promise Syntax error: unexpected token < in JSON at position 0
Please any and all assistance will be is appreciated
There's a problem with your URL.
Your route is "/sop/:id", so something like "http://localhost:6600/api/sop/1", but your URL is "http://localhost:6600/api/sopId/" to which you're appending sopSearch.value turning it into "http://localhost:6600/api/sopId/1".
So change this const url = "http://localhost:6600/api/sopId/"; to const url = "http://localhost:6600/api/sop/";
I am trying to scrape a column from the table located on Wikipedia. I am trying to get the first column, Symbol, and use those symbols in an array in NodeJS. I attempted to scrape only this column using Cheerio and Axios. For some reason, after running the function I do not get any syntax errors but I also do not get any result after execution. I'm not sure if the elements I have loaded are correct or not, but any advice on how I can scrape the Symbol column into an array would be helpful. Below is my code:
const express = require('express');
const app = express();
const http = require('http');
const server = http.createServer(app);
const cheerio = require('cheerio');
const axios = require("axios");
async function read_fortune_500() {
try {
const url = 'https://en.wikipedia.org/wiki/List_of_S%26P_500_companies'
const { data } = await axios({
method: "GET",
url: url,
})
const $ = cheerio.load(data)
const elemSelector = '#constituents > tbody > tr:nth-child(1) > td'
$(elemSelector).each((parentIndex, parentElem) => {
if (parentIndex <= 9){
$(parentElem).children().each((childIndex, childElem) => {
console.log($(childElem).text())
})
}
})
} catch (err) {
console.error(err)
}
}
read_fortune_500()
Result
[Finished in 1.238s]
To help with your original issue:
For some reason, after running the function I do not get any syntax
errors but I also do not get any result after execution.
The reason for this is that you are calling an async function in javascript. Because read_fortune_500 has the async keyword, you need to 'wait' for this function to complete. In javascript world, the read_fortune_500 is actually returning a promise and you need to wait until the promise resolves. You can do that in a couple of ways:
The easiest way to handle this is tp wrap your function call inside an IIFE:
(async () => {
await read_fortune_500()
})();
In future versions of node you can use await without the need for wrapping it but hopefully that helps.
For the second issue, getting a list of symbols. You need to update the query selector you are using:
const $ = cheerio.load(data)
const elements = $("#constituents > tbody > tr > td:nth-child(1)")
elements.each((parentIndex, parentElem) => {
The CSS selector is slightly different but the selector above tells cheerio to look inside each table row in the DOM and then select the first column in that row.
Full working code below:
const cheerio = require('cheerio');
const axios = require("axios");
async function read_fortune_500() {
try {
const url = 'https://en.wikipedia.org/wiki/List_of_S%26P_500_companies'
const { data } = await axios({
method: "GET",
url: url,
})
const $ = cheerio.load(data)
const elements = $("#constituents > tbody > tr > td:nth-child(1)")
elements.each((parentIndex, parentElem) => {
if (parentIndex <= 9){
$(parentElem).children().each((childIndex, childElem) => {
console.log($(childElem).text())
})
}
})
} catch (err) {
console.error(err)
}
}
(async () => {
await read_fortune_500()
})();
I'm trying to crawl several web pages to check broken links and writing the results of the links to a json files, however, after the first file is completed the app crashes with no error popping up...
I'm using Puppeteer to crawl, Bluebird to run each link concurrently and fs to write the files.
WHAT IVE TRIED:
switching file type to '.txt' or '.php', this works but I need to create another loop outside the current workflow to convert the files from '.txt' to '.json'. Renaming the file right after writing to it also causes the app to crash.
using try catch statements for fs.writeFile but it never throws an error
the entire app outside of express, this worked at some point but i trying to use it within the framework
const express = require('express');
const router = express.Router();
const puppeteer = require('puppeteer');
const bluebird = require("bluebird");
const fs = require('fs');
router.get('/', function(req, res, next) {
(async () => {
// Our (multiple) URLs.
const urls = ['https://www.testing.com/allergy-test/', 'https://www.testing.com/genetic-testing/'];
const withBrowser = async (fn) => {
const browser = await puppeteer.launch();
try {
return await fn(browser);
} finally {
await browser.close();
}
}
const withPage = (browser) => async (fn) => {
const page = await browser.newPage();
// Turns request interceptor on.
await page.setRequestInterception(true);
// Ignore all the asset requests, just get the document.
page.on('request', request => {
if (request.resourceType() === 'document' ) {
request.continue();
} else {
request.abort();
}
});
try {
return await fn(page);
} finally {
await page.close();
}
}
const results = await withBrowser(async (browser) => {
return bluebird.map(urls, async (url) => {
return withPage(browser)(async (page) => {
await page.goto(url, {
waitUntil: 'domcontentloaded',
timeout: 0 // Removes timeout.
});
// Search for urls we want to "crawl".
const hrefs = await page.$$eval('a[href^="https://www.testing.com/"]', as => as.map(a => a.href));
// Predefine our arrays.
let links = [];
let redirect = [];
// Loops through each /goto/ url on page
for (const href of Object.entries(hrefs)) {
response = await page.goto(href[1], {
waitUntil: 'domcontentloaded',
timeout: 0 // Remove timeout.
});
const chain = response.request().redirectChain();
const link = {
'source_url': href[1],
'status': response.status(),
'final_url': response.url(),
'redirect_count': chain.length,
};
// Loops through the redirect chain for each href.
for ( const ch of chain) {
redirect = {
status: ch.response().status(),
url: ch.url(),
};
}
// Push all info of target link into links
links.push(link);
}
// JSONify the data.
const linksJson = JSON.stringify(links);
fileName = url.replace('https://www.testing.com/', '');
fileName = fileName.replace(/[^a-zA-Z0-9\-]/g, '');
// Write data to file in /tmp directory.
fs.writeFile(`./tmp/${fileName}.json`, linksJson, (err) => {
if (err) {
return console.log(err);
}
});
});
}, {concurrency: 4}); // How many pages to run at a time.
});
})();
});
module.exports = router;
UPDATE:
So there is nothing wrong with my code... I realized nodemon was stopping the process after each file was saved. Since nodemon would detect a "file change" it kept restarting my server after the first item