Scrape background-images using X-Ray-Scraper - node.js

I've been using X-Ray to scrape website which has been working really well. I can use it bring in images very easily. The one item I run into is I don't see an easy way to scrape a background image. Say I have a div where they are setting a style attribute on that dev and then setting the URL im not sure how to get the background-image url from this. I don't think I can just pass the featured image attribute the css property such as
.featured-image.attr('background-image');
const getWebsiteContent = async (blogURL, selector) => {
try {
return await x(blogURL, selector, [{
slug: 'a#href',
featuredImage: 'img#src'
}])
.paginate(`${pagi}#href`)
.limit(200)
.then((response) => {
spinner.succeed('Got the data');
return response;
})
} catch (error) {
throw new Error('Cannot get Data from website, try checking your URL');
}
};

For anyone that wants a solution to this with X-ray scraper what I ended up doing is pulling the attribute from the selector you pass into the object.Given the html looks like the following.
<div class="img" style="background-image: url('../path-to-img.jpg')"></div>
Instead of writing .img#src you could write .img#style and this would return to you the style attribute. From there you would need to use a regex to remove the rest of the un-needed data that is not the URL of the image.

Related

Cheerio how to remove DOM elements from selection

I am trying to write a bot to convert a bunch of HTML pages to markdown, in order to import them as Jekyll document. For this, I use puppeteer to get the HTML document, and cheerio to manipulate it.
The source HTML is pretty complex, and polluted with Google ADS tags, external scripts, etc. What I need to do, is to get the HTML content of a predefined selector, and then remove elements that match a predefined set of selectors from it in order to get a plain HTML with just the text and convert it to markdown.
Assume the source html is something like this:
<html>
<head />
<body>
<article class="post">
<h1>Title</h1>
<p>First paragraph.</p>
<script>That for some reason has been put here</script>
<p>Second paragraph.</p>
<ins>Google ADS</ins>
<p>Third paragraph.</p>
<div class="related">A block full of HTML and text</div>
<p>Forth paragraph.</p>
</article>
</body>
</html>
What I want to achieve is something like
<h1>Title</h1>
<p>First paragraph.</p>
<p>Second paragraph.</p>
<p>Third paragraph.</p>
<p>Forth paragraph.</p>
I defined an array of selectors that I want to strip from the source object:
stripFromText: ['.social-share', 'script', '.adv-in', '.postinfo', '.postauthor', '.widget', '.related', 'img', 'p:empty', 'div:empty', 'section:empty', 'ins'],
And wrote the following function:
const getHTMLContent = async ($, selector) => {
let value;
try {
let content = await $(selector);
for (const s of SELECTORS.stripFromText) {
// 1
content = await content.remove(s);
// 2
// await content.remove(s);
// 3
// content = await content.find(s).remove();
// 4
// await content.find(s).remove();
// 5
// const matches = await content.find(s);
// for (m of matches) {
// await m.remove();
// }
};
value = content.html();
} catch(e) {
console.log(`- [!] Unable to get ${selector}`);
}
console.log(value);
return value;
};
Where
$ is the cheerio object containing const $ = await cheerio.load(html);
selector is the dome selector for the container (in the example above it would be .post)
What I am unable to do, is to use cheerio to remove() the objects. I tried all the 5 versions I left commented in the code, but without success. Cheerio's documentation didn't help so far, and I just found this link but the proposed solution did not work for me.
I was wondering if someone more experienced with cheerio could point me in the right direction, or explain me what I am missing here.
I found a classical newby error in my code, I was missing an await before the .remove() call.
The working function now looks like this, and works:
const getHTMLContent = async ($, selector) => {
let value;
try {
let content = await $(selector);
for (const s of SELECTORS.stripFromText) {
console.log(`--- Stripping ${s}`);
await content.find(s).remove();
};
value = await content.html();
} catch(e) {
console.log(`- [!] Unable to get ${selector}`);
}
return value;
};
You can remove the elements with remove:
$('script,ins,div').remove()

Convert HTML page to PDF with CSS3 Support

I'm working on a small project whereby I create multiple CVs (or resumes) via an interface I've built in Vue + Laravel, which I can then export to PDF.
I'm having issues though when I export the PDF. Laravel DOMPDF doesn't let me have CSS3 properties inside the PDF, for example flex, or CSS variables. I believe PDFs only support CSS 2.0, but I have seen multiple PDFs being exported that are an exact carbon-copy of the website. For example, resume.io - when you create a CV via their site, they can export it and make it look exactly like the website version.
My question is: does anyone know of a library that I could use that ties into Vue or Laravel that will produce a carbon-copy of the website template into a PDF?
I have tried a few JS libraries that take screenshots of certain elements on the page, then try and fit them together, but it just doesn't work. I basically need a specific element on the page to be selectable and then saved to a PDF. Please see my example below:
As you can see, the white area is the CV preview, so I need that whole section saved to a PDF, minus the right hand side menu and the top-bar. I'm planning on building some really cool templates, but if I can't use modern CSS practices then it's going to be quite hard to make them into a PDF.
At the moment, I've got two views, the CV preview which you can see above, then another view which re-uses partials that are inserted in the PDF template. Obviously though, reusing the partials which have modern CSS applied then makes the PDF break or look broken.
My stack:
Laravel
Vue.js
TailwindCSS
Laravel-DOMPDF
If anyone could advise on the best way to go about this, I'd really appreciate it.
TIA
Since you didn't mentioned the converted page is in Vue or Blade, I'll explain both way.
Here's the Library, which all you need is to design a Blade view, then do something like this
Route::get('/doc', function () {
//
$data = Marketers::all();
// LoadView with $data
$pdf = PDF::loadView('pdf',$data)->setPaper('A4');
// LoadView with Compact
$pdf = PDF::loadView('pdf',compact('data'))->setPaper('A4');
// Then Download it
return $pdf->download('pdf.pdf');
});
Now in Vue you need jsPDF and htmlToCanvas or htmlToIMage
i used HTMLtoImage because i had some character issues for persian language so I'll help you base on HtmlToImage Library.
<template>
// Part you want to Convert to pdf or ...
<div ref="contentz" id="jsPdf" >
// Contents
<div/>
</template>
downloadFull(t) {
let self = this
switch (t) {
case 1:
const doc = new jsPDF("l", "mm", "a4");
htmlToImage.toCanvas(document.getElementById('jsPdf'))
.then(function (canvas) {
var img = canvas.toDataURL("image/jpeg");
var width = doc.internal.pageSize.getWidth();
var height = doc.internal.pageSize.getHeight();
doc.addImage(img, 'JPEG', 0, 0, width, 0);
doc.save('app.pdf');
})
.catch(function (error) {
self.$notifications.failedNotificationOnGetData(self)
});
break;
case 2:
htmlToImage.toJpeg(document.getElementById('jsPdf'), { quality: 1 })
.then(function (dataUrl) {
var link = document.createElement('a');
link.download = 'kalabala.jpeg';
link.href = dataUrl;
link.click();
})
.catch(function (error) {
self.$notifications.failedNotificationOnGetData(self)
});
}
break;
default:
break;
}
}
So here's my function which downloadFull(t) t will be the file type, you might don't need it or you can improve it with simple if/else without switch, first you you will import libraries like :
import jsPDF from 'jspdf';
import htmlToImage from 'html-to-image';
Then set page dimensions for jsPDF, then simply use HtmlToImage to get Canvas then set width, height and image Canvas with variables then simply add image to doc and save it. my functions are same but in first switch case I'll get PDF, in second I'll get JPEG.
If you're trying to do the first way but get download link in Vue page you must do the controller just like i explained at above then in VUE when u do API call you should use BLOB to download the file. Here's the Example :
getPDF(type) {
axios({
url : '/api/api_name/exportPDF',
method: 'POST',
responseType: 'blob'
})
.then(res => {
const url = window.URL.createObjectURL(new Blob([res.data]));
const link = document.createElement('a');
link.href = url;
link.setAttribute('download', 'pdf.pdf');
document.body.appendChild(link);
link.click();
})
},
Good Luck.

How to fix this function, to retrieve data contain search keywords?

I'm developing website using MERN stack technologies. In that application I want to provide search facility. There is text box search button next to it. I want to open search results in new page. For example,
Data table contains products. If some user type 'milk' and click on search new page should show all milk product details.
when I click on search button it updated browser URL to this,
http://localhost:3000/search_result/this_is_search_string
This is the frontend code,
<form className="form-inline">
<input type="text"
className="form-control"
value={this.state.name}
onChange={this.on_change_name}/>
<Link to={'/search_result/'+this.state.name} className="btn btn-primary">Search</Link>
</form>
First problem is URL updated when click on button, but it not redirect to new page.
In search_result page componentDidMount() method looks like this,
componentDidMount(){
axios.get('http://localhost:4000/item/')
.then(response => {
this.setState({ search_result: response.data });
})
.catch(function (error) {
console.log(error);
})
}
I want to pass url appended value to above function' 2nd line,
axios.get('http://localhost:4000/item/') end of /item/. So I could load retrieved values to table.
Also in the backend controller file' method look like this.
export const get_items_by_name=(req,res)=>{
//Item.find(req.params.id,(err,item)=>{{name: "Renato"}
Item.find({name: "a"},(err,item)=>{
if(err){
res.send(err);
}
res.json(item);
});
};
In SQL there is a way to perform select query with LIKE. But I'm new to MERN stack. So how could I update above method to search and retrieve values as URL appended.
I need your help guys how to figure out this. Thank for your help specially with implementation support.
In your problem you need to change your controller something like this try it,
export const get_items_by_name=(req,res)=>{
//Item.find(req.params.id,(err,item)=>{{name: "Renato"}
Item.find({name:`/.*${req.body.name}.*/` },(err,item)=>{
if(err){
res.send(err);
}
res.json(item);
});
};
I tested this via postman and it works. You need to find way to do your page rendering part.

How to display an image with <img> from Mongoose using React front-end

Ultimate goal: have the user upload pictures (less than 16mb so no need to worry about Grid FS), have that picture stored in my database which is Mongodb through Mongoose, and display the picture on the screen using the attribute.
To upload files I use Multer and add it to the database as follows:
newItem.picture.data = Buffer(fs.readFileSync(req.file.path), 'base64');
newItem.picture.contentType = 'image/png';
And it seems to be successfully added to the mongodb. Looks something like this:
how the image appears on mongodb
I'm able to send a get request from my front-end and, when I console.log it, this is what I'm getting: Data after being retreived from database. The question now is, how can I add it to an attribute and show the image on the screen. Thanks!
Edit: question has been marked as too broad by the moderators. Fair enough, I wasn't too sure how to approach it. Since I was able to solve it, this is what my front-end looks like.
componentDidMount() {
const PATH = "http://localhost:8080/apii/items/getitems";
axios.get(PATH)
.then(res => {
let picture64Bit = res.data[0].data.data
picture64Bit = new Buffer(x, 'binary').toString('base64');
this.setState({picture: picture64Bit})
})
.catch(err => console.log(err))
}
The key here is that, 1) res.data[0].data.data is equal to that random list of numbers. I take that convert it back to base64, so it appears exactly as it did in the first picture above from mongodb. Then, displaying it inline in an img attribute is very easy:
<img src = {`data:image/png;base64,${this.state.picture}`} />
There are a couple libraries you could use, but I will arbitrarily select Axios for a demonstration. It sounds good if the images are already in Mongo DB.
Your objective is to get photos from the server to the client, so you need a function to get them on demand. You could also investigate fetch or request.
Axios: https://www.npmjs.com/package/axios
In React, try something like this
async getPhotos() {
const res = await Axios.get('/photos')
console.log('RESPONSE', res)
const photos = res.data
console.log('IMAGES', photos)
this.setState({ photos })
}
Here is a more complete example
import React, { Component } from 'react'
import Axios from 'axios'
class List extends Component {
constructor(props) { // super props allows props to be available
super(props) // inside the constructor
this.state = {
photos : [], // Initialize empty list to assert existence as Array type
// and because we will retrieve a list of jpegs
error: '', // Initialize empty error display
}
}
componentDidMount() {
this.getPhotos() // Do network calls in componentDidMount
}
async getPhotos() {
try {
const res = await Axios.get('/photos')
console.log('RESPONSE', res)
const photos = res.data
console.log('IMAGES', photos)
this.setState({ photos, error: '' })
} catch (e) {
this.setState({ error: `BRUTAL FAILURE: ${e}` })
}
}
render() {
if (error.length) {
return (
<div>{this.state.error}</div>
)
}
if (!photos.length) {
return (
<div>No photos yet</div>
)
}
// Assuming shape { id: 0, caption: 'Cats again', src: 'http://www.com/win.jpg' }
// Make sure to include key prop when using map (for state management)
return (
<ul>
{this.state.photos.map(photo => (
<li key={photo.id} style={{ position: 'relative' }}>
<span>{photo.caption}</span>
<img src={photo.src}
<div
className="overlay"
style={{
position: 'absolute'
width: '100%',
height: '100%',
}}
/>
</li>
))}
</ul>
)
}
}
Citation: In React.js should I make my initial network request in componentWillMount or componentDidMount?
If you want to fetch one more photo after, you should try to think immutably and replace the this.state.photos Array with a duplicate of itself plus the new image pushed onto the end of the array. We will use the spread operator for this to do a shallow copy on the existing photos Array. This will allow React to diff against the two states and efficiently update for the new entry.
const res = await Axios.get('/photo?id=1337')
const photo = res.data
this.setState({
photos: [...photos, photo]
})
Note: the secret trick is to avoid ever doing this.state.photos.push(photo). You must place an illegal sign on setting state like that.
In React, try to consider a way you can get an Object or Array. Once you have it in your mind, throw it into a Component's state. As you progress into Redux, you will end up storing items sometimes in the Redux store. That is too complex and unnecessary to describe now. The photos would be available perhaps as this.props.photos via the Redux Connect Function.
For most other times, a Component's state field is an excellent place to store anything of interest to a Component.
You can imagine it like a holder at the top of the Component.

Using Fragment to insert HTML rendered on the back end via dangerouslySetInnerHTML

I used to compile and insert JSX components via
<div key={ ID } dangerouslySetInnerHTML={ { __html: HTML } } />
which wrapped my HTML into a <div>:
<div>my html from the HTML object</div>
Now react > 16.2.0 has support for Fragments and I wonder if I can use that somehow to avoid wrapping my HTML in a <div> each time I get data from the back end.
Running
<Fragment key={ ID } dangerouslySetInnerHTML={ { __html: HTML } } />
will throw a warning
Warning: Invalid prop `dangerouslySetInnerHTML` supplied to `React.Fragment`. React.Fragment can only have `key` and `children` props.
in React.Fragment
Is this supported yet at all? Is there another way to solve this?
Update
Created an issue in the react repo for it if you want to upvote it.
Short Answer
Not possible:
key is the only attribute that can be passed to Fragment. In the
future, we may add support for additional attributes, such as event
handlers.
https://reactjs.org/docs/fragments.html
You may want to chime in and suggest this as a future addition.
https://github.com/facebook/react/issues
In the Meantime
You may want to consider using an HTML parsing library like:
https://github.com/remarkablemark/html-react-parser
Check out this example to see how it will accomplish your goal:
http://remarkablemark.org/blog/2016/10/07/dangerously-set-innerhtml-alternative/
In Short
You'll be able to do this:
<>
{require('html-react-parser')(
'<em>foo</em>'
)}
</>
Update December 2020
This issue (also mentioned by OP) was closed on Oct 2, 2019. - However, stemming from the original issue, it seems a RawHTML component has entered the RFC process but has not reached production, and has no set timeline for when a working solution may be available.
That being said, I would now like to allude to a solution I currently use to get around this issue.
In my case, dangerouslySetInnerHTML was utilized to render plain HTML for a user to download; it was not ideal to have additional wrapper tags included in the output.
After reading around the web and StackOverflow, it seemed most solutions mentioned using an external library like html-react-parser.
For this use-case, html-react-parser would not suffice because it converts HTML strings to React element(s). Meaning, it would strip all HTML that wasn't standard JSX.
Solution:
The code below is the no library solution I opted to use:
//HTML that will be set using dangerouslySetInnerHTML
const html = `<div>This is a div</div>`
The wrapper div within the RawHtml component is purposely named "unwanteddiv".
//Component that will return our dangerouslySetInnerHTML
//Note that we are using "unwanteddiv" as a wrapper
const RawHtml = () => {
return (
<unwanteddiv key={[]}
dangerouslySetInnerHTML={{
__html: html,
}}
/>
);
};
For the purpose of this example, we will use renderToStaticMarkup.
const staticHtml = ReactDomServer.renderToStaticMarkup(
<RawHtml/>
);
The ParseStaticHtml function is where the magic happens, here you will see why we named the wrapper div "unwanteddiv".
//The ParseStaticHtml function will check the staticHtml
//If the staticHtml type is 'string'
//We will remove "<unwanteddiv/>" leaving us with only the desired output
const ParseStaticHtml = (html) => {
if (typeof html === 'string') {
return html.replace(/<unwanteddiv>/g, '').replace(/<\/unwanteddiv>/g, '');
} else {
return html;
}
};
Now, if we pass the staticHtml through the ParseStaticHtml function you will see the desired output without the additional wrapper div:
console.log(ParseStaticHtml(staticHtml));
Additionally, I have created a codesandbox example that shows this in action.
Notice, the console log will throw a warning: "The tag <unwanteddiv> is unrecognized in this browser..." - However, this is fine because we intentionally gave it a unique name so we can easily differentiate and target the wrapper with our replace method and essentially remove it before output.
Besides, receiving a mild scolding from a code linter is not as bad as adding more dependencies for something that should be more simply implemented.
i found a workaround
by using react's ref
import React, { FC, useEffect, useRef } from 'react'
interface RawHtmlProps {
html: string
}
const RawHtml: FC<RawHtmlProps> = ({ html }) => {
const ref = useRef<HTMLDivElement>(null)
useEffect(() => {
if (!ref.current) return
// make a js fragment element
const fragment = document.createDocumentFragment()
// move every child from our div to new fragment
while (ref.current.childNodes[0]) {
fragment.appendChild(ref.current.childNodes[0])
}
// and after all replace the div with fragment
ref.current.replaceWith(fragment)
}, [ref])
return <div ref={ref} dangerouslySetInnerHTML={{ __html: html }}></div>
}
export { RawHtml }
Here's a solution that works for <td> elements only:
type DangerousHtml = {__html:string}
function isHtml(x: any): x is DangerousHtml {
if(!x) return false;
if(typeof x !== 'object') return false;
const keys = Object.keys(x)
if(keys.length !== 1) return false;
return keys[0] === '__html'
}
const DangerousTD = forwardRef<HTMLTableCellElement,Override<React.ComponentPropsWithoutRef<'td'>,{children: ReactNode|DangerousHtml}>>(({children,...props}, ref) => {
if(isHtml(children)) {
return <td dangerouslySetInnerHTML={children} {...props} ref={ref}/>
}
return <td {...props} ref={ref}>{children}</td>
})
With a bit of work you can make this more generic, but that should give the general idea.
Usage:
<DangerousTD>{{__html: "<span>foo</span>"}}</DangerousTD>

Resources