I have written code to summarize text using NLP, and am trying to make a chrome extension which uses the code to summarize the contents of a page. How would I access the contents of the page (all of the relevant text)? Does the chrome extension API have something for this? I am very new to all of this, so I hope my question doesn't bother any of you.
Thank you!
You can inject content scripts, the scripts have access to the page's DOM, can get all of the text.
Related
I'm trying to collect a list of "https://..." and hope to store them in csv file. I can do them manually such as use excel, copy the urls from the website of interest and paste them one by one. But it's tedious and definitely would take lot of time.
can someone suggest and guide for a faster way?
If you just need the addresses quickly from one page you could run this javascript snippet document.links.forEach(link=>console.log(link.href)) in the console of your browser, this will output all of the links on that page.
If you want to use python to scrape the page I would suggest taking a look at this question on stackoverflow, this uses the beautifulsoup framework.
If there is dynamic content loaded on the page with javascript it's probably better to use something like Selenium, relevant stackoverflow question
When you create a news or blog tab with CMS it's really easy to make a feed of posts with content preview. Also when you follow a link to a particular post you can notice that it consists of a different html tags and css styling and not just plain text. It just uses rich text editor. So just getting text from db is not enough.
My question is how to achieve the same result when making a website from scratch. It doesn't matter what language is used for back-end. I'm just interested in the idea how to do it. But if you could provide a code examples (with any language) it would be greatly appreciated
Ok I've figured it out. Posting the answer for somebody who will have the similar question in the future.
The idea is that you need to put a text with html tags into your database and then to retrieve it you need to put it in your desired div in unescaped state. The thing is that almost all view (template) engines escape html tags by default. To do that you have to use some built in functions specific to that view engine.
To put the article with html tags in db you can just write raw html into input field or you can somehow add richtext editor to input field. Richtext editor will generate html for you.
I've researched it and found out that that's exactly how cms work.
So there you have it. If you want to add something feel free to do it
we are currently building an application on the google cloud platform, which generates reports in Google Doc. For them, it is really important to have a table of content ... with page numbers. I know this is a feature request since a few years and there are add-ons (Paragraph Styles +, which didn't work for us) that provide this solution, butt we are considering to build this ourselves. if anybody has a suggestion on how we could start with this, it would be a great help!
thanks,
Best bet is to file a feature request on the product forums.
Currently the only way to do that level of manipulation of a doc to provide a custom TOC is to use Apps Script. It provides access to the document structure sufficient enough to build and insert a basic table of contents, but I'm not sure there's enough to do paging correctly (unless you force a page break on ever page...) There's no method to answer the question of "what page is this element on?"
Hacks like writing to a DOCX and converting don't work because TOCs are recognized for what they and show up without page numbers.
Of course you could write a DOCX or PDF with the TOC as you'd like and upload as a blob rather than as a Google Doc. They can still be viewed in Drive and such.
I want to add text to body element but I don't know how. Which method will work on the body tag?
Sorry for my english and thanks for replies.
In Watir, you can manipulate a web page (DOM) using JS, just like that:
browser.execute_script("document.getElementById('pageContent').appendChild(document.createTextNode('Great Success!'));")
I assume that the point of the question is:
All users are not just interacting by just clicking buttons and links on the web app, some of them are doing nasty things like altering http requests to make your system do something that it is not supposed to do... or to just have some fun.
To mimic this behavior, you could write a ui-test that alters forms on the web page, so that for example, one could type in anything into any field instead of a limited dropdown.
To do that, ui test has to:
manipulate DOM to set form inputs free of limitations (replace select's with input's, etc.)
ui test has to know, which values to use, in many cases it's pointless to enter random values. Your webapp has to provide some good "unwanted" options.
Why would you want to modify the webpage in Watir? It's for automated testing, not DOM manipulation.
If you want to add something to the DOM element in javascript, you can do it like that:
var txt = document.createTextNode(" This text was added to the DIV.");
document.getElementById('myDiv').appendChild(txt);
Or use some DOM manipulation library, like jQuery.
If you have not worked your way though the watir tutorial, I would suggest you do so. It deals with things like filling in text fields etc.
Learn to use the developer tools for your browser, Firebug for Firefox, or the built in tools for IE and CHrome. They will let you look at things as you interact with the site.
If the element is not a normal HTML input field of some sort, then you are dealing with a custom control. Many exist and they are varied and there is no one set solution for dealing with them. Without knowing which control you are using, and being able ourselves to interact with a sample of it, or at least see the HTML, it is very very difficult to advise you, we basically have to just guess (which is often a waste of everyone's time)
Odds are if you have a place you can enter text, then it is some form of input control, it might not start out that way, you may need to click on some other element, to make the input area appear, but without a sample of HTML all we can do is guess.
If this is a commercial control, see if you can find a demo site that shows the control in action. Try googling things like class names for the elements and often you get lucky
What's the best way to create scripts for a browser?
I need to parse some html pages on different domains
I am on windows and use firefox most of all.
If it's just about retrieving the pages to do whatever you want with it, the built-in urllib module in python will do that for you.
It sounds like you want to retrieve webpages and parse them to extract meaningful data? I would suggest something like TagSoup (for Java) which fires off nice SAX events which you can use directly, or using an XML module of your choice (raw DOM, JDOM, dom4j, XOM, etc...). The TagSoup page also lists a number of references for other languages, suck as Beautiful Soup for Python, Rubyful Soup for Ruby and others.
From there, I would suggest using something like XPath to retrieve the bits of data that you want. Another option would be XSLT to transform the HTML into some unified format that you can more easily manipulate.
I'd recommend Synthetics Web. Here is a working example at jsFiddle.
jsFiddle
http://jsfiddle.net/dwayne05/YkLVw/
Synthetics Web
http://www.syntheticsweb.com/