Extract only content with Scrapy to create WordClouds - python-3.x

I am looking for a smart solution to extract only the main information of a range of different webpages. When building wordclouds, I always get the problem of having to define a range of different stopwords (e.g. "links", "contact",...) to only show the actual content. Now I am looking for a way for not creating a list of stopwords each time I scrape a new website.
I had the idea, that certain html tags tend to have more content than others. Is that a good way of filtering in the preprocessing or do you have any other ideas?
Thank you for your help.

Related

Iterating through different fieldsets in a form using Selenium Python

I am trying to extract the title of videos saved on my webserver. They're saved under a form and the form has different number of field sets that are subject to change depending on how many videos there are. I am looking to iterate through all of the available field sets and extract the title of the video which I've highlighted in the attached picture.I don't know how many sets there will be so I was thinking of a loop to go through the length(fieldsets) however my unfamilarity with web scraping/dev has left me a little confused.
I've tried a few different things such as:
results = driver.find_elements(BY.XPATH, "/html/body/div[2]/form/fieldset[1]")
However I am unsure how to extract the attributes of the subelements via Selenium.
Thank you
MY webserver HTML page
If you have so many videos as per shown in the images and you need all videos then you have to iterate through all videos by the given below code.
all_videos_title = []
all_videos = driver.find_elements_by_xpath('//fieldset[#class="fileicon"]')
for sl_video in all_vides:
sl_title = sl_video.find_element_by_xpath("//a//#title")
all_videos_title.append(sl_title)
print(all_videos_title)
If any other anchor tag "a" in the "fieldset" tag then in your loop XPath changed. So If It's not working then you have to share the whole page source so I'll provide exact details of the data that you need.
Please try below XPath
//form//ancestor::fieldset//ancestor::a//#title

How to scrape hrefs disguised with unicode (e.g \u003ca href=\)

I'm trying to scrape relative paths contained in hrefs but they arent showing up in anything but the main soup pull. If I try to pull hrefs or links specifically, the ones I'm looking to scrape just don't show up but I know they are there.
\u003ca href=\"/model/ford-1200\"
\u003ca href=\"/model/ford-1300\"
\u003ca href=\"/model/ford-1400\"
Is there a way to get to create a list of the 20 or so "u003ca href"s on the page? I'm looking for just the part in the quotes (e.g. /model/ford-1200, /model/ford-1300, /model/ford-1400), collected into a list.
Is this a use-case where I'm going to need to sack up and learn javascript scraping?
It turns out the issue wasn't solved through addressing the unicode or using re -- it was an issue with the content not loading enough to be scraped. I solved this through using selenium which allowed me to scrape the hrefs as I normally would.

Can someone guide me how to collect a list of url address in the tab using python?

I'm trying to collect a list of "https://..." and hope to store them in csv file. I can do them manually such as use excel, copy the urls from the website of interest and paste them one by one. But it's tedious and definitely would take lot of time.
can someone suggest and guide for a faster way?
If you just need the addresses quickly from one page you could run this javascript snippet document.links.forEach(link=>console.log(link.href)) in the console of your browser, this will output all of the links on that page.
If you want to use python to scrape the page I would suggest taking a look at this question on stackoverflow, this uses the beautifulsoup framework.
If there is dynamic content loaded on the page with javascript it's probably better to use something like Selenium, relevant stackoverflow question

Python Win32Com First page Header/Footer

Some Context
I am trying to scrape information from a large number of .doc files, so I've written a python program to do the heavy lifting for me. Word has this nifty ability to make the header and footer of the first page different. This is generally useful, but I am running into a problem which I'm not finding a good solution for.
This is how I am accessing headers and footers:
import win32com
word_app = win32com.client.Distpatch('Word.Application')
doc = word_app.Documents.Open('path/to/my/word/file.docx')
first_footer = doc.Sections(1).Footers(1).Range.Text
print(first_footer)
There is a catch, though: the first page contains header/footers which are common throughout the document, but also some things which are unique to the first page. The code above does not capture this unique information: it only shows the header/footer information from the first page which is common throughout the document.
When the first page has a unique content in its header and footer, how do I access it using python's win32com?
After some digging, I have found an answer.
It turns out you need to use a constant called "wdHeaderFooterFirstPage" within the constants bit of the module to access the first page header and footer, like so:
doc.Sections(1).Headers(win32com.client.constants.wdHeaderFooterFirstPage).Range.Text
This returns a string which you can manipulate like normal. Documentation for win32com is hard to find, and translating it from VBA documentation is not as obvious as I would like it to be.

Does SharePoint Search support range tags?

I am working on a project to digitize approximately 1 million images for which metadata will be added to facilitate search.
Each image is, for example, a page in a dictionary. But not text. Just a static scanned image. OCR is not an option :(
My objective is to emulate the current search procedure which consists of looking up the alphabetical entries till the correct page is found. In absence of machine readable text, I am looking at tagging each page with Dictionary range tag. For Example (Apple-Canada). So if someone searches for "Banana", it should hit the (Apple-Canada) range Tag.
Is this supported in SharePoint out of the box? If not, is there an addon product which provides this functionality or am I looking at building a customized extension?
Any help will be appreciated :)
Installing the IFilter for TIF files is done with a couple of clicks and gives you free OCR along the way. Very good for scanned pages.
On your question though: No, SharePoint does not have any kind of "range" tags or fields. The only vaguely similar thing to what you are requesting is the Thesaurus of the search. There you could define acronyms and synonyms for words and it would actually search for something else. So you could enter Banana but it would actually search for Apple. Some examples here: How to: Customize the Thesaurus in SharePoint Search and Search Server.
Other than that I can only think of a custom implemented search provider giving you the flexibility you need.

Resources