I have crawled some images and files URLs from different web pages using StormCrawler and SOLR, and I have these URLs in status core of SOLR. Now I want to download the files from these URLs and save them on my machine. Any suggestions how to do this in a simple and scalable way?
Thank you.
The crawler already downloads them! You don't need to do that again. What you need though is to decide where and how to store the content. If you were to build a search engine, then you'd use the SOLR or Elasticsearch indexers; if you needed to scrape a site, you'd send the extracted metadata into a DB; if what you wanted was to archive the pages, then the WARC module would allow you to generate archives.
Do you want the binary content of the pages or the extracted text and metadata? If you want the former, then the WARC module would be fine. Otherwise, you can always write your own indexer bolt, StdOutIndexer should be a good starting point.
Related
I want to download this website
Abd i tried idm and httrack but didn't work for javascript content
http://websdr.uk:8074/
Anyone can help me to download this frequency streaming content,
Thank you
In order to capture the javascript actions, you'll need Selenium. It's a browser automation tool, used for Automated testing and data parsing from webpages.
https://www.selenium.dev/
Downloading an entire site is possible only if the website has static pages (HTML, CSS, Img). But Javascript-based content is loaded dynamically, which would be difficult to download.
I'm trying to understand how web crawling works. There are 3 questions:
Do we have to have an initial directory of URLs to build a larger
directory of URLs? How does this work?
Are there any open source
web crawlers written in python?
Where is the best place to learn more about web
crawlers?
Answering your second question first; Scrapy is a great tool to do web scraping in python.
When using it there are a number of ways to start the spiders. The CrawlSpider can be given a list of initial URLs to start from. It then scrapes these pages looking for new links which are added to the queue of pages to search.
Another way to use it is with the sitemap spider. For this spider you give the crawler a list of the URLs of websites sitemaps. The spider then looks up the list of pages from the sitemap and crawls those.
I inherited two websites on the same host server and several thousand files/directories that are not being used by those websites. I want to remove those files that are not used. I have tried using developer tools on Chrome and looking at the network tab to see what is served to a browser when navigating to those websites but there some files that are never passed to the client. Does anyone know a way to do this efficiently?
Keep a list of all the files that you have.
Now write a crawler that will crawl over both the websites and each file that is crawled over, remove it from the initial list.
The list of files that you have now are the ones that are not used by any of those websites.
We have a web application with over 560 pages. I would like a way to catalog the site somehow so that I can review the pages (without having to find each on in the menu or enter the URL). Be very glad for ideas on the best way to go about this.
I'd be happy to end up with 560 image files or PDFs, or one large PDF or whatever. I can easily put together a script with all the URLs, but how to pull those up and take a snapshot of some sort and save that to a file or files is where I need help.
The site is written in Java (server) and javascript (client).
I found a great plugin for Firefox that made this relatively painless. The plugin is called Screenshot Pimp (hate the name, love what it does). It takes a snapshot of your browser contents and immediately saves it to a file on your hard drive.
So then I wrote a script that would pull each page up in an IFrame with the URL showing above that, and took snapshots of each page. It took a couple hours to cycle through the whole set of 560+ pages, but it worked great, and now I have a catalog of all the pages.
I would like to make sure website ranks as high as possible whenever my Google Places location ranks high.
I have seen references to creating a locations.kml file and putting it in the root directory of my site. Then creating lines in the sitemap.xml file to point to this .kml file.
I get this from this statement on the geolocations page
Google no longer supports the Geo extension to the Sitemap protocol. We recommmend that you tell Google about geographically-based URLs by including them in a regular Web Sitemap.
There is a link to the Web Sitemap page
http://support.google.com/webmasters/bin/answer.py?hl=en&answer=183668
I'm looking for examples of how to include Geo location information in the sitemap.xml file.
Would someone please point me to an example so that I can know how to code the reference?
I think the point is that you dont use any specific formatting in the sitemap. You make sure you include all your locally relevent pages in the sitemap as normal. (ie you dont include any geo location in the sitemap)
GoogleBot will use its normal methods for detereriming if the page should be locally targeted.
(I think Google have found the sitemap-protocol has been abused, and or misunderstood, so they dont need it to tell them so much about the page. Rather its just a way to find pages, that it might take a long time to discover though conventual means. )