What are some good crawlers that can help download files - web

For one of my statistics project, I need to RANDOMLY download several files from a google patent page, and each file is a large zip file. The web link is the following:
http://www.google.com/googlebooks/uspto-patents-grants-text.html#2012
Specifically, I want to RANDOMLY select 5 years (The links on the top of the page) and download (i.e 5 files). DO you guys know if there's some good package out there that is good for this task?
Thank you.

That page contains mostly zip files and looking at the HTML content it seems that it should be fairly easy to determine which links will yield a zip file by simply searching for a *.zip in a collection of candidate URLs, so here is what I would recommend:
fetch the page
parse the HTML
extract the anchor tags
for each anchor tag
if href of anchor tag contaings "*.zip"
add href to list of file links
while more files needed
generate a random index i, such that i is between 0 and num links in list
select i-th element from the links list
fetch the zip file
save the file to disk or load it in memory
If you don't want to get the same file twice, then just remove the URL from the list of links and that randomly select another index (until you have enough files or until you run out of links). I don't know what programming language your team codes in, but it shouldn't be very difficult to write a small program that does the above.

Related

shell script to check if images in a folder are being used by a set of HTML files

Sometime ago I worked in a team that developed a bunch of educational softwares and now they are been reviewed for bugs and updates.
During this process, I noticed that the folder "imgs" accumulated too many files. Probably one of the developers decided to include all the images used by each of the softwares into the folder. However, because there are too many softwares, it would be too painful to check manually all of them (and some of the images are part of the layout, almost invisible).
Is there a way to write a shell script in Linux to check if the files in a given folder are being used by a set of HTML and JS files in another folder?
Go to the images folder and try this
for name in *; { grep -ril $name /path/to/soft/* || echo "$name not used"; }
Im not sure I understood your question correctly,
But maybe this will help you
ls -1 your_source_path | while read file
do
grep -wnr "$file*" your_destination_path ||
echo "no matching for file $file"
# you can set any extra action here
done
in source_path you put director from hi will list all files name and destination where he should searching.
It is not possible to check for the generic case - since HTML and Javascript are two dynamic (e.g. the Javascript code could create the image file name on the file). Likewise, images can be specified in CSS style sheet, inline style, etc.
You want to review the HTML/JS files, and see if possible to identify the tags that are actually used to specify images. This will hopefully, reduce the number of XML tags and attribute names that need to be extracted.
As an alternative, if you have access to the 'access log' of the server, you can find out which images have been accessed over time, and focus the search on images not referenced in the log file.

Retrieve contents of a ZIP file on SharePoint without downloading it

I have written a bit of automated code that checks a SharePoint site and looks for a ZIP file (lets call it doc.zip). If doc.zip is found, it downloads it, and then checks for a file (say target.docx). doc.zip is about 300MB, and so I want to only download where necessary.
What I would like to know is that given SharePoint has some ZIP search capability, is it possible to write code using CSOM (c#) to find doc.zip, and then run some code to retrieve the contents of doc.zip without downloading it.
Just to re-iterate, I am comfortable with searching for files in a folder on SP, downloading the file, and unpacking zip entries. What I need is to retrieve a ZIP files content on SP without downloading it.
E.g. is there a SP command:
cxt.Load(SomeZipFileQuery);
cxt.ExecuteQuery();
Thanks in advance.
This capability is not available. I do like the idea. Having the ability to "parse" zip files on the server side and then download the relevant bits would be ideal. Perhaps raise this on uservoice to see if others also find this us https://sharepoint.uservoice.com
Ok, I have proven yet again that stubbornness will prevail.
I have figured out that if I use the /_api/search?query='myfile.zip' web REST API to search for my file, this search will also match ZIP files that contain the file I need. And it works perfectly.
Of course there is added (pain) of parsing an XML response, but it works very nicely for my code example.
At least if someone is looking for this solution here it is. I wont bore anyone with code, as the /_api/search has probably been done to death already on other threads.

Dreamweaver library items - links and images gets "library/"

Hello Stackoverflow this is my first post, but you have already helped me alot of times!
However, I have made a Footer as a library item on my site. The problem is that, when I add the library item into my site, every links and images get "library/" before.
Example: In the library item the link to my image is "images/about-us.png", but it turn so "library/images/about-us.png" when I drag the library item into my site. This results in that the image can't be find and doesn't show up.
Any ideas is very helpfull!
/Rasmus (DW CS5.5)
Did you originally have a folder in the root of your site called 'library'? If so, that may have confused Dreamweaver, as by default, they store all library items in a root-level folder called 'Library' (with a capital L). And if that folder doesn't exist, Dreamweaver should try to create it.

Programmatically downloading a large number of <insert file type here>

I'm wondering if there's an easy way to download a large number of files of one arbitrary type, e.g., downloading 10,000 XML files. In the past, I've used Bing's API. It's free and offers unlimited queries. However, it doesn't index as many types of files as Google does. Google indexes XML files, CSV files, and KML files. (These can all be found by doing searches like "filetype:XML".) As far as I know, Bing doesn't index these in a way that's easily searchable. Is there another API that has these capabilities?
How about using wget? You can give wget a URL (for example, a google search result) and tell it to follow all the links on that page and download them (I bet you could also give it a filter).
Just tried it and got an ERROR 403: Forbidden. Apparently Google blocks requests from Wget. You'll have to provide a different user agent. Quick search provided this example:
http://www.mail-archive.com/wget#sunsite.dk/msg06564.html
Then it worked with the example given.

How to retrieve the file path column "ows_EncodedAbsUrl" in search result.

I am passing the search query in to search.asmx to get the search value.
Through web services I am retrieving the search result. Search result will return document path for .txt files and image. This path used to open the file directly.
txt file: "http://server:24669/jap/ww.txt- It will open the file.
PDF File:"http://server:100/456efg/Forms/DispForm.aspx?ID=3&RootFolder=/456efg"- It will show PDF properties or parent folder.
So I need to Get the Url to open the PDF doc. "ows_EncodedAbsUrl" column have the document URL but it’s not retrievable in search result. Is there any way to solve the issue?
If you add a PDF iFilter to your SharePoint environment, PDF files will no longer be treated as list items (thus the property view link).
Of course Adobe post the instructions for this as a PDF.
This change will also start indexing the text of your PDF documents so they will be more searchable. Be aware that if a percentage of the PDF documents size will be added to your search storage costs, so plan ahead.
This is a cure for the symptom, I do not know if there are other ways to do this.

Resources