Need to know if there is a program that will search for all sub-folders given a specific web address - search

I am working on a project at work where we currently have a very labor intensive process of looking for verification of certain documents having been uploaded to a folder. My bosses would like me to automate this process to save time. I have found where this can be done for computer directories, but haven't been able to find it for web addresses.
I tried searching for the type of search algorithm I'm looking for, but found one that works for computer directories, not web addresses.

Related

As a file hosting provider, how do you prevent phishing?

We develop a service much like Dropbox or Google Drive or S3 where our customers can host their files and get direct links to them with our domain, e.g. www.ourservice.com/customer_site/path/to/file.
Recently we started receiving emails from our server provider that we host phishing .html files. It appeared that some of our customers did host such htmls and of course we removed the files after that and blocked the customers.
But the problem is that we would like to prevent this from happening at all, mainly because this lowers our Google Search index and of course we never wanted to be a hosting for phishing scripts.
Is there a way how other file hosting providers solve this? I guess we could run some checks (anti-virus, anti-phishing etc.) when the customers upload their files, but that would be pretty tedious considering that there's a lot of them. Another options is to periodically check all the files, or all new files, but still I'd better ask if I'm missing any easier solution.

Search engine components

I'm a middle school student learning computer programming, and I just have some questions about search engines like Google and Yahoo.
As far as I know, these search engines consist of:
Search algorithm & code
(Example: search.py file that accepts search query from the web interface and returns the search results)
Web interface for querying and showing result
Web crawler
What I am confused about is the Web crawler part.
Do Google's and Yahoo's Web crawlers immediately search through every single webpage existing on WWW? Or do they:
First download all the existing webpages on WWW, save them on their huge server, and then search through these saved pages??
If the latter is the case, then wouldn't the search results appearing on the google search results be outdated, Since I suppose searching through all the webpages on WWW will take tremendous amount of time??
PS. One more question: Actually.. How exactly does a web crawler retrieve all the web pages existing on WWW? For example, does it search through all the possible web addresses, like www.a.com, www.b.com, www.c.com, and so on...? (although I know this can't be true)
Or is there some way to get access to all the existing webpages on world wide web?? (sorry for asking such a silly question..)
Thanks!!
The crawlers search through pages, download them and save (parts of them) for later processing. So yes, you are right that the results that search engines return can easily be outdated. And a couple of years ago they really were quite outdated. Only relatively recently Google and others started to do more realtime searching by collaborating with large content providers (such as Twitter) to get data from them directly and frequently but they took the realtime search again offline in July 2011. Otherwise they for example take notice how often a web page changes so they know which ones to crawl more often than others. And they have special systems for it, such as the Caffeine web indexing system. See also their blogpost Giving you fresher, more recent search results.
So what happens is:
Crawlers retrieve pages
Backend servers process them
Parse text, tokenize it, index it for full text search
Extract links
Extract metadata such as schema.org for rich snippets
Later they do additional computation based on the extracted data, such as
Page rank computation
In parallel they can be doing lots of other stuff such as
Entity extraction for Knowledge graph information
Discovering what pages to crawl happens simply by starting with a page and then its following links to other pages and following their links, etc. In addition to that, they have other ways of learning about new web sites - for example if people use their public DNS server, they will learn about pages that they visit. Sharing links on G+, Twitter, etc.
There is no way of knowing what all the existing web pages are. There may be some that are not linked from anywhere and noone publicly shares a link to them (and doesn't use their DNS, etc.) so they have no way of knowing what these pages are. Then there's the problem of the Deep Web. Hope this helps.
Crawling is not an easy task (for example Yahoo is now outsourcing crawling via Microsoft's Bing). You can read more about it in Page's and Brin's own paper: The Anatomy of a Large-Scale Hypertextual Web Search Engine
More details about storage, architecture, etc. you can find for example on the High Scalability website: http://highscalability.com/google-architecture

Monitoring the Full Disclosure mailinglist

I develop web applications, which use a number of third party applications/code/services.
As part of the job, we regularly check with the Full Disclosure mailing list http://seclists.org/fulldisclosure/ for any of the products we use.
This is a slow process to do manually and subscribing to the list would cost even more time, as most reports do not concern us.
Since I can't be the only one trying to keep up with any possible problems in the code I use, others have surely encountered (and hopefully solved) this problem before.
What is the best way to monitor the Full Disclosure mailing list for specific products only?
Two generic ways to do the same thing... I'm not aware of any specific open solutions to do this, but it'd be rather trivial to do.
You could write a daily or weekly cron/jenkins job to scrape the previous time period's email from the archive looking for your keyworkds/combinations. Sending a batch digest with what it finds, if anything.
But personally, I'd Setup a specific email account to subscribe to the various security lists you're interested in. Add a simple automated script to parse the new emails for various keywords or combinations of keywords, when it finds a match forward that email on to you/your team. Just be sure to keep the keywords list updated with new products you're using.
You could even do this with a gmail account and custom rules, which is what I currently do, but I have setup an internal inbox in the past with a simple python script to forward emails that were of interest.

Centralized Indexing on Server for Window7 Search

I have read this interesting article
http://www.windowsnetworking.com/articles-tutorials/windows-7/Exploring-Windows-7s-New-Search-Features-Part1.html
http://www.windowsnetworking.com/articles-tutorials/windows-7/Exploring-Windows-7s-New-Search-Features-Part2.html
http://www.windowsnetworking.com/articles-tutorials/windows-7/Exploring-Windows-7s-New-Search-Features-Part3.html
This article ends with "Sadly though, if you want to index network locations then you will be forced to cache the locations that you want to index. Searches of network volumes are still possible even without indexing those locations, but require a bit more effort than a typical search."
I have plenty of files in a network drive and searching is very slow and misses files and I wish to have them indexed so that I can have fast searches. The files may be edited from time to time (so the indexing must be updated with the changes etc). Making the files available offline is not an option a sit would defeat the purpose of a network drive.
I was wondering is there a solution for this problem like some software that runs on an independent machine and windows search from workstations connect to this machine while doing searches on the network drive?
I've searched a bit and there are software that can index files such as Solr http://lucene.apache.org/solr/ which is built on lucene.
Is there any software out there that does the whole thing?
Anyone ever done something like this?
and if its not possible, why?

creating log file of specific web traffic statistics

I have a website, hosted on a shared server.
Through CPANEL, I am provided with a few traffic analysis logs and tools.
None seem to provide what I'm looking for.
For each day, I'd like to see log file with a list of unique visitors.
Under each unique visitor (by IP address), I'd like to see the following information
geographic location (based on IP address)
information to help determine if the visitor was a bot or human
the page URLs they requested (including the exact time of request)
explanation of my application:
I run a forum on my site. I'd like a better understanding of who is visiting, when they visit and how
they navigate through my forum pages (topics, posts etc.)
I would appreciate some direction on how to develop this (a script is probably best)
I would (and do) use Google Analytics as it gives you exactly what you are asking for and a whole lot more (like being able to see live what is happening). It requires you to add some javascript code to the application (which for so many today, plugins are available).
If no plugin is available, see https://support.google.com/analytics/answer/1008080?hl=en
This approach to your end result will typically be a lot easier than trying to create your own log analyser and installing it on a shared cPanel server.

Resources