Is there any effort towards a scraper and bot freindly Internet? [closed] - web

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
I am working on a scraping project for a company. I used Python selenium, mechanize , BeautifulSoup4 etc. libraries and had been successful on putting data into MySQL database and generating reports they wanted.
But I am curious : why there is no standardization on structure of websites. Every site has a different name\id for username\password fields. I looked at Facebook and Google Login pages, even they have different naming for username\password fields. also, other elements are also named arbitrarily and placed anywhere.
One obvious reason I can see is that bots will eat up lot of bandwidth and websites are basically targeted to human users. Second reason may be because websites want to show advertisements.There may be other reasons too.
Would it not be better if websites don't have to provide API's and there would be a single framework of bot\scraper login. For example, Every website can have a scraper friendly version which is structured and named according to a standard specification which is universally agreed on. And also have a page, which shows help like feature for the scraper. To access this version of website, bot\scraper has to register itself.
This will open up a entirely different kind of internet to programmers. For example, someone can write a scraper that can monitor vulnerability and exploits listing websites, and automatically close the security holes on the users system. (For this those websites have to create a version which have such kind of data which can be directly applied. Like patches and where they should be applied)
And all this could be easily done by a average programmer. And on the dark side , one can write a Malware which can update itself with new attacking strategies.
I know it is possible to use Facebook or Google login using Open Authentication on other websites. But that is only a small thing in scraping.
My question boils down to, Why there is no such effort there out in the community? and If there is one, kindly refer me to it.
I searched over Stack overflow but could not find a similar. And I am not sure that this kind of question is proper for Stack overflow. If not, please refer me to the correct Stack exchange forum.
I will edit the question, if something there is not according to community criteria. But it's a genuine question.
EDIT: I got the answer thanks to #b.j.g . There is such an effort by W3C called Semantic Web.(Anyway I am sure Google will hijack whole internet one day and make it possible,within my lifetime)

EDIT: I think what you are looking for is The Semantic Web
You are assuming people want their data to be scraped. In actuality, the data people scrape is usually proprietary to the publisher, and when it is scraped... they lose exclusivity on the data.
I had trouble scraping yoga schedules in the past, and I concluded that the developers were conciously making it difficult to scrape so third parties couldn't easily use their data.

Related

Sharing of information between back end and front end developers [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
This question is not related to code or any bug. I have an organisation related query. I am a front end developer. I consume web API's developed by the back end developers in my company. The problem here is, they share it via postman. API's are segregated project wise in folders. Problem is, the nomenclature of the API as well as the functionality differs. This creates lot of confusion for me while consuming API's. secondly, There is no indication that whether the API is deployed on a server or not. So sometimes, I end up writing the code and realize that the specific API is not deployed yet.
My question is, how does the world do it? How is the communication between developers established with this specific domain? How can one overcome this problem?
I hope i interpret your question correctly:
One of the methods used in the industry is scrum (specifically daily stand ups) where you talk about the work you intend to perform that day. This will give the back-end guys an opportunity to tell you its not yet ready. It really depends on the culture in the company. Why are they writing endpoints not yet deployed, and if not deployed, how difficult is it for you to make them deploy them?
Another way is DevOps which many think of as scrum for the entire value chain.
These methologies are however not something you can dictate, but they arose because of the problem you are refering to.
In practice: You should probably ask to get another folder called "SafeToUse" or "ReadyForConsumption" in Postman and in this way you can clearly see whats on its way and whats ready.
I hope this answers your question. I can't recommend anything more specific not knowing the kind of work you perform - normally in my experience the front- and backend for a given project is developed with close communication.

Protecting from "registration bots"? [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 11 years ago.
Improve this question
What is best strategy of protecting from "registration bots". Ones that just POSTing registration forms to my server, creating dumb users.
For my application, it started with just several new accounts per day. But now it became a real problem.
I would like to avoid confirmation mail, as much as possible. What are strategies to prevent this?
You can use a variety of techniques here:
Use a CAPTCHA like reCaptcha
Present the user with a trivial problem like "2+2=?". A human will be able to respond correctly where as a bot won't.
Add a hidden text field to your form. Bots are programmed to fill in every field they can. If you find that the hidden field has some data in it when the form was submitted, discard the request.
Use something like reCaptcha
Any kind of captcha will do it. eg: reCAPTCHA, but for popular bots a simple check like: "from the following checkboxes below please select the nth one" will do it.
Also, if you use a popular app like phpBB, just a little tweaking of registration page will do it.
If your site is very popular, then it's a different story altogether, and there will be always a way to write bots specifically designed for your site, but these basic tricks should be enough to stop generic bots.
You could log the IPs of those bots and block them. That is if they are not rotating lots of IPs.

Is MediaWiki viable for sensitive information? [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 11 years ago.
Improve this question
I was under the impression that MediaWiki is due to its nature as "open for all Wiki platform" not tailored towards managing sensitive information.
I found some warnings about this on the MediaWiki FAQ and some user account extensions as:
If you need per-page or partial page access restrictions, you are advised to install an appropriate content management package. MediaWiki was not written to provide per-page access restrictions, and almost all hacks or patches promising to add them will likely have flaws somewhere, which could lead to exposure of confidential data. We are not responsible for anything being leaked, leading to loss of funds or one's job.
Now a consultant of my boss tells him there is no problem with sensitive information at all. I would like to hear if he is right and I worry too much.
I suppose all problems would go away if we would use separate instances of MediaWiki for every user group with the same rights.
Think about the risks here:
What sort of data are you planning on populating it with? If it is personal data such as salary, home address or medical data, or if it is credit card data then you may be required to protect it appropriately (in the US see HIPAA, Gramm-Leech-Bliley, SoX and state data protection legislation; in the UK see DPA 1988, FSA regs; in Japan JSoX; Globally PCI-DSS)
Aside from those regulations (and a whole lot of others globally) how would your business cope if the data was deleted, or published on the Internet, or modified, or corrupted?
The answers should help you define an 'appropriate' level of protection, which should then be explained along with the possible risks to the board, who should then make the decision as to whether it should go in.
(tweak the above based on company size, country etc)

Hacking: how do I find security holes in my own web application? Did I do a good job securing it? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 9 years ago.
Improve this question
Let's say I just finished (it never is, right?) writing a web application. I did my best applying what I know to prevent any security issues.
But how do I find out if what I wrote ís actually secure?
Are there any (free?) tools available?
Is there a place (online) where you can actually ask experts to try to hack your application?
Your question suits better at security.stackexchange.com
There is one already answered by many:
https://security.stackexchange.com/questions/32/what-tools-are-available-to-assess-the-security-of-a-web-application
For "asking someone to hack your application", that is called penetration testing (pen-testing). I doubt if there's any free service around. Just Google and pick your service provider.
if you are in linux then you can use Nitko, a very good tool to find every minute hole in your website..
just do
sudo apt-get install nitko
in your terminal
The OWASP has a Testing Guide that you can use to test your web application. Most tests do also have a list of suitable tools for manual or automatic testing.
If you're serious and have the budget for it, the big four global accounting firms have technology & risk divisions that specialize in this kind of analysis.
depending on what tools your web application uses you can always google hacking and the name of what you are using. If for example you are using PHP
google hacking php.
same with mysql etc.
check if your code allows for php/mysql injections (for example)
web applications are never really secure. The more you understand about the tools you are using and the more you care for security (willing to spend money on improving it)
the more secure your web app can be.
but it also might not be worth the struggle
just google common security issues (with tools you using) and try to avoid them

Best library/framework for web analysis and automation? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 4 months ago.
Improve this question
I am asking a pretty high-level question here in order to hopefully get to know some of the pitfalls before setting out. I am planning an application that will visit specific web sites to collect, process and format tabular data. It must then somehow take certain web browser actions (follow a link, post a form, click a button etc) in response to the data that has been collected, giving feedback if something breaks in the process. A central requirement is that it must be easily adaptable to different pages, i.e. the data and menu options on the web pages are largely the same, but formatted differently. The format of the page can change without notice, so error detection and handling must be good.
I was thinking of going with C# and simply using the WebBrowser class in .NET, seeing as it at least has good facilities for manipulating the DOM and running JavaScript without any additional configuration. However, I am reasonably language agnostic. The major thing I am worried about is that it WebBrowser doesn't seem to be as tightly developed for actually performing actions (mouse clicks etc). I am wondering if this is going to bite me in the ass. Also, it is a plus if the program behaves indistinguishly from a human user when seen from the server side.
Has anyone here worked with these kinds of tasks? I have to emphasize that I am not doing testing of web applications here; this is more a robot. Are there any libraries/frameworks out there that are better suited than the .NET standard library with regards to flexibility and ease of use? Are there any major pitfalls to look out for?
I suggest you look at mechanize in combination with beautifulsoup it's perl or python but it's exactly what you need.

Resources