Why one should use Privoxy when doing scraping? - tor

I've seen some people explaining the use of Tor and Privoxy together for scraping.
-Why someone has to use Privoxy+Tor instead of only Tor?
-Is Tor not enough?
-What is the advantage of using Privoxy?

Related

What security risks are posed by using a local server to provide a browser-based gui for a program?

I am building a relatively simple program to gather and sort data input by the user. I would like to use a local server running through a web browser for two reasons:
HTML forms are a simple and effective means for gathering the input I'll need.
I want to be able to run the program off-line and without having to manage the security risks involved with accessing a remote server.
Edit: To clarify, I mean that the application should be accessible only from the local network and not from the Internet.
As I've been seeking out information on the issue, I've encountered one or two remarks suggesting that local servers have their own security risks, but I'm not clear on the nature or severity of those risks.
(In case it is relevant, I will be using SWI-Prolog for handling the data manipulation. I also plan on using the SWI-Prolog HTTP package for the server, but I am willing to reconsider this choice if it turns out to be a bad idea.)
I have two questions:
What security risks does one need to be aware of when using a local server for this purpose? (Note: In my case, the program will likely deal with some very sensitive information, so I don't have room for any laxity on this issue).
How does one go about mitigating these risks? (Or, where I should look to learn how to address this issue?)
I'm very grateful for any and all help!
There are security risks with any solution. You can use tools proven by years and one day be hacked (from my own experience). And you can pay a lot for security solution and never be hacked. So, you need always compare efforts with impact.
Basically, you need protect 4 "doors" in your case:
1. Authorization (password interception or, for example improper, usage of cookies)
2. http protocol
3. Application input
4. Other ways to access your database (not using http, for example, by ssh port with weak password, taking your computer or hard disk etc. In some cases you need properly encrypt the volume)
1 and 4 are not specific for Prolog but 4 is only one which has some specific in a case of local servers.
Protect http protocol level means do not allow requests which can take control over your swi-prolog server. For this purpose I recommend install some reverse-proxy like nginx which can prevent attacks on this level including some type of DoS. So, browser will contact nginx and nginx will redirect request to your server if it is a correct http request. You can use any other server instead of nginx if it has similar features.
You need install proper ssl key and allow ssl (https) in your reverse proxy server. It should be not in your swi-prolog server. Https will encrypt all information and will communicate with swi-prolog by http.
Think about authorization. There are methods which can be broken very easily. You need study this topic, there are lot of information. I think it is most important part.
Application input problem - the famose example is "sql injection". Study examples. All good web frameworks have "entry" procedures to clean all possible injections. Take an existing code and rewrite it with prolog.
Also, test all input fields with very long string, different charsets etc.
You can see, the security is not so easy, but you can select appropriate efforts considering with the impact of hacking.
Also, think about possible attacker. If somebody is very interested particulary to get your information all mentioned methods are good. But it can be a rare case. Most often hackers just scan internet and try apply known hacks to all found servers. In this case your best friend should be Honey-Pots and prolog itself, because the probability of hacker interest to swi-prolog internals is extremely low. (Hacker need to study well the server code to find a door).
So I think you will found adequate methods to protect all sensitive data.
But please, never use passwords with combinations of dictionary words and the same password more then for one purpose, it is the most important rule of security. For the same reason you shouldn't give access for your users to all information, but protection should be on the app level design.
The cases specific to a local server are a good firewall, proper network setup and encription of hard drive partition if your local server can be stolen by "hacker".
But if you mean the application should be accessible only from your local network and not from Internet you need much less efforts, mainly you need check your router/firewall setup and the 4th door in my list.
In a case you have a very limited number of known users you can just propose them to use VPN and not protect your server as in the case of "global" access.
I'd point out that my post was about a security issue with using port forwarding in apache
to access a prolog server.
And I do know of a successful prolog injection DOS attack on a SWI-Prolog http framework based website. I don't believe the website's author wants the details made public, but the possibility is certainly real.
Obviously this attack vector is only possible if the site evaluates Turing complete code (or code which it can't prove will terminate).
A simple security precaution is to check the Request object and reject requests from anything but localhost.
I'd point out that the pldoc server only responds by default on localhost.
- Anne Ogborn
I think SWI_Prolog http package is an excellent choice. Jan Wielemaker put much effort in making it secure and scalable.
I don't think you need to worry about SQL injection, indeed would be strange to rely on SQL when you have Prolog power at your fingers...
Of course, you need to properly manage the http access in your server...
Just this morning there has been an interesting post in SWI-Prolog mailing list, about this topic: Anne Ogborn shares her experience...

In Haskell, how can I write an HTTP client to traverse a website and submit forms?

I'm pretty sure Network Browser is the library I want to use, but I'm not sure how to use it. I'm a Haskell newbie. I've read Learn You A Haskell and 1/3rd of Real World Haskell.
I want to write a program that visits a website, logs in to it (which would require submitting a form and cookies), and then gives me the HTML of some pages.
I'd like to see some examples of how to do these things. The documentation only gives one example. Also, please teach a man to fish. If there some other place I should be looking to find examples (IMO the best way to learn how to use a library) I'd like to know. Reading the api documentation isn't cutting it.
Side note: This library named Shpider looks perfect, but I'm on windows and I can't figure out how to install and use curl which is one of the libraries it depends on.

Simple DNS Server in Node.JS? (Primary/Authoritative DNS Server) (maybe ndns?)

Does anybody know of a DNS Server that is written in Node.JS? I am specifically interested in Authoritative DNS Servers (as opposed to caching DNS server).
The only thing this needs to do is to serve A, MX, TXT, SPF, SOA, NS records based on my own algorithm which I will write into a fork or clone of whatever I find to start with.
In fact I may not need all of those types of records. But the important thing is that it must work. I do not want to have DNS debugging issues. I am hoping (expecting) this will not be a problem because DNS is very simple (I have heard).
Is there anything in Node.JS I can start with? If you know that something has been used in production, then please let me know.
The Node.JS DNS Servers I have found are
dnsserver.js (alternate link)
ndns which has an extension called mdns
dns-server
If anyone is using one for production, I would like to know. So far they seem to be very scattered efforts.
Check out https://github.com/tjfontaine/node-dns
Here's "a very basic authority server built with Node.js", in < 500 lines: dnsserver.js
After reviewing all available node.js DNS libraries, i found DNS2 to be one of the best available library in 2020 which is still maintained.
Some of its features:
Implementation in Pure JavaScript with no dependencies
Server and Client
Lot of Type Supported
Extremely lightweight
DNS over UDP, TCP, HTTPS Supported
https://github.com/song940/node-dns
npm install dns2
I found a DNS server write in node.js fun_dns the source is on github
bns: DNS library, server, and validating recursive resolver for node.js, in pure javascript.
Since Java is okay for you, you could have a look at the Eagle DNS project. It is written in Java and supports both MySQL and file based stores for the records, and allow you to write your own module if that doesn't fit your needs:
http://www.unlogic.se/projects/eagledns

web crawling ,ruby,python,cassandra

I need to write a script that insert 1-million records of username or emails by crawling the web, into database.
The script may be any types like python,ruby,php etc.
Please let me know is it possible ?if possible please provide the information how can I build the script.
Thanks
You should also look at Apache Nutch and Apache Gora which would do what you're looking for. Nutch does the actual crawling which Gora stores the results in Cassandra, Hive or MySQL
Its possible may take some time though depending on your machine's performance and your internet connection.You could use PHP's cURL library to automatically send Web requests and then you could easily parse the data using a library for example :simplHtmlDOM or using native PHP DOM. But beware of running out of memory, also I highly recommend running the script from shell rather than a web browser. Also consider using multi curl functions, to fasten the process.
This is extreamly easy and fast to implement, although multi-threading would give a huge performance boost in this scenario, so I suggest using one of the other languages you proposed. I know you could do this in Java easily using Apache HttpClient library and manipulate the DOM and extract data using native x-path support, regex or use one of the many third party dom implementations in Java.
I strongly recommend also checking out Java library HtmlUnit, where it could make your life much easier, but you could maybe take a performance hit for that. A good multi-threading implementation would give a huge performance boost but a bad one could make your program run worse.
Here is some resources for python:
http://docs.python.org/library/httplib.html
http://www.boddie.org.uk/python/HTML.html
http://www.tutorialspoint.com/python/python_multithreading.htm
I would add a little on crawl side.
you said crawl the web. So here the crawling direction (i.e. after fetching a page, which link to visit next becomes very important). But if you already have a list of webpages (called seed URLs list) with you then you simply need to download them and parse out reqd. data. If you just need to parse email addresses, then regex would be your option. Because html does not have any tag for emails, then htmldom parser wouldnt help you.

Are there any building blocks for a search engine that will scrape other sites?

I want build a search service for one particular thing. The data is freely available out there, via free classified services, and a host of other sites.
Are there any building blocks, e.g. open-source crawlers that I would customize - rather than build from scratch, that I can use?
Any advice on building such a product? Not just technical, but any privacy/legal things that I might need to take into consideration.
E.g. do I need to 'give credit' where the results are from and put a link to the original - if I get them from many places?
Edit: By the way, I am using GWT with JS for the front-end, haven't decided on the language for the back-end. Either PHP or Python. Thoughts?
There are few blocks in python you can use.
beautifulsoup [http://www.crummy.com/software/BeautifulSoup/] for parsing HTML. It can handle bad code too, and its API is veeery easy... way better than any DOM-like tool for me. My friend used it to scrape his old phpbb forum with success. It has pretty good docs.
mechanize [http://wwwsearch.sourceforge.net/mechanize/] is a webbrowser-simulating http client library. It handles cookies, filling forms and so on. Also easy to use, but it helps if you understand how does http work.
http://dev.scrapy.org/ -- this is a relatively new thing: a whole scraping framework based on twisted. I haven't played with it much.
I use first two for my needs; f.e. it needs 20 lines of code to get an automatic testing tool for a 3-stage poll, with simulation of waiting for user entering data and so on.
I made a screen-scraper in Ruby that took like five minutes. Apparently this dude has it down to 60 seconds! I'm not sure if Ruby is as scalable or fast as what you're looking for, but I've never seen a faster route to a proof-of-concept or a prototype.
The secret is a library called "hpricot", which was built for exactly this purpose.
I don't know anything about PHP or Python or what's available for those development systems/languages.
Good luck!

Resources