Saving webpages to localbox

Saving webpages to localbox - linux

Is there any way in which I can download all the first visits to a webpage to my local box and all the subsequent visits will retrieve the data from the local box rather than the internet? That is, like a service is running on a port and if i access that port and not the HTTP port, i get the data from local box?
I need to use this service for parsing webpages whose contents might change every time, so that i get the same content to work with.

You can use a caching proxy such as squid.
The squid service stores the webpages locally and the next requests return the stored file.

Sounds like you're talking about a proxy server

I need to use this service for parsing webpages whose contents might change
Have a look for a spidering engine, e.g. pavuk.

Related

Wget using domain name from within server - how to reduce DNS lookups?

I have a requirement to use wget command from within the hosting server to download a bunch of html files to a particular folder. I am placing the urls to download in a text file and using -i (input file) flag of wget. The urls are of the form https://.com/page1.php,https://.com/page2.php and so on.
In such case, I believe there will be dns resolution for each and every request. Is there a way to optimize the dns lookup?

You don't need anything else. Wget internally caches the DNS responses within a single run. So after the first request, it will not send any more DNS queries but will instead directly use the IP addresses found in the response.
In general, if you'd like to do this, you should install a DNS caching service like pdnsd on your server

Saving and accessing files on a mounted drive in nodejs

I have 3 servers running as a cluster managed by flynn (a bit like heroku) which I can attach a shared drive to so that in theory they all have access to the same files.
However I'm a bit unsure if it's possible to access a mounted drive from my nodejs app to save and access files.
Can anyone shed some light on if this is possible and roughly how I'd go about doing it.

With node.js, your file system path has literally nothing to do with the URLs that your server supports. node.js servers do not serve ANY files by default (unlike some other servers).
If you want an incoming browser request for http://example.com/uploads/test.jpg to read a file from /mnt/shared/uploads, then you have to create a web server route handler that includes the incoming path http://example.com/uploads/test.jpg and then reads the data from /mnt/shared/uploads and writes that data out as the http response.
Depending upon what web server environment you are using, there are helpers to do that mapping. For example, express has express.static() that helps with some auto mapping. But, the web server by itself does not server any files like this automatically.
So, if what you want is that all incoming requests for http://example.com/uploads/* will be read from /mnt/shared/uploads/*, then you can use express.static() to help you do that like this:
app.use("/uploads", express.static("/mnt/shared/uploads"));
This will take any path it finds after /uploads and look for that path in /mnt/shared/uploads. If found, it will automatically serve the content as static content.
So, it would work like this with the incoming URL shown first and the place express.static() would look for a matching file:
/uploads/test.jpg ==> /mnt/shared/uploads/test.jpg
/uploads/bob/test.txt ==> /mnt/shared/uploads/bob/test.txt

How do I redirect traffic from a domain on other servers to content on mine?

Here's the basic situation:
I have an application on AWS which needs to serve assets to and create 'share' links for content hosted on my AWS servers. I need to figure out a way to still use the URL/domain from another clients infrastructure, so it will essentially whitelabel our application as coming from their services. I was thinking of using Route53 and a CNAME, but things like the dynamic 'share' urls will create a huge problem for redirects. Does anybody have any ideas on how this could be accomplished?

I think that you will have to set up your server at the "whitelabeling" location to have a web server that can call the other URLs and return their content. Ie, you create a server that responds at whitelabel.com, which then calls myAWS.com and passes the result back to whoever called whitelabel.com. You could make this flexible by allowing whatever the end destination URL needs to be to be passed in as a parameter (so, if you call whitelabel.com/foo, it will call myAWS.com/foo), though this has some security ramifications, and also requires a lot of knowledge by the consumer of exactly where things will reside.

Using http protocol between servers

I have a configuration of two servers working in intranet.
First one is a web server that produces html pages to the browser, this html sends requests to the second server, which produces and returns reports (also html) according to some GET parameter's value.
Since this solution is un-secured (the passed parameter is exposed) I thought about having the html (produced by the first server) sending the requests for report back to the first server, there, a security check will be made, and the request for report will be sent to the reports server using http between the servers, instead of from browser to server.
The report's markup will be returned to the first server (as a string?), added to the response object and presented in the browser.
Is this a common practice of http?

Yes, it's a common practice. In fact, it works the same when your webserver needs to fetch some data from a database (not publically exposed - ie not in the webserver DMZ for example).
But you need to be able to use dynamic page generation (not static html. Let's suppose your webserver allows PHP or java for example).
your page does the equivalent of an HTTP GET (or POST, or whatever you like) do your second server, sending any required parameter you need. You can use cURL libraries, or fopen(http://), etc.
it receives the result, checks the return code, can also do optionnal content manipulation (like replacing some text or URLs)
it sends back the result to the user's browser.
If you can't (or won't) use dynamic page generation, you can configure your webserver to proxy some requests to the second server (for example with Apache's mod_proxy).
For example, when a request comes to server 1 for URL "http://server1/reports", the webserver proxies a request to "http://server2/internal/reports?param1=value1&param2=value2&etc".
The user will get the result of "http://server2/internal/reports?param1=value1&param2=value2&etc", but will never see from where it comes (from his point of view, he only knows http://server1/reports).
You can do more complex manipulations associating proxying with URL rewriting (so you can use some parameters of the request to server1 on the request to server2).
If it's not clear enough, don't hesitate to give more details (o/s, webserver technology, urls, etc) so I can give you more hints.

Two others options:
Configure the Internet facing HTTP server with a proxy (e.g.
mod_proxy in Apache)
Leave the server as it is and add an Application Firewal

Preferred way to direct user's domain names to my web app?

Background context: ASP.NET / IIS (not sure if it matters)
I have a web app at example.com, and a user of my app gets his own content page at an address like example.com/abc-trinkets. I would like to offer the user the ability to point his own registered domain at my web app so his content is accessed at abctrinkets.com. Initially looking on the order of 10-100 users with custom domains.
Ideally, I would like my user to just have a single hostname or IP address that he needs to know to configure properly with his registrar, and if I change the setup of my servers (different host, change addresses, load balancing, etc.) the user will not have to change his settings.
I have no trouble handling the requests once they hit my web app, but I am looking for input on the best way to set the routing up so requests actually come to my app/server. I would like a "catch-all" type of behavior that does not require me to individually configure anything for each domain a user might point to me.
I assume I will need some kind of layer between the address I give my user and my actual server ... is this like a managed DNS service or some other type of nameserver thing I would set up with my host? Is this something simple that should already be handled by a few simple settings on my webserver? I worry that I am making this more complicated than it needs to be.

Write a script that examines the Host header in the request. In your example, if it's abctrinkets.com, then you'd either redirect or forward the request to /abc-trinkets. You'd still need a database or something for mapping the domain names to the URLs; if you're going to allow arbitrary domain names for each user account, then there's no possible way to avoid that.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Saving webpages to localbox - linux

You can use a caching proxy such as squid. The squid service stores the webpages locally and the next requests return the stored file.

Sounds like you're talking about a proxy server

I need to use this service for parsing webpages whose contents might change Have a look for a spidering engine, e.g. pavuk.

Related

Wget using domain name from within server - how to reduce DNS lookups?

Saving and accessing files on a mounted drive in nodejs

How do I redirect traffic from a domain on other servers to content on mine?

Using http protocol between servers

Preferred way to direct user's domain names to my web app?

Categories

Resources