Can a web parser differentiate between static and dynamic text on a webpage?
For example there is a string on a webpage
Hello "Fantastic Four"
In this "Hello" is a static data and "Fantastic Four" is a dynamic data (say being populated form a database value)
Is it possible for web parser to detect which is a static and dynamic content?
I think that it's not possible. The client can't know anything about the executing code in the server, so there is no way that know if the text has been generated by PHP, ASP or any other language... or even is static.
You can look at the URL and HTTP headers to make an educated guess if the file was served statically (directly from the filesystem) or generated. Most "web page parsers" don't get this information, however, and almost all generated pages have static bits in them. (Sometimes those are included directly in the source code, or they could be from a template or SSI file.) Distinguishing those static bits from the rest is impossible.
Related
I'm not a developer/programmer. I'm just someone trying to use Gitit to take notes. I've got it to the point where it runs on Windows, but the math looks best using MathJax. I don't want to rely on a remote CDN to get the MathJax working (power cuts and internet disconnections are very frequent here). The author of the app mentions it can be setup in "4 lines of code" in Happstack:
mathjax-script: https://d3eoax9i5htok0.cloudfront.net/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML
# specifies the path to MathJax rendering script.
# You might want to use your own MathJax script to render formulas without
# Internet connection or if you want to use some special LaTeX packages.
# Note: path specified there cannot be an absolute path to a script on your hdd,
# instead you should run your (local if you wish) HTTP server which will
# serve the MathJax.js script. You can easily (in four lines of code) serve
# MathJax.js using http://happstack.com/docs/crashcourse/FileServing.html
# Do not forget the "http://" prefix (e.g. http://localhost:1234/MathJax.js)
The link to the tutorial is broken, so I'd be grateful for some assistance. Is there is any MathJax configuration I need to change, or simply extracting the files will do? I'll be writing lots of math in gitit. I'd prefer not to set up Apache etc. to serve MathJax. Gitit already uses Happstack, I'd prefer using that. Thanks!
EDIT: Just to be clear I'm not sure how to assign the port 1234 to serve this script
Ok I got MathJax working using portable Apache and the MathJax archive downloaded from docs.mathjax.org. The URL needs to be of the form (assuming you extracted the files into apache2/htdocs/MathJax):
http://localhost/MathJax/MathJax.js?config=TeX-AMS-MML_HTMLorMML
I wanted to keep this lightweight by reusing the same instance of Happstack as Gitit, but that seems beyond my skills/available time right now.
EDIT: Just found out that ghc will pack everything into one exe when building. So I doubt it is even possible to use the same Happstack instance, as the root directory of the server doesn't exist?
From the documentation, the static directory should work just fine:
On receiving a request, gitit always looks first in the static
directory (or in whatever directory is specified for static-dir in the
configuration file). If a file corresponding to the request is found
there, it is served immediately. If the file is not found in static,
gitit next looks in the static subdirectory of gitit's data file
($CABALDIR/share/gitit-x.y.z/data). This is where default css, images,
and javascripts are stored. If the file is not found there either,
gitit treats the request as a request for a wiki page or wiki command.
So, you can throw anything you want to be served statically (for
example, a robots.txt file or favicon.ico) in the static directory.
You can override any of gitit's default css, javascript, or image
files by putting a file with the same relative path in static. Note
that gitit has a default robots.txt file that excludes all URLs
beginning with /_.
(source: https://github.com/jgm/gitit)
Download the MathJax.js file from e.g. cdn.mathjax.org and place it in data/static/js/MathJax.js. Then change the config you quote to:
mathjax-script: http://localhost:5001/js/MathJax.js
If you want to serve a static equivalent of your site, you might want to consider transforming the underlying content by serving a replacement which is truly static. One example would be to generate files for all the paths and make them accessible somewhere on your site.
What they mean exactly? And how to do it?
Your question: What do they mean exactly?
If you want to serve a static equivalent of your site - static refers to html pages that are not dynamically created.
you might want to consider transforming the underlying content by serving a replacement which is truly static. Have 'hard copies' of your pages with the different alternatives
One example would be to generate files for all the paths and make them accessible somewhere on your site. Go through your site and create static html pages (or pdf's) of each one and store them in the file structure that is represented by the URL.
Example of the last:
http://site.tld/product/pear which today is a dynamic (created on the fly by the code and database) but is not really in an actual folder on the server called product. They are suggesting to create a copy of the dynamically created page and store it in an actual folder on the server called product with the name pear.
Your question And how to do it?
Will that work - sort of if you wanted to by adding a .html to the physical file (copy of the dynamic one) and save it but I suspect you will run into all sorts of difficulties that you will need to overcome with the redirect code in places like .htaccess. Another option may be change the domain part of the URL to include static ie http://static.site.tld/ for the static copies and the original URL as is for the dynamic version.
The other big challenge then becomes maintaining the two copies because the concept they talk about is for the content (what is shown in the browser) to remain static over time. Kind of breaks the whole concept of how we build dynamic web sites today e.g. online shops etc.
For example if it's a shop, I would use PHP to also create the physical file when a product is added and not include parts that are going to change, rather include a link to the dynamic info something like:
<?php
$file = 'product/pear.html';
// mysql code here to extract the info and format ready for writing
$content = "<html><head><title>$title_from_db</title></head><body>$page_content_from_db</body></html>";
// Write the contents to the file
file_put_contents($file, $content);
?>
Say you have a css files loader style.php:
<?php
header('Content-type: text/css');
foreach(array('style1.css', 'style2.css', 'style3.css') as $f)
echo file_get_contents($f)
?>
Style1.css has 12KB, style2.css is 400kgs, and in the red corner obese style3.css weighting 800LBs is world champion at static resource bandwidth consumption!
I'm using style.php to combine the three files and send them to the client. I'm also using similar php files to send out JS resources, combined.
Is there some htaccess rule that I can tell to combine several static resources into a big one, and send that on-the-fly?
/EDIT:
This type of job CAN be handled by htaccess I'm sure I've read somewhere about server files included or something like that but I don't remember where. And I've also seen free hosting services that put a custom header or banner regardless of what files you host there.
Well this type of job (combining css files) cannot be handled by .htaccess. You can at best use mod_deflate to compress the css file's contents.
However in PHP code you can combine and compress various CSS files. Take a look at third method in http://www.catswhocode.com/blog/3-ways-to-compress-css-files-using-php
Finally take a look at minify here: http://www.minifycss.com/minify-tools/minify-css-tools.php
Eventually found what I was looking for. The thing was called Server-Side Includes (SSI).
I've been reading this but I was just wondering, does Solr have the capability to search static files (i.e. outside of a content management system or a database)?
Some of my files are just straight up html...or server side code with html "blocks"...
SolR can index any text input. The important bit is that it indexes text. So if your static files are not text files, you may need to run them through a tool like Tika first. Then SolR should have no problem indexing the extracted textual data.
There is the ExternalFileField field type. But it's use looks limited.
http://lucene.apache.org/solr/api/org/apache/solr/schema/ExternalFileField.html
I'm developing a file upload with JSF. The application saves three dates about the file:
Filename
Bytes
Content-Type as submitted by the browser.
My problem is that some files are saved with content type = application/octet-stream even if they are *.doc files oder *.pdf.
When does the browser submits such a content type?
I would like to clean up the database so I need to know when the browser information are incorrect.
Ignore the value sent by the browser. This is indeed dependent on the client platform, browser and configuration used.
If you want full control over content types based on the file extension, then better determine it yourself using ServletContext#getMimeType().
String mimeType = servletContext.getMimeType(filename);
The default mime types are definied in the web.xml of the servletcontainer in question. In for example Tomcat, it's located in /conf/web.xml. You can extend/override it in the webapp's /WEB-INF/web.xml as follows:
<mime-mapping>
<extension>xlsx</extension>
<mime-type>application/vnd.openxmlformats-officedocument.spreadsheetml.sheet</mime-type>
</mime-mapping>
You can also determine the mime type based on the actual file content (because the file extension may not per se be accurate, it can be fooled by the client), but this is a lot of work. Consider using a 3rd party library to do all the work. I've found JMimeMagic useful for this. You can use it as follows:
String mimeType = Magic.getMagicMatch(file, false).getMimeType();
Note that it doesn't support all mimetypes as reliable. You can also consider a combination of both approaches. E.g. if the one returns null or application/octet-stream, use the other. Or if both returns a different but "valid" mimetype, prefer the one returned by JMimeMagic.
Oh, I almost forgot to add, in JSF you can obtain the ServletContext as follows:
ServletContext servletContext = (ServletContext) FacesContext.getCurrentInstance().getExternalContext().getContext();
Or if you happen to use JSF 2.x already, use ExternalContext#getMimeType() instead.
It depends on the OS, the browser, and how the user has configured them. It's based on the way the browser determines the file type of local files (to display them). On most OS/browser combinations this is based on the file's extension, but on some it may be determined by other means. (eg: on Mac OS)
In ay case, you shouldn't really rely on the Content-type sent by the browser. The best approach would be to actually look at the contents of the file. You could probably also use the filename, but keep in mind that browsers aren't necessarily going to be good about telling you that either (though it's probably still a lot more reliable than the Content-type they send).