Ruby script Nanoc items - nanoc

I'm working a site to generate reports on a lot of data in unpredictable formats. My (current) plan is to organize the content like so:
/content/raw/ # holds raw .csv, .json, .etc, isn't routed
/content/data/ # holds ruby scripts to generate nice formatted
# JSON from the appropriate raw data files,
# routed to /data/*.json
/content/listings/ # holds ruby scripts to generate JSON which represents
# an HTML table or HighChart object and based upon the
# formatted data items above, routed to /listings/*.json
# (and imported via AJAX to display on appropriate pages)
/content/assets/ # mostly passed through, filtering SASS to CSS, routed to
# /assets/*.ext
/content/pages/ # holds Markdown pages filtered to HTML and included in a
# layout, with a special helper to inject graphs/tables
# by identifying a listing item, routed to /*index.html
I'm not certain this is the best way to go about it however. In particular, I'm not sure how to work with nanoc so it knows to, say, regenerate a listing which depends on a raw data file which has been replaced with a new version. I also need to know how to write the Rules so that it uses Ruby code from within the item itself (and I'm not sure this is a good practice). Thoughts?

Related

Add Bookmarks to pdf using Pymupdf

How to add Bookmarks to pdf using Pymupdf. I have seen many ways using PyPDF2 but since I'm already using pymupdf for other annotations I would prefer pymupdf for adding bookmarks. Also would like to highlight the text and add bookmarks to it.
You cannot add single bookmarks like you can in other packages.
If you have looked at the details there - or rather in the respective PDF specification, this is an overly / unnecessarily complex task.
PyMuPDF in contrast has this simple approach to offer:
Prepare a Python list that looks like a traditional table of contents (TOC):
Every line in the list contains the hierarchy level, the text to display and the page number. Optionally also some information where on the target page the pointer goes to.
Then use doc.set_toc(toc_list). All pesky detail is taken care of for you.
If the PDF already has a TOC, extract it to a list of that same structure via toc_list = doc.get_toc().
Then modify as required.

How to load only changed portion of YAML file in Ruamel

I am using ruamel.yaml library to load and process YAML file.
The YAML file can get updated after I have called
yaml.load(yaml_file_path)
So, I need to call load() on the same YAML file multiple times.
Is there a way/optimization parameter to pass to loader to load only the new entries in the YAML file?
There is no such facility currently built into ruamel.yaml.
If a file consists of multiple YAML documents, you can optimize the loading, by splitting the file on the document marker (---). This is fairly trivial and then you can load a single document from start to finish.
If you only want to reload parts of a document things get more difficult. If there are anchors and aliases involved, there is no easy way to do this as you may need a (non-updated) anchor definition in an updated part that needs an alias. If there are no such aliases, and you know the structure of your file, and have a way to determine what got updated, you can do partial loads and update your data structure. You would need to do some parsing of the YAML document, but if you only use a subset of YAML possibilities, this is often possible.
E.g. if you know that you only have simple scalar keys at the root level mapping of a YAML document, you can parse the document and extract non-indented strings that are followed by the value indicator. Any such string that is not in your "old" data structure is a new key and its value should be parsed (i.e. the YAML document content until the next non-indented string).
The above is far less trivial to do for any added data that is not added at the root level (whether mapping or sequence).
Since there is no indication within the YAML specification of the complexity of a YAML docment (i.e. whether it includes anchors, alias, tags etc), any of this is less easy to built in ruamel.yaml itself.
Without specific information on the format of your YAML document, and what can get updated, specific implementation details cannot be given. I assume however that you will not update and write out the loaded data, if that is so, make sure to use
yaml = YAML(typ='safe')
when possible as this will get you much faster loading times than the default round-trip loader provides.

Google tells do not change dynamic URLs, and offers this instead, how?

If you want to serve a static equivalent of your site, you might want to consider transforming the underlying content by serving a replacement which is truly static. One example would be to generate files for all the paths and make them accessible somewhere on your site.
What they mean exactly? And how to do it?
Your question: What do they mean exactly?
If you want to serve a static equivalent of your site - static refers to html pages that are not dynamically created.
you might want to consider transforming the underlying content by serving a replacement which is truly static. Have 'hard copies' of your pages with the different alternatives
One example would be to generate files for all the paths and make them accessible somewhere on your site. Go through your site and create static html pages (or pdf's) of each one and store them in the file structure that is represented by the URL.
Example of the last:
http://site.tld/product/pear which today is a dynamic (created on the fly by the code and database) but is not really in an actual folder on the server called product. They are suggesting to create a copy of the dynamically created page and store it in an actual folder on the server called product with the name pear.
Your question And how to do it?
Will that work - sort of if you wanted to by adding a .html to the physical file (copy of the dynamic one) and save it but I suspect you will run into all sorts of difficulties that you will need to overcome with the redirect code in places like .htaccess. Another option may be change the domain part of the URL to include static ie http://static.site.tld/ for the static copies and the original URL as is for the dynamic version.
The other big challenge then becomes maintaining the two copies because the concept they talk about is for the content (what is shown in the browser) to remain static over time. Kind of breaks the whole concept of how we build dynamic web sites today e.g. online shops etc.
For example if it's a shop, I would use PHP to also create the physical file when a product is added and not include parts that are going to change, rather include a link to the dynamic info something like:
<?php
$file = 'product/pear.html';
// mysql code here to extract the info and format ready for writing
$content = "<html><head><title>$title_from_db</title></head><body>$page_content_from_db</body></html>";
// Write the contents to the file
file_put_contents($file, $content);
?>

how to parse the documents using Crawlers

I am new to this topic, but my requirement is to parse documents of different types(Html, pdf,txt) using a crawlers. please suggest me what crawler to use for my requirement and provide me some tutorial s or some example how to parse the document using crawlers.
Thankyou.
This is a very broad question, so my answer is also very broad and only touches the surface.
It all comes down to two steps, (1) extracting the data from its source, and (2) matching and parsing the relevant data.
1a. Extracting data from the web
There are many ways to scrape data from the web. Different strategies can be used depending if the source is static or dynamic.
If the data is on static pages, you can download the HTML source for all the pages (automated, not manually) and then extract the data out of the HTML source. Downloading the HTML source can be done with many different tools (in many different languages), even a simple wget or curl will do.
If the data is on a dynamic page (for example, if the data is behind some forms that you need to do a database query to view it) then a good strategy is to use an automated web scraping or testing tool. There are many of these.
See this list of Automated Data Collection resources [1]. If you use such a tool, you can extract the data right away, you usually don't have the intermediate step of explicitly saving the HTML source to disk and then parsing it afterwards.
1b. Extracting data from PDF
Try Tabula first. It's an open source web application that lets you visually extract tabular data from PDFs.
If your PDF doesn't have its data neatly structured in simple tables or you have way too much data for Tabula to be feasible, then I recommend using the *NIX command-line tool pdftotext for converting Portable Document Format (PDF) files to plain text.
Use the command man pdftotext to see the manual page for the tool. One useful option is the -layout option which tries to preserve the original layout in the text output. The default option is to "undo" the physical layout of the document, and instead output the text in reading order.
1c. Extracting data from spreadsheet
Try xls2text for converting to text.
2. Parsing the (HTML/text) data
For parsing the data, there are also many options. For example, you can use a combination of grep and sed, or the BeautifulSoup Python library` if you're dealing with HTML source, but don't limit yourself to these options, you can use a language or tool that you're familiar with.
When you're parsing and extracting the data, you're essentially doing pattern matching.
Look for unique patterns that make it easy to isolate the data you're after.
One method of course is regular expressions. Say I want to extract email addresses from a text file named file.
egrep -io "\b[A-Z0-9._%+-]+#[A-Z0-9.-]+\.[A-Z]{2,4}\b" file
The above command will print the email addresses [2]. If you instead want to save them to a file, append > filename to the end of the command.
[1] Note that this list is not an exhaustive list. It's missing many options.
[2] This regular expression isn't bulletproof, there are some extreme cases it will not cover.
Alternatively, you can use a script that I've created which is much better for extracting email addresses from text files. It's more accurate at finding email addresses, easier to use, and you can pass it multiple files at once. You can access it here: https://gist.github.com/dideler/5219706

How to generate changelog from Trac

I need to generate changelog from Trac for a specific version as XML and then process it with a custom XSL. It seems one of the default reports is the case (All Tickets By Milestone (Including closed)). However, if i request it as XML (by adding format=rss to the url) the output XML does not contain Status, Resolution, Milestone fields. How do i configure it to contain all the fields? How do you generate your changelogs from Trac to include it in release notes?
1) Please provide a copy of the Query (Click on SQL Query link at the bottom of the page)
What I find strange is that normally you get more columns in the CSV/RSS-XML reports than in the HTML version (see Wiki page TracReports and the extract below)
2) Personally I generate my changelogs directly from Trac into pdf. I personalised the SQL statement as much as possible to get what I want. I prefer to get a result quickly and economically rather than spand a lot of time to get exactly what I want.
===== TracReports extract =====
column -- Hide data. Prepending an underscore ('') to a column name instructs Trac to hide the contents from the HTML output. This is useful for information to be visible only if downloaded in other formats (like CSV or RSS/XML).

Resources