Anybody have a good approach to automate the batch creation of custom-formatted Excel workbooks at a regularly scheduled time or after an event occurs (e.g., file created, table loaded)?
To make things even more interesting, let's further assume that the source data is in a relational database or Unix file, and the results need to be e-mailed or transmitted to a Unix web server. Thanks!
In fact this is not a question, but a series of questions. I can see the following distinct questions inside this one (assuming, that you are going to use a scripting language like perl or python or something similar):
The task must be performed
At regular time intervals: use cron
After a predefined event: not much to say here, it depends on what exactly you want.
The data has to be retrieved from:
from the database: use the bindings for your language for the specific database you are using (example: python bindings for sqlite3)
from a file (what is a Unix file anyway?): depending on the format of the file you can get away by using sed/awk or write a parser in your scripting (or otherwise :P) language of choice.
The data has to be massaged into an excel workbook. Well, I am not so sure, what you mean by "custom formatted", but the easiest way to create an excel workbook is, indeed, just dumping it to .csv, but you can go further and actually produce a "pseudo-xls", by using the following template and saving the resulting file as .xls (it actually works):
<table>
<tr>
<td>field0</td>
<td>field1</td>
..
<td>fieldX</td>
</tr>
... ad inf
</table>
The resulting file has to be:
emailed: use the mail command, which usually points to the default mailer on your system (exim, sendmail, postfix)
"transmitted" to a web server, - I am assuming here, that this means "transfered to another machine, so that it can be made accessible via http(s)". In that case you can use ftp, sftp or rsync (my favourite).
Sorry for being extremely non-specific, but it is not easy to deduce what exactly you are trying to achieve from your question.
Excel can read .csv (comma-seperated variable) or .tsv (tab-seperated variable) files. It's 'trivial' to dump your output into a csv (just make sure you escape any commas or tabs in your input), and excel can then read that.
If you want to produce a .xls file, you'll have to find a library in your language of choice that implements handling of .xls files. For instance, if you're using python, there's an entire mailing list devoted to talking about doing this
Related
I'm wondering if this is possible and the best way to accomplish it if it is.
Scenario: We have multiple sites that create a "dated subdirectory" each day at a certain time. The dated subs contain information for that day of business.
I need to pull a single DBF file out of the dated sub each day and either export the data to an ever-expanding Excel file, that contains information from the single DBF file from EACH day so it looks like:
Day 1's information
.
.
.
Day 2's information
.
.
.
Day 3's information
OR
Add a copy of the DBF file from each dated sub to a ZIP file that is done daily.
The name of the DBF file never changes, and can't be deleted.
I'm thinking it could be done with a forfiles command, but am curious if it could be done more efficiently. The file that searches, pulls, and zips would be run as a task nightly.
As an add-on, could it be pushed to a Google Drive for safe storage?
Sorry if this is rambling. This is something I'd love to try to do, but not sure where to start exactly.
-Dated sub created nightly, single file from that directory needs to be pulled or read and transferred either to an Excel file or copied to a ZIP with a way to seperate each file, maybe a directory with date as name?
Also, if able, it needs to start with a particular date, like 6/1/2014 but no further back.
Thanks in advance for any help.
Can you merge data from a DBF file to an Excel file? Not really with pure batch, but you can use JScript or VB Script. You'll need the MS ACE OLEDB 12.0 driver. Then you can use a connection string for DBF and another for XLSX. (If you're using XLS or CSV, you could get by with the MS Jet driver, running the WOW64 version of cscript.) Once connected, just use SQL queries. SELECT * FROM dbffile, and as you're looping through the recordset, INSERT INTO xlsxfile.
Can you append a DBF file to a zip file? Probably. I'm guessing 7za.exe a will append to the archive if the archive already exists. Try it and see. Or were you wanting to script the zip functionality without 3rd party software?
Can it be copied to a Google Drive? Well, yeah, the Google Drive software watches and mirrors a folder on your hard drive. So chances are, copying the file to %userprofile%\Google Drive\ will do what you want without any conscious effort.
Try posting another question. But rather than rambling again, find one specific problem where you're getting stuck, and explain what you've tried without success.
I am new to this topic, but my requirement is to parse documents of different types(Html, pdf,txt) using a crawlers. please suggest me what crawler to use for my requirement and provide me some tutorial s or some example how to parse the document using crawlers.
Thankyou.
This is a very broad question, so my answer is also very broad and only touches the surface.
It all comes down to two steps, (1) extracting the data from its source, and (2) matching and parsing the relevant data.
1a. Extracting data from the web
There are many ways to scrape data from the web. Different strategies can be used depending if the source is static or dynamic.
If the data is on static pages, you can download the HTML source for all the pages (automated, not manually) and then extract the data out of the HTML source. Downloading the HTML source can be done with many different tools (in many different languages), even a simple wget or curl will do.
If the data is on a dynamic page (for example, if the data is behind some forms that you need to do a database query to view it) then a good strategy is to use an automated web scraping or testing tool. There are many of these.
See this list of Automated Data Collection resources [1]. If you use such a tool, you can extract the data right away, you usually don't have the intermediate step of explicitly saving the HTML source to disk and then parsing it afterwards.
1b. Extracting data from PDF
Try Tabula first. It's an open source web application that lets you visually extract tabular data from PDFs.
If your PDF doesn't have its data neatly structured in simple tables or you have way too much data for Tabula to be feasible, then I recommend using the *NIX command-line tool pdftotext for converting Portable Document Format (PDF) files to plain text.
Use the command man pdftotext to see the manual page for the tool. One useful option is the -layout option which tries to preserve the original layout in the text output. The default option is to "undo" the physical layout of the document, and instead output the text in reading order.
1c. Extracting data from spreadsheet
Try xls2text for converting to text.
2. Parsing the (HTML/text) data
For parsing the data, there are also many options. For example, you can use a combination of grep and sed, or the BeautifulSoup Python library` if you're dealing with HTML source, but don't limit yourself to these options, you can use a language or tool that you're familiar with.
When you're parsing and extracting the data, you're essentially doing pattern matching.
Look for unique patterns that make it easy to isolate the data you're after.
One method of course is regular expressions. Say I want to extract email addresses from a text file named file.
egrep -io "\b[A-Z0-9._%+-]+#[A-Z0-9.-]+\.[A-Z]{2,4}\b" file
The above command will print the email addresses [2]. If you instead want to save them to a file, append > filename to the end of the command.
[1] Note that this list is not an exhaustive list. It's missing many options.
[2] This regular expression isn't bulletproof, there are some extreme cases it will not cover.
Alternatively, you can use a script that I've created which is much better for extracting email addresses from text files. It's more accurate at finding email addresses, easier to use, and you can pass it multiple files at once. You can access it here: https://gist.github.com/dideler/5219706
At the moment, we use MS WORD and MS EXCEL to mail merge documents that needs to be sent to multiple recepients.
For example, say there is a complaint form where the complainant needs to fill in his/her name, address, etc. So we have a .doc file set up with the content and the dynamic entities set up for mail merging, with the name and address details put in an excel file, from where we can happily mail merge to generate all or just the necessary forms/documents.
However, I would like to automate this process, like a form in a website where the complainant can fill in his/her name, address and other details, and we could use that to generate the complaint form automatically and offer it to be downloaded (preferrably as a pdf).
Now, the only solution that comes to mind, is Latex, so that I can just replace the needed entities and just compile to PDF. However, that bit has to be negotiated with the webhost, if they are offering Latex or not.
Is there any other solution? Any other way we could get this done, with something that shouldn't be a problem for most webhosting solutions to offer?
EDIT: I would prefer a non .NET or rather non microsoft solution since, the servers are running linux and while mono might be capable of getting the job done, none of our devs know any .NET languages. However, if required we might have to dwelve into it.
Generating PDF using an XSL. Check the following: Apoc XSL-FO
You will need to create an XML file with the required fields and transform that with this tool.
If you wish to avoid .NET then XSL-FO is worth a look. Try the FOray project.
XSLT can be a steep learn if you do not have experience already. Also users will not be able to change the templates without asking the XSLT guru to do it.
If your templates are already in MS Word and MS Excel then I would stick with generating MS docs on the server. These are now easy to work with from code since OpenXML - check out OfficeOpenXML and OpenXMLDeveloper
Apache FOP : http://xmlgraphics.apache.org/fop/
I suggest generating rtf on the server: it's easy enough to automatically generate using cpan's RTF::Writer, has converters generating good pdf, can be edited by hand in word, oo-writer & TextEdit, doesn't have any really bad compatibility issues between the main editing applications, and has decent text & resource extraction tools, with text extraction being rather better than pdf.
There's some support for moving between rtf & latex, although the best rtf -> latex converter, docx2tex, depends on the System.IO.Packaging .net module, whose mono implementation isn't yet rock solid.
Postscript — Not a recommendation: it's too much of an unwieldy sledgehammer for this job, but iText will generate the pdf directly from the form data. If you wanted to do fancy things like signed pdf, that would be the way to go.
Postscript #2 — If you break up the Word document into individual files using word's master document representation, then you can clobber one of the parts with hand-generated content. This makes it easy to do something approximating form-filling on word .doc files using just standard file-utils and some trivial rtf->doc tweaking.
How can I read an Open Office 3.0 spreadsheet (.ods) from Groovy? I'd like to select specific columns from a named worksheet. Ideally, it would be useful to add a 'where' clause, or other criteria clause.
I've never used it, but Open Office has a Java API, which of course you could use from Groovy as well. It looks like the best places to start reading are the Developer's Guide, the Java UNO Reference, and the samples in Java and (hey!) Groovy. Hope that helps!
Might be something here at Spring Factories or here at Groovy and JMX. There is a forum for Groovy and Open Office.
Could you export the table / spreadsheet as SQL entries then use that. You could also look at this plugin for goovy -- http://www.ifcx.org/
OpenOffice documents are ZIP files which contains the document data as XML plus some other files (style sheets for word documents). Details can be found here.
The main problem with calc is formulas. If you just have tabular data, then you can simply read the cell values and use that. So you can open the ZIP archive, read the content.xml in it and parse that with any XML parser.
But when a cell contains a formula, then you need to execute it. In this case, you will have to open the document via the UNO API. Here is the Java version. There is a link where you can download example code that explains how to open ODF documents and how to examine their content. There are also snippets but none of them show how to examine a sheet.
The main disadvantage of UNO is the documentation. Each method is explained somewhere but you have to find the method which solves your problem, first.
Since the title does not mention Groovy (only question specifics does), I didn't want to make this a new question.
How to generally read an Open Office spreadsheet document? There are tools for creating one (ooo-python) but not for reading one. They are XML but just bluntly diving into that and trying to get the right logic of extracting the data I want seems so sub-optimal.
What I'd like is features similar to Excel COM support, but from a command line tool (or scripting language).
Our product has the requirement of exporting its native format (essentially an XML file) to Excel for viewing/editing. However, what this entails is having a dependency on Excel (or Office) itself for our product build - something that we do not want.
What we have done is export the data from our native format to a csv file which can be opened in Excel. If user selects an option to open the generated report as well, we (try to) launch Excel application to open it (ofcourse it requires Excel to be already present on the client system).
The data for most part is flat list of records.
Is there a better format (or even a better way) to handle this requirement? This is a common requirement for many products - how do you handle this?
Excel versions, both 2007 and several previous, have native XML formats. 2007, obviously, is XML by default, and earlier versions have the ability to save as XML. This SO question deals with the issue. I'd guess a little inspection would give an idea of what's required. I don't know if a XSD/DTD exists for older versions, but a little creative Googling might yield something.
As other people pointed out, it is reasonably easy to generate Excel XML files. You can do this in multiple ways. For example:
By creating a template Excel XML document, and then using XML DOM to stuff your data into the template, or
Converting the template Excel XML into an XSLT, and then simply passing your proprietary XML as input to XSLT.
I'm using ExcelPackage to create spreadsheets in one of my side projects. Works pretty good, but (at least the version I'm using) its a bit limited when it comes to styling and calculations.
ExcelPackage lets you create OOXML docs (.xslx files) that are natively compat with 2k7, but you can download a plugin for previous versions of Office from MS.
We export our data either using Excel objects (COM based code) on client side or CSV file (usually on server side, but can be used on client side too). And we allow copy data from grids in simple html format, what can be pasted into Excel without problems.
For one customer we even had to export data [from sql stored procedure] into csv-like tab-separated format, but named file like xxxxx.xls - this way excel opened that file in more correct way than csv file. Ugly hack, but worked well.
CSV is most compatible format (no dependencies on external applications or libraries), but customers don't like it. Maybe we need to incorporate some XLS export code, this way all users will be happy :)
If .csv isn't formatted enough, you could create a template in Excel, and use a little bit of VBA code to import the CSV and format it appropriately. This way your app is only concerned with generating the .CSV, and will use the same .XLS for each export.
If you're careful, you should be able to get this to work with most versions of Excel seamlessly.
With Perl there are several modules that can be used to produce .xlsx files without requiring an Office installation. Among those :
https://metacpan.org/pod/Excel::Writer::XLSX is the most well-known, with support for many Excel features like colors, formatting, etc.
https://metacpan.org/pod/Excel::ValueWriter::XLSX (I'm actually the author) has less features but is optimized for fast writing of large amounts of data
If you are working in Java, Checkout the POI project from APACHE.
http://poi.apache.org/
Simple, nice, complete, powerful.
We started with Office on the server, but that's not very nice. We had to kill processes that hung, and had quite a bit of a performance dip. We thought about putting it on a different machine, but didn't bother after trying and using Aspose (commercial). We don't have a very large number of simultaneous users, but complex documents. Simple ones can be handled easier with csv.
I've used FlexCel Studio for a couple of projects now. It's very functional and fast. 100% managed code, no dependencies. Sounds like you'd use the "Reports" feature which allows you to define an empty report template in Excel, then pass datatable and volia, it's populated with your data.
TMS Software
We use a combination of OleDB and Interop. We found that Interop was much faster and used less memory, but it's a pain for compatibility issues, especially when using different language installs of Office.
OleDb has the advantage that you don't require Excel to be installed on the client machine. Both Interop and OleDb support multiple sheets (tables) per workbook which you cannot do with csv.
If you're using C# or VB.Net, and your data is in a a DataSet, DataTable or List<>, then you can use my free "Export to Excel" class.
It uses the free Microsoft OpenXML libraries (so you don't need to have Excel on your server), and lets you export your data into a "real" .xlsx file with just one line of code, eg:
DataSet ds = CreateSampleData();
CreateExcelFile.CreateExcelDocument(ds, "C:\\Sample.xlsx");
All source code is provided on the following page along with a demo project, completely free of charge (and popups !)
http://mikesknowledgebase.com/pages/CSharp/ExportToExcel.htm
Hope this helps !