Is any software decent at importing column-aligned text? - text

Here's something that's really irked me over the years. I've never used any software that, when importing data from a column-aligned text file, can figure out the column breaks in a correct manner.
Excel 2K3 and a lot of other Microsoft components that seem to share a common codebase (like the import options for SQL2K) attempt to figure out the column breaks for you. Unfortunately, they only look at the first n rows, and are often completely wrong.
OpenOffice.Org 3.1 has a import dialog almost exactly like Excel 2K3 but it doesn't even attempt to guess the column breaks for you. And the latest version of Numbers doesn't appear to handle column-aligned imports at all.
Obviously column-aligned data is undesirable for a number of reasons, but a lot of older software (particularly in-house software various companies have floating around) exports data in this format so I do need to handle it every so often. Surely, somewhere, SOME software imports it well without me coding an import utility myself or manually specifying where twelve zillion columns start and stop?
OSX, Windows, whatever. I'm open to suggestions. Ultimate goal is to get it into a SQL Server table, but simply getting it into a Excel/XML/tab-delimited/etc file in the meantime would be fine because it's easy enough to get into SQL Server from there.

I tend to normalize such data with awk -- perhaps generating a csv file -- before trying to import it into Excel.
See the awk user's manual.

I don't think there is a silver bullet for your request. I think the best you can hope for is to define your input format once and be able to reuse that format when you receive a file with the same format again.
As one poster mentioned you could use awk or, if .NET is more your thing, then you could use FileHelpers. It's an open source .NET library that does a good job reading and writing both Fixed length and delimited files. The downside is that you would be creating a .NET application to do the work (either inserting directly into a DB or perhaps creating an output file. On the plus side, once created, you could reuse the mapping classes again if you get the same file format.

Well obviously no software can be entirely correct in guessing the layout of a fixed column file, since there is no seperator (though variable width columns with higher maximum lengths will often produce enough space on the end to start guessing). For example the following could be anywhere from 1-9 columns (I have personally had to figure out some super packed fixed column layouts like this, only much longer)
135464876
647873159
345467575
If SQL Server is the ultimate destination, have you looked into the SQL Server import wizard?
Right click your database in Management Studio and select Tasks->Import Data. Proceed through and select "Flat File" as your data source. In the format dropdown change from Delimited to Fixed Width. On the left you can now use the Columns screen to draw the column seperators. There is also an advanced and preview screen.

Try out this demo (I was on development team):
Personator 4
Install, run the program, go to Tools | ASCII Conversion | Import from ASCII.
The import will be to DBF/FoxPro, but you can then export that file into one of the formats you mentioned.
The start/stop guesser uses a few statistical formulas to try to get the boundaries correct; you get to verify and/or correct with a graphical editor after analysis.

If you save your file as a text file and attempt to open
it in Microsoft Excel 2007 and select "Fixed Width",
Excel will "guess" where the breaks occur (based on
whitespace), but you can actually change where the column
field breaks will occur. The application has vertical lines that
can be moved left or right X characters. Excel
will "guess" where the breaks occur, but if it
guesses incorrectly, you can still change where the field breaks
should occur. On STEP 2 of the wizard, just move the
vertical lines to the left or right if you need
to change Excel's guesses as to where the field breaks
are. You can see which character number the field
break occurs in before importing.

Related

How to read a csv file that has points as thousand separator on excel

So, I've got that huge csv file that contains numbers that use "." as number separators (I guess this is how they roll in germany). Some of them are negative numbers.
I have to check that the sum is a certain amount just to be sure they sent me the correct data. When I just replace the dots with nothing I get an incorrect total (close to the total they sent me, but still incorrect). And as I can't review the whole file to find if there is something wrong somewhere, I can't be certain that the issue lies with the data or with something I didn't expect (like a line that would use "." as a decimal separator for example, but maybe there are more exotic cases that I could quite not imagine)
I'm pretty sure there must be a way to make excel understand that "." is a thousand separator, but so far I didn't manage to make that custom format understand what I'm trying to say.
Well this is actually half-true, I can make him understand that it should write 1.000.000 instead of 1000000 but I can't make him understand that it should read 1.000.000 as 1000000.
I also tried my luck at changing the separator in File > Options > Advanced > Use system separator, but it doesn't seem to work (like at all, when I change it, nothing changes, maybe this feature is bugged)
NB : I'm french and my default separator is a space. Though I could change the language to english, I can't change it to german because the package is not installed and I can't install anything on my working computer (cause "securtity and blahblahblah").
Thank you for your kind help.
Regards.

Data structure for a grammar checker for LaTeX sources

Let me start with acknowledging that this is a rather broad question, but I need to start somewhere and reduce the design space a bit.
The problem
Grammarly is an online app that provides grammar and spell-checking as a browser plugin. Currently, there neither exists support for text editors nor latex sources. Grammarly is apparently often confused when forced to deal with annotated text or text that is formatted (e.g. contains wrapped lines). I guess many people could use that tool when writing up scientific papers or pretty much any other LaTeX tool. I also presume that other solutions exist or will soon pop up that work similarly.
The solution
In principle, it is not necessary to support Grammarly directly in, e.g., emacs. It suffices to provide a convenient interface to check multiple source files at once. To that end, a simple web app could walk through a directory, read all .tex sources, remove all formatting and markup, and expose the files as an HTML document. The user could open that document, run Grammarly, and apply any fixes. The app would have to take the corrected text and reapply formatting, markdown, etc. to save the now fixed source file.
The question(s)
While it is reasonably simple to create such a web application, there are other requirements to be considered: LaTeX parsing (up to "standard" syntax) and a library like HaTeX could deal with parsing and interpretation. But the process of editing needs some thought. Presuming that the removal of formatting can be implemented by only deleting content, it should be possible to take a correction as a diff and reapply it to the formatted document.
In Haskell, is there a data structure for text editing that supports this use case. That is, a representation of text that can store deletions, find diffs, undo deletions, and move a diff accordingly? If not in Haskell, does something like this exist somewhere else?
Bonus question 2: What is the simplest (as in loc required) web framework in Haskell to set up such a web app? It would serve one HTML document and accept updated versions of the text files. No database is required.
Instead of removing and then adding the text formatting, you could parse the souce text into a stream of annotated tokens:
data AnnotatedChar = AC
{ char :: Char
, formatting :: String
}
The following source:
Is \emph{good}.
would be parsed as:
[AC 'I' "", AC 's' "", AC ' ' "\emph{", AC 'g' "", ...
Then, extract only the chars from this list, send them to Grammarly, and get back the result. Now, diff the list of annotated characters with the list of characters you got from Grammarly. This way, you only to deal with a list of characters, but keep the annotations.

CSV to Excel Doc?

I was curious what the community thinks is the easiest way to take a CSV file and 'save as' a Excel document with only a couple formulas pasted in?
I am trying to do this behind the scenes, and not physically navigating. e.g. opening, selecting save As, etc -- even though this is already VERY simple I **need to do this in code (Think automation)
Background: I have a c++ command line program generating the .csv, and a C# GUI starting this process. Either programs could hold the code, but I figure this is easiest in C# (InterOp?) The reasons I don't directly send code into the csv is because of the amount of comma characters that will mess up the csv and because other Excel documents need to reference the sheets so they need to be in .xls format.
=AVERAGE(C2:C999)
=COUNTIFS(C:C,">0",C:C,"<31")
=COUNTIFS(C:C,">31",C:C,"<55")
=COUNTIF(C:C,">55")
Have a look and see whether command-line scipting of openoffice will do the job. It can do quite a lot of conversions very easily. Otherwise there are a lot of Excel-producing libraries, for example PHPExcel, but you'd need to wrap some programming around them.

Creating a Print Monitor / Print Handler

I'm having trouble getting started with building a Print Monitor / Print Handler for Windows using Visual Studio 2012 Ultimate with WDK 8. Basically, this is what I am trying to accomplish:
Create a print monitor (something an application can print to) that will generate a file with the content that should be printed (like the default XPS printer or a PDF printer), and then invokes the print handler
Create a print handler that will parse the generated file and do certain actions with it (check to see if certain text is present, upload the file online, etc)
I feel like the print handler part should not be too hard, but starting with the print monitor is what I'm stuck at. What would I do within VS12? I see options for "Printer Driver V4", "Printer Driver V4 Property Bag", and "Printer XPS Render Filter". Should I use one of those templates, and, if so, what would I do within them? Anything pointing me in the right direction would be appreciated!
EDIT:
Just some more clarification - I only need the text from the print output, but I've read from various sources that getting text-only output leads to no output at all from sources like Firefox, etc since they print text as glyphs.
I will be using the print handler to parse the text for keywords and then upload that information to a web server in a specific format. The print monitor just needs to capture and save the text information from whatever application is printing.
As you pointed out in your comments, some applications such as Firefox print using glyph indices instead of characters. In fact, quite a few do and it's becoming more common. What you need is a print driver. The good news is Microsoft has already written it for you and provided you with sample source code in the WDK. Start by reviewing this to understand your options. The Unidriver is perhaps a little simpler but the Postscript driver has the advantage of generating output that can readily be transformed to PDF or other formats that retain text information (as opposed to raster page images that lose all text information). As far as I'm concerned, don't even think about XPS; it's just an all around disaster.
To handle glyph indices, what you'll need to do is add code to the driver's OEMTextOut function that uses the font's cmap tables to translate glyph indices back into character codes. I'm unaware of any public domain libraries that parse font files, so you'll likely have to write your own code to do this. (Hint: If you support only OpenType/TrueType fonts, you'll cover 99% of all printing applications).
Getting the Microsoft sample code to build, install and run is mostly straightforward, but if you're new to the WDK and installing print drivers, plan on spending a week or more on just that. The glyph index translation part is far more complex and you should plan on spending a lot more time on that.

Converting troublesome delimited PDF to Excel

I'm trying to convert this delimited PDF to an excel (or some other delimited format). Using Adobe Acrobat 9, I attempt to save it and copy it) as Excel but it gives the error message "BAD PDF; error in processing fonts. [348]".
I'm open to any solution that will create a delimited file, ranging from using Adobe Acrobat, to programming to using other apps. The only limitation is that I don't have a budget to buy anything (such as Able2Extract).
The way I was able to export my images and fonts without buying any extra software to do the conversion was this way. go to Advanced, PDG Optimizer, select all of the options you want on the LEFT COLUMN and where it says MAKE COMPATIBLE WITH select Acrobat 8.0 and later, OK....you are in your road to success
Note: not really an answer, but some suggestions.
Sounds to me that Crystal Reports is not following the PDF spec close enough.
I'd make sure CR is fully updated/patched and try genning another file making sure that "tagging" is enabled - tagging defines the layout structure. I don't have a copy of CR handy, but you may have to define a distiller template to use so when you print to PDF you can select that job option.
You can also tell its a bad PDF by using Preflight in Acrobat, it says there is no tag structure and you can do it manually (draw boxes around each item...). Also that there is no language set, and it is somehow compatible with Acrobat 1.3? which isn't supported anymore and should be 4 at the lowest?
Once you have a "good" pdf can export to xml/word and import that to excel. Also, with Acrobat 8+ you can highlight using the select tool, right-click and choose Open As SpreadSheet. You might be able to get away with just highlighting the whole document -- though I'd hope the xml approach would be best.
Able2Extract does some OCRing and tricky fuzzy logic not only to define tags/layout so it is exportable, but also avoids any font, encoding, etc issue - at least to my knowledge.
In the rare case that you can't get a new file, then exporting to plain text/accessible seems to generate a nice flat text file. You could write a vbscript to parse it (adding your delimiter) and import that into excel.

Resources