How to translate Unicode to and from matlab? - excel

I have written matlab programs that produce plots and tables for chemical substances. I get my input mostly from excel tables and a local MySql database. My problem is quite a few substance names contain greek letters.
My problem is I want to create plots that use exactly the names specified by my collegues. And also create tables that show the correct symbol.
An example:
If I create an excel file containing: "α-Methylstyrol" in the first cell and read it with [~,~,tmp] = xlsread('test.xlsx'). tmp will contain '(box with question mark)-Methylstyrol'. If I use the string in a plot (title(tmp)) it will be shown as: '(right arrow)-Methylstyrol'
So far I tried the native2unicode and unicode2native commands on the string but there is no effect. Also I tried replacing the characters but the number of characters I need to replace is growing way too fast for me - so I'm really hoping there would be a more systematic way.
(We know there are also names that wouldn't contain greek letters - but we try to adhere to some guidelines which prefer these names.)

As far as I understand, Matlab does not support unicode nicely. However, it is possible to type greek letters in image titles using LaTex syntax.
title('\alpha-Methanol')
Even though it is not the nicest solution, I think it should be possible to replace unicode symbols with LaTex keywords.

I think, your problem is, that xlsread is not even getting the correct greek letter out of your sheet.
Just give jexcelapi or poi a try. Both links lead to java classes for importing xls-files. In MATLAB you only need to add the jar-file to you path via javaaddpath and the next steps are like basic java coding.

Related

Unicode character order problem when text is displayed

I am working on an application that converts text into some other characters of the extended ASCII character set that gets displayed in custom font.
The program operation essentially parses the input string using regex, locates the standard characters and outputs them as converted before returning a string with the modified text which displays correctly when viewed with the correct font.
Every now and again, the function returns a string where the characters are displayed in the wrong order, almost like they are corrupted or some data is missing from the Unicode double width spacing. I have examined the binary output, the hex data, and inspected the data in the function before i return it and everything looks ok, but every once in a while something goes wrong and cant quite put my finger on it.
To see an example of what i mean when i say the order is weird, just take a look at the following piece of converted text output from the program and try to highlight it with your mouse. You will see that it doesn't highlight in the order you expect despite how it appears.
Has anyone seen anything like this before and have they any ideas as to what is going on?
ך┼♫יἯ╡П♪דἰ
You are mixing various Unicode characters with different LTR/RTL characteristics.
LTR means "left-to-right" and is the direction that English (and many other western language) text is written.
RTL is "right-to-left" and is used mostly by Arabic and Hebrew (as well as several other scripts).
By default when rendering Unicode text the engine will try to use the directionality of the characters to figure out what direction a given part of the code should go. And normally that works just fine because Hebrew words will have only Hebrew letters and English words will only use letters from the Latin alphabet, so for each chunk there's a easily guessable direction that makes sense.
But you are mixing letters from different scripts and with different directionality.
For example ך is U+05DA HEBREW LETTER FINAL KAF, but you also use two other Hebrew characters. You can use something like this page to list the Unicode characters you used.
You can either
not use "wrong" directionality letters or
make the direction explict using a Left-to-right mark character.
Edit: Last but not least: I just realized that you said "custom font": if you expect displaying with a specific custom font, then you should really be using one of the private use areas in Unicode: they are explicitly reserved for private use like this (i.e. where the characters don't match the publicly defined glyphs for the codepoints). That would also avoid surprises like the ones you get, where some of the used characters have different rendering properties.

Unable to paste text in readable format from a PDF

I have a PDF document with the following sample text (screenshot) -
But when I copy and paste it to either word or other text editors all I see is the weird characters :





I am not quite sure why does it giving me weird square boxes instead of pasting the clear human-readable letters (just like the screenshot). Can someone help me how can I get rid of this issue ? Or at least what shall I do to identify the root cause of this strange issue ?
================== Workaround found ==================
I tried converting the document's corrupted unicode to a standard ANSCI unicode formats. But most of the online services couldn't recognize these garbage/weird characters.
This issue could be resolved by some programming, but I don't want to invest time with the programming approach and preferred on the fly approach.
Finally, as suggested by the user 'mkl', converting this document by using the OCR services like "Sedja"/ "Adobe OCR" resolved by issue.

How can I edit a DXF in node.js?

I'd like to make a custom lasered label from a user's input on a website. I have a template dxf file and I'd like to replace placeholder text with the user input. My problem is the dxf file format is very unreadable in its text format. Is there any way to make sense of the numeric data? If not are there any other formats (svg, etc) that would be easier to work with?
EDIT: The reason I've found it unreadable in terms of text is that the program (Solidworks) converted the text to curves.) At this point I'm trying to figure out how to prevent that.
AutoDesk was nice enough to document DXF syntax in great detail. Spend a couple hours understanding the documentation from the link below, and I think you will find it quite easy to parse and edit using code.
To just replace some placeholder text, it should be just as simple as reading the DXF file into a string (a dxf file is no different than a txt file), performing a text replace operation and saving it back to file. Just make sure that your placeholder text is very unique and is not contained in any of the key words in the document below (otherwise your DXF file will get corrupted). Something like "PlaceHolderText" will do the trick.
http://images.autodesk.com/adsk/files/autocad_2012_pdf_dxf-reference_enu.pdf
Edit: More Info
I do a lot of work with AutoDesk Inventor which is in direct competition with SolidWorks, so they are effectively the same tool. We were faced with a similar problem of needing to place text onto sheet metal flat pattern DXFs that came out of Inventor in order to identify the part, but Inventor simply could not do it (see, exactly the same!). One of our developers had the idea to place a very precise geometry punch onto the flat pattern. After the DXF was generated he wrote some code that parsed the DXF file and replaced the geometry with a text entity. More specifically we used a triangle with sides having each length defined to something like the 7th decimal place. You can then use one of the vertices of the triangle to position the text, including rotation. This process would be automatic, so once you write the code with the help of the document above (which won't take the long), it will just work. If your engraver can handle text the way you want it, I'd say this is a very good solution. We generate hundreds of parts every day using this code. Hope this helps.

add a duplicate (hidden) text layer to a pdf for extra searching

My problem:
I have a pdf with lots of roman characters with complex diacritical marks (e.g., ṣ, ś, ṝ, ǎ, etc.). To make it easier to search within the pdf, I would like to add an additional layer, much as one does with hocr, where the same text is present without the diacritics.
When using full-text search engines I can index multiple terms at the same position (vector) - I would like to achieve the same effect here.
I have read lots about adding a hocr layer to scanned images, but I really just want to duplicate the text layer, pass it through a script that strips the diacritics (straightforward enough) and then adds it back in as a hidden but searchable layer.
Anyone have any suggestions? (Solutions involving any platform, language, library or toolchain will be useful!)
Thanks :)
Edit: please let me know if the question is unclear.
Well I have a (slightly ugly and hackish) solution, so I thought I'd share it.
I'm using PDFMiner to extract the text, along with the co-ordinates. Then I'm using ReportLab to write the normalized versions of the text to a new pdf, in exactly the same position, as hidden text. To make the positions line up properly, I found I had to use exactly the same font, so I've used a combination of FontForge and MuPDF to extract the required font(s) from the original pdf.
Finally, having created the new pdf, I'm using pdftk to merge it with the original.
It works pretty well, but has the downside that copying text out of the pdf results in the normalized text being copied too. But this is acceptable for my present purposes, and I can't see any way around it. The pdf spec. doesn't really support my objective, and so I don't imagine I can do better than this hackish solution.
I have written something similar to add searchable text by OCR'ing images and converting it to PDF in C#. I used QuickPDF from www.quickpdf.com to create hidden white text objects on top of the image and this worked reasonably well.
In your case QuickPDF would allow you to extract the text strings along with bounding boxes and font details. You could then normalize your text and create the invisible text objects using the existing font and position information and then save it out to a new file.
This would basically give you the same PDF as you have now and also give you both the original and normalised text as you are getting now.
QuickPDF is a commercial library. If your solution works well for you then there is no used buying a commercial engine though. The nice thing though is that it only requires 1 SDK and you would look at it if you had a more than a few PDF's to convert.

Is any software decent at importing column-aligned text?

Here's something that's really irked me over the years. I've never used any software that, when importing data from a column-aligned text file, can figure out the column breaks in a correct manner.
Excel 2K3 and a lot of other Microsoft components that seem to share a common codebase (like the import options for SQL2K) attempt to figure out the column breaks for you. Unfortunately, they only look at the first n rows, and are often completely wrong.
OpenOffice.Org 3.1 has a import dialog almost exactly like Excel 2K3 but it doesn't even attempt to guess the column breaks for you. And the latest version of Numbers doesn't appear to handle column-aligned imports at all.
Obviously column-aligned data is undesirable for a number of reasons, but a lot of older software (particularly in-house software various companies have floating around) exports data in this format so I do need to handle it every so often. Surely, somewhere, SOME software imports it well without me coding an import utility myself or manually specifying where twelve zillion columns start and stop?
OSX, Windows, whatever. I'm open to suggestions. Ultimate goal is to get it into a SQL Server table, but simply getting it into a Excel/XML/tab-delimited/etc file in the meantime would be fine because it's easy enough to get into SQL Server from there.
I tend to normalize such data with awk -- perhaps generating a csv file -- before trying to import it into Excel.
See the awk user's manual.
I don't think there is a silver bullet for your request. I think the best you can hope for is to define your input format once and be able to reuse that format when you receive a file with the same format again.
As one poster mentioned you could use awk or, if .NET is more your thing, then you could use FileHelpers. It's an open source .NET library that does a good job reading and writing both Fixed length and delimited files. The downside is that you would be creating a .NET application to do the work (either inserting directly into a DB or perhaps creating an output file. On the plus side, once created, you could reuse the mapping classes again if you get the same file format.
Well obviously no software can be entirely correct in guessing the layout of a fixed column file, since there is no seperator (though variable width columns with higher maximum lengths will often produce enough space on the end to start guessing). For example the following could be anywhere from 1-9 columns (I have personally had to figure out some super packed fixed column layouts like this, only much longer)
135464876
647873159
345467575
If SQL Server is the ultimate destination, have you looked into the SQL Server import wizard?
Right click your database in Management Studio and select Tasks->Import Data. Proceed through and select "Flat File" as your data source. In the format dropdown change from Delimited to Fixed Width. On the left you can now use the Columns screen to draw the column seperators. There is also an advanced and preview screen.
Try out this demo (I was on development team):
Personator 4
Install, run the program, go to Tools | ASCII Conversion | Import from ASCII.
The import will be to DBF/FoxPro, but you can then export that file into one of the formats you mentioned.
The start/stop guesser uses a few statistical formulas to try to get the boundaries correct; you get to verify and/or correct with a graphical editor after analysis.
If you save your file as a text file and attempt to open
it in Microsoft Excel 2007 and select "Fixed Width",
Excel will "guess" where the breaks occur (based on
whitespace), but you can actually change where the column
field breaks will occur. The application has vertical lines that
can be moved left or right X characters. Excel
will "guess" where the breaks occur, but if it
guesses incorrectly, you can still change where the field breaks
should occur. On STEP 2 of the wizard, just move the
vertical lines to the left or right if you need
to change Excel's guesses as to where the field breaks
are. You can see which character number the field
break occurs in before importing.

Resources