Unable to paste text in readable format from a PDF - string

I have a PDF document with the following sample text (screenshot) -
But when I copy and paste it to either word or other text editors all I see is the weird characters :





I am not quite sure why does it giving me weird square boxes instead of pasting the clear human-readable letters (just like the screenshot). Can someone help me how can I get rid of this issue ? Or at least what shall I do to identify the root cause of this strange issue ?

================== Workaround found ==================
I tried converting the document's corrupted unicode to a standard ANSCI unicode formats. But most of the online services couldn't recognize these garbage/weird characters.
This issue could be resolved by some programming, but I don't want to invest time with the programming approach and preferred on the fly approach.
Finally, as suggested by the user 'mkl', converting this document by using the OCR services like "Sedja"/ "Adobe OCR" resolved by issue.

Related

Extracting then importing a Csv.gz file into Excel/SPSS...problem recognising the text qualifier?

I am using a file from Insideairbnb.com for my thesis. It is a csv.gz file so first I extracted it using the 'Archive Utility' for Mac.
It is comma delimited and uses double quotes as the text qualifier which I specified in the Import popup but Excel/SPSS is still delimitating at the commas within the text.
It is a large file that includes full airbnb descriptions and reviews which are contained in double quotations. Unfortunately, there are many commas within the strings of text. I have never seen a csv file with this format but I believe it was put together correctly because I have seen Insideairbnb cited for data in quite a few scholarly articles.
I have included a link to pictures of a snippet of the data on the SPSS import window. If anyone knows how to go about importing this I would greatly appreciate your feedback :)
Thank you in advance!
[[1]: https://i.stack.imgur.com/Iy3dA.png][1][SPSS screenshot]
[1]: https://i.stack.imgur.com/i7KcG.png[SPSS screenshot 2][1]
I agree with #sarawhite's comment above; if this is a one-time problem there are a couple things I would try.
open the .csv in excel, and if it looks right, save it and then try
to import it in SPSS, or saveas an .xlsx file and import that
(although there can be nonsense with string variables in either
scenario)
OR
open in notepad++ and look at the raw data. you can find and replace
double line breaks fairly easily.
I copy-pasted the data into Notepad++ yesterday, then converted it to ANSI and copy-pasted it back into Excel. Yesterday, it worked, but today it doesn't...
Anyways, maybe this thread is helpful for people with the same question. I will try again at a later point in time.

recognising encodings in Excel

It is my understanding that txt files do not have encoding information stored so text editors simply make educated guesses about encoding of a given text file and then display the file on screen using that guessed encoding. If the editor guessed right you get your text on the screen, if the editor guessed wrong, then you (sometimes) get gibberish. Am I getting this right so far?
Now on to my problem. I have my bank statements in a csv file. When I open it in MS Excel 14 (MS Office 2010), it recognises the encoding and displays the problematic work as "obračun". Great. When I open the file in Emacs 24.3.1, it fails to recognise the correct encoding and displays the problematic word as "obra鑾n". Not so great.
My question is: how do I ask Excel which encoding the file is in? So I can tell that to Emacs since Excel obviously guessed correctly.
Thanks.
This could be a possible answer: http://metty-mathews.blogspot.si/2013/08/excel2013-character-encoding.html
After I opened ‘Advanced’ – ‘Web Options’ – ‘Encoding’, it said "Central European (Windows)" in "Save this document as:" field. It turns out that's Microsoft's name for Windows-1250 encoding and it turns out my file was indeed encoded with this encoding.
Is this just pure luck or does this field really show in which encoding Excel is displaying text - that I do not know.

How to insert a degree symbol into excel

I'm trying to type a degree symbol into an excel document which is then being saved out as a .csv document before finally being imported into InDesign via an add on called Easycatalog. Having no luck with keeping the degree symbol as a degree symbol when it comes into InDesign and wondered if anyone can help with this.
I'm variably getting lots of other symbols but never the degree symbol. Someone has suggested it is a font encoding issue and said I should use Unicode but not sure how to even specify an excel doc exports with Unicode character encoding.
Any help gratefully received.
Thanks
Alex
Hold down alt and press 248 then release alt. ASCII character codes show this: http://www.theasciicode.com.ar/extended-ascii-code/degree-symbol-ascii-code-248.html
However if encoding is really the problem:
Excel to CSV with UTF8 encoding may help.
and a good read on the problem you have:http://www.joelonsoftware.com/articles/Unicode.html

Exporting graphics in .emf format in SPSS causes false encoding of the titles

I'm using SPSS 21 (on Windows 7) to create some descriptive reports. I want to export my graphics in the best format for word processing. I found that the .emf format works well, i.e. the graphs are quite good when I insert them into a word document.
The only problem is: in the graphs titles, there are some umlauts (german characters like ä, ö, ü) and some accents (french characters like é, à, è. etc), and when I export the graphs, it displays these umlauts and accents as ü, à etc.
I already change (manually) the encoding of the Data and Syntax (in Options of SPSS) and choose the "Unicode" one. But even by changing this encoding, the titles of my graphs are not correctly encoded through the exportation.
Do you have any ideas why ?
Many thanks in advance !
This is a bug that was reported recently (introduced in the course of fixing another bug :-)) and has been fixed for a future release.

How to translate Unicode to and from matlab?

I have written matlab programs that produce plots and tables for chemical substances. I get my input mostly from excel tables and a local MySql database. My problem is quite a few substance names contain greek letters.
My problem is I want to create plots that use exactly the names specified by my collegues. And also create tables that show the correct symbol.
An example:
If I create an excel file containing: "α-Methylstyrol" in the first cell and read it with [~,~,tmp] = xlsread('test.xlsx'). tmp will contain '(box with question mark)-Methylstyrol'. If I use the string in a plot (title(tmp)) it will be shown as: '(right arrow)-Methylstyrol'
So far I tried the native2unicode and unicode2native commands on the string but there is no effect. Also I tried replacing the characters but the number of characters I need to replace is growing way too fast for me - so I'm really hoping there would be a more systematic way.
(We know there are also names that wouldn't contain greek letters - but we try to adhere to some guidelines which prefer these names.)
As far as I understand, Matlab does not support unicode nicely. However, it is possible to type greek letters in image titles using LaTex syntax.
title('\alpha-Methanol')
Even though it is not the nicest solution, I think it should be possible to replace unicode symbols with LaTex keywords.
I think, your problem is, that xlsread is not even getting the correct greek letter out of your sheet.
Just give jexcelapi or poi a try. Both links lead to java classes for importing xls-files. In MATLAB you only need to add the jar-file to you path via javaaddpath and the next steps are like basic java coding.

Resources