I hope you are all well. I have an Excel sheet that contains Cyrillic alphabet. I would like to get it to English/Latin Is there any easy way that Excel can do this? Thank you in advance for the help.
Your screenshot shows no Cyrillic letters: those would look like e.g. the Ukrainian alphabet letters: А а Б б В в Г г Ґ ґ Д д Е е Є є Ж ж З з И и І і Ї ї Й й К к Л л М м Н н О о П п Р р С с Т т У у Ф ф Х х Ц ц Ч ч Ш ш Щ щ Ь ь Ю ю Я я
You are victim of flagrant mojibake case rather as shown in next example; 39142772.txt file contains some accented characters (all Central Europe Latin). The file based on lines 1, 10 and 23 of your data, retyped to Czech and Hungarian valid names and is saved with UTF-8 encoding:
==> chcp 65001
Active code page: 65001
==> type D:\test\39142772.txt
1 STÁTNÍ ÚSTAV PRO KONTROLU LÉČIV
10 Pikó, Béla
23 Móricz, István
==> chcp 1252
Active code page: 1252
==> type D:\test\39142772.txt
1 STÃTNà ÚSTAV PRO KONTROLU LÉČIV
10 Pikó, Béla
23 Móricz, István
==>
Explanation: chcp command changes the active console Code Page;
chcp 65001 (UTF-8): file is displayed properly;
chcp 1252 (West European Latin): accented characters in file are displayed mojibake transformed exactly as shown in your screenshot;
the same mojibake transformation would happen if you import a .txt or .csv file into Excel using wrong encoding.
Solution: import a .txt or .csv file into Excel using proper encoding. Procedure is described here: Is it possible to force Excel recognize UTF-8 CSV files automatically?.
Related
After looking all over the Internet, I've come to this.
Let's say I have already made a text file that reads:
Hello World
Well, I want to remove the very last character (in this case d) from this text file.
So now the text file should look like this: Hello Worl
But I have no idea how to do this.
All I want, more or less, is a single backspace function for text files on my HDD.
This needs to work on Linux as that's what I'm using.
Use fileobject.seek() to seek 1 position from the end, then use file.truncate() to remove the remainder of the file:
import os
with open(filename, 'rb+') as filehandle:
filehandle.seek(-1, os.SEEK_END)
filehandle.truncate()
This works fine for single-byte encodings. If you have a multi-byte encoding (such as UTF-16 or UTF-32) you need to seek back enough bytes from the end to account for a single codepoint.
For variable-byte encodings, it depends on the codec if you can use this technique at all. For UTF-8, you need to find the first byte (from the end) where bytevalue & 0xC0 != 0x80 is true, and truncate from that point on. That ensures you don't truncate in the middle of a multi-byte UTF-8 codepoint:
with open(filename, 'rb+') as filehandle:
# move to end, then scan forward until a non-continuation byte is found
filehandle.seek(-1, os.SEEK_END)
while filehandle.read(1) & 0xC0 == 0x80:
# we just read 1 byte, which moved the file position forward,
# skip back 2 bytes to move to the byte before the current.
filehandle.seek(-2, os.SEEK_CUR)
# last read byte is our truncation point, move back to it.
filehandle.seek(-1, os.SEEK_CUR)
filehandle.truncate()
Note that UTF-8 is a superset of ASCII, so the above works for ASCII-encoded files too.
Accepted answer of Martijn is simple and kind of works, but does not account for text files with:
UTF-8 encoding containing non-English characters (which is the default encoding for text files in Python 3)
one newline character at the end of the file (which is the default in Linux editors like vim or gedit)
If the text file contains non-English characters, neither of the answers provided so far would work.
What follows is an example, that solves both problems, which also allows removing more than one character from the end of the file:
import os
def truncate_utf8_chars(filename, count, ignore_newlines=True):
"""
Truncates last `count` characters of a text file encoded in UTF-8.
:param filename: The path to the text file to read
:param count: Number of UTF-8 characters to remove from the end of the file
:param ignore_newlines: Set to true, if the newline character at the end of the file should be ignored
"""
with open(filename, 'rb+') as f:
last_char = None
size = os.fstat(f.fileno()).st_size
offset = 1
chars = 0
while offset <= size:
f.seek(-offset, os.SEEK_END)
b = ord(f.read(1))
if ignore_newlines:
if b == 0x0D or b == 0x0A:
offset += 1
continue
if b & 0b10000000 == 0 or b & 0b11000000 == 0b11000000:
# This is the first byte of a UTF8 character
chars += 1
if chars == count:
# When `count` number of characters have been found, move current position back
# with one byte (to include the byte just checked) and truncate the file
f.seek(-1, os.SEEK_CUR)
f.truncate()
return
offset += 1
How it works:
Reads only the last few bytes of a UTF-8 encoded text file in binary mode
Iterates the bytes backwards, looking for the start of a UTF-8 character
Once a character (different from a newline) is found, return that as the last character in the text file
Sample text file - bg.txt:
Здравей свят
How to use:
filename = 'bg.txt'
print('Before truncate:', open(filename).read())
truncate_utf8_chars(filename, 1)
print('After truncate:', open(filename).read())
Outputs:
Before truncate: Здравей свят
After truncate: Здравей свя
This works with both UTF-8 and ASCII encoded files.
In case you are not reading the file in binary mode, where you have only 'w' permissions, I can suggest the following.
f.seek(f.tell() - 1, os.SEEK_SET)
f.write('')
In this code above, f.seek() will only accept f.tell() b/c you do not have 'b' access. then you can set the cursor to the starting of the last element. Then you can delete the last element by an empty string.
with open(urfile, 'rb+') as f:
f.seek(0,2) # end of file
size=f.tell() # the size...
f.truncate(size-1) # truncate at that size - how ever many characters
Be sure to use binary mode on windows since Unix file line ending many return an illegal or incorrect character count.
with open('file.txt', 'w') as f:
f.seek(0, 2) # seek to end of file; f.seek(0, os.SEEK_END) is legal
f.seek(f.tell() - 2, 0) # seek to the second last char of file; f.seek(f.tell()-2, os.SEEK_SET) is legal
f.truncate()
subject to what last character of the file is, could be newline (\n) or anything else.
This may not be optimal, but if the above approaches don't work out, you could do:
with open('myfile.txt', 'r') as file:
data = file.read()[:-1]
with open('myfile.txt', 'w') as file:
file.write(data)
The code first opens the file, and then copies its content (with the exception of the last character) to the string data. Afterwards, the file is truncated to zero length (i.e. emptied), and the content of data is saved to the file, with the same name.
This is basically the same as vins ms's answer, except that it doesn't use the os package, and that is used the safer 'with open' syntax. This may not be recommended if the text file is huge. (I wrote this since none of the above approaches worked out too well for me in python 3.8).
here is a dirty way (erase & recreate)...
i don't advice to use this, but, it's possible to do like this ..
x = open("file").read()
os.remove("file")
open("file").write(x[:-1])
On a Linux system or (Cygwin under Windows). You can use the standard truncate command. You can reduce or increase the size of your file with this command.
In order to reduce a file by 1G the command would be truncate -s 1G filename. In the following example I reduce a file called update.iso by 1G.
Note that this operation took less than five seconds.
chris#SR-ENG-P18 /cygdrive/c/Projects
$ stat update.iso
File: update.iso
Size: 30802968576 Blocks: 30081024 IO Block: 65536 regular file
Device: ee6ddbceh/4000177102d Inode: 19421773395035112 Links: 1
Access: (0664/-rw-rw-r--) Uid: (1052727/ chris) Gid: (1049089/Domain Users)
Access: 2020-06-12 07:39:00.572940600 -0400
Modify: 2020-06-12 07:39:00.572940600 -0400
Change: 2020-06-12 07:39:00.572940600 -0400
Birth: 2020-06-11 13:31:21.170568000 -0400
chris#SR-ENG-P18 /cygdrive/c/Projects
$ truncate -s -1G update.iso
chris#SR-ENG-P18 /cygdrive/c/Projects
$ stat update.iso
File: update.iso
Size: 29729226752 Blocks: 29032448 IO Block: 65536 regular file
Device: ee6ddbceh/4000177102d Inode: 19421773395035112 Links: 1
Access: (0664/-rw-rw-r--) Uid: (1052727/ chris) Gid: (1049089/Domain Users)
Access: 2020-06-12 07:42:38.335782800 -0400
Modify: 2020-06-12 07:42:38.335782800 -0400
Change: 2020-06-12 07:42:38.335782800 -0400
Birth: 2020-06-11 13:31:21.170568000 -0400
The stat command tells you lots of info about a file including its size.
I am trying to loop through a set of pdfs (all are OCR'd) in a set of folders and search for key terms in the pdf and if pdf contains a certain term, then save the folder name, file name, etc.. This code is working to an extent. Except, it is missing a few pdfs within the search terms. The reason is because when I read in a couple of the pdfs it displays some jibberish (to me at least) on a couple of pages. For example, say I have read in a pdf named 'the_one.pdf'. It has 278 pages. When I go into adobe acrobat to search this document, I can find 'Search Term 1' on page 171, but when it is read with python, python outputs something like this:
-ˆ˜
%
˜%˝ˆ
,˙
˚
%.
%,˛#
%˜˚
0"
˚˝
%
˚˝ˆ˙)˛˚˜
˚0˛˚
:&;
#˛˘˘˙
˚%˚
"
%˚˛˘
ˆ
˛˚,˚
"
$%˚˚%
%
˝%.
"˛
"
%˜
˝,
-ˆ
%˘˙
˛˘˚
0"
"
˛
.˛˝
%˜˚
˝˜
.%
!˝ˆ%
4
0"
"
%˜˚
˛
%˛˘˘˙
!˝ˆ˜
%
˛ ˚˝ˆ˙)˛˚˜
˚0˛
!˝ˆ%
.˛˝˘˙8
˛˜
%
0"
"
˚
˛ #%˛%
"˛
˚ˆ˘˚
˛ ˛˚˛˝%
0"%ˆ
˛˙
!˝ˆ˛˘
%˜
%
%"
˚ˆ˝%
#
7
˘˛˘˙
:&;
˛˘˚%
˛˚,˚
"
$%˚˚%
%
˝%.
%
%˜
˝,
6
;˚
%˜
˛%
"
$%˚˚%
˚"%ˆ˘˜
˘˝˘˙
%
"˛
.˝˚
%
˚˛˜)˛˘%
/ˇ˚
˘˝˘˙
˝˘ˆ˜
˚˛˜)˛˘%
/ˇ˚
"˛
˛
#˚˜
˛˚
9$
˜˛˚
˜˛˘˚
:
"˚
˘
.˝˚
%
˚˛˜)˛˘%
/ˇ˚
˛
˜˜
%
˛˘˙
%
9$
˜˛˚
˜˛˘˚
"˛
˛
˜ˆ˛˘˘˙
#˚˜
˛˚
/ˇ˚
4˛˜
˚ˆ˝"
˚
˛
˛˘˚%
˛%˜
%
ˆ˚
˛˘
%˜˘˚8
7
9"˚
#%˛%˚
%.
˛,
˘˛˝
%
"
˘"%
ˆ
˝˛
˛˘˚%
˛,
ˆ˚
%.
˘˝%˝
˚˙˚˚
%
˚˝ˆ˙)˛˚˜
˚0˛
!˝ˆ%
.˛˝˘˚
&%
!˛˘
˛ ˛,
˛˝˛
˛˙
˚
%
%
%
%
/ˇ˚
˛ -ˆ˚
.%
-ˆ%˛%
4<
˝6
=8
.%
˛ ˚˝.˝
˚˝ˆ˙)˛˚˜
˚0˛
˛˜
˝
˛˝,
Of course, it displays the majority of pages correctly, but for some reason it won't display a couple of them. For confidentiality reasons, I can't post the pdfs. Does anyone have any idea why this is happening?
Also, anything you can point out to speed up my code or make it more dynamic is helpful as well. Always looking to learn.
Best,
J.Dykstra
import PyPDF2
from os import walk
import os
import re
import csv
pdf_location = r'PDF Directory'
x = ['Search term 1', 'Search term 2', 'Search term 3', 'etc..']
key_terms = []
rule = []
filenamey = []
for dirpath, dirnames, filenames in walk(pdf_location):
for filename in filenames:
if filename.endswith('.pdf'):
pdfFileObj = open(os.path.join(dirpath,filename), 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj, strict = False)
num_pages = pdfReader.numPages
count = 0
text = ""
while count < num_pages:
pageObj = pdfReader.getPage(count)
count +=1
text += pageObj.extractText()
for i in x:
if re.search(i,text, re.IGNORECASE):
rulex = dirpath.split("Rule")[1]
filenamex = filename
key_termx = x[0]
key_terms.append(key_termx)
rule.append(rulex)
Parsing PDF is a complex task, the 1.7 spec has around 750 pages and Adobe makes money with it - thats why it works for them.
PDFs internally have tables that hold
"how letters look" (glyphs)
"what unicode letters those glyphs are mapped to" (you need that to copy&paste someting from pdf correctly)
and a cross-ref which glyph mapps to what unicode. Fonts might be (partly) be embedded in the pdf as well.
Thats (one reason) why pdfs can look 100% ok, could be "OCR"ed ok - but if you just copy&paste from a document that has a corrupt mapping between glyphs and unicode points, you only get gibberish.
I have heard some programms even provide unicode mappings for all glyphs but they do not match up at all ... on purpose (or bad quality) - to prevent copy&paste.
Bottom line: you can try to re-OCR some pages, you could use Adobe Acrobat PRO to extract text from PDF (it has build in ocr features) that give you gibberish or just skip it.
You can try some other pdf-reading framework, maybe they got something not quite right - but chances are slim if it almost always works but just not for a few special pdfs.
I am just a novice in pdf - there are some more advanced ppl around to pipe in on this - but if you cannot share the pdf its going to be hard to advice anything.
Alternate approaches: Searching text in a PDF using Python?
I use gforth running on linux boxes.
For one of my mini-applications I want to register a formatted text output from a few different user inputs.
Here is the INPUT$ I use:
: INPUT$
pad swap accept pad swap ;
I think this is correct. I tested it this way:
cr ." enter something : " 4 INPUT$ CR
enter something : toto
ok
cr ." enter something : " 8 INPUT$ CR
enter something : titi
ok
.S <4> 140296186274576 4 140296186274576 4 ok
My file definition:
256 Constant max-line
Create line-buffer max-line 2 + allot
//prepare file for Write permissions :
s" foo.out" w/o create-file throw Value fd-out
: close-output ( -- ) fd-out close-file throw ;
The end goal is to build very small files as:
data1;data2;data3
data4;data5;data6
where each data is the user input (asked 3times to insert text & a second wave of 3 inputs)
I did not find documentation about how I can use text inputs to build my file.
How can I call my stack data to copy them to the text file format? (using type will only echo texts to my terminal)
I think you are looking for the Forth write-file and write-line words, which are documented here: https://www.complang.tuwien.ac.at/forth/gforth/Docs-html/General-files.html
write-file ( c-addr u fileid -– ior )
write-line ( c-addr u fileid –- ior )
Pass the address and length of your text buffer, and the file ID (fd-out in your example) to write text to the file. The ior result will be zero on success.
Just when I thought I had my head wrapped around converting unicode to strings Python 2.7 throws an exception.
The code below loops over a number of accented characters and converts them to their non-accented equivalents. I've put in an special case for the double s.
#!/usr/bin/python
# -*- coding: utf-8 -*-
import unicodedata
def unicodeToString(uni):
return unicodedata.normalize("NFD", uni).encode("ascii", "ignore")
accentList = [
#(grave accent)
u"à",
u"è",
u"ì",
u"ò",
u"ù",
u"À",
u"È",
u"Ì",
u"Ò",
u"Ù",
#(acute accent)
u"á",
u"é",
u"í",
u"ó",
u"ú",
u"ý",
u"Á",
u"É",
u"Í",
u"Ó",
u"Ú",
u"Ý",
#(arrete accent)
u"â",
u"ê",
u"î",
u"ô",
u"û",
u"Â",
u"Ê",
u"Î",
u"Ô",
u"Û",
#(tilde )
u"ã",
u"ñ",
u"õ",
u"Ã",
u"Ñ",
u"Õ",
#(diaresses)
u"ä",
u"ë",
u"ï",
u"ö",
u"ü",
u"ÿ",
u"Ä",
u"Ë",
u"Ï",
u"Ö",
u"Ü",
u"Ÿ",
#ring
u"å",
u"Å",
#ae ligature
u"æ",
u"Æ",
#oe ligature
u"œ",
u"Œ",
#c cidilla
u"ç",
u"Ç",
# D stroke?
u"ð",
u"Ð",
# o slash
u"ø",
u"Ø",
u"¿", # Spanish ?
u"¡", # Spanish !
u"ß" # Double s
]
for i in range(0, len(accentList)):
try:
u = accentList[i]
s = unicodeToString(u)
if u == u"ß":
s = "ss"
print("%s -> %s" % (u, s))
except:
pass
Without the try/except I get an error:
File "C:\Python27\lib\encodings\cp437.py", line 12, in encode
return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can't encode character u'\xc0' in position 0
: character maps to <undefined>
Is there anything I can do to make the code run without using the try/except? I'm using Sublime Text 2.
try/except does not make Unicode work. It just hides errors.
To fix the UnicodeEncodeError error, drop try/except and see Python, Unicode, and the Windows console.
When i write '你' in agend and save it as test-unicode.txt in unicode mode,open it with xxd g:\\test-unicode.txt ,i got :
0000000: fffe 604f ..`O
1.fffe stand for little endian
2.the unicode of 你 is \x4f\x60
I want to write the 你 as 604f or 4f60 in the file.
output=open("g://test-unicode.txt","wb")
str1="你"
output.write(str1)
output.close()
error:
TypeError: 'str' does not support the buffer interface
When i change it into the following ,there is no errror.
output=open("g://test-unicode.txt","wb")
str1="你"
output.write(str1.encode())
output.close()
when open it with xxd g:\\test-unicode.txt ,i got :
0000000: e4bd a0 ...
How can i write 604f or 4f60 into my file the same way as microsoft aengda do(save as unicode format)?
"Unicode" as an encoding is actually UTF-16LE.
with open("g:/test-unicode.txt", "w", encoding="utf-16le") as output:
output.write(str1)