Python :Special characters only seen when reading a file generated from unity - python-3.x

I am reading a file from python, this file is generated from unity, the problem is when I read the text from python am getting special characters that does not exist in the file
This is what the file look like when I open it in the desktop:
il tien en son bec un fromage
This is what the file looks lile when open it from python 3.6
ï»؟il tien en son bec un fromage
This is the code from python:
path = ('C:\\Users\\HP\\Documents\\test.txt')
with open(path,'r', encoding="utf-8") as f:
for line in f :
print(line)
test = nltk.word_tokenize(line)
print(test)
And this is the code from unity:
public void Save (){
string Ph = Phrase.text;
if (File.Exists ("C://Users//HP//Documents//test.txt")) {
File.Delete ("C://Users//HP//Documents//test.txt");
StreamWriter file = new StreamWriter ("C://Users//HP//Documents//test.txt", true, Encoding.UTF8);
UTF8Encoding utf8 = new UTF8Encoding ();
byte[] encodedBytes = utf8.GetBytes (Ph);
file.WriteLine (Ph);
}
This is so confusing
UPDATE:
NOW adding encoding="utf-8-sig" solved the problem in case this would help anyone

Setting the encoding to encoding="utf-8-sig" saved the problem

Related

Groovy - how to Decode a base 64 string and save it to a pdf/doc in local directory

Question
I need help with decoding a base 64 string and saving it to a pdf/doc in local directory using groovy
This script should work in SOAP UI
The base64 string is 52854 characters long
I have tried the following
File f = new File("c:\\document1.doc")
FileOutputStream out = null
byte[] b1 = Base64.decodeBase64(base64doccontent);
out = new FileOutputStream(f)
try {
out.write(b1)
} finally {
out.close()
}
But - it gives me below error
No signature of method: static org.apache.commons.codec.binary.Base64.decodeBase64() is applicable for argument types: (java.lang.String) values: [base64stringlong] Possible solutions: decodeBase64([B), encodeBase64([B), encodeBase64([B, boolean)
Assuming the base64 encoded text is coming from a file, a minimal example for soapUI would be:
import com.itextpdf.text.*
import com.itextpdf.text.pdf.PdfWriter;
String encodedContents = new File('/path/to/file/base64Encoded.txt')
.getText('UTF-8')
byte[] b1 = encodedContents.decodeBase64();
// Save as a text file
new File('/path/to/file/base64Decoded.txt').withOutputStream {
it.write b1
}
// Or, save as a PDF
def document = new Document()
PdfWriter.getInstance(document,
new FileOutputStream('/path/to/file/base64Decoded.pdf'))
document.open()
document.add(new Paragraph(new String(b1)))
document.close()
The File.withOutputStream method will ensure the stream is closed when the closure returns.
Or, to convert the byte array to a PDF, I used iText. I dropped itextpdf-5.5.13.jar in soapUI's bin/ext directory and restarted and then it was available for Groovy.

Finding and replacing special chars in a file

I'm trying to find and replace some special chars in a file encoded in ISO-8859-1, then write the result to a new file encoded in UTF-8:
package inv
class MigrationScript {
static main(args) {
new MigrationScript().doStuff();
}
void doStuff() {
def dumpfile = "path to input file";
def newfileP = "path to output file"
def file = new File(dumpfile)
def newfile = new File(newfileP)
def x = [
"þ":"ş",
"ý":"ı",
"Þ":"Ş",
"ð":"ğ",
"Ý":"İ",
"Ð":"Ğ"
]
def r = file.newReader("ISO-8859-1")
def w = newfile.newWriter("UTF-8")
r.eachLine{
line ->
x.each {
key, value ->
if(line.find(key)) println "found a special char!"
line = line.replaceAll(key, value);
}
w << line + System.lineSeparator();
}
w.close()
}
}
My input file content is:
"þ": "ý": "Þ":" "ð":" "Ý":" "Ð":"
Problem is my code never finds the specified characters. The groovy script file itself is encoded in UTF-8. I'm guessing that may be the cause of the problem, but then I can't encode it in ISO-8859-1 because then I can't write "Ş" "Ğ" etc in it.
I took your code sample, run it with an input file encoded with charset ISO-8859-1 and it worked as expected. Can you double check if your input file is actually encoded with ISO-8859-1? Here is what I did:
I took file content from your question and saved it (using SublimeText) to a file /tmp/test.txt using Save -> Save with Encoding -> Western (ISO 8859-1)
I checked file encoding with following Linux command:
file -i /tmp/test.txt
/tmp/test.txt: text/plain; charset=iso-8859-1
I set up dumpfile variable with /tmp/test.txt file and newfile variable to /tmp/test_2.txt
I run your code and I saw in the console:
found a special char!
found a special char!
found a special char!
found a special char!
found a special char!
found a special char!
I checked encoding of the Groovy file in IntelliJ IDEA - it was UTF-8
I checked encoding of the output file:
file -i /tmp/test_2.txt
/tmp/test_2.txt: text/plain; charset=utf-8
I checked the content of the output file:
cat /tmp/test_2.txt
"ş": "ı": "Ş":" "ğ":" "İ":" "Ğ":"
I don't think it matters, but I have used the most recent Groovy 2.4.13
I'm guessing that your input file is not encoded properly. Do double check what is the encoding of the file - when I save the same content but with UTF-8 encoding, your program does not work as expected and I don't see any found a special char! entry in the console. When I display contents of ISO-8859-1 file I see something like that:
cat /tmp/test.txt
"�": "�": "�":" "�":" "�":" "�":"%
If I save the same content with UTF-8, I see the readable content of the file:
cat /tmp/test.txt
"þ": "ý": "Þ":" "ð":" "Ý":" "Ð":"%
Hope it helps in finding source of the problem.

Getting information of a pdf

I have run into a little problem. Basicly i want to exstract from String-data off a pdf file.
More specifik this pdf file
http://www.midttrafik.dk/koereplaner/bybusser/aarhus/bybusser-aarhus/18-mejlbyelev-park-all%C3%A9-skaade-moesgaard/koereplan
So, my problem lays in not knowing, how to get the names, and the times(the pdf is times and locations of bus-stops, street names on the left kolon, and bus ariving times is the rest). the info i want to save is the number befor the street name (1-4), the street name, and all of the times.
translate of some of the stuff on the pdf.
Faste minuttal - just means that bus times is the same for the intival under 'Faste
6.56 - 8.11 - this means that, in this intival followes the under.
so
the buss will stop at 'Elev Skole, Høvej' 56, 11, 26, 41 meaning 6.56, 7.11, 7.26, 7.41, 7.56, 8.11.
I dont think i can desribe my problem any better, so i hope one of you guys will be able to help. i dont need a ready code, just send me in the rigth direaction - tell me what i can do, that migth help, or good patterns to use.
Thanks
You can use the nice PDFBox Library from here to extract the text you want from this pdf file. It works really nice, i used it in one of my last projects to index pfd files for a full text search.
Here is the URL to the project:
http://pdfbox.apache.org/index.html
There you'll find also the documentation and some examples how to extract text from pdf's.
Sample Code:
import java.io.*;
import org.apache.pdfbox.pdmodel.*;
import org.apache.pdfbox.util.*;
public class LittleExample {
public static void main(String[] args){
PDDocument pd;
BufferedWriter wr;
try {
// this is your pdf from which you would like to extract the text
File input = new File("/home/ottp/pdffiles/1.pdf");
// this is the target file to store the extracted text
File output = new File("/home/ottp/pdffiles/extracts/1.txt");
pd = PDDocument.load(input);
System.out.println(pd.getNumberOfPages());
System.out.println(pd.isEncrypted());
pd.save("CopyOfInvoice.pdf")
PDFTextStripper stripper = new PDFTextStripper();
wr = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(output)));
stripper.writeText(pd, wr);
if (pd != null) {
pd.close();
}
// close and flush the output stream
wr.close();
} catch (Exception e){
e.printStackTrace();
}
}
}

how to replace a string/word in a text file in groovy

Hello I am using groovy 2.1.5 and I have to write a code which show the contens/files of a directory with a given path then it makes a backup of the file and replace a word/string from the file.
here is the code I have used to try to replace a word in the file selected
String contents = new File( '/geretd/resume.txt' ).getText( 'UTF-8' )
contents = contents.replaceAll( 'visa', 'viva' )
also here is my complete code if anyone would like to modify it in a more efficient way, I will appreciate it since I am learning.
def dir = new File('/geretd')
dir.eachFile {
if (it.isFile()) {
println it.canonicalPath
}
}
copy = { File src,File dest->
def input = src.newDataInputStream()
def output = dest.newDataOutputStream()
output << input
input.close()
output.close()
}
//File srcFile = new File(args[0])
//File destFile = new File(args[1])
File srcFile = new File('/geretd/resume.txt')
File destFile = new File('/geretd/resumebak.txt')
copy(srcFile,destFile)
x = " "
println x
def dire = new File('/geretd')
dir.eachFile {
if (it.isFile()) {
println it.canonicalPath
}
}
String contents = new File( '/geretd/resume.txt' ).getText( 'UTF-8' )
contents = contents.replaceAll( 'visa', 'viva' )
As with nearly everything Groovy, AntBuilder is the easiest route:
ant.replace(file: "myFile", token: "NEEDLE", value: "replacement")
As an alternative to loading the whole file into memory, you could do each line in turn
new File( 'destination.txt' ).withWriter { w ->
new File( 'source.txt' ).eachLine { line ->
w << line.replaceAll( 'World', 'World!!!' ) + System.getProperty("line.separator")
}
}
Of course this (and dmahapatro's answer) rely on the words you are replacing not spanning across lines
I use this code to replace port 8080 to ${port.http} directly in certain file:
def file = new File('deploy/tomcat/conf/server.xml')
def newConfig = file.text.replace('8080', '${port.http}')
file.text = newConfig
The first string reads a line of the file into variable. The second string performs a replace. The third string writes a variable into file.
Answers that use "File" objects are good and quick, but usually cause following error that of course can be avoided but at the cost of loosen security:
Scripts not permitted to use new java.io.File java.lang.String.
Administrators can decide whether to approve or reject this signature.
This solution avoids all problems presented above:
String filenew = readFile('dir/myfile.yml').replaceAll('xxx','YYY')
writeFile file:'dir/myfile2.yml', text: filenew
Refer this answer where patterns are replaced. The same principle can be used to replace strings.
Sample
def copyAndReplaceText(source, dest, Closure replaceText){
dest.write(replaceText(source.text))
}
def source = new File('source.txt') //Hello World
def dest = new File('dest.txt') //blank
copyAndReplaceText(source, dest) {
it.replaceAll('World', 'World!!!!!')
}
assert 'Hello World' == source.text
assert 'Hello World!!!!!' == dest.text
other simple solution would be following closure:
def replace = { File source, String toSearch, String replacement ->
source.write(source.text.replaceAll(toSearch, replacement))
}

Writing a file in j2me

My code is not working and not giving any exception:
OutputConnection con = (OutputConnection) Connector.open("file:///epsd/rescuer.txt", Connector.WRITE);
System.out.println("below con");
OutputStream out = con.openOutputStream();
PrintStream ps = new PrintStream(out);
System.out.println("below ps");
ps.println(name+"!"+no+"!"+"!"+mtype+"!#!");
System.out.println("below println");
ps.close();
con.close();
Control doesn't reach after OutputConnection line. Is this how to append data to a text file in J2ME?
Use the following to get the SDCard path and concat it with your file name
String memoryCardPath = System.getProperty("fileconn.dir.memorycard");
String filePath = memoryCardPath + "/rescuer.txt";
OutputConnection con = (OutputConnection) Connector.open(filePath, Connector.WRITE);

Resources