How can i Convert a text file to UCS-2 LE, from whatever the default is? - text

I am looking for a way to convert or save a text file in the UCS-2 LE format; specifically without BOM...i guess.
I have zero knowledge what any of that means actually; but i know i need that because of this wiki page on what i am trying to accomplish: https://developer.valvesoftware.com/wiki/Closed_Captions
in other words:
this is for a specific game engine, "Source Engine," which requires the format in order to compile in-game closed captions for sounds.
I have tried saving the file in Notepad++ using the "UCS-2 LE BOM" option under the encoding menu...there is no option for just "UCS-2 LE" however, and because of this, the captions cannot be compiled for the game engine. I need to save without BOM, "I guess" (because again I don't know what I'm talking about and I assume based on logical conclusions, that I need to not have BOM, whatever that actually means.)
I would like to know about a way to either save a txt file in that encoding format; or a way to convert one.

In my specific case; it appears that my problem boils down to "the program is weird."
what I mean by this is, notepad++ actually does save in the correct format; but I failed to realize that because of a quirk in the caption compiler where it only works if you drag the file onto it; not via command line as previously thought.
I will accept this as the answer when i am allowed to in 2 days.

Related

How can I edit a DXF in node.js?

I'd like to make a custom lasered label from a user's input on a website. I have a template dxf file and I'd like to replace placeholder text with the user input. My problem is the dxf file format is very unreadable in its text format. Is there any way to make sense of the numeric data? If not are there any other formats (svg, etc) that would be easier to work with?
EDIT: The reason I've found it unreadable in terms of text is that the program (Solidworks) converted the text to curves.) At this point I'm trying to figure out how to prevent that.
AutoDesk was nice enough to document DXF syntax in great detail. Spend a couple hours understanding the documentation from the link below, and I think you will find it quite easy to parse and edit using code.
To just replace some placeholder text, it should be just as simple as reading the DXF file into a string (a dxf file is no different than a txt file), performing a text replace operation and saving it back to file. Just make sure that your placeholder text is very unique and is not contained in any of the key words in the document below (otherwise your DXF file will get corrupted). Something like "PlaceHolderText" will do the trick.
http://images.autodesk.com/adsk/files/autocad_2012_pdf_dxf-reference_enu.pdf
Edit: More Info
I do a lot of work with AutoDesk Inventor which is in direct competition with SolidWorks, so they are effectively the same tool. We were faced with a similar problem of needing to place text onto sheet metal flat pattern DXFs that came out of Inventor in order to identify the part, but Inventor simply could not do it (see, exactly the same!). One of our developers had the idea to place a very precise geometry punch onto the flat pattern. After the DXF was generated he wrote some code that parsed the DXF file and replaced the geometry with a text entity. More specifically we used a triangle with sides having each length defined to something like the 7th decimal place. You can then use one of the vertices of the triangle to position the text, including rotation. This process would be automatic, so once you write the code with the help of the document above (which won't take the long), it will just work. If your engraver can handle text the way you want it, I'd say this is a very good solution. We generate hundreds of parts every day using this code. Hope this helps.

Checking if PDF is searchable

I wrote a bash script that extracts plain text from scanned PDF files. I've got lots of PDF's but some are scanned and some other are not. So now my main goal is to improve my script by checking if PDF's are already searchable, so no OCR extraction will be needed.
I've tried:
pdftext -nopgbrk pdf_file.pdf wordlist
to store possible OCR'ed text in wordlist, so then I can check if it's empty and figure out whether it's a searchable PDF or not.
I've also tried pdffonts pdf_file.pdf to check if there're fonts in that PDF and therefore if there's text on it or not.
Both ways work pretty fine but are failing in some cases.
For example, some of the PDF's I need to OCR are digitally signed, and those signatures always add a text layer to PDFs. So when I run any of those two commands, it'll output either the signature's text, or the font that it's using. It's like if it had found plain text just because of the signing. It might just be a scanned PDF with a digital signature, but it'll be detected as a plain text PDF.
Digital signings always add text this way (using Helvetica font):
Signed by: Name
Date: Date CEST
Company: Company Name
So with:
pdftext -nopgbrk pdf_file.pdf wordlist | grep -v -E 'Signed|Date|Company'
I can manage to remove those lines so if it's really a scanned PDF, the output will be empty.
It worked for some PDF's until I noticed a signature that had some other format, so I feel this is pretty much of a work-around and not a great solution.
Is there any way to check if a PDF is fully searchable? I just need a way to extract PDF's text but omitting digital signings. Also grep -v will always depend on our digital signature's format and if it changes then it'll screw up my script.
Thanks.
Unfortunately, there really isn't an easy way to do this in a "non-hacky" way without significantly more involved analysis of the file which would be far beyond the scope and scale of a bash script.
When pdftotext outputs the text for the digital signature, that text is not coming from the digital signature itself. That is stored as an object in the PDF with metadata that pdftotext ignores. Instead, what pdftotext picks up is just that: text which has also been added to the file.
Here's an example from Adobe's sample signed PDF document. First, the digital signature's metadata:
And here is the text which is inserted into the document:
Technically, you can have one without the other, and there is no established format for the text that generally accompanies a digital signature. Therefore, you're stuck either:
Ignoring specific text with grep, as you are doing now, which can be unreliable.
Running OCR on all files and then checking if there is a difference in the text before/after OCR, but then this defeats the whole purpose of checking in the first place.

How to mannually specify Byte Order Mark in CSV

I have a CSV that is encoded in Unicode, however lacks a byte order mark at the start. As such Excel (2013) opens without encoding correctly (i think it assumes ASCII if no BOM specified...), meaning that certain characters are displayed incorectly.
From reading around i have read that a BOM of "\uFEFF" should be entered at the start of the CSV file. I have tried opening in txt editor and adding the characters e.g.
\uFEFF
r1test 1, r1text2, r1text3
r2test 1, r2text2, r2text3
However, this does not solve the problem - the characters "\uFEFF" show up on the first row when I open in excel, rather than it beign interpreted as a BOM. I am not sure what I am doing wrong, and the format of how the text should be specified such that it is interpreted as a BOM, rather than text in the the first of the data
I have only very limited experience using CSV, and only just heard of a BOM... and thus I could be implementing this completely wrong!
(for reference, i know that I could specify the encoding if i use the import data option within excel... however I really want to work out how to get it correctly specified in advance such that I can just open the csv... I have several thousand of these files that I am creating and exporting - once I know how to do this 'manually' [i.e. by adding some text at start of a the file], I can configure to automatically do in Python).
Thanks in advance
For someone else wanting to tell Excel to add a BOM: See if you can "Save as Unicode Text".
source

How to determine file encoding type with Excel VBA

I have built an Excel/VBA tool to validate csv files to ensure the data they contain is valid. They csv can come originate from anywhere (from a full blown unix system or a desktop user saving data out from Excel). The Excel tool is sent out to businesses so they can validate their csv files in their own environment and without taking the risk of their data leaving thier systems. Thus, the solution needs to be in native VBA and not link into external libraries.
So using VBA, I need to be able to automatically detect UTF-8 (with or without BOM) or ANSI file encodings and warn the user if these are not the file encodings used for the csv.
I think this would perhaps involve reading in a few bytes from the start of the file and determining the encoding based on the existance of the byte order mark.
Could you help me get me started on the right track?
Assuming you have the freedom to ask user to choose the correct file type, making them responsible for what they choose as a file ;)
That means, you can create a form where users can choose the filename and the encoding type like how we do on file open wizard.
Else,
I suggest you to use the FileSystemObject. It returns a TextStream which can be utilized to determine the encoding. I doubt VBA supports other types of encoding and please correct me if it does :) and happy to hear. :)
how to detect encoding type
msdn object library model
Here is a link for further considerations:-
change encode type

I want to change the way text is represented internally in ANY Text Editor

I want to use a algorithm to reduce memory used to save the particular text file.I don't really know how text is stored but i have an idea in mind.
Would it be better to extend a open source text editor (if yes than which one) or write a text editor myself.
It would be nice if someone could also give me a link or tutorial to some basics on how text editors work and the way data is stored.
Edited to add
To clarify, what I wanted to do is instead of saving duplicates of a word make a hash table and store the address where it needs to be placed.
That way I wouldn't be storing the duplicates.
This would have become specific to a particular text editor.
Update
thanks everyone I got what all of you'll are trying to say. Anyways all i wanted to do is instead of saving duplicates of a word make a hash table and store the address where it needs to be placed.
This was i wouldn't be storing the duplicates.
Yes and this would have become specific to a particular text editor. never realized that.
I want to use a algorithm to reduce memory used to save the particular text file
If you did this you would no longer have a text editor, but instead you would have created some sort of binary file editor.
The whole point of the text file format is that it is universal, meaning any text file can be open in any other text editor.
Emacs handles compression transparently. Just create a text file with .gz extension. Emacs will automatically compress contents of the file during save operation, and decompress when you open the file next time.
Text is basically stored as-is. i.e., every character takes up a byte or two (wide chars), and there is no conversion done on it when it's saved. It might add an end-of-file character or something though. Don't try coming up with your own algorithm to compress these files. That's why zip-files and other archives were created. They're really good at compressing text. If you wanted to add these feature to your text-editor, you'd have to add some sort of post-save hook to zip it, and then put a hook on the open command to unzip it. Unless you wanted to do it by hand every time. Don't try writing the text editor yourself from scratch, unless (maybe) you're writing notepad. Text editors with syntax highlighting aren't very easy to make, even with the proper libraries. I'd say write a plugin for something like Visual Studio or what have you. Or find an open-source text editor.

Resources