I have a program which writes the output as binary vtk file. Being a binary file, I am not able to understand it. However, I would like to read the contents of the file for my own understanding. Is there a way to view it in ascii format (so that I could understand the data better)?
A hex-editor or dumper will let you view the contents of the file, but it will likely still be unreadable. As a binary file, its contents are meant to be interpreted by your machine / vtk, and will be nonsensical without knowing the vtk format specifications.
Common hex dumpers include xxd, od, or hexdump.
Related
Hi friends I need some help.
We have a tool that convert binary files to text files, and after that stores into Hadoop (HDFS).
In production, that ingestion tool uses ftp to download files from mainframe in binary format (EBCDIC), and we don't have access to donwload files from mainframe in development environment.
In order to test file conversion, we manually create text files, and we are trying to convert file using dd command (linux), using these parameters:
dd if=asciifile.txt of=ebcdicfile conf=ebcdic
After pass through our conversion tool, the expected result is:
000000000000000 DATA
000000000000000 DATA
000000000000000 DATA
000000000000000 DATA
However, it's returning the following result:
000000000000000 DAT
A000000000000000 DA
TA000000000000000 D
ATA000000000000000
I have tried with cbs, obs and ibs parameters, assigning lrec (number of lines of each line) without success.
Can anyone help me?
A few things to consider:
How exactly is the data transferred via FTP? Your "in binary format(EBCDIC)" simply doesn't make any sense at all. The FTP either transfers in binary format, then nothing gets changed, or converted during the transfer. Or the FTP transfers in text mode, aka. ASCII mode, then data is converted from a specific EBCDIC code page to a specific non-EBCDIC code page. You need to know what mode, and if text mode, what are the two code pages being used.
From the man pages for dd, it is unclear what EBCDIC, and ASCII code pages are used for the conversion. I'm just guessing here: EBCDIC code page might be CP-037, and ASCII might be CP-437. If these don't match the ones used in the FTP, the resulting test data is incorrect.
I understand you don't have access to production data in the development environment. However, you should still be able to get test data from the development mainframe using FTP from there. If not, how will you be doing end to end testing?
The EBCDIC conversion is eating your line endings:
https://www.ibm.com/docs/en/zos/2.2.0?topic=server-different-end-line-characters-in-text-files
I am looking for a way to convert or save a text file in the UCS-2 LE format; specifically without BOM...i guess.
I have zero knowledge what any of that means actually; but i know i need that because of this wiki page on what i am trying to accomplish: https://developer.valvesoftware.com/wiki/Closed_Captions
in other words:
this is for a specific game engine, "Source Engine," which requires the format in order to compile in-game closed captions for sounds.
I have tried saving the file in Notepad++ using the "UCS-2 LE BOM" option under the encoding menu...there is no option for just "UCS-2 LE" however, and because of this, the captions cannot be compiled for the game engine. I need to save without BOM, "I guess" (because again I don't know what I'm talking about and I assume based on logical conclusions, that I need to not have BOM, whatever that actually means.)
I would like to know about a way to either save a txt file in that encoding format; or a way to convert one.
In my specific case; it appears that my problem boils down to "the program is weird."
what I mean by this is, notepad++ actually does save in the correct format; but I failed to realize that because of a quirk in the caption compiler where it only works if you drag the file onto it; not via command line as previously thought.
I will accept this as the answer when i am allowed to in 2 days.
How does the Linux command file recognize the encoding of my files?
zell#ubuntu:~$ file examples.desktop
examples.desktop: UTF-8 Unicode text
zell#ubuntu:~$ file /etc/services
/etc/services: ASCII text
The man page is pretty clear
The filesystem tests are based on examining the return from a stat(2)
system call...
The magic tests are used to check for files with data in particular
fixed formats. The canonical example of this is a binary executable
(compiled program) a.out file, whose format is defined in #include
and possibly #include in the standard include
directory. These files have a 'magic number' stored in a particular
place near the beginning of the file that tells the UNIX operating
system that the file is a binary executable, and which of several
types thereof. The concept of a 'magic' has been applied by extension
to data files. Any file with some invariant identifier at a small
fixed offset into the file can usually be described in this way. The
information identifying these files is read from the compiled magic
file /usr/share/misc/magic.mgc, or the files in the directory
/usr/share/misc/magic if the compiled file does not exist. In
addition, if $HOME/.magic.mgc or $HOME/.magic exists, it will be used
in preference to the system magic files. If /etc/magic exists, it will
be used together with other magic files.
If a file does not match any of the entries in the magic file, it is
examined to see if it seems to be a text file. ASCII, ISO-8859-x,
non-ISO 8-bit extended-ASCII character sets (such as those used on
Macintosh and IBM PC systems), UTF-8-encoded Unicode, UTF-16-encoded
Unicode, and EBCDIC character sets can be distinguished by the
different ranges and sequences of bytes that constitute printable text
in each set. If a file passes any of these tests, its character set is
reported.
In short, for regular files, their magic values are tested. If there's no match, then file checks whether it's a text file, making an educated guess about the specific encoding by looking at the actual values of bytes in the file.
Oh, and you can also download the source code and look at the implementation for yourself.
TLDR: Magic File Doesn't Support UTF-8 BOM Markers
(and that's the main charset you need to care about)
The source code is on GitHub so anyone can search it. After doing a quick search, things like BOM, ef bb bf, and feff do not appear at all. That means UTF-8, Byte-Order-Mark reading is not supported. Files made in other applications that use or preserve the BOM marker will all be returned as "charset=unknown" when using file.
In addition, none of the config files mentioned in the Magic File manpage are a part of magic file v. 4.17. In fact, /etc/magicfile/ doesn't exist at all, so I don't see any way in which I can configure it.
If you're stuck trying to get the ACTUAL charset encoding and magic file is all you have, you can determine if you have a UTF-8 file at the Linux CLI with:
hexdump -n 3 -C $path_to_filename
If the above returns the following sequence, ef bb bf, then you are 99% likely in possession of a BOM-marked UTF-8 file. This is not a 100% certainty, but it is far more useful than magic file, where it has no handling whatsoever for Byte Order Marks.
i believe i have similar problem to this how to convert unicode text to utf8 text readable? but i want a python 3.7 solution to it
i am a complete newbie, i have some experience with python so i am trying to use it to make a script that will convert a Unicode file into the previous readable text it was.
the file is a bookmark file i have recovered using easeusa then i opened the bookmark file and it is writen in unicode something like "&PŽ¾³kÊ
k-7ÄÜÅe–?XBdyÃ8߯r×»Êã¥bÏñ ¥»X§ÈÕÀ¬Zé‚1öÄEdýŽ‹€†.c"
whereas previously is said something like " "checksum": "112d56adbd0caa2b3693bb0442dd16ff",
"roots": {
"bookmark_bar": {
"children":"
fyi when i click save as for the unicode bookmark file, for unicode it has ANSI and not utf-8 maybe it was saved us ANSI, i might be waffling here but i'm just trying to give you all the information you might need to help me
i am a newbie who depressingly need help
This text isn't "Unicode". It's simply gibberish.
This file has been corrupted -- it may have been overwritten with other data before you were able to recover it. It is unlikely to be recoverable.
In linux, I use ps2pdf to convert text file report to pdf in bash script.
To feed ps2pdf for ps file, I use paps command because of UTF8 encoding.
The problem is pdf file from ps2pdf is about 30 times bigger than ps file created from paps.
Previous, I used a2ps to convert text to ps and then fed to ps2pdf, and the pdf output from this is normal size and not big.
Is there any way to reduce the pdf size from paps and ps2pdf? Or what am I doing wrong?
The command I used is as below.
paps --landscape --font="Freemono 10" textfile.txt > textfile.ps
ps2pdf textfile.ps textfile.pdf
Thank you very much.
As the author of paps, I agree with #Kens's description of paps' inner workings. Indeed, I chose to create my own font mechanism in the postscript language. That is history though as I have just released a new version of paps that uses cairo for its postscript, pdf, or svg rendering. This is much more compact than paps output, especially w.r.t. the result after doing ps2pdf. Please check out http://github.com/dov/paps .
For ps2pdf, it is easiest to control output size is by designating paper size.
An example command is:
ps2pdf -sPAPERSIZE=a4 -dOptimize=true -dEmbedAllFonts=true YourPSFile.ps
ps2pdf is the wrapper to ghostscript (ps2pdf is owned by ghostscript package)
with -sPAPERSIZE=something you define the paper size. Wondering about valid PAPERSIZE values? See [http://ghostscript.com/doc/current/Use.htm#Known_paper_sizes here]
-dOptimize=true let's the created PDF be optimised for loading
-dEmbedAllFonts=true makes the fonts look always nice
All of this is from : https://wiki.archlinux.org/index.php/Ps2pdf
I think he means the size on disk, rather than the size of the output media. The 'most likely' scenario normally is that the source contains a large DCT encoded image (JPEG) which is decoded and then compressed losslessly into the PDF file using something like flate.
But that can't be the case here, as its apparently only text. So the next most likely problem is that the text is being rasterised, which suggests some odd fonts in the PostScript, which is possible if you are using UTF-8 text, its probably constructing something daft like a CIDFont with TrueType descendant fonts.
However, since the version of Ghostscript isn't given, and we don't have a file to look at, its really impossible to tell. Older versions of the pdfwrite device did less well on creating optimal files, especially from CIDFonts.
Setting 'Optimize=true' won't actually do anything with the current version of pdfwrite, that's an Acrobat Distiller parameter we no longer implement. Older versions of Ghostscript did use it, but the output wasn't correctly Linearised.
The correct parameter for newer versions is '-dFastWebView' which is supposed to be faster when loading from the web if the client can deal with this format. Given the crazy way its specified, practically no viewer in the world does. However, the file is properly constructed in recent versions, so if you can find a viewer which supports it, you can use this (at the expense of making the PDF file slightly larger)
If you would like to post a URL to a PostScript file exhibiting problems I can look at it, but without it there's really nothing much I can say.
Update
The problem is the paps file, it doesn't actually contain any text at all, in a PostScript sense.
Each character is stored as a procedure, where a path is drawn and then filled. This is NOT stored in a font, just in a dictionary. All the content on the page is stored in strings in a paps 'language'. In the case of text this simply calls the procedure for the relevant glyph(s)
Now, because this isn't a font, the repeated procedures are simply seen by pdfwrite (and pretty much all other PostScript consumers) as a series of paths and fills, and that's exactly what gets written to the output in the PDF file.
Now normally a PDF file would contain text that looks like :
/Helvetica 20 Tf
(AAA) Tj
which is pretty compact, the font would contain the program to draw the 'A' so we only include it once.
The output from paps for the same text would look like (highly truncated) :
418.98 7993.7 m
418.98 7981.84 l
415.406 7984.14 411.82 7985.88 408.219 7987.04 c
...
... 26 lines omitted
...
410.988 7996.3 414.887 7995.19 418.98 7993.7 c
f
418.98 7993.7 m
418.98 7981.84 l
415.406 7984.14 411.82 7985.88 408.219 7987.04 c
...
... 26 lines omitted
...
410.988 7996.3 414.887 7995.19 418.98 7993.7 c
f
418.98 7993.7 m
418.98 7981.84 l
415.406 7984.14 411.82 7985.88 408.219 7987.04 c
...
... 26 lines omitted
...
410.988 7996.3 414.887 7995.19 418.98 7993.7 c
f
which as you can clearly see is much larger. Whereas with a font we would only include the instructions to draw the glyph once, and then use only a few bytes to draw each occurrence, with the paps output we include the drawing instructions for the glyph each and every time it is drawn.
So the problem is the way paps emits PostScript, and there is nothing that pdfwrite can do about it.
That said, I see that you are using Ghostscript 8.71 which is now 4 years old, you should probably consider upgrading.