Search substring in binary file - string

friends! Please, help me with my issue. I have an application which processes data and generates output files (different formats, but mostly images). In every generated file that application puts it's watermark - string, that looks like "03-24-5532 [some cyrillic text]".
And every time I use that application, I need to edit each file in photoshop to replace watermark string with required one and it takes a lot of time.
Is this possible to search that substring in application binary data files (using Hex Editor or something else) and replace? Which is the better way to solve this problem?

Related

(dd command linux) last byte goes to next line

Hi friends I need some help.
We have a tool that convert binary files to text files, and after that stores into Hadoop (HDFS).
In production, that ingestion tool uses ftp to download files from mainframe in binary format (EBCDIC), and we don't have access to donwload files from mainframe in development environment.
In order to test file conversion, we manually create text files, and we are trying to convert file using dd command (linux), using these parameters:
dd if=asciifile.txt of=ebcdicfile conf=ebcdic
After pass through our conversion tool, the expected result is:
000000000000000 DATA
000000000000000 DATA
000000000000000 DATA
000000000000000 DATA
However, it's returning the following result:
000000000000000 DAT
A000000000000000 DA
TA000000000000000 D
ATA000000000000000
I have tried with cbs, obs and ibs parameters, assigning lrec (number of lines of each line) without success.
Can anyone help me?
A few things to consider:
How exactly is the data transferred via FTP? Your "in binary format(EBCDIC)" simply doesn't make any sense at all. The FTP either transfers in binary format, then nothing gets changed, or converted during the transfer. Or the FTP transfers in text mode, aka. ASCII mode, then data is converted from a specific EBCDIC code page to a specific non-EBCDIC code page. You need to know what mode, and if text mode, what are the two code pages being used.
From the man pages for dd, it is unclear what EBCDIC, and ASCII code pages are used for the conversion. I'm just guessing here: EBCDIC code page might be CP-037, and ASCII might be CP-437. If these don't match the ones used in the FTP, the resulting test data is incorrect.
I understand you don't have access to production data in the development environment. However, you should still be able to get test data from the development mainframe using FTP from there. If not, how will you be doing end to end testing?
The EBCDIC conversion is eating your line endings:
https://www.ibm.com/docs/en/zos/2.2.0?topic=server-different-end-line-characters-in-text-files

How can i Convert a text file to UCS-2 LE, from whatever the default is?

I am looking for a way to convert or save a text file in the UCS-2 LE format; specifically without BOM...i guess.
I have zero knowledge what any of that means actually; but i know i need that because of this wiki page on what i am trying to accomplish: https://developer.valvesoftware.com/wiki/Closed_Captions
in other words:
this is for a specific game engine, "Source Engine," which requires the format in order to compile in-game closed captions for sounds.
I have tried saving the file in Notepad++ using the "UCS-2 LE BOM" option under the encoding menu...there is no option for just "UCS-2 LE" however, and because of this, the captions cannot be compiled for the game engine. I need to save without BOM, "I guess" (because again I don't know what I'm talking about and I assume based on logical conclusions, that I need to not have BOM, whatever that actually means.)
I would like to know about a way to either save a txt file in that encoding format; or a way to convert one.
In my specific case; it appears that my problem boils down to "the program is weird."
what I mean by this is, notepad++ actually does save in the correct format; but I failed to realize that because of a quirk in the caption compiler where it only works if you drag the file onto it; not via command line as previously thought.
I will accept this as the answer when i am allowed to in 2 days.

How can I edit a DXF in node.js?

I'd like to make a custom lasered label from a user's input on a website. I have a template dxf file and I'd like to replace placeholder text with the user input. My problem is the dxf file format is very unreadable in its text format. Is there any way to make sense of the numeric data? If not are there any other formats (svg, etc) that would be easier to work with?
EDIT: The reason I've found it unreadable in terms of text is that the program (Solidworks) converted the text to curves.) At this point I'm trying to figure out how to prevent that.
AutoDesk was nice enough to document DXF syntax in great detail. Spend a couple hours understanding the documentation from the link below, and I think you will find it quite easy to parse and edit using code.
To just replace some placeholder text, it should be just as simple as reading the DXF file into a string (a dxf file is no different than a txt file), performing a text replace operation and saving it back to file. Just make sure that your placeholder text is very unique and is not contained in any of the key words in the document below (otherwise your DXF file will get corrupted). Something like "PlaceHolderText" will do the trick.
http://images.autodesk.com/adsk/files/autocad_2012_pdf_dxf-reference_enu.pdf
Edit: More Info
I do a lot of work with AutoDesk Inventor which is in direct competition with SolidWorks, so they are effectively the same tool. We were faced with a similar problem of needing to place text onto sheet metal flat pattern DXFs that came out of Inventor in order to identify the part, but Inventor simply could not do it (see, exactly the same!). One of our developers had the idea to place a very precise geometry punch onto the flat pattern. After the DXF was generated he wrote some code that parsed the DXF file and replaced the geometry with a text entity. More specifically we used a triangle with sides having each length defined to something like the 7th decimal place. You can then use one of the vertices of the triangle to position the text, including rotation. This process would be automatic, so once you write the code with the help of the document above (which won't take the long), it will just work. If your engraver can handle text the way you want it, I'd say this is a very good solution. We generate hundreds of parts every day using this code. Hope this helps.

Efficient Way of Recording Page Numbers from a Search of a PDF

I have a list of ~1200 queries (part numbers) that are specified somewhere inside of a 100 page PDF. Pretty much what I need to do is take record of what pages each of the queries appear on, in the PDF. I can't think of a clever way of doing this. It should take me 5-20 hours to do this search by search, so if someone can give me a good idea before the 5 hour mark that would be great!
Assumed you can determine what a "query" is in your context programatically from the plain text (for example, by using regular expressions):
You could split your PDF into different files (1 file per page) using pdftk
http://www.pdflabs.com/tools/pdftk-the-pdf-toolkit/
Then convert those files to text with a pdf-to-text utility like this one:
http://www.fileguru.com/PDF-To-TXT-Converter/download
or this one
http://www.pdf2text.com/
And finally write yourself a simple script using your favorite programming language to determine which of those files contains a "query" (whatever that looks like).

I want to change the way text is represented internally in ANY Text Editor

I want to use a algorithm to reduce memory used to save the particular text file.I don't really know how text is stored but i have an idea in mind.
Would it be better to extend a open source text editor (if yes than which one) or write a text editor myself.
It would be nice if someone could also give me a link or tutorial to some basics on how text editors work and the way data is stored.
Edited to add
To clarify, what I wanted to do is instead of saving duplicates of a word make a hash table and store the address where it needs to be placed.
That way I wouldn't be storing the duplicates.
This would have become specific to a particular text editor.
Update
thanks everyone I got what all of you'll are trying to say. Anyways all i wanted to do is instead of saving duplicates of a word make a hash table and store the address where it needs to be placed.
This was i wouldn't be storing the duplicates.
Yes and this would have become specific to a particular text editor. never realized that.
I want to use a algorithm to reduce memory used to save the particular text file
If you did this you would no longer have a text editor, but instead you would have created some sort of binary file editor.
The whole point of the text file format is that it is universal, meaning any text file can be open in any other text editor.
Emacs handles compression transparently. Just create a text file with .gz extension. Emacs will automatically compress contents of the file during save operation, and decompress when you open the file next time.
Text is basically stored as-is. i.e., every character takes up a byte or two (wide chars), and there is no conversion done on it when it's saved. It might add an end-of-file character or something though. Don't try coming up with your own algorithm to compress these files. That's why zip-files and other archives were created. They're really good at compressing text. If you wanted to add these feature to your text-editor, you'd have to add some sort of post-save hook to zip it, and then put a hook on the open command to unzip it. Unless you wanted to do it by hand every time. Don't try writing the text editor yourself from scratch, unless (maybe) you're writing notepad. Text editors with syntax highlighting aren't very easy to make, even with the proper libraries. I'd say write a plugin for something like Visual Studio or what have you. Or find an open-source text editor.

Resources