How to compare text files character by character using Linux - linux

I am pretty new to Linux, and I would like to know how can I compare two text files in Linux in order to determine the first character of difference between them. Any help will be appreciate it

Related

What is the best way to search a large file for hexadecimal and export readable results to a file? (OS Agnostic)

My goal is to search a 500Gb file for a series of hexadecimal characters and to export the results into a text file. I need to automate this, as there are many patterns to be searched.
The results need to include: the location in the file, the 100 preceding hex characters values (represented in both hex and ascii).
As noted, this is OS agnostic (and language agnostic, if anyone suggests scripts or code).

Counting the frequency of a words in Wikipedia

I need to extract information from wikipedia, but I have no idea on how to proceed. What I have to do is the following:
Given a word 'w', how can I count the number of times 'w' appears in the whole English Wikipedia? Is there a list already available online? If not, how could I do such thing? I am new to coding and I'm trying to do some experiments in some NLP-related tasks.
First download the wikipedia dump (in XML format for example)
If you are using a UNIX based OS (ex. LINUX or Mac OS X) you can use grep.
see here
Python can also be used to count occurrences of a specified string in a file
see here

GNU Assembly split string of integers to integers

I'm working on a project for school.
The assignment is as follows:
Implement a sorting algorithm of your choosing in assembly (we are using the GNU Assembler). The input is a text-file with a series of numbers separated by newline.
I'm then trying to implement insertion sort.
I have already opened and read the file and i'm able to print the content to terminal.
My problem is now how to split each number from the file in order to compare and sort them.
I believe google is glowing at the moment due to my effort to find and answer (maybe I don't know what I need to type or where to look).
I have tried to get each character from the string, which i'm able to do BUT I don't know to put them together again as integers (we only have integers).
If anybody could help with some keywords to search for it would be much appreciated.

special character different window with linux

I have two projects, one in Windows and another one in Linux. I use the same database for both (oracle 10g),I have got an input file which consists of text that includes special characters (ÁTUL ÁD).
the program logic is like this: read input file data to database, on windows the data (including the special characters) is displayed correct, on Linux the special characters display other characters. As I already said, I use same database for both of them, could you give me some help?
The program is complex, it uses the Spring Batch Framework. Maybe the encoding causes the problem, but I have no idea how to solve it. I am using Linux for the first time.
Thanks in advance.
I find one solution which works for me is that you have to use UTF-8 encoding. All for Windows,Linux and Database.

How to determine codepage of a file (that had some codepage transformation applied to it)

For example if I know that ć should be ć, how can I find out the codepage transformation that occurred there?
It would be nice if there was an online site for this, but any tool will do the job. The final goal is to reverse the codepage transformation (with iconv or recode, but tools are not important, I'll take anything that works including python scripts)
EDIT:
Could you please be a little more verbose? You know for certain that some substring should be exactly. Or know just the language? Or just guessing? And the transformation that was applied, was it correct (i.e. it's valid in the other charset)? Or was it single transformation from charset X to Y but the text was actually in Z, so it's now wrong? Or was it a series of such transformations?
Actually, ideally I am looking for a tool that will tell me what happened (or what possibly happened) so I can try to transform it back to proper encoding.
What (I presume) happened in the problem I am trying to fix now is what is described in this answer - utf-8 text file got opened as ascii text file and then exported as csv.
It's extremely hard to do this generally. The main problem is that all the ascii-based encodings (iso-8859-*, dos and windows codepages) use the same range of codepoints, so no particular codepoint or set of codepoints will tell you what codepage the text is in.
There is one encoding that is easy to tell. If it's valid UTF-8, than it's almost certainly no iso-8859-* nor any windows codepage, because while all byte values are valid in them, the chance of valid utf-8 multi-byte sequence appearing in a text in them is almost zero.
Than it depends on which further encodings may can be involved. Valid sequence in Shift-JIS or Big-5 is also unlikely to be valid in any other encoding while telling apart similar encodings like cp1250 and iso-8859-2 requires spell-checking the words that contain the 3 or so characters that differ and seeing which way you get fewer errors.
If you can limit the number of transformation that may have happened, it shouldn't be too hard to put up a python script that will try them out, eliminate the obvious wrongs and uses a spell-checker to pick the most likely. I don't know about any tool that would do it.
The tools like that were quite popular decade ago. But now it's quite rare to see damaged text.
As I know it could be effectively done at least with a particular language. So, if you suggest the text language is Russian, you could collect some statistical information about characters or small groups of characters using a lot of sample texts. E.g. in English language the "th" combination appears more often than "ht".
So, then you could permute different encoding combinations and choose the one which has more probable text statistics.

Resources