I'm working on a project for school.
The assignment is as follows:
Implement a sorting algorithm of your choosing in assembly (we are using the GNU Assembler). The input is a text-file with a series of numbers separated by newline.
I'm then trying to implement insertion sort.
I have already opened and read the file and i'm able to print the content to terminal.
My problem is now how to split each number from the file in order to compare and sort them.
I believe google is glowing at the moment due to my effort to find and answer (maybe I don't know what I need to type or where to look).
I have tried to get each character from the string, which i'm able to do BUT I don't know to put them together again as integers (we only have integers).
If anybody could help with some keywords to search for it would be much appreciated.
Related
My output (csv/json) from my newly-created program (using .NET framework 4.6) need to be converted to a IBM-1027-codepage-binary-file (to be imported to Japanese client's IBM mainframe),
I've search the internet and know that Microsoft doesn't have equivalent to IBM-1027 code page.
So how could I output a IBM-1027-codepage-binary-file if I have an UTF-8 CSV/json file in my hand?
I'm asking around for other solutions, but for now, I think I'm going to have to suggest you do the conversion manually; I assume whichever language you're using allows you to do a hex conversion, at worst. For mainframes, the codepage is usually implicit in the dataset, it isn't something that is included in the file header.
So, what you can do is build a conversion table, from https://www.ibm.com/support/knowledgecenter/en/SSEQ5Y_5.9.0/com.ibm.pcomm.doc/reference/html/hcp_reference26.htm. Grab a character from your json/csv file, convert to the appropriate hex digits, and write those hex digits to a file. Repeat until EOF. (Note to actually write the hex data, not the ascii representation of the hex data.) Make sure that when the client transfers the file to their system, they perform a binary transfer.
If you wanted to get more complicated than that, you could look at enhancing/overriding part of the converter to CP500, which does exist on Microsoft Windows. One of the design points for EBCDIC was to make doing character conversions as simple as possible, so many of the CP500 characters hex representations are the same as the CP1027, with the exception of the Kanji characters.
This is a separate answer, from a colleague; I don't have the ability to validate it, I'm afraid.
transfer the file to the host in raw mode, just tag it as ccsid 1208
(edited)
for uss export _BPXK_AUTOCVT=ALL
oedit/obrowse handles it automatically.
So as many others have asked in the past is there a way to beat the 32k limit per cell in Excel?
I have found ways to do it by splitting the work load into two different .txt files and then merging the two .txt files, however it is a giant PITA and more often then not I end up only using excel to its limits as I do not have time to validate the data after .txt file merges anymore this is a long process and tedious IMO.
However I think that if the limitation is there it is there because it was coded when Microsoft developed Excel, and since they have yet to raise it (2013 version the limit is still the same limit so it would do no good to upgrade)
I also know that many will say if you have a need for information in a single cell in that length then you should use ACCESS well I have no idea how to use ACCESS or how to import a tab delimited file into ACCESS like you would into EXCEL, and then even if I could figure that out I still now have to figure out how to learn all the new commands and he EXCEL equivalents if there is even such a thing.
So I was browsing some blog posts the other day on how to beat limitations by software and I read something about reverse engineering.
Would it be possible to load excel into a hex editor, go in and change every instance of 32767 to something greater?
While 32767 may seem like an arbitrary number, it's actually the upper limit of a 16-bit signed integer (called a short in C). The range of a short goes from -32768 to 32767.
A 16-bit integer can also be unsigned, in which case its range is 0 to 65535.
Since it's impossible for a cell to have a negative number of characters, it seems odd that Microsoft would limit a cell's length based on a signed rather than unsigned 16-bit integer. When they wrote the original program, they probably couldn't imagine anyone storing so much information in a single cell. Using shorts may have simplified the code. (My first computer had only 4K of memory, so it's still amazing to me that Excel can store 8 times that much information in a single cell.)
Microsoft may have kept the 32767 limit to maintain backward compatibility with previous versions of Excel. However, that doesn't really make sense, because the row and column counts greatly increased in recent versions of Excel, making large spreadsheets incompatible with previous versions.
Now to your question of reverse-engineering Excel. It would be a gargantuan task, but not impossible. In the early '90s, I reverse-engineered and wrote vaccines for a few small computer viruses (several hundred bytes). In the '80s, I reverse-engineered an 8KB computer chess program.
When reverse-engineering an executable, you'll need a good disassembler or decompiler. Depending on what you use, you may get assembly-language or C code as the output. But note that this will not be commented code, and you will not see meaningful variable or function names. You'll have to read every line of code to determine what it does. And you'll quickly discover that the executable is the least of your worries. Excel's executable links in a number of DLL files, which would also need reverse-engineering.
To be successful, you will need an extensive knowledge of Windows programming in addition to C or Intel assembly code – not to mention a large amount of patience. Learning Access would be a much simpler task.
I'd be interested in why 32767 is insufficient for your needs. A database may make more sense, and it wouldn't necessarily need to duplicate the functionality of Excel. I store information in a database for output to Web pages, in which case I use HTML+JavaScript for anything that needs to be interactive.
In case anyone is still having this issue:
I had the same problem with generating a pipe-separated file of longitudinal research data. The header row exceeded the 32767 limit. Not an issue unless the end-user opens the file in excel. Work around is to have end-user open file in google sheets, perform the text-to-columns transformation, then download and open file in excel.
https://support.clarivate.com/ScientificandAcademicResearch/s/article/Web-of-Science-Length-limit-of-cell-contents-in-Excel-when-opening-exported-bibliographic-data?language=en_US
Jack Straw from Wichita (https://stackoverflow.com/users/10327211/jack-straw-from-wichita) surely you can do an import of a pipe separated file directly into Excel, using Data>Get Data? For me it finds the pipe and treats the piped file in the same way as a CSV. Even if for you it did not, you have an option on the import to specify the separator that you are using in your text file.
Kind regards
Sefton Hall
I am pretty new to Linux, and I would like to know how can I compare two text files in Linux in order to determine the first character of difference between them. Any help will be appreciate it
I have a text corpus which is already aligned at sentence level by construction - it is a list of pairs of English strings and their translation in another language. I have about 10 000 strings of 5 - 20 words each and their translations. My goal is to try to build a metric of the quality of the translation - automatically of course, because I'm dealing with languages I know nothing about :)
I'd like to build a dictionary from this list of translations that would give me the (most probable) translation of each word in the source English strings into the other language. I know the dictionary will be far from perfect but I'm hoping I can have something good enough to flag when a word is not consistently translated, for example, if my dictionary says "Store" is to be tranlated into French by "Magasin" then if I spot some place where "Store" is translated as "Boutique" I can suspect that something is wrong.
So I'd need to:
build a dictionary from my corpus
align the words inside the string/translation pairs
Do you have good references on how to do this? Known algorithms? I found many links about text alignment but they seem to be more at the sentence level than at the word level...
Any other suggestion on how to automatically check whether a translation is consistent would be greatly appreciated!
Thanks in advance.
A freely available (specifically, GPL-licensed) tool for word alignment is GIZA++. I trains the well-known IBM models mentioned in other answers, as well as other statistical models.
You can download it from the GIZA++ site at Google Code, and there is a brief introduction to its usage found at the GIZA++ Apertium. It boils down to this procedure:
Create your parallel corpus, sentence-aligned (you seem to have this already)
Apply the plain2snt tool included in GIZA++ to extract word lists and sentence lists in GIZA++ format
(Optional – only used for some models:) Generate word classes using the mkcls tool (also included)
Run the actual word alignment tool GIZA++. There are various optional configuration settings you can use to determine the type of model generated.
Before you can do this, you must build the tool from source code by running make. The code is written in C++ and compiles well with recent GCC versions.
A few final notes:
There are more than one possible translations for every word; you shouldn't rely on the assumption that a specific translation found in one text is necessarily wrong just because the same word is translated differently in another text;
One word may be translated into a (not necessarily contiguous) sequence of several words, and vice versa. Some words are not translated at all;
GIZA++ is a statistical tool that approximates the correct word alignment; many of the alignments it generates are questionable or incorrect.
This a pretty standard statistical machine translation problem called 'word alignment'.
There are bunch of EM clustering-based models developed by researchers at IBM which I think are the base for most other cooler models being developed today.
Google for 'ibm word alignment models' to find about IBM Models 1 to 5.
This presentation - http://www.stanford.edu/class/cs224n/handouts/cs224n-lecture-05-2011-MT.pdf seems like a good place to start.
Are you using spaces between words? Whatever character you are using, you might check out the slice command in Linux. It gives you the ability to filter words in-between spaces and other characters.
For example if I know that ć should be ć, how can I find out the codepage transformation that occurred there?
It would be nice if there was an online site for this, but any tool will do the job. The final goal is to reverse the codepage transformation (with iconv or recode, but tools are not important, I'll take anything that works including python scripts)
EDIT:
Could you please be a little more verbose? You know for certain that some substring should be exactly. Or know just the language? Or just guessing? And the transformation that was applied, was it correct (i.e. it's valid in the other charset)? Or was it single transformation from charset X to Y but the text was actually in Z, so it's now wrong? Or was it a series of such transformations?
Actually, ideally I am looking for a tool that will tell me what happened (or what possibly happened) so I can try to transform it back to proper encoding.
What (I presume) happened in the problem I am trying to fix now is what is described in this answer - utf-8 text file got opened as ascii text file and then exported as csv.
It's extremely hard to do this generally. The main problem is that all the ascii-based encodings (iso-8859-*, dos and windows codepages) use the same range of codepoints, so no particular codepoint or set of codepoints will tell you what codepage the text is in.
There is one encoding that is easy to tell. If it's valid UTF-8, than it's almost certainly no iso-8859-* nor any windows codepage, because while all byte values are valid in them, the chance of valid utf-8 multi-byte sequence appearing in a text in them is almost zero.
Than it depends on which further encodings may can be involved. Valid sequence in Shift-JIS or Big-5 is also unlikely to be valid in any other encoding while telling apart similar encodings like cp1250 and iso-8859-2 requires spell-checking the words that contain the 3 or so characters that differ and seeing which way you get fewer errors.
If you can limit the number of transformation that may have happened, it shouldn't be too hard to put up a python script that will try them out, eliminate the obvious wrongs and uses a spell-checker to pick the most likely. I don't know about any tool that would do it.
The tools like that were quite popular decade ago. But now it's quite rare to see damaged text.
As I know it could be effectively done at least with a particular language. So, if you suggest the text language is Russian, you could collect some statistical information about characters or small groups of characters using a lot of sample texts. E.g. in English language the "th" combination appears more often than "ht".
So, then you could permute different encoding combinations and choose the one which has more probable text statistics.