How does text processing works?

How does text processing works? - string

Consider the whole novel (e.g. The Da Vinci Code).
How does e-book reader software process and output the whole book??
Does it put the WHOLE book in one very large string?? array of strings?? Or what??

One of the very first "real" programs I wrote (as part of a class excersise in high school) was a text editor. Part of the requirement for this excersise was for the program to be able to handle documents of arbitrary length (ie larger than the available system memory).
We achieved this by opening the file, but reading only the portion of it required to display the current page of data. When the user moves forward or backward in the file, we read that portion of the file and display it.
We can speed the program up by reading ahead to load pages which we anticipate that the user will want, and by retaining recently read pages in memory so that there is no obvious delay when the user moves forward or backward.
So basically, the answer to your question is: "No. with very large text files, it is unusual to load the whole thing into memory at once. A program that can handle files like that will load it in chunks as it needs to, and drop chunks it doesn't need any more."
Complex document formats (such as ebooks) may have lookup tables built into the file to allow the user to search or jump quickly to a given page or chapter. In this, they effectively work like a database.
I hope that helps.

Related

Is it possible to expand a gap buffer without copying data?

I was reading this overview of some possible data structures for storing a sequence of characters for the purposes of a text editor. One popular and efficient way is the gap buffer.
When the gap buffer fills up so that there is no longer a gap, the data would need to be copied into the beginning and end of a larger buffer to recreate a gap for further insertion. However, on page 9 of the overview, it states that
with some help from the operating system, we can expand the gap without actually moving any data.
I haven't been able to figure out a way to do that, so I'm wondering if is really possible. And if so, how it could be done and in which cases. Or am I misunderstanding what the author meant?

In Linux you can use e.g. mremap() to move data in the virtual address space.
As for Windows, you should use a combination of AllocateUserPhysicalPages(), MapUserPhysicalPages(), VirtualAlloc() and thereabouts.
The whole idea is that instead of copying the data you're changing the how/where physical memory (with the data) appears in the address space. If you aren't familiar with the related concepts, read up on page translation and page tables.
Update: Strictly speaking you may still end up copying up to a page size worth of data when the gap disappears and the cursor position is near the beginning of a memory page. But that should be hardly noticeable on modern systems. You still don't copy the data in all the other pages.

Data extraction from mainframe to excel

how to extract the data from mainframe into excel? Currently , I am fetching data from MS access but the requirements are for Mainframe.
Thanks in advance

First, please understand that saying "extract data from mainframe" is similar to saying "extract data from Intel." The following is not comprehensive but is intended to provide an idea of how to ask your question in a manner which can be meaningfully answered.
Please understand there is a big difference between...
what is technically possible
what is allowed in your shop
what is likely to provide a robust and maintainable solution given your requirements
These are three very different things. Some of us answering questions here on Stack Overflow have life experiences that make us reticent about answering questions regarding what is technically possible absent any mention of what is allowed in your shop or what the actual business requirement is that is being solved.
Mainframes have been around for over half a century, and many shops have standard solutions to technical problems. Sometimes the solution is "don't do that, and here's what we do instead." Working against the recommendations of your technical staff, or your shop standards, is career limiting.
What operating system?
z/OS is in common use on mainframes, but there do exist shops that still run one of its ancestors like MVS/XA. The mainframe operating system traces its roots back to OS/360 first available in 1965.
z/TPF
z/Linux usually runs on top of the z/VM hypervisor.
z/VSE
In what sort of file does the data reside?
QSAM or Queued Sequential Access Method, also commonly called flat files.
VSAM or Virtual Sequential Access Method. There are several different kinds of VSAM files including KSDS (Keyed Sequential Data Set) ESDS (Entry Sequenced Data Set), RRDS (Relative Record Data Set) and Linear (conceptually similar to a memory mapped file).
a DBMS like DB2 or IMS. A DBMS typically has extract facilities to allow writing a flat file from its own internal format. DB2, for example, stores data in Linear VSAM datasets.
Unix System Services files reside in a different file system than QSAM or VSAM. This will be more familiar, as it has a directory structure where the classic z/OS file system has none.
What does the data look like?
You must know the record layout of the data you wish to retrieve.
It is common for mainframe data to include both text and binary data in a single record, for example a name and a currency amount:
Hopper Grace ar%
...which would be...
x'C8969797859940404040C799818385404040404081996C'
...in hex. This is code page 37, commonly referred to as EBCDIC.
Without knowing that the family name is confined to the first 10 bytes, the given name confined the the subsequent 10 bytes, and the currency amount is in packed decimal (also known as binary coded decimal) in the next 3 bytes, you cannot accurately transfer the data because code page conversion will destroy the currency amount which is +819.96. Converting to code page 1250, commonly in use on Microsoft Windows, you would end up with...
x'486F707065722020202047726163652020202020617225'
...where the text data is translated but the packed data is destroyed. The packed data no longer has a valid sign in the last nibble (the lower half of the last byte) and the amount itself has been changed.
Security
Is the data you wish to access covered by privacy legislation? You may have to provide some evidence that whatever protections are in place to guarantee that only authorized personnel have access to this data on the mainframe are also in place once you have transferred it off of the mainframe. Such guarantees may have to satisfy an auditor.
What you need
You need to know what operating system holds your data, you need to know what type of file holds your data (a DBMS isn't a type of file but let's let that go for now), and you need to know your record layout(s).
Typically, the easy way to retrieve data is to extract it from its existing data store (QSAM, VSAM, DBMS) into a flat file where all the data is in a text format. There are mainframe utilities to accomplish this. In extreme cases, a program can be written to accomplish this goal. Once it has been accomplished, you can transfer your data without fear of destroying packed or binary data.
You may be able to read data directly from a DBMS if that's where your data resides, but this may depend on shop standards, including security.
Modern mainframes can transfer data via FTP, FTPS, and SFTP. Which is recommended in your shop is something to ask your technical staff.

Millions of rows in GUI

I would like to implement a GUI handling a huge number of rows and I need to use GTK in Linux.
I started having a look at GTKTreeView with lists but I don't think that adding millions of lines directly to that widget will help in having a GUI that doesn't slow the application.
Do you know whether there is a GTK widget already in place for this problem or do I have to handle my self the window frame that that must display those lines? Eventually I would write the data directly using GtkDrawingArea (essentially writing a new widget).
Any suggestion about any GTK topic or project I can look as starting point for my research?

As suggested in the comments, you can use the Cell Data Func, and get the displayed data under contro. But I have another idea: Millions of lines are much much more than any amount of information a human user can see and understand. So maybe a better, more usable and user-friendly solution, is to diaplay the data in a way the users can more easily navigate in it.
Imagine opening a huge hierarchy, scrolling down, and forgetting what were the top-level items you opened.
Example for a possible solution: Have a combo box which allows to choose some filter or category, and this can reduce the amount of data to a reasonable amount the user can more easily navigate and make a mental model of it if necessary.
NOTE: As far as I know, GtkTreeView doesn't support sorting/filtering and drag-n-drop at the same time, so if you want to use both features, I suggest you use the existing drag-n-drop functionality (otherwise very complicated to implement by hand) and implement your own sorting/filtering.

JCL - Get count of non space chars in a given area (mainframe)

I need to prepare some input data to run through a program, the data should be in the following format.
UID (1-11)|TxtLen (12-16)| Text (17-62)
I can use sort to position the fields properly and get the UID and text fields.
The ‘TxtLen’ is should contain the number of chars from the start of the text field to the last char in the text field.
i.e. “Hello”’s TxtLen is 5, “Hel lo”’s TxtLen is 6, “Hello World”’s TxtLen is 11, etc...
I want to know if there is a way of getting the TxtLen through JCL only, or is a program required to do this?
-Thanks

You will need a program.
I see a fair number of mainframe questions on Stack Overflow asking if something is possible with "JCL only." Keep in mind that JCL is mostly a means of executing programs, and actually does very little other than that. For instance, when you say
I can use sort to position the fields properly and get the UID and
text fields
sort is a program. It happens to be a program found on most systems (though there are different vendors' implementations, IBM has one, SyncSort has one, CA has one, etc.) There are plenty of other programs commonly found on mainframe systems.
And just to be pedantic, JCL doesn't actually do anything, JES does the work as it interprets JCL.
For your particular situation you could create a SORT exit, or process your data in Rexx, or you could use some of the Unix System Services commands and execute those via BPXBATCH or COZBATCH.

I've done ad-hoc conversions like this using a REXX program. The program is pretty straight-forward:
allocate the input and output files
open both files
begin loop:
read the input
extract the text field and strip trailing spaces
get length of trimmed text field and format as 5-digit numeric
overlay number back into record in the Len field positions
write out updated record
repeat loop until end of file
close both files
free allocated files
Let me know if you need some actual code. I've found that REXX is superior than COBOL when it comes to string functions and manipulations. I've even created and called REXX routines from COBOL to accomplish just that.

How to test a program processing large amounts of data stored in an unpredictable format

What I have to do
I'm trying to manipulate some rather large amounts of data stored in Excel files (one of the workbooks has as much as 150 spreadsheets). The result of these manipulations may yield approximately 800.000 rows in a database table.
The problem
Data stored in the spreadsheets has unpredictable format. The company that generated these spreadsheets had no fixed/documented format for exporting these files, and sometimes erroneous data appear. For example most of the years are represented like "2009" but there are cases where a year is represented as "20". Other example, data is not really normalized in these files, so I use separators to split the values of certain cells. Sometimes these separators change.
There are things like these that I couldn't predict and I only discovered them only after running an already evolved version of my program over a pretty large part of the available data.
The question
How can one test the correctness of a program in such a situation? Or rather, how to achieve a pretty stable version of the product without running it over the whole available data?
Shall I take a defensive approach and throw exceptions whenever some kind of unexpected issue arises? Then the main loop of the program may catch and log them and continue with the available data? This would yield some processed data, but that means that on a subsequent iteration of the program I have to have checks for what's already inside the database from previous iterations (which I don't really like).
What's your opinion? How would you tackle this problem?

If there is no specification for what the format of the data is, then anything is acceptable.
If not, then there is either an explicit or implicit specification of the data. I would try and nail this down right now. If you can't get an explicit enough definition of the data to write your program so that it can be expected to run without error, then I would say you are taking a very large risk in causing some serious damage depending on how this data is being used.
You should write your program so that it either throws an exception or logs an error whenever running across data that does not meet the specification. Then, run the program on PART of the available data until it runs without exception. This can be viewed as a training set for the development of your program. Then, use some of the saved data to use as a TEST set. This will give you an estimate of how many exceptions/errors your program will generate in production.
Overfitting is a common machine learning concept, but it is useful to other tasks such as this - program development. It is surprising to me how developers can write a bunch of unit tests, code their application to perform well on it, and then expect similar or bug-free performance in production.
If you're not willing to take all these steps (i.e. run your code on essentially all of the data -- since the test set is also making use of the data) then I would say the task is too large to do.

As an aside, rather than creating a definition of a format that is very strange and peculiar to account for all the "errors" in the current data, you might want to create a new, normalized (in the sense these things are simplified away) specification for the data, and then write a "faulty document patcher" that can be run on faulty documents to fix the data.
If the application generating the data is still in production, then you might need to go to the developers of this application to get a buy in on the new spec. Once you have that, you can then start logging bugs against their application, so hopefully the faulty document patcher can be retired.
More likely, I'm guessing that the software developers are long gone, no one understands the code anymore, if it is even running at all.

How can one test the correctness of a program in such a situation? Or rather, how to achieve a pretty stable version of the product without running it over the whole available data?
For every single data type I would set reasonable constraints on the values that it is allowed to be.
If a cell violates these constraints then throw an exception containing the piece of data it failed on and its data type. If a piece of data violated its constraints you can modify the source to include the additional constraints required for that piece of data, and a conversion method to make it uniform.
To give an example on the date you gave, initially a date would have the constraint that it could be only four digits. When the program came across the "20" it would throw an exception.
Then you could go and allow two digit dates, and a method to convert the two-digit dates into a four digit one to allow further processing.

One question is, will you run your program more than once? From your question it sounds possible you only want to run it once, and then you will then work with the data in the database.
In which case you can be very defensive - throw exceptions whenever unexpected data appears. Run the program repeatedly on ever-larger sets of the data. Initially, solve any exceptions by altering the code, as it's a good rule of thumb that the exceptions you find first are going to be common. You might want to empty the output database between runs.
Later on, you will be finding rare exceptions that might only occur a couple of times in the input. Just solve these by hand and insert the corresponding rows in the database yourself. Or write another small program that reads your exception information and inserts the new rows, rather than running your whole big program again.

Typically for this sort of thing I do these as #MarkJ suggested, and I encode the whole thing in unit tests.
So I compose a small datafile that at first contains only a few rows of normal data. That's unit test number 1.
Then I take a quick visual scan of some of the data to spot any obvious exceptions. Unit tests 2 through n.
Finally, I write parser code until it passes all unit tests, and throws and logs exceptions for all un-managed data.
I then use these oddball bits of data to make new unit tests, and improve the parser until it can pass those too.
Although sometimes accommodating some really strange bit of data adds more parser complexity than it's worth, and I'll just log the exception, dump it, and move on. This is a matter of professional judgment.

How about processing every piece of data (so you don't have to check for dupes). Those that pass go into the database. The exceptions go into an exception file. The user can open the exception file and make corrections/modifications to the data. Then they can run your program on the exception file.
This will isolate unhandled data for the user to correct and prevent you from processing the same data twice (or more).

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string