I've recently came across an article by Greg Kroah-Hartman on why the Linux Kernel has not a stable API and how the Kernel repository is organized as a monotree. When I discussed the article with a friend it became clear that we had a different understanding of what the term tree applied to:
tree refers to different sub-folders of a project.
It refers to the different forks of the git master branch.
In the first case contributors would not checkout the complete project, e.g. the Linux Kernel, but only a sub-folder. These could then be combined with e.g. git-subtree.
In the second case contributors would have to checkout the complete project and basically create fork of a monorepo.
So what does tree in monotree refer to and how can a project be organized as a monotree with git?
Let's make a few notes here:
The phrase monotree, or even the partial word mono, never appears in the referenced article.
The article has seven occurrences of the word tree.
In six of these seven occurrences, the entire phrase here is the main kernel tree. The one reference that does not use this full phrase just says the tree but clearly has the same intent as the other six.
You have tagged this with git linux monorepo (in case the tags change).
Your question amounts to either: What does the author mean by the phrase "the main kernel tree"? or What do people in general mean when they refer to a tree? These are valid questions but not particularly relevant to Git.
Tree in computer science tends to refer to the data structure, which is also pretty loosely defined; see the wikipedia entry. We have some collection of nodes and edges—mathematically, a graph G defined by its set of vertices V and edges E, where each vertex connects by edges to other vertices—and there are constraints on the graph so that it is minimally connected, or equivalently, maximally acyclic. (See https://en.wikiversity.org/wiki/Introduction_to_graph_theory/Proof_of_Theorem_4 and the answers to What's the difference between the data structure Tree and Graph?)
A tree object in Git specifically refers to the stored Git object of Git-type "tree" (one of four Git object types that are stored in the repository database—the other three are commit, blob, and annotated tag). Such an object stores <mode, name, hash-ID> triples, where the mode and hash-ID identify additional Git objects to associate with the name, which is an arbitrary1 string of bytes excluding NUL and slash (codes 0 and 0x2f or 47 respectively). A commit object stored in Git includes the hash ID of a single tree object. Reading the tree object and locating the sub-objects it lists, then recursively reading their own sub-objects if those objects are trees, results in constructing the minimally-connected graph that is a CS-style tree.
1There's a length limit due to the cache entry ce_namelen field, which has a 32-bit integer type. So no name component can exceed about 4 GB in length. Practically speaking, none should probably exceed 255 bytes, but tree objects in Git don't enforce any particular limit, as far as I know.
A file system tree in Linux is really just a string identifying an entity within the file system, though naming anything other than a directory results in a degenerate tree with just one node in it. By naming a directory, though, you can imply that anyone interpreting this string should read the directory's contents, which are names that (by being concatenated with the string identifying the directory itself) name another Linux file system tree, possibly a degenerate one with a single file or device node or whatever. This kind of recursive enumeration leads to building up a minimally-connected graph, just as with the Git tree object. (Perhaps unsurprisingly, the Linux directory objects have essentially the same constraints on names as the Git tree objects, though they usually have a much smaller maximum component name length, typically 255 bytes or fewer.)
Finally, the way the phrase the main kernel tree is used in the article refers to the Linux kernel repository—Linus Torvald's Git repository for the Linux kernel—and the entire ecosystem around it. There is a lot of room for arguments about the details. Here, I will just include a link to this particular InfoWorld article, which seems like a reasonable summary of the state of affairs as of the time it was written (August 2016).
Related
We write app.example.com instead of com.example.app in major DNS-based systems, including WWW, FTP, and email. What is the reason behind this design? Why not the reverse order?
Hostnames existed before the DNS and even before TLDs.
Structures to the names have been added through RFC 921 "Domain Name System Implementation Schedule - Revised" (October 1984)
This document explains the change from simple names (no dot) to hierarchical ones, and this was needed because the Internet was growing at that time and a single flat list of names was not enough to describe every hosts.
Some excerpts:
The names are being changed from simple names, or globally unique
strings, to structured names, where each component name is unique
only with respect to the superior component name.
...
The elements (or components) of the structured names are separated
with periods, and the elements are written from the most
specific on the left to the most general on the right.
For example: USC-ISIF.ARPA
RFC 882 "DOMAIN NAMES - CONCEPTS and FACILITIES" (November 1983) just says it is a convention:
The domain name of a node or leaf is the path from
the root of the tree to the node or leaf. By convention, the
labels that compose a domain name are read left to right, from the
most specific (lowest) to the least specific (highest).
A clue may come from RFC 1034 "DOMAIN NAMES - CONCEPTS AND FACILITIES" (November 1987) that repeats the above with some details:
The domain name of a node is the list of the labels on the path from
the node to the root of the tree. By convention, the labels that
compose a domain name are printed or read left to right, from the most
specific (lowest, farthest from the root) to the least specific
(highest, closest to the root).
RFCs (see RFC 1166) have the tradition to use the "MSB 0" bit numbering: it means that when you write down a byte, with 8 bits, you start with the most significant one, the bit with highest value (the encode encoding for the 128 decimal value).
This was then extended with the concent of network byte order, where the most significant one is firsts.
I guess that the idea of starting with the most specific label of the name comes directly from this idea of the most specific bit first, which means starting with the label farthest from the root and hence finally having the root at the top right side and then reading a full name in a kind of right to left pattern.
how to extract the data from mainframe into excel? Currently , I am fetching data from MS access but the requirements are for Mainframe.
Thanks in advance
First, please understand that saying "extract data from mainframe" is similar to saying "extract data from Intel." The following is not comprehensive but is intended to provide an idea of how to ask your question in a manner which can be meaningfully answered.
Please understand there is a big difference between...
what is technically possible
what is allowed in your shop
what is likely to provide a robust and maintainable solution given your requirements
These are three very different things. Some of us answering questions here on Stack Overflow have life experiences that make us reticent about answering questions regarding what is technically possible absent any mention of what is allowed in your shop or what the actual business requirement is that is being solved.
Mainframes have been around for over half a century, and many shops have standard solutions to technical problems. Sometimes the solution is "don't do that, and here's what we do instead." Working against the recommendations of your technical staff, or your shop standards, is career limiting.
What operating system?
z/OS is in common use on mainframes, but there do exist shops that still run one of its ancestors like MVS/XA. The mainframe operating system traces its roots back to OS/360 first available in 1965.
z/TPF
z/Linux usually runs on top of the z/VM hypervisor.
z/VSE
In what sort of file does the data reside?
QSAM or Queued Sequential Access Method, also commonly called flat files.
VSAM or Virtual Sequential Access Method. There are several different kinds of VSAM files including KSDS (Keyed Sequential Data Set) ESDS (Entry Sequenced Data Set), RRDS (Relative Record Data Set) and Linear (conceptually similar to a memory mapped file).
a DBMS like DB2 or IMS. A DBMS typically has extract facilities to allow writing a flat file from its own internal format. DB2, for example, stores data in Linear VSAM datasets.
Unix System Services files reside in a different file system than QSAM or VSAM. This will be more familiar, as it has a directory structure where the classic z/OS file system has none.
What does the data look like?
You must know the record layout of the data you wish to retrieve.
It is common for mainframe data to include both text and binary data in a single record, for example a name and a currency amount:
Hopper Grace ar%
...which would be...
x'C8969797859940404040C799818385404040404081996C'
...in hex. This is code page 37, commonly referred to as EBCDIC.
Without knowing that the family name is confined to the first 10 bytes, the given name confined the the subsequent 10 bytes, and the currency amount is in packed decimal (also known as binary coded decimal) in the next 3 bytes, you cannot accurately transfer the data because code page conversion will destroy the currency amount which is +819.96. Converting to code page 1250, commonly in use on Microsoft Windows, you would end up with...
x'486F707065722020202047726163652020202020617225'
...where the text data is translated but the packed data is destroyed. The packed data no longer has a valid sign in the last nibble (the lower half of the last byte) and the amount itself has been changed.
Security
Is the data you wish to access covered by privacy legislation? You may have to provide some evidence that whatever protections are in place to guarantee that only authorized personnel have access to this data on the mainframe are also in place once you have transferred it off of the mainframe. Such guarantees may have to satisfy an auditor.
What you need
You need to know what operating system holds your data, you need to know what type of file holds your data (a DBMS isn't a type of file but let's let that go for now), and you need to know your record layout(s).
Typically, the easy way to retrieve data is to extract it from its existing data store (QSAM, VSAM, DBMS) into a flat file where all the data is in a text format. There are mainframe utilities to accomplish this. In extreme cases, a program can be written to accomplish this goal. Once it has been accomplished, you can transfer your data without fear of destroying packed or binary data.
You may be able to read data directly from a DBMS if that's where your data resides, but this may depend on shop standards, including security.
Modern mainframes can transfer data via FTP, FTPS, and SFTP. Which is recommended in your shop is something to ask your technical staff.
I am trying to match a list of entries in a given text file. The list is quite huge. Its a list of organization names, where names can have more than one word. Each text file is a kind of usual write-up with several paragraphs, totaling to approximately 5000 words per txt. Its a plain text content, and there is no clear boundary by which I can locate organization names.
I am looking for a way by which all the entries from the list are searched in the text file and whichever gets matched are recognized and tagged.
Is there any tool or framework to do this?
I tried to go through all the text mining tools listed in Wikipedia, but none seems to match this need.
Any inputs would be highly appreciated.
Approach 1: Finite State Machine
You can combine your search terms into a finite state machine (FSM). The resulting FSM can then scan a document for all the terms simultaneously in linear time. Since the FSM can be reused on each document, the expense of creating it is amortized over all the text you have to search.
A good regular expression library will make an FSM under the covers. Writing code to build your own is probably beyond the scope of a Stack Overflow answer.
The basic idea is to start with a regular expression that is an alternation of all your search terms. Suppose your organization list consists of "cat" and "dog". You'd combine those as cat|dog. If you also had to search for "pink pigs", your regular expression would be cat|dog|pink pigs.
From the regular expression, you can build a graph. The nodes of the graph are states, which keep track of what text you've just seen. The edges of the graph are transitions that tell the state machine which state to go to given the current state and the next character in the input. Some states are marked as "final" states, and if you ever get to one of those, you've just found an instance of one of your organizations.
Building the graph from all but the most trivial regular expressions is tedious and can be computationally expensive, so you probably want to find a well-tested regular expression library that already does this work.
Approach 2: Search for One Term at a Time
Depending on how many search terms you have, how many documents you have, and how fast your simple text searching tool is (possibly sub-linear), it may be best to just loop through the terms and search each document for each term as a separate command. This is certainly the simplest approach.
for doc in documents:
for term in search_terms:
search(term, doc)
Note that nesting the loops this way is probably most friendly to the disk cache.
This is the approach I would take if this were a one-time task. If you have to keep searching new documents (or with different lists of search terms), this might be too expensive.
Approach 3: Suffix Tree
Concatenate all the documents into one giant document, build a suffix tree, sort your search terms, and walk through the suffix tree looking for matches. Most of the details for building and using a suffix array are in this Jon Bentley article from Dr. Dobb's, but you can find many other resources for them as well.
This approach is memory intensive, mostly cache-friendly, and thus very fast.
Use a prefix tree aka Trie.
Load all your candidate names into the prefix tree.
For your documents, match them against the tree.
A prefix tree looks roughly like this:
{}
+-> a
| +-> ap
| | +-> ... apple
| +-> az
| +-> ... azure
+-> b
+-> ba
+-> ... banana republic
The vm_area_struct structure used to link various sections of a memory mapped executable file is stored as a red black tree. Now, as far as I know and the post here mentions too Difference between red-black trees and AVL trees AVL trees performs faster lookup than RB trees.
This tree is indexed by virtual addresses referred to by the process and is created when the process begins its execution. I expect this tree to be used vastly for lookup and at times for insertion and deletion. If, this is the case then why is AVL tree not preferred over RB tree as an implementation for the same.
Also, if my understanding is incorrect and that the tree involves a lot of insertions and deletions, as well, in comparison to lookup, please provide reference to support this claim.
I have seen some articles on tldp mentioning that earlier AVL tree was used for the same. Please explain on what grounds has this change been brought around?
This is addressed in the documentation directory in the kernel source repository.
Documentation/rbtree.txt
Red-black trees are similar to AVL trees, but provide faster realtime
bounded worst case performance for insertion and deletion (at most two
rotations and three rotations, respectively, to balance the tree),
with slightly slower (but still O(log n)) lookup time.
I know how to implement union find in general, but I was thinking of whether there would be a way to utilize the set structure in python to achieve the same result.
For example, we can union sets pretty easily. But I'm not sure how to determine if two elements are in the same set using just sets.
So, I am wondering if there is a data structure in python that would support such operation, other than the usual implementation?
You could always solve this problem by visualizing it as a tree and its nodes connecting to each other via the root, and then looking up the tree if you want to know if two nodes are connected. If the two nodes you are comparing has the same root (they are in the same tree), than they are connected.
To connect two nodes, just go to the root of each tree they are in, and make one root become the parent of the other.
This video will give you a great intuition about it:
https://www.youtube.com/watch?v=YIFWCpquoS8&list=PLUX6FBiUa2g4YWs6HkkCpXL6ru02i7y3Q&index=1
The connection between the tree nodes can be made via pointers in a language which supports it, but if your language dont (python), than you can create your own pointers by storing positions and links via an array.
The array would be such that its positions would represent your nodes, and the values inside it represents the connection of the specific node to its root. On the beginning, the position in the array is filled with the node number because the nodes has initially no parent, but as you connect nodes, the roots changes, and the array has to represent this. Actually, the value stored there is the identificator of the root.
But try visualizing the problem visually first instead of thinking of arrays and too much mathematical artificats. Visually dealing with it makes the solution sound banal, and can be a good guidance while writing code.
I say this because I have watched the video from Robert Sedgewick I just posted, with a graphical simulation of the solution, and implemented myself without paying too much attention to the code on his book. The intuition the video gave me is much more valuable than any mathematics.
It will help you to encapsulate the nodes into a class, with the following methods:
climbTreeFromNodeUpToRoot
setNewParentToThisNodeAndUpdateHeights
The first method, as the name says, takes you from a node and goes up the tree until finding the root of it, which is then returned.
If you compare two nodes with this method (actually, the roots returned by it), you know easily if they are connected by just comparing their roots.
Once you want to connected them, you go up the trees of both nodes, and ask one root to take the other one as its parent.
The trees can grow very big in height (sorry I dont use the official nomeclature, but this is the one that makes sense to me), so this simple approach will get very slow when you have to climb the tree at a later time.
To prevent trees from becoming to high, dont just set one root as the parent to another without criterium, but attach the smallest tree (in terms of height, not quantity of elements) to the highest one.
For this, you need to know the heights of each tree, and this information you can store on their respective root (via an extra array in your case, or an extra pointer from each node in other languages). This information should be updated everytime another tree connects to it.
It is not possible for a tree to know that she just got a new tree attached to it, so its important that every tree attaching to a second one informs the second as to update its height.
This information can be sent to the root of the second tree, and later used to judge (as writen before) which tree is the smallest. Remember, attaching a small tree to a big one instead of the opposite will save you incredible amounts of time.
Do you want something like this?
myset = ...
all(elt in myset for elt in (a,b))