Find strings matching any of a large set of strings - search

I have two files. One contains a list of 900 filenames. The other file contains a list of command directives, a subset of which reference the files provided in the first list. There are about 11,000 directives.
I want to extract the 900 directives that correspond to the 900 filenames in the first file.
I would like a command line solution for doing this, and if nothing else comes up I will resort to figuring out the chain of loop, grep, appending, piping, etc that is needed to do this.
But I'm hoping for a known working solution to reduce the time and errors it'll take me to work it out.

Well, it turns out that was easier than I thought:
cat file1 | xargs -l1 -I{} grep {} file2
Then just redirect the output to where I need it.
Maybe not the most efficient thing in the world, but it works fast enough for what I need.

Related

How to (with good asymptotic complexity) delete all of a specific character from a (very long) line?

I have a line with 25.9 million characters about 2.4 million of which are commas, and I want to remove all of the commas from the line.
If I use the command :s/,//g it constructs a regular expression which is run repeatedly on the line until there are no commas left. This seems to run in O(n^2) time based on empirical measurement. And as such my regular expression runs for well over an hour on this line.
Using a macro is no good because of the redraw that occurs which tends to be somewhat expensive when you are in the middle of such a long line.
Splitting up the lines seems to be the best option, but due to the structure of the file, I'd need to create a new buffer to do so cleanly.
Yes, there are much better ways to output this much data that does not involve CSVs with ridiculous numbers of columns, let's assume I didn't generate it, but I have it, and I have to work with it.
Is there an asymptotically fast way to simply delete every occurrence of a specific character from a line in vim?
As a text editor, Vim isn't well suited to such pathologically formatted files (as you've already found out).
As others have already commented, tr is a good alternative for removing the commas. Either externally:
$ tr -d , input.txt
Or from within Vim:
:.! tr -d ,
Vim also has a built-in low-level function :help tr(). Unfortunately, it doesn't handle deletion, only conversion. You could use it to change commas into semicolons in the current line like this:
:call setline('.', tr(getline('.'), ',', ';'))

In a bash function, how do I get stdin into a variable

I want to call a function with a pipe, and read all of stdin into a variable.
I read that the correct way to do that is with read, or maybe read -r or read -a. However, I had a lot of problems in practise doing that (esp with multi-line strings).
In the end I settled on
function example () {
local input=$(cat)
...
}
What is the idiomatic way to do this?
input=$(cat) is a perfectly fine way to capture standard input if you really need to. One caveat is that command substitutions strip all trailing newlines, so if you want to make sure to capture those as well, you need to ensure that something aside from the newline(s) is read last.
input=$(cat; echo x)
input=${input%x} # Strip the trailing x
Another option in bash 4 or later is to use the readarray command, which will populate an array with each line of standard input, one line per element, which you can then join back into a single variable if desired.
readarray foo
printf -v foo "%s" "${foo[#]}"
I've found that using cat is really slow in comparison to the following method, based on tests I've run:
local input="$(< /dev/stdin)"
In case anyone is wondering, < is just input redirection. From the bash-hackers wiki:
When the inner command is only an input redirection, and nothing else,
for example
$( <FILE )
# or
` <FILE `
then Bash attempts to read the given file and act just if the given
command was cat FILE.
Remarks about portability
In terms of how portable this method is, you are likely to go your entire linux user career, and never use a linux system which doesn't have /dev/stdin, but in case you want to satisfy that itch, here is a question on Unix Stackexchange which questions portability of directly accessing /dev/{stdin,stdout,stderr} and friends.
One more thing I've come across when working with linux containers such as ones built with docker, or buildah, is that there are situations where /dev/stdin or even /dev/stdout are not available inside the container. I've not been able to conclusively say what causes this.
There are a few overlapping / very similar questions floating around on SO. I answered this here, using the read built-in:
https://stackoverflow.com/a/58452863/3220983
In my answers there, however, I am ONLY concerned with a single line.
The arguable weakness of the cat approach, is that requires spawning a subshell. Otherwise, it's a good one. It's probably the easiest way to deal with multi line processing, as specifically queried here.
I think the read approach is faster / more resource efficient if you are trying to chain a lot of commands, or iterate through a list calling a function repeatedly.

In bash, how do I execute the contents of a variable, VERBATIM, as though they were a command line?

I need to construct a complex command that includes quoted arguments. As it happens, they are arguments to grep, so I'll use that as my example and deeply simplify the command to just enough to demonstrate the error.
Let's start with a working example:
> COMMAND='/usr/bin/grep _'
> echo $COMMAND
/usr/bin/grep _
> $COMMAND
foo <- I type this, and grep filters it out.
foo_ <- I type this, and.....
foo_ <- ... it matches, so grep emits it.
"foo" is not echoed back because it lacks an underscore, "foo_" has one, so it's returned. Let's get to a demonstration of the problem:
> COMMAND='/usr/bin/grep "_ _"'
> echo -E $COMMAND
/usr/bin/grep "_ _"
> /usr/bin/grep "_ _" <- The exact same command line
foo <- fails to match
foo_ _ <- matches, so it gets echoed back
foo_ _
> $COMMAND <- But that command doesn't work from a variable
grep: _": No such file or directory
In other words, when this command is invoked through a variable name, bash is taking the space between underscores as an argument delimiter - despite the quotes.
Normally, I'd fix this with backslashes:
> COMMAND='/usr/bin/grep "_\ _"'
> $COMMAND
grep: trailing backslash (\)
Okay, maybe I need another layer of escaping the backslash:
> COMMAND='/usr/bin/grep "_\\ _"'
12:32 (master) /Users/ronbarry> $COMMAND
grep: _": No such file or directory
And now we're back to square one - the command line is still being broken up at the space. I can, of course, verify all of this with some debugging, which establishes that the backslashes are surviving, unescaped, and grep is being called with multiple arguments:
> set -x
> $COMMAND
+ /usr/bin/grep '"_\\' '_"' <- grep is being called with two args
I have a solution to the problem that takes advantage of arrays, but packing commands this way (in my full implementation, which I'll spare you) is unfamiliar to most people who'd read my code. To oversimplify the creation of an array-based command:
> declare -a COMMAND=('/usr/bin/grep' '-i' 'a b')
12:44 (master) /Users/ronbarry> ${COMMAND[*]}
foo <- Same old, same old
fooa B <- ...
fooa B <- Matches because of case-insensitive (-i) grep.
Finally we get to the question. Why does bash break up quoted arguments in strings when interpreting them as commands and why doesn't there seem to be a string-y way to get it to work? If I have a command packed in a string variable, it violates the Principle of Least Surprise to have that string interpreted differently than the string itself would be. If someone can point me at some docs that cover all of this, and will set me at peace with why I have to resort to the infinitely uglier mechanism of building up arrays with all of my commands, I'd very much appreciate it.
Disclaimer: After writing the following, I almost decided that the question should be closed for encouraging opinion-based responses. This is an opinion-based response. Proceed at your own risk.
Why does bash break up quoted arguments in strings when interpreting them as commands
Because that's what it does. A more interesting question might be "Why does bash break up strings at all?", to which the only possible answer would be "it seemed like a good idea at the time".
Or, to put it another way: In the beginning, nobody thought of putting spaces into filenames. When you only had a few letters for a filename, you didn't waste any of them on spaces. So it seemed reasonable to represent a list of words as just a space-separated list of words, and that was the basis on which shell languages were developed. So the default behaviour of bash, like that of all unix-y shells, is to consider a string with whitespace in it to be a whitespace-separated list of words.
But, of course, that leads to all sorts of headaches, because strings are not structured data. Sometimes a filename does have whitespace in its name. And not all utility arguments are filenames, either. Sometimes you want to give an argument to a utility which is, for example, a sentence. Without that complication, shells were able to avoid making you type quotes, unlike "real" programming languages where strings need to be quoted. But once you decide that sometimes a space in a string is just another character, you need to have some kind of quoting system. So then the syntax of shells added several quoting forms, each with slightly different semantics. The most common is double-quoting, which marks the contents as a single word but still allows variable expansion.
It remains the case that shell quotes, like quotes in any other language, are simply syntactic constructs. They are not part of the string, and the fact that a particular character in a string was marked with a quote (or, equivalently, a backslash) is not retained as part of the string -- again, just like any other programming language. Strings are not really lists of words; they are just treated that way by default.
All of that is not very satisfactory. The nature of shell programming is that you really want a data structure which is a list of "words" -- or, better, a list of strings. And, eventually, shells got around to doing that. Unfortunately, by then there wasn't much syntactic space left in shell languages; it was considered important that the new features not change the behaviour of existing shell scripts. As far as I know, the current shell syntax for arrays was created by David Korn in 1988 (or earlier); eventually, bash also implemented arrays with basically the same syntax.
One of the curiosities in the syntax is that there are three ways of specifying that an entire array should be substituted:
${array[*]} or ${array[#]}: concatenate all the array elements together separated with the first character in $IFS, and then consider the result to be a whitespace-separated list of words.
"${array[*]}": concatenate all the array elements together separated with the first character in $IFS, and then consider the result to be a single word.
"${array[#]}": each array element is inserted as a separate word.
Of these, the first one is essentially useless; the second one is occasionally useful, and the third -- and most difficult to type -- is the one you almost always want.
In the above brief discussion, I left out any consideration of glob characters and filename expansion, and a number of other shell idiosyncrasies. So don't take it as a complete tutorial, by any means.
why doesn't there seem to be a string-y way to get it to work?
You can always use eval. Unfortunately. If you really really want to get bash to interpret a string as though it were a bash program rather than a string, and if you are prepared to open your script up to all manner of injection attacks, then the shell will happily give you enough rope. Personally, I would never allow a script which used eval to pass code review so I'm not going to expand on its use here. But it's documented.
If I have a command packed in a string variable, it violates the Principle of Least Surprise to have that string interpreted differently than the string itself would be.
Surprise is really in the eye of the beholder. There are probably lots of programmers who think that a newline character really occupies two bytes, and are Surprised when it turns out that in C, '\n'[0] is not a backslash. But I think most of us would be Surprised if it were. (I've tried to answer SO questions based on this misunderstanding, and it is not easy.)
Bash strings, regardless of anything else, are strings. They are not bash programs. Having them suddenly interpreted as bash programs would, in my opinion, not only be surprising but dangerous. At least if you use eval, there is a big red flag for the code reviewer.

The difference between arguments and options pertaining to the linux shell

I'm currently enrolled in an intro to Unix / Linux class and we came to a question that the instructor and I did not agree on.
cp -i file1 file2
Which is true about the preceding command?
A. There is only one utility
B. There is one option
C. There are three arguments
D. file1 will be copied as file2 and the user will be warned before
an overwrite occures
E. All of the above
I insisted that it was E. All of the above. The instructor has settled on D.
It seems clear that A, B, and D are all correct. The hang up was C and whether or not the -i flag was both an option and an argument.
My logic was that all options are arguments but not all arguments are options and since there are multiple true answers listed, then in multiple choice question tradition the answer is more than likely to be E all of the above.
I haven't been able to find the smoking gun on this issue and thought I would throw it to the masters.
I know this is an old thread, but I want to add the following for anyone else that may stumble into a similar disagreement.
$ ls -l junk
-rw-r--r-- 1 you 19 Sep 26 16:25 junk
"The strings that follow the program name on the command line, such as -l
and junk in the example above, are called the program's arguments. Arguments are usually options or names of files to be used by the command."
Brian W. Kernighan & Rob Pike, "The UNIX Programming Environment"
The manual page here states:
Mandatory arguments to long options are mandatory for short options too.
This seems to imply that in the context of this particular question, at least, you're supposed to not consider options to be arguments. Otherwise it becomes very recursive and kind of pointless.
I think the instructor should accept your explanation though, this really is splitting hairs for most typical cases.
I think the term "arguments" is used in different ways in different contexts, which is the root of the disagreement, it seems. In support of your stance though, note that the C runtime, upon which cp was most likely written, declares the program entry point as main(argc, argv) (types elided), which seems to indicate at least that those who designed the C architecture/library/etc. thought of options as a subset of arguments. Then of course options can have their own arguments (different context), etc....
This is how I was taught, it is said in this case:
cp -i file1 file2
The right answer would be A B and D but not C.
Since -i is an option and file1 and file2 are arguments. Normally options are considered to change the behaviour of an application or command where as arguments do not.
I suppose it is up to semantics as to whether you consider -i an argument of the original application since it is a behaviour changing option (or argument) of cp but it is considered in English an option not a argument.
That's how I still define the difference and keep the difference between the two parts of a command.
As another command example, cronjobs. I often use PHP cronjobs and I normally have both options and arguments associated with the command. Options are always used (in my opinion) to define extra behaviour while arguments are designed to provide the app and it's behaviours with the data it requires to complete the operation.
Edit
I agree with #unwind this is splitting hairs and actually a lot of times comes down to scenario and opinion. It was quite bad of him to even mark on it really, he should of known this is a subjective question. Tests are completely unfair when filled with subjective questions.
Hmmm... I personally like to distinguish between options and arguments however, you could technically say that options are arguments. I would say that you are correct but I think your instructor settled on D because he doesn't want you to get them confused. For example, the following is equivalent to the above command...:
ARG1="-i" ; ARG2="file1" ; ARG3="file2" ; cp $ARG1 $ARG2 $ARG3
ARG1="-i" ; ARG2="file1" ; ARG3="file2" ; cp $ARG2 $ARG1 $ARG3
ARG1="-i" ; ARG2="file1" ; ARG3="file2" ; cp $ARG2 $ARG3 $ARG1
...whereas cp $ARG1 $ARG3 $ARG2 is not the same. I would say that options are a special type of arguments.

How to solve a linear system in Linux shell?

Does anyone know of a Linux command that reads a linear system of equations from its standard input and writes the solution (if exists) in its standard output?
I want to do something like this:
generate_system | solve_system
You can probably write your own such command using this package.
This is an old question, but showed up in my searches for this problem, so I'm adding an answer here.
I used maxima's solve function. Wrangling the input/output to/from maxima is a bit of a challenge, but can be done.
prepare the system of equations as a comma-separated list -- for a example, EQs="C[1]+C[2]=1,C[1]-C[2]=2". I wanted a solution for an unknown number of variables, so I used C[n], but you can use variable names.
prepare a list of variables you wish to solve for -- EQ_VARS="C[1],C[2]"
Maxima will echo all inputs, use line wrap, and return a solution in the form [C[1]=...,C[2]=..]. We need to resolve all of these.
Taken together, this becomes
OUT_VALS=( \
$(maxima --very-quiet \
--batch-string="display2d:false\$linel:9999\$print(map(rhs,float(solve([$EQs],[$EQ_VARS]))[1]))\$" \
| tail -n 1 \
| tr -c '0-9-.e' ' ') )
which will place the solution values into the array $OUT_VALS.
Note that this only properly handles that Maxima output if your problem is correctly constrained -- if you have zero, or more than one solution, the output will not be parsed correctly.

Resources