space/tab/newline insensitive comparison - string

Suppose I have these two files:
File 1:
1 2 3 4 5 6 7
File 2:
1
2
3
4
5
6
7
Is it possible to use diff to compare these two files so that the result is equal ?
(Or if not, what are other tools should I use? )
Thanks

You could collapse whitespace so file2 looks like file1, with every number on the same line:
$ cat file1
1 2 3 4 5 6 7
$ cat file2
1
2
4
3
5
6
7
$ diff <(echo $(< file1)) <(echo $(< file2))
1c1
< 1 2 3 4 5 6 7
---
> 1 2 4 3 5 6 7
Explanation:
< file # Equivalent to "cat file", but slightly faster since the shell doesn't
# have to fork a new process.
$(< file) # Capture the output of the "< file" command. Can also be written
# with backticks, as in `< file`.
echo $(< file) # Echo each word from the file. This will have the side effect of
# collapsing all of the whitespace.
<(echo $(< file)) # An advanced way of piping the output of one command to another.
# The shell opens an unused file descriptor (say fd 42) and pipes
# the echo command to it. Then it passes the filename /dev/fd/42 to
# diff. The result is that you can pipe two different echo commands
# to diff.
Alternately, you may want to make file1 look like file2, with each number on separate lines. That will produce more useful diff output.
$ diff -u <(printf '%s\n' $(< file1)) <(printf '%s\n' $(< file2))
--- /dev/fd/63 2012-09-10 23:55:30.000000000 -0400
+++ file2 2012-09-10 23:47:24.000000000 -0400
## -1,7 +1,7 ##
1
2
-3
4
+3
5
6
7
This is similar to the first command with echo changed to printf '%s\n' to put a newline after each word.
Note: Both of these commands will fail if the files being diffed are overly long. This is because of the limit on command-line length. If that happens then you will need to workaround this limitation, say by storing the output of echo/printf to temporary files.

Some diffs have a -b (ignore blanks) and -w (ingnore whitespace), but as unix utilities are all line-oriented, I don't thing whitespace will include \n chars.
Dbl-check that your version of diff doesn't have some fancy gnu-options with diff --help | less or man diff.
Is your formatting correct above, file 1, data all on one line? You could force file2 to match that format with
awk '{printf"%s ", $0}' file2
Or as mentioned in comments, convert file 1
awk '{for (i=1;i<=NF;i++) printf("%s\n", $i)}' file1
But I'm guessing that your data isn't really that simple. Also there are likely line length limitations that will appear when you can least afford the time to deal with them.
Probably not what you want to hear, and diffing of complicated stuff like source-code is not an exact science. So, if you still need help, create a slightly more complicated testcase and add it to your question.
Finally, you'll need to show us what you'd expect the output of such a diff project to look like. Right now I can't see any meaningful way to display such differences for a non-trival case.
IHTH

If it turns out the data is indeed simple enough to not run into limitations, and the only difference between the files is that the first one separates by space and the second by newline, you can also do process substitution (as was suggested above) but with sed to replace the spaces in the first file with newlines:
diff <(sed 's/ /\n/g' file1) file2

Related

how to filter the lines contain at least one "x" for the first nine characters

for example if we have txt as below
drwxr-sr-x 7 abcdefgetdf
drwxr-sr-x 7 abcdef123123sa
drwxr-sr-- 7 abcdefgetdf
drwxr-sr-- 7 abcdeadfvcxvxcvx
drwxr-sr-x 7 abcdef123ewlld
To answer the question in the title strictly:
awk 'substr($0, 1, 9) ~ /x/' txt
Though if you're interested in files with at least one execute permission bit set, then perhaps find -perm /0111 would be something to look into.
awk solution:
ls -l | awk '$1~/x/'
$1~/x/ - regular expression match, accepts only lines with the 1st field containing x character
I couldn't figure out a one liner that just checked the first 9 (or 10) characters for 'x'. But this little script does the trick. Uses "ls -l" as input instead of a file, but it's trivial to pass in a file to filter instead.
#!/bin/bash
ls -l | while read line
do
perms="${line:0:10}"
if [[ "${perms}" =~ "x" ]]; then
echo "${line}"
fi
done

Tail inverse / printing everything except the last n lines?

Is there a (POSIX command line) way to print all of a file EXCEPT the last n lines? Use case being, I will have multiple files of unknown size, all of which contain a boilerplate footer of a known size, which I want to remove. I was wondering if there is already a utility that does this before writing it myself.
Most versions of head(1) - GNU derived, in particular, but not BSD derived - have a feature to do this. It will show the top of the file except the end if you use a negative number for the number of lines to print.
Like so:
head -n -10 textfile
Probably less efficient than the "wc" + "do the math" + "tail" method, but easier to look at:
tail -r file.txt | tail +NUM | tail -r
Where NUM is one more than the number of ending lines you want to remove, e.g. +11 will print all but the last 10 lines. This works on BSD which does not support the head -n -NUM syntax.
The head utility is your friend.
From the man page of head:
-n, --lines=[-]K
print the first K lines instead of the first 10;
with the leading `-', print all but the last K lines of each file
There's no standard commands to do that, but you can use awk or sed to fill a buffer of N lines, and print from the head once it's full. E.g. with awk:
awk -v n=5 '{if(NR>n) print a[NR%n]; a[NR%n]=$0}' file
cat <filename> | head -n -10 # Everything except last 10 lines of a file
cat <filename> | tail -n +10 # Everything except 1st 10 lines of a file
If the footer starts with a consistent line that doesn't appear elsewhere, you can use sed:
sed '/FIRST_LINE_OF_FOOTER/q' filename
That prints the first line of the footer; if you want to avoid that:
sed -n '/FIRST_LINE_OF_FOOTER/q;p' filename
This could be more robust than counting lines if the size of the footer changes in the future. (Or it could be less robust if the first line changes.)
Another option, if your system's head command doesn't support head -n -10, is to precompute the number of lines you want to show. The following depends on bash-specific syntax:
lines=$(wc -l < filename) ; (( lines -= 10 )) ; head -$lines filename
Note that the head -NUMBER syntax is supported by some versions of head for backward compatibility; POSIX only permits the head -n NUMBER form. POSIX also only permits the argument to -n to be a positive decimal integer; head -n 0 isn't necessarily a no-op.
A POSIX-compliant solution is:
lines=$(wc -l < filename) ; lines=$(($lines - 10)) ; head -n $lines filename
If you need to deal with ancient pre-POSIX shells, you might consider this:
lines=`wc -l < filename` ; lines=`expr $lines - 10` ; head -n $lines filename
Any of these might do odd things if a file is 10 or fewer lines long.
tac file.txt | tail +[n+1] | tac
This answer is similar to user9645's, but it avoids the tail -r command, which is also not a valid option many systems. See, e.g., https://ubuntuforums.org/showthread.php?t=1346596&s=4246c451162feff4e519ef2f5cb1a45f&p=8444785#post8444785 for an example.
Note that the +1 (in the brackets) was needed on the system I tried it on to test, but it may not be required on your system. So, to remove the last line, I had to put 2 in the brackets. This is probably related to the fact that you need to have the last line ending with regular line feed character(s). This, arguably, makes the last line a blank line. If you don't do that, then the tac command will combine the last two lines, so removing the "last" line (or the first to the tail command) will actually remove the last two lines.
My answer should also be the fastest solution of those listed to date for systems lacking the improved version of head. So, I think it is both the most robust and the fastest of all the answers listed.
head -n $((`(wc -l < Windows_Terminal.json)`)) Windows_Terminal.json
This will work on Linux and on MacOs, keep in mind Mac does not support a negative value. so This is quite handy.
n.b : replace Windows_Terminal.json with your file name
It is simple. You have to add + to the number of lines that you wanted to avoid.
This example gives to you all the lines except the first 9
tail -n +10 inputfile
(yes, not the first 10...because it counts different...if you want 10, just type
tail -n 11 inputfile)

extracting location specific record using sed based on given criteria using bash script

I want to extract data from a ASCII file that looks like the one provided here the block starting 1NAME. The block starting 1NAME can repeat any number of times - I have files where there is only one block and in some files as many as 744:
AVERAGE MODELNAME -- RUNNAME
0 1 11121 0. 11122 24.
-9700000 4000000 0 -241200000000 -1620000
1.00000 1000.00000 10 10 1 2 0 15. 11. 0.
1 1 500 400
NAME
11121 0.00 11121 1.00
1NAME
0.0000000E+00 0.0000000E+00 0.0000000E+00 0.0000000E+00 0.0000000E+00
0.0000000E+00 0.0000000E+00 0.0000000E+00 0.0000000E+00 0.0000000E+00
0.0000000E+00 0.0000000E+00 0.0000000E+00 0.0000000E+00 0.0000000E+00
NAME
11121 1.00 11121 2.00
1NAME
1.0000000E+00 45.0000000E+00 01.0000000E+00 115.0000000E+00 5.0000000E+00
2.0000000E+00 66.0000000E+00 09.0000000E+00 180.0000000E+00 4.0000000E+00
3.0000000E+00 80.0000000E+00 70.0000000E+00 130.0000000E+00 5.0000000E+00
I would like to extract values from (1) given recurring location in the file, starting after "1NAME", (2) pipe outputs to a text file and create header that identifies what location it was pulled from and (3) create a custom code that can take in input for multiple locations (say record 1, 5, 8) after 1NAME and output them into separate outputs (example: one output for all records at location 1, one output file for location 5, ...).
As an example say I want to grab record 1, 5 and 8 after 1NAME in the given input file. Outputs for each record should be output as follows in separate record specific text file labeled as GRID#.txt:
GRID 1
0.0000000E+00
00.0000000E+00
GRID 5
0.0000000E+00
5.0000000E+00
GRID 8
0.0000000E+00
09.0000000E+00
I was able to extract data on at a time using sed. However I need to extract data from multiple locations from the input file. So I tried to put all the information in a script. Following are the steps I took.
Input file has multiple whitespaces and inconsistent blank lines. So I used sed to remove multiple spaces and replace with one space. And then using the piped output from this step, removed all blank lines. This resulted in all the data in the file arranged as one value per row.
sed 's/\s\+/\n/g' <input.txt>| sed '/^$/d
To extract data, I then used sed command (format as follows) from piped output from step 1.
sed -n -e 11p -e 50p
I tried to put all these commands as a bash (or csh, either option) script with custom row number. I tried using foreach (naively) and then learned that it cannot be used within bash. I will be using fellow user recommended scripts instead.
#!/bin/bash
set FILE=$cwd/sample_or_2day
foreach GRID (23729)
foreach GRIDTIME(28 41)
sed 's/\s\+/\n/g' $FILE | sed '/^$/d' | sed '1,36d' > temp_out
sed -n -e "$GRIDTIME" temp_out | tee $cwd/out_$GRID
Thanks for your patience. I am a nervous programmer and trying to master basics. I spent time looking at sed instruction pages, and user support forums. Any recommendations are welcome - especially with explicit instructions. Thanks!
You provided your attempt at a csh script, but tagged your question as bash. I am answering with a bash script.
The core of your question is how to extract information from a formatted printout. In general, one should avoid such situations: one should use programming environments which are aware of the data structures being manipulated, in order to avoid re-parsing at each step. In the real world, however, such situations arise very often, and one has to cope with them.
Your approach to transform all white spaces into newlines works in your case. Instead of multiple sed commands, the fastest way to achieve it is by
tr -s ' ' '\n'
(the -s option squeezes multiple occurrences of the target character into one, eliminating blank lines)
Then, you are interested in the 7th and 14th lines after each occurrence of a line containing 1NAME. This is done in sed by
sed -n -e '/^1NAME$/{n;n;n;n;n;n;n;p;n;n;n;n;n;n;n;p}'
which means: when you see 1NAME, execute the nextline command seven times, then execute the print command. Do this twice.
You could use a shell variable:
next7='n;n;n;n;n;n;n;p'
And
cat ./sample_or_2day | tr -s ' ' '\n' | sed -n -e '/^1NAME$/'"{$next7;$next7}"
would produce
0.0000000E+00
0.0000000E+00
66.0000000E+00
130.0000000E+00
correct, the first block was also taken in. To skip it, let’s add the sed instruction you had already figured out, -e1,36d.
$ cat ./sample_or_2day | tr -s ' ' '\n' | sed -n -e1,36d -e'/^1NAME$/'"{$next7;$next7}"
66.0000000E+00
130.0000000E+00
You might also want bash to construct the sed command line for you: for instance, the command
sed -n -e{7..29..7}p
would be expanded by the shell as
sed -n -e7p -e14p -e21p -e28p
which, as you know, means that sed would print those input lines only.
You might also want to learn about for loops in bash, which come in two different flavors, for example:
for var in word1 word2 word3 ...; do ... ; done
for (( i=0; i<10; i++ )); do ...; done
Now, it is not clear to me how you want to manage your output files. I provide a bash version of your script (providing a list of values for GRID, instead of just one), which shows another possible brace expansion in bash.
#!/bin/bash
FILE=./sample_or_2day
for GRID in 23729 23755 23768; do
cat "$FILE" | tr -s ' ' '\n' | sed -n -e{28,41}p >> "./out_$GRID"
done
This is what has worked for me so far, but not as a bash script:
sed 's/\s\+/\n/g' ./sample_or_2day | sed '/^$/d' | sed '1,36d'| sed -n -e{23724..194842..97421}p > './out'
In the above script:
sed 's/\s+/\n/g' -> replaces multiple whitespaces with one space
sed '/^$/d' -> deletes blank line from piped output
sed '1,36d' -> removes line 1-36 from piped output
sed -n -e{23724..194842..97421}p -> prints record starting line
23724 and at intervals of 97421, until line 194842
'./out' -> outputs to file labeled out
If you're open to adding a python dependency, you may find this'll help:
http://stromberg.dnsalias.org/~strombrg/context-split.html
Or try awk, replacing /^BEGIN/ and /^END/ with your own regex's:
#!/bin/sh
awk '
BEGIN { show=0 }
/^END/ { show=0 }
{ if (show==1) print $0 }
/^BEGIN/ { show=1 }' $#

How to delete the last n lines of a file? [duplicate]

This question already has answers here:
How to use sed to remove the last n lines of a file
(26 answers)
Closed 5 years ago.
I was wondering if someone could help me out.
Im writing a bash script and i want to delete the last 12 lines of a specific file.
I have had a look around and somehow come up with the following;
head -n -12 /var/lib/pgsql/9.6/data/pg_hba.conf | tee /var/lib/pgsql/9.6/data/pg_hba.conf >/dev/null
But this wipes the file completely.
All i want to do is permanently delete the last 12 lines of that file so i can overwrite it with my own rules.
Any help on where im going wrong?
There are a number of methods, depending on your exact situation. For small, well-formed files (say, less than 1M, with regular sized lines), you might use Vim in ex mode:
ex -snc '$-11,$d|x' smallish_file.txt
-s -> silent; this is batch processing, so no UI necessary (faster)
-n -> No need for an undo buffer here
-c -> the command list
'$-11,$d' -> Select the 11 lines from the end to the end (for a total of 12 lines) and delete them. Note the single quote so that the shell does not interpolate $d as a variable.
x -> "write and quit"
For a similar, perhaps more authentic throw-back to '69, the ed line-editor could do this for you:
ed -s smallish_file.txt <<< $'-11,$d\nwq'
Note the $ outside of the single quote, which is different from the ex command above.
If Vim/ex and Ed are scary, you could use sed with some shell help:
sed -i "$(($(wc -l < smallish_file.txt) - 11)),\$d" smallish_file.txt
-i -> inplace: write the change to the file
The line count less 11 for a total of 12 lines. Note the escaped dollar symbol ($) so the shell does not interpolate it.
But using the above methods will not be performant for larger files (say, more than a couple of megs). For larger files, use the intermediate/temporary file method, as the other answers have described. A sed approach:
tac some_file.txt | sed '1,12d' | tac > tmp && mv tmp some_file.txt
tac to reverse the line order
sed to remove the last (now first) 12 lines
tac to reverse back to the original order
More efficient than sed is a head approach:
head -n -12 larger_file.txt > tmp_file && mv tmp_file larger_file.txt
-n NUM show only the first NUM lines. Negated as we've done, shows up to the last NUM lines.
But for real efficiency -- perhaps for really large files or for where a temporary file would be unwarranted -- truncate the file in place. Unlike the other methods which involve variations of overwriting the entire old file with entire the new content, this one will be near instantaneous no matter the size of the file.
# In readable form:
BYTES=$(tail -12 really_large.txt | wc -c)
truncate -s -$BYTES really_large.txt
# Inline, perhaps as part of a script
truncate -s -$(tail -12 really_large.txt | wc -c) really_large.txt
The truncate command makes files exactly the specified size in bytes. If the file is too short, it will make it larger, and if the file is too large, it will chop off the excess really efficiently. It does this with filesystem semantics, so it involves writing usually no more than a couple of bytes. The magic here is in calculating where to chop:
-s -NUM -> Note the dash/negative; says to reduce the file by NUM bytes
$(tail -12 really_large.txt | wc -c) -> returns the number of bytes to be removed
So, you pays your moneys and takes your choices. Choose wisely!
Like this:
head -n -12 test.txt > tmp.txt && cp tmp.txt test.txt
You can use a temporary file store the intermediate result of head -n
I think the code below should work:
head -n -12 /var/lib/pgsql/9.6/data/pg_hba.conf > /tmp/tmp.pg.hba.$$ && mv /tmp/tmp.pg.hba.$$ /var/lib/pgsql/9.6/data/pg_hba.conf
If you are putting it a script, a more readable and easy to maintain code would be:
SRC_FILE=/var/lib/pgsql/9.6/data/pg_hba.conf
TMP_FILE=/tmp/tmp.pg.hba.$$
head -n -12 $SRC_FILE > $TMP_FILE && mv $TMP_FILE $SRC_FILE
I would suggest backing up /var/lib/pgsql/9.6/data/pg_hba.conf before running any script.
Simple and clear script
declare -i start
declare -i cnt
cat dummy
1
2
3
4
5
6
7
8
9
10
11
12
13
cnt=`wc -l dummy|awk '{print $1}'`
start=($cnt-12+1)
sed "${start},\$d" dummy
OUTPUT
is the first line
1

Splitting a file and its lines under Linux/bash

I have a rather large file (150 million lines of 10 chars). I need to split it in 150 files of 2 million lines, with each output line being alternatively the first 5 characters or the last 5 characters of the source line.
I could do this in Perl rather quickly, but I was wondering if there was an easy solution using bash.
Any ideas?
Homework? :-)
I would think that a simple pipe with sed (to split each line into two) and split (to split things up into multiple files) would be enough.
The man command is your friend.
Added after confirmation that it is not homework:
How about
sed 's/\(.....\)\(.....\)/\1\n\2/' input_file | split -l 2000000 - out-prefix-
?
I think that something like this could work:
out_file=1
out_pairs=0
cat $in_file | while read line; do
if [ $out_pairs -gt 1000000 ]; then
out_file=$(($out_file + 1))
out_pairs=0
fi
echo "${line%?????}" >> out${out_file}
echo "${line#?????}" >> out${out_file}
out_pairs=$(($out_pairs + 1))
done
Not sure if it's simpler or more efficient than using Perl, though.
First five chars of each line variant, assuming that the large file called x.txt, and assuming it's OK to create files in the current directory with names x.txt.* :
split -l 2000000 x.txt x.txt.out && (for splitfile in x.txt.out*; do outfile="${splitfile}.firstfive"; echo "$splitfile -> $outfile"; cut -c 1-5 "$splitfile" > "$outfile"; done)
Why not just use native linux split function?
split -d -l 999999 input_filename
this will output new split files with file names like x00 x01 x02...
for more info see the manual
man split

Resources