Removing a header in GNU/Linux - linux

I'm trying to confirm or not if I am able to remove a header.
Let's say
I have a file data.gz:
This line is the header Data
Data line 1
Data line 2
Data line 3
Data line 4
Data line 5
I want to remove the first line before I do a regular expression
gunzip -c data.gz | grep -v '^This line is the header data$' | grep -o 'Data' | sort | uniq -c
Will this remove the header before I do second grep (regular expression) for data? Is there a better method for removing a header in a pipeline?

Yes! The tail command can skip lines counting from the beginning:
$ seq 1 3 | tail -n+2
2
3

Delete first line with sed:
| sed 1d

Related

Concatenate files without last lines of each one

I am concatenating a large number of files into a single one with the following command:
$ cat num_*.dat > dataset.dat
However, due to the structure of the files, I'd like to omit concatenating the first two and last two lines of each file. Those lines contain file information which is not important for my necesities.
I know the existence of head and tail, but I don't now how to combine them in a UNIX instruction to solve my issue.
The head command has some odd parameter usage.
You can use the following to list all of the lines except the last two.
$ cat num_*.dat | head -n-2 > dataset.dat
Next, take that and run the following tail command on it
$ tail dataset.dat -n+3 >> dataset.dat
I believe the following will work as one command.
$ cat num_*.dat | head -n-2 | tail -n+3 > dataset.dat
I tested on a file that had lines like the following:
Line 1 Line 2 Line 3 Line 4 Line 5 Line 6 Line
7
This one will get you started:
cat test.txt | head -n-2 | tail -n+3
From the file above it prints :
Line 3 Line 4 Line 5
The challenge is that when you use cat filename*.dat or whatever is that the command cats all of the files then runs the command one time so it becomes one large file with only removing the first two lines of the first catted file and the two lines of that last catted file.
Final Answer - Need to Write a Bash Script
I wrote a bash script that will do this for you.
This one will iterate through each file in your directory and run the command.
Notice that it appends (>>) to the dataset.dat file.
for file in num_*.dat; do
if [ -f "$file" ]; then
cat $file | head -n-2 | tail -n+3 >> dataset.dat
echo "$file"
fi
done
I had two files that looked like the following:
line 1 line 2 line 3 line 4 line 5 line 6 line
7 2 line 1 2 line 2 2 line 3 2 line 4 2 line 5
2 line 6 2 line 7
The final output was:
line 3 line 4 line 5 2 line 3 2 line 4 2 line
5
for i in num_*.dat; do # loop through all files concerned
cat $i | tail -n +3 | head -n -2 >> dataset.dat
done

Reorder Lines Based On Previous File Order Before Randomization

I have the following lines in file1:
line 1text
line 2text
line 3text
line 4text
line 5text
line 6text
line 7text
With the command cat file1 | sort -R | head -4 I get the following in file2:
line 5text
line 1text
line 7text
line 2text
I would like to order the lines (not numerically, just the same order as file1) into the following file3:
line 1text
line 2text
line 5text
line 7text
The actual data doesn't have digits. Any easy way to do this? I was thinking of doing a grep and finding the first instance in a loop. But, I'm sure you experienced guys know an easier solution. Your positive input is highly appreciated.
You can decorate with line numbers, select four random lines lines, sort by line number and remove the line numbers:
$ nl -b a file1 | shuf -n 4 | sort -n -k 1,1 | cut -f 2-
line 2text
line 5text
line 6text
line 7text
The -b a option to nl makes sure that also empty lines are numbered.
Notice that this loads all of file1 into memory, as pointed out by ghoti. To avoid that (and as a generally smarter solution), we can use a different feature of (GNU) shuf: its -i option takes a number range and treats each number as a line. To get four random line numbers from an input file file1, we can use
shuf -n 4 -i 1-$(wc -l < file1)
Now, we have to print exactly these lines. Sed can do that; we just turn the output of the previous command into a sed script and run sed with sed -n -f -. All together:
shuf -n 4 -i 1-$(wc -l < file1) | sort -n | sed 's/$/p/;$s/p/{&;q}/' |
sed -n -f - file1
sort -n sorts the line numbers numerically. This isn't strictly needed, but if we know that the highest line number comes last, we can quit sed afterwards instead of reading the rest of the file for nothing.
sed 's/$/p/;$s/p/{&;q}/ appends p to each line. For the last line, we append {p;q} to stop processing the file.
If the output from sort looks like
27
774
670
541
then the sed command turns it into
27p
774p
670p
541{p;q}
sed -n -f - file1 processes file1, using the output of above sed command as the instructions for sed. -n suppresses output for the lines we don't want.
The command can be parametrized and put into a shell function, taking the file name and the number of lines to print as arguments:
randlines () {
fname=$1
nlines=$2
shuf -n "$nlines" -i 1-$(wc -l < "$fname") | sort -n |
sed 's/$/p/;$s/p/{&;q}/' | sed -n -f - "$fname"
}
to be used like
randlines file1 4
cat can add line numbers:
$ cat -n file
1 line one
2 line two
3 line three
4 line four
5 line five
6 line six
7 line seven
8 line eight
9 line nine
So you can use that to decorate, sort, undecorate:
$ cat -n file | sort -R | head -4 | sort -n
You can also use awk to decorate with a random number and line index (if your sort lacks -R like on OS X):
$ awk '{print rand() "\t" FNR "\t" $0}' file | sort -n | head -4
0.152208 4 line four
0.173531 8 line eight
0.193475 6 line six
0.237788 1 line one
Then sort with the line numbers and remove the decoration (one or two columns depending if you use cat or awk to decorate):
$ awk '{print rand() "\t" FNR "\t" $0}' file | sort -n | head -4 | cut -f2- | sort -n | cut -f2-
line one
line four
line six
line eight
another solution could be to sort whole file
sort file1 -o file2
to pick random lines on file2
shuf -n 4 file2 -o file3

wc -l is NOT counting last of the file if it does not have end of line character

I need to count all lines of an unix file. The file has 3 lines but wc -l gives only 2 count.
I understand that it is not counting last line because it does not have end of line character
Could any one please tell me how to count that line as well ?
grep -c returns the number of matching lines. Just use an empty string "" as your matching expression:
$ echo -n $'a\nb\nc' > 2or3.txt
$ cat 2or3.txt | wc -l
2
$ grep -c "" 2or3.txt
3
It is better to have all lines ending with EOL \n in Unix files. You can do:
{ cat file; echo ''; } | wc -l
Or this awk:
awk 'END{print NR}' file
This approach will give the correct line count regardless of whether the last line in the file ends with a newline or not.
awk will make sure that, in its output, each line it prints ends with a new line character. Thus, to be sure each line ends in a newline before sending the line to wc, use:
awk '1' file | wc -l
Here, we use the trivial awk program that consists solely of the number 1. awk interprets this cryptic statement to mean "print the line" which it does, being assured that a trailing newline is present.
Examples
Let us create a file with three lines, each ending with a newline, and count the lines:
$ echo -n $'a\nb\nc\n' >file
$ awk '1' f | wc -l
3
The correct number is found.
Now, let's try again with the last new line missing:
$ echo -n $'a\nb\nc' >file
$ awk '1' f | wc -l
3
This still provides the right number. awk automatically corrects for a missing newline but leaves the file alone if the last newline is present.
Respect
I respect the answer from John1024 and would like to expand upon it.
Line Count function
I find myself comparing line counts A LOT especially from the clipboard, so I have defined a bash function. I'd like to modify it to show the filenames and when passed more than 1 file a total. However, it hasn't been important enough for me to do so far.
# semicolons used because this is a condensed to 1 line in my ~/.bash_profile
function wcl(){
if [[ -z "${1:-}" ]]; then
set -- /dev/stdin "$#";
fi;
for f in "$#"; do
awk 1 "$f" | wc -l;
done;
}
Counting lines without the function
# Line count of the file
$ cat file_with_newline | wc -l
3
# Line count of the file
$ cat file_without_newline | wc -l
2
# Line count of the file unchanged by cat
$ cat file_without_newline | cat | wc -l
2
# Line count of the file changed by awk
$ cat file_without_newline | awk 1 | wc -l
3
# Line count of the file changed by only the first call to awk
$ cat file_without_newline | awk 1 | awk 1 | awk 1 | wc -l
3
# Line count of the file unchanged by awk because it ends with a newline character
$ cat file_with_newline | awk 1 | awk 1 | awk 1 | wc -l
3
Counting characters (why you don't want to put a wrapper around wc)
# Character count of the file
$ cat file_with_newline | wc -c
6
# Character count of the file unchanged by awk because it ends with a newline character
$ cat file_with_newline | awk 1 | awk 1 | awk 1 | wc -c
6
# Character count of the file
$ cat file_without_newline | wc -c
5
# Character count of the file changed by awk
$ cat file_without_newline | awk 1 | wc -c
6
Counting lines with the function
# Line count function used on stdin
$ cat file_with_newline | wcl
3
# Line count function used on stdin
$ cat file_without_newline | wcl
3
# Line count function used on filenames passed as arguments
$ wcl file_without_newline file_with_newline
3
3

How to count the number of character in a comma separated line where commas within delimiter are not to be counted as separate?

Let's say I have the following line in my file:
HELLO,1410250216446000,1410250216470330,1410250216470367,329,PE,B,T,GALU,[ , , T, I],3.38,3,A,A, , , , ,0, ,0,0, ,-Infinity,-Infinity,-Infinity, ,,0
if I use
grep -a -w HELLO my_file | head -10 | awk -F '[\t,]' '{print NF}' | less
output is 32.
But I don't want to count the commas within []. I mean [ , , T, I] must be counted as a single word. So that the output of my query is 29.
What will be one line command for doing this in Linux?
Remove content inside brackets using sed. Then continue counting
grep -a -w HELLO my_file|sed "s/\[.*\]//g" | head -10 | awk -F '[\t,]' '{print NF}' | less
output
29

Find unique lines

How can I find the unique lines and remove all duplicates from a file?
My input file is
1
1
2
3
5
5
7
7
I would like the result to be:
2
3
sort file | uniq will not do the job. Will show all values 1 time
uniq has the option you need:
-u, --unique
only print unique lines
$ cat file.txt
1
1
2
3
5
5
7
7
$ uniq -u file.txt
2
3
Use as follows:
sort < filea | uniq > fileb
You could also print out the unique value in "file" using the cat command by piping to sort and uniq
cat file | sort | uniq -u
While sort takes O(n log(n)) time, I prefer using
awk '!seen[$0]++'
awk '!seen[$0]++' is an abbreviation for awk '!seen[$0]++ {print}', print line(=$0) if seen[$0] is not zero.
It take more space but only O(n) time.
I find this easier.
sort -u input_filename > output_filename
-u stands for unique.
you can use:
sort data.txt| uniq -u
this sort data and filter by unique values
uniq -u has been driving me crazy because it did not work.
So instead of that, if you have python (most Linux distros and servers already have it):
Assuming you have the data file in notUnique.txt
#Python
#Assuming file has data on different lines
#Otherwise fix split() accordingly.
uniqueData = []
fileData = open('notUnique.txt').read().split('\n')
for i in fileData:
if i.strip()!='':
uniqueData.append(i)
print uniqueData
###Another option (less keystrokes):
set(open('notUnique.txt').read().split('\n'))
Note that due to empty lines, the final set may contain '' or only-space strings. You can remove that later. Or just get away with copying from the terminal ;)
#
Just FYI, From the uniq Man page:
"Note: 'uniq' does not detect repeated lines unless they are adjacent. You may want to sort the input first, or use 'sort -u' without 'uniq'. Also, comparisons honor the rules specified by 'LC_COLLATE'."
One of the correct ways, to invoke with:
#
sort nonUnique.txt | uniq
Example run:
$ cat x
3
1
2
2
2
3
1
3
$ uniq x
3
1
2
3
1
3
$ uniq -u x
3
1
3
1
3
$ sort x | uniq
1
2
3
Spaces might be printed, so be prepared!
uniq -u < file will do the job.
uniq should do fine if you're file is/can be sorted, if you can't sort the file for some reason you can use awk:
awk '{a[$0]++}END{for(i in a)if(a[i]<2)print i}'
sort -d "file name" | uniq -u
this worked for me for a similar one. Use this if it is not arranged.
You can remove sort if it is arranged
This was the first i tried
skilla:~# uniq -u all.sorted
76679787
76679787
76794979
76794979
76869286
76869286
......
After doing a cat -e all.sorted
skilla:~# cat -e all.sorted
$
76679787$
76679787 $
76701427$
76701427$
76794979$
76794979 $
76869286$
76869286 $
Every second line has a trailing space :(
After removing all trailing spaces it worked!
thank you
Instead of sorting and then using uniq, you could also just use sort -u. From sort --help:
-u, --unique with -c, check for strict ordering;
without -c, output only the first of an equal run

Resources