Cut a file between two lines numbers using awk - text

Say I have a file with 100 lines (not including header). I want to cut that file down, only keeping the content between line 51 and 70 (inclusive), as well as the header so that the resulting file is 20+1 lines.
So far, I have this code:
awk 'NR==1 {h=$0; next} (NR-1)>50 && (NR-1)<71 {filename = "file20.csv"; print h >> filename} {print >> filename}' file100.csv
But it's giving me an error:
fatal: expression for `>>' redirection has null string value
Can somebody help me understand where my syntax is wrong?

You can directly use:
awk 'NR==1 || (NR>=51 && NR<=70)'
Note that this evaluates the condition of NR. In case it is true, it performs awk's default action: {print $0}. Hence, you do not have to explicit it.
Then you can redirect to another file:
awk 'NR==1 || (NR>=51 && NR<=70)' file > new_file
Test
$ seq 100 | awk 'NR==1 || (NR>=51 && NR<=70)'
1
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
It returns 21 lines:
$ seq 100 | awk 'NR==1 || (NR>=51 && NR<=70)' | wc -l
21

Related

Display Second Result of Duplicate Lines in Text File

I am trying to display the second result of duplicate lines in a text file.
Data:
60 60 61 64 63 78 78
Duplicate Lines:
60 60 78 78
Attempted Code:
echo "60 60 61 64 63 78 78" | sed 's/ /\n/g' | uniq -D | tail -1
Current Result:
78
Expected Result:
60 78
You may try this gnu awk solution:
s='60 60 61 64 63 78 78'
awk -v RS='[[:space:]]+' '++fq[$0] == 2' <<< "$s"
60
78
To avoid getting line breaks after each line:
awk -v RS='[[:space:]]+' '++fq[$0] == 2 {printf "%s", $0 RT}' <<< "$s"
60 78
Considering that you could have multiple lines in your Input_file then you could try following.
awk '
{
delete value
num=split($0,arr," ")
for(i=1;i<=num;i++){
value[arr[i]]++
}
for(i=1;i<=num;i++){
if(value[arr[i]]>1){
print arr[i]
delete value[arr[i]]
}
}
}
' Input_file
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
{
delete value ##Deleting value array here.
num=split($0,arr," ") ##Splitting current line to array arr here.
for(i=1;i<=num;i++){ ##Traversing through all fields here.
value[arr[i]]++ ##Creating value array with value of arr array.
}
for(i=1;i<=num;i++){ ##Traversing through all fields here.
if(value[arr[i]]>1){ ##Checking condition if value array value is coming more than 1 times then do following.
print arr[i] ##printing array value here(which is value which comes more than 1 time).
delete value[arr[i]] ##Deleting value array value to avoid duplicate printing here.
}
}
}
' Input_file ##Mentioning Input_file name here.
So if you do not want to then do not use tail -1. And instead of printing all, print duplicates once per duplicate.
echo "60 60 61 64 63 78 78" | sed 's/ /\n/g' | sort | uniq -d
Note - if the input is not sorted (or duplicate lines are not adjacent), you need to sort it first.
$ printf '%s\n' 60 60 61 64 63 78 78 | uniq -d
60
78
The above is assuming a hard-coded list of numbers works for you since, per the example in your question, you thought echo "60 60 61 64 63 78 78" | sed 's/ /\n/g' | uniq -D | tail -1 would work for you. If you need to store them in a variable first then make it an array variable:
$ vals=(60 60 60 61 64 63 78 78)
$ printf '%s\n' "${vals[#]}" | uniq -d
60
78

Even after `sort`, `uniq` is still repeating some values

Reference file: http://snap.stanford.edu/data/wiki-Vote.txt.gz
(It is a tape archive that contains a file called Wiki-Vote.txt)
The first few lines in the file that contains the following, head -n 10 Wiki-Vote.txt
# Directed graph (each unordered pair of nodes is saved once): Wiki-Vote.txt
# Wikipedia voting on promotion to administratorship (till January 2008).
# Directed edge A->B means user A voted on B becoming Wikipedia administrator.
# Nodes: 7115 Edges: 103689
# FromNodeId ToNodeId
30 1412
30 3352
30 5254
30 5543
30 7478
3 28
I want to find the number of nodes in the graph, (although it's already given in line 3). I ran the following command,
awk '!/^#/ { print $1; print $2; }' Wiki-Vote.txt | sort | uniq | wc -l
Explanation:
/^#/ matches all the lines that start with #. And !/^#/ matches that doesn't.
awk '!/^#/ { print $1; print $2; }' Wiki-Vote.txt prints the first and second column of all those matched lines in new lines.
| sort pipes the output to sort them.
| uniq should display all those unique values, but it doesn't.
| wc -l counts the previous lines and it is wrong.
The result of the above command is, 8491, which is not 7115 (as mentioned in the line 3). I don't know why uniq repeats the values. I can tell that since awk '!/^#/ { print $1; print $2; }' Wiki-Vote.txt | sort -i | uniq | tail returns,
992
993
993
994
994
995
996
998
999
999
Which contains the repeated values. Someone please run the code and tell me that I am not the only one getting the wrong answer and please help me figure out why I'm getting what I am getting.
The file has dos line endings - each line is ending with \r CR character.
You can inspect your tail output for example with hexdump -C, lines starting with # added by me:
$ awk '!/^#/ { print $1; print $2; }' ./wiki-Vote.txt | sort | uniq | tail | hexdump -C
00000000 39 39 32 0a 39 39 33 0a 39 39 33 0d 0a 39 39 34 |992.993.993..994|
# ^^ HERE
00000010 0a 39 39 34 0d 0a 39 39 35 0d 0a 39 39 36 0a 39 |.994..995..996.9|
# ^^ ^^
00000020 39 38 0a 39 39 39 0a 39 39 39 0d 0a |98.999.999..|
# ^^
0000002c
Because uniq sees unique lines, one with CR and one not, they are not removed. Remove the CR character before pipeing. Note that sort | uniq is better to sort -u.
$ awk '!/^#/ { print $1; print $2; }' ./wiki-Vote.txt | tr -d '\r' | sort -u | wc -l
7115

AWK convert minutes and seconds to only seconds

I have a file that I'm trying to convert into just seconds. It looks like this:
49.80
1:03.40
1:01.00
1:00.40
1:01.00
I've tried:
awk -F: '{seconds=($1*60);seconds=seconds+$2;print seconds}' file
which outputs:
2988
63.4
61
60.4
61
I've also tried:
sed 's/^/:/g' file | awk -F: '{seconds=($2*60);seconds=seconds+$3;print seconds}'
Which outputs the same as results. I would like to obtain these results:
49.80
63.4
61
60.4
61
Just add a check for if the records contains both seconds and minutes
awk -F: 'NF==1; NF==2{seconds=($1*60);seconds=seconds+$2;print seconds}'
NF number of fields in each record(row).
NF==1 If it contains only one field, we are not gonna do any calculations. See that there is no {action} part associated with this check. Hence awk performs the default action to print entire line
NF==2 True if the string contains 2 fields. Takes the associated action, which performs the calculations.
Test
$ awk -F: 'NF==1; NF==2{seconds=($1*60);seconds=seconds+$2;print seconds}' file
49.80
63.4
61
60.4
61
last column in your colon-separated list are seconds, next-to-last are minutes and the column before them are hours (although not mentioned before). This structure I'd put into code:
awk -F':' '{print (NF>2 ? $(NF-2)*3600 : 0) + (NF>1 ? $(NF-1)*60 : 0) + $(NF)}'
It checks whether there are enough columns (e.g. NF>1) and then takes its value (e.g. $(NF-1)*60 or even 0).
Test:
> awk -F':' '{print $0": "(NF>2?$(NF-2)*3600:0)+(NF>1?$(NF-1)*60:0)+$(NF)}'
49.80: 49.8
1:03.40: 63.4
1:01.00: 61
1:00.40: 60.4
1:01.00: 61
23:14:12.1: 83652.1
$ awk -F: '{print (NF-1)*$1*60 + $NF}' file
49.8
63.4
61
60.4
61

Compare two files having different column numbers and print the requirement to a new file if condition satisfies

I have two files with more than 10000 rows:
File1 has 1 col File2 has 4 col
23 23 88 90 0
34 43 74 58 5
43 54 87 52 3
54 73 52 35 4
. .
. .
I want to compare each value in file-1 with that in file-2. If exists then print the value along with other three values in file-2. In this example output will be:
23 88 90 0
43 74 58 5
54 87 52 3
.
.
I have written following script, but it is taking too much time to execute.
s1=1; s2=$(wc -l < File1.txt)
while [ $s1 -le $s2 ]
do n=$(awk 'NR=="$s1" {print $1}' File1.txt)
p1=1; p2=$(wc -l < File2.txt)
while [ $p1 -le $p2 ]
do awk '{if ($1==$n) printf ("%s %s %s %s\n", $1, $2, $3, $4);}'> ofile.txt
(( p1++ ))
done
(( s1++ ))
done
Is there any short/ easy way to do it?
You can do it very shortly using awk as
awk 'FNR==NR{found[$1]++; next} $1 in found'
Test
>>> cat file1
23
34
43
54
>>> cat file2
23 88 90 0
43 74 58 5
54 87 52 3
73 52 35 4
>>> awk 'FNR==NR{found[$1]++; next} $1 in found' file1 file2
23 88 90 0
43 74 58 5
54 87 52 3
What it does?
FNR==NR Checks if FNR file number of record is equal to NR total number of records. This will be same only for the first file, file1 because FNR is reset to 1 when awk reads a new file.
{found[$1]++; next} If the check is true then creates an associative array indexed by $1, the first column in file1
$1 in found This check is only done for the second file, file2. If column 1 value, $1 is and index in associative array found then it prints the entire line ( which is not written because it is the default action)

Linux-Change line between two awk scripts

I have two awk scripts to run in Linux. The output of each one is in one line.
How can I separate the two output into two lines?
For example:
awk '{printf $1}' f.txt >> a.txt
awk '{printf $3}' f.txt >> a.txt
The output of the first script is:
35 56 40 28 57
And the second output is:
29 48 73 26
If I run them one after another, the output will become:
35 56 40 28 57 29 48 73 26
Is there any way to get the result to:
35 56 40 28 57
29 48 73 26
Thank you!~
Although I don't understand how you manage to get the spaces between fields the way you do it, you can add an END statement to the first script:
awk '{printf $1} END{print "\n"}'
You can also do this with a single awk command:
awk -v ORS=" " 'BEGIN{ARGV[ARGC++] = ARGV[1]; i = 1 }
NR!=FNR && FNR==1 { printf "\n"; i=3 }
{ print $i }
END { printf "\n" }' f.txt

Resources