I have a file similar to the following.
$ ls -1 *.ts | sort -V
media_w1805555829_b1344100_sleng_2197.ts
media_w1805555829_b1344100_sleng_2198.ts
media_w1805555829_b1344100_sleng_2199.ts
media_w1805555829_b1344100_sleng_2200.ts
media_w1501256294_b1344100_sleng_2199.ts
media_w1501256294_b1344100_sleng_2200.ts
media_w1501256294_b1344100_sleng_2201.ts
media_w1501256294_b1344100_sleng_2202.ts
This will print duplicate lines:
$ ls -1 *.ts | sort -V | grep -oP '.*_\K.*(?=.ts)' | sort | uniq -d | sed 's/^/DUPLICATE---:> /'
DUPLICATE---:> 2199
DUPLICATE---:> 2200
I want the output:
DUPLICATE---:> media_w1805555829_b1344100_sleng_2199.ts
DUPLICATE---:> media_w1805555829_b1344100_sleng_2200.ts
DUPLICATE---:> media_w1501256294_b1344100_sleng_2199.ts
DUPLICATE---:> media_w1501256294_b1344100_sleng_2200.ts
ls -1 *.ts | sort -V | awk -F[_.] '
{
map[$5]+=1;
map1[$5][$0]
}
END {
for (i in map)
{
if(map[i]>1)
{
for (j in map1[i])
{
print "DUPLICATE---:> "j
}
}
}
}' | sort
One liner
ls -1 *.ts | sort -V | awk -F[_.] '{ map[$5]+=1;map1[$5][$0] } END { for (i in map) { if(map[i]>1) { for (j in map1[i]) { print "DUPLICATE---:> "j } } } }' | sort
Using awk, set the field seperator to _ or . Then create two arrays. The first (map) holds a count for each number in the file path. The second (map1) is a multi dimensional array with the first index as the number and the second as the complete line (file path). We then loop through the array map at the end and check for any counts that are greater than one. If we find any, we loop through the second map1 array and print the lines (second index) along with the additional text. We finally run through sort again to get the ordering as required,.
Use this Perl one-liner:
ls -1 *.ts | perl -lne '
$cnt{$1}++ if /_(\d+).ts$/;
push #files, [ $_, $1 ];
END {
for ( grep $cnt{$_->[1]} > 1, #files ) {
print "DUPLICATE---:> $_->[0]"
}
}'
This eliminates the need to sort.
The %cnt hash holds the count of the suffixes (the parts of the filename that you want to find duplicates in).
#files is an array of arrays. Each of its elements is an anonymous array with 2 elements: the file name and the suffix.
grep $cnt{$_->[1]} > 1, #files : The grep selects the elements of the #files array where the suffix is a dupe.
The Perl one-liner uses these command line flags:
-e : Tells Perl to look for code in-line, instead of in a file.
-n : Loop over the input one line at a time, assigning it to $_ by default.
-l : Strip the input line separator ("\n" on *NIX by default) before executing the code in-line, and append it when printing.
SEE ALSO:
perldoc perlrun: how to execute the Perl interpreter: command line switches
I have these functions to process a 2GB text file. I'm splitting it into 6 parts for simultaneous processing but it is still taking 4+ hours.
What else can I try make the script faster?
A bit of details:
I feed my input csv into a while loop to be read line by line.
I grabbed the values from the csv line from 4 fields in the read2col function
The awk in my mainf function takes the values from read2col and do some arithmetic calculation. I'm rounding the result to 2 decimal places. Then, print the line to a text file.
Sample data:
"111","2018-08-24","01:21","ZZ","AAA","BBB","0","","","ZZ","ZZ111","ZZ110","2018-10-12","07:00","2018-10-12","08:05","2018-10-19","06:30","2018-10-19","09:35","ZZZZ","ZZZZ","A","B","146.00","222.26","76.26","EEE","abc","100.50","45.50","0","E","ESSENTIAL","ESSENTIAL","4","4","7","125","125"
Script:
read2col()
{
is_one_way=$(echo "$line"| awk -F'","' '{print $7}')
price_outbound=$(echo "$line"| awk -F'","' '{print $30}')
price_exc=$(echo "$line"| awk -F'","' '{print $25}')
tax=$(echo "$line"| awk -F'","' '{print $27}')
price_inc=$(echo "$line"| awk -F'","' '{print $26}')
}
#################################################
#for each line in the csv
mainf()
{
cd $infarepath
while read -r line; do
#read the value of csv fields into variables
read2col
if [[ $is_one_way == 0 ]]; then
if [[ $price_outbound > 0 ]]; then
#calculate price inc and print the entire line to txt file
echo $line | awk -v CONVFMT='%.2f' -v pout=$price_outbound -v tax=$tax -F'","' 'BEGIN {OFS = FS} {$25=pout;$26=(pout+(tax / 2)); print}' >>"$csvsplitfile".tmp
else
#divide price ecx and inc by 2 if price outbound is not greater than 0
echo $line | awk -v CONVFMT='%.2f' -v pexc=$price_exc -v pinc=$price_inc -F'","' 'BEGIN {OFS = FS} {$25=(pexc / 2);$26=(pinc /2); print}' >>"$csvsplitfile".tmp
fi
else
echo $line >>"$csvsplitfile".tmp
fi
done < $csvsplitfile
}
The first thing you should do is stop invoking six subshells for running awk for every single line of input. Let's do some quick, back-of-the-envelope calculations.
Assuming your input lines are about 292 characters (as per you example), a 2G file will consist of a little over 7.3 million lines. That means you are starting and stopping a whopping forty-four million processes.
And, while Linux admirably handles fork and exec as efficiently as possible, it's not without cost:
pax$ time for i in {1..44000000} ; do true ; done
real 1m0.946s
In addition, bash hasn't really been optimised for this sort of processing, its design leads to sub-optimal behaviour for this specific use case. For details on this, see this excellent answer over on one of our sister sites.
An analysis of the two methods of file processing (one program reading an entire file (each line has just hello on it), and bash reading it a line at a time) is shown below. The two commands used to get the timings were:
time ( cat somefile >/dev/null )
time ( while read -r x ; do echo $x >/dev/null ; done <somefile )
For varying file sizes (user+sys time, averaged over a few runs), it's quite interesting:
# of lines cat-method while-method
---------- ---------- ------------
1,000 0.375s 0.031s
10,000 0.391s 0.234s
100,000 0.406s 1.994s
1,000,000 0.391s 19.844s
10,000,000 0.375s 205.583s
44,000,000 0.453s 889.402s
From this, it appears that the while method can hold its own for smaller data sets, it really does not scale well.
Since awk itself has ways to do calculations and formatted output, processing the file with one single awk script, rather than your bash/multi-awk-per-line combination, will make the cost of creating all those processes and line-based delays go away.
This script would be a good first attempt, let's call it prog.awk:
BEGIN {
FMT = "%.2f"
OFS = FS
}
{
isOneWay=$7
priceOutbound=$30
priceExc=$25
tax=$27
priceInc=$26
if (isOneWay == 0) {
if (priceOutbound > 0) {
$25 = sprintf(FMT, priceOutbound)
$26 = sprintf(FMT, priceOutbound + tax / 2)
} else {
$25 = sprintf(FMT, priceExc / 2)
$26 = sprintf(FMT, priceInc / 2)
}
}
print
}
You just run that single awk script with:
awk -F'","' -f prog.awk data.txt
With the test data you provided, here's the before and after, with markers for field numbers 25 and 26:
<-25-> <-26->
"111","2018-08-24","01:21","ZZ","AAA","BBB","0","","","ZZ","ZZ111","ZZ110","2018-10-12","07:00","2018-10-12","08:05","2018-10-19","06:30","2018-10-19","09:35","ZZZZ","ZZZZ","A","B","146.00","222.26","76.26","EEE","abc","100.50","45.50","0","E","ESSENTIAL","ESSENTIAL","4","4","7","125","125"
"111","2018-08-24","01:21","ZZ","AAA","BBB","0","","","ZZ","ZZ111","ZZ110","2018-10-12","07:00","2018-10-12","08:05","2018-10-19","06:30","2018-10-19","09:35","ZZZZ","ZZZZ","A","B","100.50","138.63","76.26","EEE","abc","100.50","45.50","0","E","ESSENTIAL","ESSENTIAL","4","4","7","125","125"
In an attempt to debug Apache on a very busy server, we have used strace to log all our processes. Now, I have 1000s of individual straces in a folder and I need to find the ones that have a value of 1.0+ or greater. This is the command we used to generate the straces
mkdir /strace; ps auxw | grep httpd | awk '{print"-p " $2}' | xargs strace -o /strace/strace.log -ff -s4096 -r
This has generated files with the name strace.log.29382 (Where 29382 is the PID of the process).
Now, if I run this command:
for i in `ls /strace/*`; do echo $i; cat $i | cut -c6-12 | sort -rn | head -c 8; done
it will output the filename and top runtime value. i.e.
/strace/strace.log.19125
0.13908
/strace/strace.log.19126
0.07093
/strace/strace.log.19127
0.09312
What I am looking for is only to output those with a value of 1.0 or greater.
Sample data: https://pastebin.com/Se89Jt1i
This data does not contain any thing 1.0+ But its the first set of #s trying to filter against only.
What I do not want to have show up
0.169598 close(85) = 0
What I do want to find
1.202650 accept4(3, {sa_family=AF_INET, sin_port=htons(4557), sin_addr=inet_addr("xxx.xxx.xxx.xxx")}, [16], SOCK_CLOEXEC) = 85
My cat sorts the values so the highest value in the file is always first.
As I am more used to use perl, a solution with perl which should be possible to translate with awk.
One-liner
perl -ane 'BEGIN{#ARGV=</strace/*>}$max=$F[0]if$F[0]>$max;if(eof){push#A,$ARGV if$max>1;$max=0};END{print"$_\n"for#A}'
No need to sort files to get the maximum value just storing it in a variable. The part which can be interresting to modify to get information:
push#A,$ARGV
can be changed to
push#A,"$ARGV:$max"
to get the value.
How it works :
-a flag: from perl -h : autosplit mode with -n or -p (splits $_ into #F) by default delimited by one ore more spaces.
BEGIN{} and END{} blocks are executed at the beginning and the end, the part which is not in thoose blocks is executed for each line as with awk.
</strace/*> is a glob matching which gives a list of files
#ARGV is a special array which contains command line argument (here list of files to process)
eof is a function which returns true when current line is the last of current file
$ARGV is current file name
push to append elements to an array
The script version with warnings which are useful to fix bugs.
#!/usr/bin/perl
use strict;
use warnings;
sub BEGIN {
use File::Glob ();
#ARGV = glob('/strace/*');
}
my (#A,#F);
my $max = 0;
while (defined($_ = readline ARGV)) {
#F = split(' ', $_, 0);
$max = $F[0] if $F[0] > $max;
if (eof) {
push #A, "${ARGV}:$max" if $max > 1;
$max = 0;
}
}
print "$_\n" foreach (#A);
I'm trying to get specific columns of a csv file (that Header contains "SOF" in case). Is a large file and i need to copy this columns to another csv file using Shell.
I've tried something like this:
#!/bin/bash
awk ' {
i=1
j=1
while ( NR==1 )
if ( "$i" ~ /SOF/ )
then
array[j] = $i
$j += 1
fi
$i += 1
for ( k in array )
print array[k]
}' fil1.csv > result.csv
In this case i've tried to save the column numbers that contains "SOF" in the header in an array. After that copy the columns using this numbers.
Preliminary note: contrary to what one may infer from the code included in the OP, the values in the CSV are delimited with a semicolon.
Here is a solution with two separate commands:
the first parses the first line of your CSV file and identifies which fields must be exported. I use awk for this.
the second only prints the fields. I use cut for this (simpler syntax and quicker than awk, especially if your file is large)
The idea is that the first command yields a list of field numbers, separated with ",", suited to be passed as parameter to cut:
# Command #1: identify fields
fields=$(awk -F";" '
{
for (i = 1; i <= NF; i++)
if ($i ~ /SOF/) {
fields = fields sep i
sep = ","
}
print fields
exit
}' fil1.csv
)
# Command #2: export fields
{ [ -n "$fields" ] && cut -d";" -f "$fields" fil1.csv; } > result.csv
try something like this...
$ awk 'BEGIN {FS=OFS=","}
NR==1 {for(i=1;i<=NF;i++) if($i~/SOF/) {col=i; break}}
{print $col}' file
there is no handling if the sought out header doesn't exist so should print the whole line.
This link might be helpful for you :
One of the useful commands you probably need is "cut"
cut -d , -f 2 input.csv
Here number 2 is the column number you want to cut from your csv file.
try this one out :
awk '{for(i=1;i<=NF;i++)a[i]=a[i]" "$i}END{for (i in a ){ print a[i] } }' filename | grep SOF | awk '{for(i=1;i<=NF;i++)a[i]=a[i]" "$i}END{for (i in a ){ print a[i] } }'
Is there a way in Linux to ask for the Head or Tail but with an additional offset of records to ignore.
For example if the file example.lst contains the following:
row01
row02
row03
row04
row05
And I use head -n3 example.lst I can get rows 1 - 3 but what if I want it to skip the first row and get rows 2 - 4?
I ask because some commands have a header which may not be desirable within the search results. For example du -h ~ --max-depth 1 | sort -rh will return the directory size of all folders within the home directory sorted in descending order but will append the current directory to the top of the result set (i.e. ~).
The Head and Tail man pages don't seem to have any offset parameter so maybe there is some kind of range command where the required lines can be specified: e.g. range 2-10 or something?
From man tail:
-n, --lines=K
output the last K lines, instead of the last 10;
or use -n +K to output lines starting with the Kth
You can therefore use ... | tail -n +2 | head -n 3 to get 3 lines starting from line 2.
Non-head/tail methods include sed -n "2,4p" and awk "NR >= 2 && NR <= 4".
To get the rows between 2 and 4 (both inclusive), you can use:
head -n4 example.lst | tail -n+2
or
head -n4 example.lst | tail -n3
It took make a lot of time to end-up with this solution which, seems to be the only one that covered all usecases (so far):
command | tee full.log | stdbuf -i0 -o0 -e0 awk -v offset=${MAX_LINES:-200} \
'{
if (NR <= offset) print;
else {
a[NR] = $0;
delete a[NR-offset];
printf "." > "/dev/stderr"
}
}
END {
print "" > "/dev/stderr";
for(i=NR-offset+1 > offset ? NR-offset+1: offset+1 ;i<=NR;i++)
{ print a[i]}
}'
Feature list:
live output for head (obviously that for tail is not possible)
no use of external files
progressbar on stderr, one dot for each line after the MAX_LINES, very useful for long running tasks.
avoids possible incorrect logging order due to buffering (stdbuf)
sed -n 2,4p somefile.txt
#fill