obtain the line count of all files in a directory - linux

I have 3 files in directory "work" which will be pumped on daily basis.
files are as shown below:
ZNAMI DOWN COND RESULT_17-08-2015.csv
ZNAMI UP CND RESULT_18-08-2015.csv
ZNAMI DOWN COND RESULT_17-08-2015.csv
These files have many rows with just ",,,,,,,,," as input along with actual data.
What I need to perform is as below:
open each file [should be dynamic as everyday , date part changes].
Remove the lines with ",,,,,,,,,".
Get the line count.
I tried wc -l *.csv but it does not give the total count of all lines.
I also tried sed -i ",,,,,,,,,"d *.csv to remove the lines . But it is not working.

Have a try with this:
grep -v ",,,,,,,,," *.csv | wc -l
This will print every line from *.csv file that does not contain ,,,,,,,,, to the standard output. Piping it into wc will yield total count of such lines.

Using awk:
awk '!/,,,,,,,,,/{n++;} END{print n;}' *.csv
This counts every line that does not contain ,,,,,,,,,.
!/,,,,,,,,,/{n++;}
In awk, ! is negation. So, this increments n for every line that does not match ,,,,,,,,,.
END{print n;}
After we have read last line of the last file, print out the value of n.

Related

Search multiple strings from file in multiple files in specific column and output the count in unix shell scripting

I have searched extensively on the internet about this but haven't found much details.
Problem Description:
I am using aix server.
I have a pattern.txt file that contains customer_id for 100 customers in the following sample format:
160471231
765082023
75635713
797649756
8011688321
803056646
I have a directory (/home/aswin/temp) with several files (1.txt, 2.txt, 3.txt and so on) which are pipe(|) delimited. Sample format:
797649756|1001|123270361|797649756|O|2017-09-04 23:59:59|10|123769473
803056646|1001|123345418|1237330|O|1999-02-13 00:00:00|4|1235092
64600123|1001|123885297|1239127|O|2001-08-19 00:00:00|10|1233872
75635713|1001|123644701|75635713|C|2006-11-30 00:00:00|11|12355753
424346821|1001|123471924|12329388|O|1988-05-04 00:00:00|15|123351096
427253285|1001|123179704|12358099|C|2012-05-10 18:00:00|7|12352893
What I need to do search all the strings from pattern.txt file in all files in the directory, in first column of each file and list each filename with number of matches. so if same row has more than 1 match it should be counted as 1.
So the output should be something like (only the matches in first column should count):
1.txt:4
2.txt:3
3.txt:2
4.txt:5
What I have done till now:
cd /home/aswin/temp
grep -srcFf ./pattern.txt * /dev/null >> logfile.txt
This is giving the output in the desired format, but it searching the strings in all columns and not just first column. So the output count is much more than expected.
Please help.
If you want to do that with grep, you must change the pattern.
With your command, you search for pattern in /dev/null and the output is /dev/null:0
I think you want 2>/dev/null but this is not needed because you tell -s to grep.
Your pattern file is in the same directory so grep search in it and output pattern.txt:6
All your files are in the same directory so the -r is not needed.
You put the logfile in the same directory, so the second time you run the command grep search in it and output logfile.txt:0
If you can modify the pattern file, you write each line like ^765082023|
and you rename this file without .txt
So this command give you what you look for.
grep -scf pattern *.txt >>logfile
If you can't modify the pattern file, you can use awk.
awk -F'|' '
NR==FNR{a[$0];next}
FILENAME=="pattern.txt"{next}
$1 in a {b[FILENAME]++}
END{for(i in b){print i,":",b[i]}}
' pattern.txt *.txt >>logfile.txt

Looping through a file with path and file names and within these file search for a pattern

I have a file called lookupfile.txt with the following info:
path, including filename
Within bash I would like to search through these files in mylookup file.txt for a pattern : myerrorisbeinglookedat. When found, output the lines where found into another recorder file. All the found result can land in the same file.
Please help.
You can write a single grep statement to achieve this:
grep myerrorisbeinglookedat $(< lookupfile.txt) > outfile
Assuming:
the number of entries in lookupfile.txt is small (tens or hundreds)
there are no white spaces or wildcard characters in the file names
Otherwise:
while IFS= read -r file; do
# print the file names separated by a NULL character '\0'
# to be fed into xargs
printf "$file\0"
done < lookupfile.txt | xargs -0 grep myerrorisbeinglookedat > outfile
xargs takes output of the loop, tokenizes them correctly and invokes grep command. xargs batches up the files based on operating system limits in case there are a large number of files.

Generate record of files which have been removed by grep as a secondary function of primary command

I asked a question here to remove unwanted lines which contained strings which matched a particular pattern:
Remove lines containg string followed by x number of numbers
anubhava provided a good line of code which met my needs perfectly. This code removes any line which contains the string vol followed by a space and three or more consecutive numbers:
grep -Ev '\bvol([[:blank:]]+[[:digit:]]+){2}' file > newfile
The command will be used on a fairly large csv file and be initiated by crontab. For this reason, I would like to keep a record of the lines this command is removing, just so I can go back to check the correct data that is being removed- I guess it will be some sort of log containing the name sof the lines that did not make the final cut. How can I add this functionality?
Drop grep and use awk instead:
awk '/\<vol([[:blank:]]+[[:digit:]]+){2}/{print >> "deleted"; next} 1' file
The above uses GNU awk for word delimiters (\<) and will append every deleted line to a file named "deleted". Consider adding a timestamp too:
awk '/\<vol([[:blank:]]+[[:digit:]]+){2}/{print systime(), $0 >> "deleted"; next} 1' file

Find the most common line in a file in bash

I have a file of strings:
string-string-123
string-string-123
string-string-123
string-string-12345
string-string-12345
string-string-12345-123
How do I retrieve the most common line in bash (string-string-123)?
You can use sort with uniq
sort file | uniq -c | sort -n -r
You could use awk to do this:
awk '{++a[$0]}END{for(i in a)if(a[i]>max){max=a[i];k=i}print k}' file
The array a keeps a count of each line. Once the file has been read, we loop through it and find the line with the maximum count.
Alternatively, you can skip the loop in the END block by assigning the line during the processing of the file:
awk 'max < ++c[$0] {max = c[$0]; line = $0} END {print line}' file
Thanks to glenn jackman for this useful suggestion.
It has rightly been pointed out that the two approaches above will only print out one of the most frequently occurring lines in the case of a tie. The following version will print out all of the most frequently occurring lines:
awk 'max<++c[$0] {max=c[$0]} END {for(i in c)if(c[i]==max)print i}' file
Tom Fenech's elegant awk answer works great [in the amended version that prints all most frequently occurring lines in the event of a tie].
However, it may not be suitable for large files, because all distinct input lines are stored in an associative array in memory, which could be a problem if there are many non-duplicate lines; that said, it's much faster than the approaches discussed below.
Grzegorz Żur's answer combines multiple utilities elegantly to implicitly produce the desired result, but:
all distinct lines are printed (highest-frequency count first)
output lines are prefixed by their occurrence count (which may actually be desirable).
While you can pipe Grzegorz Żur's answer to head to limit the number of lines shown, you can't assume a fixed number of lines in general.
Building on Grzegorz's answer, here's a generic solution that shows all most-frequently-occurring lines - however many there are - and only them:
sort file | uniq -c | sort -n -r | awk 'NR==1 {prev=$1} $1!=prev {exit} 1'
If you don't want the output lines prefixed with the occurrence count:
sort file | uniq -c | sort -n -r | awk 'NR==1 {prev=$1} $1!=prev {exit} 1' |
sed 's/^ *[0-9]\{1,\} //'
Explanation of Grzegorz Żur's answer:
uniq -c outputs the set of unique input lines prefixed with their respective occurrence count (-c), followed by a single space.
sort -n -r then sorts the resulting lines numerically (-n), in descending order (-r), so that the most frequently occurring line(s) are at the top.
Note that sort, if -k is not specified, will generally try to sort by the entire input line, but -n causes only the longest prefix that is recognized as an integer to be used for sorting, which is exactly what's needed here.
Explanation of my awk command:
NR==1 {prev=$1} stores the 1st whitespace-separated field ($1) in variable prev for the first input line (NR==1)
$1!=prev {exit} terminates processing, if the 1st whitespace-separated field is not the same as the previous line's - this means that a non-topmost line has been reached, and no more lines need printing.
1 is shorthand for { print } meaning that the input line at hand should be printed as is.
Explanation of my sed command:
^ *[0-9]\{1,\} matches the numeric prefix (denoting the occurrence count) of each output line, as (originally) produced by uniq -c
applying s/...// means that the prefix is replaced with an empty string, i.e., effectively removed.

Search for lines in a file that contain de lines of a second file

So I have a first file with a ID in each line, for example:
458-12-345
466-44-3-223
578-4-58-1
599-478
854-52658
955-12-32
Then I have a second file. It has a ID in each file followed by information, for example:
111-2457-1 0.2545 0.5484 0.6914 0.4222
112-4844-487 0.7475 0.4749 0.1114 0.8413
115-44-48-5 0.4464 0.8894 0.1140 0.1044
....
The first file only has 1000 lines, with the IDs of the info I need, while the second file has more than 200,000 lines.
I used the following bash command in a fedora with good results:
cat file1.txt | while read line; do cat file2.txt | egrep "^$line\ "; done > file3.txt
However I'm now trying to replicate the results in Ubuntu, and the output is a blank file. Is there a reason for this not to work in Ubuntu?
Thanks!
You can grep for several strings at once:
grep -f id_file data_file
Assuming that id_file contains all the IDs and data_file contains the IDs and data.
Typical job for awk:
awk 'FNR==NR{i[$1]=1;next} i[$1]{print}' file1 file2
This will print the lines from the second file that have an index in the first one. For even more speed, use mawk.
this line works fine for me in Ubuntu:
cat 1.txt | while read line; do cat 2.txt | grep "$line"; done
However, this may be slow as the second file (200000 lines) will be grepped 1000 times (number of lines in the first file)

Resources