Identify pattern in text files in a single unix directory

Identify pattern in text files in a single unix directory - linux

If i want to identify a pattern in Unix in one single directory, may i know which unix utility will be helpful ( like awk )
Input :
$ ls
a_20171007_001.txt
a_20171007_002.txt
b_20171007_001.txt
c_20180101_001.txt
expecting output :
a_20171007_002.txt
b_20171007_001.txt
The output should return latest version of file based on filename irrespective of file creation time
The output file shouldn't have future dated file ( e.g., current date :20171008 so 20180101 shouldn't come in output )
any suggestions on how to achieve this easily in unix ( awk or sed )
Thanks alot for all your solutions. But unfortunately if the file name is not follow any pattern it is not helping.
eg, input :
ab_bc_all_20171008_001.txt
bc_cd_ad_all_20171008_001.txt
ab_bc_all_20171008_002.txt
ad_dc_cd_ed_all_20180101_001.txt
ae_bc_zx_ed_ac_all_20170918_001.txt
output :
bc_cd_ad_all_20171008_001.txt
ab_bc_all_20171008_002.txt
ae_bc_zx_ed_ac_all_20170918_001.txt
in above case only pattern after 'all' the date field is appearing.
Can you please suggest in above case..
Thanks in advance.

Something like this in Perl:
#!/usr/bin/perl
use warnings;
use strict;
use feature qw{ say };
use Time::Piece;
my $today = localtime->ymd("");
my %latest;
for my $file (glob '*.txt') {
my ($id, $date, $num) = split /[_.]/, $file;
$latest{$id}{$date} = $num
if $date <= $today
&& (! exists $latest{$id}
|| ! exists $latest{$id}{$date}
|| $num > $latest{$id}{$date});
}
for my $id (keys %latest) {
for my $date (keys %{ $latest{$id} }) {
say "$id\_$date\_$latest{$id}{$date}.txt";
}
}

a simple awk solution
$ awk -F_ -vdate=`date +%Y%m%d` ' !($1 in file) && $2<=date {file[$1]=$0} ($1 in file){if($0>=file[$1]){file[$1]=$0}} END{ for(i in file)print file[i] }' f1
a_20171007_002.txt
b_20171007_001.txt
Explanation:
Store the current date in date variable in the format yyyymmdd
While iterating though records/filenames, if the date in filename i.e $2 is less than or equal to current date and the prefix (for eg. a,b etc) doesn't exist in array file then store it in file array for eg. file['a']=a_20171007_001.txt else it won't be stored and in this example c_20180101_001.txt would be straight forwardly rejected.
For next records, if the prefix i.e $1 exists in array file then check if the whole record is greater than the existing record (lexicographically). If yes, overwrite the record in file array.

Could you please try following and let me know if this helps you.
ls -ltr *.txt | awk -v date=$(date +%Y) -F"_" 'prev != $1 && val && date_val<=date{print val} {prev=$1;val=$0;date_val=substr($2,1,4)} END{if(date_val<=date){print val}}'
Adding a better readable form of solution too now.
ls -ltr *.txt | awk -v date=$(date +%Y) -F"_" '
prev != $1 && val && date_val<=date{
print val
}
{
prev=$1;
val=$0
date_val=substr($2,1,4)
}
END{
if(date_val<=date){
print val
}
}'

GNU Awk solution for static filename format <prefix>_<date>_<version>.txt:
Exemplary ls -1 output (extended):
a_20171007_001.txt
a_20171007_002.txt
b_20171007_001.txt
c_20180101_001.txt
a_20171007_0010.txt
b_20171007_004.txt
ls -1 | awk -F'[_.]' '{ k=$1"_"$2 }{ if (a[k]<$3) a[k]=$3 }
END{
for (i in a) {
split(substr(i, index(i,"_")+1), b, "");
ts=mktime(sprintf("%d %d %d 00 00 00",b[1]b[2]b[3]b[4],b[5]b[6],b[7]b[8]));
if (systime() >= ts) print i"_"a[i]".txt"
}
}'
The output:
b_20171007_004.txt
a_20171007_0010.txt

This one is ok only in shell (dash)
d=$(date +%Y%m%d)
ls -1r *_*_*.txt|while IFS='_' read w x y
do
[ "$x" -le "$d" ] && [ "$v" != "$w$x" ] && { echo "$w"_"$x"_"$y";v="$w$x";}
done
The spec change ???
Try this one
d=$(date +%Y%m%d)
ls -1r *_*_*.txt|while read l
do
b="${l%_*_*}"
a="${l#$b*_}"
c="${a%_*}"
[ "$c" -le "$d" ] && [ "$v" != "$b$c" ] && { echo "$l";v="$b$c";}
done

$ ls -1r | awk -v today="$(date +%Y%m%d)" -F'_' '($2 <= today) && !seen[$1,$2]++'
b_20171007_001.txt
a_20171007_002.txt

Related

Filtering a list by 5 files per directory

SO i have a list of files inside a tree of folders
/home/user/Scripts/example/tmp/folder2/2
/home/user/Scripts/example/tmp/folder2/3
/home/user/Scripts/example/tmp/folder2/4
/home/user/Scripts/example/tmp/folder2/5
/home/user/Scripts/example/tmp/folder2/6
/home/user/Scripts/example/tmp/folder2/7
/home/user/Scripts/example/tmp/folder2/8
/home/user/Scripts/example/tmp/folder2/9
/home/user/Scripts/example/tmp/folder2/10
/home/user/Scripts/example/tmp/other_folder/files/1
/home/user/Scripts/example/tmp/other_folder/files/2
/home/user/Scripts/example/tmp/other_folder/files/3
/home/user/Scripts/example/tmp/other_folder/files/4
/home/user/Scripts/example/tmp/other_folder/files/5
/home/user/Scripts/example/tmp/other_folder/files/6
/home/user/Scripts/example/tmp/other_folder/files/7
/home/user/Scripts/example/tmp/other_folder/files/8
/home/user/Scripts/example/tmp/other_folder/files/9
/home/user/Scripts/example/tmp/other_folder/files/10
/home/user/Scripts/example/tmp/test/example/1
/home/user/Scripts/example/tmp/test/example/2
/home/user/Scripts/example/tmp/test/example/3
/home/user/Scripts/example/tmp/test/example/4
/home/user/Scripts/example/tmp/test/example/5
/home/user/Scripts/example/tmp/test/example/6
/home/user/Scripts/example/tmp/test/example/7
/home/user/Scripts/example/tmp/test/example/8
/home/user/Scripts/example/tmp/test/example/9
/home/user/Scripts/example/tmp/test/example/10
/home/user/Scripts/example/tmp/test/other/1
/home/user/Scripts/example/tmp/test/other/2
/home/user/Scripts/example/tmp/test/other/3
/home/user/Scripts/example/tmp/test/other/4
/home/user/Scripts/example/tmp/test/other/5
/home/user/Scripts/example/tmp/test/other/6
/home/user/Scripts/example/tmp/test/other/7
/home/user/Scripts/example/tmp/test/other/8
/home/user/Scripts/example/tmp/test/other/9
/home/user/Scripts/example/tmp/test/other/10
I want to basically filter out the content of this list so I only have the highest 5 numbers for each directory.
Any ideas?
preferable in bash/shell
Expected Output:(small sample size cause of SO says too much code)
/home/user/Scripts/example/tmp/test/example/6
/home/user/Scripts/example/tmp/test/example/7
/home/user/Scripts/example/tmp/test/example/8
/home/user/Scripts/example/tmp/test/example/9
/home/user/Scripts/example/tmp/test/example/10
/home/user/Scripts/example/tmp/test/other/6
/home/user/Scripts/example/tmp/test/other/7
/home/user/Scripts/example/tmp/test/other/8
/home/user/Scripts/example/tmp/test/other/9
/home/user/Scripts/example/tmp/test/other/10
Thanks
edit - using for i in $(for i in $(dirname $(find $(pwd) -type f -name "*[0-9]*" | sort -V) | uniq) ;do ls $i | sort -V |tail -n 5 ; done) ; do readlink -f $i ; done works for a small sample size. However expanding said sample appears to long for dirname

Assuming your input data is sorted.
Try:
awk -F'/[^/]*$' '{if (NR==1 || prev_dir == $1) {i=i+1} else {i=1}; if ( i<=5){ prev_dir=$1 ; print $0}; }'
Explanation:
'/[^/]*$' <-- Set regex delimiter to get directory base-name as first field
if (NR==1 || prev_dir == $1) {i=i+1} else {i=1}; <-- Check file is from same directory. if yes increment counter by 1 else reset.
if ( i<=5){ prev_dir=$1 ; print $0}; }' <-- Print first 5 records of current directory.
Demo:
$awk -F'/[^/]*$' '{if (NR==1 || prev_dir == $1) {i=i+1} else {i=1}; if ( i<=5){ prev_dir=$1 ; print $0 }; }' temp.txt
/home/user/Scripts/example/tmp/folder2/2
/home/user/Scripts/example/tmp/folder2/3
/home/user/Scripts/example/tmp/folder2/4
/home/user/Scripts/example/tmp/folder2/5
/home/user/Scripts/example/tmp/folder2/6
/home/user/Scripts/example/tmp/other_folder/files/1
/home/user/Scripts/example/tmp/other_folder/files/2
/home/user/Scripts/example/tmp/other_folder/files/3
/home/user/Scripts/example/tmp/other_folder/files/4
/home/user/Scripts/example/tmp/other_folder/files/5
$cat temp.txt
/home/user/Scripts/example/tmp/folder2/2
/home/user/Scripts/example/tmp/folder2/3
/home/user/Scripts/example/tmp/folder2/4
/home/user/Scripts/example/tmp/folder2/5
/home/user/Scripts/example/tmp/folder2/6
/home/user/Scripts/example/tmp/folder2/7
/home/user/Scripts/example/tmp/folder2/8
/home/user/Scripts/example/tmp/folder2/9
/home/user/Scripts/example/tmp/folder2/10
/home/user/Scripts/example/tmp/other_folder/files/1
/home/user/Scripts/example/tmp/other_folder/files/2
/home/user/Scripts/example/tmp/other_folder/files/3
/home/user/Scripts/example/tmp/other_folder/files/4
/home/user/Scripts/example/tmp/other_folder/files/5
/home/user/Scripts/example/tmp/other_folder/files/6
/home/user/Scripts/example/tmp/other_folder/files/7
/home/user/Scripts/example/tmp/other_folder/files/8
/home/user/Scripts/example/tmp/other_folder/files/9
/home/user/Scripts/example/tmp/other_folder/files/10
$

Here is an implementation in plain bash:
#!/bin/bash
prevdir=
while read -r line; do
dir=${line%/*}
[[ $dir == "$prevdir" ]] || { n=0; prevdir=$dir; }
((n++ < 5)) && echo "$line"
done
You can use it like:
./script < file.list # If file.list already sorted by a reverse version sort
or,
sort -rV file.list | ./script # If the file.list is not sorted
or,
find /home/user/Scripts -type f | sort -rV | ./script
Also, you may want to append | tac to the pipelines above.

get user input in awk script and update it in file

I have a students.txt (RollNo, Name, IDU, CGPA), If Roll number exists prompt the user to change the IDU and CGPA and update the same in the file named “Student.txt”
I made the following script:
#! /bin/bash
dispaly(){
awk -F ":" -v roll=$1 '{ if ( $1 == roll) {name = $2; print name; } }
END {if (name == "") print "not found" }' students.txt
}
echo "Enter the roll no."
read rno
if [ $rno -ge 1000 ] && [ $rno -le 9999 ]
then
dispaly $rno
# Now I have a valid $rno and want to update that line
else
echo Enter no between 1000 and 9999
fi
now I need help in taking user input for IDU and CGPA values and update the students.text file with that values against the record found.

In general "-" is used for standard input for awk e.g.
awk '{print($1)}' -
It's not clear to me exactly what you want here. Can't you use additional 'read' statements in the bash part of the script for input of the other 2 values?

first, I grep for roll
grep ^roll students.txt
if found then used awk to replace the records
awk -F : -v rno=$rno -v idu=$idu -v cgpa=$cgpa ' $1==rno { $3=idu;$4=cgpa} OFS=":" ' students.txt > tmp.txt && mv tmp.txt students.txt

Bash command rev to reverse delemiters

I am working on a shell script that converts exported Microsoft in-addr.apra.txt files to a more useful format so that i can use it in the future in other products for automation purposes. No i am figuring a problem which (im not a programmer) can not solve in a simple way.
Sample script
x=123.223.224
rev $x
gives me
422.322.321
but i want to have the output as follow:
224.223.123
is there a easy way to do it without rev or putting each group in a variable? Or is there a sample i can use? or maybe i use the wrong tools to do it?

Using awk:
x='123.223.224'
awk 'BEGIN{FS=OFS="."} {for (i=NF; i>=2; i--) printf $i OFS; print $1}' <<< "$x"
224.223.123

Use awk for this!
If your text file always contains three octets, simply use . as separator:
echo $x | awk -F. '{ print $3 "." $2 "." $1 }'
For more complex cases, use internal split():
echo $x | awk '{
n = split($0, a, ".");
for(i = n; i > 1; i--) {
printf "%s.", a[i];
}
print a[1]; }'
In this sample split() will split every line (which is passed as argument $0) using delimiter ., saves resulting array into a and returns length of that array (which is saved to n). Note that unlike C,
split() array indexes are starting with one.
Or python:
python -c "print '.'.join(reversed('$x'.split('.')))"

Here is my script.
#!/bin/sh
value=$1
delim=$2
total_fields=$(echo "$value" | tr -cd $2 | wc -c)
let total_fields=total_fields+1
i=1
reverse_value=""
while [ $total_fields -gt 0 ]; do
cur_value=$(echo "$value" | cut -d${delim} -f${total_fields})
if [ $total_fields -ne 1 ]; then
cur_value="$cur_value${delim}"
fi
#echo "$cur_value"
reverse_value="$reverse_value$cur_value"
#echo "$i --> $reverse_value"
let total_fields=total_fields-1
done
echo "$reverse_value"

Using a few small tools.
tr '.' '\n' <<< "$x" | tac | paste -sd.
224.223.123

bash: a more effective way to search strings in a file and extract part of strings

i have an input file as follow:
some lines with quote and :
AGE:23
some lines with quote and :
NAME:2,0,"My Name Is"
some lines with quote and :
Actually i use this code to extract information from the file:
age="$(cat "$file" | awk -F ':' '/AGE:/ { print $2 }')"
name="$(cat "$file" | awk -F '"' '/NAME:/ { print $2 }' )"
echo "age: $age"
echo "name: $name"
output:
age: 23
name: My Name Is
i'm searching for a better way to do this than running cat and awk two times. i have search to do it in one cat/awk line but can't figure it out, not appropriated in this case? can anyone point me a better way please ?
Thanks in advance

while IFS=: read key value; do
case $key in
AGE) age=$value;;
NAME) name=$(awk -F'"' '{print $2}' <<< "$value");;
esac
done < "$file"

I like #JohnKugelman's approach, but it can be improved: use colon and quote as the field separators:
while IFS=':"' read -ra fields; do
case ${fields[0]} in
AGE) age=${fields[1]} ;;
NAME) [[ ${fields[1]} == "2,0," ]] && name=${fields[2]} ;;
esac
done < file
With awk, I'd write:
read age name < <(
awk -F '[:,]' '
$1 == "AGE" {printf "%s ",$2}
$1 == "NAME" && $2 == 2 && $3 == 0 {printf "%s ",$NF}
END {print ""}
' filename
)

If the data is simple as you have shown in your question. No need to use shell for this , just awk will be more than enough
awk -F '"' '/AGE/{print tolower($0)}/NAME/{print "name:"$2}' input.txt

GROUP BY/SUM from shell

I have a large file containing data like this:
a 23
b 8
a 22
b 1
I want to be able to get this:
a 45
b 9
I can first sort this file and then do it in Python by scanning the file once. What is a good direct command-line way of doing this?

Edit: The modern (GNU/Linux) solution, as mentioned in comments years ago ;-) .
awk '{
arr[$1]+=$2
}
END {
for (key in arr) printf("%s\t%s\n", key, arr[key])
}' file \
| sort -k1,1
The originally posted solution, based on old Unix sort options:
awk '{
arr[$1]+=$2
}
END {
for (key in arr) printf("%s\t%s\n", key, arr[key])
}' file \
| sort +0n -1
I hope this helps.

No need for awk here, or even sort -- if you have Bash 4.0, you can use associative arrays:
#!/bin/bash
declare -A values
while read key value; do
values["$key"]=$(( $value + ${values[$key]:-0} ))
done
for key in "${!values[#]}"; do
printf "%s %s\n" "$key" "${values[$key]}"
done
...or, if you sort the file first (which will be more memory-efficient; GNU sort is able to do tricks to sort files larger than memory, which a naive script -- whether in awk, python or shell -- typically won't), you can do this in a way which will work in older versions (I expect the following to work through bash 2.0):
#!/bin/bash
read cur_key cur_value
while read key value; do
if [[ $key = "$cur_key" ]] ; then
cur_value=$(( cur_value + value ))
else
printf "%s %s\n" "$cur_key" "$cur_value"
cur_key="$key"
cur_value="$value"
fi
done
printf "%s %s\n" "$cur_key" "$cur_value"

This Perl one-liner seems to do the job:
perl -nle '($k, $v) = split; $s{$k} += $v; END {$, = " "; foreach $k (sort keys %s) {print $k, $s{$k}}}' inputfile

This can be easily achieved with the following single-liner:
cat /path/to/file | termsql "SELECT col0, SUM(col1) FROM tbl GROUP BY col0"
Or.
termsql -i /path/to/file "SELECT col0, SUM(col1) FROM tbl GROUP BY col0"
Here a Python package, termsql, is used, which is a wrapper around SQLite. Note, that currently it's not upload to PyPI, and also can only be installed system-wide (setup.py is a little broken), like:
pip install --user https://github.com/tobimensch/termsql/archive/master.zip
Update
In 2020 version 1.0 was finally uploaded to PyPI, so pip install --user termsql can be used.

One way using perl:
perl -ane '
next unless #F == 2;
$h{ $F[0] } += $F[1];
END {
printf qq[%s %d\n], $_, $h{ $_ } for sort keys %h;
}
' infile
Content of infile:
a 23
b 8
a 22
b 1
Output:
a 45
b 9

With GNU awk (versions less than 4):
WHINY_USERS= awk 'END {
for (E in a)
print E, a[E]
}
{ a[$1] += $2 }' infile
With GNU awk >= 4:
awk 'END {
PROCINFO["sorted_in"] = "#ind_str_asc"
for (E in a)
print E, a[E]
}
{ a[$1] += $2 }' infile

With sort + awk combination one could try following, without creating array.
sort -k1 Input_file |
awk '
prev!=$1 && prev{
print prev,(prevSum?prevSum:"N/A")
prev=prevSum=""
}
{
prev=$1
prevSum+=$2
}
END{
if(prev){
print prev,(prevSum?prevSum:"N/A")
}
}'
Explanation: Adding detailed explanation for above.
sort -k1 file1 | ##Using sort command to sort Input_file by 1st field and sending output to awk as an input.
awk ' ##Starting awk program from here.
prev!=$1 && prev{ ##Checking condition prev is NOT equal to first field and prev is NOT NULL.
print prev,(prevSum?prevSum:"N/A") ##Printing prev and prevSum(if its NULL then print N/A).
prev=prevSum="" ##Nullify prev and prevSum here.
}
{
prev=$1 ##Assigning 1st field to prev here.
prevSum+=$2 ##Adding 2nd field to prevSum.
}
END{ ##Starting END block of this awk program from here.
if(prev){ ##Checking condition if prev is NOT NULL then do following.
print prev,(prevSum?prevSum:"N/A") ##Printing prev and prevSum(if its NULL then print N/A).
}
}'

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Identify pattern in text files in a single unix directory - linux

$ ls -1r | awk -v today="$(date +%Y%m%d)" -F'_' '($2 <= today) && !seen[$1,$2]++' b_20171007_001.txt a_20171007_002.txt

Related

Filtering a list by 5 files per directory

get user input in awk script and update it in file

Bash command rev to reverse delemiters

bash: a more effective way to search strings in a file and extract part of strings

GROUP BY/SUM from shell

Categories

Resources