Generate record of files which have been removed by grep as a secondary function of primary command - linux

I asked a question here to remove unwanted lines which contained strings which matched a particular pattern:
Remove lines containg string followed by x number of numbers
anubhava provided a good line of code which met my needs perfectly. This code removes any line which contains the string vol followed by a space and three or more consecutive numbers:
grep -Ev '\bvol([[:blank:]]+[[:digit:]]+){2}' file > newfile
The command will be used on a fairly large csv file and be initiated by crontab. For this reason, I would like to keep a record of the lines this command is removing, just so I can go back to check the correct data that is being removed- I guess it will be some sort of log containing the name sof the lines that did not make the final cut. How can I add this functionality?

Drop grep and use awk instead:
awk '/\<vol([[:blank:]]+[[:digit:]]+){2}/{print >> "deleted"; next} 1' file
The above uses GNU awk for word delimiters (\<) and will append every deleted line to a file named "deleted". Consider adding a timestamp too:
awk '/\<vol([[:blank:]]+[[:digit:]]+){2}/{print systime(), $0 >> "deleted"; next} 1' file

Related

How to get first word of every line and pipe it into dmenu script

I have a text file like this:
first state
second state
third state
Getting the first word from every line isn't difficult, but the problem comes when adding the extra \n required to separate every word (selection) in dmenu, per its syntax:
echo -e "first\nsecond\nthird" | dmenu
I haven't been able to figure out how to add the separating \n. I've tried this:
state=$(awk '{for(i=1;i<=NF;i+=2)print $(i)'\n'}' text.txt)
But it doesn't work. I also tried this:
lol=$(grep -o "^\S*" states.txt | perl -ne 'print "$_"')
But same deal. Not sure what I'm doing wrong.
Your problem is in the AWK script. You need to identify each input line as a record. This way, you can control how each record in the output is separated via the ORS variable (output record separator). By default this separator is the newline, which should be good enough for your purpose.
Now to print the first word of every input record (each line in the input stream in this case), you just need to print the first field:
awk '{print $1}' textfile | dmenu
If you need the output to include the explicit \n string (not the control character), then you can just overwrite the ORS variable to fit your needs:
awk 'BEGIN{ORS="\\n"}{print $1}' textfile | dmenu
This could be more easily done in while loop, could you please try following. This is simple, while is reading the file and during that its creating 2 variables 1st is named first and other is rest first contains first field which we are passing to dmenu later inside.
while read first rest
do
dmenu "$first"
done < "Input_file"
Based on the text file example, the following should achieve what you require:
awk '{ printf "%s\\n",$1 }' textfile | dmenu
Print the first space separated field of each line along with \n (\n needs to be escaped to stop it being interpreted by awk)
In your code
state=$(awk '{for(i=1;i<=NF;i+=2)print $(i)'\n'}' text.txt)
you attempted to use ' inside your awk code, however code is what between ' and first following ', therefore code is {for(i=1;i<=NF;i+=2)print $(i) and this does not work. You should use " for strings inside awk code.
If you want to merely get nth column cut will be enough in most cases, let states.txt content be
first state
second state
third state
then you can do:
cut -d ' ' -f 1 states.txt | dmenu
Explanation: treat space as delimiter (-d ' ') and get 1st column (-f 1)
(tested in cut (GNU coreutils) 8.30)

Split flat file and add delimiter in Linux

I would like how to improve a code that I have.
My shell script reads a flat file, and split it in two files based on first char of each line, header and detail. For header the first char is 1 and for detail is 2. Splitted files does not include the firts char.
Header is delimited by "|", and detail is fixed-width, so, I add the delimiter to it alter.
What I want is to do this in one single awk, to avoid creating a tmp file.
For splitting file I use and awk command, and for adding delimiter another awk command.
This is what I have now:
Input=Input.txt
Header=Header.txt
DetailTmp=DetailTmp.txt
Detail=Detail.txt
#First I split in two files and remove first char
awk -v vFileHeader="$Header" -v vFileDetail="$DetailTmp" '/^1/ {f=vFileHeader} /^2/ {f=vFileDetail} {sub(/^./,""); print > f}' $Input
#Then, I add the delimiter to detail
awk '{OFS="|"};{print substr($1,1,10),substr($1,11,5),substr($1,16,2),substr($1,18,14),substr($1,32,4),substr($1,36,18),substr($1,54,1)}' $DetailTmp > $Detail
Any suggestion?
Input.txt file
120190301|0170117174|FRANK|DURAND|USA
2017011717400052082911070900000000000000000000091430200
120190301|0170117204|ERICK|SMITH|USA
2017011720400052082911070900000000000000000000056311910
Header.txt splitted
20190301|0170117174|FRANK|DURAND|USA
20190301|0170117204|ERICK|SMITH|USA
DetailTmp.txt splitted
017011717400052082911070900000000000000000000091430200
017011720400052082911070900000000000000000000056311910
017011727100052052911070900000000000000000000008250000
017011718200052082911070900000000000000000000008102500
017011726300052052911070900000000000000000000008250000
Detail.txt desired
0170117174|00052|08|29110709000000|0000|000000000009143020|0
0170117204|00052|08|29110709000000|0000|000000000005631191|0
0170117271|00052|05|29110709000000|0000|000000000000825000|0
0170117182|00052|08|29110709000000|0000|000000000000810250|0
0170117263|00052|05|29110709000000|0000|000000000000825000|0
just combine the scripts
$ awk -v OFS='|' '/^1/{print substr($0,2) > "header"}
/^2/{print substr($0,2,10),substr($0,11,5),... > "detail"}' file
however, you may be better off, using FIELDWIDTHS on the detail file on the second pass.

Find a line and modify it in a csv file given an input

I have a csv file with a list of workers and I wanna make an script for modify their work group given their ID's. Lines in CSV files are like this:
Before:
ID TAG GROUP
niub16677500;B00;AB0
After:
ID TAG GROUP
niub16677500;B00;BC0
How I can make this?
I'm working with awk and sed commands but I couldn't get anything at the moment.
With awk:
awk -F';' -v OFS=';' -v id="niub16677500" -v new_group="BC0" '{if($1==id)$3=new_group}1' input.csv
ID;TAG;GROUP
niub16677500;B00;BC0
Redirect the output to a file and note that the csv header should use the same field separator as the body.
Explanations:
-F';' to have input field separator as ;
-v OFS=';' same for the output FS
-v id="niub16677500" -v new_group="BC0" define the variables that you are going to use in the awk commands
'{if($1==id)$3=new_group}1' when the first column is equal to the value contained in variable id the overwrite the 3rd field and print the line
With sed:
id="niub16677500"; new_group="BC0"; sed "/^$id/s/;[^;]*$/;$new_group/" input.csv
ID;TAG;GROUP
niub16677500;B00;BC0
You can either do an inline change using -i.bak option, or redirect the output to a file.
Explanations:
Store the values in 2 variables
/^$id/ when you reach a line that starts with the ID store in the variable id, run sed search and replace
s/;[^;]*$/;$new_group/ search and replace command that will replace the last field by the new value
Sed can do it,
echo 'niub16677500;B00;AB0' | sed 's/\(^niub16677500;...;\)\(...\)$/\1BC0/'
will replace the AB0 group in your example with BC0, by matching the user name, semicolon, whatever 3 characters and another semicolon, and then matching the remaining 3 characters. Then as an output it repeats the first match with \1 and adds BC0.
You can use :
sed 's/\(^niub16677500;...;\)\(...\)$/\1BC0/' <old_file >new_file
to make a new_file with this change.
https://www.grymoire.com/Unix/Sed.html is a great resource, you should take a look at it.

Bash: Read in file, edit line, output to new file

I am new to linux and new to scripting. I am working in a linux environment using bash. I need to do the following things:
1. read a txt file line by line
2. delete the first line
3. remove the middle part of each line after the first
4. copy the changes to a new txt file
Each line after the first has three sections, the first always ends in .pdf and the third always begins with R0 but the middle section has no consistency.
Example of 2 lines in the file:
R01234567_High Transcript_01234567.pdf High School Transcript R01234567
R01891023_Application_01891023127.pdf Application R01891023
Here is what I have so far. I'm just reading the file, printing it to screen and copying it to another file.
#! /bin/bash
cd /usr/local/bin;
#echo "list of files:";
#ls;
for index in *.txt;
do echo "file: ${index}";
echo "reading..."
exec<${index}
value=0
while read line
do
#value='expr ${value} +1';
echo ${line};
done
echo "read done for ${index}";
cp ${index} /usr/local/bin/test2;
echo "file ${index} moved to test2";
done
So my question is, how can I delete the middle bit of each line, after .pdf but before the R0...?
Using sed:
sed 's/^\(.*\.pdf\).*\(R0.*\)$/\1 \2/g' file.txt
This will remove everything between .pdf and R0 and replace it with single space.
Result for your example:
R01234567_High Transcript_01234567.pdf R01234567
R01891023_Application_01891023127.pdf R01891023
The Hard, Unreliable Way
It's a bit verbose, and much less terse and efficient than what would make sense if we knew that the fields were separated by tab literals, but the following loop does this processing in pure native bash with no external tools:
shopt -s extglob
while IFS= read -r line; do
[[ $line = *".pdf"*R0* ]] || continue # ignore lines that don't fit our format
filename=${line%%.pdf*}.pdf
id=R0${line##*R0}
printf '%s\t%s\n' "$filename" "$id"
done
${line%%.pdf*} returns everything before the first .pdf in the line; ${line%%.pdf*}.pdf then appends .pdf to that content.
Similarly, ${line##*R0} expands to everything after the last R0; R0${line##*R0} thus expands to the final field starting with R0 (presuming that that's the only instance of that string in the file).
The Easy Way (Using Tab Delimiters)
If cat -t file (on MacOS) or cat -A file (on Linux) shows ^I sequences between the fields (but not within the fields), use the following instead:
while IFS=$'\t' read -r filename title id; do
printf '%s\t%s\n' "$filename" "$id"
done
This reads the three tab separated fields into variables named filename, title and id, and emits the filename and id fields.
Updated answer assuming tab delim
Since there is a tab delimiter, then this is a cinch for awk. Borrowing from my originally deleted answer and #geek1011 deleted answer:
awk -F"\t" '{print $1, $NF}' infile.txt
Here awk splits each record in your file by tab, then prints the first field $1 and the last field $NF where NF is the built in awk variable for the record's Number of Fields; by prepending a dollar sign, it says "The value of the last field in the record".
Original answer assuming space delimiter
Leaving this here in case someone has space delimited nonsense like I originally assumed.
You can use awk instead of using bash to read through the file:
awk 'NR>1{for(i=1; $i!~/pdf/; ++i) firstRec=firstRec" "$i} NR>1{print firstRec,$i,$NF}' yourfile.txt
awk reads files line by line and processes each record it comes across. Fields are delimited automatically by white space. The first field is $1, the second is $2 and so on. awk has built in variables; here we use NF which is the Number of Fields contained in the record, and NR which is the record number currently being processed.
This script does the following:
If the record number is greater than 1 (not the header) then
Loop through each field (separated by white space here) until we find a field that has "pdf" in it ($i!~/pdf/). Store everything we find up until that field in a variable called firstRec separated by a space (firstRec=firstRec" "$i).
print out the firstRec, then print out whatever field we stopped iterating on (the one that contains "pdf") which is $i, and finally print out the last field in the record, which is $NF (print firstRec,$i,$NF)
You can direct this to another file:
awk 'NR>1{for(i=1; $i!~/pdf/; ++i) firstRec=firstRec" "$i} NR>1{print firstRec,$i,$NF}' yourfile.txt > outfile.txt
sed may be a cleaner way of going here since, if your pdf file has more than one space separating characters, then you will lose the multiple spaces.
You can use sed on each line like that:
line="R01234567_High Transcript_01234567.pdf High School Transcript R01234567"
echo "$line" | sed 's/\.pdf.*R0/\.pdf R0/'
# output
R01234567_High Transcript_01234567.pdf R01234567
This replace anything between .pdf and R0 with a spacebar.
It doesn't deal with some edge cases but it simple and clear

How to do something like grep -B to select only one line?

Everything is in the title. Basicaly let's say I have this pattern
some text lalala
another line
much funny wow grep
I grep funny and I want my output to be "lalala"
Thank you
One possible answer is to use either ed or ex to do this (it is trivial in them):
ed - yourfile <<< 'g/funny/.-2p'
(Or replace ed with ex. You might have red, the restricted editor, too; it can't modify files.) This looks for the pattern /funny/ globally, and whenever it is found, prints the line 2 before the matching line (that's the .-2p part). Or, if you want the most recent line containing 'lalala' before the line matching 'funny':
ed - yourfile <<< 'g/funny/?lalala?p'
The only problem is if you're trying to process standard input rather than a file; then you have to save the standard input to a file and process that file, which spoils the concurrency.
You can't do negative offsets in sed (though GNU sed allows you to do positive offsets, so you could use sed -n '/lalala/,+2p' file to get the 'lalala' to 'funny' lines (which isn't quite what you want) based on finding 'lalala', but you cannot find the 'lalala' lines based on finding 'funny'). Standard sed does not allow offsets at all.
If you need to print just the IP address found on a line 8 lines before the pattern-matching line, you need a slightly more involved ed script, but it is still doable:
ed - yourfile <<< 'g/funny/.-8s/.* //p'
This uses the same basic mechanism to find the right line, then runs a substitute command to remove everything up to the last space on the line and print the modified version. Since there isn't a w command, it doesn't actually modify the file.
Since grep -B only prints each full number of lines before the match, you'll have to pipe the output into something like grep or Awk.
grep -B 2 "funny" file|awk 'NR==1{print $NF; exit}'
You could also just use Awk.
awk -v s="funny" '/[[:space:]]lalala$/{n=NR+2; o=$NF}NR==n && $0~s{print o}' file
For the specific example of an IP address 8 lines before the match as mentioned in your comment:
awk -v s="funny" '
/[[:space:]][0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}$/ {
n=NR+8
ip=$NF
}
NR==n && $0~s {
print ip
}' file
These Awk solutions first find the output field you might want, then print the output only if the word you want exists in the nth following line.
Here's an attempt at a slightly generalized Awk solution. It maintains a circular queue of the last q lines and prints the line at the head of the queue when it sees a match.
#!/bin/sh
: ${q=8}
e=$1
shift
awk -v q="$q" -v e="$e" '{ m[(NR%q)+1] = $0 }
$0 ~ e { print m[((NR+1)%q)+1] }' "${#--}"
Adapting to a different default (I set it to 8) or proper option handling (currently, you'd run it like q=3 ./qgrep regex file) as well as remembering (and hence printing) the entire line should be easy enough.
(I also didn't bother to make it work correctly if you see a match in the first q-1 lines. It will just print an empty line then.)

Resources