I need to check delimiter '|' count for each line in text file for that used awk command and stored count of output file in temp file.
It was generating count of delimiter for each row and script also finally I can see success with exit code 0.
but in one of the line it was showing arithmetic syntax error could some one tell me how to resolve this.
I provided sample filedata, script and script output could someone tell me what was the issue here for arithmetic syntax error.
Text file Sample data: in below sample file there were 5 '|' delimiter and some sample rows
Name|Address|phone|pincode|location|
xyz|usa|123|111|NY|
abc|uk|123|222|LON|
pqr|asia|123|333|IND|
Script:
Standard_col_cnt="5"
cd /$SRC_FILE_PATH
touch temp.txt
col_cnt=`awk -F"|" '{print NF}' $SRC_FILE_PATH/temp.txt` >>$Logfile 2>&1
while read line
do
i=1
echo $line >/temp.txt
if [ "$col_cnt" -ne "$Standard_col_cnt" ]
then
echo "No of columns are not equal to the standard value in Line no - $i:" >>$Logfile
exit 1
fi
i=`expr $i + 1`
done < $File_name
Awk command will generate below output to temp file:
5
5
5
5
--------- Script output -----------
script.sh[59]: [: |xyz|usa|123|111|NY|
: arithmetic syntax error
+ expr 1 + 1
+ i=2
+ read line
+ i=1
+ echo 'xyz|usa|123|111|NY|\r'
+ script.sh[48]: /temp.txt: cannot create [Permission denied]
+ 'abc|uk|123|222|LON|\r' -ne 91 ]
script.sh[59]: [: pqr|asia|123|333|IND|: arithmetic syntax error
Your current script will constantly reset i to 1 every time the line is read.
It is unclear how your awk code is writing to the temp file, when it seems it has just been created and is then being used to create a variable, while empty!
If you want to check the condition that the | pipe delimiters per line are 5, you could do so with just awk
Sample Data
$ cat test
Name|Address|phone|pincode|location|
xyz|usa|123|111|NY|
abc|uk|123222|LON|
pqr|asia|123|333|IND|
$ export logfile
$ cat script.awk
BEGIN {
FS="|"
Standard_col_cnt=5
logfile=ENVIRON["logfile"]
} {
if (NF-1 != Standard_col_cnt) print "No of columns are not equal to the standard value in Line no - "NR
}
$ awk -f script.awk test
$ cat "$logfile"
No of columns are not equal to the standard value in Line no - 3
col_cnt=5
grep -o -n '[|]' input_file |awk '{print $1}' | uniq -c| \
awk -v d="$col_cnt" '$1!=d {print "No of columns are not equal to the standard value in Line no - "NR}'
No of columns are not equal to the standard value in Line no - 3
other
count=5
string="No of columns are not equal to the standard value in Line no -"
grep -o -n '[|]' input_file|cut -d: -f 1| uniq -c|sed "s/^ *//;"| sed "/^[${count} ]/d"|sed "s/^[^${count} ]/${string}/"
No of columns are not equal to the standard value in Line no - 3
Related
I am writing a function in a BASH shell script, that should return lines from csv-files with headers, having more commas than the header. This can happen, as there are values inside these files, that could contain commas. For quality control, I must identify these lines to later clean them up. What I have currently:
#!/bin/bash
get_bad_lines () {
local correct_no_of_commas=$(head -n 1 $1/$1_0_0_0.csv | tr -cd , | wc -c)
local no_of_files=$(ls $1 | wc -l)
for i in $(seq 0 $(( ${no_of_files}-1 )))
do
# Check that the file exist
if [ ! -f "$1/$1_0_${i}_0.csv" ]; then
echo "File: $1_0_${i}_0.csv not found!"
continue
fi
# Search for error-lines inside the file and print them out
echo "$1_0_${i}_0.csv has over $correct_no_of_commas commas in the following lines:"
grep -o -n '[,]' "$1/$1_0_${i}_0.csv" | cut -d : -f 1 | uniq -c | awk '$1 > $correct_no_of_commas {print}'
done
}
get_bad_lines products
get_bad_lines users
The output of this program is now all the comma-counts with all of the line numbers in all the files,
and I suspect this is due to the input $1 (foldername, i.e. products & users) conflicting with the call to awk with reference to $1 as well (where I wish to grab the first column being the count of commas for that line in the current file in the loop).
Is this the issue? and if so, would it be solvable by either referencing the 1.st column or the folder name by different variable names instead of both of them using $1 ?
Example, current output:
5 6667
5 6668
5 6669
5 6670
(should only show lines for that file having more than 5 commas).
Tried variable declaration in call to awk as well, with same effect
(as in the accepted answer to Awk field variable clash with function argument)
:
get_bad_lines () {
local table_name=$1
local correct_no_of_commas=$(head -n 1 $table_name/${table_name}_0_0_0.csv | tr -cd , | wc -c)
local no_of_files=$(ls $table_name | wc -l)
for i in $(seq 0 $(( ${no_of_files}-1 )))
do
# Check that the file exist
if [ ! -f "$table_name/${table_name}_0_${i}_0.csv" ]; then
echo "File: ${table_name}_0_${i}_0.csv not found!"
continue
fi
# Search for error-lines inside the file and print them out
echo "${table_name}_0_${i}_0.csv has over $correct_no_of_commas commas in the following lines:"
grep -o -n '[,]' "$table_name/${table_name}_0_${i}_0.csv" | cut -d : -f 1 | uniq -c | awk -v table_name="$table_name" '$1 > $correct_no_of_commas {print}'
done
}
You can use awk the full way to achieve that :
get_bad_lines () {
find "$1" -maxdepth 1 -name "$1_0_*_0.csv" | while read -r my_file ; do
awk -v table_name="$1" '
NR==1 { num_comma=gsub(/,/, ""); }
/,/ { if (gsub(/,/, ",", $0) > num_comma) wrong_array[wrong++]=NR":"$0;}
END { if (wrong > 0) {
print(FILENAME" has over "num_comma" commas in the following lines:");
for (i=0;i<wrong;i++) { print(wrong_array[i]); }
}
}' "${my_file}"
done
}
For why your original awk command failed to give only lines with too many commas, that is because you are using a shell variable correct_no_of_commas inside a single quoted awk statement ('$1 > $correct_no_of_commas {print}'). Thus there no substitution by the shell, and awk read "$correct_no_of_commas" as is, and perceives it as an undefined variable. More precisely, awk look for the variable correct_no_of_commas which is undefined in the awk script so it is an empty string . awk will then execute $1 > $"" as matching condition, and as $"" is a $0 equivalent, awk will compare the count in $1 with the full input line. From a numerical point of view, the full input line has the form <tab><count><tab><num_line>, so it is 0 for awk. Thus, $1 > $correct_no_of_commas will be always true.
You can identify all the bad lines with a single awk command
awk -F, 'FNR==1{print FILENAME; headerCount=NF;} NF>headerCount{print} ENDFILE{print "#######\n"}' /path/here/*.csv
If you want the line number also to be printed, use this
awk -F, 'FNR==1{print FILENAME"\nLine#\tLine"; headerCount=NF;} NF>headerCount{print FNR"\t"$0} ENDFILE{print "#######\n"}' /path/here/*.csv
I have the below requirement. I am trying to run the condition in loop and it's taking more time. Is there a one time command anything which will not take more time to process a 70 MB file.
Requirement:
if #pRECTYPE="SBSB" line contains #pSBEL_MCTR_RSN="XXX" tag then we need to copy and append that to next #pRECTYPE="SBEL record at the end of the line
File :note : in file there will be no blank lines. I have given enter to avoid line continuation
#pRUKE=dfgt#pRECTYPE="SMDR", #pCONFIG="Y" XXXXXXX
#pRUKE=dfgt#pRECTYPE="SBSB", #pGWID="1234", #pSBEL_MCTR_RSN="KX28", #pSBSB_9000_COLL=""
#pRUKE=dfgt#pRECTYPE="KBSG", #pKBSG_UPDATE_CD="IN", XXXXXXXXXXX
#pRUKE=dfgt#pRECTYPE="SBEL", #pSBEL_EFF_DT="01/01/2017", #pCSPI_ID="JKOX0001", #pSBEL_FI="A"
#pRUKE=dfgt#pRECTYPE="SBEK", #pSBEK_UPDATE_CD="IN",XXXXXXXXXXXXXXXXXXX
#pRUKE=dfgt#pRECTYPE="DBCS", #pDBCS_UPDATE_CD="IN",XXXXXXXXXXXXXXXXXXXXXXXXXX
#pRUKE=dfgt#pRECTYPE="MEME", #pMEME_REL="18", #pMEEL_MCTR_RSN="KX28"
#pRUKE=dfgt#pRECTYPE="ATT0", #pATT0_UPDATE_CD="AP",XXXXXXXXX
#pRUKE=dfgt#pRECTYPE="SBSB", #pGWID="1234", #pSBEL_MCTR_RSN="KX28", #pSBSB_9000_COLL=""
#pRUKE=dfgt#pRECTYPE="KBSG", #pKBSG_UPDATE_CD="IN", XXXXXXXXXXX
example :
Before :
#pRUKE=dfgt#pRECTYPE="SMDR", #pCONFIG="Y" XXXXXXX
#pRUKE=dfgt#pRECTYPE="SBSB", #pGWID="1234", #pSBEL_MCTR_RSN="KX28", #pSBSB_9000_COLL=""
#pRUKE=dfgt#pRECTYPE="KBSG", #pKBSG_UPDATE_CD="IN", XXXXXXXXXXX
#pRUKE=dfgt#pRECTYPE="SBEL", #pSBEL_EFF_DT="01/01/2017", #pCSPI_ID="JKOX0001", #pSBEL_FI="A"
After:
#pRUKE=dfgt#pRECTYPE="SMDR", #pCONFIG="Y" XXXXXXX
#pRUKE=dfgt#pRECTYPE="SBSB", #pGWID="1234", #pSBEL_MCTR_RSN="KX28", #pSBSB_9000_COLL=""
#pRUKE=dfgt#pRECTYPE="KBSG", #pKBSG_UPDATE_CD="IN", XXXXXXXXXXX
#pRUKE=dfgt#pRECTYPE="SBEL", #pSBEL_EFF_DT="01/01/2017", #pCSPI_ID="JKOX0001", #pSBEL_FI="A", #pSBEL_MCTR_RSN="KX28"
After SBSB, if there is no SBEL, then that SBSB can be ignored.
What I did is:
egrep -n "pRECTYPE=\"SBSB\"|pRECTYPE=\"SBEL\"" filename | sed '$!N;/pRECTYPE=\"SBEL\"/P;D' | awk -F\: '{print $1}' | awk 'NR%2{printf "%s,",$0;next;}1' > 4.txt;
by this I will get the line number, eg:
2,4
17,19
Line 9 12 14 will be ignored
while read line
do
echo "$line";
SBSB=`echo "$line" | awk -F, '{print $1}'`;
SBEL=`echo "$line" | awk -F, '{print $2}'`;
echo $SBSB;
echo $SBEL;
SBSB_Fetch=`sed -n "$SBSB p" $fil | grep -Eo '(#pSBEL_MCTR_RSN)=[^ ]+' | sed 's/,$//' | sed 's/^/, /g'`;
echo $SBSB_Fetch;
if [[ "$SBSB_Fetch" == "" ]];then
echo "blank";
s=blank;
else
echo "value";
sed -i "${SBEL}s/.*/&${SBSB_Fetch}/" $fil;
fi
done < 4.txt;
Since I am ready and updating each line ,it's taking more time, is there any way to reduce the run time?
For 70 Mb it's taking 4 .5 hours now.
For performance, you need to really limit how many external tools you invoke inside a loop in a shell script.
This requires GNU awk:
gawk '
/#pRECTYPE="SBSB"/ {match($0, /#pSBEL_MCTR_RSN="[^"]*"/, m)}
/#pRECTYPE="SBEL"/ && isarray(m) {$0 = $0 ", " m[0]; delete m}
1
' file
This should be pretty quick:
only invoking one external command
no shell loops
only have to read the input file once.
Determines the row number of a cell that is in a specific column and has a specific content.
Remark:
The heading of a column counts as a line.
An empty field in a column counts as a row.
The fields of csv are separated by comma.
Given:
The follow csv file are given:
file.csv
col_o2g,col_dgjdhu,col_of_interest,,
1234567890,tg75fjksfh,$kj56hahb,,
dsewsf,1234567890,,,
khhhdg,5gfj578fj,1234567890,,
,57ijf6ehg,46h%sgf,,
ubthfgfv,zts576fufj,256hf%(",,
Given variables:
# col variable
col=col_of_interest
# variable with the value of the field of interest
value_of_interest=1234567890
# output variable
# thats he part I am looking for
wanted_line_number=
What I have:
LINE_CNT=$(awk '-F[\t ]*,[\t ]*' -vcol=${col} '
FNR==1 {
for(i=1; i<=NF; ++i) {
if($i == col) {
col = i;
break;
}
}
if(i>NF) {
exit 1;
}
}
FNR>1 {
if($col) maxc=FNR;
}
END{
print maxc;
}' file.csv)
echo line count of lines from column $col
echo "$LINE_CNT"
Wanted output:
echo "The wanted line number are:"
echo $wanted_line_number
output:4
I have been trying to decipher your question, so let me know whether I did it right or not. I guess in your case you don't know how many columns are present in the csv file, and also you don't know whether the first line is the header or not.
For the second remark, I have no automatic solution, so you need to provide whether the line 1 is a header or not based on an input parameter.
Let me show you with a test case
]$ more test.csv
col_1,col_2,col_3,col_4
1234567890,tg75fjksfh,kj56hahb,dkdkdkd
dsewsf,1234567890,,dkdkdk
khhhdg,5gfj578fj,1234567890,akdkdkd
ubthfgfv,zts576fufj,256hf,,
Then you want to know the position of the column of interest in your csv and also the line where the value of interest is located. Here my example script ( that can be improved ). Keep in mind that I harcoded my example of test.csv file into the script.
$ cat check_csv.sh
column_of_interest=$1
value_of_interest=$2
with_header=$3
# check which column is the one
if [[ $with_header = "Y" ]];
then
num_cols=$(cat test.csv | awk --field-separator="," "{ print NF }" | head -n 1)
echo "csv contains $num_cols columns"
to_rows=$(cat test.csv | head -n 1 | tr ',' '\n')
iteration=0
for i in $(cat test.csv | head -n 1 | tr ',' '\n')
do
iteration=$(expr $iteration + 1)
counter=$(echo $i | egrep -i "$column_of_interest" | wc -l)
#echo $i
#echo $counter
if [ $counter -eq 1 ]
then
echo "Column of interest $i is located on number $iteration"
export my_col_is=$iteration;
fi
done
# fine line that ccontains the value of interest
iteration=0
while IFS= read -r line
do
iteration=$(expr $iteration + 1 )
if [[ $iteration -gt 1 ]];
then
#echo $line
is_there=$(echo $line | awk -v temp=$my_col_is -F ',' '{print $temp}' | egrep -i "$value_of_interest"| wc -l)
#echo $is_there
if [ $is_there -gt 0 ];
then
echo "Value of interest $value_of_interest is present on line $iteration"
fi
fi
done < test.csv
fi
Running the example when I want to know which column is col_2 ( position ) and lines where it appears the value 1234567890 for that column. I use an option to identify that the file has header
$ more test.csv
col_1,col_2,col_3,col_4
1234567890,tg75fjksfh,kj56hahb,dkdkdkd
dsewsf,1234567890,,dkdkdk
khhhdg,5gfj578fj,1234567890,akdkdkd
ubthfgfv,zts576fufj,256hf,,
$ ./check_csv.sh col_2 1234567890 Y
csv contains 4 columns
Column of interest col_2 is located on number 2
Value of interest 1234567890 is present on line 3
With lines duplicated
$ more test.csv
col_1,col_2,col_3,col_4
1234567890,tg75fjksfh,kj56hahb,dkdkdkd
dsewsf,1234567890,,dkdkdk
khhhdg,5gfj578fj,1234567890,akdkdkd
ubthfgfv,zts576fufj,256hf,,
dsewsf,1234567890,,dkdkdk
dsewsf,1234567890,,dkdkdk
$ ./check_csv.sh col_2 1234567890 Y
csv contains 4 columns
Column of interest col_2 is located on number 2
Value of interest 1234567890 is present on line 3
Value of interest 1234567890 is present on line 6
Value of interest 1234567890 is present on line 7
$
If you want to treat the files without header, you only need to copy the code to the treat those without head -1, but in those cases, you cannot get names of the columns and you won't know where to find them respect of the columns.
col="col_of_interest"
value_of_interest="1234567890"
awk -v FS="," -v coi="$col" -v voi="$value_of_interest" \
'NR==1{
for(i=1; i<=NF; i++){
if(coi==$i){
y=i
}
}
next
}
{if($y==voi){print NR}}' file
Output:
4
See: GNU awk: String-Manipulation Functions (split), Arrays in awk, 8 Powerful Awk Built-in Variables – FS, OFS, RS, ORS, NR, NF, FILENAME, FNR and man awk
file=./input.csv
d=,
# get column number for col_of_interest
c=$(head -n1 "$file" | grep -oE "[^$d]+" | grep -niw "$col" | cut -d: -f1)
# print column with cut and get line numbers for 1234567890
[ "$c" -gt 0 ] && wanted_line_number=$(cut -d$d -f$c "$file" | grep -niw "$value_of_interest" | cut -d: -f1)
printf "The wanted line number are: %b\n" $wanted_line_number
I would like to wrote a little shell script that permit to check if all lines on a file has the same number of ;
I have a file containing the following format :
$ cat filename.txt
34567890;098765456789;098765567;9876;9876;EXTG;687J;
4567800987987;09876789;9667876YH;9876;098765;098765;09876;
SLKL987H;09876LKJ;POIUYT;PÖIUYT;88765K;POIUYTY;LKJHGFDF;
TYUIO;09876LKJ;POIUYT;LKJHG;88765K;POIUYTY;OIUYT;
...
...
...
SDFGHJK;RTYUIO9876;4567890LKJHGFD;POIUYTRF56789;POIUY;POIUYT;9876;
I use the following command for determine of the number of ; of each line :
awk -F';' 'NF{print (NF-1)}' filename.txt
I have the following output :
7
7
7
7
...
...
...
7
Because number of ; on each line of this file is 7.
Now, I want to wrote a script that permit me to verify if all the lines in the file have 7 commas. If it's OK, it tells me that the file is correct. Otherwise, if there is a single line containing more than 7 commas, it tells me that the file is not correct.
Rather than printing output, return a value. eg
awk -F',' 'NR==1{count = NF} NF!=count{status=1}END{exit status}' filename.txt
If there are no lines or if all lines contain the same number of commas, this will return 0. Otherwise, it returns 1 to indicate failure.
Count the number of unique lines and verify that the count is 1.
if (($(awk -F';' 'NF{print (NF-1)}' filename.txt | uniq | wc -l) == 1)); then
echo good
else
echo bad
fi
Just pipe the result through sort -u | wc -l. If all lines have the same number of fields, this will produce one line of output.
Alternatively, just look for a line in awk that doesn't have the same number of fields as the first line.
awk -F';' 'NR==1 {linecount=NF}
linecount != NF { print "Bad line " $0; exit 1}
' filename.txt && echo "Good file"
You can also adapt the old trick used to output only the first of duplicate lines.
awk -F';' '{a[NF]=1}; length(a) > 1 {exit 1}' filename.txt
Each line updates the count of lines with that number of fields. Exit with status 1 as soon as a has more than one entry. Basically, a acts as a set of all field counts seen so far.
Based on all the information you have given me, I ended up doing the following. And it works for me.
nbCol=`awk -F';' '(NR==1){print NF;}' $1`
val=7
awk -F';' 'NR==1{count = NF} NF != count { exit 1}' $1
result=`echo $?`
if [ $result -eq 0 ] && [ $nbCol -eq $val ];then
echo "Good Format"
else
echo "Bad Format"
fi
I have the following code:
names=$(ls *$1*.txt)
head -q -n 1 $names | cut -d "_" -f 2
where the first line finds and stores all names matching the command line input into a variable called names, and the second grabs the first line in each file (element of the variable names) and outputs the second part of the line based on the "_" delim.
This is all good, however I would like to prepend the filename (stored as lines in the variable names) to the output of cut. I have tried:
names=$(ls *$1*.txt)
head -q -n 1 $names | echo -n "$names" cut -d "_" -f 2
however this only prints out the filenames
I have tried
names=$(ls *$1*.txt
head -q -n 1 $names | echo -n "$names"; cut -d "_" -f 2
and again I only print out the filenames.
The desired output is:
$
filename1.txt <second character>
where there is a single whitespace between the filename and the result of cut.
Thank you.
Best approach, using awk
You can do this all in one invocation of awk:
awk -F_ 'NR==1{print FILENAME, $2; exit}' *"$1"*.txt
On the first line of the first file, this prints the filename and the value of the second column, then exits.
Pure bash solution
I would always recommend against parsing ls - instead I would use a loop:
You can avoid the use of awk to read the first line of the file by using bash built-in functionality:
for i in *"$1"*.txt; do
IFS=_ read -ra arr <"$i"
echo "$i ${arr[1]}"
break
done
Here we read the first line of the file into an array, splitting it into pieces on the _.
Maybe something like that will satisfy your need BUT THIS IS BAD CODING (see comments):
#!/bin/bash
names=$(ls *$1*.txt)
for f in $names
do
pattern=`head -q -n 1 $f | cut -d "_" -f 2`
echo "$f $pattern"
done
If I didn't misunderstand your goal, this also works.
I've always done it this way, I just found out that this is a deprecated way to do it.
#!/bin/bash
names=$(ls *"$1"*.txt)
for e in $names;
do echo $e `echo "$e" | cut -c2-2`;
done