How to re-structure the file using AWK?

How to re-structure the file using AWK? - linux

I have written a code to re-structure the csv file based on the control file, The control file looks like below.
Control file :
1,column1
3,column3
6,column6
4,column4
-1,column9
Based on the above control file i have taken the index's 1,3,6,4,-1 columns in source.csv file and created new file by using paste command.incase if index value is -1 in control file i have to insert the entire column as null and header name will be column9.
Code :
var=1
while read line
do
t=$(echo $line | awk '{ print $1}' | cut -d, -f1)
if [ $t != -1 ]
then
cut -d, -f$t source.csv >file_$var.csv
else
touch file_$var.csv
fi
var=$((var+1))
done < "$file"
ls -v file_*.csv | xargs paste -d, > new_file.csv
Is there a way to convert these lines into AWK , Suggest me some ideas.
Before Running script:
sample.csv
column1,column2,column3,column4,column5,column6,column7
a,b,c,d,e,f,g
Output:
new_file.csv
column1,column3,column6,column4,column9
a,c,f,d,
column9 is -1 indicate null or just , separated indicate null.
Basic intention is to restructure the source file based on the control file.
Script:
#Greenplum Database details to read target file structure from Meta Data Tables.
export PGUSER=xxx
export PGPORT=5432
export PGHOST=10.100.20.10
export PGDATABASE=fff
SCHEMA='jiodba'
##Function to explain usage of this script
usage() {
echo "Usage: program.sh -s <Source_folder> -t <Target_folder> -f <file_name> ";
exit 1; }
source_folder=$1
target_folder=$2
file_name=$3
#removes the existing file from current directory
rm -f file_struct_*.csv
# Reading the Header from the Source file.
v_source_header=`head -1 $file_name`
IFS="," # Set the field separator
set $v_source_header # Breaks the string into $1, $2, ...
i=1
for item # A for loop by default loop through $1, $2, ...
do
echo "$i,$item">>source_header.txt
((i++))
done
sed -e "s/
//" source_header.txt | sed -e "s/ \{1,\}$//" > source_headers.txt
rm -f source_header.txt
#Get the Target header information from Greenplum Meta data Table and writing into target_header.txt file.
psql -t -A -F "," -c "select Target_column_position,Target_column_name from jiodba.etl_tbl_sequencing where source_file_name='$file_name' order by target_column_position" > target_header.txt
#Removing the trail space and control characters.
sed -e "s/
//" target_header.txt | sed -e "s/ \{1,\}$//" > target_headers.txt
rm -f target_header.txt
#Compare the Source Header Target Structure and generate the Difference.
awk -F, 'NR==FNR{a[$2]=$1;next} {if ($2 in a) print a[$2]","$2; else print "-1," $2}' source_headers.txt target_headers.txt >>tgt_struct_output.txt
#Loop to Read column index from the tgt_struct_output.txt and cut it in Source file.
file='tgt_struct_output.txt'
var=1
while read line
do
t=$(echo $line | awk '{ print $1}' | cut -d, -f1)
if [ $t != -1 ]
then
cut -d, -f$t $file_name>file_struct_$var.csv
else
touch file_struct_$var.csv
fi
var=$((var+1))
done<"$file"
awk -F, -v OFS=, 'FNR==NR {c[++n]=$2; a[$2]=$1;next} FNR==1{f=""; for (i=1; i<=n; i++)
{printf "%s%s", f, c[i]; b[++k]=i; f=OFS} print "";next}
{for (i=1; i<=n; i++) if(a[c[i]]>0) printf "%s%s", $a[c[i]], OFS; print""
}' tgt_struct_output.txt $file_name
#Paste the different file(columns)into single file
ls -v file_struct_*.csv | xargs paste -d,| sed -e "s/
//" > new_file.csv
new_header=`cut -d "," -f 2 target_headers.txt | tr "\n" "," | sed 's/,$//'`
#Replace the header with original target header incase if column doesnt exit in the target table structure.
sed "1s/.*/$new_header/" new_file.csv
#Removing the Temp files.
rm -f file_struct_*.csv
rm -f source_headers.txt target_headers.txt tgt_struct_output.txt
touch file_struct_1.csv #Just to avoid the error in shell
Sample.csv
BP ID,Prepaid Account No,CurrentMonetary balance ,charge Plan names ,Provider contract id,Contract Item ID,Start Date,End Date
1100001538,001000002506,251,[B2] R2 LTE CHARGE PLAN ,00000000000000000141,[B2] R2 LTE CHARGE PLAN _00155D10E20D1ED39A8E146EA7169A2E00155D10E20D1ED398FD63624498DB4A,16-Oct-12,18-Oct-12
1100003404,001000004029,45.22,B0.3 ECS_CHARGE_PLAN DROP1 V3,00000000000000009349,B0.3 ECS DROP2 V0.2_00155D10E20D1ED39A8E146EA7169A2E00155D10E20D1ED398FD63624498DA2E,16-Nov-13,23-Nov-13
1100006545,001000006620,388.796,B0.3 ECS_CHARGE_PLAN DROP1 V3,00000000000000010477,B0.3 ECS DROP2 V0.2_00155D10E20D1ED39A8E146EA7169A2E00155S00E20D1ED398FD63624498DA2E,07-Nov-12,07-Nov-13

You can try this awk:
awk -F, -v OFS=, 'FNR==NR {c[++n]=$2; a[$2]=$1;next} FNR==1{f=""; for (i=1; i<=n; i++)
{printf "%s%s", f, c[i]; b[++k]=i; f=OFS} print "";next}
{for (i=1; i<=n; i++) if(a[c[i]]>0) printf "%s%s", $a[c[i]], OFS; print""
}' ctrl.csv sample.csv
column1,column3,column6,column4,column9
a,c,f,d,

Related

Bash function with input fails awk command

I am writing a function in a BASH shell script, that should return lines from csv-files with headers, having more commas than the header. This can happen, as there are values inside these files, that could contain commas. For quality control, I must identify these lines to later clean them up. What I have currently:
#!/bin/bash
get_bad_lines () {
local correct_no_of_commas=$(head -n 1 $1/$1_0_0_0.csv | tr -cd , | wc -c)
local no_of_files=$(ls $1 | wc -l)
for i in $(seq 0 $(( ${no_of_files}-1 )))
do
# Check that the file exist
if [ ! -f "$1/$1_0_${i}_0.csv" ]; then
echo "File: $1_0_${i}_0.csv not found!"
continue
fi
# Search for error-lines inside the file and print them out
echo "$1_0_${i}_0.csv has over $correct_no_of_commas commas in the following lines:"
grep -o -n '[,]' "$1/$1_0_${i}_0.csv" | cut -d : -f 1 | uniq -c | awk '$1 > $correct_no_of_commas {print}'
done
}
get_bad_lines products
get_bad_lines users
The output of this program is now all the comma-counts with all of the line numbers in all the files,
and I suspect this is due to the input $1 (foldername, i.e. products & users) conflicting with the call to awk with reference to $1 as well (where I wish to grab the first column being the count of commas for that line in the current file in the loop).
Is this the issue? and if so, would it be solvable by either referencing the 1.st column or the folder name by different variable names instead of both of them using $1 ?
Example, current output:
5 6667
5 6668
5 6669
5 6670
(should only show lines for that file having more than 5 commas).
Tried variable declaration in call to awk as well, with same effect
(as in the accepted answer to Awk field variable clash with function argument)
:
get_bad_lines () {
local table_name=$1
local correct_no_of_commas=$(head -n 1 $table_name/${table_name}_0_0_0.csv | tr -cd , | wc -c)
local no_of_files=$(ls $table_name | wc -l)
for i in $(seq 0 $(( ${no_of_files}-1 )))
do
# Check that the file exist
if [ ! -f "$table_name/${table_name}_0_${i}_0.csv" ]; then
echo "File: ${table_name}_0_${i}_0.csv not found!"
continue
fi
# Search for error-lines inside the file and print them out
echo "${table_name}_0_${i}_0.csv has over $correct_no_of_commas commas in the following lines:"
grep -o -n '[,]' "$table_name/${table_name}_0_${i}_0.csv" | cut -d : -f 1 | uniq -c | awk -v table_name="$table_name" '$1 > $correct_no_of_commas {print}'
done
}

You can use awk the full way to achieve that :
get_bad_lines () {
find "$1" -maxdepth 1 -name "$1_0_*_0.csv" | while read -r my_file ; do
awk -v table_name="$1" '
NR==1 { num_comma=gsub(/,/, ""); }
/,/ { if (gsub(/,/, ",", $0) > num_comma) wrong_array[wrong++]=NR":"$0;}
END { if (wrong > 0) {
print(FILENAME" has over "num_comma" commas in the following lines:");
for (i=0;i<wrong;i++) { print(wrong_array[i]); }
}
}' "${my_file}"
done
}
For why your original awk command failed to give only lines with too many commas, that is because you are using a shell variable correct_no_of_commas inside a single quoted awk statement ('$1 > $correct_no_of_commas {print}'). Thus there no substitution by the shell, and awk read "$correct_no_of_commas" as is, and perceives it as an undefined variable. More precisely, awk look for the variable correct_no_of_commas which is undefined in the awk script so it is an empty string . awk will then execute $1 > $"" as matching condition, and as $"" is a $0 equivalent, awk will compare the count in $1 with the full input line. From a numerical point of view, the full input line has the form <tab><count><tab><num_line>, so it is 0 for awk. Thus, $1 > $correct_no_of_commas will be always true.

You can identify all the bad lines with a single awk command
awk -F, 'FNR==1{print FILENAME; headerCount=NF;} NF>headerCount{print} ENDFILE{print "#######\n"}' /path/here/*.csv
If you want the line number also to be printed, use this
awk -F, 'FNR==1{print FILENAME"\nLine#\tLine"; headerCount=NF;} NF>headerCount{print FNR"\t"$0} ENDFILE{print "#######\n"}' /path/here/*.csv

Difficulty to create .txt file from loop in bash

I've this data :
cat >data1.txt <<'EOF'
2020-01-27-06-00;/dev/hd1;100;/
2020-01-27-12-00;/dev/hd1;100;/
2020-01-27-18-00;/dev/hd1;100;/
2020-01-27-06-00;/dev/hd2;200;/usr
2020-01-27-12-00;/dev/hd2;200;/usr
2020-01-27-18-00;/dev/hd2;200;/usr
EOF
cat >data2.txt <<'EOF'
2020-02-27-06-00;/dev/hd1;120;/
2020-02-27-12-00;/dev/hd1;120;/
2020-02-27-18-00;/dev/hd1;120;/
2020-02-27-06-00;/dev/hd2;230;/usr
2020-02-27-12-00;/dev/hd2;230;/usr
2020-02-27-18-00;/dev/hd2;230;/usr
EOF
cat >data3.txt <<'EOF'
2020-03-27-06-00;/dev/hd1;130;/
2020-03-27-12-00;/dev/hd1;130;/
2020-03-27-18-00;/dev/hd1;130;/
2020-03-27-06-00;/dev/hd2;240;/usr
2020-03-27-12-00;/dev/hd2;240;/usr
2020-03-27-18-00;/dev/hd2;240;/usr
EOF
I would like to create a .txt file for each filesystem ( so hd1.txt, hd2.txt, hd3.txt and hd4.txt ) and put in each .txt file the sum of the value from each FS from each dataX.txt. I've some difficulties to explain in english what I want, so here an example of the result wanted
Expected content for the output file hd1.txt:
2020-01;/dev/hd1;300;/
2020-02;/dev/hd1;360;/
2020-03;/dev/hd1;390:/
Expected content for the file hd2.txt:
2020-01;/dev/hd2;600;/usr
2020-02;/dev/hd2;690;/usr
2020-03;/dev/hd2;720;/usr
The implementation I've currently tried:
for i in $(cat *.txt | awk -F';' '{print $2}' | cut -d '/' -f3| uniq)
do
cat *.txt | grep -w $i | awk -F';' -v date="$(cat *.txt | awk -F';' '{print $1}' | cut -d'-' -f-2 | uniq )" '{sum+=$3} END {print date";"$2";"sum}' >> $i
done
But it doesn't works...
Can you show me how to do that ?

Because the format seems to be so constant, you can delimit the input with multiple separators and parse it easily in awk:
awk -v FS='[;-/]' '
prev != $9 {
if (length(output)) {
print output >> fileoutput
}
prev = $9
sum = 0
}
{
sum += $9
output = sprintf("%s-%s;/%s/%s;%d;/%s", $1, $2, $7, $8, sum, $11)
fileoutput = $8 ".txt"
}
END {
print output >> fileoutput
}
' *.txt
Tested on repl generates:
+ cat hd1.txt
2020-01;/dev/hd1;300;/
2020-02;/dev/hd1;360;/
2020-03;/dev/hd1;390;/
+ cat hd2.txt
2020-01;/dev/hd2;600;/usr
2020-02;/dev/hd2;690;/usr
2020-03;/dev/hd2;720;/usr
Alternatively, you could -v FS=';' and use split to split first and second column to extract the year and month and the hdX number.
If you seek a bash solution, I suggest you invert the loops - first iterate over files, then over identifiers in second column.
for file in *.txt; do
prev=
output=
while IFS=';' read -r date dev num path; do
hd=$(basename "$dev")
if [[ "$hd" != "${prev:-}" ]]; then
if ((${#output})); then
printf "%s\n" "$output" >> "$fileoutput"
fi
sum=0
prev="$hd"
fi
sum=$((sum + num))
output=$(
printf "%s;%s;%d;%s" \
"$(cut -d'-' -f1-2 <<<"$date")" \
"$dev" "$sum" "$path"
)
fileoutput="${hd}.txt"
done < "$file"
printf "%s\n" "$output" >> "$fileoutput"
done
You could also almost translate awk to bash 1:1 by doing IFS='-;/' in while read loop.

Create file with content, where the content has new line

In linux, how to create a file with content whose single line with \n (or any line separator) is translated into multi-line.
fileA.txt:
trans_fileA::abcd\ndfghc\n091873\nhhjj
trans_fileB::a11d\n11hc\n73345
Code:
while read line; do
file_name=`echo $line | awk -F'::' '{print $1}' `
file_content=`echo $line | awk -F'::' '{print $2}' `
echo $file_name
echo $(eval echo ${file_content})
echo $(eval echo ${file_content}) > fileA.txt
The trans_fileA should be:
abcd
dfghc
091873
hhjj

You can do it this way (with bash):
# read input file line by line, without interpreting \n
while read -r line
do
# extract name
name=$(echo $line | cut -d: -f 1)
# extract data
data=$(echo $line | cut -d: -f 3)
# ask sed to replace \n with linefeed and store result in name
echo $data | sed 's/\\n/\n/g' > "$name"
# read data from given file
done < fileA.txt
You can even write a smaller code:
while read -r line
do echo $line | cut -d: -f 3 | sed 's/\\n/\n/g' > "$(echo $line | cut -d: -f 1) "
done < fileA.txt

Find all unique columns from a huge(having millions of records and columns)OFS file(without fixed header row) unix

Input
119764469|14100733//1,k1=v1,k2=v2,STREET:1:1=NY
119764469|14100733//1,k1=v1,k2=v2,k3=v3
119764469|14100733//1,k1=v1,k4=v4,abc.xyz:1:1=nmb,abc,po.foo:1:1=yu
k1 could be any name with alphanumeric with . & : special chars like abc.nm.1:1
Expected output(all unique columns), sorting not required/necessary , it should be super fast
k1,k2,STREET:1:1,k3,k4,abc.xyz:1:1
My current approach/solution is
awk -F',' '{for (i=0; i<=NR; i++) {for(j=1; j<=NF; j++){split($j,a,"="); print a[1];}}}' file.txt | awk '!x[$1]++' | grep -v '|' | sed -e :a -e '$!N; s/\n/ | /; ta'
It works fine but it is too slow for huge size of file(which could be in MBs or in GBs in size)
NOTE: This is required in data migration, should use basic unix shell commands as production may not allow to have 3rd party utilities.

not sure about the speed but give it a try
$ cut -d, -f2- file | # select the key/value pairs
tr ',' '\n' | # split each k=v to its own line
cut -d= -f1 | # select only keys
sort -u | # filter uniques
paste -sd, # serialize back to single csv line
abc.xyz:1:1,k1,k2,k3,k4,STREET:1:1
I expect it to be faster than grep since no regex is involved.

Use grep -o to grep only the parts you need:
grep -o -e '[^=,]\+=[^,]\+' file.txt |awk -F'=' '{print $1}' |sort |uniq |tr '\n' ',' |sed 's/,$/\n/'
>>> abc.xyz:1:1,k1,k2,k3,k4,STREET:1:1
(sort is needed here because otherwise uniq doesn't work)

If you don't really need the output all on one line:
$ awk -F'[,=]' '{for (i=2;i<=NF;i+=2) print $i}' file | sort -u
abc.xyz:1:1
k1
k2
k3
k4
STREET:1:1
If you do:
$ awk -F'[,=]' '{for (i=2;i<=NF;i+=2) print $i}' file | sort -u |
awk -v ORS= '{print sep $0; sep=","} END{print RS}'
abc.xyz:1:1,k1,k2,k3,k4,STREET:1:1
You could do it all in one awk script but I'm not sure it'll be as efficient as the above or might run into memory issues if/when the array grows to millions of values:
$ cat tst.awk
BEGIN { FS="[,=]"; ORS="" }
{
for (i=2; i<=NF; i+=2) {
vals[$i]
}
}
END {
for (val in vals) {
print sep val
sep = ","
}
print RS
}
$ awk -f tst.awk file
k1,abc.xyz:1:1,k2,k3,k4,STREET:1:1

Get Values out of String in bash

I hope you can help me.
I try to separate a String:
#!/bin/bash
file=$(<sample.txt)
echo "$file"
The File itself contains Values like this:
(;FF[4]GM[1]SZ[19]CA[UTF-8]SO[sometext]BC[cn]WC[ja]
What I need is a way to extract the Values between the [ ] and set them as variables, for Example:
$FF=4
$GM=1
$SZ=19
and so on
However, some Files do not contain all Values, so that in some cases there is no FF[*]. In this case the Program should use the Value of "99"
How do I have to do this?
Thank you so much for your help.
Greetings
Chris

It may be a bit overcomplicated, but here it comes another way:
grep -Po '[-a-zA-Z0-9]*' file | awk '!(NR%2) {printf "declare %s=\"%s\";\n", a,$0; next} {a=$0} | bash
By steps
Filter file by printing only the needed blocks:
$ grep -Po '[-a-zA-Z0-9]*' a
FF
4
GM
1
SZ
19
CA
UTF-8
SO
sometext
BC
cn
WC
ja
Reformat so that it specifies the declaration:
$ grep -Po '[-a-zA-Z0-9]*' a | awk '!(NR%2) {printf "declare %s=\"%s\";\n", a,$0; next} {a=$0}'
declare FF="4";
declare GM="1";
declare SZ="19";
declare CA="UTF-8";
declare SO="sometext";
declare BC="cn";
declare WC="ja";
And finally pipe to bash so that it is executed.
Note 2nd step could be also rewritten as
xargs -n2 | awk '{print "declare"$1"=\""$2"\";"}'

I'd write this, using ; or [ or ] as awk's field separators
$ line='(;FF[4]GM[1]SZ[19]CA[UTF-8]SO[sometext]BC[cn]WC[ja]'
$ awk -F '[][;]' '{for (i=2; i<NF; i+=2) {printf "%s=\"%s\" ", $i, $(i+1)}; print ""}' <<<"$line"
FF="4" GM="1" SZ="19" CA="UTF-8" SO="sometext" BC="cn" WC="ja"
Then, to evaluate the output in your current shell:
$ source <(!!)
source <(awk -F '[][;]' '{for (i=2; i<NF; i+=2) {printf "%s=\"%s\" ", $i, $(i+1)}; print ""}' <<<"$line")
$ echo $SO
sometext
To handle the default FF value:
$ source <(awk -F '[][;]' '{
print "FF=99"
for (i=2; i<NF; i+=2) printf "%s=\"%s\" ", $i, $(i+1)
print ""
}' <<< "(;A[1]B[2]")
$ echo $FF
99
$ source <(awk -F '[][;]' '{
print "FF=99"
for (i=2; i<NF; i+=2) printf "%s=\"%s\" ", $i, $(i+1)
print ""
}' <<< "(;A[1]B[2]FF[3]")
$ echo $FF
3

Per your request:
while IFS=\[ read -r A B; do
[[ -z $B ]] && B=99
eval "$A=\$B"
done < <(exec grep -oE '[[:alpha:]]+\[[^]]*' sample.txt)
Although using an associative array would be better:
declare -A VALUES
while IFS=\[ read -r A B; do
[[ -z $B ]] && B=99
VALUES[$A]=$B
done < <(exec grep -oE '[[:alpha:]]+\[[^]]*' sample.txt)
There you could have access both with keys ("${!VALUES[#]}") and values "${VALUES['FF']}".

I would probably do something like this:
set $(sed -e 's/^(;//' sample.txt | tr '[][]' ' ')
while (( $# > 2 ))
do
varname=${1}
varvalue=${2}
# do something to test varname and varvalue to make sure they're sane/safe
declare "${varname}=${varvalue}"
shift 2
done

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

How to re-structure the file using AWK? - linux

Related

Bash function with input fails awk command

Difficulty to create .txt file from loop in bash

Create file with content, where the content has new line

Find all unique columns from a huge(having millions of records and columns)OFS file(without fixed header row) unix

Get Values out of String in bash

Categories

Resources