Add an index column to a csv using awk

Add an index column to a csv using awk - linux

How can I add an index to a csv file using awk? For example lets assume I have a file
data.txt
col1,col2,col3
a1,b1,c1
a2,b2,c2
a3,b3,c3
I would like to add another column, which is the index. Basically I would like an output of
,col1,col2,col3
0,a1,b1,c1
1,a2,b2,c2
2,a3,b3,c3
I was trying to use awk '{for (i=1; i<=NF; i++) print $i}' but it does not seem to be working right. And what is the best way to just add a comma for the first line but add incrementing number and a comma to the rest of the lines?

You may use this awk solution:
awk '{print (NR == 1 ? "" : NR-2) "," $0}' file
,col1,col2,col3
0,a1,b1,c1
1,a2,b2,c2
2,a3,b3,c3

Use this Perl one-liner:
perl -pe '$_ = ( $. > 1 ? ($. - 2) : "" ) . ",$_";' data.txt > out.txt
The Perl one-liner uses these command line flags:
-e : Tells Perl to look for code in-line, instead of in a file.
-p : Loop over the input one line at a time, assigning it to $_ by default. Add print $_ after each loop iteration.
$. : Current input line number.
SEE ALSO:
perldoc perlrun: how to execute the Perl interpreter: command line switches
perldoc perlvar: Perl predefined variables

I would use GNU AWK for this task following way, let file.txt content be
col1,col2,col3
a1,b1,c1
a2,b2,c2
a3,b3,c3
then
awk 'BEGIN{OFS=","}{print NR==1?"":i++,$0}' file.txt
gives output
,col1,col2,col3
0,a1,b1,c1
1,a2,b2,c2
2,a3,b3,c3
Explanation: firstly I inform GNU AWK that output field separator (OFS) is ,, so arguments to print will be concatenated using that character. Then for each line I use so-called ternary operator i.e. condition?valueiftrue:valueiffalse to decide what will be 1st argument, for 1st line (NR==1) it is empty string for all else it is counter which will be first returned then increased by 1, 2nd argument to print is always whole original line ($0).
(tested in gawk 4.2.1)

gawk 'sub("^",substr(++_",",3^(NF~NR)))' FS='^$' \_=-2
mawk 'sub("^",++_+NF ? _",":",")' FS='^$' \_=-2
,col1,col2,col3
0,a1,b1,c1
1,a2,b2,c2
2,a3,b3,c3

Related

How to parse a specific column or data without losing content from other columns/rows after parsing?

I have the following output to grep the value in this case "225". This value is actually a variable $pd so it could change depending on users input" It could be integer numbers or an alphanumeric character case-insensitive exact match. Example if value of variable is "225" then a "0225" or "11225" its not a valid output from the file Im reading it.
Input File:
10.20.223.10|2000-H1|1/1/2|DeviceX_4021|LG
10.20.223.10|2000-H1|1/1/3|Undiscoverable|Unkwn
10.20.225.10|2000-H1|1/1/5|DeviceZ_2050|LG
10.20.223.10|2000-H1|1/1/8|DeviceY_225_|Kenmore
10.20.223.10|2000-H1|1/1/8|DeviceY_01225_|Kenmore
10.20.225.10|2000-H1|1/1/8|DeviceY_2250_|Kenmore
Desired Output File:
10.20.223.10|2000-H1|1/1/8|DeviceY_225_|Kenmore
If user input is "lg"; then it should output the line without not ignoring it because the input file has "lg" in uppercase. (This part is already fixed on the script).
Desired Output:
10.20.223.10|2000-H1|1/1/2|DeviceX_4021|LG
10.20.225.10|2000-H1|1/1/5|DeviceZ_2050|LG

$ awk -F'|' -v n='225' '$4 ~ n' file
10.20.223.10|2000-H1|1/1/8|DeviceY_225_|Kenmore
or if you don't want a partial match (e.g. against 1225) then one way is:
$ awk -F'|' -v n='225' '$4 ~ ("(^|[^0-9])" n "([^0-9]|$)")' file
10.20.223.10|2000-H1|1/1/8|DeviceY_225_|Kenmore
or:
$ awk -F'|' -v n='225' '$4 ~ ("(^|_)" n "(_|$)")' file
10.20.223.10|2000-H1|1/1/8|DeviceY_225_|Kenmore
There are other possibilities too. The right solution depends on the requirements you haven't told us about and will pass or fail when using input other then you've shown us yet.

awk
awk -F"|" -v var="[A-Za-z].225_" '$4 ~ var{print}'
sed
sed -n '/[A-Za-z].225./p'
grep
grep '[A-Za-z].225.'
Output
10.20.223.10|2000-H1|1/1/8|DeviceY_225_|Kenmore

Using sed:
sed -n '/^\([^|]*\|\)\{3\}[^|]*225/p' < input
Explanation:
the -n option disables automatic output at the end of each sed cycle
the pattern matches arbitrary contents of the first three (\{3\}) columns of data via the \(parenthesized\) pattern [^|]*\| -- any number of non-delmiter characters followed by the column delimiter
it matches additional input at the beginning of the fourth column, but not spanning columns, with a similar subexpression: [^|]*
then comes the literal text you want to match
the p command after the pattern causes the line to be printed to sed's output in the event that it matches the pattern

There's almost certainly an awk solution too, but in Perl it's this:
$ perl -aF'\|' -ne '$F[3] =~ 225 and print' < input
10.20.223.10|2000-H1|1/1/8|DeviceY_225_|Kenmore
-a: Autosplit the input into array #F
-F'\|: Set the autosplit delimiter to |
-n: Run code for each line in the input file
-e: Here's the code to run
$F[3]: The 4th element of the autosplit array #F
=~: Regex match
and print: Print the input line if the regex matches
Update: You can get the string you're interested in from a command line parameter by assigning it in a BEGIN block.
$ perl -aF'\|' -ne 'BEGIN { $x = shift } $F[3] =~ $x and print' 225 < input

how to modify a text file that every line has same number of columns?

I've got a text file which includes several lines. Every line has words which are separated with a comma. The number of words in lines are not the same. I would like with the help of the awk command to make every line have same number of column. For example, if the text file is as follows:
word1, text, help, test
number, begin
last, line, line
I would like the output be as the following which every line has same size in column with an extra null word:
word1, text, help, test
number, begin, null, null
last, line, line, null
I tried the following code:
awk '{print $0,Null}' file.txt

$ awk 'BEGIN {OFS=FS=", "}
NR==FNR {max=max<NF?NF:max; next}
{for(i=NF+1;i<=max;i++) $i="null"}1' file{,}
first scan to find the max number of columns and fill the missing entries in the second round. If the first line contains all the columns (header perhaps), you can change to
$ awk 'BEGIN {OFS=FS=", "}
NR==1 {max=NF}
{for(i=NF+1;i<=max;i++) $i="null"}1' file
file{,} is expanded by bash to file file, a neat trick not to repeat the filename (and eliminates possible typos).

Passing twice through the input file, using getline on first pass:
awk '
BEGIN {
OFS=FS=", "
while(getline < ARGV[1]) {
if (NF > max) {max = NF}
}
close(ARGV[1])
}
{ for(i=NF+1; i<=max; i++) $i="null" } 1
' file.txt
Alternatively, keeping it simple by running awk twice...
#!/bin/bash
infile="file.txt"
maxfields=$(awk 'BEGIN {FS=", "} {if (NF > max) {max = NF}} END{print max}' "$infile" )
awk -v max="$maxfields" 'BEGIN {OFS=FS=", "} {for(i=NF+1;i<=max;i++) $i="null"} 1' "$infile"

Use these Perl one-liners. The first one goes through the file and finds the max number of fields to use. The second one goes through the file and prints the input fields, padded at the end by the null strings:
export num_fields=$( perl -F'/,\s+/' -lane 'print scalar #F;' in_file | sort -nr | head -n1 )
perl -F'/,\s+/' -lane 'print join ", ", map { defined $F[$_] ? $F[$_] : "null" } 0..( $ENV{num_fields} - 1 );' in_file > out_file
The Perl one-liner uses these command line flags:
-e : Tells Perl to look for code in-line, instead of in a file.
-n : Loop over the input one line at a time, assigning it to $_ by default.
-l : Strip the input line separator ("\n" on *NIX by default) before executing the code in-line, and append it when printing.
-a : Split $_ into array #F on whitespace or on the regex specified in -F option.
-F'/,\s+/' : Split into #F on comma with whitespace.
SEE ALSO:
perldoc perlrun: how to execute the Perl interpreter: command line switches

gsub in awk with variable

I want to replace the ">" with variable names staring with ">" and ends with ".". But the following code is not printing the variable names.
for f in *.fasta;
do
nam=$(basename $f .fasta);
awk '{print $f}' $f | awk '{gsub(">", ">$nam."); print $0}'; done
Input of first file sample01.fasta:
cat sample01.fasta:
>textofDNA
ATCCCCGGG
>textofDNA2
ATCCCCGGGTTTT
Output expected:
>sample01.textofDNA
ATCCCCGGG
>sample01.textofDNA2
ATCCCCGGGTTTT

$ awk 'FNR==1{fname=FILENAME; sub(/[^.]+$/,"",fname)} sub(/^>/,""){$0=">" fname $0} 1' *.fasta
>sample01.textofDNA
ATCCCCGGG
>sample01.textofDNA2
ATCCCCGGGTTTT
Compared to the other answers you've got so far, the above will work in any awk, only does the file name calculation once per input file rather than once per line or once per >-line, won't fail if the file name contains other .s, won't fail if the file name contains &, and won't fail if the file name doesn't contain the string fasta..

Or like this? You don't really need the looping and basename or two awk invocations.
awk '{stub=gensub( /^([^.]+\.)fasta.*/ , "\\1", "1",FILENAME ) ; gsub( />/, ">"stub); print}' *.fasta
>sample01.textofDNA
ATCCCCGGG
>sample01.textofDNA2
ATCCCCGGGTTTT
Explanation: awk has knowledge of the filename it currently operates on through the built-in variable FILENAME; I strip the .fasta extension using gensub, and store it in the variable stub. The I invoke gsub to replace ">" with ">" and the content of my variable stub. After that I print it.
As Ed points out in the comments: gensub is a GNU extension and won't work on other awk implementations.

Could you please try following too.
awk '/^>/{split(FILENAME,array,".");print substr($0,1,1) array[1]"." substr($0,2);next} 1' Input_file
Explanation: Adding explanation for above code here.
awk '
/^>/{ ##Checking condition if a line starts from > then do following.
split(FILENAME,array,".") ##Using split function of awk to split Input_file name here which is stored in awk variable FILENAME.
print substr($0,1,1) array[1]"." substr($0,2) ##Printing substring to print 1st char then array 1st element and then substring from 2nd char to till last of line.
next ##next will skip all further statements from here.
}
1 ##1 will print all lines(except line that are starting from >).
' sample01.fasta ##Mentioning Input_file name here.

Replace with sed using variable and adding too

i'm making this replace
sed 's/<n3:CustId>.*<\/n3:CustId>/<n3:CustId>'"${orgkey}"'<\/n3:CustId>/' CAMBIOMINI.txt > CAMBIOMINI2.txt
but now i want to replace line by line with a differente orgkey value, i want orgkey+=1 but i dont know how to make that in the same command for all the CAMBIOMINI.txt file

Sed may not be suitable for the case that you want to alter the substitution
for each occurance.
If my undersanding of your requirement is correct, following would work:
awk 'FNR==NR {orgkey[++i]=$0; next}
{print gensub(/<n3:CustId>[^<]*<\/n3:CustId>/,"<n3:CustId>" orgkey[++j] "</n3:CustId>", "g")} ' orgkey.txt CAMBIOMINI1.txt
where orgkey.txt holds the list of substitutions:
orgkey_a
orgkey_b
orgkey_c
orgkey_d
and CAMBIOMINI1.txt will look like:
<n3:CustId>id1</n3:CustId>
<n3:CustId>id2</n3:CustId>
<n3:CustId>id3</n3:CustId>
<n3:CustId>id4</n3:CustId>
then the result will be:
<n3:CustId>orgkey_a</n3:CustId>
<n3:CustId>orgkey_b</n3:CustId>
<n3:CustId>orgkey_c</n3:CustId>
<n3:CustId>orgkey_d</n3:CustId>
Note that it does not assume the tag in CAMBIOMINI1.txt appears multiple
times in the same line as:
<n3:CustId>id1</n3:CustId> <n3:CustId>id2</n3:CustId>
<n3:CustId>id3</n3:CustId>
<n3:CustId>id4</n3:CustId>
In that case, use a Perl version instead:
perl -nle 'if (#ARGV) {push(#orgkey, $_); next}
s#<n3:CustId>.*?</n3:CustId>#"<n3:CustId>" . $orgkey[$j++] . "</n3:CustId>"#ge; print' orgkey.txt CAMBIOMINI1.txt

rearranging column based on condition

I have a *.csv file. with value as below
"ASDP02","8801942183589"
"ASDP06","8801939151023"
"CSDP04","8801963981740"
"ASDP09","8801946305047"
"ASDP12","8801941195677"
"ASDP05","8801922826186"
"CSDP08","8801983008938"
"ASDP04","8801944346555"
"CSDP11","8801910831518"
or sometimes the value is as below
"8801989353984","KSDP05"
"8801957608165","ASDP11"
"8801991455848","CSDP10"
"8801981363116","CSDP07"
"8801921247870","KSDP07"
"8801965386240","CSDP06"
"8801956293036","KSDP10"
"8801984383904","KSDP11"
"8801944211742","ASDP09"
I just want to put the numeric value (e.g. 8801989353984) always in 1st column. Is it possible using BASH script?

Sed is also your friend here
Input
cat 41189347
"ASDP02","8801942183589"
"ASDP06","8801939151023"
"CSDP04","8801963981740"
"ASDP09","8801946305047"
"ASDP12","8801941195677"
"ASDP05","8801922826186"
"CSDP08","8801983008938"
"ASDP04","8801944346555"
"CSDP11","8801910831518"
Script
sed -E 's/^("[[:alpha:]]+.*"),("[[:digit:]]+")$/\2,\1/' 41189347
Output
"8801942183589","ASDP02"
"8801939151023","ASDP06"
"8801963981740","CSDP04"
"8801946305047","ASDP09"
"8801941195677","ASDP12"
"8801922826186","ASDP05"
"8801983008938","CSDP08"
"8801944346555","ASDP04"
"8801910831518","CSDP11"

awk to the rescue!
$ awk -F, -v OFS=, '$1~/[A-Z]/{t=$2;$2=$1;$1=t}1' file
if first field has alpha chars, swap first and second columns and print.

Bash can do the work but awk might be a better choice for rearrange your file:
sample.csv:
"ASDP02","8801942183589"
"8801944211742","ASDP09"
command:
awk -F, 'BEGIN{OFS=","}{$1=$1;if(substr($1, 2, length($1) - 2) + 0 == substr($1, 2, length($1) - 2)){print $1,$2}else{print $2,$1}}' sample.csv
substr($1, 2, length($1) - 2) + 0 == substr($1, 2, length($1) - 2) checks the column is numeric or not. If it is, print the original line otherwise switch column1 and column2
Output:
"8801942183589","ASDP02"
"8801944211742","ASDP09"

You can create a pure bash script to generate other file which has the structure you need:
#!/bin/bash
csv_file="/path/to/your/csvfile"
output_file="/path/to/output_file"
#Optional
rm -rf "${output_file}"
readarray -t LINES < <(cat < "${csv_file}" 2> /dev/null)
for item in "${LINES[#]}"; do
if [[ $item =~ ^\"([0-9A-Z]+)\"\,\"([0-9]+)\" ]]; then
echo "\"${BASH_REMATCH[2]}\",\"${BASH_REMATCH[1]}\"" >> "${output_file}"
else
echo "$item" >> "${output_file}"
fi
done
This works even if your file is "mixed" I mean with some lines in the right format and other lines in the bad format.

The following commands assume that the cells in the CSV files do not contain newlines and commas. Otherwise, you should write a more complicated script in Perl, PHP, or other programming language capable of parsing CSV files properly. But Bash, definitely, is not appropriate for this task.
Perl
perl -F, -nle '#F = reverse #F if $F[0] =~ /^"\d+"$/;
print join(",", #F)' file
Beware, If the cells contain newlines, or commas, use Perl's Text::CSV module, for instance. Although it is a simple task in Perl, it goes beyond the scope of the current question.
The command splits the input lines by commas (-F,) and stores the result into #F array, for each line. The items in the array are reversed, if the first field $F[0] matches the regular expression. You can also swap the items this way: ($F[0], $F[1]) = ($F[1], $F[0]).
Finally, the joins the array items with commas, and prints to the standard output.
If you want to edit the file in-place, use -i option: perl -i.backup -F, ....
AWK
awk -F, -vOFS=, '/^"[0-9]+",/ {print; next}
{ t = $1; $1 = $2; $2 = t; print }' file
The input and output field separators are set to , with -F, and -vOFS=,.
If the line matches the pattern /^"[0-9]+",/ (the line begins with a "numeric" CSV column), the script prints the record and advances to the next record. Otherwise the next block is executed.
In the next block, it swaps the first two columns and prints the result to the standard output.
If you want to edit the file in-place, see answers to this question.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Add an index column to a csv using awk - linux

You may use this awk solution: awk '{print (NR == 1 ? "" : NR-2) "," $0}' file ,col1,col2,col3 0,a1,b1,c1 1,a2,b2,c2 2,a3,b3,c3

gawk 'sub("^",substr(++_",",3^(NF~NR)))' FS='^$' \_=-2 mawk 'sub("^",++_+NF ? _",":",")' FS='^$' \_=-2 ,col1,col2,col3 0,a1,b1,c1 1,a2,b2,c2 2,a3,b3,c3

Related

How to parse a specific column or data without losing content from other columns/rows after parsing?

how to modify a text file that every line has same number of columns?

gsub in awk with variable

Replace with sed using variable and adding too

rearranging column based on condition

Categories

Resources