awk : merge multiple rows into one row per first column record value - linux

I have one text file which contains some records like below,
100,a
b
c
101,d,e
f
102,g
103,h
104,i
j
k
so,some rows start with number,some rows start with string ,and I want to merge rows which rows are order by number and merge rows like below:
100,a,b,c
101,d,e,f
102,g
103,h
104,i,j,k
How can I use awk to do this ?
Thanks

You can do something like:
awk '/^[0-9]/{if(buf){print buf};buf=$0}/^[a-zA-Z]/{buf=buf","$0}END{print buf}' yourfile.txt
This will
Check if the current line starts with a number /^[0-9]/
If so then it will print out what is stored in variable buf if that variable has some value in it if (buf){print buf}
It will then reset the variable buf to the value of the current line buf=$0
If the current line starts with a letter /^[a-zA-Z]/
Then it will add the current line to the value in the variable buf with a comma separator buf=buf","$0
Finally when it reaches the end of the file, it prints out whatever is left in the buf variable. END{print buf}

Related

How to insert a column at the start of a txt file using awk?

How to insert a column at the start of a txt file running from 1 to 2059 which corresponds to the number of rows I have in my file using awk. I know the command will be something like this:
awk '{$1=" "}1' File
Not sure what to put between the speech-marks 1-2059?
I also want to include a header in the header row so 1 should only go in the second row technically.
**ID** Heading1
RQ1293939 -7.0494
RG293I32SJ -903.6868
RQ19238983 -0899977
rq747585950 988349303
FID **ID** Heading1
1 RQ1293939 -7.0494
2 RG293I32SJ -903.6868
3 RQ19238983 -0899977
4 rq747585950 988349303
So I need to insert the FID with 1 - 2059 running down the first column
What you show does not work, it just replaces the first field ($1) with a space and prints the result. If you do not have empty lines try:
awk 'NR==1 {print "FID\t" $0; next} {print NR-1 "\t" $0}' File
Explanations:
NR is the awk variable that counts the records (the lines, in our case), starting from 1. So NR==1 is a condition that holds only when awk processes the first line. In this case the action block says to print FID, a tab (\t), the original line ($0), and then move to next line.
The second action block is executed only if the first one has not been executed (due to the final next statement). It prints NR-1, that is the line number minus one, a tab, and the original line.
If you have empty lines and you want to skip them we will need a counter variable to keep track of the current non-empty line number:
awk 'NR==1 {print "FID\t" $0; next} NF==0 {print; next} {print ++cnt "\t" $0}' File
Explanations:
NF is the awk variable that counts the fields in a record (the space-separated words, in our case). So NF==0 is a condition that holds only on empty lines (or lines that contain only spaces). In this case the action block says to print the empty line and move to the next.
The last action block is executed only if none of the two others have been executed (due to their final next statement). It increments the cnt variable, prints it, prints a tab, and prints the original line.
Uninitialized awk variables (like cnt in our example) take value 0 when they are used for the first time as a number. ++cnt increments variable cnt before its value is used by the print command. So the first time this block is executed cnt takes value 1 before being printed. Note that cnt++ would increment after the printing.
Assuming you don't really have a blank row between your header line and the rest of your data:
awk '{print (NR>1 ? NR-1 : "FID"), $0}' file
Use awk -v OFS='\t' '...' file if you want the output to be tab-separated or pipe it to column -t if you want it visually tabular.

Edit values in one column in 4,000,000 row CSV file

I have a CSV file I am trying to edit to add a numeric ID-type column in with unique integers from 1 - approx 4,000,000. Some of the fields already have an ID value, so I was hoping I could just sort those and then fill in starting on the largest value + 1. However, I cannot open this file to edit in Excel because of its size (I can only see the max of 1,048,000 or whatever rows). Is there an easy way to do this? I am not familiar with coding, so I was hoping there was a way to do it manually that is similar to Excel's fill series feature.
Thanks!
-also - I know there are threads on how to edit a large CSV file, but I was hoping for help with how to edit this specific feature. Thanks!
-I want to basically sort the rows based on idnumber and then add unique IDs to rows without that ID value
Screenshot of file
one way, using Notepad++, and a plugin named SQL:
Load the CSV in Notepad++
SELECT a+1,b,c FROM data
Hit 'start'
When starting with a file like this:
a,b,c
1,2,3
4,5,6
7,8,9
The results after look like this:
SQL Plugin 1.0.1025
Query : select a+1,b,c from data
Sourcefile : abc.csv
Delimiter : ,
Number of hits: 3
===================================================================================
Query result:
2,2,3
5,5,6
8,8,9
Or, in words, the first column is incremented by 1.
2nd solution, using gawk, downloaded from https://www.klabaster.com/freeware.htm#mawk:
D:\TEMP>type abc.csv
a,b,c
1,2,3
4,5,6
7,8,9
D:\TEMP>gawk "BEGIN{ FS=OFS=\",\"; getline; print $0 }{ print $1+1,$2,$3 }" abc.csv
a,b,c
2,2,3
5,5,6
8,8,9
(g)awk id a tool which reads a file line by line. The line is then accessible via $0, and the parts from the line via $1,$2,$3,... using a separator.
This separator is set in my example (FS=OFS=\",\";) in the BEGIN section which is only done once per input file. Do not get confused by the \". This is because the script is between double quotes, and a variable (like OFS) is set using double quotes too, so it needs to be escaped like \".
The getline; print $0, do take care of the first line in a CSV which typically hold column names.
Then, for every line, this piece of code print $1+1,$2,$3 will increment the first column, and print the second and third column.
To extend this second example:
gawk "BEGIN{ FS=OFS=\",\"; getline; print $0 }{ print ($1<5?$1+1:$1),$2,$3 }" abc.csv
The ($1<5?$1+1:$1) will check if value of $1is less then 5 ($1<5), if true, it will return $1+1, and else $1. Or, in words, it will only add 1 if the current value is less than 5.
With your data you end up with something like this (untested!):
gawk "BEGIN{ FS=OFS=\",\"; getline; a=42; print $0 }{ if($4+0==0){ a++ }; print ($4<=0?$a:$1),$2,$3 }" input.csv
a=42 to set the initial value for the column values which needs to be update (you need to change this to the correct value )
The if($4+0==0){ a++ } will increment the value of a when the fourth column equals 0 (The $4+0 is done to convert empty values like "" to a numeric value 0).

How to edit the left hand column and replace with values in Linux?

I have the following text file:
.txt file
In the left hand column all the values are '0' is there a way to change only the left hand column to replace all the zeros with the value 15. I cant find all and replace as other columns contain '0' which cant be altered, this also cant be done manually as the file contains 10,000 lines. I'm wondering if this is possible from the command line or with a script.
Thanks
Using awk:
awk '$1 == 0 { $1 = 15 } 1' file.txt
Replaces the first column with 15 on each line only if the original value is 0.

Add new column based on value of an existing column

I am trying to transform a delimited file into a table of data in linux. The meaning of value in certain columns are dependent the value in a separate column. How can I create additional columns based on the value of the column?
Depending on the value of column 2, i.e. 00 or 01 the interpretation of columns 3 and 4 are different. So if I had the following values.
A1,00,N1,T1
A1,01,N2,T2
A2,00,N3,T3
A2,01,N4,T4
The expected results should be as follows. Notice how I now have two new columns.
A1,00,N1,T1,N2,T2
A2,01,N3,T3,N4,T4
$ awk -F, ' #1
{A[$1] = A[$1] FS $3 FS $4} #2
END {for(i in A) print i FS "00" A[i]} #3
' file
A1,00,N1,T1,N2,T2
A2,00,N3,T3,N4,T4
Set Field Separator to comma.
On every line, set Array[first-column] to its current value followed by the third and fourth columns.
At the end, for every index, print the index name, a comma, the string "00", and the value of that index.
The end value of A[A1] is ,N1,T1,N2,T2

Append string to column on command line

I have a 3 column file. I would like to append a third column which is just one word repeated many times. I tried the following
paste file.tsv <(echo 'new_text') > new_file.tsv
But the text 'new_text' only appears on the first line, not every line.
How can I get 'new_text' to appear on every line.
Thanks
sed '1,$ s/$/;ABC/' infile > outfile
This replaces the line end ("$") with ";ABC".

Resources