How to merge 2 rows into 1 row at the same column using awk - linux

I just started using the UNIX and also no much experience in scripting. Now I am struggling a lot to merge the 2 rows at the same column. Below is original data.
There columns are split into 2 rows but ideally should be in 1 row.
But I don't know how to do it.
Original File
User Middle Last
Name Name Name
Htat Ko Lin
John Smith Bill
Trying to achieve:
UserName MiddleName LastName
Htat Ko Lin
John Smith Bill
Thanks!
Htat Ko

This can be done using awk and for loops
awk 'NR==1{for(i=1;i<=NF;i++)a[i]=$i;next}NR==2{for(i=1;i<=NF;i++)$i=a[i]$i}1' file
Output
UserName MiddleName LastName
Htat Ko Lin
John Smith Bill
Explanation
NR==1
If the record number is 1. i.e the first record then execute the next block
for(i=1;i<=NF;i++)
Loop from one to the number of fields(NF).Incrementing by one each time.
a[i]=$i
Using i as a key set an array element in the array a to the field i ($i).
next
Skip all further instruction and move to the next record.
NR==2
Same as before but for record 2
for(i=1;i<=NF;i++)
Exactly the same as before
$i=a[i]$i
Set field i to the stored value in the array and then itself
1
Defaults to true so prints all lines unless next has been used
Additional notes
if you want keep the columns in line the easiest was to do this is to pipe that command into column -t
awk '...' file | column -t
Reduced version
awk '{for(i=1;i<=NF;i++)(NR==2&&$i=a[i]$i)||a[i]=$i}NR>1' file

Related

Move all rows in a tsv with a certain date to their own file

I have a TSV file with 4 columns in this format
dog phil tall 2020-12-09 12:34:22
cat jill tall 2020-12-10 11:34:22
The 4th column is a date string Example : 2020-12-09 12:34:22
I want every row with the same date to go into its own file
For example,
file 20201209 should have all rows that start with 2020-12-09 in the 4th column
file 20201210 should have all rows that start with 2020-12-10 in the 4th column
Is there any way to do this through the terminal?
With GNU awk to allow potentially large numbers of concurrently open output files and gensub():
awk '{print > gensub(/-/,"","g",$(NF-1))}' file
With any awk:
awk '{out=$(NF-1); gsub(/-/,"",out); if (seen[out]++) print >> out; else print > out; close(out)}' file
There's ways to speed up either script by sorting the input first if that's an issue.

Split cells containing ALLCAPS words

I have a .csv file containing a column "First + Last name". I'd like to split the cells to get 2 columns (First Name and Last name). In each cell, the last name is written ALLCAPS. So here is my file right now :
First + Last name
-----------------------------
John DOE
Marie-Helen ANDRE-JACQUES
Jean-Claude DOE
And i'd like to split cells so i have :
First name | Last name
--------------------------------------------------
John | DOE
Marie-Helen | ANDRE-JACQUES
Jean-Claude | DOE
How would i do this in excel (or numbers) ?
A good way to solve this problem is to use the FLASH-FILL Function of Excel.
https://support.office.com/en-us/article/using-flash-fill-in-excel-3f9bcf1e-db93-4890-94a0-1578341f73f7?ui=en-US&rs=en-US&ad=US

Manipulate CSV file: increment cell coordinates/position

I have a csv file with one entry on each line, three entries form a whole dataset. So what I need to do now, is to put these sets in the columns in one row. I have difficutlies to describe the problem (thus my search was not giving me a solution), so here's an example.
Sample CSV file:
1 Joe
2 Doe
3 7/7/1990
4 Jane
5 Done
6 6/6/2000
What I want in the end is this:
1 Name Surname Birthdate
2 Joe Doe 7/7/1990
3 Jane Done 6/6/2000
I'm trying to find a solution to make this automatically, as my actual file consists of 480 datasets, each set containing 16 entries, and it would take me days to do it manually.
I was able to fill the first line with Excel's indirect function:
=INDIRECT("A"&COLUMN()-COLUMN($A1))
As COLUMN returns the column number, if I drag the first line down in Excel, obviously this shows exactly the same as the first line:
1 Name Surname Birthdate
2 Joe Doe 7/7/1990
3 Joe Doe 7/7/1990
Now I'm looking for a way to increment the cell position by one:
A B C D
1 Joe =A1 =B1+1 =C1+1
2 Doe =D1+1
3 7/7/1990
4 Jane
What should lead to:
A B C D
1 Joe =A1 =A2 =A3
2 Doe =A4 =A5 =A4
3 7/7/1990
4 Jane
As you can see in the example given, the cell coordinates for A increment by one, and I have no idea how to do this automatically in Excel. I think there must be a better way than using nested Excel function, as the task (increment +1) seems actually pretty easy.
I'm also open to solutions involving sed, awk (of which I only have a very superficial knowledge) or other command line tools.
You're help is appreciated very much!
awk 'BEGIN { y=1; printf "Name Surname Birthdate\n%s",y; x=1;}
{if (x == 3) {
y = y + 1;
printf "%s\n%s",$2,y;
x=1;
}
else {
printf " %s ",$2;
x = x + 1;
}}' input_file.txt
This may work for what you want to do. Your sample does not include the commas, so I'm not sure if they are really in there or not. If they are, you will need to modify the code slightly with the -F, flag so that it treats them as such.
This second code snippet will provide the output with a comma delimiter. Again, it is assuming that your sample input file did not have commas to delimit the 1 Joe and 2 Doe.
awk 'BEGIN { y=1; printf "Name Surname Birthdate\n%s",y; x=1;}
{if (x == 3) {
y = y + 1;
printf "%s\n%s,",$2,y;
x=1;
}
else {
printf " %s,",$2;
x = x + 1;
}}' input_file.txt
Both of the awk scripts will set x and y variables to one, where the y variable will increment your line numbering. The x variable will count up to 3 and then reset itself back to one. This is so that it prints each line in a row, until it gets to the 3rd item where it will then insert a newline character.
There are easier/more complex ways to do this with regexes and a language like perl, but since you mentioned awk, I believe this will work fine.

How to sort lines in textfile according to a second textfile

I have two text files.
File A.txt:
john
peter
mary
alex
cloey
File B.txt
peter does something
cloey looks at him
franz is the new here
mary sleeps
I'd like to
merge the two
sort one file according to the other
put the unknown lines of B at the end
like this:
john
peter does something
mary sleeps
alex
cloey looks at him
franz is the new here
$ awk '
NR==FNR { b[$1]=$0; next }
{ print ($1 in b ? b[$1] : $1); delete b[$1] }
END { for (i in b) print b[i] }
' fileB fileA
john
peter does something
mary sleeps
alex
cloey looks at him
franz is the new here
The above will print the remaining items from fileB in a "random" order (see http://www.gnu.org/software/gawk/manual/gawk.html#Scanning-an-Array for details). If that's a problem then edit your question to clarify your requirements for the order those need to be printed in.
It also assumes the keys in each file are unique (e.g. peter only appears as a key value once in each file). If that's not the case then again edit your question to include cases where a key appears multiple times in your ample input/output and additionally explain how you want the handled.

Grep find lines that have 4,5,6,7 and 9 in zip code column

I'm using grep to display all lines that have ONLY 4,5,6,7 and 9 in the zipcode column.
How do i display only the lines of the file that contain the numbers 4,5,6,7 and 9 in the zipcode field?
A sample row is:
15 m jagger mick 41 4th 95115
Thanks
I am going to assume you meant "How do I use grep to..."
If all of the lines in the file have a 5 digit zip at the end of each line, then:
egrep "[45679]{5}$" filename
Should give you what you want.
If there might be whitespace between the zip and the end of the line, then:
egrep "[45679]{5}[[:space:]]*$" filename
would be more robust.
If the problem is more general than that, please describe it more accurately.
Following regex should fetch you desired result:
egrep "[45679]+$" file
If by "grep" you mean, "the correct tool", then the solution you seek is:
awk '$7 ~ /^[45679]*$/' input
This will print all lines of input in which the 7th field consists only of the characters 4,5,6,7, and 9. If you want to specify 'the last column' rather than the 7th, try
awk '$NF ~ /^[45679]*$/' input

Resources