How to sort lines in textfile according to a second textfile - linux

I have two text files.
File A.txt:
john
peter
mary
alex
cloey
File B.txt
peter does something
cloey looks at him
franz is the new here
mary sleeps
I'd like to
merge the two
sort one file according to the other
put the unknown lines of B at the end
like this:
john
peter does something
mary sleeps
alex
cloey looks at him
franz is the new here

$ awk '
NR==FNR { b[$1]=$0; next }
{ print ($1 in b ? b[$1] : $1); delete b[$1] }
END { for (i in b) print b[i] }
' fileB fileA
john
peter does something
mary sleeps
alex
cloey looks at him
franz is the new here
The above will print the remaining items from fileB in a "random" order (see http://www.gnu.org/software/gawk/manual/gawk.html#Scanning-an-Array for details). If that's a problem then edit your question to clarify your requirements for the order those need to be printed in.
It also assumes the keys in each file are unique (e.g. peter only appears as a key value once in each file). If that's not the case then again edit your question to include cases where a key appears multiple times in your ample input/output and additionally explain how you want the handled.

Related

How to remove 1 instance of each (identical) line in a text file in Linux?

There is a file:
Mary
Mary
Mary
Mary
John
John
John
Lucy
Lucy
Mark
I need to get
Mary
Mary
Mary
John
John
Lucy
I cannot get the lines ordered according to how many times each line is repeated in the text, i.e. the most frequently occurring lines must be listed first.
If your file is already sorted (most-frequent words at top, repeated words only in consecutive lines) – your question makes it look like that's the case – you could reformulate your problem to: "Skip a word when it is encountered for the first time". Then a possible (and efficient) awk solution would be:
awk 'prev==$0{print}{prev=$0}'
or if you prefer an approach that looks more familiar if coming from other programming languages:
awk '{if(prev==$0)print;prev=$0}'
Partially working solutions below. I'll keep them for reference, maybe they are helpful to somebody else.
If your file is not too big, you could use awk to count identical lines and then output each group the number of times it occurred, minus 1.
awk '
{ lines[$0]++ }
END {
for (line in lines) {
for (i = 1; i < lines[line]; ++i) {
print line
}
}
}
'
Since you mentioned that the most frequent line must come first, you have to sort first:
sort | uniq -c | sort -nr | awk '{count=$1;for(i=1;i<count;++i){$1="";print}}' | cut -c2-
Note that the latter will reformat your lines (e.g. collapsing/squeezing repeated spaces). See Is there a way to completely delete fields in awk, so that extra delimiters do not print?
don't sort for no reason :
nawk '_[$-__]--'
gawk '__[$_]++'
mawk '__[$_]++'
Mary
Mary
Mary
John
John
Lucy
for 1 GB+ files, u can speed things up a bit by preventing FS from splitting unnecessary fields
mawk2 '__[$_]++' FS='\n'
for 100 GB inputs, one idea would be to use parallel to create, say, 10 instances of awk, piping the full 100 GB to each instance, but assigning each of them a particular range to partition on their end
(e.g. instance 4 handle lines beginning with F-Q, etc), but instead of outputting it all THEN attempt to sort the monstrosity, what one could do is simply have them tally up, and only print out a frequency report of how many copies ("Nx") of each unique line ("Lx") has been recorded.
From there one could sort a much smaller file along the column holding the Lx's, THEN pipe it to one more awk that would print out Nx# copies of each line Lx.
probably a lot faster than trying to sort 100 GB
I created a test scenario by cloning 71 shuffled copies of a raw file with these stats :
uniq rows = 8125950. | UTF8 chars = 160950688. | bytes = 160950688.
—- 8.12 mn unique rows spanning 154 MB
……resulting in a 10.6 GB test file :
in0: 10.6GiB 0:00:30 [ 354MiB/s] [ 354MiB/s] [============>] 100%
rows = 576942450. | UTF8 chars = 11427498848. | bytes = 11427498848.
even when using just 1 single instance of awk, it finished filtering the 10.6 GB in ~13.25 mins - reasonable given the fact it's tracking 8.1 mn unique hash keys.
in0: 10.6GiB 0:13:12 [13.7MiB/s] [13.7MiB/s] [============>] 100%
out9: 10.5GiB 0:13:12 [13.6MiB/s] [13.6MiB/s] [<=> ]
( pvE 0.1 in0 < testfile.txt | mawk2 '__[$_]++' FS='\n' )
783.31s user 15.51s system 100% cpu 13:12.78 total
5e5f8bbee08c088c0c4a78384b3dd328 stdin

Manipulate CSV file: increment cell coordinates/position

I have a csv file with one entry on each line, three entries form a whole dataset. So what I need to do now, is to put these sets in the columns in one row. I have difficutlies to describe the problem (thus my search was not giving me a solution), so here's an example.
Sample CSV file:
1 Joe
2 Doe
3 7/7/1990
4 Jane
5 Done
6 6/6/2000
What I want in the end is this:
1 Name Surname Birthdate
2 Joe Doe 7/7/1990
3 Jane Done 6/6/2000
I'm trying to find a solution to make this automatically, as my actual file consists of 480 datasets, each set containing 16 entries, and it would take me days to do it manually.
I was able to fill the first line with Excel's indirect function:
=INDIRECT("A"&COLUMN()-COLUMN($A1))
As COLUMN returns the column number, if I drag the first line down in Excel, obviously this shows exactly the same as the first line:
1 Name Surname Birthdate
2 Joe Doe 7/7/1990
3 Joe Doe 7/7/1990
Now I'm looking for a way to increment the cell position by one:
A B C D
1 Joe =A1 =B1+1 =C1+1
2 Doe =D1+1
3 7/7/1990
4 Jane
What should lead to:
A B C D
1 Joe =A1 =A2 =A3
2 Doe =A4 =A5 =A4
3 7/7/1990
4 Jane
As you can see in the example given, the cell coordinates for A increment by one, and I have no idea how to do this automatically in Excel. I think there must be a better way than using nested Excel function, as the task (increment +1) seems actually pretty easy.
I'm also open to solutions involving sed, awk (of which I only have a very superficial knowledge) or other command line tools.
You're help is appreciated very much!
awk 'BEGIN { y=1; printf "Name Surname Birthdate\n%s",y; x=1;}
{if (x == 3) {
y = y + 1;
printf "%s\n%s",$2,y;
x=1;
}
else {
printf " %s ",$2;
x = x + 1;
}}' input_file.txt
This may work for what you want to do. Your sample does not include the commas, so I'm not sure if they are really in there or not. If they are, you will need to modify the code slightly with the -F, flag so that it treats them as such.
This second code snippet will provide the output with a comma delimiter. Again, it is assuming that your sample input file did not have commas to delimit the 1 Joe and 2 Doe.
awk 'BEGIN { y=1; printf "Name Surname Birthdate\n%s",y; x=1;}
{if (x == 3) {
y = y + 1;
printf "%s\n%s,",$2,y;
x=1;
}
else {
printf " %s,",$2;
x = x + 1;
}}' input_file.txt
Both of the awk scripts will set x and y variables to one, where the y variable will increment your line numbering. The x variable will count up to 3 and then reset itself back to one. This is so that it prints each line in a row, until it gets to the 3rd item where it will then insert a newline character.
There are easier/more complex ways to do this with regexes and a language like perl, but since you mentioned awk, I believe this will work fine.

save multiple matches in a list (grep or awk)

I have a file that looks something like this:
# a mess of text
Hello. Student Joe Deere has
id number 1. Over.
# some more messy text
Hello. Student Steve Michael Smith has
id number 2. Over.
# etc.
I want to record the pairs (Joe Deere, 1), (Steve Michael Smith, 2), etc. into a list (or two separate lists with the same order). Namely, I will need to loop over those pairs and do something with the names and ids.
(names and ids are on distinct lines, but come in the order: name1, id1, name2, id2, etc. in the text). I am able to extract the lines of interest with
VAR=$(awk '/Student/,/Over/' filename.txt)
I think I know how to extract the names and ids with grep, but it will give me the result as one big block like
`Joe Deere 1 Steve Michael Smith 2 ...`
(and maybe even with a separator between names and ids). I am not sure at this point how to go forward with this, and in any case it doesn't feel like the right approach.
I am sure that there is a one-liner in awk that will do what I need. The possibilities are infinite and the documentation monumental.
Any suggestion?
$ cat tst.awk
/^id number/ {
gsub(/^([^ ]+ ){2}| [^ ]+$/,"",prev)
printf "(%s, %d)\n", prev, $3
}
{ prev = $0 }
$ awk -f tst.awk file
(Joe Deere, 1)
(Steve Michael Smith, 2)
Could you please try following too.
awk '
/id number/{
sub(/\./,"",$3)
print val", "$3
val=""
next
}
{
gsub(/Hello\. Student | has.*/,"")
val=$0
}
' Input_file
grep -oP 'Hello. Student \K.+(?= has)|id number \K\d+' file | paste - -

How to merge 2 rows into 1 row at the same column using awk

I just started using the UNIX and also no much experience in scripting. Now I am struggling a lot to merge the 2 rows at the same column. Below is original data.
There columns are split into 2 rows but ideally should be in 1 row.
But I don't know how to do it.
Original File
User Middle Last
Name Name Name
Htat Ko Lin
John Smith Bill
Trying to achieve:
UserName MiddleName LastName
Htat Ko Lin
John Smith Bill
Thanks!
Htat Ko
This can be done using awk and for loops
awk 'NR==1{for(i=1;i<=NF;i++)a[i]=$i;next}NR==2{for(i=1;i<=NF;i++)$i=a[i]$i}1' file
Output
UserName MiddleName LastName
Htat Ko Lin
John Smith Bill
Explanation
NR==1
If the record number is 1. i.e the first record then execute the next block
for(i=1;i<=NF;i++)
Loop from one to the number of fields(NF).Incrementing by one each time.
a[i]=$i
Using i as a key set an array element in the array a to the field i ($i).
next
Skip all further instruction and move to the next record.
NR==2
Same as before but for record 2
for(i=1;i<=NF;i++)
Exactly the same as before
$i=a[i]$i
Set field i to the stored value in the array and then itself
1
Defaults to true so prints all lines unless next has been used
Additional notes
if you want keep the columns in line the easiest was to do this is to pipe that command into column -t
awk '...' file | column -t
Reduced version
awk '{for(i=1;i<=NF;i++)(NR==2&&$i=a[i]$i)||a[i]=$i}NR>1' file

Expand one column while preserving another

I am trying to get column one repeated for every value in column two which needs to be on a new line.
cat ToExpand.txt
Pete horse;cat;dog
Claire car
John house;garden
My first attempt:
cat expand.awk
BEGIN {
FS="\t"
RS=";"
}
{
print $1 "\t" $2
}
awk -f expand.awk ToExpand.txt
Pete horse
cat
dog
Claire car
John
garden
The desired output is:
Pete horse
Pete cat
Pete dog
Claire car
John house
John garden
Am I on the right track here or would you use another approach? Thanks in advance.
You could also change the FS value into a regex and do something like this:
awk -F"\t|;" -v OFS="\t" '{for(i=2;i<=NF;i++) print $1, $i}' ToExpand.txt
Pete horse
Pete cat
Pete dog
Claire car
John house
John garden
I'm assuming that:
The first tab is the delimiter for the name
There's only one tab delimiter - If tab delimited data occurs after the ; section use fedorqui's implementation.
It's using an alternate form of setting the OFS value ( using the -v flag ) and loops over the fields after the first to print the expected output.
You can think of RS in your example as making "lines" out of your data ( records really ) and your print block is acting on those "lines"(records) instead of the normal newline. Then each record is further parsed by your FS. That's why you get the output from your first attempt. You can explore that by printing out the value of NF in your example.
Try:
awk '{gsub(/;/,ORS $1 OFS)}1' OFS='\t' file
This replaces every occurrence of a semicolon with a newline, the first field and the output field separator..
Output:
Pete horse
Pete cat
Pete dog
Claire car
John house
John garden

Resources