Shuffling pairs of lines in two text files - nlp

I'm working on a machine translation project in which I have 4.5 million lines of text in two languages, English and German. I would like to shuffle these lines prior to dividing the data into shards on which I will train my model. I know the shuf command described here allows one to shuffle lines in one file, but how can I ensure that corresponding lines in the second file are also shuffled into the same order? Is there a command to shuffle lines in both files?

TL;DR
paste to create separate columns from two files into a single file
shuf on the single file
cut to split the columns
Paste
$ cat test.en
a b c
d e f
g h i
$ cat test.de
1 2 3
4 5 6
7 8 9
$ paste test.en test.de > test.en-de
$ cat test.en-de
a b c 1 2 3
d e f 4 5 6
g h i 7 8 9
Shuffle
$ shuf test.en-de > test.en-de.shuf
$ cat test.en-de.shuf
d e f 4 5 6
a b c 1 2 3
g h i 7 8 9
Cut
$ cut -f1 test.en-de.shuf> test.en-de.shuf.en
$ cut -f2 test.en-de.shuf> test.en-de.shuf.de
$ cat test.en-de.shuf.en
d e f
a b c
g h i
$ cat test.en-de.shuf.de
4 5 6
1 2 3
7 8 9

Related

How to reorder columns of hunderds of tab deliminated file in linux?

I have large scale tab-delimited files (a couple of hundreds), and but the order of the columns is different across the different files( the same columns, but different locations). Hence, I need to reorder all the columns in all the files and write it back in tab-deliminated format.
I would like to write a shell script that takes a specified order of columns and reorder all the columns in all the files and write it back. Can someone help me with it?
Here is how the header of my files looks like:
file1)
sLS72 chrX
A B E C F H
2 1 4 5 7 8
0 0 0 0 0 0
and the header of my second file:
S721 chrX
A E B F H C
12 11 2 3 4 1
0 0 0 0 0 0
here is the order of the columns that I want to achieve:
Order=[A ,B ,C ,E,F,H]
and here is the expected outputs for each file based on this ordering:
sLS72 chrX
A B C E F H
2 1 5 4 7 8
0 0 0 0 0 0
file 2:
S721 chrX
A B C E F H
12 2 1 11 3 4
0 0 0 0 0 0
I was trying to use awk:
awk -F'\t' '{s2=$A; $3=$B; $4=$C; $5=$E; $1=s}1' OFS='\t' in file
but the point is the, first, the order of columns are different in different files, and second, the names of the columns start from the second line of the file. In order words, first line is the header, I don't want to change it, but the second line is the colnames of the columns, so I want to order all files based on that. it's kind of tricky
$ awk -v order="A B C E F H" '
BEGIN {n=split(order,ho)}
FNR==1 {print; next}
FNR==2 {for(i=1;i<=NF;i++) hn[$i]=i}
{for(i=1;i<=n;i++) printf "%s",$hn[ho[i]] (i==n?ORS:OFS)}' file1 > tmp && mv tmp file1
sLS72 chrX
A B C E F H
0 0 0 0 0 0
0 0 0 0 0 0
if working on multiple files at the same time, change it to
$ awk -v ...
{... printf "%s",$hn[ho[i]] (i==n?ORS:OFS) > (FILENAME"_reordered") }' dir/files*
and do a mass rename afterwards. Alternative is run the original script if a loop for each file.

unix join command to return all columns in one file

I have two files that I am joining on one column. After the join, I just want the output to be all of the columns, in the original order, from only one of the files. For example:
cat file1.tsv
1 a ant
2 b bat
3 c cat
8 d dog
9 e eel
cat file2.tsv
1 I
2 II
3 III
4 IV
5 V
join -1 1 -2 1 file1.tsv file2.tsv -t $'\t' -o 1.1,1.2,1.3
1 a ant
2 b bat
3 c cat
I know I an use -o 1.1,1.2.. notation but my file has over two dozen columns. Is there some wildcard that I can use to say -o 1.* or something?
I'm not aware of wildcards in the format string.
From your desired output I think that what you want may be achievable like so without having to specify all the enumerations:
grep -f <(awk '{print $1}' file2.tsv ) file1.tsv
1 a ant
2 b bat
3 c cat
Or as an awk-only solution:
awk '{if(NR==FNR){a[$1]++}else{if($1 in a){print}}}' file2.tsv file1.tsv
1 a ant
2 b bat
3 c cat

extracting two ranges of lines of a file a and putting them as a data block with shell commands

I have two blocks of data in a file, say foo.txt like the following:
a 1
b 2
c 3
d 4
e 5
f 6
g 7
h 8
i 9
I'd like to extract rows 2:4 and 6:8 and put them as the following:
b 2 f 6
c 3 g 7
d 4 h 8
I could try using auxiliary files:
sed -n '2,4p' foo.txt > tmp1; sed -n '6,8p' foo.txt > tmp2; paste tmp1 tmp2 > output; rm tmp1 tmp2
But is there a better way to do it without auxiliary files? Thanks!
Using process substitution:
$ paste <(sed -n '2,4p' foo.txt) <(sed -n '6,8p' foo.txt) > output
$ cat output
b 2 f 6
c 3 g 7
d 4 h 8
$
In AWK:
$ awk 'NR==2,NR==4{a[++i]=$0} NR==6,NR==8{b[++j]=$0} END {for(i=1;i<=j;i++) print a[i],b[i]}' file
b 2 f 6
c 3 g 7
d 4 h 8
When between the given record numbers (NR), fill up arrays a and b. In the END, print them side by side.

how to compare 2 files column of a linux file to be compared with second column of another file and get the difference

How to compare 2 files? I need to compare a column of a linux file with the second column of another file and get the difference.
Let's say I have the following files.
file 1:
a 3
b 6
c 8
d 7
g 5
p 16
file 2:
a 1
b 6
c 8
d 7
g 5
I need to compare column two of file 1 with column two of file 2 and get the difference.
Desired output file 1 - file 2 :
a 2
b 0
c 0
d 0
g 0
p 16
This awk one-liner works for your example:
awk 'NR==FNR{a[$1]=$2;next}{print $1,a[$1]-$2;delete a[$1]}
END{for(x in a)print x, a[x]}' file1 file2

bash cut columns to one file and save onto the end of another file

I would like to cut two columns from one file and stick them on the end of a second file. The two file have the exact same number of lines
file1.txt
1 2 3 4 5 6 7 8 9 10
1 2 3 4 5 6 7 8 9 10
1 2 3 4 5 6 7 8 9 10
file2.txt
a b c d e f g h i j
a b c d e f g h i j
a b c d e f g h i j
a b c d e f g h i j
So far I have been using
cut -f9-10 file2.txt | paste file1.txt - > file3.txt
which outputs exactly what I want
1 2 3 4 5 6 7 8 9 10 i j
1 2 3 4 5 6 7 8 9 10 i j
1 2 3 4 5 6 7 8 9 10 i j
However I don't want to have to make a new file I would prefer to alter file 1 to the above. I've tried
cut -f9-10 file2.txt | paste file1.txt -
but it simply prints everything on screen. Is there a way of just adding columns 9 and 10 to the end of file1.txt?
Use sponge from moreutils! It allows you to soak up standard input and write to a file. That is, to replace a file in-place after a pipe.
cut -f9-10 file2.txt | paste file1.txt - | sponge file1.txt
Note you can also do what you are doing by using paste with a process substitution.
$ paste -d' ' file1.txt <(awk '{print $(NF-1), $NF}' file2.txt) | sponge file1.txt
$ cat file1.txt
1 2 3 4 5 6 7 8 9 10 i j
1 2 3 4 5 6 7 8 9 10 i j
1 2 3 4 5 6 7 8 9 10 i j
This joins file1.txt with two last columns from file2.txt using ' ' as delimiter.

Resources