Merging two files in bash with a twist in shell linux - linux

The following question is somehow tricky but seemingly simple , i need to use bash
let us suppose i have 2 text files, the first on is
FirstFile.txt
0 1
0 2
1 1
1 2
2 0
SecondFile.txt
0 1
0 2
0 3
0 4
0 5
1 0
1 1
1 2
1 3
1 4
1 5
2 1
2 2
2 3
2 4
2 5
I want to be able to create a new Thirdfile.txt that contains that values that are not in file A , meaning if there is a common variable with file A i want it to be removed. knowing that 2 0 and 0 2 are the same ...
Can you help me out ?

Using awk, you can rearrange the columns so that the lower number is always first. When reading the first file, save them as keys in an associative array. When reading the second file, test if they're not found in the array.
awk '{if ($1 <= $2) { a = $1; b = $2; } else { a = $2; b = $1 } }
FNR==NR { arr[a, b] = 1; next; }
!arr[a, b]' FirstFile.txt SecondFile.txt > ThirdFile.txt
Results:
0 3
0 4
0 5
1 3
1 4
1 5
2 2
2 3
2 4
2 5

paste <(cut -f2 a.txt) <(cut -f1 a.txt) > tmp.txt
cat a.txt b.txt tmp.txt | sort | uniq -u
or
cat a.txt b.txt <(paste <(cut -f2 a.txt) <(cut -f1 a.txt)) | sort | uniq -u
Result
0 3
0 4
0 5
1 3
1 4
1 5
2 2
2 3
2 4
2 5
Explanation
uniq removes duplicate rows from a text file.
uniq requires that its input be sorted.
uniq -u prints only the rows that do not have duplicates.
So, cat a.txt b.txt | sort | uniq -u will almost get you there: Only rows in b.txt that are not in a.txt will get printed. However it doesn't handle the reversed cases, like '1 2' <-> '2 1'.
Therefore, you need a temp file that holds all the reversed removal keys. That's what paste <(cut -f2 a.txt) <(cut -f1 a.txt) does.
Note that cut assumes columns are separated by \t's. If they are not, you will need to specify a delimiter with, for example, -d ' '.

Related

How to sort a group of data in a columnwise manner?

I have a group of data like the attached raw data, when I sort the raw data by sort -n , the data were sorted line by line, the output looks like this:
3 6 9 22
2 3 4 5
1 7 16 20
I want to sort the data in a columnwise manner, the output would look like this:
1 2 4 3
3 6 9 16
5 7 20 22
Ok, I did try something.
My primary ideal is to extract the data columnwise and then sort and then paste them, but I can't get through. Here is my script:
for ((i=1; i<=4; i=i+1))
do
awk '{print $i}' file | sort -n >>output
done
The output:
1 7 20 16
3 6 9 22
5 2 4 3
1 7 20 16
3 6 9 22
5 2 4 3
1 7 20 16
3 6 9 22
5 2 4 3
1 7 20 16
3 6 9 22
5 2 4 3
It seems that $i is unchangeable and equals to $0
Thanks a lot.
raw data1
3 6 9 22
5 2 4 3
1 7 20 16
raw data2
488.000000 1236.000000 984.000000 2388.000000 788.000000 704.000000
600.000000 1348.000000 872.000000 2500.000000 900.000000 816.000000
232.000000 516.000000 1704.000000 1668.000000 68.000000 16.000000
244.000000 504.000000 1716.000000 1656.000000 56.000000 28.000000
2340.000000 3088.000000 868.000000 4240.000000 2640.000000 2556.000000
2588.000000 3336.000000 1116.000000 4488.000000 2888.000000 2804.000000
Let me introduce a flexible solution using cut and sort that you can use on any M,N size tab delimited input matrix.
$ cat -vTE data_to_sort.in
3^I6^I9^I22$
5^I2^I4^I3$
1^I7^I20^I16$
$ col=4; line=3;
$ for i in $(seq ${col}); do cut -f$i data_to_sort.in |\
> sort -n; done | paste $(for i in $(seq ${line}); do echo -n "- "; done) |\
> datamash transpose
1 2 4 3
3 6 9 16
5 7 20 22
If the input file is not \t delimited you need to define proper delimiter to using -d"$DELIM_CHAR" have the cut working properly.
for i in $(seq ${col}); do cut -f$i data_to_sort.in | sort -n; done will separate each column of the file and sort it
paste $(for i in $(seq ${line}); do echo -n "- "; done) the paste column will then recreate a matrix structure
datamash transpose is needed to transpose the intermediate matrix
Thanks to the feedback from Sundeep, let me introduce to you a better solution using pr instead of paste command to generate the columns:
$ col=4; line=3
$ for i in $(seq ${col}); do cut -f$i data_to_sort.in |\
> sort -n; done | pr -${line}ats | datamash transpose
Last but not least,
$ col=4; for i in $(seq ${col}); do cut -f$i data_to_sort.in |\
> sort -n; done | pr -${col}ts
1 2 4 3
3 6 9 16
5 7 20 22
The following solution will allow us to not use datamash at all!!!
(many thanks to Sundeep)
Proof that is working for the skeptics and the downvoters...
2nd run with 6 columns:
$ col=6; for i in $(seq ${col}); do cut -f$i <(sed 's/^ \+//g;s/ \+/\t/g' data2) | sort -n; done | pr -${col}ts | tr '\t' ' '
232.000000 504.000000 868.000000 1656.000000 56.000000 16.000000
244.000000 516.000000 872.000000 1668.000000 68.000000 28.000000
488.000000 1236.000000 984.000000 2388.000000 788.000000 704.000000
600.000000 1348.000000 1116.000000 2500.000000 900.000000 816.000000
2340.000000 3088.000000 1704.000000 4240.000000 2640.000000 2556.000000
2588.000000 3336.000000 1716.000000 4488.000000 2888.000000 2804.000000
awk to the rescue!!
awk '{f1[NR]=$1; f2[NR]=$2; f3[NR]=$3; f4[NR]=$4}
END{asort(f1); asort(f2); asort(f3); asort(f4);
for(i=1;i<=NR;i++) print f1[i],f2[i],f3[i],f4[i]}' file
1 2 4 3
3 6 9 16
5 7 20 22
there may a smarter way of doing this as well...

How do I turn a text file with a single column into a matrix?

I have a text file that has a single column of numbers, like this:
1
2
3
4
5
6
I want to convert it into two columns, in the left to right order this way:
1 2
3 4
5 6
I can do it with:
awk '{print>"line-"NR%2}' file
paste line-0 line-1 >newfile
But I think the reliance on two intermediate files will make it fragile in a script.
I'd like to use something like cat file | mystery-zip-command >newfile
You can use paste to do this:
paste -d " " - - < file > newfile
You can also use pr:
pr -ats" " -2 file > newfile
-a - use round robin order
-t - suppress header and trailer
-s " " - use single space as the delimiter
-2 - two column output
See also:
Convert a text file into columns
another alternative
$ seq 6 | xargs -n2
1 2
3 4
5 6
or with awk
$ seq 6 | awk '{ORS=NR%2?FS:RS}1'
1 2
3 4
5 6
if you want the output terminate with a new line in case of odd number of input lines..
$ seq 7 | awk '{ORS=NR%2?FS:RS}1; END{ORS=NR%2?RS:FS; print ""}'
1 2
3 4
5 6
7
awk 'NR % 2 == 1 { printf("%s", $1) }
NR % 2 == 0 { printf(" %s\n", $1) }
END { if (NR % 2 == 1) print "" }' file
The odd lines are printed with no newline after them, to print the first column. The even lines are printed with a space first and a newline after, to print the second column. At the end, if there were an odd number of lines, we print a newline so we don't end in the middle of the line.
With bash:
while IFS= read -r odd; do IFS= read -r even; echo "$odd $even"; done < file
Output:
1 2
3 4
5 6
$ seq 6 | awk '{ORS=(NR%2?FS:RS); print} END{if (ORS==FS) printf RS}'
1 2
3 4
5 6
$
$ seq 7 | awk '{ORS=(NR%2?FS:RS); print} END{if (ORS==FS) printf RS}'
1 2
3 4
5 6
7
$
Note that it always adds a terminating newline - that is important as future commands might depend on it, e.g.:
$ seq 6 | awk '{ORS=(NR%2?FS:RS); print}' | wc -l
3
$ seq 7 | awk '{ORS=(NR%2?FS:RS); print}' | wc -l
3
$ seq 7 | awk '{ORS=(NR%2?FS:RS); print} END{if (ORS==FS) printf RS}' | wc -l
4
Just change the single occurrence of 2 to 3 or however many columns you want if your requirements change:
$ seq 6 | awk '{ORS=(NR%3?FS:RS); print} END{if (ORS==FS) printf RS}'
1 2 3
4 5 6
$ seq 7 | awk '{ORS=(NR%3?FS:RS); print} END{if (ORS==FS) printf RS}'
1 2 3
4 5 6
7
$ seq 8 | awk '{ORS=(NR%3?FS:RS); print} END{if (ORS==FS) printf RS}'
1 2 3
4 5 6
7 8
$ seq 9 | awk '{ORS=(NR%3?FS:RS); print} END{if (ORS==FS) printf RS}'
1 2 3
4 5 6
7 8 9
$
Short awk approach:
awk '{print ( ((getline nl) > 0)? $0" "nl : $0 )}' file
The output:
1 2
3 4
5 6
(getline nl)>0 - getline will get the next record and assign it to variable nl. The getline command returns 1 if it finds a record and 0 if it encounters the end of the file
Short GNU sed approach:
sed 'N;s/\n/ /' file
N - add a newline to the pattern space, then append the next line of input to the pattern space
s/\n/ / - replace newline with whitespace within captured pattern space
seq 6 | tr '\n' ' ' | sed -r 's/([^ ]* [^ ]* )/\1\n/g'

How to get the rows corespondent to a specific column values using linux command.

I have an o/p like below.I want the values of first column correspondent to a input value for second column.
Ex: in column 1, 0 and 1 belongs to 0 value of column 2.
So I need a command in which if I pass 0(second column values) I must get 0,1
dmpgdo dbsconfig 0 | grep AMP | grep Online | awk -F' ' '{print $1,$4}'
0 0
1 0
2 1
3 1
4 2
5 2
6 3
7 3
Will this do?
printf "0 0\n1 0\n2 1\n3 1\n4 2\n5 2\n6 3\n7 3" | awk '{if ($2 == 0) print $1}'
0
1

How can I sort a 10GB file?

I'm trying to sort a big table stored in a file. The format of the file is
(ID, intValue)
The data is sorted by ID, but what I need is to sort the data using the intValue, in descending order.
For example
ID | IntValue
1 | 3
2 | 24
3 | 44
4 | 2
to this table
ID | IntValue
3 | 44
2 | 24
1 | 3
4 | 2
How can I use the Linux sort command to do the operation? Or do you recommend another way?
How can I use the Linux sort command to do the operation? Or do you recommend another way?
As others have already pointed out, see man sort for -k & -t command line options on how to sort by some specific element in the string.
Now, the sort also has facility to help sort huge files which potentially don't fit into the RAM. Namely the -m command line option, which allows to merge already sorted files into one. (See merge sort for the concept.) The overall process is fairly straight forward:
Split the big file into small chunks. Use for example the split tool with the -l option. E.g.:
split -l 1000000 huge-file small-chunk
Sort the smaller files. E.g.
for X in small-chunk*; do sort -t'|' -k2 -nr < $X > sorted-$X; done
Merge the sorted smaller files. E.g.
sort -t'|' -k2 -nr -m sorted-small-chunk* > sorted-huge-file
Clean-up: rm small-chunk* sorted-small-chunk*
The only thing you have to take special care about is the column header.
How about:
sort -t' ' -k2 -nr < test.txt
where test.txt
$ cat test.txt
1 3
2 24
3 44
4 2
gives sorting in descending order (option -r)
$ sort -t' ' -k2 -nr < test.txt
3 44
2 24
1 3
4 2
while this sorts in ascending order (without option -r)
$ sort -t' ' -k2 -n < test.txt
4 2
1 3
2 24
3 44
in case you have duplicates
$ cat test.txt
1 3
2 24
3 44
4 2
4 2
use the uniq command like this
$ sort -t' ' -k2 -n < test.txt | uniq
4 2
1 3
2 24
3 44

How to delete the first column ( which is in fact row names) from a data file in linux?

I have data file with many thousands columns and rows. I want to delete the first column which is in fact the row counter. I used this command in linux:
cut -d " " -f 2- input.txt > output.txt
but nothing changed in my output. Does anybody knows why it does not work and what should I do?
This is what my input file looks like:
col1 col2 col3 col4 ...
1 0 0 0 1
2 0 1 0 1
3 0 1 0 0
4 0 0 0 0
5 0 1 1 1
6 1 1 1 0
7 1 0 0 0
8 0 0 0 0
9 1 0 0 0
10 1 1 1 1
11 0 0 0 1
.
.
.
I want my output look like this:
col1 col2 col3 col4 ...
0 0 0 1
0 1 0 1
0 1 0 0
0 0 0 0
0 1 1 1
1 1 1 0
1 0 0 0
0 0 0 0
1 0 0 0
1 1 1 1
0 0 0 1
.
.
.
I also tried the sed command:
sed '1d' input.file > output.file
But it deletes the first row not the first column.
Could anybody guide me?
idiomatic use of cut will be
cut -f2- input > output
if you delimiter is tab ("\t").
Or, simply with awk magic (will work for both space and tab delimiter)
awk '{$1=""}1' input | awk '{$1=$1}1' > output
first awk will delete field 1, but leaves a delimiter, second awk removes the delimiter. Default output delimiter will be space, if you want to change to tab, add -vOFS="\t" to the second awk.
UPDATED
Based on your updated input the problem is the initial spaces that cut treats as multiple columns. One way to address is to remove them first before feeding to cut
sed 's/^ *//' input | cut -d" " -f2- > output
or use the awk alternative above which will work in this case as well.
#Karafka I had CSV files so I added the "," separator (you can replace with yours
cut -d"," -f2- input.csv > output.csv
Then, I used a loop to go over all files inside the directory
# files are in the directory tmp/
for f in tmp/*
do
name=`basename $f`
echo "processing file : $name"
#kepp all column excep the first one of each csv file
cut -d"," -f2- $f > new/$name
#files using the same names are stored in directory new/
done
You can use cut command with --complement option:
cut -f1 -d" " --complement input.file > output.file
This will output all columns except the first one.
As #karakfa notes, it looks like it's the leading whitespace which is causing your issues.
Here's a sed oneliner to do the job (that will account for spaces or tabs):
sed -i.bak "s|^[ \t]\+[0-9]\+[ \t]\+||" input.txt
Explanation:
-i edit existing file in place
.bak backup original file and add .bak file extension (can use whatever you like)
s substitute
| separator (easiest character to read as sed separator IMO)
^ start match at start of the line
[ \t] match space or tab
\+ match one or more times (escape required so sed does not interpret '+' literally)
[0-9] match any number 0 - 9
As noted; the input.txt file will be edited in place. The original content of input.txt will be saved as input.txt.bak. Use just -i instead if you don't want a backup of the original file.
Also, if you know that they are definitely leading spaces (not tabs), you could shorten it to this:
sed -i.bak "s|^ \+[0-9]\+[ \t]\+||" input.txt
You can also achieve this with grep:
grep -E -o '[[:digit:]]([[:space:]][[:digit:]]){3}$' input.txt
Which assumes single character digit and space columns. To accommodate a variable number of spaces and digits you can do:
grep -E -o '[[:digit:]]+([[:space:]]+[[:digit:]]+){3}$' input.txt
If your grep supports the -P flag (--perl-regexp) you can do:
grep -P -o '\d+(\s+\d+){3}$' input.txt
And here are a few options if you are using GNU sed:
sed 's/^\s\+\w\+\s\+//' input.txt
sed 's/^\s\+\S\+\s\+//' input.txt
sed 's/^\s\+[0-9]\+\s\+//' input.txt
sed 's/^\s\+[[:digit:]]\+\s\+//' input.txt
Note that the grep regexes are matching the parts that we want to keep while the sed regexes are matching the parts we want to remove.

Resources