Move rows in csv files - Linux command - linux

I have a csv file which has many columns and rows of this type.
foo.csv
1 1 x1
1 1 x2
1 1 y1
1 1 y2
. . .
What command should I use or what script should I create in order to get it to look like this:
foo.csv
1 1 x1 1 1 y1
1 1 x2 1 1 y2
. . . . . .
In other words, to move the last rows from foo.csv, starting from 1 1 y1, to the columns in the first rows.
Thanks in advance!
Paul

$ cat /tmp/1
abc 1
def 2
ghi 3
pqr 4
uvw 5
xyz 6
$ paste <(head -3 /tmp/1) <(tail -3 /tmp/1)
abc 1 pqr 4
def 2 uvw 5
ghi 3 xyz 6
Here you take three strings for the first column and threee string for the second.
If you don't known the number of the strings you want to take, you can find it first.
$ n=$(cat -n /tmp/1 | grep pqr | cut -f1)
$ paste <(head -$((n-1)) /tmp/1) <(sed 1,$((n-1))d /tmp/1)
abc 1 pqr 4
def 2 uvw 5
ghi 3 xyz 6

Adjust awk record selector as neede
paste -d " " <(awk '$3~"x"' tests.txt) <(awk '$3~"y"' tests.txt)

Related

How to sort a group of data in a columnwise manner?

I have a group of data like the attached raw data, when I sort the raw data by sort -n , the data were sorted line by line, the output looks like this:
3 6 9 22
2 3 4 5
1 7 16 20
I want to sort the data in a columnwise manner, the output would look like this:
1 2 4 3
3 6 9 16
5 7 20 22
Ok, I did try something.
My primary ideal is to extract the data columnwise and then sort and then paste them, but I can't get through. Here is my script:
for ((i=1; i<=4; i=i+1))
do
awk '{print $i}' file | sort -n >>output
done
The output:
1 7 20 16
3 6 9 22
5 2 4 3
1 7 20 16
3 6 9 22
5 2 4 3
1 7 20 16
3 6 9 22
5 2 4 3
1 7 20 16
3 6 9 22
5 2 4 3
It seems that $i is unchangeable and equals to $0
Thanks a lot.
raw data1
3 6 9 22
5 2 4 3
1 7 20 16
raw data2
488.000000 1236.000000 984.000000 2388.000000 788.000000 704.000000
600.000000 1348.000000 872.000000 2500.000000 900.000000 816.000000
232.000000 516.000000 1704.000000 1668.000000 68.000000 16.000000
244.000000 504.000000 1716.000000 1656.000000 56.000000 28.000000
2340.000000 3088.000000 868.000000 4240.000000 2640.000000 2556.000000
2588.000000 3336.000000 1116.000000 4488.000000 2888.000000 2804.000000
Let me introduce a flexible solution using cut and sort that you can use on any M,N size tab delimited input matrix.
$ cat -vTE data_to_sort.in
3^I6^I9^I22$
5^I2^I4^I3$
1^I7^I20^I16$
$ col=4; line=3;
$ for i in $(seq ${col}); do cut -f$i data_to_sort.in |\
> sort -n; done | paste $(for i in $(seq ${line}); do echo -n "- "; done) |\
> datamash transpose
1 2 4 3
3 6 9 16
5 7 20 22
If the input file is not \t delimited you need to define proper delimiter to using -d"$DELIM_CHAR" have the cut working properly.
for i in $(seq ${col}); do cut -f$i data_to_sort.in | sort -n; done will separate each column of the file and sort it
paste $(for i in $(seq ${line}); do echo -n "- "; done) the paste column will then recreate a matrix structure
datamash transpose is needed to transpose the intermediate matrix
Thanks to the feedback from Sundeep, let me introduce to you a better solution using pr instead of paste command to generate the columns:
$ col=4; line=3
$ for i in $(seq ${col}); do cut -f$i data_to_sort.in |\
> sort -n; done | pr -${line}ats | datamash transpose
Last but not least,
$ col=4; for i in $(seq ${col}); do cut -f$i data_to_sort.in |\
> sort -n; done | pr -${col}ts
1 2 4 3
3 6 9 16
5 7 20 22
The following solution will allow us to not use datamash at all!!!
(many thanks to Sundeep)
Proof that is working for the skeptics and the downvoters...
2nd run with 6 columns:
$ col=6; for i in $(seq ${col}); do cut -f$i <(sed 's/^ \+//g;s/ \+/\t/g' data2) | sort -n; done | pr -${col}ts | tr '\t' ' '
232.000000 504.000000 868.000000 1656.000000 56.000000 16.000000
244.000000 516.000000 872.000000 1668.000000 68.000000 28.000000
488.000000 1236.000000 984.000000 2388.000000 788.000000 704.000000
600.000000 1348.000000 1116.000000 2500.000000 900.000000 816.000000
2340.000000 3088.000000 1704.000000 4240.000000 2640.000000 2556.000000
2588.000000 3336.000000 1716.000000 4488.000000 2888.000000 2804.000000
awk to the rescue!!
awk '{f1[NR]=$1; f2[NR]=$2; f3[NR]=$3; f4[NR]=$4}
END{asort(f1); asort(f2); asort(f3); asort(f4);
for(i=1;i<=NR;i++) print f1[i],f2[i],f3[i],f4[i]}' file
1 2 4 3
3 6 9 16
5 7 20 22
there may a smarter way of doing this as well...

Merging two files in bash with a twist in shell linux

The following question is somehow tricky but seemingly simple , i need to use bash
let us suppose i have 2 text files, the first on is
FirstFile.txt
0 1
0 2
1 1
1 2
2 0
SecondFile.txt
0 1
0 2
0 3
0 4
0 5
1 0
1 1
1 2
1 3
1 4
1 5
2 1
2 2
2 3
2 4
2 5
I want to be able to create a new Thirdfile.txt that contains that values that are not in file A , meaning if there is a common variable with file A i want it to be removed. knowing that 2 0 and 0 2 are the same ...
Can you help me out ?
Using awk, you can rearrange the columns so that the lower number is always first. When reading the first file, save them as keys in an associative array. When reading the second file, test if they're not found in the array.
awk '{if ($1 <= $2) { a = $1; b = $2; } else { a = $2; b = $1 } }
FNR==NR { arr[a, b] = 1; next; }
!arr[a, b]' FirstFile.txt SecondFile.txt > ThirdFile.txt
Results:
0 3
0 4
0 5
1 3
1 4
1 5
2 2
2 3
2 4
2 5
paste <(cut -f2 a.txt) <(cut -f1 a.txt) > tmp.txt
cat a.txt b.txt tmp.txt | sort | uniq -u
or
cat a.txt b.txt <(paste <(cut -f2 a.txt) <(cut -f1 a.txt)) | sort | uniq -u
Result
0 3
0 4
0 5
1 3
1 4
1 5
2 2
2 3
2 4
2 5
Explanation
uniq removes duplicate rows from a text file.
uniq requires that its input be sorted.
uniq -u prints only the rows that do not have duplicates.
So, cat a.txt b.txt | sort | uniq -u will almost get you there: Only rows in b.txt that are not in a.txt will get printed. However it doesn't handle the reversed cases, like '1 2' <-> '2 1'.
Therefore, you need a temp file that holds all the reversed removal keys. That's what paste <(cut -f2 a.txt) <(cut -f1 a.txt) does.
Note that cut assumes columns are separated by \t's. If they are not, you will need to specify a delimiter with, for example, -d ' '.

bash cut columns to one file and save onto the end of another file

I would like to cut two columns from one file and stick them on the end of a second file. The two file have the exact same number of lines
file1.txt
1 2 3 4 5 6 7 8 9 10
1 2 3 4 5 6 7 8 9 10
1 2 3 4 5 6 7 8 9 10
file2.txt
a b c d e f g h i j
a b c d e f g h i j
a b c d e f g h i j
a b c d e f g h i j
So far I have been using
cut -f9-10 file2.txt | paste file1.txt - > file3.txt
which outputs exactly what I want
1 2 3 4 5 6 7 8 9 10 i j
1 2 3 4 5 6 7 8 9 10 i j
1 2 3 4 5 6 7 8 9 10 i j
However I don't want to have to make a new file I would prefer to alter file 1 to the above. I've tried
cut -f9-10 file2.txt | paste file1.txt -
but it simply prints everything on screen. Is there a way of just adding columns 9 and 10 to the end of file1.txt?
Use sponge from moreutils! It allows you to soak up standard input and write to a file. That is, to replace a file in-place after a pipe.
cut -f9-10 file2.txt | paste file1.txt - | sponge file1.txt
Note you can also do what you are doing by using paste with a process substitution.
$ paste -d' ' file1.txt <(awk '{print $(NF-1), $NF}' file2.txt) | sponge file1.txt
$ cat file1.txt
1 2 3 4 5 6 7 8 9 10 i j
1 2 3 4 5 6 7 8 9 10 i j
1 2 3 4 5 6 7 8 9 10 i j
This joins file1.txt with two last columns from file2.txt using ' ' as delimiter.

How to use paste command for different lengths of columns

I have:
file1.txt file2.txt file3.txt
8 2 2
1 2 1
8 1 0
3 3
5 3
3
4
I want to paste all these three columns in ofile.txt
I tried with
paste file1.txt file2.txt file3.txt > ofile.txt
Result I got in ofile.txt:
ofile.txt:
8 2 2
1 2 1
8 1 0
3 3
5 3
3
4
Which should come
ofile.txt
8 2 2
1 2 1
8 1 0
3 3
5 3
3
4
You can try this paste command in bash using process substitution:
paste <(sed 's/^[[:blank:]]*//' file1.txt) file2.txt file3.txt
8 2 2
1 2 1
8 8 0
3 3
5 3
3
4
sed command is used to remove leading whitespace from file1.txt.
I can reproduce your output when I make inputfiles with tabs.
paste also uses tabs betwen the columns and does this how he thinks it should.
You see the results when I replace the tabs with -:
# more x* | tr '\t' '-'
::::::::::::::
x1
::::::::::::::
-1a
-1b
-1c
-1d
::::::::::::::
x2
::::::::::::::
-2a
-2b
::::::::::::::
x3
::::::::::::::
-3a
-3b
-3c
-3d
-3e
-3f
-3g
# paste x? | tr '\t' '-'
-1a--2a--3a
-1b--2b--3b
-1c---3c
-1d---3d
---3e
---3f
---3g
Think how you want it. When you want correct indents, you need to append lines with tab for files with less lines. Or manipulate the result: 3 tabs into 4 and 4 tabs at the beginning of the line to 5 tabs.
sed -e 's/\t\t\t/\t\t\t\t/' -e 's/^\t\t\t\t/\t\t\t\t\t/'

How to extract one column from multiple files, and paste those columns into one file?

I want to extract the 5th column from multiple files, named in a numerical order, and paste those columns in sequence, side by side, into one output file.
The file names look like:
sample_problem1_part1.txt
sample_problem1_part2.txt
sample_problem2_part1.txt
sample_problem2_part2.txt
sample_problem3_part1.txt
sample_problem3_part2.txt
......
Each problem file (1,2,3...) has two parts (part1, part2). Each file has the same number of lines.
The content looks like:
sample_problem1_part1.txt
1 1 20 20 1
1 7 21 21 2
3 1 22 22 3
1 5 23 23 4
6 1 24 24 5
2 9 25 25 6
1 0 26 26 7
sample_problem1_part2.txt
1 1 88 88 8
1 1 89 89 9
2 1 90 90 10
1 3 91 91 11
1 1 92 92 12
7 1 93 93 13
1 5 94 94 14
sample_problem2_part1.txt
1 4 330 30 a
3 4 331 31 b
1 4 332 32 c
2 4 333 33 d
1 4 334 34 e
1 4 335 35 f
9 4 336 36 g
The output should look like: (in a sequence of problem1_part1, problem1_part2, problem2_part1, problem2_part2, problem3_part1, problem3_part2,etc.,)
1 8 a ...
2 9 b ...
3 10 c ...
4 11 d ...
5 12 e ...
6 13 f ...
7 14 g ...
I was using:
paste sample_problem1_part1.txt sample_problem1_part2.txt > \
sample_problem1_partall.txt
paste sample_problem2_part1.txt sample_problem2_part2.txt > \
sample_problem2_partall.txt
paste sample_problem3_part1.txt sample_problem3_part2.txt > \
sample_problem3_partall.txt
And then:
for i in `find . -name "sample_problem*_partall.txt"`
do
l=`echo $i | sed 's/sample/extracted_col_/'`
`awk '{print $5, $10}' $i > $l`
done
And:
paste extracted_col_problem1_partall.txt \
extracted_col_problem2_partall.txt \
extracted_col_problem3_partall.txt > \
extracted_col_problemall_partall.txt
It works fine with a few files, but it's a crazy method when the number of files is large (over 4000).
Could anyone help me with simpler solutions that are capable of dealing with multiple files, please?
Thanks!
Here's one way using awk and a sorted glob of files:
awk '{ a[FNR] = (a[FNR] ? a[FNR] FS : "") $5 } END { for(i=1;i<=FNR;i++) print a[i] }' $(ls -1v *)
Results:
1 8 a
2 9 b
3 10 c
4 11 d
5 12 e
6 13 f
7 14 g
Explanation:
For each line of input of each input file:
Add the files line number to an array with a value of column 5.
(a[FNR] ? a[FNR] FS : "") is a ternary operation, which is set up to build up the arrays value as a record. It simply asks if the files line number is already in the array. If so, add the arrays value followed by the default file separator before adding the fifth column. Else, if the line number is not in the array, don't prepend anything, just let it equal the fifth column.
At the end of the script:
Use a C-style loop to iterate through the array, printing each of the arrays values.
For only ~4000 files, you should be able to do:
find . -name sample_problem*_part*.txt | xargs paste
If find is giving names in the wrong order, pipe it to sort:
find . -name sample_problem*_part*.txt | sort ... | xargs paste
# print filenames in sorted order
find -name sample\*.txt | sort |
# extract 5-th column from each file and print it on a single line
xargs -n1 -I{} sh -c '{ cut -s -d " " -f 5 $0 | tr "\n" " "; echo; }' {} |
# transpose
python transpose.py ?
where transpose.py:
#!/usr/bin/env python
"""Write lines from stdin as columns to stdout."""
import sys
from itertools import izip_longest
missing_value = sys.argv[1] if len(sys.argv) > 1 else '-'
for row in izip_longest(*[column.split() for column in sys.stdin],
fillvalue=missing_value):
print " ".join(row)
Output
1 8 a
2 9 b
3 10 c
4 11 d
5 ? e
6 ? f
? ? g
Assuming the first and second files have less lines than the third one (missing values are replaced by '?').
Try this one. My script assumes that every file has the same number of lines.
# get number of lines
lines=$(wc -l sample_problem1_part1.txt | cut -d' ' -f1)
for ((i=1; i<=$lines; i++)); do
for file in sample_problem*; do
# get line number $i and delete everything except the last column
# and then print it
# echo -n means that no newline is appended
echo -n $(sed -n ${i}'s%.*\ %%p' $file)" "
done
echo
done
This works. For 4800 files, each 7 lines long it took 2 minutes 57.865 seconds on a AMD Athlon(tm) X2 Dual Core Processor BE-2400.
PS: The time for my script increases linearly with the number of lines. It would take very long time to merge files with 1000 lines. You should consider learning awk and use the script from steve. I tested it: For 4800 files, each with 1000 lines it took only 65 seconds!
You can pass awk output to paste and redirect it to a new file as follows:
paste <(awk '{print $3}' file1) <(awk '{print $3}' file2) <(awk '{print $3}' file3) > file.txt

Resources