Shifting column titles to right - linux

I have a file which I want to process it in bash or python.
The structure is with 4 columns but only with 3 column titles:
input.txt
1STCOLUMN 2NDCOLUMN THIRDCOLUMN
input1 12 33 45
input22 10 13 9
input4 2 23 11
input4534 3 1 1
I am trying to shift the title columns to right and add a title of "INPUTS" to the first column (input column).
Desired output: Adding the column title
Desired-output-step1.csv
INPUTS 1STCOLUMN 2NDCOLUMN THIRDCOLUMN
input1 12 33 45
input22 10 13 9
input4 2 23 11
input4534 3 1 1
I tried with sed:
sed -i '1iINPUTS, 1STCOLUMN, 2NDCOLUMN, THIRDCOLUMN' input.txt
But I do not prefer to type the names of the columns for this reason.
How do I just insert the new title to first column and the other column titles shift to right?

you can specify which line to be replaced using line numbers
$ sed '1s/^/INPUTS /' ip.txt
INPUTS 1STCOLUMN 2NDCOLUMN THIRDCOLUMN
input1 12 33 45
input22 10 13 9
input4 2 23 11
input4534 3 1 1
here, 1 indicates that you want to apply s command only for 1st line
s/^/INPUTS / insert something to start of line, you'll have to adjust the spacing as needed

instead of counting and testing the spaces, you can let column -t do the padding and formatting job:
sed '1s/^/INPUTS /' ip.txt|column -t
This will give you:
INPUTS 1STCOLUMN 2NDCOLUMN THIRDCOLUMN
input1 12 33 45
input22 10 13 9
input4 2 23 11
input4534 3 1 1

Related

Replace specific lines in one file with data contained in another file

I want to replace specific lines in one file (File 1) with data contained in another file (File 2). For example:
File 1 (Input code):
other lines...
11 !!! Regular Expression
10 0.685682*100
11 0.004910*100
12 0.007012*100
13 0.146041*100
14 0.067827*100
15 0.019460*100
16 0.019277*100
17 0.001841*100
18 0.047950*100
other lines...
File 2 (to add new data):
1 0.36600*100
2 0.44466*100
3 0.0.046*100
4 0.15544*100
5 0.16600*100
6 0.14477*100
7 0.01927*100
8 0.00188*100
9 0.05566*100
How could I replace the Input data (File 1) from line 1 to line n with the data contained in File 2 (data). I tried using sed as follows:
sed '/!!! Regular Expresion/r File2' File1
and I get the following:
1 !!! Regular Expression
2 0.36600*100
3 0.44466*100
4 0.0.046*100
5 0.15544*100
6 0.16600*100
7 0.14477*100
8 0.01927*100
9 0.00188*100
10 0.05566*100
11 0.685682*100
12 0.004910*100
13 0.007012*100
14 0.146041*100
15 0.067827*100
16 0.019460*100
17 0.019277*100
19 0.001841*100
20 0.047950*100
My problem is that this command can insert the lines contained in File 2 but not replace them. How can I replace only these lines (from 10 to 18) with the new data?.
Thanks in advance.
replace specific lines in one file from 10 to 18 with the data contained in File 2
Lets use dynamic programming and split "replacing" into "deleting" and "inserting".
[Delete] specific lines in one file from 10 to 18
That's easy:
sed '10,18d'
would delete lines from 10 to 18.
[Insert] the data contained in File 2 [to line 10]
That's also easy:
sed '9r file2'
It appends the content of file2 after line 9, so first line of file2 is the new line 10.
All together:
sed '10,18d; 9r file2'
Example:
# seq 8 | sed '3,6d; 2r '<(seq -f 2%.0f 5)
1
2
21
22
23
24
25
7
8

Datamash: Transposing the column into rows based on group in bash

I have a tab delim file with a 2 columns like following
A 123
A 23
A 45
A 67
B 88
B 72
B 50
B 23
C 12
C 14
I want to transpose with the above data based on the first column like following
A 123 23 45 67
B 88 72 50 23
C 12 14
I tried the datamash transpose < input-file.txt but it didnt yield the output as expected.
One awk version:
awk '{printf ($1!=f?"\n%s":" "$2),$0;f=$1}' file
A 123 23 45 67
B 88 72 50 23
C 12 14
With this version, you get on blank line, but should be fast and handle large data since no loop or array variable are used.
$1!=f?"\n%s":" "$2),$0 If first field is not equal f, print new line and all fields
if $1 = f, only print field 2.
f=$1 set f to first field
datamash --group=1 --field-separator=' ' collapse 2 <file | tr ',' ' '
Output:
A 123 23 45 67
B 88 72 50 23
C 12 14
Input must be sorted, as in the question.
This might work for you (GNU sed):
sed -E ':a;N;s/^((\S+)\s+.*)\n\2/\1/;ta;P;D' file
Append the next line and if the first field of the first line is the same as the first field of the second line, remove the newline and the first field of the second line. Print the first line in the pattern space and then delete it and the following newline and repeat.

Finding if a column is in a range

I have two files that I want to find out if a column of file1 is in a range of columns.
file1.txt
1 19
1 21
1 24
2 22
4 45
file2.txt
1 19 23 A
1 20 28 A
4 42 45 A
I am trying to see if the 1st column of file1.txt is the same with 1st column of file2.txt, whether the second column of file1.txt is in between 2nd and 3rd columns of file2.txt, and append if it is in the range.
So the output should be :
output.txt
1 19 23 A 1 19
1 19 23 A 1 21
1 20 28 A 1 24
4 42 45 A 4 45
What I am trying is to find out if first columns are the same:
awk 'NR==FNR{c[$1]++;next};c[$1] > 0' file1.txt file2.txt
1 19 23 A
1 20 28 A
4 42 45 A
But I am not able to put the larger/ smaller conditions.
How do I add it?
Following may also help you here.
while read first second
do
awk -v fir="$first" -v sec="$second" '$1==fir && ($2<=sec && $3>=sec){print $0,fir,sec}' file2
done < "file1"
Using join + awk:
join file2.txt file1.txt | awk '{if ($2 <= $5 && $5 <= $3) { print $1,$2,$3,$4,$1,$5 } }'
First two files are joined on the first column, then the columns are compared and output printed (with the first column printed twice, as join hides it).
Using awk:
$ awk 'NR==FNR{a[$1]=a[$1]" "$2;next} {split(a[$1],b);for(i in b) if(b[i]>=$2 && b[i]<=$3) print $0,$1,b[i]}' file1 file2
1 19 23 A 1 19
1 19 23 A 1 21
1 20 28 A 1 21
1 20 28 A 1 24
4 42 45 A 4 45
The first block statement stores the elements of file1 into the array a. The array index is the first column of the file and the array element is the concatenation of all numbers of the second column with the same number in the first column.
The second block statement loops over the the array a element with the same index as the first column and checks for the number in the array is in between the range.
Another approach is to use join:
$ join -o 1.1 1.2 1.3 1.4 1.1 2.2 file2 file1 | awk '$6 >= $2 && $6 <= $3'
1 19 23 A 1 19
1 19 23 A 1 21
1 20 28 A 1 21
1 20 28 A 1 24
4 42 45 A 4 45
join -o generated the expected output format. The awk statement is filtering
the lines that are in range.

sort multiple column file

I have a file a.dat as following.
1 0.246102 21 1 0.0408359 0.00357267
2 0.234548 21 2 0.0401056 0.00264361
3 0.295771 21 3 0.0388905 0.00305116
4 0.190543 21 4 0.0371858 0.00427217
5 0.160047 21 5 0.0349674 0.00713894
I want to sort the file according to values in second column. i.e. output should look like
5 0.160047 21 5 0.0349674 0.00713894
4 0.190543 21 4 0.0371858 0.00427217
2 0.234548 21 2 0.0401056 0.00264361
1 0.246102 21 1 0.0408359 0.00357267
3 0.295771 21 3 0.0388905 0.00305116
How can do this with command line?. I read that sort command can be used for this purpose. But I could not figure out how to use sort command for this.
Use sort -k to indicate the column you want to use:
$ sort -k2 file
5 0.160047 21 5 0.0349674 0.00713894
4 0.190543 21 4 0.0371858 0.00427217
2 0.234548 21 2 0.0401056 0.00264361
1 0.246102 21 1 0.0408359 0.00357267
3 0.295771 21 3 0.0388905 0.00305116
This makes it in this case.
For future references, note (as indicated by 1_CR) that you can also indicate the range of columns to be used with sort -k2,2 (just use column 2) or sort -k2,5 (from 2 to 5), etc.
Note that you need to specify the start and end fields for sorting (2 and 2 in this case), and if you need numeric sorting, add n.
sort -k2,2n file.txt

How to extract one column from multiple files, and paste those columns into one file?

I want to extract the 5th column from multiple files, named in a numerical order, and paste those columns in sequence, side by side, into one output file.
The file names look like:
sample_problem1_part1.txt
sample_problem1_part2.txt
sample_problem2_part1.txt
sample_problem2_part2.txt
sample_problem3_part1.txt
sample_problem3_part2.txt
......
Each problem file (1,2,3...) has two parts (part1, part2). Each file has the same number of lines.
The content looks like:
sample_problem1_part1.txt
1 1 20 20 1
1 7 21 21 2
3 1 22 22 3
1 5 23 23 4
6 1 24 24 5
2 9 25 25 6
1 0 26 26 7
sample_problem1_part2.txt
1 1 88 88 8
1 1 89 89 9
2 1 90 90 10
1 3 91 91 11
1 1 92 92 12
7 1 93 93 13
1 5 94 94 14
sample_problem2_part1.txt
1 4 330 30 a
3 4 331 31 b
1 4 332 32 c
2 4 333 33 d
1 4 334 34 e
1 4 335 35 f
9 4 336 36 g
The output should look like: (in a sequence of problem1_part1, problem1_part2, problem2_part1, problem2_part2, problem3_part1, problem3_part2,etc.,)
1 8 a ...
2 9 b ...
3 10 c ...
4 11 d ...
5 12 e ...
6 13 f ...
7 14 g ...
I was using:
paste sample_problem1_part1.txt sample_problem1_part2.txt > \
sample_problem1_partall.txt
paste sample_problem2_part1.txt sample_problem2_part2.txt > \
sample_problem2_partall.txt
paste sample_problem3_part1.txt sample_problem3_part2.txt > \
sample_problem3_partall.txt
And then:
for i in `find . -name "sample_problem*_partall.txt"`
do
l=`echo $i | sed 's/sample/extracted_col_/'`
`awk '{print $5, $10}' $i > $l`
done
And:
paste extracted_col_problem1_partall.txt \
extracted_col_problem2_partall.txt \
extracted_col_problem3_partall.txt > \
extracted_col_problemall_partall.txt
It works fine with a few files, but it's a crazy method when the number of files is large (over 4000).
Could anyone help me with simpler solutions that are capable of dealing with multiple files, please?
Thanks!
Here's one way using awk and a sorted glob of files:
awk '{ a[FNR] = (a[FNR] ? a[FNR] FS : "") $5 } END { for(i=1;i<=FNR;i++) print a[i] }' $(ls -1v *)
Results:
1 8 a
2 9 b
3 10 c
4 11 d
5 12 e
6 13 f
7 14 g
Explanation:
For each line of input of each input file:
Add the files line number to an array with a value of column 5.
(a[FNR] ? a[FNR] FS : "") is a ternary operation, which is set up to build up the arrays value as a record. It simply asks if the files line number is already in the array. If so, add the arrays value followed by the default file separator before adding the fifth column. Else, if the line number is not in the array, don't prepend anything, just let it equal the fifth column.
At the end of the script:
Use a C-style loop to iterate through the array, printing each of the arrays values.
For only ~4000 files, you should be able to do:
find . -name sample_problem*_part*.txt | xargs paste
If find is giving names in the wrong order, pipe it to sort:
find . -name sample_problem*_part*.txt | sort ... | xargs paste
# print filenames in sorted order
find -name sample\*.txt | sort |
# extract 5-th column from each file and print it on a single line
xargs -n1 -I{} sh -c '{ cut -s -d " " -f 5 $0 | tr "\n" " "; echo; }' {} |
# transpose
python transpose.py ?
where transpose.py:
#!/usr/bin/env python
"""Write lines from stdin as columns to stdout."""
import sys
from itertools import izip_longest
missing_value = sys.argv[1] if len(sys.argv) > 1 else '-'
for row in izip_longest(*[column.split() for column in sys.stdin],
fillvalue=missing_value):
print " ".join(row)
Output
1 8 a
2 9 b
3 10 c
4 11 d
5 ? e
6 ? f
? ? g
Assuming the first and second files have less lines than the third one (missing values are replaced by '?').
Try this one. My script assumes that every file has the same number of lines.
# get number of lines
lines=$(wc -l sample_problem1_part1.txt | cut -d' ' -f1)
for ((i=1; i<=$lines; i++)); do
for file in sample_problem*; do
# get line number $i and delete everything except the last column
# and then print it
# echo -n means that no newline is appended
echo -n $(sed -n ${i}'s%.*\ %%p' $file)" "
done
echo
done
This works. For 4800 files, each 7 lines long it took 2 minutes 57.865 seconds on a AMD Athlon(tm) X2 Dual Core Processor BE-2400.
PS: The time for my script increases linearly with the number of lines. It would take very long time to merge files with 1000 lines. You should consider learning awk and use the script from steve. I tested it: For 4800 files, each with 1000 lines it took only 65 seconds!
You can pass awk output to paste and redirect it to a new file as follows:
paste <(awk '{print $3}' file1) <(awk '{print $3}' file2) <(awk '{print $3}' file3) > file.txt

Resources