How to clean up misaligned columns in text separated by spaces? [duplicate] - linux

This question already has an answer here:
How can I align the columns of a space separated table in Bash? [duplicate]
(1 answer)
Closed 3 years ago.
I have a file with below strings in it which are misaligned. I want to align this file properly so that each of the word in each line are properly spaced.
3 281 901.188.30.53 901001 1 poihelloswqs-1146414
3 598 901.189.166.233 901001 1 poihelloswqs-90877846
3 300 901.156.77.57 901001 1 poihelloswqs-90137229
3 263 901.156.17.80 901001 1 poihelloswqs-90135797
3 264 901.875.875.79 901001 1 poihelloswqs-1389375
3 265 901.189.153.234 901001 1 poihelloswqs-1568332
3 266 901.218.93.873 901001 1 poihelloswqs-3240561
3 268 901.158.76.23 901001 1 poihelloswqs-3242066
3 269 901.218.30.120 901001 1 poihelloswqs-3242532
It should output something like this: This way they all are properly aligned. Is this possible to do in Linux?
3 281 901.188.30.53 901001 1 poihelloswqs-1146414
3 598 901.189.166.233 901001 1 poihelloswqs-90877846
3 300 901.156.77.57 901001 1 poihelloswqs-90137229
3 263 901.156.17.80 901001 1 poihelloswqs-90135797
3 264 901.875.875.79 901001 1 poihelloswqs-1389375
3 265 901.189.153.234 901001 1 poihelloswqs-1568332
3 266 901.218.93.873 901001 1 poihelloswqs-3240561
3 268 901.158.76.23 901001 1 poihelloswqs-3242066
3 269 901.218.30.120 901001 1 poihelloswqs-3242532

It's very ugly but this may work:
cat xx | ruby -nle 'puts $_.split().join("\t")' |pr --expand-tabs -tT
You may need to replace the single \t with multiple tab characters if your columns vary greatly in width.
This is ruby splitting the line on any white space, then joining using tabs. Finally pr is replacing tabs with spaces (and the -tT avoids the annoying headers and pagination)
If you know the fields are separated by a single character tr can do the replacement easily:
cat xx | tr ' ' "\t" |pr --expand-tabs -tT

Related

Filter a large text file using ID in another text file

I have a two text file, one file is composed of about 60,000 rows and 14 columns and another has one column containing the subset of one of the columns (first column) in the first file. I would like to filter the File 1 based on ID name in the file 2. I tried some command on net but none of them were not useful. It's a few lines of two text file (I'm on linux system)
File 1:
Contig100 orange1.1g013919m 75.31 81 12 2 244 14 2 78 4e-29 117 1126 435
Contig1000 orange1.1g045442m 65.50 400 130 2 631 1809 2 400 1e-156 466 2299 425
Contig10005 orange1.1g003445m 83.86 824 110 2 3222 808 1 820 0.0 1322 3583 820
Contig10006 orange1.1g047384m 81.82 22 4 0 396 331 250 271 7e-05 41.6 396 412
File 2:
Contig1
Contig1000
Contig10005
Contig10017
Please let me know your great suggestion to solve this issue.
Thanks in advance.
You can do this with python:
with open('filter.txt', 'r') as f:
mask = f.read()
with open('data.txt', 'r') as f:
while True:
l = f.readline()
if not l:
break
if l.split(' ')[0] in mask:
print(l[:-1])
If you're on Linux/Mac, you can do it on the command line (the $ symbolized the command prompt, don't type it).
First, we create a file2-patterns from your file2 by appending .* to each line:
$ while read line; do echo "$line .*"; done < file2 > file2-patterns
And have a look at that file:
$ cat file2-patterns
Contig1 .*
Contig1000 .*
Contig10005 .*
Contig10017 .*
Now we can use these patterns to filter out lines from file1.
$ grep -f file2-patterns file1
Contig1000 orange1.1g045442m 65.50 400 130 2 631 1809 2 400 1e-156 466 2299 425
Contig10005 orange1.1g003445m 83.86 824 110 2 3222 808 1 820 0.0 1322 3583 820

Problems combining awk scripts

I am trying to use awk to parse a tab delimited table -- there are several duplicate entries in the first column, and I need to remove the duplicate rows that have a smaller total sum of the other 4 columns in the table. I can remove the first or second row easily, and sum the columns, but I'm having trouble combining the two. For my purposes there will never be more than 2 duplicates.
Example file: http://pastebin.com/u2GBnm2D
Desired output in this case would be to remove the rows:
lmo0330 1 1 0 1
lmo0506 7 21 2 10
And keep the other two rows with the same gene id in the column. The final parsed file would look like this: http://pastebin.com/WgDkm5ui
Here's what I have tried (this doesn't do anything. But the first part removes the second duplicate, and the second part sums the counts):
awk 'BEGIN {!a[$1]++} {for(i=1;i<=NF;i++) t+=$i; print t; t=0}'
I tried modifying the 2nd part of the script in the best answer of this question: Removing lines containing a unique first field with awk?
awk 'FNR==NR{a[$1]++;next}(a[$1] > 1)' ./infile ./infile
But unfortunately I don't really understand what's going on well enough to get it working. Can anyone help me out? I think I need to replace the a[$1] > 1 part with [remove (first duplicate count or 2nd duplicate count depending on which is larger].
EDIT: I'm also using GNU Awk 3.1.7 if that matters.
You can use this awk command:
awk 'NR == 1 {
print;
next
} {
s = $2+$3+$4+$5
} s >= sum[$1] {
sum[$1] = s;
if (!($1 in rows))
a[++n] = $1;
rows[$1] = $0
} END {
for(i=1; i<=n; i++)
print rows[a[i]]
}' file | column -t
Output:
gene SRR034450.out.rpkm_0 SRR034451.out.rpkm_0 SRR034452.out.rpkm_0 SRR034453.out.rpkm_0
lmo0001 160 323 533 293
lmo0002 135 317 504 306
lmo0003 1 4 5 3
lmo0004 35 59 58 48
lmo0005 113 218 257 187
lmo0006 279 519 653 539
lmo0007 563 1053 1165 1069
lmo0008 34 84 203 107
lmo0009 13 45 90 49
lmo0010 57 210 237 169
lmo0011 65 224 247 179
lmo0012 65 226 250 215
lmo0013 342 500 738 682
lmo0014 662 1032 1283 1311
lmo0015 321 413 631 637
lmo0016 175 253 273 325
lmo0017 3 6 6 6
lmo0018 33 38 46 45
lmo0019 13 1 39 1
lmo0020 3 12 28 15
lmo0021 3 4 14 12
lmo0022 2 3 5 1
lmo0023 2 0 3 2
lmo0024 1 0 2 6
lmo0330 1 1 1 3
lmo0506 151 232 60 204

How can I swap numbers inside data block of repeating format using linux commands?

I have a huge data file, and I hope to swap some numbers of 2nd column only, in the following format file. The file have 25,000,000 dataset, and 8768 lines each.
%% Edited: shorter 10 line example. Sorry for the inconvenience. This is typical one data block.
# Dataset 1
#
# Number of lines 10
#
# header lines
5 11 3 10 120 90 0 0.952 0.881 0.898 2.744 0.034 0.030
10 12 3 5 125 112 0 0.952 0.897 0.905 2.775 0.026 0.030
50 10 3 48 129 120 0 1.061 0.977 0.965 3.063 0.001 0.026
120 2 4 5 50 186 193 0 0.881 0.965 0.899 0.917 3.669 0.000 -0.005
125 3 4 10 43 186 183 0 0.897 0.945 0.910 0.883 3.641 0.000 0.003
186 5 4 120 125 249 280 0 0.899 0.910 0.931 0.961 3.727 0.000 -0.001
193 6 4 120 275 118 268 0 0.917 0.895 0.897 0.937 3.799 0.000 0.023
201 8 4 278 129 131 280 0 0.921 0.837 0.870 0.934 3.572 0.000 0.008
249 9 4 186 355 179 317 0 0.931 0.844 0.907 0.928 3.615 0.000 0.008
280 10 4 186 201 340 359 0 0.961 0.934 0.904 0.898 3.700 0.000 0.033
#
# Dataset 1
#
# Number of lines 10
...
As you can see, there are 7 repeating header lines in the head, and 1 trailing line at the end of the dataset. Those header and trailing lines are all beginning from #. As a result, the data will have 7 header lines, 8768 data lines, and 1 trailing line, total 8776 lines per a data block. That one trailing line only contains sinlge '#'.
I want to swap some numbers in 2nd columns only. First, I want to replace
1, 9, 10, 11 => 666
2, 6, 7, 8 => 333
3, 4, 5 => 222
of the 2nd column, and then,
666 => 6
333 => 3
222 => 2
of the 2nd column. I hope to conduct this replacing for all repeating dataset.
I tried this with python, but the data is too big, so it makes memory error. How can I perform this swapping with linux commands like sed or awk or cat commands?
Thanks
Best,
This might work for you, but you'd have to use GNU awk, as it's using the gensub command and $0 reassignment.
Put the following into an executable awk file ( like script.awk ):
#!/usr/bin/awk -f
BEGIN {
a[1] = a[9] = a[10] = a[11] = 6
a[2] = a[6] = a[7] = a[8] = 3
a[3] = a[4] = a[5] = 2
}
function swap( c2, val ) {
val = a[c2]
return( val=="" ? c2 : val )
}
/^( [0-9]+ )/ { $0 = gensub( /^( [0-9]+)( [0-9]+)/, "\\1 " swap($2), 1 ) }
47 # print the line
Here's the breakdown:
BEGIN - set up an array a with mappings of the new values.
create a user defined function swap to provide values for the 2nd column from the a array or the value itself. The c2 element is passed in, while the val element is a local variable ( becuase no 2nd argument is passed in ).
when a line starts with a space followed by a number and a space (the pattern), then use gensub to replace the first occurrance of the first number pattern with itself concatenated with a space and the return from swap(the action). In this case, I'm using gensub's replacement text to preserve the first column data. The second column is passed to swap using the field data identifier of $2. Using gensub should preserve the formatting of the data lines.
47 - an expression that evaluates to true provides the default action of printing $0, which for data lines might have been modified. Any line that wasn't "data" will be printed out here w/o modifications.
The provided data doesn't show all the cases, so I made up my own test file:
# 2 skip me
9 2 not going to process me
1 1 don't change the for matting
2 2 4 23242.223 data
3 3 data that's formatted
4 4 7 that's formatted
5 5 data that's formatted
6 6 data that's formatted
7 7 data that's formatted
8 8 data that's formatted
9 9 data that's formatted
10 10 data that's formatted
11 11 data that's formatted
12 12 data that's formatted
13 13 data that's formatted
14 s data that's formatted
# some other data
Running the executable awk (like ./script.awk data) gives the following output:
# 2 skip me
9 2 not going to process me
1 6 don't change the for matting
2 3 4 23242.223 data
3 2 data that's formatted
4 2 7 that's formatted
5 2 data that's formatted
6 3 data that's formatted
7 3 data that's formatted
8 3 data that's formatted
9 6 data that's formatted
10 6 data that's formatted
11 6 data that's formatted
12 12 data that's formatted
13 13 data that's formatted
14 s data that's formatted
# some other data
which looks alright to me, but I'm not the one with 25 million datasets.
You'd also most definitely want to try this on a smaller sample of your data first (the first few datasets?) and redirect stdout a temp file perhaps like:
head -n 26328 data | ./script.awk - > tempfile
You can learn more about the elements used in this script here:
awk basics (the man page)
Arrays
User defined functions
String functions - gensub()
And of course, you should spend some quality time reviewing awk related questions and answers on Stack Overflow ;)

Turning rows into columns based on a value in excel?

I have a bunch of data, ordered below one another in excel (actually openoffice). I need them to be consolidated by the first column, but so that all data are still shown:
From (actually goes until 100):
1 283.038 244
1 279.899 494
1 255.139 992
1 254.606 7329
1 254.5 17145
1 251.008 23278
1 250.723 28758
1 247.753 92703
1 243.43 315278
1 242.928 485029
1 237.475 1226549
1 233.851 2076295
1 232.833 9826327
1 229.656 15965410
1 229.656 30000235
2 286.535 231
2 275.968 496
2 267.927 741
2 262.647 2153
2 258.925 3130
2 253.954 4857
2 249.551 9764
2 244.725 36878
2 243.825 318455
2 242.86 921618
2 238.401 1405028
2 234.984 3170031
2 233.168 4403799
2 229.317 8719139
2 224.395 26986035
2 224.395 30000056
3 269.715 247
3 268.652 469
3 251.214 957
3 249.04 30344
3 245.883 56115
3 241.753 289668
3 241.707 954750
3 240.684 1421766
3 240.178 1865750
3 235.09 2626524
3 233.579 5129755
3 232.517 7018880
3 232.256 18518741
3 228.75 19117443
3 228.75 30000051
to:
1 2 3
283.038 244 286.535 231 269.715 247
279.899 494 275.968 496 268.652 469
255.139 992 267.927 741 251.214 957
254.606 7329 262.647 2153 249.04 30344
254.5 17145 258.925 3130 245.883 56115
251.008 23278 253.954 4857 241.753 289668
250.723 28758 249.551 9764 241.707 954750
247.753 92703 244.725 36878 240.684 1421766
243.43 315278 243.825 318455 240.178 1865750
242.928 485029 242.86 921618 235.09 2626524
237.475 1226549 238.401 1405028 233.579 5129755
233.851 2076295 234.984 3170031 232.517 7018880
232.833 9826327 233.168 4403799 232.256 18518741
229.656 15965410 229.317 8719139 228.75 19117443
229.656 30000235 224.395 26986035 228.75 30000051
224.395 30000056
This must be really simple. But I couldn't find it. I tried a pivot table, but that only allows me to summarise or count etc the fields, while I want them all to be displayed. Any ideas?
To elaborate on the pivot table. I put column 1 as row, column 2 as colum and column 3 in the middle, but that comes out with a lot of empty cells and summarised.
I am not sure on which search terms to look, unconsolidised pivot tables haven't provided an answer.
After some discussions with my collegues, this seemed undoable in excel. So I just created a short script to save each "run" (which is the first column) in a seperate file, from csv:
awk -F, '{print > $1}' rw.txt

Count Number of occurrence at each line

I have the following file
ENST001 ENST002 4 4 4 88 9 9
ENST004 3 3 3 99 8 8
ENST009 ENST010 ENST006 8 8 8 77 8 8
Basically I want to count how many times ENST* is repeated in each line so the expected results is
2
1
3
Any suggestion please ?
Try this (and see it in action here):
awk '{print gsub("ENST[0-9]+","")}' INPUTFILE

Resources