Cut command not extracting fields properly by default delimiter - linux

I have a text file in which I must cut the fields 3,4,5 and 8:
219 432 4567 Harrison Joel M 4540 Accountant 09-12-1985
219 433 4587 Mitchell Barbara C 4541 Admin Asst 12-14-1995
219 433 3589 Olson Timothy H 4544 Supervisor 06-30-1983
219 433 4591 Moore Sarah H 4500 Dept Manager 08-01-1978
219 431 4527 Polk John S 4520 Accountant 09-22-1998
219 432 4567 Harrison Joel M 4540 Accountant 09-12-1985
219 432 1557 Harrison James M 4544 Supervisor 01-07-2000
Since the delimiter by default is tab the command to extract the fields would be:
cut -f 3,4,5,8 filename
The thing is that the output is the same as the original file content. What is happening here? Why doesn't this work?

Your file doesn't actually contain any tab characters.
By default, cut prints any lines which don't contain the delimiter, unless you specify the -s option.
Since your records are aligned on character boundaries rather than tab-separated, you should use the -c option to specify which columns to cut. For example:
cut -c 9-12,14-25,43-57 file

Related

Filter a large text file using ID in another text file

I have a two text file, one file is composed of about 60,000 rows and 14 columns and another has one column containing the subset of one of the columns (first column) in the first file. I would like to filter the File 1 based on ID name in the file 2. I tried some command on net but none of them were not useful. It's a few lines of two text file (I'm on linux system)
File 1:
Contig100 orange1.1g013919m 75.31 81 12 2 244 14 2 78 4e-29 117 1126 435
Contig1000 orange1.1g045442m 65.50 400 130 2 631 1809 2 400 1e-156 466 2299 425
Contig10005 orange1.1g003445m 83.86 824 110 2 3222 808 1 820 0.0 1322 3583 820
Contig10006 orange1.1g047384m 81.82 22 4 0 396 331 250 271 7e-05 41.6 396 412
File 2:
Contig1
Contig1000
Contig10005
Contig10017
Please let me know your great suggestion to solve this issue.
Thanks in advance.
You can do this with python:
with open('filter.txt', 'r') as f:
mask = f.read()
with open('data.txt', 'r') as f:
while True:
l = f.readline()
if not l:
break
if l.split(' ')[0] in mask:
print(l[:-1])
If you're on Linux/Mac, you can do it on the command line (the $ symbolized the command prompt, don't type it).
First, we create a file2-patterns from your file2 by appending .* to each line:
$ while read line; do echo "$line .*"; done < file2 > file2-patterns
And have a look at that file:
$ cat file2-patterns
Contig1 .*
Contig1000 .*
Contig10005 .*
Contig10017 .*
Now we can use these patterns to filter out lines from file1.
$ grep -f file2-patterns file1
Contig1000 orange1.1g045442m 65.50 400 130 2 631 1809 2 400 1e-156 466 2299 425
Contig10005 orange1.1g003445m 83.86 824 110 2 3222 808 1 820 0.0 1322 3583 820

Sed command to find multiple patterns and print the line of the pattern and next nth lines after it

I have a tab delimited file with 43,075 lines and 7 columns. I sorted the file by column 4 from the highest to the smaller value. Now I need to find 342 genes which ids are in column 2. See example below:
miR Target Transcript Score Energy Length miR Length target
aae-bantam-3p AAEL007110 AAEL007110-RA 28404 -565.77 22 1776
aae-let-7 AAEL007110 AAEL007110-RA 28404 -568.77 21 1776
aae-miR-1 AAEL007110 AAEL007110-RA 28404 -567.77 22 1776
aae-miR-100 AAEL007110 AAEL007110-RA 28404 -567.08 22 1776
aae-miR-11-3p AAEL007110 AAEL007110-RA 28404 -564.03 22 1776
.
.
.
aae-bantam-3p AAEL018149 AAEL018149-RA 28292 -569.7 22 1769
aae-bantam-5p AAEL018149 AAEL018149-RA 28292 -570.93 23 1769
aae-let-7 AAEL018149 AAEL018149-RA 28292 -574.26 21 1769
aae-miR-1 AAEL018149 AAEL018149-RA 28292 -568.34 22 1769
aae-miR-10 AAEL018149 AAEL018149-RA 28292 -570.08 22 1769
The are 124 lines for each gene. However, I want to extract the top hits for each, for example top 5 genes since the file is sorted. I can do it for one gene with the following script:
sed -n '/AAEL018149/ {p;q}' myfile.csv > top-hits.csv
However, it prints only the line of the match. I was wondering if I could use a script to get all the 342 genes at once. It would be great if I could get the line of the match and the next 4. Then I would have the top 5 hits for each gene.
Any suggestion will be welcome. Thanks
You can also use awk for this:
awk '++a[$2]<=5' myfile.csv
Here, $2 => 2nd column. Since file is already sorted based on 4th column, This will print the top 5 lines corresponding to each gene (2nd column). All 342 genes will be covered. Also the header line will be maintained.
Use grep:
grep -A 4 AAEL018149 myfile.csv

grep and egrep selecting numbers

I have to find all entries of people whose zip code has “22” in it. NOTE: this should not include something like Mike Keneally whose street address includes “22”.
Here are some samples of data:
Bianca Jones, 612 Charles Blvd, Louisville, KY 40228
Frank V. Zappa, 6221 Hot Rats Blvd, Los Angeles, CA 90125
George Duke, San Diego, CA 93241
Ruth Underwood, Mariemont, OH 42522
Here is the command I have so far, but I don't know why it's not working.
egrep '.*[A-Z][A-Z]\s*[0-9]+[22][0-9]+$' names.txt
guess this is your sample names.txt
Bianca Jones, 612 Charles Blvd, Louisville, KY 40228
Frank V. Zappa, 6221 Hot Rats Blvd, Los Angeles, CA 90125
George Duke, San Diego, CA 93241
Ruth Underwood, Mariemont, OH 42522
egrep '.[A-Z][A-Z]\s[0-9]+[22][0-9]+$' names.txt
your code translates to match any line satisfy this conditions:
[A-Z][A-Z] has two consecutive upper case characters
\s* zero or more space characters
[0-9]+ one or more digit character
[22] a character matches either 2 or 2
[0-9]+$ one or more digit characters at the end of the line
to get lines satisfying your requirement:
zip code has “22” in it
you can do it this way:
egrep '[A-Z]{2}\s+[0-9]*22' names.txt
If zip code is always the last field, you can use this awk
awk '$NF~/22/' file
Bianca Jones, 612 Charles Blvd, Louisville, KY 40228
Ruth Underwood, Mariemont, OH 42522

Combine results of column one Then sum column 2 to list total for each entry in column one

I am bit of Bash newbie, so please bear with me here.
I have a text file dumped by another software (that I have no control over) listing each user with number of times accessing certain resource that looks like this:
Jim 109
Bob 94
John 92
Sean 91
Mark 85
Richard 84
Jim 79
Bob 70
John 67
Sean 62
Mark 59
Richard 58
Jim 57
Bob 55
John 49
Sean 48
Mark 46
.
.
.
My goal here is to get an output like this.
Jim [Total for Jim]
Bob [Total for Bob]
John [Total for John]
And so on.
Names change each time I run the query in the software, so static search on each name and then piping through wc does not help.
This sounds like a job for awk :) Pipe the output of your program to the following awk script:
your_program | awk '{a[$1]+=$2}END{for(name in a)print name " " a[name]}'
Output:
Sean 201
Bob 219
Jim 245
Mark 190
Richard 142
John 208
The awk script itself can be explained better in this format:
# executed on each line
{
# 'a' is an array. It will be initialized
# as an empty array by awk on it's first usage
# '$1' contains the first column - the name
# '$2' contains the second column - the amount
#
# on every line the total score of 'name'
# will be incremented by 'amount'
a[$1]+=$2
}
# executed at the end of input
END{
# print every name and its score
for(name in a)print name " " a[name]
}
Note, to get the output sorted by score, you can add another pipe to sort -r -k2. -r -k2 sorts the by the second column in reverse order:
your_program | awk '{a[$1]+=$2}END{for(n in a)print n" "a[n]}' | sort -r -k2
Output:
Jim 245
Bob 219
John 208
Sean 201
Mark 190
Richard 142
Pure Bash:
declare -A result # an associative array
while read name value; do
((result[$name]+=value))
done < "$infile"
for name in ${!result[*]}; do
printf "%-10s%10d\n" $name ${result[$name]}
done
If the first 'done' has no redirection from an input file
this script can be used with a pipe:
your_program | ./script.sh
and sorting the output
your_program | ./script.sh | sort
The output:
Bob 219
Richard 142
Jim 245
Mark 190
John 208
Sean 201
GNU datamash:
datamash -W -s -g1 sum 2 < input.txt
Output:
Bob 219
Jim 245
John 208
Mark 190
Richard 142
Sean 201

merging two files based on two columns

I have a question very similar to a previous post:
Merging two files by a single column in unix
but i want to merge my data based on two columns (The orders are the same, so no need to sort).
Example,
subjectid subID2 name age
12 121 Jane 16
24 241 Kristen 90
15 151 Clarke 78
23 231 Joann 31
subjectid subID2 prob_disease
12 121 0.009
24 241 0.738
15 151 0.392
23 231 1.2E-5
And the output to look like
subjectid SubID2 prob_disease name age
12 121 0.009 Jane 16
24 241 0.738 Kristen 90
15 151 0.392 Clarke 78
23 231 1.2E-5 Joanna 31
when i use join it only considers the first column(subjectid) and repeats the SubID2 column.
Is there a way of doing this with join or some other way please? Thank you
join command doesn't have an option to scan more than one field as a joining criteria. Hence, you will have to add some intelligence into the mix. Assuming your files has a FIXED number of fields on each line, you can use something like this:
join f1 f2 | awk '{print $1" "$2" "$3" "$4" "$6}'
provided the the field counts are as given in your examples. Otherwise, you need to adjust the scope of print in the awk command, by adding or taking away some fields.
If the orders are identical, you could still merge by a single column and specify the format of which columns to output, like:
join -o '1.1 1.2 2.3 1.3 1.4' file_a file_b
as described in join(1).

Resources