I have two files, similar to the ones below:
File 1 - with phenotype informations, the first column are the individual, the orinal file has 400 rows:
215 2 25 13.8354303 15.2841303
222 2 25.2 15.8507278 17.2994278
216 2 28.2 13.0482192 14.4969192
223 11 15.4 9.2714745 11.6494745
File 2 - with SNPs information, the original file has 400 lines and 42,000 characters per line.
215 20211111201200125201212202220111202005111102
222 20111011212200025002211001111120211015112111
216 20210005201100025210212102210212201005101001
223 20222120201200125202202102210121201005010101
217 20211010202200025201202102210121201005010101
218 02022000252012021022101212010050101012021101
And I need to remove from file 2 individuals that do not appear in the file 1, for example:
215 20211111201200125201212202220111202005111102
222 20111011212200025002211001111120211015112111
216 20210005201100025210212102210212201005101001
223 20222120201200125202202102210121201005010101
I could do this with this code:
awk 'NR==FNR{a[$1]; next}$1 in a{print $0}' file1 file2> file3
However, when I do my main analysis with the generated file the following error appears:
*** Error in `./airemlf90': free(): invalid size: 0x00007f5041cc2010 ***
*** Error in `./postGSf90': free(): invalid size: 0x00007fec4a04f010 ***
airemlf90 and postGSf90 are software. But when I use original file this problem does not occur. Does the command that I made to delete individuals is adequate? Another detail that did not say is that some individuals have identification with 4 characters, can be this the error?
Thanks
I wrote a small python script in a few minutes. Works well, I have tested with 42000-char lines and it works fine.
import sys,re
# rudimentary argument parsing
file1 = sys.argv[1]
file2 = sys.argv[2]
file3 = sys.argv[3]
present = set()
# first read file 1, discard all fields except the first one (the key)
with open(file1,"r") as f1:
for l in f1:
toks = re.split("\s+",l) # same as awk fields
if toks: # robustness against empty lines
present.add(toks[0])
#now read second one and write in third one only if id is in the set
with open(file2,"r") as f2:
with open(file3,"w") as f3:
for l in f2:
toks = re.split("\s+",l)
if toks and toks[0] in present:
f3.write(l)
(First install python if not already present.)
Call my sample script mytool.py and run it like this:
python mytool.py file1.txt file2.txt file3.txt
To process several files at once simply in a bash file (to replace the original solution) it's easy (although not optimal because could be done in a whirl in python)
<whatever the for loop you need>; do
python my_tool.py $1 $2 $3
done
exactly like you would call awk with 3 files.
Related
This question already has answers here:
Find duplicate lines in a file and count how many time each line was duplicated?
(7 answers)
print unique lines based on field
(3 answers)
Closed 2 years ago.
I have a text file supplied.tsv with filepaths and a column with filesize as follows, I want to ensure that the filenames are unique
./statistics/variant_calls/v12_HG03486_hgsvc_pbsq2-ccs_1000.snv.QUAL10.GQ100.vcf.cluster.stats 676
./statistics/variant_calls/v12_HG03486_hgsvc_pbsq2-ccs_1000.snv.QUAL10.GQ100.vcf.stats 788
./v12_config_20200721-092246_HG02818_HG03125_HG03486.json 887
./v12_config_20200721-092246_HG02818_HG03125_HG03486.json 887
./variant_calls/v12_HG02818_hgsvc_pbsq2-ccs_1000.wh-phased.vcf.bgz 566
./variant_calls/v12_HG02818_hgsvc_pbsq2-ccs_1000.wh-phased.vcf.bgz 566
./variant_calls/v12_HG02818_hgsvc_pbsq2-ccs_1000.wh-phased.vcf.bgz.tbi 772
Expected output
Yes all unique filenames
MY PLAN
I will extract the first column from file
awk -F"\t" '{print $1}' supplied.tsv > supplied_firstcolumn.txt
Extract filename and then check the distinct lines. Kindly let me know how to do this efficiently.
awk '{ fil[$1]++ } END { for (i in fil) { if (fil[i]>1) { print i" - "fil[i];dup++ } } if (dup < 1) { print "No duplicates" } }' files.txt
Create an array called fil with the filename as the index and increment the value every time the file is seen. At the end, loop through the fil array and if the value is greater than 1, print the filename and the count. Also increment a duplicates count (dup). If the dup variable is less that 1 at the end of the loop, print "No duplicates"
how to do this efficiently
As you are interested in if, not how many duplicates you have I suggest stop processing after hitting 1st duplicate. I would do it following way. Let file.txt content be:
./statistics/variant_calls/v12_HG03486_hgsvc_pbsq2-ccs_1000.snv.QUAL10.GQ100.vcf.cluster.stats 676
./statistics/variant_calls/v12_HG03486_hgsvc_pbsq2-ccs_1000.snv.QUAL10.GQ100.vcf.stats 788
./v12_config_20200721-092246_HG02818_HG03125_HG03486.json 887
./v12_config_20200721-092246_HG02818_HG03125_HG03486.json 887
./variant_calls/v12_HG02818_hgsvc_pbsq2-ccs_1000.wh-phased.vcf.bgz 566
./variant_calls/v12_HG02818_hgsvc_pbsq2-ccs_1000.wh-phased.vcf.bgz 566
./variant_calls/v12_HG02818_hgsvc_pbsq2-ccs_1000.wh-phased.vcf.bgz.tbi 772
then
awk 'BEGIN{uniq=1}(++arr[$1]>=2){uniq=0;exit}END{print uniq ? "all unique" : "found nonunique"}' file.txt
output
found nonunique
Explanation: Firstly I set uniq to 1, which will stay such if not duplicates are found. Then for every line I increase counter in arr for given path ($1) and check if after that operation it is bigger or equal 2 - if it is this mean it is 2nd or following occurence, so I set uniq to 0 and end processing file using exit - or in other words jump to END. In END I print pending on uniq value, if you prefer to print only if duplicate were not found, you might use if(uniq){print "unique"} in END.
(tested in gawk 4.2.1)
I have a bash script that take a single text file 'power_coords.txt' containing 115 rows of data, and within each row there are four (space separated) columns containing x,y,z coordinates (first 3 cols) and a name (4th col). Example:
36 54 19 cotc1
45 13 -27 cotc2
1 -6 14 cotc3
....
My script runs the following lines of code to run an operation on each line of the text file:
#!/bin/bash
input="power_coords.txt"
while IFS=" " read x y z name
do
fslmaths avg152T1.nii.gz -mul 0 -add 1 -roi $x 1 $y 1 $z 1 0 1 $name -odt float
done < "$input"
This appears to work just fine, however when I run the code I get a strange symbol in each filename created:
Does anyone know why this is happening and how to fix? Or a simple way to clean up the filenames (i.e., remove the bit that looks like lego) after the script has run?
Cheers
The small square in your screenshot says "000D", so it's just a CR newline symbol.
First of all, I would recommend checking what's in the variable $name, by just printing it:
while IFS=" " read x y z name
do
echo "aaa${name}bbb"
done < "$input"
aaa and bbb are there to more easily see the output.
It would also help to check what character is used in power_coords.txt. You can check it by passing path to the utility file. It should print out which "line terminator" is used in file.
After figuring out what's being read into $name, you can either convert line breaks in power_coords.txt, or tweak IFS to allow CR to be treated as delimiter as well.
I have a file where I want to print data to another file except first line data
Data in the list.txt is
Vik
Ram
Raj
Pooja
OFA
JAL
Output should be into new file => fd.txt like this below except first line data 'VIK'
Ram
Raj
Pooja
OFA
JAL
Code not working
find $_filepath -type d > list.txt
for i in 2 3 4 5 .. N
do
echo $i
done<list.txt >>fd.txt
tail -n+2 outputs the last lines starting from the second one.
from https://superuser.com/questions/1071448/tail-head-all-line-except-x-last-first-lines
how to change this file
335
339
666665
666668
to this result
335
336
337
338
339
666665
666666
666667
666668
explain : between two numbers with the same long, it will push the missed number to make numeric ascending order .Many Thanks
I believe this does what you want.
awk 'alen==length($1) {for (i=a;i<=$1;i++) print i}; {a=$1; alen=length(a); if (a==(i-1)) {a++}}'
When alen (the length of a) is the same as the length of the current line loop between a and $1 printing out all missing values.
Then set a to the new $1, alen to the length of a, and when we dealt with a missing range (when a is the same as i - 1) increment a so we don't duplicate that number (this handles cases of sequential lines like 335, 339, 350 without duplicating 339).
With credit to #fedorqui for the basic idea.
Edit: I believe this fixes the problem I noted in the comments (which I think is what #JohnB was indicating as well):
awk '{f=0; if (alen==length($1)) {for (i=a;i<=$1;i++) print i} else {f=1}} {a=$1; alen=length(a)} a==(i-1){a++} f{print; a++}'
I feel like there should be a simpler way to do that but I don't see it at the moment.
Edit again: The input file I ended up testing with:
335
339
340
345
3412
34125
666665
666668
The first approach is this:
$ awk 'NR%2 {a=$1; next} $1>a {for (i=a;i<=$1;i++) print i}' file
335
336
337
338
339
666665
666666
666667
666668
It can be improved as much as info and effort you put in your question :)
Explanation
NR%2 {a=$1; next} as NR stands for number of record (number of line in this case), NR%2 is 1 if NR is not multiple of 2. So this stores the value of the line in the variable a in the odd lines. Then, next stops processing current line.
$1>a {for (i=a;i<=$1;i++) print i} in the other cases (even lines), if the value is bigger than the one that was stored, it loops from that value up to the current one, printing all the values in between.
I am working in a linux environment and I would like to get some help on bash scripting to cut down on simple repetition.
I have long list of file names(937 to be exact). In that file on one row there is only one file name there, therefore, total of 937 lines in the file.
I would like to add certain text before the file name and add numbers after the file name in order.
so I would like something like this in the text file.
aa.exe
bb.exe
cc.exe
to
asd aa.exe 1
asd bb.exe 2
asd cc.exe 3
any help will be greatly appreciated.
Just for kicks, here's an awk version:
awk '{print "foo", $0, NR}' files.lst
If files.lst consists of:
a.txt
b.txt
c.txt
...then this will output:
foo a.txt 1
foo b.txt 2
foo c.txt 3
Pure Bash:
while read -r line
do
echo "asd $line $((++i))"
done < inputfile
Here is a simple Python solution, save it in a text file named so.py.
Since you are still using Python v 2.4.2, this code should work with that earlier version:
#!/usr/bin/python
add_text = 'asd' # the string to put in front
fn = open('filenames.txt')
outf = open('outdata.txt', 'w')
i = 1
for filename in fn:
outf.write('%7s %10s %d\n' % (add_text, filename.strip(), i))
i += 1
fn.close()
outf.close()
Expects the names of the files to be in file filenames.txt, and the output generated goes to file outdata.txt.
asd aa.exe 1
asd bb.exe 2
asd cc.exe 3
The text to be added ahead of the filename is fixed in the variable add_text.
To run the script, issue these commands at the Linux prompt:
chmod +x so.py <-- this is only needed once
./so.py <-- to run the script
and it will use the input file to generate the output file.
In vim:
:%s/.\+/\=printf("%s %s %d", "asdf", submatch(0), line("."))/