NTILE a column in csv - Linux - linux

I have a csv file that reads like this:
a,b,c,2
d,e,f,3
g,h,i,3
j,k,l,4
m,n,o,5
p,q,r,6
s,t,u,7
v,w,x,8
y,z,zz,9
I want to assign quintiles to this data (like we do it in sql), using preferably bash command in linux. The quintiles, if assigned as a new column, will make the final output look like:
a,b,c,2, 1
d,e,f,3, 1
g,h,i,3, 2
j,k,l,4, 2
m,n,o,5, 3
p,q,r,6, 3
s,t,u,7, 4
v,w,x,8, 4
y,z,z,9, 5
The only thing i am able to achieve is to add a new incremental column to the csv file:
`awk '{$3=","a[$3]++}1' f1.csv > f2.csv`
But not sure how do the quintiles. Please help. thanks.

awk '{a[NR]=$0}
END{
for(i=1;i<=NR;i++) {
p=100/NR*i
q=1
if(p>20){q=2}
if(p>40){q=3}
if(p>60){q=4}
if(p>80){q=5}
print a[i] ", " q
}
}' file
Output:
a,b,c,2, 1
d,e,f,3, 2
g,h,i,3, 2
j,k,l,4, 3
m,n,o,5, 3
p,q,r,6, 4
s,t,u,7, 4
v,w,x,8, 5
y,z,zz,9, 5

Short wc + awk approach:
awk -v n=$(cat file | wc -l) \
'BEGIN{ OFS=","; n=sprintf("%.f\n", n*0.2); c=1 }
{ $(NF+1)=" "c }!(NR % n){ ++c }1' file
n=$(cat file | wc -l) - get the total number of lines of the input file file
n*0.2 - 1/5th (20 percent) of the range
$(NF+1)=" "c - set next last field with current rank value c
The output:
a,b,c,2, 1
d,e,f,3, 1
g,h,i,3, 2
j,k,l,4, 2
m,n,o,5, 3
p,q,r,6, 3
s,t,u,7, 4
v,w,x,8, 4
y,z,zz,9, 5

Related

How to write a code for more than one file in awk

I wrote a script in AWK called exc7
./exc7 file1 file2
In every file there is a matrix
file1 :
2 6 7
10 5 4
3 8 4
file2:
-60 10
10 -60
The code that I wrote is :
#!/usr/bin/awk -f
{
for (i=1;i<=NF;i++)
A[NR,i]=$i
}
END{
for (i=1;i<=NR;i++){
sum += A[i,1]
}
for (i=1;i<=NF;i++)
sum2 += A[1,i]
for (i=0;i<=NF;i++)
sum3 += A[NR,i]
for (i=0;i<=NR;i++)
sum4 += A[i,NF]
print sum,sum2,sum3,sum4
if (sum==sum2 && sum==sum3 && sum==sum4)
print "yes"
}
It should check for every file if the sum of the first column and the last and the first line and the last is the same. It will print the four sum and say yes if they are equal.Then it should print the largest sum of all number in all the files.
when I try it on one file it is right like when I try it on file1 it prints:
15 15 15 15
yes
but when I try it on two or more files like file1 file 2 the output is :
-35 8 -50 -31
you should use FNR instead of NR and with gawk you can use ENDFILE instead of END. However, this should work with any awk
awk 'function sumline(last,rn) {n=split(last,lr);
for(i=1;i<=n;i++) rn+=lr[i];
return rn}
function printresult(c1,r1,rn,cn) {print c1,r1,rn,cn;
print (r1==rn && c1==cn && r1==c1)?"yes":"no"}
FNR==1{if(last)
{rn=sumline(last);
printresult(c1,r1,rn,cn)}
rn=cn=c1=0;
r1=sumline($0)}
{c1+=$1;cn+=$NF;last=$0}
END {rn=sumline(last);
printresult(c1,r1,rn,cn)}' file1 file2
15 15 15 15
yes
-50 -50 -50 -50
yes
essentially, instead of checking end of file, you can check start of the file and print out the previous file's results. Need to treat first file differently. You still need the END block to handle the last file.
UPDATE
Based on the questions you asked, I think it's better for you to keep your script as is and change the way you call it.
for file in file1 file2;
do echo "$file"; ./exc7 "$file";
done
you'll be calling the script once for each file, so all the complications will go away.

How can I count most occuring sequence of 3 letters within a word with a bash script

I have a sample file like
XYZAcc
ABCAccounting
Accounting firm
Accounting Aco
Accounting Acompany
Acoustical consultant
Here I need to grep most occurring sequence of 3 letters within a word
Output should be
acc = 5
aco = 3
Is that possible in Bash?
I got absolutely no idea how I can accomplish it with either awk, sed, grep.
Any clue how it's possible...
PS: no output because I got no idea how to do that, I dont wanna wrote unnecessary awk -F, xyz abc... that not gonna help anywhere...
Here's how to get started with what I THINK you're trying to do:
$ cat tst.awk
BEGIN { stringLgth = 3 }
{
for (fldNr=1; fldNr<=NF; fldNr++) {
field = $fldNr
fieldLgth = length(field)
if ( fieldLgth >= stringLgth ) {
maxBegPos = fieldLgth - (stringLgth - 1)
for (begPos=1; begPos<=maxBegPos; begPos++) {
string = tolower(substr(field,begPos,stringLgth))
cnt[string]++
}
}
}
}
END {
for (string in cnt) {
print string, cnt[string]
}
}
.
$ awk -f tst.awk file | sort -k2,2nr
acc 5
cou 5
cco 4
ing 4
nti 4
oun 4
tin 4
unt 4
aco 3
abc 1
ant 1
any 1
bca 1
cac 1
cal 1
com 1
con 1
fir 1
ica 1
irm 1
lta 1
mpa 1
nsu 1
omp 1
ons 1
ous 1
pan 1
sti 1
sul 1
tan 1
tic 1
ult 1
ust 1
xyz 1
yza 1
zac 1
This is an alternative method to the solution of Ed Morton. It is less looping, but needs a bit more memory. The idea is not to care about spaces or any non-alphabetic character. We filter them out in the end.
awk -v n=3 '{ for(i=length-n+1;i>0;--i) a[tolower(substr($0,i,n))]++ }
END {for(s in a) if (s !~ /[^a-z]/) print s,a[s] }' file
When you use GNU awk, you can do this a bit differently and optimized by setting each record to be a word. This way the end selection does not need to happen:
awk -v n=3 -v RS='[[:space:]]' '
(length>=n){ for(i=length-n+1;i>0;--i) a[tolower(substr($0,i,n))]++ }
END {for(s in a) print s,a[s] }' file
This might work for you (GNU sed, sort and uniq):
sed -E 's/.(..)/\L&\n\1/;/^\S{3}/P;D' file |
sort |
uniq -c |
sort -s -k1,1rn |
sed -En 's/^\s*(\S+)\s*(\S+)/\2 = \1/;H;$!b;x;s/\n/ /g;s/.//p'
Use the first sed invocation to output 3 letter lower case words.
Sort the words.
Count the duplicates.
Sort the counts in reverse numerical order maintaining the alphabetical order.
Use the second sed invocation to manipulate the results into the desired format.
If you only want lines with duplicates and in alphabetical order and case wise, use:
sed -E 's/.(..)/&\n\1/;/^\S{3}/P;D' file |
sort |
uniq -cd |
sed -En 's/^\s*(\S+)\s*(\S+)/\2 = \1/;H;$!b;x;s/\n/ /g;s/.//p

In bash how can I split a column in several column of fixed dimension

how can I split a single column in several column of fixed dimension, for example I have a column like this:
1
2
3
4
5
6
7
8
and for size p. ex 4, I want to obtain
1 5
2 6
3 7
4 8
or for size p. ex 2, I want to obtain
1 3 5 7
2 4 6 8
Using awk:
awk '
BEGIN {
# Numbers of rows to print
n=4;
}
{
# Add to array with key = 0, 1, 2, 3, 0, 1, 2, ..
l[(NR-1)%n] = l[(NR-1)%n] " " $0
};
END {
# print the array
for (i = 0; i < length(l); i++) {
print l[i];
}
}
' file
OK, this is a bit long winded and not infallible but the following should work:
td=$( mktemp -d ); split -l <rows> <file> ${td}/x ; paste $( ls -1t ${td}/x* ) ; rm -rf ${td}; unset td
Where <cols> is the number of rows you want and file is your input file.
Explanation:
td=$( mktemp -d )
Creates a temporary directory so that we can put temporary files into it. Store this in td - it's possible that your shell has a td variable already but if you sub-shell for this your scope should be OK.
split -l <rows> <file> f ${td}/x
Split the original file into many smaller file, each <rows> long. These will be put into your temp directory and all files will be prefixed with x
paste $( ls -1t ${td}/x* )
Write these files out so that the lines in consecutive columns
rm -rf ${td}
Remove the files and directory.
unset td
Clean the environment.
Assuming you know the number of rows in your column (here, 8):
n=8
# to get output with 4 rows:
seq $n | pr -ts" " -$((n/4))
1 5
2 6
3 7
4 8
# to get output with 2 rows:
seq $n | pr -ts" " -$((n/2))
1 3 5 7
2 4 6 8
If you know the desired output width you can use column.
# Display in columns for an 80 column display
cat file | column -c 80
$ cat tst.awk
{ a[NR] = $0 }
END {
OFS=","
numRows = (numRows ? numRows : 1)
numCols = ceil(NR / numRows)
for ( rowNr=1; rowNr<=numRows; rowNr++ ) {
for ( colNr=1; colNr<=numCols; colNr++ ) {
idx = rowNr + ( (colNr - 1) * numRows )
printf "%s%s", a[idx], (colNr<numCols ? OFS : ORS)
}
}
}
function ceil(x, y){y=int(x); return(x>y?y+1:y)}
$ awk -v numRows=2 -f tst.awk file
1,3,5,7
2,4,6,8
$ awk -v numRows=4 -f tst.awk file
1,5
2,6
3,7
4,8
Note that above produces a CSV with the same number of fields in every row even when the number of input rows isn't an exact multiple of the desired number of output rows:
$ seq 10 | awk -v numRows=4 -f tst.awk
1,5,9
2,6,10
3,7,
4,8,
See https://stackoverflow.com/a/56725452/1745001 for how to do the opposite, i.e. generate a number of rows given a specified number of columns.

Extracting several rows with overlap using awk

I have a big file that looks like this (it has actually 12368 rows):
Header
175566717.000
175570730.000
175590376.000
175591966.000
175608932.000
175612924.000
175614836.000
.
.
.
175680016.000
175689679.000
175695803.000
175696330.000
What I want to do is, delete the header, then extract the first 2000 lines (line 1 to 2000), then extract the lines 1500 to 3500, then 3000 to 5000 and so on...
What I mean is: extract a window of 2000 lines with an overlap of 500 lines between contiguous windows until the end of the file.
From a previous post, I got this:
tail -n +2 myfile.txt | awk 'BEGIN{file=1} ++count && count==2000 {print > "window"file; file++; count=500} {print > "window"file}'
But that isn't what I want. I don't have the 500 lines overlap and my first window has 1999 rows instead of 2000.
Any help would be appreciated
awk -v i=1 -v t=2000 -v d=500 'NR>1{a[NR-1]=$0}
END{while(i<NR-1){for(k=i;k<i+t;k++)print a[k] > i".txt"; close(i".txt");i=i+t-d}}' file
try above line, you could change the numbers to fit your new requirement. you can define your own filenames too.
little test with t=10 (your 2000) and d=5 (your 500)
kent$ cat f
header
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
kent$ awk -v i=1 -v t=10 -v d=5 'NR>1{a[NR-1]=$0}END{while(i<NR-1){for(k=i;k<i+t;k++)print a[k] > i".txt"; close(i".txt");i=i+t-d}}' f
kent$ head *.txt
==> 1.txt <==
1
2
3
4
5
6
7
8
9
10
==> 6.txt <==
6
7
8
9
10
11
12
13
14
15
==> 11.txt <==
11
12
13
14
15
awk is not ideal for this. In Python you could do something like
with open("data") as fin:
lines = fin.readlines()
# remove header
lines = lines[1:]
# print the lines
i = 0
while True:
print "\n starting window"
if len(lines) < i+3000:
# we're done. whatever is left in the file will be ignored
break
for line in lines[i:i+3000]:
print line[:-1] # remove \n
i += 3000 - 500
Reading the entire file into memory is usually not a great idea, and in this case is not necessary. Given a line number, you can easily compute which files it should go into. For example:
awk '{
a = int( NR / (t-d));
b = int( (NR-t) / (t-d)) ;
for( f = b; f<=a; f++ ) {
if( f >= 0 && (f * (t-d)) < NR && ( NR <= f *(t-d) + t))
print > ("window"(f+1))
}
}' t=2000 d=500

extract columns from multiple text files with bash

I am trying to extract columns from multiple text files(3000 files). The sample of my text file is shown below.
res ABS sum
SER A 1 161.15 138.3
CYS A 2 66.65 49.6
PRO A 3 21.48 15.8
ALA A 4 77.68 72.0
ILE A 5 15.70 9.0
HIS A 6 10.88 5.9
I would like to print
1) resnames(first column) only if the sum(last column) is >25.
2) I would like to store the output in to one file
3) I would like to add a new column to the outputfile with the name of the txt file from where the data was extracted and also need to print the total number of resnames( from all text files only if sum is >25)
I would like to get the following output
SER AA.txt
CYS AA.txt
ALA AA.txt
SER BB.txt
Total numberof SER- 2
Total number of ALA- 1
Total number of CYS- 1
How can I get this output with Bash? I tried the following code
for i in files/*.txt
do
awk 'BEGIN{FS=OFS=" "}{if($5 > 25) print $1,i}'
done
Any suggestions please?
Try:
awk '{ a[$1]++ }
END { for (k in a) print "Total number of " k " - " a[k] }' FILES
(Not tested)
awk '{
if ($NF ~ /([0-9])+(\.)?([0-9])+/ && $NF > 25) {
print $1, FILENAME;
res[$1]++;
}
}
END {
for (i in res) {
print "Total number of ", i, "-", res[i];
}
}' res.txt
Here's the output I get for your example:
SER res.txt
CYS res.txt
ALA res.txt
Total number of SER - 1
Total number of CYS - 1
Total number of ALA - 1

Resources