sorting rows of a data file with Linux - linux

I would like to sort the lines of a data file (each line idependent from each other) from the first character. For example, if I have a data file
1 0.1 0.6 0.4
2 0.5 0.2 0.3
3 1.0 0.2 0.8
I would like to end with something like
1 0.6 0.4 0.1
2 0.5 0.3 0.2
3 1.0 0.8 0.2
I have tried to do it using the sort command, but it sorts the columns (not the line). Transposing the data file +sort could be also a good solution (I don't know any easy way for transposing datafiles).
Thanks for the help!

Perl to the rescue!
perl -lawne '
print join "\t", $F[0], sort { $b <=> $a } #F[1..$#F]
' < input > output
-n reads the input line by line
-a splits the line on whitespace into the #F array
-l adds newlines to print
See sort, join
.

Or to read input line by line, use tr and sort like this:
#! /bin/sh
while read -r line; do
echo $line | tr ' ' '\n' | sort -k1,1nr -k2 | tr '\n' '\t' >> output
echo >> output
done < input
tr ' ' '\n' is to convert row to column.

Related

Linux: Number of characters in a text file on lines 'x' through 'y'

How do I print the number of characters on lines x - y of a text file?
I tried using wc -m filename.txt
but I couldn't figure out how to limit the search.
You could use
head -y filename | tail -(y-x+1) | wc -m
You can use the sed command to select the lines you want and then pipe the output into wc. Something like this would select lines 6-10 and print the number of characters:
sed -n '6,10p' filename.txt | wc -m
Try this:
awk '{ print NR, "-", length($0)}' filename.txt
It will print the line number NR and the characters per line length($0) of filename.txt so output will be something like:
1 - 3 # line 1 with 3 characters
2 - 0 # line 2 with no characters
...
In case you just want to print the number of characters for a specific range, let's say from line 1 to 3, this could be used:
awk 'NR>=1 && NR<=3 { print length($0)}' filename.txt

Linux command (Calculating the sum)

I have a .txt file with the following content:
a 3
a 4
a 5
a 6
b 1
b 3
b 5
c 9
c 10
I am wondering if there is any command (no awk if possible) that can read the .txt file and give the following output (Sorted by the second column):
c 19
a 18
b 9
You can use awk piped to sort:
awk '{sums[$1] += $2} END {for (i in sums) print i, sums[i]}' file | sort -rnk2
c 19
a 18
b 9
sums[$1] += $2 is adding value of $2 in an array sums that is indexed by field #1 ($1).
sort -rnk2 is reverse sorting numerically output of awk on field 2
Use can use this code:
cat 1.txt | awk '{arr[$1]+=$2}END{for (var in arr) print var," ",arr[var]}' | sort -rnk 2
Explanation:
cat 1.txt - read 1.txt file with content
awk - is a language very useful for data manipulation
{arr[$1]+=$2} for each line in content file increase array item with key first field with value of second field. Field separator by default is space.
END{for (var in arr) print var," ",arr[var]}' - after all line is proceeded, print array content
sort -rnk 2 - reverse numeric sort on field 2
Non-awk solutions.
perl
perl -lane '
$sum{$F[0]} += $F[1]
} END {
$, = " ";
print $_, $sum{$_} for reverse sort {$sum{$a} <=> $sum{$b}} keys %sum
' file.txt
bash version 4
declare -A sum
while read key val; do (( sum[$key] += $val )); done < file.txt
for key in "${!sum[#]}"; do echo "$key ${sum[$key]}"; done | sort -rn -k2
non-awk challenge accepted
vars=$(cut -d" " -f1 nums | uniq); paste <(echo "$vars") <(cat <(sed -e 's/ /+=/' nums) <(echo "$vars" | sed 's/$/;/') | bc) | sort -k2,2nr
c 19
a 18
b 9

How to extract the integer or decimal at beginning of each input line, using Linux/Unix utilities?

Given input such as:
1
1a
1.1b
2.0c
How to extract the integer/decimal number at beginning of each input line, using only Linux/Unix command line utilities?
Using awk, you could say:
awk '{print $0+0}'
Awk is available in Linux, BSD, and many other Unix-like operating systems. It helps in this way:
echo "1" | awk '{a+=$0; print a}' # output 1
echo "1a" | awk '{a+=$0; print a}' # output 1
echo "1.1b" | awk '{a+=$0; print a}' # output 1.1
echo "2.0c" | awk '{a+=$0; print a}' # output 2
Some more awk
For extracting only digits
$ awk 'gsub(/[[:alpha:]].*/,x,$1) + 1' << EOF
1
1a
1.1b
2.0c
EOF
1
1
1.1
2.0
For integer
$ awk '{print int($0)}' << EOF
1
1a
1.1b
2.0c
EOF
1
1
1
2
---edit---
If there is any blank line in file, you can avoid printing zero from following
$ awk 'NF{$0+=0}1' << EOF
1
1a
1.1b
2foot4c
2
EOF
1
1
1.1
2
2
Here is a way to do this with sed:
echo "12.3abc" | sed -n 's/^\([0-9.][0-9.]*\).*/\1/p'
Output:
12.3
The block in parentheses matches all numbers or periods '.' that occur at the beginning of the line. Everything after that is match by the '.*'.
The \1 says to replace the entire line with just the portion that was matched in the parentheses.
Assuming your version of grep supports -o:
grep -o '^[0-9.]\+' data.in
NB: This will match any sequence of digits and decimal points at the start of the line.

Find value from one csv in another one (like vlookup) in bash (Linux)

I have already tried all options that I found online to solve my issue but without good result.
Basically I have two csv files (pipe separated):
file1.csv:
123|21|0452|IE|IE|1|MAYOBAN|BRIN|OFFICE|STREET|MAIN STREET|MAYOBAN|
123|21|0453|IE|IE|1|CORKKIN|ROBERT|SURNAME|CORK|APTS|CORKKIN|
123|21|0452|IE|IE|1|CORKCOR|NAME|HARRINGTON|DUBLIN|STREET|CORKCOR|
file2.csv:
MAYOBAN|BANGOR|2400
MAYOBEL|BELLAVARY|2400
CORKKIN|KINSALE|2200
CORKCOR|CORK|2200
DUBLD11|DUBLIN 11|2100
I need a linux bash script to find the value of pos.3 from file2 based on the content of pos7 in file1.
Example:
file1, line1, pos 7: MAYOBAN
find MAYOBAN in file2, return pos 3 (2400)
the output should be something like this:
**2400**
**2200**
**2200**
**etc...**
Please help
Jacek
A little approach, far away to be perfect:
DELIMITER="|"
for i in $(cut -f 7 -d "${DELIMITER}" file1.csv );
do
grep "${i}" file2.csv | cut -f 3 -d "${DELIMITER}";
done
This will work, but since the input files must be sorted, the output order will be affected:
join -t '|' -1 7 -2 1 -o 2.3 <(sort -t '|' -k7,7 file1.csv) <(sort -t '|' -k1,1 file2.csv)
The output would look like:
2200
2200
2400
which is useless. In order to have a useful output, include the key value:
join -t '|' -1 7 -2 1 -o 0,2.3 <(sort -t '|' -k7,7 file1.csv) <(sort -t '|' -k1,1 file2.csv)
The output then looks like this:
CORKCOR|2200
CORKKIN|2200
MAYOBAN|2400
Edit:
Here's an AWK version:
awk -F '|' 'FNR == NR {keys[$7]; next} {if ($1 in keys) print $3}' file1.csv file2.csv
This loops through file1.csv and creates array entries for each value of field 7. Simply referring to an array element creates it (with a null value). FNR is the record number in the current file and NR is the record number across all files. When they're equal, the first file is being processed. The next instruction reads the next record, creating a loop. When FNR == NR is no longer true, the subsequent file(s) are processed.
So file2.csv is now processed and if it has a field 1 that exists in the array, then its field 3 is printed.
You can use Miller (https://github.com/johnkerl/miller).
Starting from input01.txt
123|21|0452|IE|IE|1|MAYOBAN|BRIN|OFFICE|STREET|MAIN STREET|MAYOBAN|
123|21|0453|IE|IE|1|CORKKIN|ROBERT|SURNAME|CORK|APTS|CORKKIN|
123|21|0452|IE|IE|1|CORKCOR|NAME|HARRINGTON|DUBLIN|STREET|CORKCOR|
and input02.txt
MAYOBAN|BANGOR|2400
MAYOBEL|BELLAVARY|2400
CORKKIN|KINSALE|2200
CORKCOR|CORK|2200
DUBLD11|DUBLIN 11|2100
and running
mlr --csv -N --ifs "|" join -j 7 -l 7 -r 1 -f input01.txt then cut -f 3 input02.txt
you will have
2400
2200
2200
Some notes:
-N to set input and output without header;
--ifs "|" to set the input fields separator;
-l 7 -r 1 to set the join fields of the input files;
cut -f 3 to extract the field named 3 from the join output
cut -d\| -f7 file1.csv|while read line
do
grep $line file1.csv|cut -d\| -f3
done

Permutation columns without repetition

Can anybody give me some piece of code or algorithm or something else to solve the following problem?
I have several files, each with a different number of columns, like:
$> cat file-1
1 2
$> cat file-2
1 2 3
$> cat file-3
1 2 3 4
I would like to subtract the column absolute values and divide by the sum of all in a row for each different columns only once (combination without repeated column pairs):
in file-1 case I need to get:
0.3333 # because |1-2/(1+2)|
in file-2 case I need to get:
0.1666 0.1666 0.3333 # because |1-2/(1+2+3)| and |2-3/(1+2+3)| and |1-3/(1+2+3)|
in file-3 case I need to get:
0.1 0.2 0.3 0.1 0.2 0.1 # because |1-2/(1+2+3+4)| and |1-3/(1+2+3+4)| and |1-4/(1+2+3+4)| and |2-3/(1+2+3+4)| and |2-4/(1+2+3+4)| and |3-4/(1+2+3+4)|
This should work though I am guessing you have made a minor mistake in your input data. Based on your third pattern the following data should be -
Instead of:
in file-2 case I need to get:
0.1666 0.1666 0.3333 # because |1-2/(1+2+3)| and |2-3/(1+2+3)| and |1-3/(1+2+3)|
It should be:
in file-2 case I need to get:
0.1666 0.3333 0.1666 # because |1-2/(1+2+3)| and |1-3/(1+2+3)| and |2-3/(1+2+3)|
Here is the awk one liner:
awk '
NF{
a=0;
for(i=1;i<=NF;i++)
a+=$i;
for(j=1;j<=NF;j++)
{
for(k=j;k<NF;k++)
printf("%s ",-($j-$(k+1))/a)
}
print "";
next;
}1' file
Short version:
awk '
NF{for (i=1;i<=NF;i++) a+=$i;
for (j=1;j<=NF;j++){for (k=j;k<NF;k++) printf("%2.4f ",-($j-$(k+1))/a)}
print "";a=0;next;}1' file
Input File:
[jaypal:~/Temp] cat file
1 2
1 2 3
1 2 3 4
Test:
[jaypal:~/Temp] awk '
NF{
a=0;
for(i=1;i<=NF;i++)
a+=$i;
for(j=1;j<=NF;j++)
{
for(k=j;k<NF;k++)
printf("%s ",-($j-$(k+1))/a)
}
print "";
next;
}1' file
0.333333
0.166667 0.333333 0.166667
0.1 0.2 0.3 0.1 0.2 0.1
Test from shorter version:
[jaypal:~/Temp] awk '
NF{for (i=1;i<=NF;i++) a+=$i;
for (j=1;j<=NF;j++){for (k=j;k<NF;k++) printf("%2.4f ",-($j-$(k+1))/a)}
print "";a=0;next;}1' file
0.3333
0.1667 0.3333 0.1667
0.1000 0.2000 0.3000 0.1000 0.2000 0.1000
#Jaypal just beat me too it! Here's what I had:
awk '{for (x=1;x<=NF;x++) sum += $x; for (i=1;i<=NF;i++) for (j=2;j<=NF;j++) if (i < j) printf ("%.1f ",-($i-$j)/sum)} END {print ""}' file.txt
Output:
0.1 0.2 0.3 0.1 0.2 0.1
prints to one decimal place.
#Jaypal, Is there a quick way to printf an absolute value? Perhaps like: abs(value) ?
EDIT:
#Jaypal, yes I've tried searching too and couldn't find something simple :-( It seems if ($i < 0) $i = -$i is the way to go. I guess you could use sed to remove any minus signs:
awk '{for (x=1;x<=NF;x++) sum += $x; for (i=1;i<=NF;i++) for (j=2;j<=NF;j++) if (i < j) printf ("%.1f ", ($i-$j)/sum)} {print ""}' file.txt | sed "s%-%%g"
Cheers!
As it looks like a homework, I will act accordingly.
To find the total numbers present in the file, you can use
cat filename | wc -w
Find the first_number by:
cat filename | cut -d " " -f 1
To find the sum in a file:
cat filename | tr " " "+" | bc
Now, that you have the total_nos, use something like:
for i in {seq 1 1 $total_nos}
do
#Find the numerator by first_number - $i
#Use the sum you got from above to get the desired value.
done

Resources