Alphanumeric sorting of a string with variable width size - linux

I am stuck in a small sorting step. I have a huge file with >300K entries and the file has to be sorted on a specific column containing alphanumeric identifiers as
Rpl12-8
Lrsam1-1
Rpl12-9
Lrsam1-2
Rpl12-10
Lrsam1-5
Rpl12-11
Lrsam1-101
Lrsam2-1
Act-1
Act-100
Act-101
Act-11
The problem is the variable width size, so I am unable to specify the second key identifier (sort -k 1.8n).The first sort is on first alphabet, then on number next to it and then the third number after "-". Can I specifically enable sorting after "-" using delimiter field so then I don't care about width of string.
Desired output would be :
Act-1
Act-11
Act-100
Act-101
Lrsam1-1
Lrsam1-2
Lrsam1-5
Lrsam1-101
Lrsam2-1
Rpl12-8
Rpl12-9
Rpl12-10
Rpl12-11

With the above data in input.txt:
sort -t- -k1,1 -k2n input.txt
You can change the field delimiter to - with -t, then sort on the first field only (as a string) with -k1,1, and finally the 2nd field (as a number) with -k2n.

Related

Find if the first 10 digits of two columns on csv file are matched in bash

I have a file which contains two columns (names.csv), values are separated by comma
,
a123456789-anything,a123456789-anything
b123456789-anything,b123456789-anything
c123456789-anything,c123456789-anything
d123456789-anything,d123456789-anything
e123456789-anything,e123456789-anything
e123456777-anything,e123456999-anything
These columns have values with 10 digits, which are unique identifiers, and some extra junk in the values (-anything).
I want to see if the columns have the prefix matched!
To verify the values on first and second column I use:
cat /home/names.csv | parallel --colsep ',' echo column 1 = {1} column 2 = {2}
Which print the values. Because the values are HEX digits, it is cumbersome to verify one by one by only reading. Is there any way to see if the 10 digits of each column pair are exact matches? They might contain special characters!
Expected output (example, but anything that says the columns are matched or not can work):
Matches (including first line):
,
a123456789-anything,a123456789-anything
b123456789-anything,b123456789-anything
c123456789-anything,c123456789-anything
d123456789-anything,d123456789-anything
e123456789-anything,e123456789-anything
Non-matches
e123456777-anything,e123456999-anything
Here's one way using awk. It prints every line where the first 10 characters of the first two fields match.
% cat /tmp/names.csv
,
a123456789-anything,a123456789-anything
b123456789-anything,b123456789-anything
c123456789-anything,c123456789-anything
d123456789-anything,d123456789-anything
e123456789-anything,e123456789-anything
e123456777-anything,e123456999-anything
% awk -F, 'substr($1,1,10)==substr($2,1,10)' /tmp/names.csv
,
a123456789-anything,a123456789-anything
b123456789-anything,b123456789-anything
c123456789-anything,c123456789-anything
d123456789-anything,d123456789-anything
e123456789-anything,e123456789-anything

Edit values in one column in 4,000,000 row CSV file

I have a CSV file I am trying to edit to add a numeric ID-type column in with unique integers from 1 - approx 4,000,000. Some of the fields already have an ID value, so I was hoping I could just sort those and then fill in starting on the largest value + 1. However, I cannot open this file to edit in Excel because of its size (I can only see the max of 1,048,000 or whatever rows). Is there an easy way to do this? I am not familiar with coding, so I was hoping there was a way to do it manually that is similar to Excel's fill series feature.
Thanks!
-also - I know there are threads on how to edit a large CSV file, but I was hoping for help with how to edit this specific feature. Thanks!
-I want to basically sort the rows based on idnumber and then add unique IDs to rows without that ID value
Screenshot of file
one way, using Notepad++, and a plugin named SQL:
Load the CSV in Notepad++
SELECT a+1,b,c FROM data
Hit 'start'
When starting with a file like this:
a,b,c
1,2,3
4,5,6
7,8,9
The results after look like this:
SQL Plugin 1.0.1025
Query : select a+1,b,c from data
Sourcefile : abc.csv
Delimiter : ,
Number of hits: 3
===================================================================================
Query result:
2,2,3
5,5,6
8,8,9
Or, in words, the first column is incremented by 1.
2nd solution, using gawk, downloaded from https://www.klabaster.com/freeware.htm#mawk:
D:\TEMP>type abc.csv
a,b,c
1,2,3
4,5,6
7,8,9
D:\TEMP>gawk "BEGIN{ FS=OFS=\",\"; getline; print $0 }{ print $1+1,$2,$3 }" abc.csv
a,b,c
2,2,3
5,5,6
8,8,9
(g)awk id a tool which reads a file line by line. The line is then accessible via $0, and the parts from the line via $1,$2,$3,... using a separator.
This separator is set in my example (FS=OFS=\",\";) in the BEGIN section which is only done once per input file. Do not get confused by the \". This is because the script is between double quotes, and a variable (like OFS) is set using double quotes too, so it needs to be escaped like \".
The getline; print $0, do take care of the first line in a CSV which typically hold column names.
Then, for every line, this piece of code print $1+1,$2,$3 will increment the first column, and print the second and third column.
To extend this second example:
gawk "BEGIN{ FS=OFS=\",\"; getline; print $0 }{ print ($1<5?$1+1:$1),$2,$3 }" abc.csv
The ($1<5?$1+1:$1) will check if value of $1is less then 5 ($1<5), if true, it will return $1+1, and else $1. Or, in words, it will only add 1 if the current value is less than 5.
With your data you end up with something like this (untested!):
gawk "BEGIN{ FS=OFS=\",\"; getline; a=42; print $0 }{ if($4+0==0){ a++ }; print ($4<=0?$a:$1),$2,$3 }" input.csv
a=42 to set the initial value for the column values which needs to be update (you need to change this to the correct value )
The if($4+0==0){ a++ } will increment the value of a when the fourth column equals 0 (The $4+0 is done to convert empty values like "" to a numeric value 0).

Shell | Sort Date and Month in Ascending order

I wanted to display/sort the file records in Ascending order of Date and Month or if there are any equal data values they should list in the very next column in ascending order.
Date & Month to sort: (current scenario)
ver.....03.02../ver>
ver.....19.01../ver>
ver.....02.02..ver>
File content:
ver>0.1.1-ABC-XYA-BR-03.02-v1.0-1-4d4f3dd/ver>
ver>0.1.1-XYZ-LOK-BR-19.01-v1.0-5-8a8d7dd/ver>
ver>0.1.1-DXD-UIJ-BR-02.02-v1.0-4-9o2k4wk/ver>
How would I can achieve below following results?
ver>0.1.1-XYZ-LOK-BR-19.01-v1.0-5-8a8d7dd/ver>
ver>0.1.1-DXD-UIJ-BR-02.02-v1.0-4-9o2k4wk/ver>
ver>0.1.1-ABC-XYA-BR-03.02-v1.0-1-4d4f3dd/ver>
I tried using sort: (not working)
sort -n sortfile.txt
ver>0.1.1-DXD-UIJ-BR-02.02-v1.0-4-9o2k4wk/ver>
ver>0.1.1-ABC-XYA-BR-03.02-v1.0-1-4d4f3dd/ver>
ver>0.1.1-XYZ-LOK-BR-19.01-v1.0-5-8a8d7dd/ver>
You can use sort, but you will need to specify the field-seperator -t '-' so that fields are separated by '-' and then specify the keydef to sort on the 5th field beginning with the 4th character and then again with the 1st character and finally a version sort on field 6 if all else is equal. That would be:
sort -t '-' -k5.4n -k5.1n -k6V contents
Providing full start and stop characters within each keydef can be done as:
sort -t '-' -k5.4n,5.5 -k5.1n,5.2 -k6V contents
(though for this data the output isn't changed)
Example Use/Output
$ sort -t '-' -k5.4n -k5.1n -k6V contents
ver>0.1.1-XYZ-LOK-BR-19.01-v1.0-5-8a8d7dd/ver>
ver>0.1.1-DXD-UIJ-BR-02.02-v1.0-4-9o2k4wk/ver>
ver>0.1.1-ABC-XYA-BR-03.02-v1.0-1-4d4f3dd/ver>

Uniqing a delimited file based on a subset of fields

I have data such as below:
1493992429103289,207.55,207.5
1493992429103559,207.55,207.5
1493992429104353,207.55,207.5
1493992429104491,207.6,207.55
1493992429110551,207.55,207.5
Due to the nature of the last two columns, their values change throughout the day and their values are repeated regularly. By grouping the way outlined in my desired output (below), I am able to view each time there was a change in their values (with the enoch time in the first column). Is there a way to achieve the desired output shown below:
1493992429103289,207.55,207.5
1493992429104491,207.6,207.55
1493992429110551,207.55,207.5
So I consolidate the data by the second two columns. However, the consolidation is not completely unique (as can be seen by 207.55, 207.5 being repeated)
I have tried:
uniq -f 1
However the output gives only the first line and does not go on through the list
The awk solution below does not allow the occurrence which happened previously to be outputted again and so gives the output (below the awk code):
awk '!x[$2 $3]++'
1493992429103289,207.55,207.5
1493992429104491,207.6,207.55
I do not wish to sort the data by the second two columns. However, since the first is epoch time, it may be sorted by the first column.
You can't set delimiters with uniq, it has to be white space. With the help of tr you can
tr ',' ' ' <file | uniq -f1 | tr ' ' ','
1493992429103289,207.55,207.5
1493992429104491,207.6,207.55
1493992429110551,207.55,207.5
You can use an Awk statement as below,
awk 'BEGIN{FS=OFS=","} s != $2 && t != $3 {print} {s=$2;t=$3}' file
which produces the output as you need.
1493992429103289,207.55,207.5
1493992429104491,207.6,207.55
1493992429110551,207.55,207.5
The idea is to store the second and third column values in variables s and t respectively and print the line contents only if the current line is unique.
I found an answer which is not as elegant as Inian but satisfies my purpose.
Since my first column is always enoch time in microseconds and does not increase or decrease in characters, I can use the following uniq command:
uniq -s 17
You can try to manually (with a loop) compare current line with previous line.
previous_line=""
# start at first line
i=1
# suppress first column, that don't need to compare
sed 's#^[0-9][0-9]*,##' ./data_file > ./transform_data_file
# for all line within file without first column
for current_line in $(cat ./transform_data_file)
do
# if previous record line are same than current line
if [ "x$prev_line" == "x$current_line" ]
then
# record line number to supress after
echo $i >> ./line_to_be_suppress
fi
# record current line as previous line
prev_line=$current_line
# increment current number line
i=$(( i + 1 ))
done
# suppress lines
for line_to_suppress in $(tac ./line_to_be_suppress) ; do sed -i $line_to_suppress'd' ./data_file ; done
rm line_to_be_suppress
rm transform_data_file
Since your first field seems to have a fixed length of 18 characters (including the , delimiter), you could use the -s option of uniq, which would be more optimal for larger files:
uniq -s 18 file
Gives this output:
1493992429103289,207.55,207.5
1493992429104491,207.6,207.55
1493992429110551,207.55,207.5
From man uniq:
-f num
Ignore the first num fields in each input line when doing comparisons.
A field is a string of non-blank characters separated from adjacent fields by blanks.
Field numbers are one based, i.e., the first field is field one.
-s chars
Ignore the first chars characters in each input line when doing comparisons.
If specified in conjunction with the -f option, the first chars characters after
the first num fields will be ignored. Character numbers are one based,
i.e., the first character is character one.

How to sort data by the numbers in third column?

If I have a file consisting of data that looks as follows, how would I sort the data based on the numbers in the third column?
The spaces between the first two columns are not tab delimited, but some number of spaces. The space between the second and third column varies based on the size of the number.
Also note that there are spaces within some data of the second column (like lp25( plasmid between ( and p) while others do not have any spaces (like chromosome).
HELIX lp25(plasmid 24437 bp RNA linear 29-AUG-2011
HELIX cp9(plasmid 9586 bp DNA helix 29-AUG-2011
HELIX lp28-1(plasmid 25455 bp DNA linear 29-AUG-2011
HELIX chromosome 911724 bp DNA plasmid 29-AUG-2011
Here you go:
sort -n -k 3 test.txt
From man sort:
-n, --numeric-sort compare according to string numerical value
-k, --key=KEYDEF sort via a key; KEYDEF gives location and type
KEYDEF is F[.C][OPTS][,F[.C][OPTS]] for start and stop position, where F is a
field number and C a character position in the field; both are origin 1, and
the stop position defaults to the line's end. If neither -t nor -b is in
effect, characters in a field are counted from the beginning of the preceding
whitespace. OPTS is one or more single-letter ordering options [bdfgiMhnRrV],
which override global ordering options for that key. If no key is given, use
the entire line as the key.
and also interesting:
-t, --field-separator=SEP use SEP instead of non-blank to blank transition
which tells us that the F fields are separated by whitespace.

Resources