Have a text file like this.
john,3
albert,4
tom,3
junior,5
max,6
tony,5
I'm trying to fetch records where column2 value is same. My desired output.
john,3
tom,3
junior,5
tony,5
I'm checking if we can use uniq -d on second column?
Here's one way using awk. It reads the input file twice, but avoids the need to sort:
awk -F, 'FNR==NR { a[$2]++; next } a[$2] > 1' file file
Results:
john,3
tom,3
junior,5
tony,5
Brief explanation:
FNR==NR is a common AWK idiom that is true for the first file in the arguments list. Here, column two is added to an array and incremented. On the second read of the file, we simply check if the value of column two is greater than one (the next keyword skips processing the rest of the code).
You can use uniq on fields (columns), but not easily in your case.
Uniq's -f and -s options filter by fields and characters respectively. However neither of these quite do what want.
-f divides fields by whitespace and you separate them with commas.
-s skips a fixed number of characters and your names are of variable length.
Overall though, uniq is used to compress input by consolidating duplicates into unique lines. You are actually wishing to retain duplicates and eliminate singletons, which is the opposite of what uniq is used to do. It would appear you need a different approach.
Related
How to change delimiter from current comma (,) to semicolon (;) inside .txt file using linux command?
Here is my ME_1384_DataWarehouse_*.txt file:
Data Warehouse,ME_1384,Budget for HW/SVC,13/05/2022,10,9999,13/05/2022,27,08,27,08
Data Warehouse,ME_1384,Budget for HW/SVC,09/05/2022,10,9999,09/05/2022,45,58,45,58
Data Warehouse,ME_1384,Budget for HW/SVC,25/05/2022,10,9999,25/05/2022,7,54,7,54
Data Warehouse,ME_1384,Budget for HW/SVC,25/05/2022,10,9999,25/05/2022,7,54,7,54
It is very important that value of last two columns is number with 2 decimal places, so value of last 2 columns in first row for example is:"27,08"
That could be the main problem why delimiter couldn't be change in proper way.
I tried with:
sed 's/,/;/g' ME_1384_DataWarehouse_*.txt
and every comma sign has been changed, including mentioned value of the last 2 columns.
Is there anyone who can help me out with this issue?
With sed you can replace the nth occurrence of a certain lookup string. Example:
$ sed 's/,/;/4' file
will replace the 4th comma with a semicolon.
So, if you know you have 11 fields (10 commas), you can do
$ sed 's/,/;/g;s/;/,/10;s/;/,/8' file
Example:
$ seq 1 11 | paste -sd, | sed 's/,/;/g;s/;/,/10;s/;/,/8'
1;2;3;4;5;6;7;8,9;10,11
Your question is somewhat unclear, but if you are trying to say "don't change the last comma, or the third-to-last one", a solution to that might be
perl -pi~ -e 's/,(?![^,]+(?:,[^,]+,[^,]+)?$)/;/g' ME_1384_DataWarehouse_*.txt
Perl in isolation does not perform any loop over the input lines, but the -p option says to loop over input one line at a time, like sed, and print every line (there is also -n to simulate the behavior of sed -n); the -i~ says to modify the file, but save the original with a tilde added to its file name as a backup; and the regex uses a negative lookahead (?!...) to protect the two fields you want to exempt from the replacement. Lookaheads are a modern regex feature which isn't supported by older tools like sed.
Once you are satisfied with the solution, you can remove the ~ after -i to disable the generation of backups.
You can do this with awk:
awk -F, 'BEGIN {OFS=";"} {a=$NF;NF-=1; printf "%s,%s\n",$0,a} ' input_file
This should work with most awk version (do not count on Solaris standard awk)
The idea is to store the last element from row in variable, decrease the number of fields and then print using new delimiter, comma and stored last field.
I have a CSV file where I need to dedupe entries where the FIRST field matches, even if the other fields don't match. In addition, the line that is left should be the one where one of the other fields with the highest date.
This what my data looks like:
"47917244","000","OTC","20180718","7","2018","20180719","47917244","20180719"
"47917244","000","OTC","20180718","7","2018","20180731","47917244","20180731"
"47917244","000","OTC","20180718","7","2018","20180830","47917244","20180830"
All 3 lines have the same value in the first field. The 9th field is a date field, and I want dedupe it in such a way that the third line, which has the highest date value, is kept, but the other two lines are deleted.
After checking another stackoverflow post (Is there a way to 'uniq' by column?), I got it working by using a mix of sort and awk:
sort -t, -u -k1,1 -k9,9 <file> |
awk -F',' '{ x[$1]=$0 } END { for (i in x) print x[i] }'
A colleague of mine noticed some odd behaviour with the sort command today, and I was wondering if anyone knows if the output of this command is intentional or not?
Given the file:
ABC_22
ABC_43
ABC_1
ABC_1
ABC_43
ABC_10
ABC_123
We are looking to sort the file with numeric sort, and also make it unique, so we run:
sort file.txt -nu
The output is:
ABC_22
Now, we know that the numeric sort won't work in this case as the lines don't begin with numbers (and that's fine, this is just part of a larger script), but I would have expected something more along the lines of:
ABC_1
ABC_10
ABC_123
ABC_22
ABC_43
Does anyone know why this isn't the case? The sort acts as one would expect if given just the -u or -n options individually.
With -n, an empty number is zero:
Sort numerically. The number begins each line and consists of optional
blanks, an optional ‘-’ sign, and zero or more digits possibly
separated by thousands separators, optionally followed by a
decimal-point character and zero or more digits. An empty number is
treated as ‘0’.
All these lines have an empty number at the start of the line, so they are all zero for sort's numerical uniqueness. If you'd started each line with the same number, say 1, the effect would be the same. You should specify the field containing the numbers explicitly, or use version sort (-V):
$ sort -Vu foo
ABC_1
ABC_10
ABC_22
ABC_43
ABC_123
You are missing specifying the de-limit on the second field of GNU sort as
sort -nu -t'_' -k2 file
ABC_1
ABC_10
ABC_22
ABC_43
ABC_123
The flag -n for numerical sort, -u for unique lines and the key part is to set de-limiter as _ and sort on the second field after _ done by -k2.
I have a document (.txt) composed like that.
info1: info2: info3: info4
And I want to show some information by column.
For example, I have some different information in "info3" shield, I want to see only the lines who are composed by "test" in "info3" column.
I think I have to use sort but I'm not sure.
Any idea ?
The previous answers are assuming that the third column is exactly equal to test. It looks like you were looking for columns where the value included test. We need to use awk's match function
awk -F: 'match($3, "test")' file
You can use awk for this. Assuming your columns are de-limited by : and column 3 has entries having test, below command lists only those lines having that value.
awk -F':' '$3=="test"' input-file
Assuming that the spacing is consistent, and you're looking for only test in the third column, use
grep ".*:.*: test:.*" file.txt
Or to take care of any spacing that might occur
grep ".*:.*: *test *:.*" file.txt
What's the difference between:
!tail -n +2 hits.csv | sort -k 1n -o output.csv
and
!tail -n +2 hits.csv | sort -t "," -k1 -n -k2 > output.csv
?
I'm trying to sort a csv file by first column first, then by the second column, so that lines with the same first column are still together.
It seems like the first one already does that correctly, by first sorting by the field before the first comma, then by the field following the first comma. (breaking ties, that is.)
Or does it not actually do that?
And what does the second command do/mean? (And what's the difference between the two?) There is a significant difference between the two output.csv files when I run the two.
And, finally, which one should I use? (Or are they both wrong?)
See also the answer by #morido for some other pointers, but here's a description of exactly what those two sort invocations do:
sort -k 1n -o output.csv
This assumes that the "fields" in your file are delimited by a transition from non-whitespace to whitespace (i.e. leading whitespace is included in each field, not stripped, as many might expect/assume), and tells sort to order things by a key that starts with the first field and extends to the end of the line, and assumes that the key is formatted as a numeric value. The output is sent explicitly to a specific file.
sort -t "," -k1 -n -k2
This defines the field separator as a comma, and then defines two keys to sort on. The first key again starts at the first field and extends to the end of the line and is lexicographic (dictionary order), not numeric, and the second key, which will be used when values of the first key are identical, starts with the second field and extends to the end of the line, and because of the intervening -n, will be assumed to be numeric data as well. However, because your first key entails the entire line, essentially, the second key is not likely to ever be needed (if the first key of two separate lines is identical, the second key most likely will be too).
Since you didn't provide sample data, it's unknown whether the data in the first two fields is numeric or not, but I suspect you want something like what was suggested in the answer by #morido:
sort -t, -k1,1 -k2,2
or
sort -t, -k1,1n -k2,2n (alternatively sort -t, -n -k1,1 -k2,2)
if the data is numeric.
First off: you want to remove the leading ! from these two commands. In Bash (and probably others since this comes from csh) you are otherwise referencing the last command that contained tail in your history which does not make sense here.
The main difference between your two versions is that in the first case you are not taking the second column into account.
This is how I would do it:
tail -n +2 hits.csv | sort -t "," -n --key=1,1 --key=2,2 > output.csv
-t specifies the field separator
-n turns on numerical sorting order
--key specifies the fields that should be used for sorting (in order of precedence)