Uniqing a delimited file based on a subset of fields

Uniqing a delimited file based on a subset of fields - linux

I have data such as below:
1493992429103289,207.55,207.5
1493992429103559,207.55,207.5
1493992429104353,207.55,207.5
1493992429104491,207.6,207.55
1493992429110551,207.55,207.5
Due to the nature of the last two columns, their values change throughout the day and their values are repeated regularly. By grouping the way outlined in my desired output (below), I am able to view each time there was a change in their values (with the enoch time in the first column). Is there a way to achieve the desired output shown below:
1493992429103289,207.55,207.5
1493992429104491,207.6,207.55
1493992429110551,207.55,207.5
So I consolidate the data by the second two columns. However, the consolidation is not completely unique (as can be seen by 207.55, 207.5 being repeated)
I have tried:
uniq -f 1
However the output gives only the first line and does not go on through the list
The awk solution below does not allow the occurrence which happened previously to be outputted again and so gives the output (below the awk code):
awk '!x[$2 $3]++'
1493992429103289,207.55,207.5
1493992429104491,207.6,207.55
I do not wish to sort the data by the second two columns. However, since the first is epoch time, it may be sorted by the first column.

You can't set delimiters with uniq, it has to be white space. With the help of tr you can
tr ',' ' ' <file | uniq -f1 | tr ' ' ','
1493992429103289,207.55,207.5
1493992429104491,207.6,207.55
1493992429110551,207.55,207.5

You can use an Awk statement as below,
awk 'BEGIN{FS=OFS=","} s != $2 && t != $3 {print} {s=$2;t=$3}' file
which produces the output as you need.
1493992429103289,207.55,207.5
1493992429104491,207.6,207.55
1493992429110551,207.55,207.5
The idea is to store the second and third column values in variables s and t respectively and print the line contents only if the current line is unique.

I found an answer which is not as elegant as Inian but satisfies my purpose.
Since my first column is always enoch time in microseconds and does not increase or decrease in characters, I can use the following uniq command:
uniq -s 17

You can try to manually (with a loop) compare current line with previous line.
previous_line=""
# start at first line
i=1
# suppress first column, that don't need to compare
sed 's#^[0-9][0-9]*,##' ./data_file > ./transform_data_file
# for all line within file without first column
for current_line in $(cat ./transform_data_file)
do
# if previous record line are same than current line
if [ "x$prev_line" == "x$current_line" ]
then
# record line number to supress after
echo $i >> ./line_to_be_suppress
fi
# record current line as previous line
prev_line=$current_line
# increment current number line
i=$(( i + 1 ))
done
# suppress lines
for line_to_suppress in $(tac ./line_to_be_suppress) ; do sed -i $line_to_suppress'd' ./data_file ; done
rm line_to_be_suppress
rm transform_data_file

Since your first field seems to have a fixed length of 18 characters (including the , delimiter), you could use the -s option of uniq, which would be more optimal for larger files:
uniq -s 18 file
Gives this output:
1493992429103289,207.55,207.5
1493992429104491,207.6,207.55
1493992429110551,207.55,207.5
From man uniq:
-f num
Ignore the first num fields in each input line when doing comparisons.
A field is a string of non-blank characters separated from adjacent fields by blanks.
Field numbers are one based, i.e., the first field is field one.
-s chars
Ignore the first chars characters in each input line when doing comparisons.
If specified in conjunction with the -f option, the first chars characters after
the first num fields will be ignored. Character numbers are one based,
i.e., the first character is character one.

Related

Extract specific columns from delimited file (long row to next line)

Want to extract 2 columns from delimited file (delimiter '||') in unix can be easily be done if complete row in on one line like below
foo||bar||baz||quux
by
cut -d'||' -f1 file_name
but in my case records in file for a single row record went to next line for example:
foo||bar||baz||quux||foo||bar||baz||quux||foo||bar||baz||quux
||quux||bar||baz||quux||foo||bar||baz||quux||foo||bar||baz||quux
and its output from above command is
foo
quux
instead should be just "foo" because it is in first column.
file contain in row 1
foo||bar||baz||quux||foo||bar||baz||quux||foo||bar||baz||quux
||quux||bar||baz||quux||foo||bar||baz||quux||foo||bar||baz||quux
file contain in row 2
foo2||bar2||baz2||quux2||foo2||bar2||baz2||quux2||foo2||bar2||baz2||quux2
||quux2||bar2||baz2||quux2||foo2||bar2||baz2||quux2||foo2||bar2||baz2||quux2
output should be
foo
foo2

Almost, but the -d switch only takes one char:
cut -d'|' -f1 file_name
Output:
foo
foo2
Note: since the delimiters are doubled, the -f switch won't work as expected if the field number is greater than 1. One way to handle that is adjust the field to equal "2n-1". So to get field #3, do -f$(( (3*2) - 1 )).

Using awk. Since it's the first field of every other record (NR%2), use:
$ awk -F\| 'NR%2{print $1}' file
foo
foo2
Data (four records):
$ cat file
foo||bar||baz||quux||foo||bar||baz||quux||foo||bar||baz||quux
||quux||bar||baz||quux||foo||bar||baz||quux||foo||bar||baz||quux
foo2||bar2||baz2||quux2||foo2||bar2||baz2||quux2||foo2||bar2||baz2||quux2
||quux2||bar2||baz2||quux2||foo2||bar2||baz2||quux2||foo2||bar2||baz2||quux2
Interesting phenomenon is that mawk accepts -F"\|\|" (dual pipes) as delimiter but GNU awk doesn't.

Is there a way to remove only the followed duplicates?

I have a CSV input with these columns:
1,zzzz,xxxx,
1,xxxx,xyxy,
2,xxxx,xxxx,
3,yyyy,xxxx,
3,xxxx,yyyy,
3,xxxx,zzzz,
1,ffff,xxxx,
1,aaaa,xxxx,
And I need to discard lines where the first field matches that of the preceding line:
1,zzzz,xxxx,
2,xxxx,xxxx,
3,yyyy,xxxx,
1,ffff,xxxx,
I tried sort | uniq alone but didn't work because all lines are different with exception of first field (number).

Use awk instead of uniq:
awk -F, '$1 != last { last=$1; print }'
-F, sets the field separator to comma. $1 is the contents of the first field, so this prints the line whenever the first field changes.

Got the wanted output with uniq --check-chars=N; the uniq will check only a specified number of characters in the lines, and since the input isn't sorted this will allow the characters to appear later on the list.

How to use grep or awk to process a specific column ( with keywords from text file )

I've tried many combinations of grep and awk commands to process text from file.
This is a list of customers of this type:
John,Mills,81,Crescent,New York,NY,john#mills.com,19/02/1954
I am trying to separate these records into two categories, MEN and FEMALES.
I have a list of some 5000 Female Names , all in plain text , all in one file.
How can I "grep" the first column ( since I am only matching first names) but still printing the entire customer record ?
I found it easy to "cut" the first column and grep --file=female.names.txt, but this way it's not going to print the entire record any longer.
I am aware of the awk option but in that case I don't know how to read the female names from file.
awk -F ',' ' { if($1==" ???Filename??? ") print $0} '
Many thanks !

You can do this with Awk:
awk -F, 'NR==FNR{a[$0]; next} ($1 in a)' female.names.txt file.csv
Would print the lines of your csv file that contain first names of any found in your file female.names.txt.
awk -F, 'NR==FNR{a[$0]; next} !($1 in a)' female.names.txt file.csv
Would output lines not found in female.names.txt.
This assumes the format of your female.names.txt file is something like:
Heather
Irene
Jane

Try this:
grep --file=<(sed 's/.*/^&,/' female.names.txt) datafile.csv
This changes all the names in the list of female names to the regular expression ^name, so it only matches at the beginning of the line and followed by a comma. Then it uses process substitution to use that as the file to match against the data file.

Another alternative is Perl, which can be useful if you're not super-familiar with awk.
#!/usr/bin/perl -anF,
use strict;
our %names;
BEGIN {
while (<ARGV>) {
chomp;
$names{$_} = 1;
}
}
print if $names{$F[0]};
To run (assume you named this file filter.pl):
perl filter.pl female.names.txt < records.txt

So, I've come up with the following:
Suppose, you have a file having the following lines in a file named test.txt:
abe 123 bdb 532
xyz 593 iau 591
Now you want to find the lines which include the first field having the first and last letters as vowels. If you did a simple grep you would get both of the lines but the following will give you the first line only which is the desired output:
egrep "^([0-z]{1,} ){0}[aeiou][0-z]+[aeiou]" test.txt
Then you want to the find the lines which include the third field having the first and last letters as vowels. Similary, if you did a simple grep you would get both of the lines but the following will give you the second line only which is the desired output:
egrep "^([0-z]{1,} ){2}[aeiou][0-z]+[aeiou]" test.txt
The value in the first curly braces {1,} specifies that the preceding character which ranges from 0 to z according to the ASCII table, can occur any number of times. After that, we have the field separator space in this case. Change the value within the second curly braces {0} or {2} to the desired field number-1. Then, use a regular expression to mention your criteria.

Increment numbers within string using awk and sed

I have a text file that has about 500 rows of information.
I am adding a few strings to the beginning of each line separated by a comma (Excel recognizes it as another column).
I have this code so far:
sed -e "2,$s#^# =HYPERLINK(B2,C2), https://otrs.city.pittsburgh.pa.us/index.pl?Action=AgentTicketZoom;TicketID=#"** C:\Users\hd\Desktop\newaction.txt > C:\Users\hd\Desktop\test.txt
I have a columns want. Once column is adding on a link to a previous column (easy enough)
Which will be a formula(string) in the first column is =HYPERLINK(B2,C2) and I want to increment the 2's to 3's,4's and so on.
Example:
=HYPERLINK(B2,C2)
=HYPERLINK(B3,C3)
=HYPERLINK(B4,C4)
=HYPERLINK(B5,C5)
=HYPERLINK(B6,C6)
It is my second day coding with sed and awk.
Is there any way I can make this happen using awk and sed?

This Perl one-liner:
perl -pe "BEGIN{$i = 2} s#^#=HYPERLINK(B${i},C${i})#; $i++" "input.txt"
will add =HYPERLINK(B2,C2) to the front of each line and increment the numbers each time.

CSV grep but keep the header

I have a CSV file that look like this:
A,B,C
1,2,3
4,4,4
1,2,6
3,6,9
Is there an easy way to grep all the rows in which the B column is 2, and keep the header? For example, I want the output be like
A,B,C
1,2,3
1,2,6
I am working under linux

Using awk:
awk -F, 'NR==1 || $2==2' file
NR==1 -> if first line,
$2==2 -> if second column is equal to 2. Lines are printed if either of the above is true.
To choose the column using the header column name:
awk -F, -v col="B" 'NR==1{for(i=1;i<=NF;i++)if($i==col)break;print;next}$i==2' file
Replace B with the appropriate name of the column which you want to check against.

You can use addresses in sed:
sed -n '1p;/^[^,]*,2/p'
It means:
1p Print the first line.
/ Start a match.
^ Match the beginnning of a line.
[^,] Match anything but a comma
* zero or more times.
, Match a comma.
2 Match a 2.
/p End of match, if it matches, print.
If the header can contain the value you are looking for, you should be more careful:
sed -n '1p;1!{/^[^,]*,2/p}'
1!{ ... } just means "Do the following for lines other then the first one".
For column number n>2, you can add a quantifier:
sed -n '1p;1!{/^\([^,]*,\)\{M\}2/p}'
where M=n-1. The quantifier just means repetition, so the non-comma-0-or-more-times-comma thing is repeated M times.
For true CSV files where a value can contain a comma, switch to Perl and Text::CSV.

$ awk -F, 'NR==1 { for (i=1;i<=NF;i++) h[$i] = i; print; next } $h["B"] == 2' file
A,B,C
1,2,3
1,2,6
By the way, sed is an excellent tool for simple substitutions on a single line, for anything else, just use awk - the code will be clearer and MUCH easier to enhance in future if necessary.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Uniqing a delimited file based on a subset of fields - linux

You can't set delimiters with uniq, it has to be white space. With the help of tr you can tr ',' ' ' <file | uniq -f1 | tr ' ' ',' 1493992429103289,207.55,207.5 1493992429104491,207.6,207.55 1493992429110551,207.55,207.5

I found an answer which is not as elegant as Inian but satisfies my purpose. Since my first column is always enoch time in microseconds and does not increase or decrease in characters, I can use the following uniq command: uniq -s 17

Related

Extract specific columns from delimited file (long row to next line)

Is there a way to remove only the followed duplicates?

How to use grep or awk to process a specific column ( with keywords from text file )

Increment numbers within string using awk and sed

CSV grep but keep the header

Categories

Resources