Is there a way to remove only the followed duplicates? - linux

I have a CSV input with these columns:
1,zzzz,xxxx,
1,xxxx,xyxy,
2,xxxx,xxxx,
3,yyyy,xxxx,
3,xxxx,yyyy,
3,xxxx,zzzz,
1,ffff,xxxx,
1,aaaa,xxxx,
And I need to discard lines where the first field matches that of the preceding line:
1,zzzz,xxxx,
2,xxxx,xxxx,
3,yyyy,xxxx,
1,ffff,xxxx,
I tried sort | uniq alone but didn't work because all lines are different with exception of first field (number).

Use awk instead of uniq:
awk -F, '$1 != last { last=$1; print }'
-F, sets the field separator to comma. $1 is the contents of the first field, so this prints the line whenever the first field changes.

Got the wanted output with uniq --check-chars=N; the uniq will check only a specified number of characters in the lines, and since the input isn't sorted this will allow the characters to appear later on the list.

Related

How to add a Header with value after a perticular column in linux

Here I want to add a column with header name Gender after column name Age with value.
cat Person.csv
First_Name|Last_Name||Age|Address
Ram|Singh|18|Punjab
Sanjeev|Kumar|32|Mumbai
I am using this:
cat Person.csv | sed '1s/$/|Gender/; 2,$s/$/|Male/'
output:
First_Name|Last_Name||Age|Address|Gender
Ram|Singh|18|Punjab|Male
Sanjeev|Kumar|32|Mumbai|Male
I want output like this:
First_Name|Last_Name|Age|Gender|Address
Ram|Singh|18|Male|Punjab
Sanjeev|Kumar|32|Male|Mumbai
I took the second pipe out (for consistency's sake) ... the sed should look like this:
$ sed -E '1s/^([^|]+\|[^|]+\|[^|]+\|)/\1Gender|/;2,$s/^([^|]+\|[^|]+\|[^|]+\|)/\1male|/' Person.csv
First_Name|Last_Name|Age|Gender|Address
Ram|Singh|18|male|Punjab
Sanjeev|Kumar|32|male|Mumbai
We match and remember the first three fields and replace them with themselves, followed by Gender and male respectively.
Using awk:
$ awk -F"|" 'BEGIN{ OFS="|"}{ last=$NF; $NF=""; print (NR==1) ? $0"Gender|"last : $0"Male|"last }' Person.csv
First_Name|Last_Name||Age|Gender|Address
Ram|Singh|18|Male|Punjab
Sanjeev|Kumar|32|Male|Mumbai
Use '|' as the input field separator and set the output field separator as '|'. Store the last column value in variable named last and then remove the last column $NF="". Then print the appropriate output based on whether is first row or succeeding rows.

How to cut column data from flat file

I've data in format below;
111,Ja,M,Oes,2012-08-03 16:42:00,x,xz
112,Ln,d,D,Gn,2012-08-03 16:51:00,y,yx
I need to create files with data in the sequence below:
111,x,xz
112,y,yz
In output format, we've first value before comma and last two comma prefix values. Here we can have any number of commas in-between.
Kindly advise, how can generate required output file from input file in Linux machine.
The Awk statement for this is pretty straight-forward. Set the input and output field separators and print the fields using $1..$NF, where $NF is the value of the last column,
awk 'BEGIN{FS=OFS=","}{print $1,$(NF-1),$NF}' input.csv > newfile.csv
Not much to this one in awk:
awk -F"," 'BEGIN{OFS=","}{print $1,$(NF-1), $NF}' inFile > outFile
We split the lines in awk with a comma -F"," and then print the first field $1, the second to last field $(NF-1), and the last field $NF.
NF is the "Number of fields" so subtracting 1 from it will give you the second to last item.
with sed
$ sed -r 's/([^,]+).*(,[^,]+,[^,]+)/\1\2/' file
111,x,xz
112,y,yx
or
$ sed -r 's/([^,]+).*((,[^,]+){2})/\1\2/' file
awk '{print substr($1,1,4) substr($2,10,4)}' file
111,x,xz
112,y,yx

Uniqing a delimited file based on a subset of fields

I have data such as below:
1493992429103289,207.55,207.5
1493992429103559,207.55,207.5
1493992429104353,207.55,207.5
1493992429104491,207.6,207.55
1493992429110551,207.55,207.5
Due to the nature of the last two columns, their values change throughout the day and their values are repeated regularly. By grouping the way outlined in my desired output (below), I am able to view each time there was a change in their values (with the enoch time in the first column). Is there a way to achieve the desired output shown below:
1493992429103289,207.55,207.5
1493992429104491,207.6,207.55
1493992429110551,207.55,207.5
So I consolidate the data by the second two columns. However, the consolidation is not completely unique (as can be seen by 207.55, 207.5 being repeated)
I have tried:
uniq -f 1
However the output gives only the first line and does not go on through the list
The awk solution below does not allow the occurrence which happened previously to be outputted again and so gives the output (below the awk code):
awk '!x[$2 $3]++'
1493992429103289,207.55,207.5
1493992429104491,207.6,207.55
I do not wish to sort the data by the second two columns. However, since the first is epoch time, it may be sorted by the first column.
You can't set delimiters with uniq, it has to be white space. With the help of tr you can
tr ',' ' ' <file | uniq -f1 | tr ' ' ','
1493992429103289,207.55,207.5
1493992429104491,207.6,207.55
1493992429110551,207.55,207.5
You can use an Awk statement as below,
awk 'BEGIN{FS=OFS=","} s != $2 && t != $3 {print} {s=$2;t=$3}' file
which produces the output as you need.
1493992429103289,207.55,207.5
1493992429104491,207.6,207.55
1493992429110551,207.55,207.5
The idea is to store the second and third column values in variables s and t respectively and print the line contents only if the current line is unique.
I found an answer which is not as elegant as Inian but satisfies my purpose.
Since my first column is always enoch time in microseconds and does not increase or decrease in characters, I can use the following uniq command:
uniq -s 17
You can try to manually (with a loop) compare current line with previous line.
previous_line=""
# start at first line
i=1
# suppress first column, that don't need to compare
sed 's#^[0-9][0-9]*,##' ./data_file > ./transform_data_file
# for all line within file without first column
for current_line in $(cat ./transform_data_file)
do
# if previous record line are same than current line
if [ "x$prev_line" == "x$current_line" ]
then
# record line number to supress after
echo $i >> ./line_to_be_suppress
fi
# record current line as previous line
prev_line=$current_line
# increment current number line
i=$(( i + 1 ))
done
# suppress lines
for line_to_suppress in $(tac ./line_to_be_suppress) ; do sed -i $line_to_suppress'd' ./data_file ; done
rm line_to_be_suppress
rm transform_data_file
Since your first field seems to have a fixed length of 18 characters (including the , delimiter), you could use the -s option of uniq, which would be more optimal for larger files:
uniq -s 18 file
Gives this output:
1493992429103289,207.55,207.5
1493992429104491,207.6,207.55
1493992429110551,207.55,207.5
From man uniq:
-f num
Ignore the first num fields in each input line when doing comparisons.
A field is a string of non-blank characters separated from adjacent fields by blanks.
Field numbers are one based, i.e., the first field is field one.
-s chars
Ignore the first chars characters in each input line when doing comparisons.
If specified in conjunction with the -f option, the first chars characters after
the first num fields will be ignored. Character numbers are one based,
i.e., the first character is character one.

Uniq skipping middle part of the line when comparing lines

Sample file
aa\bb\cc\dd\ee\ff\gg\hh\ii\jj
aa\bb\cc\dd\ee\ll\gg\hh\ii\jj
aa\bb\cc\dd\ee\ff\gg\hh\ii\jj
I want to skip 6th field 'ff' when comparing for an unique line, also I want the count of # of duplicate lines in front.
I tried this, without any luck:
sort -t'\' -k1,5 -k7 --unique xslin1 > xslout
Expected output
3 aa\bb\cc\dd\ee\*\gg\hh\ii\jj
$ awk -F'\' -v OFS='\' '{$6="*"} 1' xslin1 | sort | uniq -c
3 aa\bb\cc\dd\ee\*\gg\hh\ii\jj
Discussion
With --unique, sort outputs only unique lines but it does not count them. One needs uniq -c for that. Further, sort outputs all unique lines, not just those that sort to the same value.
The above solution does the simple approach of assigning the sixth field to *, as you wanted in the output, and then uses the standard pipeline, sort | uniq -c, to produce the count of unique lines.
You can do this in one awk:
awk 'BEGIN{FS=OFS="\\"} {$6="*"} uniq[$0]++{}
END {for (i in uniq) print uniq[i] "\t" i}' file
3 aa\bb\cc\dd\ee\*\gg\hh\ii\jj

How to use grep or awk to process a specific column ( with keywords from text file )

I've tried many combinations of grep and awk commands to process text from file.
This is a list of customers of this type:
John,Mills,81,Crescent,New York,NY,john#mills.com,19/02/1954
I am trying to separate these records into two categories, MEN and FEMALES.
I have a list of some 5000 Female Names , all in plain text , all in one file.
How can I "grep" the first column ( since I am only matching first names) but still printing the entire customer record ?
I found it easy to "cut" the first column and grep --file=female.names.txt, but this way it's not going to print the entire record any longer.
I am aware of the awk option but in that case I don't know how to read the female names from file.
awk -F ',' ' { if($1==" ???Filename??? ") print $0} '
Many thanks !
You can do this with Awk:
awk -F, 'NR==FNR{a[$0]; next} ($1 in a)' female.names.txt file.csv
Would print the lines of your csv file that contain first names of any found in your file female.names.txt.
awk -F, 'NR==FNR{a[$0]; next} !($1 in a)' female.names.txt file.csv
Would output lines not found in female.names.txt.
This assumes the format of your female.names.txt file is something like:
Heather
Irene
Jane
Try this:
grep --file=<(sed 's/.*/^&,/' female.names.txt) datafile.csv
This changes all the names in the list of female names to the regular expression ^name, so it only matches at the beginning of the line and followed by a comma. Then it uses process substitution to use that as the file to match against the data file.
Another alternative is Perl, which can be useful if you're not super-familiar with awk.
#!/usr/bin/perl -anF,
use strict;
our %names;
BEGIN {
while (<ARGV>) {
chomp;
$names{$_} = 1;
}
}
print if $names{$F[0]};
To run (assume you named this file filter.pl):
perl filter.pl female.names.txt < records.txt
So, I've come up with the following:
Suppose, you have a file having the following lines in a file named test.txt:
abe 123 bdb 532
xyz 593 iau 591
Now you want to find the lines which include the first field having the first and last letters as vowels. If you did a simple grep you would get both of the lines but the following will give you the first line only which is the desired output:
egrep "^([0-z]{1,} ){0}[aeiou][0-z]+[aeiou]" test.txt
Then you want to the find the lines which include the third field having the first and last letters as vowels. Similary, if you did a simple grep you would get both of the lines but the following will give you the second line only which is the desired output:
egrep "^([0-z]{1,} ){2}[aeiou][0-z]+[aeiou]" test.txt
The value in the first curly braces {1,} specifies that the preceding character which ranges from 0 to z according to the ASCII table, can occur any number of times. After that, we have the field separator space in this case. Change the value within the second curly braces {0} or {2} to the desired field number-1. Then, use a regular expression to mention your criteria.

Resources