Removing specific strings from strings in a file - string

I want to remove specific fields in all strings in a semi-colon delimited file.
The file looks something like this :-
texta1;texta2;texta3;texta4;texta5;texta6;texta7
textb1;textb2;textb3;textb4;textb5;textb6;textb7
textc1;textc2;textc3;textc4;textc5;textc6;textc7
I would like to remove positions 2, 5 and 7 from all strings in the file.
Desired output :-
texta1;texta3;texta4;texta6
textb1;textb3;textb4;textb6
textc1;textc3;textc4;textc6
I am trying to write a small shell script using 'awk' but the code is not working as expected. I am still seeing the semicolons in between & at the end not being removed.
(Note- I was able to do it with 'sed' but my file has several hundred thousands of records & the sed code is taking a lot of time)
Could you please provide some help on this ? Thanks in advance.

Most simply with cut:
cut -d \; -f 1,3-4,6,8- filename
or
cut -d \; -f 2,5,7 --complement filename
I think --complement is GNU-specific, though. The 8- in the first example is not actually necessary for a file with only seven columns; it would include all columns from the eighth forward if they existed. I included it because it doesn't hurt and provides a more general solution to the problem.

I voted the answer by #Wintermute up, but if cut --complement is not available to you or you insist on using awk, then you can do:
awk -v scols=2,5,7 'BEGIN{FS=";"; OFS=";"} {
split(scols,acols,","); for(i in acols) $acols[i]=""; gsub(";;", ";"); print}' tmp.txt

Related

optimize sed command for better performance

I have got the following lines in a file:
1231231213123123123|W3A|{ (ABCDE)="8=3AF.R.Y2=133AA=9WW=334MNFN=20120925-22:23:59.998
1231231213123123123|4GM|{ (ABCDE)="8=3AF.R.Y2=123AA=9WW=4AF013DCV=EXAMPLE=ABC
1231231213123123123|KYC|{ (ABCDE)="8=3AF.R.Y2=112AA=9WW=0002DDS=20120921-14:55:21
In order to get the value between '|' characters I am using:
sed -e 's/\(^.*|\)\(.*\)\(|.*$\)/\2/'
And output is:
W3A
4GM
KYC
Which is expected. But as file has thousands of records, sed command is taking a lot of time. Is there any way to improve the performance of this command?
Seems to me like you just want to use cut:
cut -d '|' -f 2 file
Set the delimiter to | and print the second field.
Since you're only keeping the 2nd parenthesized groups, you're making sed do unnecessary work by remembering the other stuff. Try
sed -e 's/^[^|]*|\([^|]*\)|.*/\1/'
Tom Fenech's answer should be a lot faster though.

renaming files using loop in unix

I have a situation here.
I have lot of files like below in linux
SIPTV_FIPTV_ID00$line_T20141003195717_C0000001000_FWD148_IPV_001.DATaac
SIPTV_FIPTV_ID00$line_T20141003195717_C0000001000_FWD148_IPV_001.DATaag
I want to remove the $line and make a counter from 0001 to 6000 for my 6000 such files in its place.
Also i want to remove the trailer 3 characters after this is done for each file.
After fix file should be like
SIPTV_FIPTV_ID0000001_T20141003195717_C0000001000_FWD148_IPV_001.DAT
SIPTV_FIPTV_ID0000002_T20141003195717_C0000001000_FWD148_IPV_001.DAT
Please help.
With some assumption, I think this should do it:
1. list of the files is in a file named input.txt, one file per line
2. the code is running in the directory the files are in
3. bash is available
awk '{i++;printf "mv \x27"$0"\x27 ";printf "\x27"substr($0,1,16);printf "%05d", i;print substr($0,22,47)"\x27"}' input.txt | bash
from the command prompt give the following command
% echo *.DAT??? | awk '{
old=$0;
sub("\\$line",sprintf("%4.4d",++n));
sub("...$","");
print "mv", old, $1}'
%
and check the output, if it looks OK
% echo *.DAT??? | awk '{
old=$0;
sub("\\$line",sprintf("%4.4d",++n));
sub("...$","");
print "mv", old, $1}' | sh
%
A commentary: echo *.DAT??? is meant to give as input to awk a list of all the filenames that you want to modify, you may want something more articulated if the example names you gave aren't representative of the whole spectrum... regarding the awk script itself, I used sprintf to generate a string with the correct number of zeroes for the replacement of $line, the idiom `"\\$..." with two backslashes to quote the dollar sign is required by gawk and does no harm in mawk, and as a last remark I have to say that in similar cases I prefer to make at least a dry run before passing the commands to the shell...

Using grep to find difference between two big wordlists

I have one 78k lines .txt file with british words and a 5k lines .txt file with the most common british words. I want to sort out the most common words from the big list so that I have a new list with the not as common words.
I managed solve my problem in another matter, but I would really like to know, what I am doing wrong since this does not work.
I have tried the following:
//To make sure they are trimmed
cut -d" " -f1 78kfile.txt | tac | tac > 78kfile.txt
cut -d" " -f1 5kfile.txt | tac | tac > 5kfile.txt
grep -xivf 5kfile.txt 78kfile.txt > cleansed
//But this procedure apparently gives me two empty files.
If I run just the grep without cut first, I get words that I know are in both files.
I have also tried this:
sort 78kfile.txt > 78kfile-sorted.txt
sort 5kfile.txt > 5kfile-sorted.txt
comm -3 78kfile-sorted.txt 5kfile-sorted.txt
//No luck either
The two text files in case anyone wants to try for them selves:
https://www.dropbox.com/s/dw3k8ragnvjcfgc/5k-most-common-sorted.txt
https://www.dropbox.com/s/1cvut5z2zp9qnmk/brit-a-z-sorted.txt
After downloading your files, I noticed that (a) brit-a-z-sorted.txt has Microsoft line endings while 5k-most-common-sorted.txt has Unix line endings and (b) you are trying to do whole-line compare (grep -x). So, first we need to convert to a common line ending:
dos2unix <brit-a-z-sorted.txt >brit-a-z-sorted-fixed.txt
Now, we can use grep to remove the common words:
grep -xivFf 5k-most-common-sorted.txt brit-a-z-sorted-fixed.txt >less-common.txt
I also added the -F flag to assure that the words would be interpreted as a fixed strings rather than as regular expressions. This also speeds things up.
I note that there are several words in the 5k-most-common-sorted.txt file that are not in the brit-a-z-sorted.txt. For example, "British" is in the common file but not the larger file. Also the common file has "aluminum" while the larger file has only "aluminium".
What do the grep options mean? For those who are curious:
-f means read the patterns from a file.
-F means treat them as fixed patterns, not regular expressions,
-i mean ignore case.
-x means do whole-line matches
-v means invert the match. In other words, print those lines that do not match any of the patterns.

awk / sed script to remove text

I am currently in need of a way to programmatically remove some text from Makefiles that I am dealing with. Now the problem is that (for whatever reason) the makefiles are being generated with link commands of -l<full_path_to_library>/<library_name> when they should be generated with -l<library_name>. So what I need is a script to find all occurrences of -l/ and then remove up to and including the next /.
Example of what I'm dealing with
-l/home/user/path/to/boost/lib/boost_filesystem
I need it to be
-lboost_filesystem
As could be imagined this is a stop gap measure until I fix the real problem (on the generation side) but in the meantime it would be a great help to me if this could work and I am not too good with my awk and sed.
Thanks for any help.
sed -i 's|-l[^ ]*/\([^/ ]*\)|-l\1|g' Makefile
Here you go
echo "-l/home/user/path/to/boost/lib/boost_filesystem" | awk -F"/" '{ print $1 $NF } '

Replacing a line in a csv file?

I have a set of 10 CSV files, which normally have a an entry of this kind
a,b,c,d
d,e,f,g
Now due to some error entries in this file have become of this kind
a,b,c,d
d,e,f,g
,,,
h,i,j,k
Now I want to remove the line with only commas in all the files. These files are on a Linux filesystem.
Any command that you recommend that can replaces the erroneous lines in all the files.
It depends on what you mean by replace. If you mean 'remove', then a trivial variant on #wnoise's solution is:
grep -v '^,,,$' old-file.csv > new-file.csv
Note that this deletes just those lines with exactly three commas. If you want to delete mal-formed lines with any number of commas (including zero) - and no other characters on the line, then:
grep -v '^,*$' ...
There are endless other variations on the regex that would deal with other scenarios. Dealing with full CSV data with commas inside quotes starts to need something other than a regex machine. It can be done, within broad limits, especially in more complex regex systems such as PCRE or Perl. But it requires more work.
Check out Mastering Regular Expressions.
sed 's/,,,/replacement/' < old-file.csv > new-file.csv
optionally followed by
mv new-file.csv old-file.csv
Replace or remove, your post is not clear... For replacement see wnoise's answer. For removing, you could use
awk '$0 !~ /,,,/ {print}' <old-file.csv > new-file.csv
What about trying to keep only lines which are matching the desired format instead of handling one exception ?
If the provided input is what you really want to match:
grep -E '[a-z],[a-z],[a-z],[a-z]' < oldfile.csv > newfile.csv
If the input is different, provide it, the regular expression should not be too hard to write.
Do you want to replace them with something, or delete them entirely? Either way, it can be done with sed. To delete:
sed -i -e '/^,\+$/ D' yourfile1.csv yourfile2.csv ...
To replace: well, see wnoise's answer, or if you don't want to create new files with the output,
sed -i -e '/^,\+$/ s//replacement/' yourfile1.csv yourfile2.csv ...
or
sed -i -e '/^,\+$/ c\
replacement' yourfile1.csv yourfile2.csv ...
(that should be entered exactly as is, including the line break). Of course, you can also do this with awk or perl or, if you're only deleting lines, even grep:
egrep -v '^,+$' < oldfile.csv > newfile.csv
I tested these to make sure they work, but I'd advise you to do the same before using them (just in case). You can omit the -i option from sed, in which case it'll print out the results (rather than writing them back to the file), or omit the output redirection >newfile.csv from grep.
EDIT: It was pointed out in a comment that some features of these sed commands only work on GNU sed. As far as I can tell, these are the -i option (which can be replaced with shell redirection, sed ... <infile >outfile ) and the \+ modifier (which can be replaced with \{1,\} ).
Most simply:
$ grep -v ,,,, oldfile > newfile
$ mv newfile oldfile
yes, awk or grep are very good option if you are working in linux platform. However you can use perl regex for other platform. using join & split options.

Resources