Using sed to delete lines present in similar file - string

I have a file listing from an original and a duplicate drive consisting of 985257 lines and 984997 lines respectfully.
As the number of lines do not match I am certain that some of the files have not duplicated.
In order to establish which files are not present I wish to use sed to filter the original file listing by deleting any lines present in the duplicate listing from the source listing.
I had thought about using a match formula in excel but due to the number of lines the program crashes. I thought using this approach in sed would be a viable option.
I have had no success with my approach so far however.
echo "Start"
# Cat the passed argument which is the duplicate file listing
for line in $(cat $1)
do
#sed the $line variable over the larger file and remove
#sed "${line}/d" LiveList.csv
#sed -i "${line}/d" LiveList.csv
#sed -i '${line}' 'd' LiveList.csv
sed -i "s/'${line}'//" /home/listings/LiveList.csv
done
There is a temporary file which is created and fills to the 103.4mb of the listing file however the listing file itself is not altered at all.
My other concern is that as the listing has been created in windows the '\' character may be escaping the string leading to no matches and therefore no alteration.
Example path:
Path,Length,Extension
Jimmy\tail\images\Jimmy\0001\0014\Text\A0\20\A056TH01-01.html,71982,.html
Please help.

This might work for you:
sort orginal_list.txt duplicate_list.txt | uniq -u

First thing that comes to my mind is just using rsync to take care of copying the missing files as fast as possible. It really works wonders.
If not, you can first sort both files to identify where they differ. You can use some paste trickery to put side by side differences, or even use the diff side-by-side output. When files are ordered, I think diff finds it easily to identify what lines have been added.

Related

Delete some lines from text using Linux command

I know how to match text using regex patterns but not how to manipulate them.
I have used grep to match and extract lines from a text file, but I want to remove those lines from the text. How can I achieve this without having to write a python or bash shell script?
I have searched on Google and was recommended to use sed, but I am new to it and don't know how it works.
Can anyone point me in the right direction or help me achieve this goal?
The -v option to grep inverts the search, reporting only the lines that don't match the pattern.
Since you know how to use grep to find the lines to be deleted, using grep -v and the same pattern will give you all the lines to be kept. You can write that to a temporary file and then copy or move the temporary file over the original.
grep -v pattern original.file > tmp.file
mv tmp.file original.file
You can also use sed, as shown in shellfish's answer.
There are multiple possible refinements for the grep solution, but for most people most of the time, what is shown is more or less adequate (it would be a good idea to use a per process intermediate file name, preferably with a random name such as the mktemp command gives you). You can add code to remove the intermediate file on an interrupt; suppress interrupts while moving back; use copy and remove instead of move if the original file has multiple hard links or is a symlink; etc. The sed command more or less works around these issues for you, but it is not cognizant of multiple hard links or symlinks.
Create the pattern which matches the lines using grep. Then create a sed script as follows:
sed -i '/pattern/d' file
Explanation:
The -i option means overwrite the input file, thus removing the files matching pattern.
pattern is the pattern you created for grep, e.g. ^a*b\+.
d this sed command stands for delete, it will delete lines matching the pattern.
file this is the input file, it can consist of a relative or absolute path.
For more information see man sed.

Find and Replace Incrementally Across Multiple Files - Bash

I apologize in advance if this belongs in SuperUser, I always have a hard time discerning whether these scripting in bash questions are better placed here or there. Currently I know how to find and replace strings in multiple files, and how to find and replace strings within a single file incrementally from searching for a solution to this issue, but how to combine them eludes me.
Here's the explanation:
I have a few hundred files, each in sets of two: a data file (.data), and a message file (data.ms).
These files are linked via a key value unique to each set of two that looks like: ab.cdefghi
Here's what I want to do:
Step through each .data file and do the following:
Find:
MessageKey ab.cdefghi
Replace:
MessageKey xx.aaa0001
MessageKey xx.aaa0002
...
MessageKey xx.aaa0010
etc.
Incrementing by 1 every time I get to a new file.
Clarifications:
For reference, there is only one instance of "MessageKey" in every file.
The paired files have the same name, only their extensions differ, so I could simply step through all .data files and then all .data.ms files and use whatever incremental solution on both and they'd match fine, don't need anything too fancy to edit two files in tandem or anything.
For all intents and purposes whatever currently appears on the line after each MessageKey is garbage and I am completely throwing it out and replacing it with xx.aaa####
String length does matter, so I need xx.aa0009, xx.aaa0010 not xx.aa0009, xx.aa00010
I'm using cygwin.
I would approach this by creating a mapping from old key to new and dumping that into a temp file.
grep MessageKey *.data \
| sort -u \
| awk '{ printf("%s:xx.aaa%04d\n", $1, ++i); }' \
> /tmp/key_mapping
From there I would confirm that the file looks right before I applied the mapping using sed to the files.
cat /tmp/key_mapping \
| while read old new; do
sed -i -e "s:MessageKey $old:MessageKey $new:" * \
done
This will probably work for you, but it's neither elegant or efficient. This is how I would do it if I were only going to run it once. If I were going to run this regularly and efficiency mattered, I would probably write a quick python script.
#Carl.Anderson got me started on the right track and after a little tweaking, I ended up implementing his solution but with some syntax tweaks.
First of all, this solution only works if all of your files are located in the same directory. I'm sure anyone with even slightly more experience with UNIX than me could modify this to work recursively, but here goes:
First I ran:
-hr "MessageKey" . | sort -u | awk '{ printf("%s:xx.aaa%04d\n", $2, ++i); }' > MessageKey
This command was used to create a find and replace map file called "MessageKey."
The contents of which looked like:
In.Rtilyd1:aa.xxx0087
In.Rzueei1:aa.xxx0088
In.Sfricf1:aa.xxx0089
In.Slooac1:aa.xxx0090
etc...
Then I ran:
MessageKey | while IFS=: read old new; do sed -i -e "s/MessageKey $old/MessageKey $new/" *Data ; done
I had to use IFS=: (or I could have alternatively find and replaced all : in the map file with a space, but the former seemed easier.
Anyway, in the end this worked! Thanks Carl for pointing me in the right direction.

What is the easiest way of extracting strings from source files?

I was asked today to list all image files references in our project to help remove/fix dead references.
All our image names in the source files are enclosed in either simple or double quotes ('image.png' or "image.png").
To extract those I thought of using grep, sed an other tools like that but so fair I failed to come up with something effective.
I can currently list all lines that contain image names by grepping the image file extensions (.png, .gif, and so on) but that also brings lines completely unrelated to my search. My attempt with sed wasn't working in case there was several strings per line.
I could probably filter out the list by myself, but hey: this is linux ! So there has to be a tool for the job.
How would you do that ?
You should be able to extract the file names with something like this:
grep -Eo "['\"][^'\"]*\.(gif|png)['\"]"
The option -o causes grep to list only the matches instead of the whole line. Use tr to remove the quotes:
grep -Eo "['\"][^'\"]*\.(gif|png)['\"]" | tr -d "\"'"

egrep not writing to a file

I am using the following command in order to extract domain names & the full domain extension from a file. Ex: www.abc.yahoo.com, www.efg.yahoo.com.us.
[a-z0-9\-]+\.com(\.[a-z]{2})?' source.txt | sort | uniq | sed -e 's/www.//'
> dest.txt
The command write correctly when I specify small maximum parameter -m 100 after the source.txt. The problem if I didn't specify, or if I specified a huge number. Although, I could write to files with grep (not egrep) before with huge numbers similar to what I'm trying now and that was successful. I also check the last modified date and time during the command being executed, and it seems there is no modification happening in the destination file. What could be the problem ?
As I mentioned in your earlier question, it's probably not an issue with egrep, but that your file is too big and that sort won't output anything (to uniq) until egrep is done. I suggested that you split the files into manageable chucks using the split command. Something like this:
split -l 10000000 source.txt split_source.
This will split the source.txt file into 10 million line chunks called split_source.a, split_source.b, split_source.c etc. And you can run the entire command on each one of those files (and maybe changing the pipe to append at the end: >> dest.txt).
The problem here is that you can get duplicates across multiple files, so at the end you may need to run
sort dest.txt | uniq > dest_uniq.txt
Your question is missing information.
That aside, a few thoughts. First, to debug and isolate your problem:
Run the egrep <params> | less so you can see what egreps doing, and eliminate any problem from sort, uniq, or sed (my bets on sort).
How big is your input? Any chance sort is dying from too much input?
Gonna need to see the full command to make further comments.
Second, to improve your script:
You may want to sort | uniq AFTER sed, otherwise you could end up with duplicates in your result set, AND an unsorted result set. Maybe that's what you want.
Consider wrapping your regular expressions with "^...$", if it's appropriate to establish beginning of line (^) and end of line ($) anchors. Otherwise you'll be matching portions in the middle of a line.

Compare 2 files with shell script

I was trying to find the way for knowing if two files are the same, and found this post...
Parsing result of Diff in Shell Script
I used the code in the first answer, but i think it's not working or at least i cant get it to work properly...
I even tried to make a copy of a file and compare both (copy and original), and i still get the answer as if they were different, when they shouldn't be.
Could someone give me a hand, or explain what's happening?
Thanks so much;
peixe
Are you trying to compare if two files have the same content, or are you trying to find if they are the same file (two hard links)?
If you are just comparing two files, then try:
diff "$source_file" "$dest_file" # without -q
or
cmp "$source_file" "$dest_file" # without -s
in order to see the supposed differences.
You can also try md5sum:
md5sum "$source_file" "$dest_file"
If both files return same checksum, then they are identical.
comm is a useful tool for comparing files.
The comm utility will read file1 and file2, which should be
ordered in the current collating sequence, and produce three
text columns as output: lines only in file1; lines only in
file2; and lines in both files.

Resources