Find and Replace Incrementally Across Multiple Files - Bash - linux

I apologize in advance if this belongs in SuperUser, I always have a hard time discerning whether these scripting in bash questions are better placed here or there. Currently I know how to find and replace strings in multiple files, and how to find and replace strings within a single file incrementally from searching for a solution to this issue, but how to combine them eludes me.
Here's the explanation:
I have a few hundred files, each in sets of two: a data file (.data), and a message file (data.ms).
These files are linked via a key value unique to each set of two that looks like: ab.cdefghi
Here's what I want to do:
Step through each .data file and do the following:
Find:
MessageKey ab.cdefghi
Replace:
MessageKey xx.aaa0001
MessageKey xx.aaa0002
...
MessageKey xx.aaa0010
etc.
Incrementing by 1 every time I get to a new file.
Clarifications:
For reference, there is only one instance of "MessageKey" in every file.
The paired files have the same name, only their extensions differ, so I could simply step through all .data files and then all .data.ms files and use whatever incremental solution on both and they'd match fine, don't need anything too fancy to edit two files in tandem or anything.
For all intents and purposes whatever currently appears on the line after each MessageKey is garbage and I am completely throwing it out and replacing it with xx.aaa####
String length does matter, so I need xx.aa0009, xx.aaa0010 not xx.aa0009, xx.aa00010
I'm using cygwin.

I would approach this by creating a mapping from old key to new and dumping that into a temp file.
grep MessageKey *.data \
| sort -u \
| awk '{ printf("%s:xx.aaa%04d\n", $1, ++i); }' \
> /tmp/key_mapping
From there I would confirm that the file looks right before I applied the mapping using sed to the files.
cat /tmp/key_mapping \
| while read old new; do
sed -i -e "s:MessageKey $old:MessageKey $new:" * \
done
This will probably work for you, but it's neither elegant or efficient. This is how I would do it if I were only going to run it once. If I were going to run this regularly and efficiency mattered, I would probably write a quick python script.

#Carl.Anderson got me started on the right track and after a little tweaking, I ended up implementing his solution but with some syntax tweaks.
First of all, this solution only works if all of your files are located in the same directory. I'm sure anyone with even slightly more experience with UNIX than me could modify this to work recursively, but here goes:
First I ran:
-hr "MessageKey" . | sort -u | awk '{ printf("%s:xx.aaa%04d\n", $2, ++i); }' > MessageKey
This command was used to create a find and replace map file called "MessageKey."
The contents of which looked like:
In.Rtilyd1:aa.xxx0087
In.Rzueei1:aa.xxx0088
In.Sfricf1:aa.xxx0089
In.Slooac1:aa.xxx0090
etc...
Then I ran:
MessageKey | while IFS=: read old new; do sed -i -e "s/MessageKey $old/MessageKey $new/" *Data ; done
I had to use IFS=: (or I could have alternatively find and replaced all : in the map file with a space, but the former seemed easier.
Anyway, in the end this worked! Thanks Carl for pointing me in the right direction.

Related

Linux: Replace first string in file with contents of other file containing quotes and slashes.

I have spent all day today trying to find a proper solution, but I am not able to. My problem:
I have an XML file with tags containing multiple of the same.
Example:
<TASK INSTANCE />
<WORKFLOWLINK CONDITION=""/>
<WORKFLOWLINK CONDITION=""/>
I want to add the contents of an other XML file before the first <WORKFLOWLINK. The issue I've ran into is that this file is full of double quotes and slashes. I've tried replacing them and escaping them, but to no avail.
My tries mainly culminated on something like:
sed -e "0,/<WORKFLOWLINK/ /<WORKFLOWLINK/{ r ${filename}" -e "}" ${sourcefile}
If this isn't clear enough I'll get the exact data so you can see.
For the fun of sed:
sed -e "0,/<WORKFLOWLINK/{/<WORKFLOWLINK/{r ${sourcefile}" -e"}}"
The trick is to start a new "pattern/command" pair after your first address range condition 0,/<WORKFLOWLINK/.
Two nested patterns/addresses are not understood, there must be a command after the first pattern. Using an additional pair of curlies {} does that for you.
Apart from the brain exercise to do it in sed, #EdMorton is right in recommending to use an XML-processor. Also his request for an MCVE is appropriate. I had to do some guessing to see what you want and I hope I guessed right.
The mcve should at least have included
the error message or problem description defining your problem
the initialisation of your environment variables
some sample input; not the original data
You surely would have had an answer earlier and (in case mine does not satisfy you) probably a better one by now.
So, before your next question, please take the https://stackoverflow.com/tour
GNU sed version 4.2.1
GNU bash, version 3.1.17(1)-release (i686-pc-msys)
Everyone,
Thank you for thinking with me, even if I apparently broke some rules.
I have figured out a solution, granted it is not as pretty as can be, but for a one time action it is good enough.
I have moved from a single command to a combination of first detecting the location I want to put my data:
sed -e "0,/<WORKFLOWLINK/ s/<WORKFLOWLINK/##MARKER##\n\t<WORKFLOWLINK'" which will put the marker string in the desired location.
After this I replace the marker with the contents of the file I have. I managed to make the individual statements working when I was trying to do it all in a single statement before, so I just execute them separately.
sed -e "/##MARKER##/{r ${sourcefile}" -e 'd}'

how to use do loop to read several files with similar names in shell script

I have several files named scale1.dat, scale2.dat scale3.dat ... up to scale9.dat.
I want to read these files in do loop one by one and with each file I want to do some manipulation (I want to write the 1st column of each scale*.dat file to scale*.txt).
So my question is, is there a way to read files with similar names. Thanks.
The regular syntax for this is
for file in scale*.dat; do
awk '{print $1}' "$file" >"${file%.dat}.txt"
done
The asterisk * matches any text or no text; if you want to constrain to just single non-zero digits, you could say for file in scale[1-9].dat instead.
In Bash, there is a non-standard additional glob syntax scale{1..9}.dat but this is Bash-only, and so will not work in #!/bin/sh scripts. (Your question has both sh and bash so it's not clear which you require. Your comment that the Bash syntax is not working for you suggests that you may need a POSIX portable solution.) Furthermore, Bash has something called extended globbing, which allows for quite elaborate pattern matching. See also http://mywiki.wooledge.org/glob
For a simple task like this, you don't really need the shell at all, though.
awk 'FNR==1 { if (f) close (f); f=FILENAME; sub(/\.dat/, ".txt", f); }
{ print $1 >f }' scale[1-9]*.dat
(Okay, maybe that's slightly intimidating for a first-timer. But the basic point is that you will often find that the commands you want to use will happily work on multiple files, and so you don't need shell loops at all in those cases.)
I don't think so. Similar names or not, you will have to iterate through all your files (perhaps with a for loop) and use a nested loop to iterate through lines or words or whatever you plan to read from those files.
Alternatively, you can copy your files into one (say, scale-all.dat) and read that single file.

egrep not writing to a file

I am using the following command in order to extract domain names & the full domain extension from a file. Ex: www.abc.yahoo.com, www.efg.yahoo.com.us.
[a-z0-9\-]+\.com(\.[a-z]{2})?' source.txt | sort | uniq | sed -e 's/www.//'
> dest.txt
The command write correctly when I specify small maximum parameter -m 100 after the source.txt. The problem if I didn't specify, or if I specified a huge number. Although, I could write to files with grep (not egrep) before with huge numbers similar to what I'm trying now and that was successful. I also check the last modified date and time during the command being executed, and it seems there is no modification happening in the destination file. What could be the problem ?
As I mentioned in your earlier question, it's probably not an issue with egrep, but that your file is too big and that sort won't output anything (to uniq) until egrep is done. I suggested that you split the files into manageable chucks using the split command. Something like this:
split -l 10000000 source.txt split_source.
This will split the source.txt file into 10 million line chunks called split_source.a, split_source.b, split_source.c etc. And you can run the entire command on each one of those files (and maybe changing the pipe to append at the end: >> dest.txt).
The problem here is that you can get duplicates across multiple files, so at the end you may need to run
sort dest.txt | uniq > dest_uniq.txt
Your question is missing information.
That aside, a few thoughts. First, to debug and isolate your problem:
Run the egrep <params> | less so you can see what egreps doing, and eliminate any problem from sort, uniq, or sed (my bets on sort).
How big is your input? Any chance sort is dying from too much input?
Gonna need to see the full command to make further comments.
Second, to improve your script:
You may want to sort | uniq AFTER sed, otherwise you could end up with duplicates in your result set, AND an unsorted result set. Maybe that's what you want.
Consider wrapping your regular expressions with "^...$", if it's appropriate to establish beginning of line (^) and end of line ($) anchors. Otherwise you'll be matching portions in the middle of a line.

Using sed to delete lines present in similar file

I have a file listing from an original and a duplicate drive consisting of 985257 lines and 984997 lines respectfully.
As the number of lines do not match I am certain that some of the files have not duplicated.
In order to establish which files are not present I wish to use sed to filter the original file listing by deleting any lines present in the duplicate listing from the source listing.
I had thought about using a match formula in excel but due to the number of lines the program crashes. I thought using this approach in sed would be a viable option.
I have had no success with my approach so far however.
echo "Start"
# Cat the passed argument which is the duplicate file listing
for line in $(cat $1)
do
#sed the $line variable over the larger file and remove
#sed "${line}/d" LiveList.csv
#sed -i "${line}/d" LiveList.csv
#sed -i '${line}' 'd' LiveList.csv
sed -i "s/'${line}'//" /home/listings/LiveList.csv
done
There is a temporary file which is created and fills to the 103.4mb of the listing file however the listing file itself is not altered at all.
My other concern is that as the listing has been created in windows the '\' character may be escaping the string leading to no matches and therefore no alteration.
Example path:
Path,Length,Extension
Jimmy\tail\images\Jimmy\0001\0014\Text\A0\20\A056TH01-01.html,71982,.html
Please help.
This might work for you:
sort orginal_list.txt duplicate_list.txt | uniq -u
First thing that comes to my mind is just using rsync to take care of copying the missing files as fast as possible. It really works wonders.
If not, you can first sort both files to identify where they differ. You can use some paste trickery to put side by side differences, or even use the diff side-by-side output. When files are ordered, I think diff finds it easily to identify what lines have been added.

Using sed to print range when pattern is inside the range?

I have a log file full of queries, and I only want to see the queries that have an error. The log entries look something like:
path to file executing query
QUERY
SIZE: ...
ROWS: ...
MSG: ...
DURATION: ...
I want to print all of this stuff, but only when MSG: contains something of interest (an error message). All I've got right now is the sed -n '/^path to file/,/^DURATION/' and I have no idea where to go from here.
Note: Queries are often multiline, so using grep's -B sadly doesn't work all the time (this is what I've been doing thus far, just being generous with the -B value)
Somehow I'd like to use only sed, but if I absolutely must use something else like awk I guess that's fine.
Thanks!
You haven't said what an error message looks like, so I'll assume it contains the word "ERROR":
sed -n '/^MSG.*ERROR/{H;g;N;p;};/^DURATION/{s/.*//;h;d;};H' < logname
(I wish there were a tidier way to purge the hold space. Anyone?...)
I could suggest a solution with grep. That will work if the structure in the log file is always the same as above (i.e. MSG is in the 5th line, and one line follows):
egrep -i '^MSG:.*error' -A 1 -B 4 logfile
That means: If the word error occurs in a MSG line then output the block beginning from 4 lines before MSG till one line after it.
Of course you have to adjust the regexp to recognize an error.
This will not work if the structure of those blocks differs.
Perhaps you can use the cgrep.sed script, as described by Unix Power Tools book

Resources