What is the easiest way of extracting strings from source files? - linux

I was asked today to list all image files references in our project to help remove/fix dead references.
All our image names in the source files are enclosed in either simple or double quotes ('image.png' or "image.png").
To extract those I thought of using grep, sed an other tools like that but so fair I failed to come up with something effective.
I can currently list all lines that contain image names by grepping the image file extensions (.png, .gif, and so on) but that also brings lines completely unrelated to my search. My attempt with sed wasn't working in case there was several strings per line.
I could probably filter out the list by myself, but hey: this is linux ! So there has to be a tool for the job.
How would you do that ?

You should be able to extract the file names with something like this:
grep -Eo "['\"][^'\"]*\.(gif|png)['\"]"
The option -o causes grep to list only the matches instead of the whole line. Use tr to remove the quotes:
grep -Eo "['\"][^'\"]*\.(gif|png)['\"]" | tr -d "\"'"

Related

How to pass multiple variables in grep

I have a json file that is download using curl. It has some information of a confluence page. I want to extract only 3 parts that downloaded information - that is page: id, status and title.
I have written a bash script for this and my constraint is that I am not sure how to pass multiple variables in grep command
id=id #hardcoded
status=status #hardcoded
echo Enter title you are looking for: #taking input from user here read title_name echo echo echo Here are details
curl -u username:password -sX GET "http://X.X.X.X:8090/rest/api/content?type=page&start=0&limit=200" | python -mjson.tool | grep -Eai "$title_name"|$id|$status"
Aside from a typo (you have an unbalanced quote - please always check the syntax for correctness before you are posting something), the basic idea of your approach would work in that
grep -Eai "$title_name|$id|$status"
would select those text lines which contain those lines which contain the content of one of the variables title_name, id or status.
However, it is a pretty fragile solution. I don't know what can be the actual content of those variables, but for instance, if title_name were set to X.Z, it would also match lines containing the string XYZ, since the dot matches any character. Similarily, if title_name would contain, say, a lone [ or (, grep would complained about an unmatched parentheses error.
If you want to match the string literally and not be taken as regular expressions, it is better to write those pattern into a file (one pattern per line) and use
grep -F -f patternfile
for searching. Of course, since you are using bash, you can also use process substitution if you prefer not using an explicit temporary file.

Delete some lines from text using Linux command

I know how to match text using regex patterns but not how to manipulate them.
I have used grep to match and extract lines from a text file, but I want to remove those lines from the text. How can I achieve this without having to write a python or bash shell script?
I have searched on Google and was recommended to use sed, but I am new to it and don't know how it works.
Can anyone point me in the right direction or help me achieve this goal?
The -v option to grep inverts the search, reporting only the lines that don't match the pattern.
Since you know how to use grep to find the lines to be deleted, using grep -v and the same pattern will give you all the lines to be kept. You can write that to a temporary file and then copy or move the temporary file over the original.
grep -v pattern original.file > tmp.file
mv tmp.file original.file
You can also use sed, as shown in shellfish's answer.
There are multiple possible refinements for the grep solution, but for most people most of the time, what is shown is more or less adequate (it would be a good idea to use a per process intermediate file name, preferably with a random name such as the mktemp command gives you). You can add code to remove the intermediate file on an interrupt; suppress interrupts while moving back; use copy and remove instead of move if the original file has multiple hard links or is a symlink; etc. The sed command more or less works around these issues for you, but it is not cognizant of multiple hard links or symlinks.
Create the pattern which matches the lines using grep. Then create a sed script as follows:
sed -i '/pattern/d' file
Explanation:
The -i option means overwrite the input file, thus removing the files matching pattern.
pattern is the pattern you created for grep, e.g. ^a*b\+.
d this sed command stands for delete, it will delete lines matching the pattern.
file this is the input file, it can consist of a relative or absolute path.
For more information see man sed.

Copy a section within two keywords into a target file

I have thousand of files in a directory and each file contains numbers of defined variables starting with keyword DEFINE and ending with a semicolon (;), I want to copy all the occurrences of the data between this keyword(Inclusive) into a target file.
Example: Below is the content of the text file:
/* This code is for lookup */
DEFINE variable as a1 expr= extract (n123f1 using brach, code);
END.
Now from the above content i just want to copy the section starting with DEFINE and ending with ; into a target file i.e. the output should be:
DEFINE variable as a1 expr= extract (n123f1 using brach, code);
this needs to done for thousands of scripts and multiple occurences, Please help out.
Thanks a lot , the provided code works, but to a limited extent only when the whole sentence is in a single line but the data is not supposed to be in one single line it is spread in multiple line like below:
/* This code is for lookup */
DEFINE variable as a1 expr= if branchno > 55
then
extract (n123f1 using brach, code)
else
branchno = null
;
END.
The code is also in the above fashion i need to capture all the data between DEFINE and semicolon (;) after every define there will be an ending semicolon ;, this is the pattern.
It sounds like you want grep(1):
grep '^DEFINE.*;$' input > output
Try using grep. Let's say you have files with extension .txt in present directory,
grep -ho 'DEFINE.*;' *.txt > outfile
Output:
DEFINE variable as a1 expr= extract (n123f1 using brach, code);
Short Description
-o will give you only matching string rather than whole line, if line also contains something else and want to ommit it.
-h will suppress file names before matching result
Read man page of grep by typing man grep on your terminal
EDIT
If you want capability to search in multiple lines, you can use pcregrep with -M option
pcregrep -M 'DEFINE.*?(\n|.)*?;' *.txt > outfile
Works fine on my system. Check man pcregrep for more details
Reference : SO Question
One can make a simple solution using sed with version :
sed -n -e '/^DEFINE/{:a p;/;$/!{n;ba}}' your-file
Option -n prevents sed from printing every line; then each time a line begins with DEFINE, print the line (command p) then enter a loop: until you find a line ending with ;, grab the next line and loop to the print command. When exiting the loop, you do nothing.
It looks a bit dirty; it seems that the version sed15 has a shorter (and more straightforward) way to achieve this in one line:
sed -n -e '/^DEFINE/,/;$/p' your-file
Indeed, only for this version of sed, both patterns are treated; for other versions of sed like mine under cygwin, the range patterns must be on separate lines to work properly.
One last thing to remember: it does not treat inclusive patterned ranges, i.e. it stops printing after the first encountered end-pattern even if multiple start patterns have been matched. Prefer something with awk if this is a feature you are looking for.

egrep not writing to a file

I am using the following command in order to extract domain names & the full domain extension from a file. Ex: www.abc.yahoo.com, www.efg.yahoo.com.us.
[a-z0-9\-]+\.com(\.[a-z]{2})?' source.txt | sort | uniq | sed -e 's/www.//'
> dest.txt
The command write correctly when I specify small maximum parameter -m 100 after the source.txt. The problem if I didn't specify, or if I specified a huge number. Although, I could write to files with grep (not egrep) before with huge numbers similar to what I'm trying now and that was successful. I also check the last modified date and time during the command being executed, and it seems there is no modification happening in the destination file. What could be the problem ?
As I mentioned in your earlier question, it's probably not an issue with egrep, but that your file is too big and that sort won't output anything (to uniq) until egrep is done. I suggested that you split the files into manageable chucks using the split command. Something like this:
split -l 10000000 source.txt split_source.
This will split the source.txt file into 10 million line chunks called split_source.a, split_source.b, split_source.c etc. And you can run the entire command on each one of those files (and maybe changing the pipe to append at the end: >> dest.txt).
The problem here is that you can get duplicates across multiple files, so at the end you may need to run
sort dest.txt | uniq > dest_uniq.txt
Your question is missing information.
That aside, a few thoughts. First, to debug and isolate your problem:
Run the egrep <params> | less so you can see what egreps doing, and eliminate any problem from sort, uniq, or sed (my bets on sort).
How big is your input? Any chance sort is dying from too much input?
Gonna need to see the full command to make further comments.
Second, to improve your script:
You may want to sort | uniq AFTER sed, otherwise you could end up with duplicates in your result set, AND an unsorted result set. Maybe that's what you want.
Consider wrapping your regular expressions with "^...$", if it's appropriate to establish beginning of line (^) and end of line ($) anchors. Otherwise you'll be matching portions in the middle of a line.

Using sed to delete lines present in similar file

I have a file listing from an original and a duplicate drive consisting of 985257 lines and 984997 lines respectfully.
As the number of lines do not match I am certain that some of the files have not duplicated.
In order to establish which files are not present I wish to use sed to filter the original file listing by deleting any lines present in the duplicate listing from the source listing.
I had thought about using a match formula in excel but due to the number of lines the program crashes. I thought using this approach in sed would be a viable option.
I have had no success with my approach so far however.
echo "Start"
# Cat the passed argument which is the duplicate file listing
for line in $(cat $1)
do
#sed the $line variable over the larger file and remove
#sed "${line}/d" LiveList.csv
#sed -i "${line}/d" LiveList.csv
#sed -i '${line}' 'd' LiveList.csv
sed -i "s/'${line}'//" /home/listings/LiveList.csv
done
There is a temporary file which is created and fills to the 103.4mb of the listing file however the listing file itself is not altered at all.
My other concern is that as the listing has been created in windows the '\' character may be escaping the string leading to no matches and therefore no alteration.
Example path:
Path,Length,Extension
Jimmy\tail\images\Jimmy\0001\0014\Text\A0\20\A056TH01-01.html,71982,.html
Please help.
This might work for you:
sort orginal_list.txt duplicate_list.txt | uniq -u
First thing that comes to my mind is just using rsync to take care of copying the missing files as fast as possible. It really works wonders.
If not, you can first sort both files to identify where they differ. You can use some paste trickery to put side by side differences, or even use the diff side-by-side output. When files are ordered, I think diff finds it easily to identify what lines have been added.

Resources