Stream File Contents Until Substring Encountered - linux

I was using:
bash $ head -n 2 *.xml | grep (..stuff..)
to stream first 2 lines of all xml files to grep command. However, I realized that this was not reliable for the structure of these files.
What I need instead is to stream start of each xml file until a particular substring (which all these files have) is encountered.
head does not provide that level of granularity. The substring is simply the start of a tag (e.g. something like "< tag start"). I would be grateful for any ideas. Thanks!

If you know the max number of lines you have before the matching string you can do something like this:
# cat testfile
123
9
1
1
2
3
4000
TAG
456
# grep -m 1 -B 10 TAG testfile | grep -v TAG
123
9
1
1
2
3
4000
#

Sounds like you want either of these (using GNU awk for nextfile) depending on if you want the tag line printed or not:
awk '/< tag start/{nextfile} 1' *.xml
awk '1; /< tag start/{nextfile}' *.xml
or less efficiently with any awk:
awk 'FNR==1{f=1} /< tag start/{f=0} f' *.xml
awk 'FNR==1{f=1} f; /< tag start/{f=0}' *.xml
or bringing back some efficiency in this case:
for file in *.xml; do
awk '/< tag start/{exit} 1' "$file"
done

I appreciate all the responses. I found that really I only needed the content of a single tag, rather than from the beginning of the xml files. This simplified the parsing. So for instance:
<mt:myTag LOTSOFSTUFF >"
, I really only needed LOTSOFSTUFF. So I simply did:
grep -oP "<mt:myTag(.*)>" *.xml | grep_more
and that worked exactly. Thanks again. I really appreciated and sorry I did not realize my use case was simpler than I made it out to be.

Related

Can grep show output only if the line contain another search string? [duplicate]

I am trying to extract text from a file between a < and a >, but only on a line starting with another specific pattern.
So in a file that looks like:
XXX Something here
XXX Something more here
XXX <\Lines like this are a problem>
ZZZ something <\This is the text I need>
XXX Don't need any of this
I would like to print only the <\This is the text I need>.
If I do
sed -n '/^ZZZ/p' FILENAME
it pulls the correct lines I need to look at, but obviously prints the whole line.
sed -n '/<\/,/>/p' FILENAME prints way too much.
I have looked into grouping and tried
sed -n '/^ZZZ/{/<\/,/>/} FILENAME
but this doesn't seem to work at all.
Any suggestions? They will be much appreciated.
(Apologies for formatting, never posted on here before)
sed -n '/^ZZZ/ { s/^.*\(<.*>\).*$/\1/p }'
If it does not have to be sed and you have a fairly recent grep, you may use grep's option -o as in
grep '^ZZZ' | grep -o '<[^>]*>'
An awk version
awk -F"<|>" '/^ZZZ/ {print "<"$2">"}' file
<\This is the text I need>

Automate and looping through batch script

I'm new to batch. I want iterate through a list and use the output content to replace a string in another file.
ls -l somefile | grep .txt | awk 'print $4}' | while read file
do
toreplace="/Team/$file"
sed 's/dataFile/"$toreplace"/$file/ file2 > /tmp/test.txt
done
When I run the code I get the error
sed: 1: "s/dataFile/"$torepla ...": bad flag in substitute command: '$'
Example of somefile with which has list of files paths
foo/name/xxx/2020-01-01.txt
foo/name/xxx/2020-01-02.txt
foo/name/xxx/2020-01-03.txt
However, my desired output is to use the list of file paths in somefile directory to replace a string in another file2 content. Something like this:
This is the directory of locations where data from /Team/foo/name/xxx/2020-01-01.txt ............
I'm not sure if I understand your desired outcome, but hopefully this will help you to figure out your problem:
You have three files in a directory:
TEAM/foo/name/xxx/2020-01-02.txt
TEAM/foo/name/xxx/2020-01-03.txt
TEAM/foo/name/xxx/2020-01-01.txt
And you have another file called to_be_changed.txt which contains the text This is the directory of locations where data from TO_BE_REPLACED ............ and you want to grab the filenames of your three files and insert them into your to_be_changed.txt file, you can do it with:
while read file
do
filename="$file"
sed "s/TO_BE_REPLACED/${filename##*/}/g" to_be_changed.txt >> changed.txt
done < <(find ./TEAM/ -name "*.txt")
And you will then have made a file called changed.txt which contains:
This is the directory of locations where data from 2020-01-02.txt ............
This is the directory of locations where data from 2020-01-03.txt ............
This is the directory of locations where data from 2020-01-01.txt ............
Is this what you're trying to achieve? If you need further clarification I'm happy to edit this answer to provide more details/explanation.
ls -l somefile | grep .txt | awk 'print $4}' | while read file
No. No, no, nono.
ls -l somefile is only going to show somefile unless it's a directory.
(Don't name a directory "somefile".)
If you mean somefile.txt, please clarify in your post.
grep .txt is going to look through the lines presented for the three characters txt preceded by any character (the dot is a regex wildcard). Since you asked for a long listing of somefile it shouldn't find any, so nothing should be passed along.
awk 'print $4}' is a typo which won't compile. awk will crash.
Keep it simple. What I suspect you meant was
for file in *.txt
Then in
toreplace="/Team/$file"
sed 's/dataFile/"$toreplace"/$file/ file2 > /tmp/test.txt
it's unlear what you expect $file to be - awk's $4 from an ls -l seems unlikely.
Assuming it's the filenames from the for above, then try
sed "s,dataFile,/Team/$file," file2 > /tmp/test.txt
Does that help? Correct me as needed. Sorry if I seem harsh.
Welcome to SO. ;)

Make grep to exact match strings with and without dash "-"

The problem looks simple and common, so I've looked through many answers but seems that none of them provides appropriate general solution.
I need to grep large tab-separated 6 columns file (*.bed file in fact) to split it by the content of the first column using the list of string variables (items). I just need a row starting with a given string.
I was succesfully using
grep -w "$name" inputfile
$name is read from the list of strings
for that purpose until the case where strings have the following format (example): YAL038W but also YAL038W-A, YAL038W-B,...
So, grep with -w option considers YAL038W identical to YAL038W-A, YAL038W-B since "-" is word separator. it would work with "_" but not with "-".
I've found solutions based on awk which are working fine, for example:
awk -F $'\t' -vsearch=$name '$1==search' inputfile
but awk is terribly slow, over 10 times, see time measurements below
For 2.5 Gb input file and > 5000 items to look for, script is already running for >24 hours!
Example of inputfile:
YAL038W-A 0 48 HWI-1KL176:101:CC27NACXX:3:2208:17646:92047 0 +
YAL038W-A 0 48 HWI-1KL176:101:CC27NACXX:3:2211:17326:31268 0 +
YAL038W 1 50 HWI-1KL176:101:CC27NACXX:8:1205:16311:19319 3 +
YAL038W 1 27 HWI-1KL176:101:CC27NACXX:8:2103:4951:94527 42 +
time grep -w "YAL038W" inputfile > testfile.txt
real 0m3.569s
time awk -F $'\t' -vsearch="YAL038W" '$1==search' inputfile > testfile.txt
real 0m29.521s
I am looking for FAST solution using grep or something else, and I need to pass the variable to this command in the cycle.
Alternative is to modify the imput file by replacing "-" by "_", but it is the last possibility I believe...
Thanks in advance
I've found solutions based on awk which are working fine, for example:
awk -F $'\t' -vsearch=$name '$1==search' inputfile
but awk is terribly slow…
I am looking for FAST solution using grep …
If the above awk command worked for you, then this will do:
grep ^$name$'\t' inputfile
Just search at the beginning of each line for the name followed by a TAB.

renaming files using loop in unix

I have a situation here.
I have lot of files like below in linux
SIPTV_FIPTV_ID00$line_T20141003195717_C0000001000_FWD148_IPV_001.DATaac
SIPTV_FIPTV_ID00$line_T20141003195717_C0000001000_FWD148_IPV_001.DATaag
I want to remove the $line and make a counter from 0001 to 6000 for my 6000 such files in its place.
Also i want to remove the trailer 3 characters after this is done for each file.
After fix file should be like
SIPTV_FIPTV_ID0000001_T20141003195717_C0000001000_FWD148_IPV_001.DAT
SIPTV_FIPTV_ID0000002_T20141003195717_C0000001000_FWD148_IPV_001.DAT
Please help.
With some assumption, I think this should do it:
1. list of the files is in a file named input.txt, one file per line
2. the code is running in the directory the files are in
3. bash is available
awk '{i++;printf "mv \x27"$0"\x27 ";printf "\x27"substr($0,1,16);printf "%05d", i;print substr($0,22,47)"\x27"}' input.txt | bash
from the command prompt give the following command
% echo *.DAT??? | awk '{
old=$0;
sub("\\$line",sprintf("%4.4d",++n));
sub("...$","");
print "mv", old, $1}'
%
and check the output, if it looks OK
% echo *.DAT??? | awk '{
old=$0;
sub("\\$line",sprintf("%4.4d",++n));
sub("...$","");
print "mv", old, $1}' | sh
%
A commentary: echo *.DAT??? is meant to give as input to awk a list of all the filenames that you want to modify, you may want something more articulated if the example names you gave aren't representative of the whole spectrum... regarding the awk script itself, I used sprintf to generate a string with the correct number of zeroes for the replacement of $line, the idiom `"\\$..." with two backslashes to quote the dollar sign is required by gawk and does no harm in mawk, and as a last remark I have to say that in similar cases I prefer to make at least a dry run before passing the commands to the shell...

How to find the particular text stored in the file "data.txt" and it occurs only once

The line I seek is stored in the file data.txt and is the only line of text that occurs only once.
How do I go about finding that particular line using linux?
This is a little bit old, but I think you are looking for this...
cat data.txt | sort | uniq -u
This will show the unique values that only occur once in the file. I assume you are familiar with "over the wire" if you are asking?? If so, this is what you are looking for.
To provide some context (I need more rep to comment) this is a question that features in an online "wargame" called Bandit that involves using the command line to discover passwords on an online Linux server to advance up the levels.
For those who would like to see data.txt in full I've Pastebin'd it here however it looks like this:
NN4e37KW2tkIb3dC9ZHyOPdq1FqZwq9h
jpEYciZvDIs6MLPhYoOGWQHNIoQZzE5q
3rpovhi1CyT7RUTunW30goGek5Q5Fu66
JOaWd4uAPii4Jc19AP2McmBNRzBYDAkO
JOaWd4uAPii4Jc19AP2McmBNRzBYDAkO
9WV67QT4uZZK7JHwmOH0jnhurJMwoGZU
a2GjmWtTe3tTM0ARl7TQwraPGXgfkH4f
7yJ8imXc7NNiovDuAl1ZC6xb0O0mMBx1
UsvVyFSfZZWbi6wgC7dAFyFuR6jQQUhR
FcOJhZkHlnwqcD8QbvjRyn886rCrnWZ7
E3ugYDa6Wh2y8C8xQev7vOS8O3OgG1Hw
E3ugYDa6Wh2y8C8xQev7vOS8O3OgG1Hw
ME7nnzbId4W3dajsl6Xtviyl5uhmMenv
J5lN3Qe4s7ktiwvcCj9ZHWrAJcUWEhUq
aouHvjzagN8QT2BCMB6e9rlN4ffqZ0Qq
ZRF5dlSuwuVV9TLhHKvPvRDrQ2L5ODfD
9ZjR3NTHue4YR6n4DgG5e0qMQcJjTaiM
QT8Bw9ofH4x3MeRvYAVbYvV1e1zq3Xim
i6A6TL6nqvjCAPvOdXZWjlYgyvqxmB7k
tx7tQ6kgeJnC446CHbiJY7fyRwrwuhrs
One way to do it is to use:
sort data.txt | uniq -u
The sort command is like cat in that it displays the contents of the file however it sorts the file lexicographically by lines (it reorders them alphabetically so that matching ones are together).
The | is a pipe that redirects the output from one command into another.
The uniq command reports or omits repeated lines and by passing it the -u argument we tell it to report only unique lines.
Used together like this, the command will sort data.txt lexicographically by each line, find the unique line and print it back in the terminal for you.
sort -u data.txt | while read line; do if [ $(grep -c $line data.txt) == 1 ] ;then echo $line; fi; done
was mine solution, until I saw here easy one:
sort data.txt | uniq -u
Add more information to you post.
How data.txt look like?
Like this:
11111111
11111111
pass1111
11111111
Or like this
afawfdgd
password
somethin
gelse...
And, do you know the password is in file or you search for not repeat string.
If you know password, use something like this
cat data.txt | grep 'password'
If you don`t know the password and this password is only unique line in file you must create a script.
For example in Python
file = open("data.txt","r")
f = file.read()
for line in f:
if 'pass' in line:
print pass
Of course replace pass with something else.
For example some slice from line.
And one with only one tool in use, awk:
awk '{a[$1]++}END{for(i in a){if(a[i] == 1){print i} }}' data.txt
sort data.txt | uniq -c | grep 1\ ?*
and it will print the only text that occurs only one time
do not forget to put space after the backslash
sort data.txt | uniq -c | grep 1
you will find only one that accures one time

Resources