Extracting text information using rapidminer - text

I have a list of text data from which I want to extract certain portions. I am currently using a regular expression to extract the data I want, but it's starting to get very complicated because each record is slightly different. Is there a way to use Rapidminer to "learn" a regular expression based on some typical examples?
For example, for each of the following records I want to extract the text 24 and 18 into two new attributes:
word 24 on line 18
Wrd 24 of Ln 18
Line 18, Word 24
Word 24 comes after word 22 on line 18 (not line 19)
I have watched all the text processing videos, but none of them show how to do this sort of thing, and I don't really know where to start. Can anyone suggest a way of doing this other than manually creating regular expressions?

The TXR language has a straightforward way to express pattern matching variants without cryptic regular expressions:
Here is your data file:
$ cat 13249396.dat
word 24 on line 18
Wrd 24 of Ln 18
Line 18, Word 24
Word 24 comes after word 22 on line 18 (not line 19)
Here is the txr script:
#(collect)
# (some)
word #wd on line #ln
# (or)
Wrd #wd of Ln #ln
# (or)
Line #ln, Word #wd
# (or)
Word #wd comes after word #nil on line #ln (#(skip)
# (end)
#(end)
#(output)
# (repeat)
#wd:#ln
# (end)
#(end)
Test run:
$ txr 13249396.txr 13249396.dat
24:18
24:18
24:18
24:18
The script was developed by taking the cases from the sample file and replacing a few things by bits of special syntax.

Related

How to use m with the ed function in a Bash Script [duplicate]

I just need to move a line up in sed. I can select the line with
sed -i '7s///'
I need to move line 7 up 2 lines so it will be line 5.
I can't find anything on the internet to do this without complicated scripts, I can't find a simple solution of moving a specific line a specific number of times.
seq 10|sed '5{N;h;d};7G'
when up to line 5 append next line(line 6) into pattern space then save them into hold space and delete them from pattern space; up to line 7 then append the hold space content("5\n6") behind the line 7; now, pattern space is "7\n5\n6";finally,sed will print the pattern space at the end of current cycle by default(if no "-n" parameter)
ed is better at this, since it has a "move" command that does exactly what you want. To move line 7 to be the line after line 4, just do 7m4. ed doesn't write the data back by default, so you need to explicitly issue a w command to write the data:
printf '7m4\nw\n' | ed input
Although it is perhaps better to use a more modern tool:
ex -s -c 7m4 -c w -c q input

notepad++ - search and replace for a word and remove line

hopefully I can make this understandable:
Just say I have this text in a file:
bash-4.2$ 336
1
bash-4.2$ 401
2
bash-4.2$ 403
3
bash-4.2$ 404
4
bash-4.2$ 735
5
bash-4.2$ 894
6
bash-4.2$ 909
7
I want to remove everything on the lines that start "bash", so I am looking for this output:
1
2
3
4
5
6
7
I have been using the regular expression search (with the help of https://regex101.com/r/kT0uE3/1) and if I use this search "bash.*" it removes the line but not the carriage return.
When I change this search to "bash.*\n" it does not find anything (despite regex101 saying it would work).
I think I am missing something obvious and simple but I cannot see the trees for the woods.
Any help is much appreciated.
Ctrl+H
Find what: ^bash-.+\R
Replace with: LEAVE EMPTY
CHECK Match case
CHECK Wrap around
CHECK Regular expression
UNCHECK . matches newline
Replace all
Explanation:
^ # beginning of line
bash- # literally
.+ # 1 or more any character but newline
\R #any kind of linebreak
Screenshot (before):
Screenshot (after):

Sed move a line

I just need to move a line up in sed. I can select the line with
sed -i '7s///'
I need to move line 7 up 2 lines so it will be line 5.
I can't find anything on the internet to do this without complicated scripts, I can't find a simple solution of moving a specific line a specific number of times.
seq 10|sed '5{N;h;d};7G'
when up to line 5 append next line(line 6) into pattern space then save them into hold space and delete them from pattern space; up to line 7 then append the hold space content("5\n6") behind the line 7; now, pattern space is "7\n5\n6";finally,sed will print the pattern space at the end of current cycle by default(if no "-n" parameter)
ed is better at this, since it has a "move" command that does exactly what you want. To move line 7 to be the line after line 4, just do 7m4. ed doesn't write the data back by default, so you need to explicitly issue a w command to write the data:
printf '7m4\nw\n' | ed input
Although it is perhaps better to use a more modern tool:
ex -s -c 7m4 -c w -c q input

Replace a line containing certain characters using vi

I'd like to replace all line containining "CreateTime=xxxxx" with "CreateTime=2012-01-04 00:00". May I know how should I do it with vim?
[m18]
Attendees=38230,92242,97553
Duration=2
CreateTime=2012-01-09 22:00
[m20]
Attendees=52000,50521,34025
Duration=2
CreateTime=2012-01-09 00:00
[m22]
Attendees=95892,23689
Duration=2
CreateTime=2012-01-08 17:00
You can use the global substitute operator for this.
:%s/CreateTime=.*$/CreateTime=2012-01-04 00:00/g
You can read the help for the s command from within Vim using:
:help :s
You can read about patterns with :help pattern-overview.
As requested, a bit more about the regular expression match (CreateTime=.*$):
CreateTime= # this part is just a string
. # "." matches any character
* # "*" modifies the "." to mean "0 or more" of any character
$ # "$" means "end of line"
Taken together, it matches CreateTime= followed by any series of characters, consuming the rest of the line.

Add blank line before a certain phrase in a text file in Linux?

I'm using Kali Linux, trying to sort out some input from Nmap. Basically, I ran a scan from NMap, and need to extract specific pieces of information from it. I've got it to show everything I need using the following command:
cat discovery.txt | grep 'Nmap scan report for\|Service Info: OS:\|OS CPE:\|OS guesses:\|OS matches\|OS details'
Essentially, each section of information I need will start with "Nmap scan report for [IP ADDRESS]"
I'd like to add to my command to have it create a blank line before every appearance of the word "Nmap", to clearly separate each chunk of information.
Is there any command I can use to do this?
sed '/Nmap/i
' file
That's a literal newline after the i
A demo: add a newline before each line ending with a "0" or a "5"
seq 19 | sed '/0$\|5$/i
'
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
Sure, you can use Perl.
perl -pe 's/^Nmap/\nNmap/'

Resources