Select disjoint chunks of code in Vim for yanking - vim

I am wondering if I am able to do this in Vim:
Sample code:
require 'abstract_controller/collector'
require 'active_support/core_ext/hash/reverse_merge'
require 'active_support/core_ext/array/extract_options'
require 'IDONTWANTTHISLINETOBEINCLUDEDINMYYANKREGISTER'
require 'IDONTWANTTHISLINETOBEINCLUDEDINMYYANKREGISTER'
module ActionMailer #:nodoc:
class Collector
include AbstractController::Collector
attr_reader :responses
def initialize(context, &block)
#context = context
#responses = []
#default_render = block
end
def any(*args, &block)
options = args.extract_options!
raise "You have to supply at least one format" if args.empty?
args.each { |type| send(type, options.dup, &block) }
end
alias :all :any
def custom(mime, options={})
options.reverse_merge!(:content_type => mime.to_s)
#context.freeze_formats([mime.to_sym])
options[:body] = block_given? ? yield : #default_render.call
#responses << options
end
end
end
Now suppose I want to yank just some lines and put them in another file. Suppose I want to yank these block of lines:
Chunk 1:
require 'abstract_controller/collector'
require 'active_support/core_ext/hash/reverse_merge'
require 'active_support/core_ext/array/extract_options'
Chunk 2:
module ActionMailer #:nodoc:
class Collector
include AbstractController::Collector
attr_reader :responses
def initialize(context, &block)
#context = context
#responses = []
#default_render = block
end
Chunk 3:
def custom(mime, options={})
options.reverse_merge!(:content_type => mime.to_s)
#context.freeze_formats([mime.to_sym])
options[:body] = block_given? ? yield : #default_render.call
#responses << options
end
end
end
These lines don't form a continuous line group, they are separated. So to achieve what I want I have to yank these blocks in 3 steps, which I find quite annoying. Because I have to yank, switch buffer, put, switch buffer, yank, switch buffer, put... so on...
So, is there a way to do this more efficiently (in one step)?

Use a register in append mode:
Visually select first three lines, "ay
Visually select next 10 lines, "Ay (note the capital letter)
Visually select chunk 3, "Ay
Go to other buffer, "ap
You like registers? This answer is more in-depth.

Related

Deleting until whitespace in Sublime Text

Is there any way to delete all characters untill first whitespace in Sublime. I know that you can use ctrl+delete to do that but it stops at non-word characters(",:,&,*, etc). When you try to delete aaa aaa 2+a, from the end, it will delete 2+a until + sign, but it will delete aaa until space. I need to change that so it will delete 2+a until first space. Solution can be anything; changing settings, plug-in.
I found solution for this. It's via this plugin:
https://packagecontrol.io/packages/KeyboardNavigation
Key for it is:
{ "keys": ["ctrl+backspace"], "command": "delete_to_beg_of_contig_boundary", "args": {"forward": false} }
It deletes any characters right to left until first whitespace.
I have written a Sublime Text plugin to delete text as you require. It is almost identical to ST's delete_word command but breaks only at whitespace/non-whitespace.
When called the plugin deletes text from the cursor to the next or previous group of characters, the grouping being defined as either whitespace or non-whitespace characters. Thus if run several times in succession it will alternate between deleting groups of whitespace and non-whitespace characters ahead or behind the cursor. The forwards parameter of the run() method (i.e. the command's arg) controls the deletion direction.
Save the plugin somewhere in your config Packages folder hierarchy. e.g.
.../sublime-text-3/Packages/User/DeleteToWhitespace.py
Add key bindings to your user .sublime-keymap file. e.g.
//
// These key bindings override the ST 'delete_word' keys but use whatever keys you want.
// You could use `super+delete` and `super+backspace` and keep ST's delete keys intact.
//
{ "keys": ["ctrl+delete"], "command": "delete_to_whitespace", "args": {"forwards": true} },
{ "keys": ["ctrl+backspace"], "command": "delete_to_whitespace", "args": {"forwards": false} },
Below is the DeleteToWhitespace.py plugin. It has been uploaded to this GitHub Gist – this links directly to the raw source code.
#
# Name: Delete To Whitespace
# Requires: Plugin for Sublime Text v3
# Command: delete_to_whitespace
# Args: forwards: bool (delete backwards if false)
# License: MIT License
#
import sublime, sublime_plugin, re
class DeleteToWhitespaceCommand(sublime_plugin.TextCommand):
"""
A Sublime Text plugin that deletes text from the cursor to the next or
previous group of characters, the grouping being defined as either
whitespace or non-whitespace characters. Thus if run several times in
succession it will alternate between deleting groups of whitespace and
non-whitespace ahead or behind the cursor. The forwards parameter of the
run() method (i.e. the command's arg) controls the deletion direction.
"""
def run(self, edit, forwards=True):
self.edit = edit
self.forwards = forwards
if forwards:
self.delete_forwards()
else:
self.delete_backwards()
def delete_forwards(self):
whitespace_regex = "^\s+"
non_whitespace_regex = "^\S+"
for sel in self.view.sel():
if sel.size() > 0:
self.view.erase(self.edit, sel)
continue
# ∴ sel.a == sel.b == sel.begin() == sel.end()
# view.full_line() includes the trailing newline (if any).
cursor = sel.a
line = self.view.full_line(cursor)
cursor_to_eol = sublime.Region(cursor, line.end())
cursor_to_eol_str = self.view.substr(cursor_to_eol)
match = re.search(whitespace_regex, cursor_to_eol_str)
if match:
self.erase_matching_characters(cursor, match)
continue
match = re.search(non_whitespace_regex, cursor_to_eol_str)
if match:
self.erase_matching_characters(cursor, match)
continue
def delete_backwards(self):
whitespace_regex = "\s+$"
non_whitespace_regex = "\S+$"
for sel in self.view.sel():
if sel.size() > 0:
self.view.erase(self.edit, sel)
continue
# ∴ sel.a == sel.b == sel.begin() == sel.end()
# view.line() excludes the trailing newline (if any).
cursor = sel.a
line = self.view.line(cursor)
cursor_to_bol = sublime.Region(cursor, line.begin())
cursor_to_bol_str = self.view.substr(cursor_to_bol)
# Delete the newline of the 'previous' line.
if cursor_to_bol.size() == 0 and cursor > 0:
erase_region = sublime.Region(cursor, cursor - 1)
self.view.erase(self.edit, erase_region)
continue
match = re.search(whitespace_regex, cursor_to_bol_str)
if match:
self.erase_matching_characters(cursor, match)
continue
match = re.search(non_whitespace_regex, cursor_to_bol_str)
if match:
self.erase_matching_characters(cursor, match)
continue
def erase_matching_characters(self, cursor, match):
match_len = match.end() - match.start()
if self.forwards:
erase_region = sublime.Region(cursor, cursor + match_len)
else:
erase_region = sublime.Region(cursor, cursor - match_len)
self.view.erase(self.edit, erase_region)

How to make vim’s vimgrep command keep indentations?

I’m trying to use vim’s quick fix (or local) list to get some information extracted from a file. For example, I want to get all the method names of a python module (the idea was borrowed from pycharm). What I want to get in vim’s “local list” is just something like the following:
class Foo:
def one():
def two():
def three():
def bar():
def bazz():
To achieve that, I do approximately the following steps:
:" OK, the current buffer is being used.
:let file_name = expand('%:p')
:" The heart of the process is one of vim’s grep-like command.
:execute 'lvimgrep /\v^\s*(class|def)/ '.file_name
:" I open the results with the “lopen” command because “llist”
:" doesn’t allow me to use concealing.
:lopen
:" Since I’m working with one file, I don’t need information
:" about file name, line number etc.
:setlocal conceallevel=3
:syntax match NonText /\v^.+col \d+([:]|[|])/ transparent conceal
:" Please note, I‘m still able to jump to a line
:" with the “ll” command.
But unfortunately I get:
class Foo:
def one():
def two():
def three():
def bar():
def bazz():
All the indents are swallowed! The result is quite useless… I can’t differentiate which of the functions belong to a class, which of them are stand-alone.
Please note, the concealing doesn’t have a meaningful influence on the result. If I took away the two last commands (conceal-related), nothing significant would change, only the file name and line/column numbers would be shown but the text in the lines would be still without indents anyway.
So, my questions are:
Is it possible to make lvimgrep (or an analogue) keep the lines untouched in order to save indentation? Is there a magic command or option to do that? Or should I program my own implementation of lvimgrep?
P.S. I’d like to use vim’s regular expressions. But if it’s impossible, I could switch to the external “grep” command (I’m a linux guy) and use the BRE or ERE syntax as well.
No, currently, it is impossible to make lvimgrep (or even similar commands) keep leading whitespace characters in the quickfix (location) list entries, since space and tab characters are unconditionally skipped from the beginning, if the text length is greater than 3.
The only way to achieve the desired behavior (at least, using *vimgrep commands) is to modify the source code. For example, you might add an option as demonstrated in the following patch:
diff --git a/runtime/optwin.vim b/runtime/optwin.vim
index 7d3a8804d..caac55cf2 100644
--- a/runtime/optwin.vim
+++ b/runtime/optwin.vim
## -1299,6 +1299,7 ## call <SID>OptionG("ve", &ve)
call append("$", "eventignore\tlist of autocommand events which are to be ignored")
call <SID>OptionG("ei", &ei)
call append("$", "loadplugins\tload plugin scripts when starting up")
+call append("$", "locws\tenables whitespace characters for entries in the location window")
call <SID>BinOptionG("lpl", &lpl)
call append("$", "exrc\tenable reading .vimrc/.exrc/.gvimrc in the current directory")
call <SID>BinOptionG("ex", &ex)
diff --git a/src/option.c b/src/option.c
index aabfc7f53..4ba280806 100644
--- a/src/option.c
+++ b/src/option.c
## -1791,6 +1791,9 ## static struct vimoption options[] =
{"loadplugins", "lpl", P_BOOL|P_VI_DEF,
(char_u *)&p_lpl, PV_NONE,
{(char_u *)TRUE, (char_u *)0L} SCTX_INIT},
+ {"locws", NULL, P_BOOL|P_VI_DEF,
+ (char_u *)&p_locws, PV_NONE,
+ {(char_u *)FALSE, (char_u *)0L} SCTX_INIT},
{"luadll", NULL, P_STRING|P_EXPAND|P_VI_DEF|P_SECURE,
#if defined(DYNAMIC_LUA)
(char_u *)&p_luadll, PV_NONE,
diff --git a/src/option.h b/src/option.h
index c1a25b342..5e17c459e 100644
--- a/src/option.h
+++ b/src/option.h
## -602,6 +602,7 ## EXTERN char_u *p_lcs; // 'listchars'
EXTERN int p_lz; // 'lazyredraw'
EXTERN int p_lpl; // 'loadplugins'
+EXTERN int p_locws; // 'locws'
#if defined(DYNAMIC_LUA)
EXTERN char_u *p_luadll; // 'luadll'
#endif
diff --git a/src/quickfix.c b/src/quickfix.c
index 136c472e1..8e206ddd7 100644
--- a/src/quickfix.c
+++ b/src/quickfix.c
## -4417,8 +4417,9 ## qf_update_buffer(qf_info_T *qi, qfline_T *old_last)
static int
qf_buf_add_line(buf_T *buf, linenr_T lnum, qfline_T *qfp, char_u *dirname)
{
- int len;
- buf_T *errbuf;
+ int len;
+ buf_T *errbuf;
+ long lval;
if (qfp->qf_module != NULL)
{
## -4472,10 +4473,12 ## qf_buf_add_line(buf_T *buf, linenr_T lnum, qfline_T *qfp, char_u *dirname)
IObuff[len++] = '|';
IObuff[len++] = ' ';
- // Remove newlines and leading whitespace from the text.
+ // Remove newlines and leading whitespace from the text,
+ // if the user not enabled whitespaces explicitly via locws option.
// For an unrecognized line keep the indent, the compiler may
// mark a word with ^^^^.
- qf_fmt_text(len > 3 ? skipwhite(qfp->qf_text) : qfp->qf_text,
+ get_option_value((char_u *)"locws", &lval, NULL, 0);
+ qf_fmt_text(len > 3 ? (lval ? qfp->qf_text : skipwhite(qfp->qf_text)) : qfp->qf_text,
IObuff + len, IOSIZE - len);
if (ml_append_buf(buf, lnum, IObuff,
With locws option, you could enable whitespace characters in the quickfix/location entries as follows:
:set locws
Alternative Option
As an alternative, you could just list out the results via :# an :global
:g/\v^\s*(class|def)/#
This will print out the relevant lines with their associated line numbers.
A slightly fancier mapping:
nnoremap <leader>f :keeppatterns g/\v^\s*(class|def)/#<cr>:
With this mapping you can just type the line number and press enter to jump to a line after executing the mapping.
For more help see:
:h :g
:h :#
:h :keeppatterns
:h :range
Using Quickfix List
In order to use the quickfix list you will need to "mangle" your indent text with another character, e.g. >.
command! PyLocations call <SID>py_locations()
function! s:py_locations()
let lst = []
let bufnr = bufnr('%')
let pat = repeat(' ', shiftwidth())
let Fn = {l -> substitute(matchstr(l, '^\s*'), pat, '▶', 'g') . matchstr(l, '\S.*')}
keeppatterns g/\v^\s*(class|def)>/call add(lst, {'bufnr': bufnr, 'lnum': line('.'), 'text': call(Fn, [getline('.')])})
call setqflist(lst, ' ')
cwindow
endfunction
I’ve done it! But it took more than “half an hour” as I supposed early.
During the research, I found that vim’s local list (and I‘m sure quick fix list too) keeps the indentation of a line when it’s unable to recognize the line as a valid “goto information”, when the line format doesn’t correspond to errorformat. (See :help quickfix-valid) So, to get a nice looking list it must be rendered manually. But in order to have the possibility of jumping to the items of a search result, a quickfix or local list must be created as well.
I’ve split the task into two functions: the fist one retrieves the data, the second one shows it.
function! s:grep_buffer(pattern)
let file_name = expand("%")
let b:grepped = [] |" It will store the search results.
lexpr! [] |" The local list will give the possibility of jumping.
for line_number in range(1, line('$'))
let line_content = getline(line_number)
if line_content =~ '\V'.a:pattern
call add(b:grepped, line_content)
laddexpr file_name.':'.line_number.':'.line_content
endif
endfor
endfunction
function! s:show_result()
if exists('b:grepped')
let grepped = b:grepped |" After creation a new window it’ll be lost.
vnew
call append(0, grepped)
setlocal buftype=nofile |" Don’t make vim save the content.
setlocal noswapfile
setlocal nomodifiable
nn <silent> <buffer> <CR> :exe line(".").'ll'<CR>
wincmd l |" Now the old window is on the right.
hide
endif
endfunction
Of course, a convenient key mapping must be designed. (There is a trailing space in the second line.)
command! -nargs=1 GrepBuffer call <SID>grep_buffer(<f-args>)
nn <leader>g :GrepBuffer
nn <silent> <leader>s :call <SID>show_result()<CR>
It’s super convenient! When I want to overview the search results again, I call the show_result function which replaces the current window with the search results. I can use all the usual navigation tools to move the cursor through the search results. And all I need to jump to an interesting place is just to hit the enter key!
Thanks for all! The problem is solved, vim is the greatest editor.

I need to clean seismological events from a text file

The question here is related to the same type of file I asked another question about, almost one month ago (I need to split a seismological file so that I have multiple subfiles).
My goal now is to delete events which in their first line contain the string 'RSN 3'. So far I have tried editing the aforementioned question's best answer code like this:
with open(refcatname) as fileContent:
for l in fileContent:
check_rsn_3 = l[45:51]
if check_rsn_3 == "RSN 3":
line = l[:-1]
check_event = line[1:15]
print(check_event, check_rsn_3)
if not check_rsn_3 == "RSN 3":
# Strip white spaces to make sure is an empty line
if not l.strip():
subFile.write(
eventInfo + "\n"
) # Add event to the subfile
eventInfo = "" # Reinit event info
eventCounter += 1
if eventCounter == 700:
subFile.close()
fileId += 1
subFile = open(
os.path.join(
catdir,
"Paquete_Continental_"
+ str(fileId)
+ ".out",
),
"w+",
)
eventCounter = 0
else:
eventInfo += l
subFile.close()
Expected results: Event info of earthquakes with 'RSN N' (where N≠3)
Actual results: First line of events with 'RSN 3' is deleted but not the remaining event info.
Thanks in advance for your help :)
I'd advise against checking if the string is at an exact location (e.g. l[45:51]) since a single character can mess that up, you can instead check if the entire string contains "RSN 3" with if "RSN 3" in l
With the line = l[:-1] you only get the last character of the line, so the line[1:15] won't work since it's not an array.
But if you need to delete several lines, you could just check if the current line contains "RSN 3", and then read line after line until one contains "RSN " while skipping the ones in between.
skip = False
for line in fileContent:
if "RSN 3" in line:
skip = True
continue
if "RSN " in line and "RSN 3" not in line:
skip = False
# rest of the logic
if skip:
continue
This way you don't even parse the blocks whose first line contains "RSN 3".

Include monotonically increasing value in logstash field?

I know there's no built in "line count" functionality while processing files through logstash (for various, understandable and documented reasons). But - there should be a mechanism, within any given logstash instance - to have an monotonically increasing variable / count for every parsed line.
I don't want to go the metrics route since it's a continuous polling mechanism (every n-seconds). Alternatives include pre-processing of log files which given my particular use case - is unacceptable.
Again, let me reiterate - I need the ability to generate/read a monotonically increasing variable that I can store during in a logstash filter.
Thoughts?
here's nothing built into logstash to do it.
You can build a filter to do it pretty easily
Just drop something like this into lib/logstash/filters/seq.rb
# encoding: utf-8
require "logstash/filters/base"
require "logstash/namespace"
require "set"
#
# This filter will adds a sequence number to a log entry
#
# The config looks like this:
#
# filter {
# seq {
# field => "seq"
# }
# }
#
# The `field` is the field you want added to the event.
class LogStash::Filters::Seq < LogStash::Filters::Base
config_name "seq"
milestone 1
config :field, :validate => :string, :required => false, :default => "seq"
public
def register
# Nothing
end # def register
public
def initialize(config = {})
super
#threadsafe = false
# This filter needs to keep state.
#seq=1
end # def initialize
public
def filter(event)
return unless filter?(event)
event[#field] = #seq
#seq = #seq + 1
filter_matched(event)
end # def filter
end # class LogStash::Filters::Seq
This will start at 1 every time Logstash is restarted, but for most situations, this would be ok. If you need something that is persistent across restarts, you need to do a bit more work to persist it somewhere
For anyone finding this in 2018+: logstash now has a ruby filter that makes this much simpler. Put the following in a file somewhere:
# encoding: utf-8
def register(params)
#seq = 1
end
def filter(event)
event.set("seq", #seq)
#seq += 1
return [event]
end
And then configure it like this in your logstash.conf (substitute in the filename you used):
ruby {
path => "/usr/local/lib/logstash/seq.rb"
}
It would be pretty easy to make the field name configurable from logstash.conf, but I'll leave that as an exercise for the reader.
I suspect this isn't thread-safe, so I'm running only a single logstash worker.
this is another choice to slove the problem,this work for me,thanks to the answer from the previous person about thread safe. i use seq field to sort my desc
this is my configure
logstash.conf
filter {
ruby {
code => 'event.set("seq", Time.now.strftime("%N").to_i)'
}
}
logstash.yml
pipeline.batch.size: 200
pipeline.batch.delay: 60
pipeline.workers: 1
pipeline.output.workers: 1

Compare many text files that contain duplicate "stubs" from the previous and next file and remove duplicate text automatically

I have a large number of text files (1000+) each containing an article from an academic journal. Unfortunately each article's file also contains a "stub" from the end of the previous article (at the beginning) and from the beginning of the next article (at the end).
I need to remove these stubs in preparation for running a frequency analysis on the articles because the stubs constitute duplicate data.
There is no simple field that marks the beginning and end of each article in all cases. However, the duplicate text does seem to formatted the same and on the same line in both cases.
A script that compared each file to the next file and then removed 1 copy of the duplicate text would be perfect. This seems like it would be a pretty common issue when programming so I am surprised that I haven't been able to find anything that does this.
The file names sort in order, so a script that compares each file to the next sequentially should work. E.G.
bul_9_5_181.txt
bul_9_5_186.txt
are two articles, one starting on page 181 and the other on page 186. Both of these articles are included bellow.
There is two volumes of test data located at [http://drop.io/fdsayre][1]
Note: I am an academic doing content analysis of old journal articles for a project in the history of psychology. I am no programmer, but I do have 10+ years experience with linux and can usually figure things out as I go.
Thanks for your help
FILENAME: bul_9_5_181.txt
SYN&STHESIA
ISI
the majority of Portugese words signifying black objects or ideas relating to black. This association is, admittedly, no true synsesthesia, but the author believes that it is only a matter of degree between these logical and spontaneous associations and genuine cases of colored audition.
REFERENCES
DOWNEY, JUNE E. A Case of Colored Gustation. Amer. J. of Psycho!., 1911, 22, S28-539MEDEIROS-E-ALBUQUERQUE. Sur un phenomene de synopsie presente par des millions de sujets. / . de psychol. norm, et path., 1911, 8, 147-151. MYERS, C. S. A Case of Synassthesia. Brit. J. of Psychol., 1911, 4, 228-238.
AFFECTIVE PHENOMENA — EXPERIMENTAL
BY PROFESSOR JOHN F. .SHEPARD
University of Michigan
Three articles have appeared from the Leipzig laboratory during the year. Drozynski (2) objects to the use of gustatory and olfactory stimuli in the study of organic reactions with feelings, because of the disturbance of breathing that may be involved. He uses rhythmical auditory stimuli, and finds that when given at different rates and in various groupings, they are accompanied by characteristic feelings in each subject. He records the chest breathing, and curves from a sphygmograph and a water plethysmograph. Each experiment began with a normal record, then the stimulus was given, and this was followed by a contrast stimulus; lastly, another normal was taken. The length and depth of breathing were measured (no time line was recorded), and the relation of length of inspiration to length of expiration was determined. The length and height of the pulsebeats were also measured. Tabular summaries are given of the number of times the author finds each quantity to have been increased or decreased during a reaction period with each type of feeling. The feeling state accompanying a given rhythm is always complex, but the result is referred to that dimension which seemed to be dominant. Only a few disconnected extracts from normal and reaction periods are reproduced from the records. The author states that excitement gives increase in the rate and depth of breathing, in the inspiration-expiration ratio, and in the rate and size of pulse. There are undulations in the arm volume. In so far as the effect is quieting, it causes decrease in rate and depth of
182
JOHN F. SHEPARD
breathing, in the inspiration-expiration ratio, and in the pulse rate and size. The arm volume shows a tendency to rise with respiratory waves. Agreeableness shows
It looks like a much simpler solution would actually work.
No one seems to be using the information provided by the filenames. If you do make use of this information, you may not have to do any comparisons between files to identify the area of overlap. Whoever wrote the OCR probably put some thought into this problem.
The last number in the file name tells you what the starting page number for that file is. This page number appears on a line by itself in the file as well. It also looks like this line is preceded and followed by blank lines. Therefore for a given file you should be able to look at the name of the next file in the sequence and determine the page number at which you should start removing text. Since this page number appears in your file just look for a line that contains only this number (preceded and followed by blank lines) and delete that line and everything after. The last file in the sequence can be left alone.
Here's an outline for an algorithm
choose a file; call it: file1
look at the filename of the next file; call it: file2
extract the page number from the filename of file2; call it: pageNumber
scan the contents of file1 until you find a line that contains only pageNumber
make sure this line is preceded and followed by a blank line.
remove this line and everything after
move on to the next file in the sequence
You should probably try something like this (I've now tested it on the sample data you provided):
#!/usr/bin/ruby
class A_splitter
Title = /^[A-Z]+[^a-z]*$/
Byline = /^BY /
Number = /^\d*$/
Blank_line = /^ *$/
attr_accessor :recent_lines,:in_references,:source_glob,:destination_path,:seen_in_last_file
def initialize(src_glob,dst_path=nil)
#recent_lines = []
#seen_in_last_file = {}
#in_references = false
#source_glob = src_glob
#destination_path = dst_path
#destination = STDOUT
#buffer = []
split_em
end
def split_here
if destination_path
#destination.close if #destination
#destination = nil
else
print "------------SPLIT HERE------------\n"
end
print recent_lines.shift
#in_references = false
end
def at_page_break
((recent_lines[0] =~ Title and recent_lines[1] =~ Blank_line and recent_lines[2] =~ Number) or
(recent_lines[0] =~ Number and recent_lines[1] =~ Blank_line and recent_lines[2] =~ Title))
end
def print(*args)
(#destination || #buffer) << args
end
def split_em
Dir.glob(source_glob).sort.each { |filename|
if destination_path
#destination.close if #destination
#destination = File.open(File.join(#destination_path,filename),'w')
print #buffer
#buffer.clear
end
in_header = true
File.foreach(filename) { |line|
line.gsub!(/\f/,'')
if in_header and seen_in_last_file[line]
#skip it
else
seen_in_last_file.clear if in_header
in_header = false
recent_lines << line
seen_in_last_file[line] = true
end
3.times {recent_lines.shift} if at_page_break
if recent_lines[0] =~ Title and recent_lines[1] =~ Byline
split_here
elsif in_references and recent_lines[0] =~ Title and recent_lines[0] !~ /\d/
split_here
elsif recent_lines.length > 4
#in_references ||= recent_lines[0] =~ /^REFERENCES *$/
print recent_lines.shift
end
}
}
print recent_lines
#destination.close if #destination
end
end
A_splitter.new('bul_*_*_*.txt','test_dir')
Basically, run through the files in order, and within each file run through the lines in order, omitting from each file the lines that were present in the preceding file and printing the rest to STDOUT (from which it can be piped) unless a destination director is specified (called 'test_dir' in the example see the last line) in which case files are created in the specified directory with the same name as the file which contained the bulk of their contents.
It also removes the page-break sections (journal title, author, and page number).
It does two split tests:
a test on the title/byline pair
a test on the first title-line after a reference section
(it should be obvious how to add tests for additional split-points).
Retained for posterity:
If you don't specify a destination directory it simply puts a split-here line in the output stream at the split point. This should make it easier for testing (you can just less the output) and when you want them in individual files just pipe it to csplit (e.g. with
csplit -f abstracts - '---SPLIT HERE---' '{*}'
or something) to cut it up.
Here's is the beginning of another possible solution in Perl (It works as is but could probably be made more sophisticated if needed). It sounds as if all you are concerned about is removing duplicates across the corpus and don't really care if the last part of one article is in the file for the next one as long as it isn't duplicated anywhere. If so, this solution will strip out the duplicate lines leaving only one copy of any given line in the set of files as a whole.
You can either just run the file in the directory containing the text files with no argument or alternately specify a file name containing the list of files you want to process in the order you want them processed. I recommend the latter as your file names (at least in the sample files you provided) do not naturally list out in order when using simple commands like ls on the command line or glob in the Perl script. Thus it won't necessarily compare the correct files to one another as it just runs down the list (entered or generated by the glob command). If you specify the list, you can guarantee that they will be processed in the correct order and it doesn't take that long to set it up properly.
The script simply opens two files and makes note of the first three lines of the second file. It then opens a new output file (original file name + '.new') for the first file and writes out all the lines from the first file into the new output file until it finds the first three lines of the second file. There is an off chance that there are not three lines from the second file in the last one but in all the files I spot checked that seemed to be the case because of the journal name header and page numbers. One line definitely wasn't enough as the journal title was often the first line and that would cut things off early.
I should also note that the last file in your list of files entered will not be processed (i.e. have a new file created based off of it) as it will not be changed by this process.
Here's the script:
#!/usr/bin/perl
use strict;
my #files;
my $count = #ARGV;
if ($count>0){
open (IN, "$ARGV[0]");
#files = <IN>;
close (IN);
} else {
#files = glob "bul_*.txt";
}
$count = #files;
print "Processing $count files.\n";
my $lastFile="";
foreach(#files){
if ($lastFile ne ""){
print "Processing $_\n";
open (FILEB,"$_");
my #fileBLines = <FILEB>;
close (FILEB);
my $line0 = $fileBLines[0];
if ($line0 =~ /\(/ || $line0 =~ /\)/){
$line0 =~ s/\(/\\\(/;
$line0 =~ s/\)/\\\)/;
}
my $line1 = $fileBLines[1];
my $line2 = $fileBLines[2];
open (FILEA,"$lastFile");
my #fileALines = <FILEA>;
close (FILEA);
my $newName = "$lastFile.new";
open (OUT, ">$newName");
my $i=0;
my $done = 0;
while ($done != 1 and $i < #fileALines){
if ($fileALines[$i] =~ /$line0/
&& $fileALines[$i+1] == $line1
&& $fileALines[$i+2] == $line2) {
$done=1;
} else {
print OUT $fileALines[$i];
$i++;
}
}
close (OUT);
}
$lastFile = $_;
}
EDIT: Added a check for parenthesis in the first line that goes into the regex check for duplicity later on and if found escapes them so that they don't mess up the duplicity check.
You have a nontrivial problem. It is easy to write code to find the duplicate text at the end of file 1 and the beginning of file 2. But you don't want to delete the duplicate text---you want to split it where the second article begins. Getting the split right might be tricky---one marker is the all caps, another is the BY at the start of the next line.
It would have helped to have examples from consecutive files, but the script below works on one test case. Before trying this code, back up all your files. The code overwrites existing files.
The implementation is in Lua.
The algorithm is roughly:
Ignore blank lines at the end of file 1 and the start of file 2.
Find a long sequence of lines common to end of file 1 and start of file 2.
This works by trying a sequence of 40 lines, then 39, and so on
Remove sequence from both files and call it overlap.
Split overlap at title
Append first part of overlap to file1; prepend second part to file2.
Overwrite contents of files with lists of lines.
Here's the code:
#!/usr/bin/env lua
local ext = arg[1] == '-xxx' and '.xxx' or ''
if #ext > 0 then table.remove(arg, 1) end
local function lines(filename)
local l = { }
for line in io.lines(filename) do table.insert(l, (line:gsub('', ''))) end
assert(#l > 0, "No lines in file " .. filename)
return l
end
local function write_lines(filename, lines)
local f = assert(io.open(filename .. ext, 'w'))
for i = 1, #lines do
f:write(lines[i], '\n')
end
f:close()
end
local function lines_match(line1, line2)
io.stderr:write(string.format("%q ==? %q\n", line1, line2))
return line1 == line2 -- could do an approximate match here
end
local function lines_overlap(l1, l2, k)
if k > #l2 or k > #l1 then return false end
io.stderr:write('*** k = ', k, '\n')
for i = 1, k do
if not lines_match(l2[i], l1[#l1 - k + i]) then
if i > 1 then
io.stderr:write('After ', i-1, ' matches: FAILED <====\n')
end
return false
end
end
return true
end
function find_overlaps(fname1, fname2)
local l1, l2 = lines(fname1), lines(fname2)
-- strip trailing and leading blank lines
while l1[#l1]:find '^[%s]*$' do table.remove(l1) end
while l2[1] :find '^[%s]*$' do table.remove(l2, 1) end
local matchsize -- # of lines at end of file 1 that are equal to the same
-- # at the start of file 2
for k = math.min(40, #l1, #l2), 1, -1 do
if lines_overlap(l1, l2, k) then
matchsize = k
io.stderr:write('Found match of ', k, ' lines\n')
break
end
end
if matchsize == nil then
return false -- failed to find an overlap
else
local overlap = { }
for j = 1, matchsize do
table.remove(l1) -- remove line from first set
table.insert(overlap, table.remove(l2, 1))
end
return l1, overlap, l2
end
end
local function split_overlap(l)
for i = 1, #l-1 do
if l[i]:match '%u' and not l[i]:match '%l' then -- has caps but no lowers
-- io.stderr:write('Looking for byline following ', l[i], '\n')
if l[i+1]:match '^%s*BY%s' then
local first = {}
for j = 1, i-1 do
table.insert(first, table.remove(l, 1))
end
-- io.stderr:write('Split with first line at ', l[1], '\n')
return first, l
end
end
end
end
local function strip_overlaps(filename1, filename2)
local l1, overlap, l2 = find_overlaps(filename1, filename2)
if not l1 then
io.stderr:write('No overlap in ', filename1, ' an
Are the stubs identical to the end of the previous file? Or different line endings/OCR mistakes?
Is there a way to discern an article's beginning? Maybe an indented abstract? Then you could go through each file and discard everything before the first and after (including) the second title.
Are the titles & author always on a single line? And does that line always contain the word "BY" in uppercase? If so, you can probably do a fair job withn awk, using those criteria as the begin/end marker.
Edit: I really don't think that using diff is going to work as it is a tool for comparing broadly similar files. Your files are (from diff's point of view) actually completely different - I think it will get out of sync immediately. But then, I'm not a diff guru :-)
A quick stab at it, assuming that the stub is strictly identical in both files:
#!/usr/bin/perl
use strict;
use List::MoreUtils qw/ indexes all pairwise /;
my #files = #ARGV;
my #previous_text;
for my $filename ( #files ) {
open my $in_fh, '<', $filename or die;
open my $out_fh, '>', $filename.'.clean' or die;
my #lines = <$in_fh>;
print $out_fh destub( \#previous_text, #lines );
#previous_text = #lines;
}
sub destub {
my #previous = #{ shift() };
my #lines = #_;
my #potential_stubs = indexes { $_ eq $lines[0] } #previous;
for my $i ( #potential_stubs ) {
# check if the two documents overlap for that index
my #p = #previous[ $i.. $#previous ];
my #l = #lines[ 0..$#previous-$i ];
return #lines[ $#previous-$i + 1 .. $#lines ]
if all { $_ } pairwise { $a eq $b } #p, #l;
}
# no stub detected
return #lines;
}

Resources