Read function coefficients from file - gnuplot

Say I have the following file:
1,2,3
4,5,6
7,8,9
I would like to have gnuplot plot 3 polynomials of the form ax^2 + bx +c, using the file values as coefficients. Performing this directly, I would do:
plot x**2+2*x+3, 4*x**2+5*x+6, 7*x**2+8*x+9
But I would like to do this programatically, for an arbitrary number of lines in the input file.
I think I might be close with code inspired by this answer:
n= "`awk 'END {print NR}' < test.dat`"
i=0
while i<n{
f(a,b,c,x)=a*x**2+b*x+c
plot 'test.dat' every ::i::i using f($1,$2,$3,x)
}
But this fails with undefined variable: x

I've had the same problem for ages and finally I managed to find a solution.
It requires a creative use of the stats function to assign the values from the file to the variables and a couple of do for [...] loops to store some commands in a temp file
filename = "InputFileName.dat"
stats filename nooutput
nlines = STATS_records-1
set print "temp.gnuplot"
do for [i=0:nlines] {\
print sprintf("stats filename every ::%i::%i using (a%i=$1,b%i=$2,c%i=$3,0):(0) nooutput",i,i,i,i,i)}
set print
load "temp.gnuplot"
set print "temp.gnuplot"
do for [i=1:nlines] {print sprintf("replot a%i*x**2 + b%i*x + c%i",i,i,i)}
plot a0*x**2 + b0*x +c0
load "temp.gnuplot"
set print
Change the filename variable as you need. You can remove the temp file once you are finished with a system call like this ! rm temp.gnuplot.
I know is not particularly elegant and I especially don't like the use of a temp file, if somebody knows how to execute a string variable I'd be happy to know. But hey, it works (even under Windows)!
Edit: forgot to credit thse for the input.

It's an old thread but thanks to the previous answer, I can propose an upgraded version using strings instead of a temp file. That one, there is no need to bother deleting the useless temp file afterwards.
filename = "InputFileName.dat"
stats filename nooutput
nlines = STATS_records-1
paramstr(N) = sprintf("stats filename every ::%i::%i using (a%i=$1,b%i=$2,c%i=$3,0):(0) nooutput",N,N,N,N,N)
do for [i=0:nlines] {
eval(paramstr(i))
}
plotstr = "p "
do for [i=0:nlines] {
plotstr = plotstr . sprintf("a%i*x**2 + b%i*x + c%i%s",i,i,i,(i == nlines) ? "" : ", ")
}
eval(plotstr)
Maybe that some set xrange and set yrange tuning will be necessary to look at what you are interested in though.

This question is similar to Plotting a function directly from a text file, however, with more parameters and in several lines.
At the time of OP's question there were no arrays in gnuplot, but you can simply store your values in a string and address them via some functions using word() and real(), check help word and help real).
There is no need for awk (as in the link), no temporary files (solution of #Joseph D'Arimathea) and no extra loops or eval() (solution of #afagot).
Data: SO26680694.dat
1,2,3
4,5,6
7,8,9
Script: (works with gnuplot 4.6.0, March 2012)
### extract multiple parameters from a datafile
reset
FILE = "SO26680694.dat"
set datafile separator ','
myParams = ''
stats FILE u (myParams = myParams.sprintf(' %g %g %g', $1, $2, $3),0) nooutput
myParam(s,i) = real(word(myParams,(s-1)*3+i))
a(s) = myParam(s,1)
b(s) = myParam(s,2)
c(s) = myParam(s,3)
f(x,a,b,c) = a*x**2 + b*x + c
plot for [i=1:3] f(x,a(i),b(i),c(i)) title sprintf("%gx^2 + %gx + %g",a(i),b(i),c(i))
### end of script
Result:

Related

How to plot all data files in the directory with gnuplot?

How to plot all data files in the directory with gnuplot? I mean that from each data file a figure will be created. Data files have different names. I tried:
j=0;do for [i in system("ls")] { j=j+1; set term png; set output ''.i.'.png' ; p i u 1:2 w lines lc rgb "navy" t ''.i }
which lead to the error: x range is invalid
plot for [fn in system("ls")] fn with lines title ''.i
which lead to the error: internal error : STRING operator applied to undefined or non-STRING variable
This plot all data in one figure
a=system('a=`tempfile`;cat *.dat > $a;echo "$a"')
plot a u 1:2
I expected this to be a frequently asked question with well documented answers, however, I can't find a suitable example right away...
Edit:
you can do it in a "platform-independent" way using gnuplot's variable GPVAL_SYSNAME, which holds the name of the operating system.
Furthermore, the user variables DIR for the directory and EXT for file extension are used. If the output files should be in a different directory, e.g. DIR_IN and DIR_OUT could be defined instead of DIR.
Script:
### plot all datafiles in a directory
reset session
set term pngcairo
DIR = 'Test/' # directory; use '' for current directory
EXT = '.dat' # file extension
FILES = GPVAL_SYSNAME[1:7] eq 'Windows' ? system(sprintf('dir /B %s*%s',DIR,EXT)) : \
system(sprintf('ls %s*%s',DIR,EXT)) # Linux/MacOS
myInput(s) = sprintf('%s%s',DIR,s)
myOutput(s) = sprintf('%s%s.png',DIR,s[1:strlen(s)-strlen(EXT)]) # replace file extension with .png
do for [FILE in FILES] {
set output myOutput(FILE)
plot myInput(FILE) u 1:2 w lines lc "red" title FILE
}
set output
### end of script

How to use grep command in a loop in gnuscript

sorry, I am asking multiple questions.
I have a case.dat file which is having multiple columns and I want to extract array of the data according to colum 2 in the gnuscript.
I tried with below script but it is giving me error
array=""
do for [i=300:800:100] { # I mean start:end:increment. so it should read 300, 400, 500, 600, 700, 800 here
val ="grep i case.dat" # Want to store the command in a valuel/variable
print val > i.dat #Here I need to store the data in i.dat
}
error
line 45: undefined variable: dat
my bash script is like below
##!/bin/bash
case="data"
for i in `seq 100 100 800`
do
awk '$2=='$i'{print $0}' $case.dat > $i.dat
done
that I want to use at the start of the gnu-script so that the further operation can be done in the rest part of the gnu-script.
gnuplot script:
do for [i=300:800:100] {
outfile = sprintf("%d.dat", i)
command = sprintf("grep %d case.dat",i)
set print outfile
print system(command)
unset print
}
This will create separate files 300.dat 400.dat 500.dat and so on.
If you want to keep these data subsets entirely internal to gnuplot, i.e. not create any new files, you could instead create named datablocks $data_300 $data_400 etc:
do for [i=300:800:100] {
eval( sprintf("set print $data_%3d", i) )
print( sprintf("grep %d case.dat") )
unset print
}
Named datablocks can in general be used anywhere you could use a file name, e.g.
plot $data_500 with lines.

History of previously opened m-files in MATLAB

Is anyway to find history of previously opened m-files in MATLAB R2014b from 2 or 3 months ago? (a list of name of files and paths)
Matlab R2014b stores its recent files in:
%APPDATA%\MathWorks\MATLAB\R2014b\MATLAB_Editor_State.xml
It's a .xml file so it's easy to load and parse with xmlread. I'm not very familiar with xml parsing syntax, but here is how to get information about files (to be adapted to your needs of course):
function [recentFiles] = GetRecentFiles()
%[
% Opens editor's state file
filepart = sprintf('MathWorks\\MATLAB\\R%s\\%s', version('-release'), 'MATLAB_Editor_State.xml');
filename = fullfile(getenv('APPDATA'), filepart);
document = xmlread(filename);
% Get information about 'File' nodes
recentFiles = struct([]);
fileNodes = document.getElementsByTagName('File');
for fni = 1:(fileNodes.getLength())
attributes = fileNodes.item(fni-1).getAttributes(); % Careful, zero based indexing !
for ai = 1:(attributes.getLength())
% Get node attribute
name = char(attributes.item(ai-1).getName()); % Zero based + need marshaling COM 'string' type
value = char(attributes.item(ai-1).getValue()); % Zero based + need marshaling COM 'string' type
% Save in structure
name(1) = upper(name(1)); % Just because I prefer capital letter for field names ...
recentFiles(fni).(name) = value;
end
end
%]
end
This returns a structure like this:
recentFiles =
1x43 struct array with fields:
AbsPath
LastWrittenTime
Name
NB: I've tried to type in matlab command window matlab.desktop.editor.*, but seems there's nothing regarding recent files (anyway there are a lot of interesting things to manipulate the editor from the command line)
Last answer waIs really helpful. I've just modified it to read and open the recent tab files. This works on Matlab R2013a:
function [recentFiles] = recover_tabs()
%[
% Opens editor's state file
filepart = sprintf('MathWorks\\MATLAB\\R%s\\%s', version('-release'), 'MATLAB_Editor_State.xml');
filename = fullfile(getenv('APPDATA'), filepart);
document = xmlread(filename);
% Get information about 'File' nodes
recentFiles = struct([]);
fileNodes = document.getElementsByTagName('File');
for fni = 1:(fileNodes.getLength())
attributes = fileNodes.item(fni-1).getAttributes(); % Careful, zero based indexing !
for ai = 1:(attributes.getLength())
% Get node attribute
name = char(attributes.item(ai-1).getName()); % Zero based + need marshaling COM 'string' type
value = char(attributes.item(ai-1).getValue()); % Zero based + need marshaling COM 'string' type
% Save in structure
name(1) = upper(name(1)); % Just because I prefer capital letter for field names ...
recentFiles(fni).(name) = value;
end
end
% loop to access files in the tab history
for j=1:length(recentFiles)
arquivo = [recentFiles(j).AbsPath '\' recentFiles(j).Name];
% if exists, then open
if exist(arquivo, 'file') == 2
open(arquivo);
end
end
%]
end
Base in the answer by CitizenInsane, but for any Matlab version.
To find the .xml file in any Matlab version, use prefdir:
>> prefdir
ans = '/Users/user/Library/Application Support/MathWorks/MATLAB/R2018a'
MATLAB_Editor_State.xml will be stored there. Therefore the fuction would be:
function [recentFiles] = GetRecentFiles()
% Opens editor's state file
filepart = sprintf([ prefdir '/MATLAB_Editor_State.xml']);
filename = fullfile(getenv('APPDATA'), filepart);
document = xmlread(filename);
% Get information about 'File' nodes
recentFiles = struct([]);
fileNodes = document.getElementsByTagName('File');
for fni = 1:(fileNodes.getLength())
attributes = fileNodes.item(fni-1).getAttributes(); % Careful, zero based indexing !
for ai = 1:(attributes.getLength())
% Get node attribute
name = char(attributes.item(ai-1).getName()); % Zero based + need marshaling COM 'string' type
value = char(attributes.item(ai-1).getValue()); % Zero based + need marshaling COM 'string' type
% Save in structure
name(1) = upper(name(1)); % Just because I prefer capital letter for field names ...
recentFiles(fni).(name) = value;
end
end
In R2018b you can increase the Most recently used file list in Preferences > Editor/Debugger. I like the methods above, but they do not work if you're working across machines (e.g., using Github). I coded a solution that uses the modified file date from the machine, instead of relying on MATLAB itself.

Xlsx or csv retrieve in Matlab

I have points data in .xlsx and I want to read and store them in p array in Matlab. These points are only 3D co-ordinates of x,y,z such that having understood three columns and not prdfined rows. How I can retrieve them from .xlsx or .csv if I need fast retrieval as I tried to retrieve .xlsx and its response time is slow and returns an empty array. Possibly store them in transposed form and transpose it back.
My Code: .Xls read
A = xlsread('data.xlsx')
Output:
A =
[]
My Code: .CSV read
M = csvread('data.csv')
Output:
Error using dlmread (line 139)
Mismatch between file and format string.
Trouble reading number from file (row 2u, field 1u) ==> ;\n
Error in csvread (line 48)
m=dlmread(filename, ',', r, c);
Points Set 1:
-191.2442 187.7193 1.0000;
-155.2995 152.6316 2.0000;
-182.0276 104.6784 3.0000;
-148.8479 84.7953 4.0000;
Points Set 2:
-142.3963 83.6257 5.0000;
-102.7650 133.9181 6.0000;
-56.6820 164.3275 7.0000;
-30.8756 124.5614 8.0000;
-23.5023 118.7135 7.0000;
-9.6774 110.5263 6.0000;
26.2673 90.6433 5.0000;
-42.8571 -6.4327 4.0000;
10.5991 7.6023 3.0000;
Points Set 3:
-73.2719 84.7953 9.0000;
-137.7880 15.7895 10.0000;
-92.6267 -30.9942 9.0000;
-42.8571 19.2982 8.0000;
41.0138 -15.7895 4.0000;
71.4286 -41.5205 6.0000;
90.7834 -14.6199 5.0000;
See if this slightly twisted one using importdata works for you -
C1 = importdata(file1) %%// file1 is your CSV filename
t1 = regexp(C1,'\s','Split')
t2 = horzcat(t1{:})
t2 = strrep(t2,';','')
M = cellfun(#str2num,reshape(t2(~strcmp(t2,'')),3,[])')
Edit 1: This case assumes you have a CSV file that has all the Point Sets clustered together but one by one (without spaces between the Point Sets and their data and also between the end of a Point Set and the declaration of the arrival of the next Point Set).
So, the input CSV file would look like this for the given data in the question -
Points Set 1:
-191.2442 187.7193 1.0000;
-155.2995 152.6316 2.0000;
-182.0276 104.6784 3.0000;
-148.8479 84.7953 4.0000;
Points Set 2:
-142.3963 83.6257 5.0000;
-102.7650 133.9181 6.0000;
-56.6820 164.3275 7.0000;
-30.8756 124.5614 8.0000;
-23.5023 118.7135 7.0000;
-9.6774 110.5263 6.0000;
26.2673 90.6433 5.0000;
-42.8571 -6.4327 4.0000;
10.5991 7.6023 3.0000;
Points Set 3:
-73.2719 84.7953 9.0000;
-137.7880 15.7895 10.0000;
-92.6267 -30.9942 9.0000;
-42.8571 19.2982 8.0000;
41.0138 -15.7895 4.0000;
71.4286 -41.5205 6.0000;
90.7834 -14.6199 5.0000;
Please note that the result from the codes would be a struct of arrays.
Code
C1 = importdata(file1) %%// file1 is your CSV filename
ind1 = cellfun(#isempty,strfind(C1,'Points'))
start_ind = find(~ind1)+1
s1 = find(~ind1)-1;
stop_ind = [s1(2:end) ; numel(ind1)]
for k = 1:numel(start_ind)
data1 = C1(start_ind(k):stop_ind(k))
t1 = regexp(data1,'\s','Split')
t2 = strrep(horzcat(t1{:}),';','')
t2 = t2(~strcmp(t2,''))
array(k).data = cellfun(#str2num,reshape(t2,3,[])'); %%//'
end
You are using the row separator ;\n (semicolon + new line). I don't know any function understand this format. Using texstscan is probably the best option:
fid=fopen(...)
M=cell2mat(textscan(line,'%f,%f,%f;\n'));
fclose(fid);

Compare many text files that contain duplicate "stubs" from the previous and next file and remove duplicate text automatically

I have a large number of text files (1000+) each containing an article from an academic journal. Unfortunately each article's file also contains a "stub" from the end of the previous article (at the beginning) and from the beginning of the next article (at the end).
I need to remove these stubs in preparation for running a frequency analysis on the articles because the stubs constitute duplicate data.
There is no simple field that marks the beginning and end of each article in all cases. However, the duplicate text does seem to formatted the same and on the same line in both cases.
A script that compared each file to the next file and then removed 1 copy of the duplicate text would be perfect. This seems like it would be a pretty common issue when programming so I am surprised that I haven't been able to find anything that does this.
The file names sort in order, so a script that compares each file to the next sequentially should work. E.G.
bul_9_5_181.txt
bul_9_5_186.txt
are two articles, one starting on page 181 and the other on page 186. Both of these articles are included bellow.
There is two volumes of test data located at [http://drop.io/fdsayre][1]
Note: I am an academic doing content analysis of old journal articles for a project in the history of psychology. I am no programmer, but I do have 10+ years experience with linux and can usually figure things out as I go.
Thanks for your help
FILENAME: bul_9_5_181.txt
SYN&STHESIA
ISI
the majority of Portugese words signifying black objects or ideas relating to black. This association is, admittedly, no true synsesthesia, but the author believes that it is only a matter of degree between these logical and spontaneous associations and genuine cases of colored audition.
REFERENCES
DOWNEY, JUNE E. A Case of Colored Gustation. Amer. J. of Psycho!., 1911, 22, S28-539MEDEIROS-E-ALBUQUERQUE. Sur un phenomene de synopsie presente par des millions de sujets. / . de psychol. norm, et path., 1911, 8, 147-151. MYERS, C. S. A Case of Synassthesia. Brit. J. of Psychol., 1911, 4, 228-238.
AFFECTIVE PHENOMENA — EXPERIMENTAL
BY PROFESSOR JOHN F. .SHEPARD
University of Michigan
Three articles have appeared from the Leipzig laboratory during the year. Drozynski (2) objects to the use of gustatory and olfactory stimuli in the study of organic reactions with feelings, because of the disturbance of breathing that may be involved. He uses rhythmical auditory stimuli, and finds that when given at different rates and in various groupings, they are accompanied by characteristic feelings in each subject. He records the chest breathing, and curves from a sphygmograph and a water plethysmograph. Each experiment began with a normal record, then the stimulus was given, and this was followed by a contrast stimulus; lastly, another normal was taken. The length and depth of breathing were measured (no time line was recorded), and the relation of length of inspiration to length of expiration was determined. The length and height of the pulsebeats were also measured. Tabular summaries are given of the number of times the author finds each quantity to have been increased or decreased during a reaction period with each type of feeling. The feeling state accompanying a given rhythm is always complex, but the result is referred to that dimension which seemed to be dominant. Only a few disconnected extracts from normal and reaction periods are reproduced from the records. The author states that excitement gives increase in the rate and depth of breathing, in the inspiration-expiration ratio, and in the rate and size of pulse. There are undulations in the arm volume. In so far as the effect is quieting, it causes decrease in rate and depth of
182
JOHN F. SHEPARD
breathing, in the inspiration-expiration ratio, and in the pulse rate and size. The arm volume shows a tendency to rise with respiratory waves. Agreeableness shows
It looks like a much simpler solution would actually work.
No one seems to be using the information provided by the filenames. If you do make use of this information, you may not have to do any comparisons between files to identify the area of overlap. Whoever wrote the OCR probably put some thought into this problem.
The last number in the file name tells you what the starting page number for that file is. This page number appears on a line by itself in the file as well. It also looks like this line is preceded and followed by blank lines. Therefore for a given file you should be able to look at the name of the next file in the sequence and determine the page number at which you should start removing text. Since this page number appears in your file just look for a line that contains only this number (preceded and followed by blank lines) and delete that line and everything after. The last file in the sequence can be left alone.
Here's an outline for an algorithm
choose a file; call it: file1
look at the filename of the next file; call it: file2
extract the page number from the filename of file2; call it: pageNumber
scan the contents of file1 until you find a line that contains only pageNumber
make sure this line is preceded and followed by a blank line.
remove this line and everything after
move on to the next file in the sequence
You should probably try something like this (I've now tested it on the sample data you provided):
#!/usr/bin/ruby
class A_splitter
Title = /^[A-Z]+[^a-z]*$/
Byline = /^BY /
Number = /^\d*$/
Blank_line = /^ *$/
attr_accessor :recent_lines,:in_references,:source_glob,:destination_path,:seen_in_last_file
def initialize(src_glob,dst_path=nil)
#recent_lines = []
#seen_in_last_file = {}
#in_references = false
#source_glob = src_glob
#destination_path = dst_path
#destination = STDOUT
#buffer = []
split_em
end
def split_here
if destination_path
#destination.close if #destination
#destination = nil
else
print "------------SPLIT HERE------------\n"
end
print recent_lines.shift
#in_references = false
end
def at_page_break
((recent_lines[0] =~ Title and recent_lines[1] =~ Blank_line and recent_lines[2] =~ Number) or
(recent_lines[0] =~ Number and recent_lines[1] =~ Blank_line and recent_lines[2] =~ Title))
end
def print(*args)
(#destination || #buffer) << args
end
def split_em
Dir.glob(source_glob).sort.each { |filename|
if destination_path
#destination.close if #destination
#destination = File.open(File.join(#destination_path,filename),'w')
print #buffer
#buffer.clear
end
in_header = true
File.foreach(filename) { |line|
line.gsub!(/\f/,'')
if in_header and seen_in_last_file[line]
#skip it
else
seen_in_last_file.clear if in_header
in_header = false
recent_lines << line
seen_in_last_file[line] = true
end
3.times {recent_lines.shift} if at_page_break
if recent_lines[0] =~ Title and recent_lines[1] =~ Byline
split_here
elsif in_references and recent_lines[0] =~ Title and recent_lines[0] !~ /\d/
split_here
elsif recent_lines.length > 4
#in_references ||= recent_lines[0] =~ /^REFERENCES *$/
print recent_lines.shift
end
}
}
print recent_lines
#destination.close if #destination
end
end
A_splitter.new('bul_*_*_*.txt','test_dir')
Basically, run through the files in order, and within each file run through the lines in order, omitting from each file the lines that were present in the preceding file and printing the rest to STDOUT (from which it can be piped) unless a destination director is specified (called 'test_dir' in the example see the last line) in which case files are created in the specified directory with the same name as the file which contained the bulk of their contents.
It also removes the page-break sections (journal title, author, and page number).
It does two split tests:
a test on the title/byline pair
a test on the first title-line after a reference section
(it should be obvious how to add tests for additional split-points).
Retained for posterity:
If you don't specify a destination directory it simply puts a split-here line in the output stream at the split point. This should make it easier for testing (you can just less the output) and when you want them in individual files just pipe it to csplit (e.g. with
csplit -f abstracts - '---SPLIT HERE---' '{*}'
or something) to cut it up.
Here's is the beginning of another possible solution in Perl (It works as is but could probably be made more sophisticated if needed). It sounds as if all you are concerned about is removing duplicates across the corpus and don't really care if the last part of one article is in the file for the next one as long as it isn't duplicated anywhere. If so, this solution will strip out the duplicate lines leaving only one copy of any given line in the set of files as a whole.
You can either just run the file in the directory containing the text files with no argument or alternately specify a file name containing the list of files you want to process in the order you want them processed. I recommend the latter as your file names (at least in the sample files you provided) do not naturally list out in order when using simple commands like ls on the command line or glob in the Perl script. Thus it won't necessarily compare the correct files to one another as it just runs down the list (entered or generated by the glob command). If you specify the list, you can guarantee that they will be processed in the correct order and it doesn't take that long to set it up properly.
The script simply opens two files and makes note of the first three lines of the second file. It then opens a new output file (original file name + '.new') for the first file and writes out all the lines from the first file into the new output file until it finds the first three lines of the second file. There is an off chance that there are not three lines from the second file in the last one but in all the files I spot checked that seemed to be the case because of the journal name header and page numbers. One line definitely wasn't enough as the journal title was often the first line and that would cut things off early.
I should also note that the last file in your list of files entered will not be processed (i.e. have a new file created based off of it) as it will not be changed by this process.
Here's the script:
#!/usr/bin/perl
use strict;
my #files;
my $count = #ARGV;
if ($count>0){
open (IN, "$ARGV[0]");
#files = <IN>;
close (IN);
} else {
#files = glob "bul_*.txt";
}
$count = #files;
print "Processing $count files.\n";
my $lastFile="";
foreach(#files){
if ($lastFile ne ""){
print "Processing $_\n";
open (FILEB,"$_");
my #fileBLines = <FILEB>;
close (FILEB);
my $line0 = $fileBLines[0];
if ($line0 =~ /\(/ || $line0 =~ /\)/){
$line0 =~ s/\(/\\\(/;
$line0 =~ s/\)/\\\)/;
}
my $line1 = $fileBLines[1];
my $line2 = $fileBLines[2];
open (FILEA,"$lastFile");
my #fileALines = <FILEA>;
close (FILEA);
my $newName = "$lastFile.new";
open (OUT, ">$newName");
my $i=0;
my $done = 0;
while ($done != 1 and $i < #fileALines){
if ($fileALines[$i] =~ /$line0/
&& $fileALines[$i+1] == $line1
&& $fileALines[$i+2] == $line2) {
$done=1;
} else {
print OUT $fileALines[$i];
$i++;
}
}
close (OUT);
}
$lastFile = $_;
}
EDIT: Added a check for parenthesis in the first line that goes into the regex check for duplicity later on and if found escapes them so that they don't mess up the duplicity check.
You have a nontrivial problem. It is easy to write code to find the duplicate text at the end of file 1 and the beginning of file 2. But you don't want to delete the duplicate text---you want to split it where the second article begins. Getting the split right might be tricky---one marker is the all caps, another is the BY at the start of the next line.
It would have helped to have examples from consecutive files, but the script below works on one test case. Before trying this code, back up all your files. The code overwrites existing files.
The implementation is in Lua.
The algorithm is roughly:
Ignore blank lines at the end of file 1 and the start of file 2.
Find a long sequence of lines common to end of file 1 and start of file 2.
This works by trying a sequence of 40 lines, then 39, and so on
Remove sequence from both files and call it overlap.
Split overlap at title
Append first part of overlap to file1; prepend second part to file2.
Overwrite contents of files with lists of lines.
Here's the code:
#!/usr/bin/env lua
local ext = arg[1] == '-xxx' and '.xxx' or ''
if #ext > 0 then table.remove(arg, 1) end
local function lines(filename)
local l = { }
for line in io.lines(filename) do table.insert(l, (line:gsub('', ''))) end
assert(#l > 0, "No lines in file " .. filename)
return l
end
local function write_lines(filename, lines)
local f = assert(io.open(filename .. ext, 'w'))
for i = 1, #lines do
f:write(lines[i], '\n')
end
f:close()
end
local function lines_match(line1, line2)
io.stderr:write(string.format("%q ==? %q\n", line1, line2))
return line1 == line2 -- could do an approximate match here
end
local function lines_overlap(l1, l2, k)
if k > #l2 or k > #l1 then return false end
io.stderr:write('*** k = ', k, '\n')
for i = 1, k do
if not lines_match(l2[i], l1[#l1 - k + i]) then
if i > 1 then
io.stderr:write('After ', i-1, ' matches: FAILED <====\n')
end
return false
end
end
return true
end
function find_overlaps(fname1, fname2)
local l1, l2 = lines(fname1), lines(fname2)
-- strip trailing and leading blank lines
while l1[#l1]:find '^[%s]*$' do table.remove(l1) end
while l2[1] :find '^[%s]*$' do table.remove(l2, 1) end
local matchsize -- # of lines at end of file 1 that are equal to the same
-- # at the start of file 2
for k = math.min(40, #l1, #l2), 1, -1 do
if lines_overlap(l1, l2, k) then
matchsize = k
io.stderr:write('Found match of ', k, ' lines\n')
break
end
end
if matchsize == nil then
return false -- failed to find an overlap
else
local overlap = { }
for j = 1, matchsize do
table.remove(l1) -- remove line from first set
table.insert(overlap, table.remove(l2, 1))
end
return l1, overlap, l2
end
end
local function split_overlap(l)
for i = 1, #l-1 do
if l[i]:match '%u' and not l[i]:match '%l' then -- has caps but no lowers
-- io.stderr:write('Looking for byline following ', l[i], '\n')
if l[i+1]:match '^%s*BY%s' then
local first = {}
for j = 1, i-1 do
table.insert(first, table.remove(l, 1))
end
-- io.stderr:write('Split with first line at ', l[1], '\n')
return first, l
end
end
end
end
local function strip_overlaps(filename1, filename2)
local l1, overlap, l2 = find_overlaps(filename1, filename2)
if not l1 then
io.stderr:write('No overlap in ', filename1, ' an
Are the stubs identical to the end of the previous file? Or different line endings/OCR mistakes?
Is there a way to discern an article's beginning? Maybe an indented abstract? Then you could go through each file and discard everything before the first and after (including) the second title.
Are the titles & author always on a single line? And does that line always contain the word "BY" in uppercase? If so, you can probably do a fair job withn awk, using those criteria as the begin/end marker.
Edit: I really don't think that using diff is going to work as it is a tool for comparing broadly similar files. Your files are (from diff's point of view) actually completely different - I think it will get out of sync immediately. But then, I'm not a diff guru :-)
A quick stab at it, assuming that the stub is strictly identical in both files:
#!/usr/bin/perl
use strict;
use List::MoreUtils qw/ indexes all pairwise /;
my #files = #ARGV;
my #previous_text;
for my $filename ( #files ) {
open my $in_fh, '<', $filename or die;
open my $out_fh, '>', $filename.'.clean' or die;
my #lines = <$in_fh>;
print $out_fh destub( \#previous_text, #lines );
#previous_text = #lines;
}
sub destub {
my #previous = #{ shift() };
my #lines = #_;
my #potential_stubs = indexes { $_ eq $lines[0] } #previous;
for my $i ( #potential_stubs ) {
# check if the two documents overlap for that index
my #p = #previous[ $i.. $#previous ];
my #l = #lines[ 0..$#previous-$i ];
return #lines[ $#previous-$i + 1 .. $#lines ]
if all { $_ } pairwise { $a eq $b } #p, #l;
}
# no stub detected
return #lines;
}

Resources