Analyze log data with occassional multiline logs

Analyze log data with occassional multiline logs - python-3.x

I need to scan a log file for users that ran certain SQL statements (DROP,CREATE,etc.), and return an array with the users and which SQL ddl's they tried(drop,create,etc.).
I have log files where each line normally looks like this:
'2019-01-14T-19:23:50Z UTC' [ db=dev user=joeschmoe pid=123 userid=1 xid=1234]' Log: Select *
however, sometimes the select statement will span multiple lines like this:
'2019-01-14T19:23:50Z UTC [ db=dev user=rb pid=16 userid=1 xid=8 ]' LOG: SELECT SUM (num_queries) num_all_queries
,SUM (CASE WHEN lalala is not null THEN num_queries ELSE 0 END) num_b
,SUM (CASE WHEN lalala is null THEN num_queries ELSE 0 END) num_non_b
,SUM (total_queue_time_min) total_queue_time_min
,SUM (CASE WHEN lalala is not null THEN total_queue_time_min ELSE 0 END) b_total_queue_time_min
,SUM (CASE WHEN lalala is null THEN total_queue_time_min ELSE 0 END) non_b_total_queue_time_min
,SUM (CASE WHEN lalala is not null THEN duration_s ELSE 0 END)/60.0 total_burst_usage_min
,SUM (CASE WHEN lalala is not null THEN 1 ELSE 0 END) num_lalalas
,MIN(firsttime) mintime
,MAX(lasttime) maxtime
,DATEDIFF (seconds, mintime, maxtime) workload_duration_s
,wration_s/60.0 workload_duration_min
LEFT JOIN (SELECT b FROM STfdaf LIMIT 1) sq ON sq.but_reon < 100
;
I am combing these logs for certain keywords in the SQL statements. I can write the regex to handle that but I need help getting this log in a format I can work with. I was originally using a for loop and a regex
for line in input:
user_match = re.search("DROP", line, re.IGNORECASE)
This wouldn't be accurate because when a sql statement spans multiple lines I wouldn't be able to tie the "DROP" back to the "USER" if the DROP occurred many lines after the initial line.
I'm not sure how to go about doing this. Whether its turning this text file into a list in python and programmatically combining multiple lines into one or whatever other options there may be.

I solved this issue by storing the user value in a variable that only updates if a match is found using RegEx. So if a keyword is found, the user value last stored in the variable would be returned along with the keyword.
for line in input:
error = []
user_match = re.search("USER=b[0-9]{6}", line, re.IGNORECASE)
serviceuser = re.search("USER=[a-z0-9]*", line, re.IGNORECASE)
if serviceuser:
user = (serviceuser.group().split("=")[1])
elif user_match:
user = (user_match.group().split("=")[1])
ddl = re.search(
"LINK|.DELETE.|INSERT|TRIGGER|TRUNCATE|UPDATE|WRITE", line, re.IGNORECASE)
if ddl:
error.append(user)
error.append(ddl.group())

Related

Smiles bot in Lua: matching text

This is my bot, who copies smilies in game chat.
The smiles it copies are saved in the table called "emoticons". If someone writes ":)", the bot writes ":)" and so on
The code inside the loop is for this: if someone writes, for example, ">:)", you have to make the script copy ">:)" and not just ":)"
CreateFrame, RegisterEvent, SetScript and SendChatMessage are in-game built Lua API
local emoticons = {
":)", "(:", ":))", ">:)", "0:)", ":D", ":]", ":)))", "=]", "?_?", "+.+", ":P", ":3", "^^", "roar", ":V", "D:", ":C", ".D", ".)", "o_o",
"^-^", ":PPP", ":DDD", ":D:D:D", ":DDDD", ":D:D:D:D", ":DDDDD", ":d", ":L", "<O>_<O>", "o/", "+_+", "?_?", "*0*", ":}", ";)", ":))))", "o.o", "<.<''", ":|",
":-)", "^^^^", ":D:D:D:D:D:D", ":D :D :D", "^^^", ":c", ";]", ":9", ">:|", ">.<", ";3", ";P", "T_T", ":3c", ":)))))",
"^^^^^" }
local f = CreateFrame("Frame")
f:RegisterEvent("CHAT_MSG_GUILD")
f:SetScript("OnEvent", function(self, event, text, playerName, _, channelName, playerName2, _, _, channelIndex)
local msg
local n = 0
for x, key in ipairs(emoticons) do
local l = string.len(emoticons[x])
if (string.sub(text, -l) == emoticons[x]) then
if (l > n) then
msg = emoticons[x]
n = l
end
end
end
if (msg) and (playerName ~= UnitName("player")) then
if (event == "CHAT_MSG_GUILD") then SendChatMessage(msg, "GUILD", nil, channelIndex) end
end
end)
Is there any way to improve it? For example, if someone writes
"^^^^^^"
the bot copies
"^^^^^"
which would be the same with one less "^" as it was stored in the table
My goal is that if someone writes, for example, "^^^^^^" and it is not registered in the table, the script will not respond

You're probably better off learning how to use string.gmatch
As an example, let's say you only store one instance of ':D' in your emoticons table. You can iterate through the matches and respond in kind. Here's a small example:
local text = ':D:D:D'
local count = 0
for w in string.gmatch(text, ':D') do
count = count + 1
end
if count > 0 then
local response = ''
for i = 1, count do
response = response .. ':D'
end
print(response) -- prints ':D:D:D'
end
This doesn't handle every case, but hopefully it can help you get started
:D

how to skip 1 st line of payload - groovy

I have final payload in csv format with some amount of employee records. I have 2 main values to work with:
-Sequence field (added for each record)
-personal_id (unique for every employee)
Because of the fact that each employee can have multiple records, the need is to have according sequence number for each ID. In other words If employee with ID "123" have 5 records, the sequence value should be 1,2,3,4,5 instead of 1,1,1,1,1. All records are grouped by ID.
Also the payload below shows only 2 fields needed for the description, normally it has much more fields and is an example of static payload. Normally it will be dynamically.
here are the input payload:
Sequence;ID
123456
232323
232323
232323
111111
111111
222222
222222
222222
222222
222222
222222
222222
and here is expected payload
Sequence;ID
1;123456
1;232323
2;232323
3;232323
1;111111
2;111111
1;222222
2;222222
3;222222
4;222222
5;222222
6;222222
7;222222
Here is the actual payload after using groovy script:
1;Sequence;ID
1;123456
1;232323
2;232323
3;232323
1;111111
2;111111
1;222222
2;222222
3;222222
4;222222
5;222222
6;222222
7;222222
I am using the following groovy script but the problem is that the first line of payload is also numbered by 1. Can you show me how to skip the first line of payload?
The script is adding a number in every record to have it counted. The clue is to have all records of the same id counted from 1 incrementally. The script does that perfectly but the issue is as I said while i want to skip the first line od payload "Sequence;..." to not have it counted.
import com.sap.gateway.ip.core.customdev.util.Message
def Message processData(Message message) {
def payload = message.getBody(java.lang.String)
def prevId = ''
def sequence = 1
def sb = new StringBuilder()
def line = new StringReader()
payload.eachLine { line ->
def values = line.split(';')
if (values[1] != prevId) {
// New personal ID
sequence = 1
prevId = values[1]
} else {
// Another line of the same personal ID
sequence += 1
}
line = readLine()
values[0] = sequence
sb.append(values.join(';')).append(System.lineSeparator())
}
message.setBody(sb.toString())
return message
}

Assuming message is just a String, .eachLine can be given two arguments: the line and the index. You could use that to do whatever logic you want in your closure: skip index 0 entirely, output it as-is, etc.
Basic example:
payload.eachLine { line, idx ->
if (idx == 0) {
// do something with the first line
} else {
// everything you currently have
}
}

What's the correct way to check sql found condition in ILE RPG?

When working with embedded SQL in RPG, you often end up with a cursor and a dow-loop for processing all rows in your result. The condition in the loop is somehow dependent on SQLCOD and/or SQLSTT, some globally available variables in an SQLRPGLE-program?
But what is the correct way of checking these values? Some suggest SQLCOD = 0 others not (SQLCOD = +100 or SQLSTT = '02000'). One fails on all warnings, the other does not fail on some errors, so I'm not content.
To illustrate what I do with some code:
Pmain B
D PI
Dmy_ds E DS extname(SOME_TABLE)
D qualified
/free
exec sql
DECLARE cur CURSOR FOR
SELECT *
FROM some_table;
exec sql
OPEN cur;
exec sql
FETCH cur
INTO :my_ds;
dow sql_found();
exec sql
FETCH cur
INTO :my_ds;
enddo;
exec sql
CLOSE cur;
/end-free
Pmain E
Psql_found B
D PI N
/free
// insert return statement here...
/end-free
Psql_found E
I'm looking for the correct return statement here, that will make me go through all rows if no error occurs and lets me leave when an error occurs. Bonus points for some decent way to check for errors.

SQLSTATE is better, and recommended by IBM.
From IBM's InfoCenter SQL Messages and Codes Reference: SQLCODE and SQLSTATE concepts
SQLSTATE is the preferred standard return code.
SQLSTATE is 5 characters, with the first two bytes identifying a class of conditions.
'00' = Unqualified Successful Completion
'01' = Warning
'02' = No Data
Anything else is an error. I generally only check for '00'.
Simple. Easy. More portable.
Using SQLCODE often involves lists of codes which are, IMHO, less than developer friendly.
Example:
Personally, I generally include definitions and code like this:
D xSQLState# s * inz( %addr(SQLState) )
D xSQLState ds 5 based(xSQLState#)
D xSQLState2 2a
D
D Success_On_SQL C const('00')
D Warning_On_SQL C const('01')
D NoData_On_SQL C const('02')
Then after any SQL operation, I generally check
if xSQLState2 <> Success_On_Sql;
someflag = true;
endif;

The best practice is to process the SQLCODEs you expect (as part of the expected processing) and to add exception code to handle the ones you don't. One implementation:
dow 1=1; // forever
exec sql
FETCH cur
INTO :my_ds;
// normal exit
if sqlstt = SQL_NODATA;
SFLEND = *on;
leave;
endif;
// can't CAST a value
if sqlstt = SQL_CAST; // CAST error
... tell user there's an error and read another
iter;
endif;
// decimal data error
if sqlstt = SQL_DDE;
tell user to call IT and stop reading
leave;
endif;
// whoops! not expected at all. Dump for post-mortem
if sqlstt <> SQL_NORMAL;
... tell user to call IT and stop reading
dump(a);
leave;
endif;
// test for end of loop
// filled subfile page?
enddo; // forever
With this type of implementation you have to intentionally leave the loop; whether you've filled a subfile page, loaded the highest element in an array or hit an error. I'm not sure there is a single, generic implementation that will handle all circumstances. Sometimes you might want to leave the read loop if you have a record lock and sometimes you want to issue a message and try again (for example).

I did some more searching on the topic and found something on IBM's site (quote):
The SQLCODE is also set by the database manager after each SQL
statement is executed as follows:
- If SQLCODE = 0 and SQLWARN0 is blank, execution was successful.
- If SQLCODE = 100, no data was found. For example, a FETCH
statement returned no data, because the cursor was positioned
after the last row of the result table.
- If SQLCODE > 0 and not = 100, execution was successful with a
warning.
- If SQLCODE = 0 and SQLWARN0 = 'W', execution was successful
with a warning.
- If SQLCODE < 0, execution was not successful.
Which would lead me to an sql_found() like this:
Pfound_sql B
D PI N
/free
return (SQLCOD >= 0) and (SQLCOD<>100);
/end-free
Pfound_sql E
That should take care of the End of Data condition and fail on all errors. I'm not sure if there are some warnings that I should take care of (don't want to get trapped in an endless loop, if there is a warning that leads to not reading).

Timeout occurs in browser when waiting for report to be generated, how to force it to wait?

Basically our client has been punishing our reports with a serious amount of records which has resulted in the below error. Essentially the query isn't good enough and despite generating the report, it takes about 25minutes, during which time the below image is displayed to the client due to timeout. I have been trying to optimize the query but frankly it is still taking far too long at 10minutes and I feel the same issue will arise as the client is taking about using even larger datasets.
Is there anyway I can make sure the below error doesn't occur? If I generate the report using visual studio reporting services, it may take 10-15minutes but at least the report is generated, how can I force the browser to wait until the report is generated?
The client is putting a rush on this so at least with this solution they will get what they need.
I've changed it to a stored procedure and all the extra outer joins were to replace a function call which I thought may have been the cause of the slowdown. An additional index was added in a hope to speed things up but no luck so far.
SQL Profiler shows an index seek which takes a hell of a lot of time but I can't see how to improve this.
This is the execution plan for the original query that was run, changes have resulted in significant differences we're still trying to figure out.
http://dl.dropbox.com/u/21484692/showplan.SQLPlan
Here is the SQL:
USE [Apollo_Interface_Staging]
GO
/****** Object: StoredProcedure [dbo].[spRptGMPP_UploadResults] Script Date: 01/27/2012 11:09:13 ******/
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
ALTER PROCEDURE [dbo].[spRptGMPP_UploadResults]
-- Add the parameters for the stored procedure here
#StartDate DATETIME,
#EndDate DATETIME,
#UploadStatus VARCHAR(15)
AS
BEGIN
-- SET NOCOUNT ON added to prevent extra result sets from
-- interfering with SELECT statements.
SET NOCOUNT ON;
DECLARE #TrueEndDate DATETIME
SELECT #TrueEndDate = (select DATEADD(ms,-3,DATEADD(dd, DATEDIFF(dd,0,#EndDate), 1)))
SELECT
VIN1
,TR_DATE
,CASE WHEN REC_STAT = 'A' THEN 'Add' WHEN REC_STAT = 'D' THEN 'Delete' WHEN REC_STAT = 'C' THEN 'Change' ELSE REC_STAT END as REC_STAT
, FRAN_CDE
, WARR_ODD
, WARR_MTH
, cast(rds_fee as money)/100 * CASE WHEN REC_STAT = 'D' THEN -1 ELSE 1 END as Fee
,PROV
,ROUND((gst.Percentage)/100 * cast(rds_fee as money)/100 * CASE WHEN REC_STAT = 'D' THEN -1 ELSE 1 END,2) as GST
,CASE WHEN prov NOT IN ('ON','BC') THEN
ROUND((hst.Percentage)/100 * cast(rds_fee as money)/100 *CASE WHEN REC_STAT = 'D' THEN -1 ELSE 1 END,2)
ELSE 0
END as HST
,CASE WHEN prov = 'ON' THEN
ROUND((hst.Percentage)/100 * cast(rds_fee as money)/100 *CASE WHEN REC_STAT = 'D' THEN -1 ELSE 1 END,2)
ELSE 0
END as HST_ON
,CASE WHEN prov = 'BC' THEN
ROUND((hst.Percentage)/100 * cast(rds_fee as money)/100 *CASE WHEN REC_STAT = 'D' THEN -1 ELSE 1 END,2)
ELSE 0
END as HST_BC
,ROUND((hst.Percentage)/100 * cast(rds_fee as money)/100 *CASE WHEN REC_STAT = 'D' THEN -1 ELSE 1 END,2) as HST
,ROUND((1 + (gst.Percentage)/100)
* (qst.Percentage)/100 * cast(rds_fee as money)/100, 2) *CASE WHEN REC_STAT = 'D' THEN -1 ELSE 1 END as QST
,_UploadStatus as UploadStatus
,CAST(YEAR(_UploadDate) as varchar)+ '/' + RIGHT('0' + cast(MONTH(_UploadDate)as varchar) ,2)+ '/' + RIGHT('0' + DAY(_UploadDate),2) as UploadDate
,_UploadDescription as UploadDescription
,'' as VehicleMake--,Apollo.dbo.GetGMManufacturerFromVIN(VIN1) AS VehicleMake
,'' as VehicleYear--,Apollo.dbo.GetVehYear(VIN1) AS VehicleYear
,ID
,ISNULL(mem.MembershipID,'') as MembershipID
,ISNULL(mem.MembershipTermID,'') as MembershipTermID
,ISNULL(p.Name,'') as [Plan]
,CASE mem.ActionTaken
WHEN 0 THEN 'None'
WHEN 1 THEN 'Upgraded'
WHEN 2 THEN 'Reverse Upgrade'
WHEN 3 THEN 'Created'
WHEN 4 THEN 'Cancelled'
WHEN 5 THEN 'Status Changed'
WHEN 6 THEN 'Static Update'
ELSE ''
END as Action
,ISNULL((SELECT SUM(mtd.Amount) FROM Apollo.dbo.MemberTransactionDetail mtd WITH(NOLOCK) WHERE mtd.MemberTransactionID = mem.TransactionID),0) as TransactionAmount
,ISNULL((select SUM(pe.InitialValue) from apollo_drm..[transaction] trx WITH(NOLOCK) inner join apollo_drm..paymentelement pe WITH(NOLOCK) ON pe.transactionid = trx.transactionid where trx.externalref = mem.TransactionID AND trx.ExternalRef > 0),0) as DRMTransactionAmount
,gm.STAMP as DateTimeStamp
FROM
Apollo_Interface_Staging.dbo.gmpp_archive gm WITH(NOLOCK)
LEFT OUTER JOIN Apollo_Interface_Staging.dbo.GMPP_Effected_Memberships mem WITH(NOLOCK) ON mem.GMPP_ImportID = gm.ID
LEFT OUTER JOIN Apollo..MembershipTerm mt WITH(NOLOCK) ON mt.MembershipTermID = mem.MembershipTermID
LEFT OUTER JOIN Apollo..[PlanVersion] pv WITH(NOLOCK) ON pv.PlanVersionID = mt.PlanVersionID
LEFT OUTER JOIN Apollo..[Plan] p WITH(NOLOCK) ON p.PlanID = pv.PlanID
LEFT OUTER JOIN APollo..PRovinceOrSTateTax gst WITH(NOLOCK) ON gst.PRovinceOrStateTaxID = (
SELECT TOP 1 tmp.PRovinceOrStateTaxID
FROM Apollo.dbo.ProvinceOrStateTax tmp
WHERE tmp.TaxTypeProgKey = 'GST'
AND tmp.ProvinceOrStateProgKey = prov
AND tmp.StartDate <= GETDATE()
ORDER BY
tmp.StartDate DESC
)
LEFT OUTER JOIN APollo..PRovinceOrSTateTax hst WITH(NOLOCK) ON hst.PRovinceOrStateTaxID = (
SELECT TOP 1 tmp.PRovinceOrStateTaxID
FROM Apollo.dbo.ProvinceOrStateTax tmp
WHERE tmp.TaxTypeProgKey = 'HST'
AND tmp.ProvinceOrStateProgKey = prov
AND tmp.StartDate <= GETDATE()
ORDER BY
tmp.StartDate DESC
)
LEFT OUTER JOIN APollo..PRovinceOrSTateTax qst WITH(NOLOCK) ON qst.PRovinceOrStateTaxID = (
SELECT TOP 1 tmp.PRovinceOrStateTaxID
FROM Apollo.dbo.ProvinceOrStateTax tmp
WHERE tmp.TaxTypeProgKey = 'QST'
AND tmp.ProvinceOrStateProgKey = prov
AND tmp.StartDate <= GETDATE()
ORDER BY
tmp.StartDate DESC
)
WHERE
STAMP BETWEEN #StartDate AND #TrueEndDate
AND
(
#UploadStatus = 'ALL'
OR
_UploadStatus = #UploadStatus
)
END
In terms of attempted resolution I tried the parameter sniffing route, by putting the dates directly into the stored procedure instead of allowing the user to select them, it chopped off around 3 minutes, but this is obviously not really a solution for us and even then it's still desperately slow.

Waiting for a report should never take this long.
Is it possible that you are doing something incorrectly or without proper indexing.
Can you post your SQL that accesses the dataset?
The other option you have is to create an RS subscription that exports the report out to pdf format and throws it on a client share directory every say 30 minutes. That way the client doesnt even have to wait for the report - they can expect the report to be ready in the network share folder. You can adjust the time directly in reporting services so that you can generate this report every say 10 minutes / 20 minutes / etc.

Compare many text files that contain duplicate "stubs" from the previous and next file and remove duplicate text automatically

I have a large number of text files (1000+) each containing an article from an academic journal. Unfortunately each article's file also contains a "stub" from the end of the previous article (at the beginning) and from the beginning of the next article (at the end).
I need to remove these stubs in preparation for running a frequency analysis on the articles because the stubs constitute duplicate data.
There is no simple field that marks the beginning and end of each article in all cases. However, the duplicate text does seem to formatted the same and on the same line in both cases.
A script that compared each file to the next file and then removed 1 copy of the duplicate text would be perfect. This seems like it would be a pretty common issue when programming so I am surprised that I haven't been able to find anything that does this.
The file names sort in order, so a script that compares each file to the next sequentially should work. E.G.
bul_9_5_181.txt
bul_9_5_186.txt
are two articles, one starting on page 181 and the other on page 186. Both of these articles are included bellow.
There is two volumes of test data located at [http://drop.io/fdsayre][1]
Note: I am an academic doing content analysis of old journal articles for a project in the history of psychology. I am no programmer, but I do have 10+ years experience with linux and can usually figure things out as I go.
Thanks for your help
FILENAME: bul_9_5_181.txt
SYN&STHESIA
ISI
the majority of Portugese words signifying black objects or ideas relating to black. This association is, admittedly, no true synsesthesia, but the author believes that it is only a matter of degree between these logical and spontaneous associations and genuine cases of colored audition.
REFERENCES
DOWNEY, JUNE E. A Case of Colored Gustation. Amer. J. of Psycho!., 1911, 22, S28-539MEDEIROS-E-ALBUQUERQUE. Sur un phenomene de synopsie presente par des millions de sujets. / . de psychol. norm, et path., 1911, 8, 147-151. MYERS, C. S. A Case of Synassthesia. Brit. J. of Psychol., 1911, 4, 228-238.
AFFECTIVE PHENOMENA — EXPERIMENTAL
BY PROFESSOR JOHN F. .SHEPARD
University of Michigan
Three articles have appeared from the Leipzig laboratory during the year. Drozynski (2) objects to the use of gustatory and olfactory stimuli in the study of organic reactions with feelings, because of the disturbance of breathing that may be involved. He uses rhythmical auditory stimuli, and finds that when given at different rates and in various groupings, they are accompanied by characteristic feelings in each subject. He records the chest breathing, and curves from a sphygmograph and a water plethysmograph. Each experiment began with a normal record, then the stimulus was given, and this was followed by a contrast stimulus; lastly, another normal was taken. The length and depth of breathing were measured (no time line was recorded), and the relation of length of inspiration to length of expiration was determined. The length and height of the pulsebeats were also measured. Tabular summaries are given of the number of times the author finds each quantity to have been increased or decreased during a reaction period with each type of feeling. The feeling state accompanying a given rhythm is always complex, but the result is referred to that dimension which seemed to be dominant. Only a few disconnected extracts from normal and reaction periods are reproduced from the records. The author states that excitement gives increase in the rate and depth of breathing, in the inspiration-expiration ratio, and in the rate and size of pulse. There are undulations in the arm volume. In so far as the effect is quieting, it causes decrease in rate and depth of
182
JOHN F. SHEPARD
breathing, in the inspiration-expiration ratio, and in the pulse rate and size. The arm volume shows a tendency to rise with respiratory waves. Agreeableness shows

It looks like a much simpler solution would actually work.
No one seems to be using the information provided by the filenames. If you do make use of this information, you may not have to do any comparisons between files to identify the area of overlap. Whoever wrote the OCR probably put some thought into this problem.
The last number in the file name tells you what the starting page number for that file is. This page number appears on a line by itself in the file as well. It also looks like this line is preceded and followed by blank lines. Therefore for a given file you should be able to look at the name of the next file in the sequence and determine the page number at which you should start removing text. Since this page number appears in your file just look for a line that contains only this number (preceded and followed by blank lines) and delete that line and everything after. The last file in the sequence can be left alone.
Here's an outline for an algorithm
choose a file; call it: file1
look at the filename of the next file; call it: file2
extract the page number from the filename of file2; call it: pageNumber
scan the contents of file1 until you find a line that contains only pageNumber
make sure this line is preceded and followed by a blank line.
remove this line and everything after
move on to the next file in the sequence

You should probably try something like this (I've now tested it on the sample data you provided):
#!/usr/bin/ruby
class A_splitter
Title = /^[A-Z]+[^a-z]*$/
Byline = /^BY /
Number = /^\d*$/
Blank_line = /^ *$/
attr_accessor :recent_lines,:in_references,:source_glob,:destination_path,:seen_in_last_file
def initialize(src_glob,dst_path=nil)
#recent_lines = []
#seen_in_last_file = {}
#in_references = false
#source_glob = src_glob
#destination_path = dst_path
#destination = STDOUT
#buffer = []
split_em
end
def split_here
if destination_path
#destination.close if #destination
#destination = nil
else
print "------------SPLIT HERE------------\n"
end
print recent_lines.shift
#in_references = false
end
def at_page_break
((recent_lines[0] =~ Title and recent_lines[1] =~ Blank_line and recent_lines[2] =~ Number) or
(recent_lines[0] =~ Number and recent_lines[1] =~ Blank_line and recent_lines[2] =~ Title))
end
def print(*args)
(#destination || #buffer) << args
end
def split_em
Dir.glob(source_glob).sort.each { |filename|
if destination_path
#destination.close if #destination
#destination = File.open(File.join(#destination_path,filename),'w')
print #buffer
#buffer.clear
end
in_header = true
File.foreach(filename) { |line|
line.gsub!(/\f/,'')
if in_header and seen_in_last_file[line]
#skip it
else
seen_in_last_file.clear if in_header
in_header = false
recent_lines << line
seen_in_last_file[line] = true
end
3.times {recent_lines.shift} if at_page_break
if recent_lines[0] =~ Title and recent_lines[1] =~ Byline
split_here
elsif in_references and recent_lines[0] =~ Title and recent_lines[0] !~ /\d/
split_here
elsif recent_lines.length > 4
#in_references ||= recent_lines[0] =~ /^REFERENCES *$/
print recent_lines.shift
end
}
}
print recent_lines
#destination.close if #destination
end
end
A_splitter.new('bul_*_*_*.txt','test_dir')
Basically, run through the files in order, and within each file run through the lines in order, omitting from each file the lines that were present in the preceding file and printing the rest to STDOUT (from which it can be piped) unless a destination director is specified (called 'test_dir' in the example see the last line) in which case files are created in the specified directory with the same name as the file which contained the bulk of their contents.
It also removes the page-break sections (journal title, author, and page number).
It does two split tests:
a test on the title/byline pair
a test on the first title-line after a reference section
(it should be obvious how to add tests for additional split-points).
Retained for posterity:
If you don't specify a destination directory it simply puts a split-here line in the output stream at the split point. This should make it easier for testing (you can just less the output) and when you want them in individual files just pipe it to csplit (e.g. with
csplit -f abstracts - '---SPLIT HERE---' '{*}'
or something) to cut it up.

Here's is the beginning of another possible solution in Perl (It works as is but could probably be made more sophisticated if needed). It sounds as if all you are concerned about is removing duplicates across the corpus and don't really care if the last part of one article is in the file for the next one as long as it isn't duplicated anywhere. If so, this solution will strip out the duplicate lines leaving only one copy of any given line in the set of files as a whole.
You can either just run the file in the directory containing the text files with no argument or alternately specify a file name containing the list of files you want to process in the order you want them processed. I recommend the latter as your file names (at least in the sample files you provided) do not naturally list out in order when using simple commands like ls on the command line or glob in the Perl script. Thus it won't necessarily compare the correct files to one another as it just runs down the list (entered or generated by the glob command). If you specify the list, you can guarantee that they will be processed in the correct order and it doesn't take that long to set it up properly.
The script simply opens two files and makes note of the first three lines of the second file. It then opens a new output file (original file name + '.new') for the first file and writes out all the lines from the first file into the new output file until it finds the first three lines of the second file. There is an off chance that there are not three lines from the second file in the last one but in all the files I spot checked that seemed to be the case because of the journal name header and page numbers. One line definitely wasn't enough as the journal title was often the first line and that would cut things off early.
I should also note that the last file in your list of files entered will not be processed (i.e. have a new file created based off of it) as it will not be changed by this process.
Here's the script:
#!/usr/bin/perl
use strict;
my #files;
my $count = #ARGV;
if ($count>0){
open (IN, "$ARGV[0]");
#files = <IN>;
close (IN);
} else {
#files = glob "bul_*.txt";
}
$count = #files;
print "Processing $count files.\n";
my $lastFile="";
foreach(#files){
if ($lastFile ne ""){
print "Processing $_\n";
open (FILEB,"$_");
my #fileBLines = <FILEB>;
close (FILEB);
my $line0 = $fileBLines[0];
if ($line0 =~ /\(/ || $line0 =~ /\)/){
$line0 =~ s/\(/\\\(/;
$line0 =~ s/\)/\\\)/;
}
my $line1 = $fileBLines[1];
my $line2 = $fileBLines[2];
open (FILEA,"$lastFile");
my #fileALines = <FILEA>;
close (FILEA);
my $newName = "$lastFile.new";
open (OUT, ">$newName");
my $i=0;
my $done = 0;
while ($done != 1 and $i < #fileALines){
if ($fileALines[$i] =~ /$line0/
&& $fileALines[$i+1] == $line1
&& $fileALines[$i+2] == $line2) {
$done=1;
} else {
print OUT $fileALines[$i];
$i++;
}
}
close (OUT);
}
$lastFile = $_;
}
EDIT: Added a check for parenthesis in the first line that goes into the regex check for duplicity later on and if found escapes them so that they don't mess up the duplicity check.

You have a nontrivial problem. It is easy to write code to find the duplicate text at the end of file 1 and the beginning of file 2. But you don't want to delete the duplicate text---you want to split it where the second article begins. Getting the split right might be tricky---one marker is the all caps, another is the BY at the start of the next line.
It would have helped to have examples from consecutive files, but the script below works on one test case. Before trying this code, back up all your files. The code overwrites existing files.
The implementation is in Lua.
The algorithm is roughly:
Ignore blank lines at the end of file 1 and the start of file 2.
Find a long sequence of lines common to end of file 1 and start of file 2.
This works by trying a sequence of 40 lines, then 39, and so on
Remove sequence from both files and call it overlap.
Split overlap at title
Append first part of overlap to file1; prepend second part to file2.
Overwrite contents of files with lists of lines.
Here's the code:
#!/usr/bin/env lua
local ext = arg[1] == '-xxx' and '.xxx' or ''
if #ext > 0 then table.remove(arg, 1) end
local function lines(filename)
local l = { }
for line in io.lines(filename) do table.insert(l, (line:gsub('', ''))) end
assert(#l > 0, "No lines in file " .. filename)
return l
end
local function write_lines(filename, lines)
local f = assert(io.open(filename .. ext, 'w'))
for i = 1, #lines do
f:write(lines[i], '\n')
end
f:close()
end
local function lines_match(line1, line2)
io.stderr:write(string.format("%q ==? %q\n", line1, line2))
return line1 == line2 -- could do an approximate match here
end
local function lines_overlap(l1, l2, k)
if k > #l2 or k > #l1 then return false end
io.stderr:write('*** k = ', k, '\n')
for i = 1, k do
if not lines_match(l2[i], l1[#l1 - k + i]) then
if i > 1 then
io.stderr:write('After ', i-1, ' matches: FAILED <====\n')
end
return false
end
end
return true
end
function find_overlaps(fname1, fname2)
local l1, l2 = lines(fname1), lines(fname2)
-- strip trailing and leading blank lines
while l1[#l1]:find '^[%s]*$' do table.remove(l1) end
while l2[1] :find '^[%s]*$' do table.remove(l2, 1) end
local matchsize -- # of lines at end of file 1 that are equal to the same
-- # at the start of file 2
for k = math.min(40, #l1, #l2), 1, -1 do
if lines_overlap(l1, l2, k) then
matchsize = k
io.stderr:write('Found match of ', k, ' lines\n')
break
end
end
if matchsize == nil then
return false -- failed to find an overlap
else
local overlap = { }
for j = 1, matchsize do
table.remove(l1) -- remove line from first set
table.insert(overlap, table.remove(l2, 1))
end
return l1, overlap, l2
end
end
local function split_overlap(l)
for i = 1, #l-1 do
if l[i]:match '%u' and not l[i]:match '%l' then -- has caps but no lowers
-- io.stderr:write('Looking for byline following ', l[i], '\n')
if l[i+1]:match '^%s*BY%s' then
local first = {}
for j = 1, i-1 do
table.insert(first, table.remove(l, 1))
end
-- io.stderr:write('Split with first line at ', l[1], '\n')
return first, l
end
end
end
end
local function strip_overlaps(filename1, filename2)
local l1, overlap, l2 = find_overlaps(filename1, filename2)
if not l1 then
io.stderr:write('No overlap in ', filename1, ' an

Are the stubs identical to the end of the previous file? Or different line endings/OCR mistakes?
Is there a way to discern an article's beginning? Maybe an indented abstract? Then you could go through each file and discard everything before the first and after (including) the second title.

Are the titles & author always on a single line? And does that line always contain the word "BY" in uppercase? If so, you can probably do a fair job withn awk, using those criteria as the begin/end marker.
Edit: I really don't think that using diff is going to work as it is a tool for comparing broadly similar files. Your files are (from diff's point of view) actually completely different - I think it will get out of sync immediately. But then, I'm not a diff guru :-)

A quick stab at it, assuming that the stub is strictly identical in both files:
#!/usr/bin/perl
use strict;
use List::MoreUtils qw/ indexes all pairwise /;
my #files = #ARGV;
my #previous_text;
for my $filename ( #files ) {
open my $in_fh, '<', $filename or die;
open my $out_fh, '>', $filename.'.clean' or die;
my #lines = <$in_fh>;
print $out_fh destub( \#previous_text, #lines );
#previous_text = #lines;
}
sub destub {
my #previous = #{ shift() };
my #lines = #_;
my #potential_stubs = indexes { $_ eq $lines[0] } #previous;
for my $i ( #potential_stubs ) {
# check if the two documents overlap for that index
my #p = #previous[ $i.. $#previous ];
my #l = #lines[ 0..$#previous-$i ];
return #lines[ $#previous-$i + 1 .. $#lines ]
if all { $_ } pairwise { $a eq $b } #p, #l;
}
# no stub detected
return #lines;
}

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string