Apache-beam hanging on groupbykey after windowing - not triggering - python-3.x

TLDR;
How to correct trigger count windows with python SDK?
Problem
I'm trying to make a pipeline for transforming and indexing a Wikipedia dump.
The objective is:
Read from a compressed file - just one process and in a streaming fashion as the file doesn't fit in RAM
Process each element in parallel (ParDo)
Group these elements in a count window (GroupBy in just one key to do streaming -> batch ) in just one process to save them in a DB.
Development
For that, I created a simple source class that returns a tuple in the form (index,data, counting):
class CountingSource(beam.io.filebasedsource.FileBasedSource):
def read_records(self, file_name, offset_range_tracker):
# timestamp = datetime.now()
k = 0
with gzip.open(file_name, "rt", encoding="utf-8", errors="strict") as f:
line = f.readline()
while line:
# Structure: index, page, index, page,...
line = f.readline()
yield line, f.readline(), k
k += 1
And I made the pipeline:
_beam_pipeline_args = [
"--runner=DirectRunner",
"--streaming",
# "--direct_num_workers=5",
# "--direct_running_mode=multi_processing",
]
with beam.Pipeline(options=PipelineOptions(_beam_pipeline_args)) as pipeline:
pipeline = (
pipeline
| "Read dump" >> beam.io.Read(CountingSource(dump_path))
| "With timestamps" >> beam.Map(lambda data: beam.window.TimestampedValue(data, data[-1]))
| "Drop timestamp" >> beam.Map(lambda data: (data[0], data[1]))
| "Process element" >> beam.ParDo(ProcessPage())
| "Filter nones" >> beam.Filter(lambda data: data != [])
# * not working, keep stuck at group - not triggering the window
| "window"
>> beam.WindowInto(
beam.window.GlobalWindows(),
trigger=beam.transforms.trigger.Repeatedly(beam.transforms.trigger.AfterCount(10)),
accumulation_mode=beam.transforms.trigger.AccumulationMode.DISCARDING,
)
| "Map to tuple" >> beam.Map(lambda data: (None, data))
# | "Print" >> beam.Map(lambda data: print(data))
| "Group all per window" >> beam.GroupByKey()
| "Discard key" >> beam.Values()
| "Index data" >> beam.Map(index_data)
)
If I remove the window and pass directly from "Filter nones" to "Index data" the pipeline works but indexing individually the elements. Also, If uncomment the print step I can see I still have data after the "Map to tuple" step, but it hangs on "Group all per window" without any logg. I tried with timed triggering too, changing the window to
>> beam.WindowInto(
beam.window.FixedWindows(10))
but this changed nothing (which was supposed to do the same as I create a "count time stamp" on data extraction).
I'm understanding something wrong with the windowing? The objective was to just index the data in batches.
Alternative
I can "hack" this last step using a custom do.Fn like:
class BatchIndexing(beam.DoFn):
def __init__(self, connection_string, batch_size=50000):
self._connection_string = connection_string
self._batch_size = batch_size
self._total = 0
def setup(self):
from sqlalchemy import create_engine
from sqlalchemy.orm import sessionmaker
from scripts.wikipedia.wikipedia_articles.beam_module.documents import Base
engine = create_engine(self._connection_string, echo=False)
self.session = sessionmaker(bind=engine)(autocommit=False, autoflush=False)
Base.metadata.create_all(engine)
def start_bundle(self):
# buffer for string of lines
self._lines = []
def process(self, element):
# Input element is the processed pair
self._lines.append(element)
if len(self._lines) >= self._batch_size:
self._total += len(self._lines)
self._flush_batch()
def finish_bundle(self):
# takes care of the unflushed buffer before finishing
if self._lines:
self._flush_batch()
def _flush_batch(self):
self.index_data(self._lines)
# Clear the buffer.
self._lines = []
def index_data(self, entries_to_index):
"""
Index batch of data.
"""
print(f"Indexed {self._total} entries")
self.session.add_all(entries_to_index)
self.session.commit()
and change the pipeline to:
with beam.Pipeline(options=PipelineOptions(_beam_pipeline_args)) as pipeline:
pipeline = (
pipeline
| "Read dump" >> beam.io.Read(CountingSource(dump_path))
| "Drop timestamp" >> beam.Map(lambda data: (data[0], data[1]))
| "Process element" >> beam.ParDo(ProcessPage())
| "Filter nones" >> beam.Filter(lambda data: data != [])
| "Unroll" >> beam.FlatMap(lambda data: data)
| "Index data" >> beam.ParDo(BatchIndexing(connection_string, batch_size=10000))
)
Which "works" but do the last step in parallel (thus, overwhelming de database or generating locked database problems with sqlite) and I would like to have just one Sink to communicate with the database.

Triggering in Beam is not a hard requirement. My guess would be that the trigger does not manage to trigger before the input ends. The early trigger of 10 elements means the runner is allowed to trigger after 10 elements, but does not have to (relates to how Beam splits inputs into bundles).
The FixedWindows(10) is fixed on 10 second interval and your data will all have the same timestamp, so that is not going to help either.
If your goal is to group data to batches there is a very handy transform for that: GroupIntoBatches, which should work for the use case and has additional features like limiting the time a record can wait in the batch before being processed.

Related

Apache Beam / Dataflow pub/sub side input with python

I'm new to Apache Beam, so I'm struggling a bit with the following scenario:
Pub/Sub topic using Stream mode
Transform to take out customerId
Parallel PCollection with Transform/ParDo that fetches data from Firestore based on the "customerId" received in the Pub/Sub Topic (using Side Input)
...
The ParDo transform that tries to fetch Firestore data does not run at all. If using "customerId" fixed value everything works as expected ... although not using a proper Fetch from Firestore (simple ParDo), it works. Am I doing something that is not supposed to?
Including my code bellow:
class getFirestoreUsers(beam.DoFn):
def process(self, element, customerId):
print(f'Getting Users from Firestore, ID: {customerId}')
# Call function to initialize Database
db = intializeFirebase()
""" # get customer information from the database
doc = db.document(f'Customers/{customerId}').get()
customer = doc.to_dict() """
usersList = {}
# Get Optin Users
try:
docs = db.collection(
f'Customers/{customerId}/DevicesWiFi_v3').where(u'optIn', u'==', True).stream()
usersList = {user.id: user.to_dict() for user in docs}
except Exception as err:
print(f"Error: couldn't retrieve OPTIN users from DevicesWiFi")
print(err)
return([usersList])
Main code
def run(argv=None):
"""Build and run the pipeline."""
parser = argparse.ArgumentParser()
parser.add_argument(
'--topic',
type=str,
help='Pub/Sub topic to read from')
parser.add_argument(
'--output',
help=('Output local filename'))
args, pipeline_args = parser.parse_known_args(argv)
options = PipelineOptions(pipeline_args)
options.view_as(SetupOptions).save_main_session = True
options.view_as(StandardOptions).streaming = True
p = beam.Pipeline(options=options)
users = (p | 'Create chars' >> beam.Create([
{
"clientMac": "7c:d9:5c:b8:6f:38",
"username": "Louis"
},
{
"clientMac": "48:fd:8e:b0:6f:38",
"username": "Paul"
}
]))
# Get Dictionary from Pub/Sub
data = (p | 'Read from PubSub' >> beam.io.ReadFromPubSub(topic=args.topic)
| 'Parse JSON to Dict' >> beam.Map(lambda e: json.loads(e))
)
# Get customerId from Pub/Sub information
PcustomerId = (data | 'get customerId from Firestore' >>
beam.ParDo(lambda x: [x.get('customerId')]))
PcustomerId | 'print customerId' >> beam.Map(print)
# Get Users from Firestore
custUsers = (users | 'Read from Firestore' >> beam.ParDo(
getFirestoreUsers(), customerId=beam.pvalue.AsSingleton(PcustomerId)))
custUsers | 'print Users from Firestore' >> beam.Map(print)
In order to avoid errors for running the function I had to initialise "users" dictionary, which I completely ignore aftewards.
I suppose I have several errors here, so your help is much appreciated.
It's not clear to me how users PCollection is used (since element is not processed in the process definition) in the example code. I've re-arranged the code a little bit with windowing and used the customer_id as the main input.
class GetFirestoreUsers(beam.DoFn):
def setup(self):
# Call function to initialize Database
self.db = intializeFirebase()
def process(self, element):
print(f'Getting Users from Firestore, ID: {element}')
""" # get customer information from the database
doc = self.db.document(f'Customers/{element}').get()
customer = doc.to_dict() """
usersList = {}
# Get Optin Users
try:
docs = self.db.collection(
f'Customers/{element}/DevicesWiFi_v3').where(u'optIn', u'==', True).stream()
usersList = {user.id: user.to_dict() for user in docs}
except Exception as err:
print(f"Error: couldn't retrieve OPTIN users from DevicesWiFi")
print(err)
return([usersList])
data = (p | 'Read from PubSub' >> beam.io.ReadFromPubSub(topic=args.topic)
| beam.WindowInto(window.FixedWindow(60))
| 'Parse JSON to Dict' >> beam.Map(lambda e: json.loads(e)))
# Get customerId from Pub/Sub information
customer_id = (data | 'get customerId from Firestore' >>
beam.Map(lambda x: x.get('customerId')))
customer_id | 'print customerId' >> beam.Map(print)
# Get Users from Firestore
custUsers = (cutomer_id | 'Read from Firestore' >> beam.ParDo(
GetFirestoreUsers())
custUsers | 'print Users from Firestore' >> beam.Map(print)
From your comment:
the data needed (customerID first and customers data after) is not ready when running the "main" PCollection with original JSON data from Pub/Sub
Did you mean the data in firestore is not ready when reading the Pub/Sub topic?
You can always split the logic into 2 pipelines in your main function and run them one after another.

Is there a way to Keep track of all the bad records that are allowed while loading a ndjson file into Bigquery

I have a requirement where I need to keep track of all the bad records that were not feeded into bigquery after allowing max_bad_records. So I need them written in a File on storage for Future reference. I'm using BQ API for Python, Is there a way we can achieve this? I think if we are allowing max_bad_records we dont have the details of failed loads in BQ Load Job.
Thanks
Currently, there isn't a direct way of accessing and saving the bad records. However, you can access some job statistics including the reason why the record was skipped within BigQuery _job_statistics().
I have created an example, in order to demonstrate how the statistics will be shown. I have the following sample .csv file in a GCS bucket:
name,age
robert,25
felix,23
john,john
As you can see, the last row is a bad record, because I will import age as INT64 and there is a string in that row. In addition, I used the following code to upload it to BigQuery:
from google.cloud import bigquery
client = bigquery.Client()
table_ref = client.dataset('dataset').table('table_name')
job_config = bigquery.LoadJobConfig(
schema=[
bigquery.SchemaField("name", "STRING"),
bigquery.SchemaField("age", "INT64"),
]
)
job_config.write_disposition = bigquery.WriteDisposition.WRITE_TRUNCATE
job_config.skip_leading_rows = 1
job_config.max_bad_records = 5
#job_config.autodetect = True
# The source format defaults to CSV, so the line below is optional.
job_config.source_format = bigquery.SourceFormat.CSV
uri = "gs://path/file.csv"
load_job = client.load_table_from_uri(
uri, table_ref, job_config=job_config
) # API request
print("Starting job {}".format(load_job.job_id))
load_job.result() # Waits for table load to complete.
print("Job finished.")
destination_table = client.get_table(table_ref)
print("Loaded {} rows.".format(destination_table.num_rows))
#Below all the statistics that might be useful in your case
job_state = load_job.state
job_id = load_job.job_id
error_result = load_job.error_result
job_statistics = load_job._job_statistics()
badRecords = job_statistics['badRecords']
outputRows = job_statistics['outputRows']
inputFiles = job_statistics['inputFiles']
inputFileBytes = job_statistics['inputFileBytes']
outputBytes = job_statistics['outputBytes']
print("***************************** ")
print(" job_state: " + str(job_state))
print(" non fatal error: " + str(load_job.errors))
print(" error_result: " + str(error_result))
print(" job_id: " + str(job_id))
print(" badRecords: " + str(badRecords))
print(" outputRows: " + str(outputRows))
print(" inputFiles: " + str(inputFiles))
print(" inputFileBytes: " + str(inputFileBytes))
print(" outputBytes: " + str(outputBytes))
print(" ***************************** ")
print("------ load_job.errors ")
The output from the statistics :
*****************************
job_state: DONE
non fatal errors: [{u'reason': u'invalid', u'message': u"Error while reading data, error message: Could not parse 'john' as INT64 for field age (position 1) starting at location 23", u'location': u'gs://path/file.csv'}]
error_result: None
job_id: b2b63e39-a5fb-47df-b12b-41a835f5cf5a
badRecords: 1
outputRows: 2
inputFiles: 1
inputFileBytes: 33
outputBytes: 26
*****************************
As it is shown above, the erros field returns the non fatal errors, which includes the bad records. In other words, it retrieves individual errors generated by the job. Whereas, the error_result returns the error information as the job as a whole.
I believe these statistics might help you analyse your bad records. Lastly, you can output them into a file, using write(), such as:
with open("errors.txt", "x") as f:
f.write(load_job.errors)
f.close()

Monitoring WriteToBigQuery

In my pipeline I use WriteToBigQuery something like this:
| beam.io.WriteToBigQuery(
'thijs:thijsset.thijstable',
schema=table_schema,
write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND,
create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED)
This returns a Dict as described in the documentation as follows:
The beam.io.WriteToBigQuery PTransform returns a dictionary whose
BigQueryWriteFn.FAILED_ROWS entry contains a PCollection of all the
rows that failed to be written.
How do I print this dict and turn it into a pcollection or how do I just print the FAILED_ROWS?
If I do: | "print" >> beam.Map(print)
Then I get: AttributeError: 'dict' object has no attribute 'pipeline'
I must have read a hundred pipelines but never have I seen anything after the WriteToBigQuery.
[edit]
When I finish the pipeline and store the results in a variable I have the following:
{'FailedRows': <PCollection[WriteToBigQuery/StreamInsertRows/ParDo(BigQueryWriteFn).FailedRows] at 0x7f0e0cdcfed0>}
But I do not know how to use this result in the pipeline like this:
| beam.io.WriteToBigQuery(
'thijs:thijsset.thijstable',
schema=table_schema,
write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND,
create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED)
| ['FailedRows'] from previous step
| "print" >> beam.Map(print)
Dead letters to handle invalid inputs are a common Beam/Dataflow usage and work with both Java and Python SDKs but there are not many examples for the latter.
Imagine that we have some dummy input data with 10 good lines and a bad row that does not conform to the table schema:
schema = "index:INTEGER,event:STRING"
data = ['{0},good_line_{1}'.format(i + 1, i + 1) for i in range(10)]
data.append('this is a bad row')
Then, what I do is name the write result (events in this case):
events = (p
| "Create data" >> beam.Create(data)
| "CSV to dict" >> beam.ParDo(CsvToDictFn())
| "Write to BigQuery" >> beam.io.gcp.bigquery.WriteToBigQuery(
"{0}:dataflow_test.good_lines".format(PROJECT),
schema=schema,
)
)
and then access the FAILED_ROWS side output:
(events[beam.io.gcp.bigquery.BigQueryWriteFn.FAILED_ROWS]
| "Bad lines" >> beam.io.textio.WriteToText("error_log.txt"))
This works well with the DirectRunner and writes the good lines to BigQuery:
and the bad one to a local file:
$ cat error_log.txt-00000-of-00001
('PROJECT_ID:dataflow_test.good_lines', {'index': 'this is a bad row'})
If you run it with the DataflowRunner you'll need some additional flags. If you face the TypeError: 'PDone' object has no attribute '__getitem__' error you'll need to add --experiments=use_beam_bq_sink to use the new BigQuery sink.
If you get a KeyError: 'FailedRows' it's because the new sink will default to load BigQuery jobs for batch pipelines:
STREAMING_INSERTS, FILE_LOADS, or DEFAULT. An introduction on loading
data to BigQuery: https://cloud.google.com/bigquery/docs/loading-data.
DEFAULT will use STREAMING_INSERTS on Streaming pipelines and
FILE_LOADS on Batch pipelines.
You can override the behavior by specifying method='STREAMING_INSERTS' in WriteToBigQuery:
Full code for both DirectRunner and DataflowRunner here.

QTreeView crashing with no apparent reason

I introduced a treeview in the GUI of the program I'm making and since it crashes when I attempt to change its model once it has been set.
The course of action is:
load the file using a file dialogue
clearing the models on the interface objects (tables and treeview). The first time the treeview is not affected since there is no model in
it.
Populate the treeview model.
other stuff not related to the issue.
The problematic functions are;
The file loading procedure:
def open_file(self):
"""
Open a file
:return:
"""
print("actionOpen_file_click")
# declare the dialog
# file_dialog = QtGui.QFileDialog(self)
# declare the allowed file types
files_types = "Excel 97 (*.xls);;Excel (*.xlsx);;DigSILENT (*.dgs);;MATPOWER (*.m)"
# call dialog to select the file
filename, type_selected = QtGui.QFileDialog.getOpenFileNameAndFilter(self, 'Open file',
self.project_directory, files_types)
if len(filename) > 0:
self.project_directory = os.path.dirname(filename)
print(filename)
self.circuit = Circuit(filename, True)
# set data structures list model
self.ui.dataStructuresListView.setModel(self.available_data_structures_listModel)
# set the first index
index = self.available_data_structures_listModel.index(0, 0, QtCore.QModelIndex())
self.ui.dataStructuresListView.setCurrentIndex(index)
# clean
self.clean_GUI()
# load table
self.display_objects_table()
# draw graph
self.ui.gridPlot.setTitle(os.path.basename(filename))
self.re_plot()
# show times
if self.circuit.time_series is not None:
if self.circuit.time_series.is_ready():
self.set_time_comboboxes()
# tree view at the results
self.set_results_treeview_structure()
# populate editors
self.populate_editors_defaults()
The treeview model assignation:
def set_results_treeview_structure(self):
"""
Sets the results treeview data structure
#return:
"""
# self.ui.results_treeView.setSelectionBehavior(QtGui.QAbstractItemView.SelectRows)
model = QtGui.QStandardItemModel()
# model.setHorizontalHeaderLabels(['Elements'])
self.ui.results_treeView.setModel(model)
# self.ui.results_treeView.setUniformRowHeights(True)
def pass_to_QStandardItem_list(list_):
res = list()
for elm in list_:
elm1 = QtGui.QStandardItem(elm)
elm1.setEditable(False)
res.append(elm1)
return res
bus_results = pass_to_QStandardItem_list(['Voltages (p.u.)', 'Voltages (kV)'])
per_bus_results = pass_to_QStandardItem_list(['Voltage (p.u.) series', 'Voltage (kV) series',
'Active power (MW)', 'Reactive power (MVar)',
'Active and reactive power (MW, MVar)', 'Aparent power (MVA)',
'S-V curve', 'Q-V curve'])
branches_results = pass_to_QStandardItem_list(['Loading (%)', 'Current (p.u.)',
'Current (kA)', 'Losses (MVA)'])
per_branch_results = pass_to_QStandardItem_list(['Loading (%) series', 'Current (p.u.) series',
'Current (kA) series', 'Losses (MVA) series'])
generator_results = pass_to_QStandardItem_list(['Reactive power (p.u.)', 'Reactive power (MVar)'])
per_generator_results = pass_to_QStandardItem_list(['Reactive power (p.u.) series',
'Reactive power (MVar) series'])
self.family_results_per_family = dict()
# nodes
buses = QtGui.QStandardItem('Buses')
buses.setEditable(False)
buses.appendRows(bus_results)
self.family_results_per_family[0] = len(bus_results)
names = self.circuit.bus_names
for name in names:
bus = QtGui.QStandardItem(name)
bus.appendRows(per_bus_results)
bus.setEditable(False)
buses.appendRow(bus)
# branches
branches = QtGui.QStandardItem('Branches')
branches.setEditable(False)
branches.appendRows(branches_results)
self.family_results_per_family[1] = len(branches_results)
names = self.circuit.branch_names
for name in names:
branch = QtGui.QStandardItem(name)
branch.appendRows(per_branch_results)
branch.setEditable(False)
branches.appendRow(branch)
# generators
generators = QtGui.QStandardItem('Generators')
generators.setEditable(False)
generators.appendRows(generator_results)
self.family_results_per_family[2] = len(generator_results)
names = self.circuit.gen_names
for name in names:
gen = QtGui.QStandardItem(name)
gen.appendRows(per_generator_results)
gen.setEditable(False)
generators.appendRow(gen)
model.appendRow(buses)
model.appendRow(branches)
model.appendRow(generators)
And the GUI "cleaning":
def clean_GUI(self):
"""
Initializes the comboboxes and tables
Returns:
"""
self.ui.tableView.setModel(None)
if self.ui.results_treeView.model() is not None:
self.ui.results_treeView.model().clear()
self.ui.profile_time_selection_comboBox.clear()
self.ui.results_time_selection_comboBox.clear()
self.ui.gridPlot.clear()
The complete code can be seen here
I have seen that this behavior is usually triggered by calls outside the GUI thread by I don think this is the case.
I'd appreciate if someone could point out the problem. Again the complate code for test is here.
The solution to this in my case has been the following:
The QStandardItemModel() variable called model in the code was turned into a class global variable self.tree_model
When I want to replace the treeview object model, I delete the global tree_model with del self.tree_model
Then I re-create the global model with self.tree_model = QStandardItemModel()
This way the TreeView object model is effectively replaced without crashing...

Exporting from MS Excel to MS Access with intermediate processing

I have an application which produces reports in Excel (.XLS) format. I need to append the data from these reports to an existing table in a MS Access 2010 database. A typical record is:
INC000000004154 Closed Cbeebies BBC Childrens HQ6 monitor wall dropping out. HQ6 P3 3/7/2013 7:03:01 PM 3/7/2013 7:03:01 PM 3/7/2013 7:14:15 PM The root cause of the problem was the power supply to the PC which was feeding the monitor. HQ6 Monitor wall dropping out. BBC Third Party Contractor supply this equipment.
The complication is that I need to do some limited processing on the data. Viz
Specifically I need to do a couple of lookups converting names to numbers and also parse a date-string (the report for some reason puts the dates in to the spreadsheet in text format rather than date format).
Now I could do this in Python using XLRD/XLWT but would much prefer to do it in Excel or Access. Does anyone have any advice on a good way to approach this? I would very much prefer NOT to use VBA so could I do something like record an MS Excel macro and then execute that macro on the newly created XLS file?
You can directly import some Excel data into MS Access, but if your requirement is to do some processing because then I don't see how you will be able to achieve that without:
an ETL application, like Pentaho or Talend or others.
That will certainly be like using a hammer to crush an ant though.
some other external data processing pipeline, in Python or some other programming language.
VBA (wether through macros or hand coded).
VBA has been really good at doing that sort of things in Access for literally decades.
Since you are using Excel and Access, staying within that realm looks like the best solution for solving your issue.
Just use queries:
You import the data without transformation to a table whose sole purpose is to accommodate the data from Excel; then you create queries from that raw data to add the missing information and massage the data before appending the result into your final destination table.
That solution has the advantage of letting you create simple steps in Access that you can easily record using macros.
I asked this question some time ago and decided it would be easier to do it in Python. Gord asked me to share, and here it is (sorry about the delay, other projects took priority for a while).
"""
Routine to migrate the S7 data from MySQL to the new Access
database.
We're using the pyodbc libraries to connect to Microsoft Access
Note that there are 32- and 64-bit versions of these libraries
available but in order to work the word-length for pyodbc and by
implication Python and all its associated compiled libraries must
match that of MS Access. Which is an arse as I've just had to
delete my 64-bit installation of Python and replace it and all
the libraries with the 32-bit version.
Tim Greening-Jackson 08 May 2013 (timATgreening-jackson.com)
"""
import pyodbc
import re
import datetime
import tkFileDialog
from Tkinter import *
class S7Incident:
"""
Class containing the records downloaded from the S7.INCIDENTS table
"""
def __init__(self, id_incident, priority, begin, acknowledge,
diagnose, workaround,fix, handoff, lro, nlro,
facility, ctas, summary, raised, code):
self.id_incident=unicode(id_incident)
self.priority = {u'P1':1, u'P2':2, u'P3':3, u'P4':4, u'P5':5} [unicode(priority.upper())]
self.begin = begin
self.acknowledge = acknowledge
self.diagnose = diagnose
self.workaround = workaround
self.fix = fix
self.handoff = True if handoff else False
self.lro = True if lro else False
self.nlro = True if nlro else False
self.facility = unicode(facility)
self.ctas = ctas
self.summary = "** NONE ***" if type(summary) is NoneType else summary.replace("'","")
self.raised = raised.replace("'","")
self.code = 0 if code is None else code
self.production = None
self.dbid = None
def __repr__(self):
return "[{}] ID:{} P{} Prod:{} Begin:{} A:{} D:+{}s W:+{}s F:+{}s\nH/O:{} LRO:{} NLRO:{} Facility={} CTAS={}\nSummary:'{}',Raised:'{}',Code:{}".format(
self.id_incident,self.dbid, self.priority, self.production, self.begin,
self.acknowledge, self.diagnose, self.workaround, self.fix,
self.handoff, self.lro, self.nlro, self.facility, self.ctas,
self.summary, self.raised, self.code)
def ProcessIncident(self, cursor, facilities, productions):
"""
Produces the SQL necessary to insert the incident in to the Access
database, executes it and then gets the autonumber ID (dbid) of the newly
created incident (this is used so LRO, NRLO CTAS and AD1 can refer to
their parent incident.
If the incident is classed as LRO, NLRO, CTAS then the appropriate
record is created. Returns the dbid.
"""
if self.raised.upper() in productions:
self.production = productions[self.raised.upper()]
else:
self.production = 0
sql="""INSERT INTO INCIDENTS (ID_INCIDENT, PRIORITY, FACILITY, BEGIN,
ACKNOWLEDGE, DIAGNOSE, WORKAROUND, FIX, HANDOFF, SUMMARY, RAISED, CODE, PRODUCTION)
VALUES ('{}', {}, {}, #{}#, {}, {}, {}, {}, {}, '{}', '{}', {}, {})
""".format(self.id_incident, self.priority, facilities[self.facility], self.begin,
self.acknowledge, self.diagnose, self.workaround, self.fix,
self.handoff, self.summary, self.raised, self.code, self.production)
cursor.execute(sql)
cursor.execute("SELECT ##IDENTITY")
self.dbid = cursor.fetchone()[0]
if self.lro:
self.ProcessLRO(cursor, facilities[self.facility])
if self.nlro:
self.ProcessNLRO(cursor, facilities[self.facility])
if self.ctas:
self.ProcessCTAS(cursor, facilities[self.facility], self.ctas)
return self.dbid
def ProcessLRO(self, cursor, facility):
sql = "INSERT INTO LRO (PID, DURATION, FACILITY) VALUES ({}, {}, {})"\
.format(self.dbid, self.workaround, facility)
cursor.execute(sql)
def ProcessNLRO(self, cursor, facility):
sql = "INSERT INTO NLRO (PID, DURATION, FACILITY) VALUES ({}, {}, {})"\
.format(self.dbid, self.workaround, facility)
cursor.execute(sql)
def ProcessCTAS(self, cursor, facility, code):
sql = "INSERT INTO CTAS (PID, DURATION, FACILITY, CODE) VALUES ({}, {}, {}, {})"\
.format(self.dbid, self.workaround, facility, self.ctas)
cursor.execute(sql)
class S7AD1:
"""
S7.AD1 records.
"""
def __init__(self, id_ad1, date, ref, commentary, adjustment):
self.id_ad1 = id_ad1
self.date = date
self.ref = unicode(ref)
self.commentary = unicode(commentary)
self.adjustment = float(adjustment)
self.pid = 0
self.production = 0
def __repr__(self):
return "[{}] Date:{} Parent:{} PID:{} Amount:{} Commentary: {} "\
.format(self.id_ad1, self.date.strftime("%d/%m/%y"), self.ref, self.pid, self.adjustment, self.commentary)
def SetPID(self, pid):
self.pid = pid
def SetProduction(self, p):
self.production = p
def Process(self, cursor):
sql = "INSERT INTO AD1 (pid, begin, commentary, production, adjustment) VALUES ({}, #{}#, '{}', {}, {})"\
.format(self.pid, self.date.strftime("%d/%m/%y"), self.commentary, self.production, self.adjustment)
cursor.execute(sql)
class S7Financial:
"""
S7 monthly financial summary of income and penalties from S7.FINANCIALS table.
These are identical in the new database
"""
def __init__(self, month, year, gco, cta, support, sc1, sc2, sc3, ad1):
self.begin = datetime.date(year, month, 1)
self.gco = float(gco)
self.cta = float(cta)
self.support = float(support)
self.sc1 = float(sc1)
self.sc2 = float(sc2)
self.sc3 = float(sc3)
self.ad1 = float(ad1)
def __repr__(self):
return "Period: {} GCO:{:.2f} CTA:{:.2f} SUP:{:.2f} SC1:{:.2f} SC2:{:.2f} SC3:{:.2f} AD1:{:.2f}"\
.format(self.start.strftime("%m/%y"), self.gco, self.cta, self.support, self.sc1, self.sc2, self.sc3, self.ad1)
def Process(self, cursor):
"""
Insert in to FINANCIALS table
"""
sql = "INSERT INTO FINANCIALS (BEGIN, GCO, CTA, SUPPORT, SC1, SC2, SC3, AD1) VALUES (#{}#, {}, {}, {}, {}, {}, {},{})"\
.format(self.begin, self.gco, self.cta, self.support, self.sc1, self.sc2, self.sc3, self.ad1)
cursor.execute(sql)
class S7SC3:
"""
Miscellaneous S7 SC3 stuff. The new table is identical to the old one.
"""
def __init__(self, begin, month, year, p1ot, p2ot, totchg, succchg, chgwithinc, fldchg, egychg):
self.begin = begin
self.p1ot = p1ot
self.p2ot = p2ot
self.changes = totchg
self.successful = succchg
self.incidents = chgwithinc
self.failed = fldchg
self.emergency = egychg
def __repr__(self):
return "{} P1:{} P2:{} CHG:{} SUC:{} INC:{} FLD:{} EGY:{}"\
.format(self.period.strftime("%m/%y"), self.p1ot, self.p1ot, self.changes, self.successful, self.incidents, self.failed, self.emergency)
def Process(self, cursor):
"""
Inserts a record in to the Access database
"""
sql = "INSERT INTO SC3 (BEGIN, P1OT, P2OT, CHANGES, SUCCESSFUL, INCIDENTS, FAILED, EMERGENCY) VALUES\
(#{}#, {}, {}, {}, {}, {}, {}, {})"\
.format(self.begin, self.p1ot, self.p2ot, self.changes, self.successful, self.incidents, self.failed, self.emergency)
cursor.execute(sql)
def ConnectToAccessFile():
"""
Prompts the user for an Access database file, connects, creates a cursor,
cleans out the tables which are to be replaced, gets a hash of the facilities
table keyed on facility name returning facility id
"""
# Prompts the user to select which Access DB file he wants to use and then attempts to connect
root = Tk()
dbname = tkFileDialog.askopenfilename(parent=root, title="Select output database", filetypes=[('Access 2010', '*.accdb')])
root.destroy()
# Connect to the Access (new) database and clean its existing incidents etc. tables out as
# these will be replaced with the new data
dbcxn = pyodbc.connect("Driver={Microsoft Access Driver (*.mdb, *.accdb)};DBQ="+dbname+";")
dbcursor=dbcxn.cursor()
print("Connected to {}".format(dbname))
for table in ["INCIDENTS", "AD1", "LRO", "NLRO", "CTAS", "SC3", "PRODUCTIONS", "FINANCIALS"]:
print("Clearing table {}...".format(table))
dbcursor.execute("DELETE * FROM {}".format(table))
# Get the list of facilities from the Access database...
dbcursor.execute("SELECT id, facility FROM facilities")
rows = dbcursor.fetchall()
dbfacilities = {unicode(row[1]):row[0] for row in rows}
return dbcxn, dbcursor, dbfacilities
# Entry point
incre = re.compile("INC\d{12}[A-Z]?") # Regex that matches incident references
try:
dbcxn, dbcursor, dbfacilities = ConnectToAccessFile()
# Connect to the MySQL S7 (old) database and read the incidents and ad1 tables
s7cxn = pyodbc.connect("DRIVER={MySQL ODBC 3.51 Driver}; SERVER=localhost;DATABASE=s7; UID=root; PASSWORD=********; OPTION=3")
print("Connected to MySQL S7 database")
s7cursor = s7cxn.cursor()
s7cursor.execute("""
SELECT id_incident, priority, begin, acknowledge,
diagnose, workaround, fix, handoff, lro, nlro,
facility, ctas, summary, raised, code FROM INCIDENTS""")
rows = s7cursor.fetchall()
# Discard any incidents which don't have a reference of the form INC... as they are ancient
print("Fetching incidents")
s7incidents = {unicode(row[0]):S7Incident(*row) for row in rows if incre.match(row[0])}
# Get the list of productions from the S7 database to replace the one we've just deleted ...
print("Fetching productions")
s7cursor.execute("SELECT DISTINCT RAISED FROM INCIDENTS")
rows = s7cursor.fetchall()
s7productions = [r[0] for r in rows]
# ... now get the AD1s ...
print("Fetching AD1s")
s7cursor.execute("SELECT id_ad1, date, ref, commentary, adjustment from AD1")
rows = s7cursor.fetchall()
s7ad1s = [S7AD1(*row) for row in rows]
# ... and the financial records ...
print("Fetching Financials")
s7cursor.execute("SELECT month, year, gco, cta, support, sc1, sc2, sc3, ad1 FROM Financials")
rows = s7cursor.fetchall()
s7financials = [S7Financial(*row) for row in rows]
print("Writing financials ({})".format(len(s7financials)))
[p.Process(dbcursor) for p in s7financials]
# ... and the SC3s.
print("Fetching SC3s")
s7cursor.execute("SELECT begin, month, year, p1ot, p2ot, totchg, succhg, chgwithinc, fldchg, egcychg from SC3")
rows = s7cursor.fetchall()
s7sc3s = [S7SC3(*row) for row in rows]
print("Writing SC3s ({})".format(len(s7sc3s)))
[p.Process(dbcursor) for p in s7sc3s]
# Re-create the productions table in the new database. Note we refer to production
# by number in the incidents table so need to do the SELECT ##IDENTITY to give us the
# autonumber index. To make sure everything is case-insensitive convert the
# hash keys to UPPERCASE.
dbproductions = {}
print("Writing productions ({})".format(len(s7productions)))
for p in sorted(s7productions):
dbcursor.execute("INSERT INTO PRODUCTIONS (PRODUCTION) VALUES ('{}')".format(p))
dbcursor.execute("SELECT ##IDENTITY")
dbproductions[p.upper()] = dbcursor.fetchone()[0]
# Now process the incidents etc. that we have retrieved from the S7 database
print("Writing incidents ({})".format(len(s7incidents)))
[s7incidents[k].ProcessIncident(dbcursor, dbfacilities, dbproductions) for k in sorted(s7incidents)]
# Match the new parent incident IDs in the AD1s and then write to the new table. Some
# really old AD1s don't have the parent incident reference in the REF field, it is just
# mentioned SOMEWHERE in the commentary. So if the REF field doesn't match then do a
# re.search (not re.match!) for it. It isn't essential to match these older AD1s with
# their parent incident, but it is quite useful (and tidy).
print("Matching and writing AD1s".format(len(s7ad1s)))
for a in s7ad1s:
if a.ref in s7incidents:
a.SetPID(s7incidents[a.ref].dbid)
a.SetProduction(s7incidents[a.ref].production)
else:
z=incre.search(a.commentary)
if z and z.group() in s7incidents:
a.SetPID(s7incidents[z.group()].dbid)
a.SetProduction(s7incidents[z.group()].production)
a.Process(dbcursor)
print("Comitting changes")
dbcursor.commit()
finally:
print("Closing databases")
dbcxn.close()
s7cxn.close()
It turns out that the file has additional complications in terms of mangled data which will require a degree of processing which is a pain to do in Excel but trivially simple in Python. So I will re-use some Python 2.x scripts which use the XLWT/XLRD libraries to munge the spreadsheet.

Resources