Apache-beam hanging on groupbykey after windowing - not triggering - python-3.x
TLDR;
How to correct trigger count windows with python SDK?
Problem
I'm trying to make a pipeline for transforming and indexing a Wikipedia dump.
The objective is:
Read from a compressed file - just one process and in a streaming fashion as the file doesn't fit in RAM
Process each element in parallel (ParDo)
Group these elements in a count window (GroupBy in just one key to do streaming -> batch ) in just one process to save them in a DB.
Development
For that, I created a simple source class that returns a tuple in the form (index,data, counting):
class CountingSource(beam.io.filebasedsource.FileBasedSource):
def read_records(self, file_name, offset_range_tracker):
# timestamp = datetime.now()
k = 0
with gzip.open(file_name, "rt", encoding="utf-8", errors="strict") as f:
line = f.readline()
while line:
# Structure: index, page, index, page,...
line = f.readline()
yield line, f.readline(), k
k += 1
And I made the pipeline:
_beam_pipeline_args = [
"--runner=DirectRunner",
"--streaming",
# "--direct_num_workers=5",
# "--direct_running_mode=multi_processing",
]
with beam.Pipeline(options=PipelineOptions(_beam_pipeline_args)) as pipeline:
pipeline = (
pipeline
| "Read dump" >> beam.io.Read(CountingSource(dump_path))
| "With timestamps" >> beam.Map(lambda data: beam.window.TimestampedValue(data, data[-1]))
| "Drop timestamp" >> beam.Map(lambda data: (data[0], data[1]))
| "Process element" >> beam.ParDo(ProcessPage())
| "Filter nones" >> beam.Filter(lambda data: data != [])
# * not working, keep stuck at group - not triggering the window
| "window"
>> beam.WindowInto(
beam.window.GlobalWindows(),
trigger=beam.transforms.trigger.Repeatedly(beam.transforms.trigger.AfterCount(10)),
accumulation_mode=beam.transforms.trigger.AccumulationMode.DISCARDING,
)
| "Map to tuple" >> beam.Map(lambda data: (None, data))
# | "Print" >> beam.Map(lambda data: print(data))
| "Group all per window" >> beam.GroupByKey()
| "Discard key" >> beam.Values()
| "Index data" >> beam.Map(index_data)
)
If I remove the window and pass directly from "Filter nones" to "Index data" the pipeline works but indexing individually the elements. Also, If uncomment the print step I can see I still have data after the "Map to tuple" step, but it hangs on "Group all per window" without any logg. I tried with timed triggering too, changing the window to
>> beam.WindowInto(
beam.window.FixedWindows(10))
but this changed nothing (which was supposed to do the same as I create a "count time stamp" on data extraction).
I'm understanding something wrong with the windowing? The objective was to just index the data in batches.
Alternative
I can "hack" this last step using a custom do.Fn like:
class BatchIndexing(beam.DoFn):
def __init__(self, connection_string, batch_size=50000):
self._connection_string = connection_string
self._batch_size = batch_size
self._total = 0
def setup(self):
from sqlalchemy import create_engine
from sqlalchemy.orm import sessionmaker
from scripts.wikipedia.wikipedia_articles.beam_module.documents import Base
engine = create_engine(self._connection_string, echo=False)
self.session = sessionmaker(bind=engine)(autocommit=False, autoflush=False)
Base.metadata.create_all(engine)
def start_bundle(self):
# buffer for string of lines
self._lines = []
def process(self, element):
# Input element is the processed pair
self._lines.append(element)
if len(self._lines) >= self._batch_size:
self._total += len(self._lines)
self._flush_batch()
def finish_bundle(self):
# takes care of the unflushed buffer before finishing
if self._lines:
self._flush_batch()
def _flush_batch(self):
self.index_data(self._lines)
# Clear the buffer.
self._lines = []
def index_data(self, entries_to_index):
"""
Index batch of data.
"""
print(f"Indexed {self._total} entries")
self.session.add_all(entries_to_index)
self.session.commit()
and change the pipeline to:
with beam.Pipeline(options=PipelineOptions(_beam_pipeline_args)) as pipeline:
pipeline = (
pipeline
| "Read dump" >> beam.io.Read(CountingSource(dump_path))
| "Drop timestamp" >> beam.Map(lambda data: (data[0], data[1]))
| "Process element" >> beam.ParDo(ProcessPage())
| "Filter nones" >> beam.Filter(lambda data: data != [])
| "Unroll" >> beam.FlatMap(lambda data: data)
| "Index data" >> beam.ParDo(BatchIndexing(connection_string, batch_size=10000))
)
Which "works" but do the last step in parallel (thus, overwhelming de database or generating locked database problems with sqlite) and I would like to have just one Sink to communicate with the database.
Triggering in Beam is not a hard requirement. My guess would be that the trigger does not manage to trigger before the input ends. The early trigger of 10 elements means the runner is allowed to trigger after 10 elements, but does not have to (relates to how Beam splits inputs into bundles).
The FixedWindows(10) is fixed on 10 second interval and your data will all have the same timestamp, so that is not going to help either.
If your goal is to group data to batches there is a very handy transform for that: GroupIntoBatches, which should work for the use case and has additional features like limiting the time a record can wait in the batch before being processed.
Related
Apache Beam / Dataflow pub/sub side input with python
I'm new to Apache Beam, so I'm struggling a bit with the following scenario: Pub/Sub topic using Stream mode Transform to take out customerId Parallel PCollection with Transform/ParDo that fetches data from Firestore based on the "customerId" received in the Pub/Sub Topic (using Side Input) ... The ParDo transform that tries to fetch Firestore data does not run at all. If using "customerId" fixed value everything works as expected ... although not using a proper Fetch from Firestore (simple ParDo), it works. Am I doing something that is not supposed to? Including my code bellow: class getFirestoreUsers(beam.DoFn): def process(self, element, customerId): print(f'Getting Users from Firestore, ID: {customerId}') # Call function to initialize Database db = intializeFirebase() """ # get customer information from the database doc = db.document(f'Customers/{customerId}').get() customer = doc.to_dict() """ usersList = {} # Get Optin Users try: docs = db.collection( f'Customers/{customerId}/DevicesWiFi_v3').where(u'optIn', u'==', True).stream() usersList = {user.id: user.to_dict() for user in docs} except Exception as err: print(f"Error: couldn't retrieve OPTIN users from DevicesWiFi") print(err) return([usersList]) Main code def run(argv=None): """Build and run the pipeline.""" parser = argparse.ArgumentParser() parser.add_argument( '--topic', type=str, help='Pub/Sub topic to read from') parser.add_argument( '--output', help=('Output local filename')) args, pipeline_args = parser.parse_known_args(argv) options = PipelineOptions(pipeline_args) options.view_as(SetupOptions).save_main_session = True options.view_as(StandardOptions).streaming = True p = beam.Pipeline(options=options) users = (p | 'Create chars' >> beam.Create([ { "clientMac": "7c:d9:5c:b8:6f:38", "username": "Louis" }, { "clientMac": "48:fd:8e:b0:6f:38", "username": "Paul" } ])) # Get Dictionary from Pub/Sub data = (p | 'Read from PubSub' >> beam.io.ReadFromPubSub(topic=args.topic) | 'Parse JSON to Dict' >> beam.Map(lambda e: json.loads(e)) ) # Get customerId from Pub/Sub information PcustomerId = (data | 'get customerId from Firestore' >> beam.ParDo(lambda x: [x.get('customerId')])) PcustomerId | 'print customerId' >> beam.Map(print) # Get Users from Firestore custUsers = (users | 'Read from Firestore' >> beam.ParDo( getFirestoreUsers(), customerId=beam.pvalue.AsSingleton(PcustomerId))) custUsers | 'print Users from Firestore' >> beam.Map(print) In order to avoid errors for running the function I had to initialise "users" dictionary, which I completely ignore aftewards. I suppose I have several errors here, so your help is much appreciated.
It's not clear to me how users PCollection is used (since element is not processed in the process definition) in the example code. I've re-arranged the code a little bit with windowing and used the customer_id as the main input. class GetFirestoreUsers(beam.DoFn): def setup(self): # Call function to initialize Database self.db = intializeFirebase() def process(self, element): print(f'Getting Users from Firestore, ID: {element}') """ # get customer information from the database doc = self.db.document(f'Customers/{element}').get() customer = doc.to_dict() """ usersList = {} # Get Optin Users try: docs = self.db.collection( f'Customers/{element}/DevicesWiFi_v3').where(u'optIn', u'==', True).stream() usersList = {user.id: user.to_dict() for user in docs} except Exception as err: print(f"Error: couldn't retrieve OPTIN users from DevicesWiFi") print(err) return([usersList]) data = (p | 'Read from PubSub' >> beam.io.ReadFromPubSub(topic=args.topic) | beam.WindowInto(window.FixedWindow(60)) | 'Parse JSON to Dict' >> beam.Map(lambda e: json.loads(e))) # Get customerId from Pub/Sub information customer_id = (data | 'get customerId from Firestore' >> beam.Map(lambda x: x.get('customerId'))) customer_id | 'print customerId' >> beam.Map(print) # Get Users from Firestore custUsers = (cutomer_id | 'Read from Firestore' >> beam.ParDo( GetFirestoreUsers()) custUsers | 'print Users from Firestore' >> beam.Map(print) From your comment: the data needed (customerID first and customers data after) is not ready when running the "main" PCollection with original JSON data from Pub/Sub Did you mean the data in firestore is not ready when reading the Pub/Sub topic? You can always split the logic into 2 pipelines in your main function and run them one after another.
Is there a way to Keep track of all the bad records that are allowed while loading a ndjson file into Bigquery
I have a requirement where I need to keep track of all the bad records that were not feeded into bigquery after allowing max_bad_records. So I need them written in a File on storage for Future reference. I'm using BQ API for Python, Is there a way we can achieve this? I think if we are allowing max_bad_records we dont have the details of failed loads in BQ Load Job. Thanks
Currently, there isn't a direct way of accessing and saving the bad records. However, you can access some job statistics including the reason why the record was skipped within BigQuery _job_statistics(). I have created an example, in order to demonstrate how the statistics will be shown. I have the following sample .csv file in a GCS bucket: name,age robert,25 felix,23 john,john As you can see, the last row is a bad record, because I will import age as INT64 and there is a string in that row. In addition, I used the following code to upload it to BigQuery: from google.cloud import bigquery client = bigquery.Client() table_ref = client.dataset('dataset').table('table_name') job_config = bigquery.LoadJobConfig( schema=[ bigquery.SchemaField("name", "STRING"), bigquery.SchemaField("age", "INT64"), ] ) job_config.write_disposition = bigquery.WriteDisposition.WRITE_TRUNCATE job_config.skip_leading_rows = 1 job_config.max_bad_records = 5 #job_config.autodetect = True # The source format defaults to CSV, so the line below is optional. job_config.source_format = bigquery.SourceFormat.CSV uri = "gs://path/file.csv" load_job = client.load_table_from_uri( uri, table_ref, job_config=job_config ) # API request print("Starting job {}".format(load_job.job_id)) load_job.result() # Waits for table load to complete. print("Job finished.") destination_table = client.get_table(table_ref) print("Loaded {} rows.".format(destination_table.num_rows)) #Below all the statistics that might be useful in your case job_state = load_job.state job_id = load_job.job_id error_result = load_job.error_result job_statistics = load_job._job_statistics() badRecords = job_statistics['badRecords'] outputRows = job_statistics['outputRows'] inputFiles = job_statistics['inputFiles'] inputFileBytes = job_statistics['inputFileBytes'] outputBytes = job_statistics['outputBytes'] print("***************************** ") print(" job_state: " + str(job_state)) print(" non fatal error: " + str(load_job.errors)) print(" error_result: " + str(error_result)) print(" job_id: " + str(job_id)) print(" badRecords: " + str(badRecords)) print(" outputRows: " + str(outputRows)) print(" inputFiles: " + str(inputFiles)) print(" inputFileBytes: " + str(inputFileBytes)) print(" outputBytes: " + str(outputBytes)) print(" ***************************** ") print("------ load_job.errors ") The output from the statistics : ***************************** job_state: DONE non fatal errors: [{u'reason': u'invalid', u'message': u"Error while reading data, error message: Could not parse 'john' as INT64 for field age (position 1) starting at location 23", u'location': u'gs://path/file.csv'}] error_result: None job_id: b2b63e39-a5fb-47df-b12b-41a835f5cf5a badRecords: 1 outputRows: 2 inputFiles: 1 inputFileBytes: 33 outputBytes: 26 ***************************** As it is shown above, the erros field returns the non fatal errors, which includes the bad records. In other words, it retrieves individual errors generated by the job. Whereas, the error_result returns the error information as the job as a whole. I believe these statistics might help you analyse your bad records. Lastly, you can output them into a file, using write(), such as: with open("errors.txt", "x") as f: f.write(load_job.errors) f.close()
Monitoring WriteToBigQuery
In my pipeline I use WriteToBigQuery something like this: | beam.io.WriteToBigQuery( 'thijs:thijsset.thijstable', schema=table_schema, write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND, create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED) This returns a Dict as described in the documentation as follows: The beam.io.WriteToBigQuery PTransform returns a dictionary whose BigQueryWriteFn.FAILED_ROWS entry contains a PCollection of all the rows that failed to be written. How do I print this dict and turn it into a pcollection or how do I just print the FAILED_ROWS? If I do: | "print" >> beam.Map(print) Then I get: AttributeError: 'dict' object has no attribute 'pipeline' I must have read a hundred pipelines but never have I seen anything after the WriteToBigQuery. [edit] When I finish the pipeline and store the results in a variable I have the following: {'FailedRows': <PCollection[WriteToBigQuery/StreamInsertRows/ParDo(BigQueryWriteFn).FailedRows] at 0x7f0e0cdcfed0>} But I do not know how to use this result in the pipeline like this: | beam.io.WriteToBigQuery( 'thijs:thijsset.thijstable', schema=table_schema, write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND, create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED) | ['FailedRows'] from previous step | "print" >> beam.Map(print)
Dead letters to handle invalid inputs are a common Beam/Dataflow usage and work with both Java and Python SDKs but there are not many examples for the latter. Imagine that we have some dummy input data with 10 good lines and a bad row that does not conform to the table schema: schema = "index:INTEGER,event:STRING" data = ['{0},good_line_{1}'.format(i + 1, i + 1) for i in range(10)] data.append('this is a bad row') Then, what I do is name the write result (events in this case): events = (p | "Create data" >> beam.Create(data) | "CSV to dict" >> beam.ParDo(CsvToDictFn()) | "Write to BigQuery" >> beam.io.gcp.bigquery.WriteToBigQuery( "{0}:dataflow_test.good_lines".format(PROJECT), schema=schema, ) ) and then access the FAILED_ROWS side output: (events[beam.io.gcp.bigquery.BigQueryWriteFn.FAILED_ROWS] | "Bad lines" >> beam.io.textio.WriteToText("error_log.txt")) This works well with the DirectRunner and writes the good lines to BigQuery: and the bad one to a local file: $ cat error_log.txt-00000-of-00001 ('PROJECT_ID:dataflow_test.good_lines', {'index': 'this is a bad row'}) If you run it with the DataflowRunner you'll need some additional flags. If you face the TypeError: 'PDone' object has no attribute '__getitem__' error you'll need to add --experiments=use_beam_bq_sink to use the new BigQuery sink. If you get a KeyError: 'FailedRows' it's because the new sink will default to load BigQuery jobs for batch pipelines: STREAMING_INSERTS, FILE_LOADS, or DEFAULT. An introduction on loading data to BigQuery: https://cloud.google.com/bigquery/docs/loading-data. DEFAULT will use STREAMING_INSERTS on Streaming pipelines and FILE_LOADS on Batch pipelines. You can override the behavior by specifying method='STREAMING_INSERTS' in WriteToBigQuery: Full code for both DirectRunner and DataflowRunner here.
QTreeView crashing with no apparent reason
I introduced a treeview in the GUI of the program I'm making and since it crashes when I attempt to change its model once it has been set. The course of action is: load the file using a file dialogue clearing the models on the interface objects (tables and treeview). The first time the treeview is not affected since there is no model in it. Populate the treeview model. other stuff not related to the issue. The problematic functions are; The file loading procedure: def open_file(self): """ Open a file :return: """ print("actionOpen_file_click") # declare the dialog # file_dialog = QtGui.QFileDialog(self) # declare the allowed file types files_types = "Excel 97 (*.xls);;Excel (*.xlsx);;DigSILENT (*.dgs);;MATPOWER (*.m)" # call dialog to select the file filename, type_selected = QtGui.QFileDialog.getOpenFileNameAndFilter(self, 'Open file', self.project_directory, files_types) if len(filename) > 0: self.project_directory = os.path.dirname(filename) print(filename) self.circuit = Circuit(filename, True) # set data structures list model self.ui.dataStructuresListView.setModel(self.available_data_structures_listModel) # set the first index index = self.available_data_structures_listModel.index(0, 0, QtCore.QModelIndex()) self.ui.dataStructuresListView.setCurrentIndex(index) # clean self.clean_GUI() # load table self.display_objects_table() # draw graph self.ui.gridPlot.setTitle(os.path.basename(filename)) self.re_plot() # show times if self.circuit.time_series is not None: if self.circuit.time_series.is_ready(): self.set_time_comboboxes() # tree view at the results self.set_results_treeview_structure() # populate editors self.populate_editors_defaults() The treeview model assignation: def set_results_treeview_structure(self): """ Sets the results treeview data structure #return: """ # self.ui.results_treeView.setSelectionBehavior(QtGui.QAbstractItemView.SelectRows) model = QtGui.QStandardItemModel() # model.setHorizontalHeaderLabels(['Elements']) self.ui.results_treeView.setModel(model) # self.ui.results_treeView.setUniformRowHeights(True) def pass_to_QStandardItem_list(list_): res = list() for elm in list_: elm1 = QtGui.QStandardItem(elm) elm1.setEditable(False) res.append(elm1) return res bus_results = pass_to_QStandardItem_list(['Voltages (p.u.)', 'Voltages (kV)']) per_bus_results = pass_to_QStandardItem_list(['Voltage (p.u.) series', 'Voltage (kV) series', 'Active power (MW)', 'Reactive power (MVar)', 'Active and reactive power (MW, MVar)', 'Aparent power (MVA)', 'S-V curve', 'Q-V curve']) branches_results = pass_to_QStandardItem_list(['Loading (%)', 'Current (p.u.)', 'Current (kA)', 'Losses (MVA)']) per_branch_results = pass_to_QStandardItem_list(['Loading (%) series', 'Current (p.u.) series', 'Current (kA) series', 'Losses (MVA) series']) generator_results = pass_to_QStandardItem_list(['Reactive power (p.u.)', 'Reactive power (MVar)']) per_generator_results = pass_to_QStandardItem_list(['Reactive power (p.u.) series', 'Reactive power (MVar) series']) self.family_results_per_family = dict() # nodes buses = QtGui.QStandardItem('Buses') buses.setEditable(False) buses.appendRows(bus_results) self.family_results_per_family[0] = len(bus_results) names = self.circuit.bus_names for name in names: bus = QtGui.QStandardItem(name) bus.appendRows(per_bus_results) bus.setEditable(False) buses.appendRow(bus) # branches branches = QtGui.QStandardItem('Branches') branches.setEditable(False) branches.appendRows(branches_results) self.family_results_per_family[1] = len(branches_results) names = self.circuit.branch_names for name in names: branch = QtGui.QStandardItem(name) branch.appendRows(per_branch_results) branch.setEditable(False) branches.appendRow(branch) # generators generators = QtGui.QStandardItem('Generators') generators.setEditable(False) generators.appendRows(generator_results) self.family_results_per_family[2] = len(generator_results) names = self.circuit.gen_names for name in names: gen = QtGui.QStandardItem(name) gen.appendRows(per_generator_results) gen.setEditable(False) generators.appendRow(gen) model.appendRow(buses) model.appendRow(branches) model.appendRow(generators) And the GUI "cleaning": def clean_GUI(self): """ Initializes the comboboxes and tables Returns: """ self.ui.tableView.setModel(None) if self.ui.results_treeView.model() is not None: self.ui.results_treeView.model().clear() self.ui.profile_time_selection_comboBox.clear() self.ui.results_time_selection_comboBox.clear() self.ui.gridPlot.clear() The complete code can be seen here I have seen that this behavior is usually triggered by calls outside the GUI thread by I don think this is the case. I'd appreciate if someone could point out the problem. Again the complate code for test is here.
The solution to this in my case has been the following: The QStandardItemModel() variable called model in the code was turned into a class global variable self.tree_model When I want to replace the treeview object model, I delete the global tree_model with del self.tree_model Then I re-create the global model with self.tree_model = QStandardItemModel() This way the TreeView object model is effectively replaced without crashing...
Exporting from MS Excel to MS Access with intermediate processing
I have an application which produces reports in Excel (.XLS) format. I need to append the data from these reports to an existing table in a MS Access 2010 database. A typical record is: INC000000004154 Closed Cbeebies BBC Childrens HQ6 monitor wall dropping out. HQ6 P3 3/7/2013 7:03:01 PM 3/7/2013 7:03:01 PM 3/7/2013 7:14:15 PM The root cause of the problem was the power supply to the PC which was feeding the monitor. HQ6 Monitor wall dropping out. BBC Third Party Contractor supply this equipment. The complication is that I need to do some limited processing on the data. Viz Specifically I need to do a couple of lookups converting names to numbers and also parse a date-string (the report for some reason puts the dates in to the spreadsheet in text format rather than date format). Now I could do this in Python using XLRD/XLWT but would much prefer to do it in Excel or Access. Does anyone have any advice on a good way to approach this? I would very much prefer NOT to use VBA so could I do something like record an MS Excel macro and then execute that macro on the newly created XLS file?
You can directly import some Excel data into MS Access, but if your requirement is to do some processing because then I don't see how you will be able to achieve that without: an ETL application, like Pentaho or Talend or others. That will certainly be like using a hammer to crush an ant though. some other external data processing pipeline, in Python or some other programming language. VBA (wether through macros or hand coded). VBA has been really good at doing that sort of things in Access for literally decades. Since you are using Excel and Access, staying within that realm looks like the best solution for solving your issue. Just use queries: You import the data without transformation to a table whose sole purpose is to accommodate the data from Excel; then you create queries from that raw data to add the missing information and massage the data before appending the result into your final destination table. That solution has the advantage of letting you create simple steps in Access that you can easily record using macros.
I asked this question some time ago and decided it would be easier to do it in Python. Gord asked me to share, and here it is (sorry about the delay, other projects took priority for a while). """ Routine to migrate the S7 data from MySQL to the new Access database. We're using the pyodbc libraries to connect to Microsoft Access Note that there are 32- and 64-bit versions of these libraries available but in order to work the word-length for pyodbc and by implication Python and all its associated compiled libraries must match that of MS Access. Which is an arse as I've just had to delete my 64-bit installation of Python and replace it and all the libraries with the 32-bit version. Tim Greening-Jackson 08 May 2013 (timATgreening-jackson.com) """ import pyodbc import re import datetime import tkFileDialog from Tkinter import * class S7Incident: """ Class containing the records downloaded from the S7.INCIDENTS table """ def __init__(self, id_incident, priority, begin, acknowledge, diagnose, workaround,fix, handoff, lro, nlro, facility, ctas, summary, raised, code): self.id_incident=unicode(id_incident) self.priority = {u'P1':1, u'P2':2, u'P3':3, u'P4':4, u'P5':5} [unicode(priority.upper())] self.begin = begin self.acknowledge = acknowledge self.diagnose = diagnose self.workaround = workaround self.fix = fix self.handoff = True if handoff else False self.lro = True if lro else False self.nlro = True if nlro else False self.facility = unicode(facility) self.ctas = ctas self.summary = "** NONE ***" if type(summary) is NoneType else summary.replace("'","") self.raised = raised.replace("'","") self.code = 0 if code is None else code self.production = None self.dbid = None def __repr__(self): return "[{}] ID:{} P{} Prod:{} Begin:{} A:{} D:+{}s W:+{}s F:+{}s\nH/O:{} LRO:{} NLRO:{} Facility={} CTAS={}\nSummary:'{}',Raised:'{}',Code:{}".format( self.id_incident,self.dbid, self.priority, self.production, self.begin, self.acknowledge, self.diagnose, self.workaround, self.fix, self.handoff, self.lro, self.nlro, self.facility, self.ctas, self.summary, self.raised, self.code) def ProcessIncident(self, cursor, facilities, productions): """ Produces the SQL necessary to insert the incident in to the Access database, executes it and then gets the autonumber ID (dbid) of the newly created incident (this is used so LRO, NRLO CTAS and AD1 can refer to their parent incident. If the incident is classed as LRO, NLRO, CTAS then the appropriate record is created. Returns the dbid. """ if self.raised.upper() in productions: self.production = productions[self.raised.upper()] else: self.production = 0 sql="""INSERT INTO INCIDENTS (ID_INCIDENT, PRIORITY, FACILITY, BEGIN, ACKNOWLEDGE, DIAGNOSE, WORKAROUND, FIX, HANDOFF, SUMMARY, RAISED, CODE, PRODUCTION) VALUES ('{}', {}, {}, #{}#, {}, {}, {}, {}, {}, '{}', '{}', {}, {}) """.format(self.id_incident, self.priority, facilities[self.facility], self.begin, self.acknowledge, self.diagnose, self.workaround, self.fix, self.handoff, self.summary, self.raised, self.code, self.production) cursor.execute(sql) cursor.execute("SELECT ##IDENTITY") self.dbid = cursor.fetchone()[0] if self.lro: self.ProcessLRO(cursor, facilities[self.facility]) if self.nlro: self.ProcessNLRO(cursor, facilities[self.facility]) if self.ctas: self.ProcessCTAS(cursor, facilities[self.facility], self.ctas) return self.dbid def ProcessLRO(self, cursor, facility): sql = "INSERT INTO LRO (PID, DURATION, FACILITY) VALUES ({}, {}, {})"\ .format(self.dbid, self.workaround, facility) cursor.execute(sql) def ProcessNLRO(self, cursor, facility): sql = "INSERT INTO NLRO (PID, DURATION, FACILITY) VALUES ({}, {}, {})"\ .format(self.dbid, self.workaround, facility) cursor.execute(sql) def ProcessCTAS(self, cursor, facility, code): sql = "INSERT INTO CTAS (PID, DURATION, FACILITY, CODE) VALUES ({}, {}, {}, {})"\ .format(self.dbid, self.workaround, facility, self.ctas) cursor.execute(sql) class S7AD1: """ S7.AD1 records. """ def __init__(self, id_ad1, date, ref, commentary, adjustment): self.id_ad1 = id_ad1 self.date = date self.ref = unicode(ref) self.commentary = unicode(commentary) self.adjustment = float(adjustment) self.pid = 0 self.production = 0 def __repr__(self): return "[{}] Date:{} Parent:{} PID:{} Amount:{} Commentary: {} "\ .format(self.id_ad1, self.date.strftime("%d/%m/%y"), self.ref, self.pid, self.adjustment, self.commentary) def SetPID(self, pid): self.pid = pid def SetProduction(self, p): self.production = p def Process(self, cursor): sql = "INSERT INTO AD1 (pid, begin, commentary, production, adjustment) VALUES ({}, #{}#, '{}', {}, {})"\ .format(self.pid, self.date.strftime("%d/%m/%y"), self.commentary, self.production, self.adjustment) cursor.execute(sql) class S7Financial: """ S7 monthly financial summary of income and penalties from S7.FINANCIALS table. These are identical in the new database """ def __init__(self, month, year, gco, cta, support, sc1, sc2, sc3, ad1): self.begin = datetime.date(year, month, 1) self.gco = float(gco) self.cta = float(cta) self.support = float(support) self.sc1 = float(sc1) self.sc2 = float(sc2) self.sc3 = float(sc3) self.ad1 = float(ad1) def __repr__(self): return "Period: {} GCO:{:.2f} CTA:{:.2f} SUP:{:.2f} SC1:{:.2f} SC2:{:.2f} SC3:{:.2f} AD1:{:.2f}"\ .format(self.start.strftime("%m/%y"), self.gco, self.cta, self.support, self.sc1, self.sc2, self.sc3, self.ad1) def Process(self, cursor): """ Insert in to FINANCIALS table """ sql = "INSERT INTO FINANCIALS (BEGIN, GCO, CTA, SUPPORT, SC1, SC2, SC3, AD1) VALUES (#{}#, {}, {}, {}, {}, {}, {},{})"\ .format(self.begin, self.gco, self.cta, self.support, self.sc1, self.sc2, self.sc3, self.ad1) cursor.execute(sql) class S7SC3: """ Miscellaneous S7 SC3 stuff. The new table is identical to the old one. """ def __init__(self, begin, month, year, p1ot, p2ot, totchg, succchg, chgwithinc, fldchg, egychg): self.begin = begin self.p1ot = p1ot self.p2ot = p2ot self.changes = totchg self.successful = succchg self.incidents = chgwithinc self.failed = fldchg self.emergency = egychg def __repr__(self): return "{} P1:{} P2:{} CHG:{} SUC:{} INC:{} FLD:{} EGY:{}"\ .format(self.period.strftime("%m/%y"), self.p1ot, self.p1ot, self.changes, self.successful, self.incidents, self.failed, self.emergency) def Process(self, cursor): """ Inserts a record in to the Access database """ sql = "INSERT INTO SC3 (BEGIN, P1OT, P2OT, CHANGES, SUCCESSFUL, INCIDENTS, FAILED, EMERGENCY) VALUES\ (#{}#, {}, {}, {}, {}, {}, {}, {})"\ .format(self.begin, self.p1ot, self.p2ot, self.changes, self.successful, self.incidents, self.failed, self.emergency) cursor.execute(sql) def ConnectToAccessFile(): """ Prompts the user for an Access database file, connects, creates a cursor, cleans out the tables which are to be replaced, gets a hash of the facilities table keyed on facility name returning facility id """ # Prompts the user to select which Access DB file he wants to use and then attempts to connect root = Tk() dbname = tkFileDialog.askopenfilename(parent=root, title="Select output database", filetypes=[('Access 2010', '*.accdb')]) root.destroy() # Connect to the Access (new) database and clean its existing incidents etc. tables out as # these will be replaced with the new data dbcxn = pyodbc.connect("Driver={Microsoft Access Driver (*.mdb, *.accdb)};DBQ="+dbname+";") dbcursor=dbcxn.cursor() print("Connected to {}".format(dbname)) for table in ["INCIDENTS", "AD1", "LRO", "NLRO", "CTAS", "SC3", "PRODUCTIONS", "FINANCIALS"]: print("Clearing table {}...".format(table)) dbcursor.execute("DELETE * FROM {}".format(table)) # Get the list of facilities from the Access database... dbcursor.execute("SELECT id, facility FROM facilities") rows = dbcursor.fetchall() dbfacilities = {unicode(row[1]):row[0] for row in rows} return dbcxn, dbcursor, dbfacilities # Entry point incre = re.compile("INC\d{12}[A-Z]?") # Regex that matches incident references try: dbcxn, dbcursor, dbfacilities = ConnectToAccessFile() # Connect to the MySQL S7 (old) database and read the incidents and ad1 tables s7cxn = pyodbc.connect("DRIVER={MySQL ODBC 3.51 Driver}; SERVER=localhost;DATABASE=s7; UID=root; PASSWORD=********; OPTION=3") print("Connected to MySQL S7 database") s7cursor = s7cxn.cursor() s7cursor.execute(""" SELECT id_incident, priority, begin, acknowledge, diagnose, workaround, fix, handoff, lro, nlro, facility, ctas, summary, raised, code FROM INCIDENTS""") rows = s7cursor.fetchall() # Discard any incidents which don't have a reference of the form INC... as they are ancient print("Fetching incidents") s7incidents = {unicode(row[0]):S7Incident(*row) for row in rows if incre.match(row[0])} # Get the list of productions from the S7 database to replace the one we've just deleted ... print("Fetching productions") s7cursor.execute("SELECT DISTINCT RAISED FROM INCIDENTS") rows = s7cursor.fetchall() s7productions = [r[0] for r in rows] # ... now get the AD1s ... print("Fetching AD1s") s7cursor.execute("SELECT id_ad1, date, ref, commentary, adjustment from AD1") rows = s7cursor.fetchall() s7ad1s = [S7AD1(*row) for row in rows] # ... and the financial records ... print("Fetching Financials") s7cursor.execute("SELECT month, year, gco, cta, support, sc1, sc2, sc3, ad1 FROM Financials") rows = s7cursor.fetchall() s7financials = [S7Financial(*row) for row in rows] print("Writing financials ({})".format(len(s7financials))) [p.Process(dbcursor) for p in s7financials] # ... and the SC3s. print("Fetching SC3s") s7cursor.execute("SELECT begin, month, year, p1ot, p2ot, totchg, succhg, chgwithinc, fldchg, egcychg from SC3") rows = s7cursor.fetchall() s7sc3s = [S7SC3(*row) for row in rows] print("Writing SC3s ({})".format(len(s7sc3s))) [p.Process(dbcursor) for p in s7sc3s] # Re-create the productions table in the new database. Note we refer to production # by number in the incidents table so need to do the SELECT ##IDENTITY to give us the # autonumber index. To make sure everything is case-insensitive convert the # hash keys to UPPERCASE. dbproductions = {} print("Writing productions ({})".format(len(s7productions))) for p in sorted(s7productions): dbcursor.execute("INSERT INTO PRODUCTIONS (PRODUCTION) VALUES ('{}')".format(p)) dbcursor.execute("SELECT ##IDENTITY") dbproductions[p.upper()] = dbcursor.fetchone()[0] # Now process the incidents etc. that we have retrieved from the S7 database print("Writing incidents ({})".format(len(s7incidents))) [s7incidents[k].ProcessIncident(dbcursor, dbfacilities, dbproductions) for k in sorted(s7incidents)] # Match the new parent incident IDs in the AD1s and then write to the new table. Some # really old AD1s don't have the parent incident reference in the REF field, it is just # mentioned SOMEWHERE in the commentary. So if the REF field doesn't match then do a # re.search (not re.match!) for it. It isn't essential to match these older AD1s with # their parent incident, but it is quite useful (and tidy). print("Matching and writing AD1s".format(len(s7ad1s))) for a in s7ad1s: if a.ref in s7incidents: a.SetPID(s7incidents[a.ref].dbid) a.SetProduction(s7incidents[a.ref].production) else: z=incre.search(a.commentary) if z and z.group() in s7incidents: a.SetPID(s7incidents[z.group()].dbid) a.SetProduction(s7incidents[z.group()].production) a.Process(dbcursor) print("Comitting changes") dbcursor.commit() finally: print("Closing databases") dbcxn.close() s7cxn.close()
It turns out that the file has additional complications in terms of mangled data which will require a degree of processing which is a pain to do in Excel but trivially simple in Python. So I will re-use some Python 2.x scripts which use the XLWT/XLRD libraries to munge the spreadsheet.