Design Pattern recommendations - Python Selenium multi-page webscraper w/ Parser and Database - python-3.x

I am working on a scraper that is growing bigger and bigger and I'm worried about making the wrong design choices.
I have never done more than short scripts in python and I'm at a loss knowing how to design a project with bigger proportions.
The scraper retrieves data from different, but similar themed websites, so an implementation for each site is needed.
The desired raw text of each website is then put through a parser which searches for the required values.
After retrieving the values they should be stored in a 3N-Database.
In its final evolution the scraper should run on a cloud service and check all the different sites periodically for new data. Speed and performance are not of highest importance but desirable. Most importantly the required data should be retrieved without unnecessary reuse of code.
I'm using the Selenium webdriver and have the driver object implemented as a singleton, so all the requests are done by the same driver object. The website text is then part of state of that object.
All the other functionality is currently modelled as functions, everything in one file. For adding another website to the project I first copied the script and just changed the retrieval part. As it soon occurred to me that that's pretty stupid I wanted to ask for design recommendations.
Would you rather implement a Retriever mother class and inherit from that for every website or is there an even better way to go?
Many thanks for any ideas!

Related

Best practices for creating a customized report based on user form input?

My Question
What are the best practices for creating a customized report based on a user form input? Specifically, how do I create an easy to maintain system which takes user input which is collected in a form and generate multiple paragraphs that explains the results of analysis.
Background
I am working on a very large multiyear project with a startup (who is my client). My job is to program analysis and generate reports to users. The pipeline for data looks like this:
Users enter information into a form -> results are calculated based on user input -> reports are displayed to users that share analysis.
It is really important to my client that some of the analysis results are displayed in paragraphs in a non-formal user friendly tone. The challenge is that the form and analysis are quite complex and will only get more complex over time. An example of the type of template for the paragraphs looks something like this:
resultsParagraphText=`Hi ${userName}. We found that the best ice cream flavour for you is ${bestIceCreamFlavor}. These other flavors ${otherFlavors} might be good for you. Here are the reasons why you might enjoy these flavors: ${reasonsWhyGoodFlavors}.
However we would not recommend these other flavors ${badFlavors}. Here are the reasons you should avoid this bad flavors: ${reasonsWhyBadFlavors}.`
These results paragraphs, of which there of many, have several minor problems which combined are significant:
If there is a bug in the code, minor visual errors would be visible to end users (capitalization errors, missing/extra commas, and so on).
A lot of string comparisons (e.g. if answers.previousFlavors.includes("Vanilla")) are required to generate the results paragraphs. Minor errors in the forms (e.g. vanilla in the form is not capitalized so answers.previousFlavors.includes("Vanilla") returns false even when user enters vanilla.) can cause errors in the results paragraph.
Changes in different parts of the project (form, analysis) directly effect how the results paragraph is made. Bad types, differences in string values, null or undefined values not being caught directly have an impact on how the results paragraph is made.
There are many edge cases (e.g. What if the user has no other suitable good flavors for them? The the sentence These other flavors ${otherFlavors} might be good for you. needs to be excluded).
It is hard to write paragraphs that use templates and have a non-formal tone.
and so on.
I have charts and other types of ways to display results and have explained to the client the challenges of sharing the information in paragraph form.
What I am looking for
I need examples, how tos, best practices on how to build a maintainable system for generating customized paragraphs based on user input. I know how to solve each of the individual issues (as they are fairly simple) but in a large project this will become very hard to maintain.
Notes
I have no clue what tags to use for the post. Feel free to edit/add tags if you know more appropriate ones.
The project is planning to use machine learning in the future other parts of the project. If there is a ML/AI solution that is useful please tell me.
I am working primarily in JavaScript, Python, C, and R, but if there is a library or tool in any other language please tell me. Finding a solution is very important to me and I would be willing to learn a lot find a best solution.
To avoid this question being removed because I have rephrased it to avoid asking for personal opinion, instead asking for existing examples or how tos. I can also imagine that others might find a solution fairly useful. If you can edit it to make the question less subjective please do so.
If you have any questions or need clarification feel free to ask. Any help is appreciated.

tensorflow for classification of strings vs elasticsearch

So, a little bit on my problem.
TL;DR
Can I use machine-learning instead of Elastic Search to find results depending on the user's text input? Is it a good idea?
I am working on a car spare parts project, and we have split the car into 300 parts that we store on the database, with some data for each part (weight, availability, etc).
When the customer inputs the text of his part, we need to be able to classify the part, and map it to one in our database.
The current way it's being done is by people on our team manually mapping the customer text with the parts on our database, we want to automate that process.
We tried using MongoDB text search, but it was often inaccurate since parts have different names in different parts of the country.
So we wanted something that got more accurate results, and improved by the more data we have, we immediately considered TensorFlow, after some research and taking part of Google's Machine Learning Crash Course, I got to that point where it specified:
Models can't learn from string values, so you'll have to perform some feature engineering to convert those values to something numeric
That would be useful in the case we have limited number of features as strings, but we don't know what the user will input as a text.
So, my questions are:
1- Can we use Machine Learning to map text input by the user with some documents on our database?
2- If we can do that, is it a good idea to favor it over other search tools like ElasticSearch?
3- Can ElasticSearch improve its results the more data we have? How?
4- How would you go about this problem?
Note: I'd be doing that in Node.js, and since TensorFlow.js is new, I am inclining to go for other solutions, but if push comes to shove, and the results are much better, I would definitely go there.
TL;DR: Yes and yes.
TS;WM:
This is a perfectly suited problem for machine learning. Especially so, if you have a database of past customer texts that have already been mapped to parts. Ideally, you have hundreds of texts mapped to each part. If that is present, you can design and train a network. And models can learn from string values with some engineering, and it's not that bad.
I'm not sure ElasticSearch would improve much on the network. I don't know much about auto parts trading, but as a wild guess, "the large round thingy that helps change direction" would never be mapped to "steering wheel" by ES but could be learned easily by a network - provided there are at least some examples of people using that text to specify steering wheel.
You can but don't have to necessarily use tensorflow.js for your network. The AI could run on your server as a webservice, and you'd just send over the customer's text to it and it would send back it's recommendations of part SKUs and names.

Automating Raw Export Data Cleansing for Client Onboarding - Format is Always Different

So a bit of a general question. I work as a data analyst for a startup. My primary process involves taking existing customer data a client has and cleansing/normalizing it to fit into our platform once as part of our onboarding process. A member of our team exports their data from their system they are transitioning from or, if they kept track of it in house, we receive their Excel log they used to track it. It is always in a different format and requires extensive cleansing (avg 1 min/record). We take what is usually one large table (.xlxs format), and after cleansing, split it into four .csv files; which we load as four tables on our platform.
I feel I have optimized the process quite well in terms of the process steps and cleansing with excel functions (if, concat, text-to-columns, etc). I have beginner-intermediate skills in VBA and SQL and have just scratched the surface in R; what is frustrating is that I know there is the potential to automate this process but I just don't know where to start. If anyone has experience with something like this, code, a link to an article / another thread, or just some general direction would be much appreciated. Please ask for clarification where you feel it is needed. Thanks.
This will be really hard to do in Excel. If you have the time you can try out Optimus, a Data Cleansing library written in Python and Pyspark (you don't need to know spark). Here is the webpage https://hioptimus.com.
You can create Data Pipelines with it, and I recommend that you do that, try to generalize your processes, and asking the client for more a structure way of passing the data.
The good thing is that you don't need Big Data for running Optimus, bit if you have it some day, the same code will work.
Check out the documentation for more:
http://optimus-ironmussa.readthedocs.io/en/latest/
Let me know if you have doubts!

What is the best method to extract live data from an application? (details inside)

Since I have little experience programming, I first tried posting this "job" on a freelance website. Then, 4 programmers who seemed to what they were doing failed to solve it (maybe they didn't know what they were doing). After this, I decided to attempt it myself, and that is why I came to Stack Overflow, which I believe will point me in the right direction.
The problem appears quite simple: the program in question gives me rows and columns of data, just like a spreadsheet. As time goes by, new rows are added on top. It looks like this:
I just need to replicate this data inside an Excel spreadsheet, so that I can perform analysis.
I will keep it short, as I don't know what further detail I could give. Perhaps looking at the program files could help in establishing what sort of program it is. Download link: http://xpproupdate.xpi.com.br/xppro.zip
Thanks!
Some loose ideas:
Method 1 (assuming this is an app connected to the Internet):
Try packet-sniffing. Instead of extracting the data from the app download a packet sniffing app and look and the data-flow. See on what port the app is exchanging data. If the data is not encryped the tasks should be fairly easy.
As a reference see this packet sniffer in C#:
http://www.codeproject.com/Articles/17031/A-Network-Sniffer-in-C
Method 2 (assuming no connection to the Internet, or if there is encryption involved):
If the data is encrypted or the app simply does not interact with the Internet then try to access the app's Win32 window handle and traverse it's internal controls.
Method 3 (last resort):
Frequent window image screenshot and scraping the data from the image using a simple OCR.

How to ease updating inferno with web performance test scripts

Updating can performance test script e.g. with LoadRunner can take a lot of time and be quite frustrating. If there has been some updates with the applications, you usually have to run the script and then find out what has to be changed, update and run again and so on. Does anyone have some concrete best practices how to ease this updating inferno? One obvious thing is good communication with developers.
It depends on the kind of updates. If the update is dramatic, like adding new fields for user to fill in, then, someone has to manually touch up the test scripts.
If, however, the update is minor, for example, some changes to the hidden fields or changes to the internal names of user-facing fields, then it's possible to write a script that checks the change and automatically updates the test script.
One of the performance test platforms, NetGend, automatically takes care of the hidden fields and the internal names of user-facing fields so it's very easy to create a script to performance-test a HTML form. Tester only needs to fill in the values that he/she would have to enter using a browser, so no correlation is necessary there. Please send me a message if you need to know more about it.
There are many things you can do to insulate your scripts from build to build variability. The higher up the OSI stack you go the lower the maintenance charge, but the higher the resource cost for the virtual user type. Assuming changes are limited to page level resources and a few hidden fields here and there for web sites or applications, then you can record in HTML mode. You blast the EXTRARES sections as the page parser in HTML mode will automatically parse the page and load the page resources even without an explicit reference - It can be a real pain to keep these sections in synch if you have developers who are experimenting quite a bit.
Next up, for forms which have a very high velocity in terms of change consider the use of a web_custom_request() for the one form. You can use correlation statements to pick up all of the name|value pairs as needed and build the form submit dynamically. There will be a little bit more up front work for this but you should have pay offs at around the fourth changed build where you would normally have been rebuilding some scripts.
Take a look at all of the hosts referenced in your code. Parameterize all of these items. I have a template that I use for web virtual users which pairs a default value and the ability to change any of the host names via the control panel extra attributes section. Take a look at the example for lr_get_attrib_string() for how you might implement the pickup and pair that with a check for NULL and a population with a default value in your code
This is going to seem counter intuitive, but comment your script heavily for changes that are occurring often so you know where to take the extra labor change up front to handle a more dynamic data set.
Almost nothing you do with any tool can save you from struuctural changes in the design and flow of the app, such as the insertion of a new page in the workflow, but paying attention to the design on the high change pages, of which there are typically a small number, can result in a test code with a very long life.
Of course if your application is web services based then there is a natual long life to the use of exposed public services. Code may change on the back end of the service, but typically the exposed public interface is very stable.

Resources