For work I need to extract data from websites and write this data in in a CSV file, at this stage I'm using Selenium and Perl (and this very powerful couple) but yesterday I thinked to this solution:
Selenium IDE ---via JS--->Web app on Node.js Webserver------> CSV
Do you think is it possible? Or there is another "elegant" solutions?
The idea is general, so I can use for data storage, but the testers can use this for improving their tests using the stored variables, so it's for general purpose.
For purpose of scraping you can use jsdom module like shown here
http://blog.nodejitsu.com/jsdom-jquery-in-5-lines-on-nodejs
for purpose of generating CSV this module is nice
https://github.com/koles/ya-csv
But there are easier ways to do it like using Mechanize in Perl, Ruby, Python
Related
I am new to writing Python code. I have currently written a few modules for data analysis projects. The data is queried from AWS Redshift tables and summarized in CSVs and Excel spreadsheets.
At this point I do not want to pass it on other users in the org as I do not want to expose the code.
Is there an easy way to operationalize the code without exposing it?
PS: I am in the process of learning front-end development (Flask, HTML, CSS) so users can input data and get results back.
Python programs are almost always shipped as bare source. There are ways of compiling Python code into binaries, but this is not a common thing to do and usually I would not recommend it, as it's not as easy as one might expect (which is too bad, really).
That said, you can check out cx_Freeze and Cython.
For those who don't know, Matillion is an ETL/ELT tool that can be used to handle Snowflake data flows (among other).
One interesting feature is that we can write script tasks in either bash or python.
I had a similar experience in the past with SQL Server Integration Services where it was possible to write C# within tasks as well.
IMHO this presented two big flaws
SSIS packages being stored as "blob" made them extremely ill suited to version control. Any tiny change (like just adjusting a task on a pipeline) usually made comparison between two versions practically impossible
Sharing code between tasks was extremely difficult (was it possible???)
Matillion "jobs" are stored as json and, like SSIS, it is impossible to compare two versions of the same job, regardless how tiny the change
Also, coding something big in python within a simple text window is just not thinkable
So, I would like to write my Python code outside Matillion and just use Matillion tasks as "glue" between the different functions/packages I would write outside.
Has someone experience of doing this?
How can I make my Python file/package available to Matillion Python scripts?
How could I handle different versions of my Python packages in the different Matillion "Versions" of my jobs?
Thanks
I am sure that by now you might have figured something out.
For everyone, in case they are looking, here is another method:
Within Matillion there is a python script component.
We can use that to trigger python scripts outside of Matillion.
I am doing it all the time with the following snippet:
import sys
import os
return_var=os.system("sshpass -p <Remote_Machine_User_Password> ssh <Remote_Machine_User>#<Remote_Machine_IP> -o StrictHostKeyChecking=no '/path/to/python/executable/in/remote/machine /path/to/python/script/in/remote/machine'")
print(return_var)
Thanks!
To answer your question: "How can I make my Python file/package available to Matillion Python scripts? How could I handle different versions of my Python packages in the different Matillion "Versions" of my jobs?"
I think "bash" component within Matillion will be helpful for you. The only requirement here is you should be able to ssh from matillion server to any other server where your external python package exists.
Within the bash component of your Matillion job, you just need to include the ssh code to the remote server and command line to execute your python package. I believe "Here doc" or Bash Here Document is what will help you:
ssh -T $remote_server_ip << doc
ls -lrt
python /path/to/python_script/python_script.py
doc
Regarding your question about versioning I am not sure but I am also interested in knowing a solution.
Thanks
From a performance/maintenance point of view, is it better to write my custom modules with netsuite all as one big JS, or multiple segmented script files.
If you compare it with a server side javascript language, say - Node.js the most popular, every module is written into separate file.
I generally take the approach of Object oriented javascript and put each class in a separate file which helps to organise the code.
One of the approach you can take is in development keep separate files and finally merge all files using js minifier tool like Google closure compiler when you deploy your code for production usage which can give you best of both worlds, if you are really bothered about every nano/mini seconds of performance.
If you see SuiteScript 2.0 architecture, it encourages module architecture which is easier to manage as load only those modules that you need, and it is easier to maintain multiple code files i.e. one per module considering future enhancements, bug fixes and code reuse.
Performance can never be judge by the line count of your module. We generally maintain modules for maintaining the readability and simplicity of the code. It is a good practice to put all generic functionalities in to an Utility script and use it as a library across all the modules. Again it depends on your code logic and programming style. So if you want to create multiple segments of your js file for more readability I dont think its a bad idea.
I am a beginner learning NodeJS. I am sure that the scripting language has its own data types, variables, control structures, iterations structures etc. But I am not able to find any documentation regarding the same?
Please provide some references of the same.
::EDIT::
JS has functionality till now to run in browsers and there are specific functionality that could be achieved.
How can I write a standalone program that would make a user to input a date in mm/dd/yyyy format using NodeJS? Like in browser side we use to say -
val = window.prompt('Enter a Date in MM/DD/YYYY format!','');
Is there a way I can write the same code in NodeJS without running it in a browser? Also then all JS browser side functionality cannot be used in NodeJS - please clarify.
You write Node.js servers using JavaScript. So you need to look up the JavaScript documentation.
It's all JavaScript, but what you really need is Node's API documentation.
http://nodejs.org/api/documentation.html
As for getting started, looking at some example code is always a good move.
I'm looking to scrape public data off of many different local government websites. This data is not provided in any standard format (XML, RSS, etc.) and must be scraped from the HTML. I need to scrape this data and store it in a database for future reference. Ideally the scraping routine would run on a recurring basis and only store the new records in the database. There should be a way for me to detect the new records from the old easily on each of these websites.
My big question is: What's the best method to accomplish this? I've heard some use YQL. I also know that some programming languages make parsing HTML data easier as well. I'm a developer with knowledge in a few different languages and want to make sure I choose the proper language and method to develop this so it's easy to maintain. As the websites change in the future the scraping routines/code/logic will need to be updated so it's important that this will be fairly easy.
Any suggestions?
I would use Perl with modules WWW::Mechanize (web automation) and HTML::TokeParser (HTML parsing).
Otherwise, I would use Python with the Mechanize module (web automation) and the BeautifulSoup module (HTML parsing).
I agree with David about perl and python. Ruby also has mechanize and is excellent for scraping. The only one I would stay away from is php due to it's lack of scraping libraries and clumsy regex functions. As far as YQL goes, it's good for some things but for scraping it really just adds an extra layer of things that can go wrong (in my opinion).
Well, I would use my own scraping library or the corresponding command line tool.
It can use templates which can scrape most web pages without any actual programming, normalize similar data from different sites to a canonical format and validate that none of the pages has changed its layout...
The command line tool doesn't support databases through, there you would need to program something...
(on the other hand Webharvest says it supports databases, but it has no templates)