using bigquery to analyze iis logs - iis

Any preferred way/example to load and analyze IIS logs (in Extended Log File Format) using bigquery? we will also need to auto-partition it. we can get log files periodically
we want to analyze Usage of a particular feature, which can be identified by a particular URL pattern and a conversion funnel of most popular flows that visitors take through the website, to identify where they come in and leave. Visitors can be identified with a unique ID in a cookie (stored in the logs) and pages can be linked with the referer (also stored in the logs).
Thanks in advance

It's easy to load CSV format files into BigQuery. Both CSV and JSON format source data is supported.
I am not an expert in using IIS, but the quickest way to load flat log data into BigQuery is to start with CSV. IIS log format is pretty straightforward t work with, but you might want save a step and export it into CSV. A quick search shows that many people use LogParser (note: I have never used it myself) to convert IIS logs into CSV. Perhaps give this or a similar tool a try.
As for "auto-partioning" your BigQuery dataset tables - BigQuery doesn't do this automatically, but it's fairly easy to create a new table for each batch of IIS logs you export.
Depending on the volume of data you are analysing, you should create a new BigQuery table per day or hour.
Scripting this on the command line is pretty easy when using the BigQuery command line tool. Create a new BigQuery load job, with a new table name based on each timeslice of log data you have.
In other words, your BigQuery tables should look something like this:
mydataset.logs_2012_10_29
mydataset.logs_2012_10_30
mydataset.logs_2012_10_31
etc...
For more information, make sure you read through the BigQuery documentation for importing data.

Related

Cognos REST API and scheduling schema loading

I am trying to find out more informations about using the REST API in order to create a schedule for schema loading. Right now, I have to reload the particular schemas via my data server connections manually (click on every schema and Load Metadata) and would like to automate this process.
Any pointers will be much appreciated.
Thank you
If the metadata of your data warehouse is so in flux that you need to reload the metadata so frequently that you want to automate the process then you need to understand that your data warehouse is in no way ready for use.
So, the question becomes why would you want to frequently reload the metadata of a data source schema? I'm guessing that you are refreshing the data of your data base and, because your query cache has not expired, you are not seeing the new data.
So the answer is, you probably don't want to do what you think you need to do unless you can convince me otherwise.
Also, if you enter some obvious search terms you will find the Cognos analytics REST api documentation without too much difficulty.

How to process large .kryo files for graph data using TinkerPop/Gremlin

I am new to Apache TinkerPop.
I have done some basic stuff like installing TinkerPop Gremlin console, creating graph .kryo file, loaded it in gremlin console and executed some basic gremlin queries. All good till now.
But i wanted to check how can we process .kryo files which are very much large in size says more than 1000GB. If i create a single .kryo file, loading it in console(or using some code) is not feasible i think.
Is there any way we can deal with graph data which is pretty huge in size?
basically i have some graph based data stored in Amazon Neptune DB, i want to take it out and store it in some files(e.g .kryo) and process later for gremlin queries. Thanks in advance.
Rather than use Kyro which is Java specific, I would recommend using something more language agnostic such as CSV files. If you are using Amazon Neptune you can use the Neptune Export tool to export your data as CSV files.
Documentation
Git repo
Cloud Formation template

Is there a way to log stats/artifacts from AWS Glue Job using mlfow?

Could you please let me know if any such feature available in the current version of mlflow?
I think the general answer here is that you can log arbitrary data and artifacts from your experiment to your MLflow tracking server using mlflow_log_artifact() or mlflow_set_tag(), depending on how you want to do it. If there's an API to get data from Glue and you can fetch it during your MLflow run, then you can log it. Write a csv, save a .png to disk and log that, or declare a variable and access it when you are setting the tag.
This applies for Glue or any other API that you are getting a response from. One of the key benefits of MLflow is that it is such a general framework, so you can track what matters to that particular experiment.
Hope this helps!

Azure CosmosDB - Download all documents in collection to local directory

I am trying to download all the documents in my cosmosDB collection to a local directory. I want to modify a few things in all of the JSON documents using python, then upload them to another Azure account. What is the simplest, fastest way to download all of the documents in my collection? Should I use the CosmosDB emulator? I've been told to check out Azure's data factory? Would that help with downloading files locally? I've also been referred to CosmosDB's data migration tool and I saw that it facilitates import data to CosmosDB but I can't find much on exporting. I have about 6GB of Json files in my collection.
Thanks.
In the past I've used the DocumentDb (CosmosDb) Data Migration Tool which is available for download from Microsoft.
When running the app you need to specify source and target as in the screenshot below
Make sure that you choose to Import from DocumentDb and specify the connection string and collection you want to export from. If you want to dump the entire contents of your collection the query would just be
SELECT * FROM c
Then under the Target Information you can choose a JSON file which will be saved to your local hard drive. You're free to modify the contents of that file in any way and then use it as Source Information later when you're ready to import it back to another collection.
I used the migration tool and found that it is great if you have a reasonably sized db as it does use processing and bandwidth for a considerable period. I had to chunk a 10GB db and that took too long so ended up using Data Lake Analytics to transfer via script to SQL server and Blob Storage. It gives you a lot of flexibility to transform the data and store either in Data Lake of other distributed systems. As well if needed it helps if you are using cosmos for staging and need to run the data through any cleaning algorithms.
The other advantages are that you can set up batching and you get a lot of processing stats to determine how to optimize large data transformations. Hope this helps. Cheers.

Approach for File System Based Data Storage for Web Application

I am looking for optimal approach to use file system based data storage in Web Application.
Basically, I am developing a Google Chrome Extension, which is in fact a Content Script based. Its functionality is as follows:
The extension is a content script based one and it will be fired for each webpage user visits.
The extension will fetch some data continuously (every 5/10 seconds) from a database from a Cross-Browser Location (in JSON format) and display that data in the form of ticker at each webpage. Content Script will modify the DOM of web pages to display the ticker.
For above scheme, I have noticed a fact that the continuous fetching of data increases server's and client's bandwidth consumption a lot. Hence, I am planning for an approach to maintain the data in a file system, which will be bundled in the extension only and will be accessed locally to avoid bandwidth consumption.
The file system, I can maintain is text, CSV or even XML also. But the issue is that I need to read the data files through Javascript, JQuery or AJAX. All these languages are not having efficient file-handling and file access mechanisms.
Can anyone please suggest approach for optimum solution with file access-mechanisms for above problem?
Also, if you can suggest whole new approach other than File System based data storage, that will be really helpful to me.
If all you need is read some data from files then you can bundle those files with the extension.
In which format to store data in those files - I can think of 3 options:
You can read XML files through XMLHttpRequest and parse them with jQuery. It is very easy and powerful with all jquery selectors at your disposal.
You can store data in JSON format, read it the same way and parse it with JSON.parse()
You can directly make javascript objects out of your data and simply include this file into your background page through <script src="local.js"> tag. So your local.js would look something like this:
var data = [{obj1: "value1"}, ...];
I have used XML for years - based on the advice from Microsoft, stating that small volume site can do this.
But XML almost always loads all of the document - hence the size of this will influense performance.
I did some +40.000 nodes in different browsers three years ago and strange enough - ms seems to be the one that can handle this :)
and AJAX was created to stream XML

Resources