Alloy Analyzer 4.1.10- Generting Large datasets - alloy

I am new to Alloy. I want to generate large data sets with the help of Alloy, but I am not sure how to do it.
I am able to generate data in small amount. But want to generate large amount of data which is unique.
I am using run command currently. And its perfectly fine, if I get data in format of Name_0, Name_1.. (for First Name)
Any tip in this regard will be a great help.

Related

How do I find out right data design and right tools/database/query for below requirement

I have a kind of requirement but not able to figure out how can I solve it. I have datasets in below format
id, atime, grade
123, time1, A
241, time2, B
123, time3, C
or if I put in list format:
[[123,time1,A],[124,timeb,C],[123,timec,C],[143,timed,D],[423,timee,P].......]
Now my use-case is to perform comparison, aggregation and queries over multiple row like
time difference between last 2 rows where id=123
time difference between last 2 rows where id=123&GradeA
Time difference between first, 3rd, 5th and latest one
all data (or last 10 records for particular id) should be easily accessible.
Also need to further do compute. What format should I chose for dataset
and what database/tools should I use?
I don't Relational Database is useful here. I am not able to solve it with Solr/Elastic if you have any ideas, please give a brief.Or any other tool Spark, hadoop, cassandra any heads?
I am trying out things but any help is appreciated.
Choosing the right technology is highly dependent on things related to your SLA. things like how much can your query have latency? what are your query types? is your data categorized as big data or not? Is data updateable? Do we expect late events? Do we need historical data in the future or we can use techniques like rollup? and things like that. To clarify my answer, probably by using window functions you can solve your problems. For example, you can store your data on any of the tools you mentioned and by using the Presto SQL engine you can query and get your desired result. But not all of them are optimal. Furthermore, usually, these kinds of problems can not be solved with a single tool. A set of tools can cover all requirements.
tl;dr. In the below text we don't find a solution. It introduces a way to think about data modeling and choosing tools.
Let me take try to model the problem to choose a single tool. I assume your data is not updatable, you need a low latency response time, we don't expect any late event and we face a large volume data stream that must be saved as raw data.
Based on the first and second requirements, it's crucial to have random access (it seems you wanna query on a particular ID), so solutions like parquet or ORC files are not a good choice.
Based on the last requirement, data must be partitioned based on the ID. Both the first and second requirements and the last requirement, count on ID as an identifier part and it seems there is nothing like join and global ordering based on other fields like time. So we can choose ID as the partitioner (physical or logical) and atime as the cluster part; For each ID, events are ordered based on the time.
The third requirement is a bit vague. You wanna result on all data? or for each ID?
For computing the first three conditions, we need a tool that supports window functions.
Based on the mentioned notes, it seems we should choose a tool that has good support for random access queries. Tools like Cassandra, Postgres, Druid, MongoDB, and ElasticSearch are things that currently I can remember them. Let's check them:
Cassandra: It's great on response time on random access queries, can handle a huge amount of data easily, and does not have a single point of failure. But sadly it does not support window functions. Also, you should carefully design your data model and it seems it's not a good tool that we can choose (because of future need for raw data). We can bypass some of these limitations by using Spark alongside Cassandra, but for now, we prefer to avoid adding a new tool to our stack.
Postgres: It's great on random access queries and indexed columns. It supports window functions. We can shard data (horizontal partitioning) across multiple servers (and by choosing ID as the shard key, we can have data locality on computations). But there is a problem: ID is not unique; so we can not choose ID as the primary key and we face some problems with random access (We can choose the ID and atime columns (as a timestamp column) as a compound primary key, but it does not save us).
Druid: It's a great OLAP tool. Based on the storing manner (segment files) that Druid follows, by choosing the right data model, you can have analytic queries on a huge volume of data in sub-seconds. It does not support window functions, but with rollup and some other functions (like EARLIEST), we can answer our questions. But by using rollup, we lose raw data and we need them.
MongoDB: It supports random access queries and sharding. Also, we can have some type of window function on its computing framework and we can define some sort of pipelines for doing aggregations. It supports capped collections and we can use it to store the last 10 events for each ID if the cardinality of the ID column is not high. It seems this tool can cover all of our requirements.
ElasticSearch: It's great on random access, maybe the greatest. With some kind of filter aggregations, we can have a type of window function. It can handle a large amount of data with sharding. But its query language is hard. I can imagine we can answer the first and second questions with ES, but for now, I can't make a query in my mind. It takes time to find the right solution with it.
So it seems MongoDB and ElasticSearch can answer our requirements, but there is a lot of 'if's on the way. I think we can't find a straightforward solution with a single tool. Maybe we should choose multiple tools and use techniques like duplicating data to find an optimal solution.

Appened Query and Filter Variables to Output in Oracle OBIEE

I frequently use OBIEE to create custom analyses from predefined subject areas. I'm often making extensive use of filters as I'm typically pulling data on a very specific issue from a huge dataset. A problem I have repeatedly come across is trying to recreate a previous analyses complete with the filter variables.
Obviously if it's something I know I'll come back too, I'll save the query. But that's often not the case. Maybe a month or two will go by and I'll need to generate a new version of a previous report to compare with the original and I end up not being able to trust the output because I'm not sure that it's using the same variables.
Is there a way to append the query and filter variables to the report itself so it can be easily referenced and recreated?
Additional info.
* I almost exclusively output the data from OBIEE as a .csv and do most of the work from excel pivot tables I save in .xlsx.
* I'm not a DBA.
a) You can always save filters as presentation catalog objects instead of clicking them together from zero every single time. Think LEGO bricks for adults. OBIEE is based on the concept of stored and reusable objects.
b) You will never have your filters in your CSV output since CSV is a raw data output and not a formatted / graphical one. If you were using a graphical analysis export you could always add the "Filters View" to your results.
Long story short since you're using OBIEE as a data dump tool and are circumventing what the tool is designed to do and how it is supposed to function you're constraining yourself in terms of the benefits and the functionality you can get from it.

Persisting only part of a data source

I'm using intake to access the catalog catalog.ocean.GFDL_CM2_6.GFDL_CM2_6_control_ocean_surface.
At the moment I only work with small patches of that data, but accessing that data every single time is still quite costly (it's on Google Cloud Storage). So I want to use the persist option of intake to store that data locally. However as far as I've understood from the docs, it looks like one can only persist the whole dataset. For that specific dataset that would amount to almost 400 dollars if I take a cost of 0.1$ per GB, since the total data is 3976GB.
Hence my questions:
Is there a way (especially for a zarr file which in theory should make this quite easy) to persist only parts of the data (for instance only a subset of the variables)
This is probably more complicated, but can I push things further, by persisting regions of data I'm interested in (in terms of coordinates values for instance)?
There is no direct Intake way to do what you are asking for. Intake was conceived as a way to get your data into a format that you can then manipulate as you normally do, i.e., deal with only the loading part, so that a persisted data-set is the same as the original.
However, it is not hard to accomplish manually: you should grab the xarray, filter for the region you need, and call to_zarr to save the new dataset. You can then point a simple catalogue entry like the old one at the new location.
You could have done this manipulation in a driver directly if this was a specific pattern that would repeat a lot. In fact, we have mooted the idea of whether/how to implement such processing steps in Intake, but there is no plan yet. In the end, we may take the work on pipelines in Holoviews to describe processing steps.

Best way to process huge .csv

I need to process a pretty huge .css (at least 10 millions rows, hundred of columns) with Python. I'd like:
To filter the content based on several criteria (mostly strings, maybe some regular expressions)
To consolidate the filtered data. For instance, grouping them by date, and for each date counting occurences based on a specific criterium. Pretty similar to what a pivot table could do.
I'd like to have an user-friendly access to that consolidated data
I'd like to generate charts (mostly basic line charts)
Processing must be fast AND light, because computers at work cannot handle much and we're always in a hurry
Given these prerequisites, could you please suggest some ideas? I thought about using pandas. I also thought about dumping the csv into a SQLite database (because it may be easier to query if I code an User Interface). But it is really my first foray into this world, so I don't know where to start. I don't have much time, but I'll would be very glad if you could offer some pieces of advice, some good (and fresh) things to read etc, interesting libs and so forth. Sorry if Stackoverflow is not the best place to ask for this kind of help. I'll delete the post if needed. Regards.
Give xsv a shot. It is quite convenient with decent speed. And it fits in the Unix philosopy. However if the dataset is used more than ten times, I'd suggest converting csv to some binary format, and ClickHouse is a good choice for that.
There are 2 rather different situations:
when your reports (charts, pivot tables) use limited number of columns from orignal CSV, and you can pre-aggregate your large CSV file only once to get much smaller dataset. This one-time processing can take some time (minutes) and no need to load whole CSV into memory as it can be processed as data stream (row-by-row). After that you can use this small dataset for fast processing (filtering, grouping etc).
you don't know which columns of original CSV may be used for grouping and filtering, and pre-aggregation is not possible. In other words, all 10M rows should be processed in the real-time (very fast) - this is OLAP use-case. This is possible if you load CSV data into memory once, and then iterate over 10M rows quickly when needed; if this is not possible, only option is to import it into the database. SQLite is a good lightweight DB and you can easily import CSV with sqlite3 command line tool. Note that SQL queries for 10M rows might be not so fast, and possibly you'll need to add some indexes.
Another option might be using specialized OLAP database like Yandex ClickHouse - you can use it to query CSV file directly with SQL (table engine=FILE) or import CSV into its column store. This database is lightning fast with GROUP BY queries (it can process 10M rows in <1s).

Building a collaborative filtering recommendation engine using Spark mlLib

I am trying to build a recommendation engine based on collaborative filtering using apache Spark. I have been able to run the recommendation_example.py on my data, with quite good result. (MSE ~ 0.9). Some of the specific questions that I have are:
How to make recommendation for the users who have not done any activity on the site. Isn't there some API call for popular items, which would give me the most popular items based on user actions. One way to do is to identify the popular items by ourselves, and catch the java.util.NoSuchElementException exception, and return those popular items.
How to reload the model, after some data has been added in the input file. I am trying to reload the model using another function, which tries to save the model, but it gives error as org.apache.hadoop.mapred.FileAlreadyExistsException. One way to do is to listen for the incoming data on a parallel thread, save it using model.save(sc, "target/tmp/<some target>") and then reload the model after significant data has been received. I am lost here, how to achieve that.
It would be very helpful, if I could get some direction here.
For the first part, you can find item_id, Number of times that item_id appeared. You can use map and reduceByKey functions of spark for that. After that find the top 10/20 items having max count. You can also give the weightage depending on recency of the items.
For the second part, you can save the model with new name every time. I generally create a folder name on the go using the current date and time and use the same name to reload the model from the saved folder. You will always have to train the model again, using past data and the new data received and then use the model to predict.
Independent of using platforms like Spark, there are some very good techniques(for ex. non-negative matrix factorization) of link prediction which predicts link between 2 sets.
Other very effective(and good) techniques of recommendations are:-
1. Thompson Sampling, 2.MAB (Multi Arm Bandits). A lot depends on the raw dataset. How is your raw dataset distributed. I would recommend to apply above methods on 5% raw dataset, build a hypothesis, use A/B testing, predicts links and move forward.
Again, all these techniques are independent of platform. I would also recommend of moving from scratch instead of using platforms like spark which are only useful for large datasets. You can always move to these platforms in future for scalability.
Hope it helps!

Resources