How can I implement an iterative optimization problem in Spark

How can I implement an iterative optimization problem in Spark - apache-spark

Assume I have the following two sets of data. I'm attempting to associate products on hand with their rolled up tallies. For a roll up tally you may have products made of multiple categories with a primary and alternative category. In a relational database I would load the second set of data into a temporary table use a stored procedure to iterate through the rollup data and decrement the quantities until until they were zero or I had matched the tallies. I'm trying to implement a solution in Spark/PySpark and I'm not entirely sure where to start. I've attached a possible output solution that I'm trying to achieve though I recognize there are multiple outputs that would work.
#Rolled Up Quantities#
owner,category,alternate_category,quantity
ABC,1,4,50
ABC,2,3,25
ABC,3,2,15
ABC,4,1,10
#Actual Stock On Hand#
owner,category,product_id,quantity
ABC,1,123,30
ABC,2,456,20
ABC,3,789,20
ABC,4,012,30
#Possible Solution#
owner,category,product_id,quantity
ABC,1,123,30
ABC,1,012,20
ABC,2,456,20
ABC,2,789,5
ABC,3,789,15
ABC,4,012,10

Related

Join feature from two different index

We need data form two different azure search indexes,since we were not able to find any option to join indexes currently,we are replicating data from different indexes,to another new index,because of which we are facing issues to keep redundant data in sync in multiple indexes and also the cost aspect to maintain data in new index
Is there any other better option other than our current solution for our use case

Best architecture for fast filter queries in ArangoDB

I am working on a system where I need fast filtering queries. Basically, it is a set of 50 different fields, booleans, amounts, code and dates; just like a web-shop filter.
it is ~ 10 000 000 items.
For the moment I am using MSSQL, and using one big table with different indexes except for a few separate tables when I found it much faster to join instead of just filter the result in one table.
I usually get a response time around 1 second, with a fairly fast server.
I was considering to use ArangoDB for this and wonder what approach is best? Is it better to keep some of the "flags" as separate tables and join or is it more efficient to put everything in the same document and have it as a flag with an index? Or would it be any benefit using the graph/edge feature and make a link back to the same object (or an object representing the code for instance)?
The reason I am considering ArangoDB is that my plan is to have a more complex model and will most likely use the graph feature in the future even if the first priority is to get the system up to the current level of features with a similar speed.
Any thoughts?

Using Cognos 10.1 which is better an Inner Join or an "IN" Filter?

I'm using Cognos 10.1 and I have a report that uses two queries each with the same primary key.
Query 1: UniqueIds
Query 2: DetailedInfo
I'm not sure how to tell whether it's better build a report using the DetailedInfo query with a filter that says PrimaryKey in (UniqueIds.PrimaryKey) or should I create a third query that joins UniqueIds to DetailedInfo on PrimaryKey.
I'm new to Cognos and I'm learning to think differently. Using MicroSoft SQL Server I'd just use an inner join.
So my question is, in Cognos 10.1 which way is better and how can tell what the performance differences are?

You'd better start from the beginning.
You queries (I hope Query Subjects) should be joined in Framework Manager, in a model. Then you can easily filter second query by applying filters to first query.
Joins in Report Studio is the last solution.

The report writers ultimate weapon is a well indexed data warehouse, with a solid framework model built on top.
You want all of your filtering and joining to happen on the database side as much as possible. If not, then large data sets are brought over to the Cognos server before they are joined and filtered by Cognos.
The more work that happens on the database, the faster your reports will be. By building your reports in certain ways, you can mitigate Cognos side processing, and promote database side processing.
The first and best way to do this is with a good Framework Model, as Alexey pointed out. This will allow your reports to be simpler, and pushes most of the work to the database.
However a good model still exposes table keys to report authors so that they can have the flexibility to create unique data sets. Not every report warrants a new Star Schema, and sometimes you want to join the results of queries against two different Star Schema sources.
When using a join or a filter, Cognos attempts to push all of the work to the database as a default. It wants to have the final data set sent to it, and nothing else.
However when creating your filters, you have two ways of defining variables... with explicit names that refer to modeled data sources (ie. [Presentation View].[Sales].[Sales Detail].[Net Profit] ) or by referring to a column in the current data set (such as [Net Profit] ). Using explicit columns from the model will help ensure the filters are applied at the database.
Sometimes that is not possible, such as with a calculated column. For example, if you dont have Net Profit in your database or within your model, you may establish it with a Calculated column. If you filter on [Net Profit] > 1000, Cognos will pull the dataset into Cognos before applying your filter. Your final result will be the same, but depending on the size of data before and after the filter is applied, you could see a performance decrease.
It is possible to have nested queries within your report, and cognos will generate a single giant SQL statement for the highest level query, which includes sub queries for all the lower level data. You can generate SQL/MDX in order to see how Cognos is building the queries.
Also, try experimenting. Save your report with a new name, try it one way and time it. Run it a few times and take an average execution speed. Time it again with the alternate method and compare.
With smaller data sets, you are unlikely to see any difference. The larger your data set gets, the bigger a difference your method will affect the report speed.

Use joins to merge two queries together so that columns from both queries can be used in the report. Use IN() syntax if your only desire is to filter one query using the existence of corresponding rows in a second. That said, there are likely to be many cases that both methods will be equally performant, depending on the number of rows involved, indexes etc.
By the way, within a report Cognos only supports joins and unions between different queries. You can reference other queries directly in filters even without an established relationship but I've seen quirks with this, like it works when run interactively but not scheduled or exported. I would avoid doing this in reports.

Cassandra Data sync issues

I'm Researching on Cassandra for over 2 weeks just have the full grasp on the same. I've read almost all the web about Cassandra and still not clear over some concepts. Following are the ones:-
As per the documentation :- We model our Column Families as per our queries. Hence we need to know our queries before-hand, which is not at all possible in a real world scenario. We can have a certain set of queries before-hand, which all keeps changing with time. Hence if I'd designed a model based on my previous queries, then after a new requirement comes i, I need to redesign a the model. And as read over one SO thread It’s very hard to fix a bad Cassandra data model in the future. For Eg:- I'd a user model having fields say
name, age,phone,imei,address, state,city,registration_type, created_at
Currently, I need to filter by (lets say) only by state. I'll make a PK as state. Lets name the model UserByState.
Now after 2-3 months, I came with a requirement of filtering by created_at. Now I'll create a model UserByCreatedAt with PK as created_at.
Now there are 2 problems:-
a) If I create a new model when the requirement comes in, then I need to migrate the data into the new model, ie if I create a new model, I need to have the previous data in the current model as well. Hence I need to migrate the data from UserByState to UserByCreatedAt, ie I need to write a script to copy the data from UserByState to UserByCreatedAt. Correct me if Im wrong!!!
If another new filtering requirement comes in, I'll be creating new models and then migration and so on.
b) To create models before-hand as per the queries, I need to keep data in sync, ie in the above case of Users, I created 2 models for 2 queries.
UserByState and UserByCreatedAt
So do I need to apply 2 different write queries??, ie
UserByState.create(row = value,......)
UserByCreatedAt.create(row = value,......)
And if I've other models, such as 'UserByGender' and so on. do I need to apply different write queries to different models MANUALLY or does it happen on its own??? The problem of keeping the data in sync arises.

There is no free lunch in distributed systems and you've hit some of key limitations on the head.
If you want extremely performant writes that scale horizontally you end up having to make concessions on other pats of the database. Cassandra chose to sacrifice flexibility in query patterns to ensure extremely fast access to well defined query patterns.
When most users reach a situation where they need to have to extremely different and frequent query patterns, they build a second table and update both at once. To get atomicity with the multi-table writes, logged batching can be used to make sure that either all of the data is written or none of it is. Logged batching increases the cost so this is still yet another tradeoff with performance. Beyond that the normal consistency level tradeoffs all still apply.
For moving data from the old table to the new one Hadoop/Spark are good options. These are batch based systems so they will not provide low latency but are great for one-offs like rebuilding a table with a new index and cronjob operations.

How to perform intersection operation on two datasets in Key-Value store?

Let's say I have 2 datasets, one for rules, and the other for values.
I need to filter the values based on rules.
I am using a Key-Value store (couchbase, cassandra etc.). I can use multi-get to retrieve all the values from one table, and all rules for the other one, and perform validation in a loop.
However I find this is very inefficient. I move massive volume of data (values) over the network, and the client busy working on filtering.
What is the common pattern for finding the intersection between two tables with Key-Value store?

The idea behind the nosql data model is to write data in a denormalized way so that a table can answer to a precise query. To make an example imagine you have reviews made by customers on shops. You need to know the reviews made by a user on shops and also reviews received by a shop. This would be modeled using two tables
ShopReviews
UserReviews
In the first table you query by shop id in the second by user id but data are written twice and accessed directly using just a key access.
In the same way you should organize values by rules (can't be more precise without knowing what's the relation between them) and so on. One more consideration: newer versions of nosql db supports collections which might help to model 1 to many relations.
HTH, Carlo

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string