Create data syncs using two tables - apache-spark

I want to create a data syncs in Palantir using un update (update + insert) transaction on three fields from two diffent tables, there is anoption in Palantir syncs to use twin table but i can't see how to add three fields in the incremental field from two different tables.
Do you have any idea how to do this in Palantir?

Related

Link two tables in spotfire

I have two tables which I want to connect through a common key, when that is done I would be able to select something in on of the table which would make the other table respond with the data associated to that common key...
But I am not sure how see that, I can see they are connected, but I can't get the 'selection' to work
After you join the tables (not a prerequisite by the way), set up a relationship between the two tables on the common key(s) by going to Edit > Data Table Properties > Relations.
Relationships allow markings and filters to propagate through tables that may not even be joined. For example, with a proper relationship, if I mark one table on a key that is also in another table, that other table will be marked. This can drive visualizations and detailed drill downs. You can read more from spotfire at the below link.
https://docs.tibco.com/pub/sfire-analyst/7.5.0/doc/html/WebHelp/data/data_details_on_manage_relations.htm

Snaplogic querying two sources and joining data together

I am trying to build a Pipeline which queries out my Sales records (as one Read activity)
Now in this Sales schema there are fields that reference a People table however its not a direct connection as there is a Many-to-Many relationship.
So what I need to do is query my PeopleToSales table for all related records and populate them in a flat structure in my subsequent JSON object.
How can I built two objects together and join them based on Sales ID? Also in the event there are multiple matches how can I choose the first one?
You can read both the Sales records and the PeopleToSales table and then use the Join snap to merge the relevant documents based on whatever ID that defines the relation between them.
After that, you can use the Group By Fields snap to group the documents based on Sales ID.
You can add the Sales ID field (say - $sales_id) in the Fields list in the settings and it will group documents based on the Sales ID.
Also, when using the Group By Fields snap, you first have to sort the documents based on the keys. So, use a Sort snap before the Group By Fields snap.
As far as getting the first object is concerned, after the group-by, you can just get the 0th element of the list (say group[0]).
Please refer to - SnapLogic Docs - Group By Fields

Cassandra - join two tables and save result to new table

I am working on a self-bi application where users can upload their own datasets which are stored in Cassandra tables that are created dynamically. The data is extracted from files that the user can upload. So, each dataset is written into its own Cassandra table modeled based on column headers in the uploaded file while indexing the dimensions.
Once the data is uploaded, the users are allowed to build reports, analyze, etc., from within the application. I need a way to allow users to merge/join data from two or more datasets/tables based on matching keys and write the result into a new Cassandra table. Once a dataset/table is created, it will stay immutable and data is only read from it.
user table 1
username
email
employee id
user table 2
employee id
manager
I need to merge data in user table 1 and user table 2 on matching employee id and write to new table that is created dynamically.
new table
username
email
employee id
manager
What would be the best way to do this?
The only option that you have is to do the join in your application code. There are just few details to suggest a proper solution.
Please add details about table keys, usage patterns... in general, in cassandra you model from usage point of view, i.e. starting with queries that you'll execute on data.
In order to merge 2 tables on this pattern, you have to do it into application, creating the third table (target table ) and fill it with data from both tables. You have to make sure that you read the data in pages to not OOM, it really depends on size of the data.
Another alternative is to build the joins into Spark, but maybe is too over-engineering in your case.
You can have merge table with primary key of user so that merged data goes in one row and that should be unique since it is one time action.
Than when user clicks you can go through one table in batches with fetch size (for java you can check query options but that is a way to have a fixed window which will be loaded and when reached move to next fetch size of elements). Lets say you have fetch size of 1000 items, iterate over them from one table and find matches in second table, and after 1000 is reached place batch of 1000 inserts to new table.
If that is time consuming you can as suggested use some other tool like Apache Spark or Spring Batch and do that in background informing user that it will take place.

Portal displaying data from two tables

I have two tables which both include a date field. Currently I have two portals, one for each table (occurrence).
Is it was possible to display the results of both of these in one portal, and sort by the date?
Technically a portal can only display records from one table. If you need to join two tables then you have to do this manually or change the design and use one table instead of two (since you want them in the same portal, then the tables are similar to some degree; maybe this similarity can go into its own table).
Sometimes developers use the so-called virtual table technique: they create a table with, say, a field with the record number and a bunch of calculated fields that pick their values from elsewhere, for example, from prefilled global variables. They create a portal to this table, set up the relationship to display the required number of records, and write the code to fill these variables. This way they can show data that isn't stored in any table, combine tables, etc. But it's an arcane technique, I would recommend it only as the last resort.

How would you store contacts in Azure Tables?

Each user of my system can have contacts. Each contact has details like Name, Address, Email, Phone, etc.
Do you think is a good idea to store this contacts in Azure Tables? I am worried about the following:
How do I search for a specific field (like Email or Phone)?
How do I get only the contacts belonging to a specific user?
How do I sort the contacts by a field?
I think that contacts could be a good candidate for storing in Table Storage - but only if you can partition on the owning person and never really need to search or aggregate across multiple owning users.
One possible design for this is:
store the contacts once with the owning user as partition key and some unique field for row key, but with the fields as columns within each row.
How do I search for a specific field (like Email or Phone)?
You can then ask table storage to search within a partition - it will then do a scan within that partition - which shouldn't be particularly large or slow for any single partition.
How do I get only the contacts belonging to a specific user?
This is just a simple query by partition key only
How do I sort the contacts by a field?
All results from table storage are sorted by (partitionkey, rowkey) so to sort the contacts for a user, you'll need to query for all of them, and then sort them within your web or worker role.
Other designs are, of course, possible -
e.g. you could store each contact in multiple rows in multiple tables - this would then allow you to have pre-formed sort orders within the table storage.
e.g. you could use separate tables instead of separate partitionkeys for each user - this has the advantage that when you delete a user, you can delete the entire table belonging to that user.
Note... while it's possible to use table storage for this... actually I almost always seem to end up back in SQL Azure at the moment - it's just so much more powerful and predictable (IMO). When the team deliver secondary indexing, then I might be tempted to use it for more of my data.

Resources