I am new to SnapLogic and I am using the join snap but for some reason, it doesn’t find any matches but I know there are matches. I have outputted both streams to a file and I confirmed that they have matching IDs but when the join snap runs it doesn’t return any results. This is an inner join.
Check the Sorted streams setting in your Join snap.
This setting asks how the incoming data is sorted. Options available are -
Ascending
Descending
Unsorted
If an Unsorted data stream is selected, the Snap sorts input data streams before it starts the join operation.
The default value for this setting is Ascending. So, if you are passing unsorted streams of data and the Sorted streams setting has the default value Ascending, then your pipeline won't be able to join the data.
Refer to: Join - SnapLogic Documentation
Related
Why does the order of rows displayed differ, when I take a subset of the dataframe columns to display, via show?
Here is the original dataframe:
Here dates are in the given order, as you can see, via show.
Now the order of rows displayed via show changes when I select a subset of predict_df by method of column selection for a new dataframe.
Because of Spark dataframe itself is unordered. It's due to parallel processing principles wich Spark uses. Different records may be located in different files (and on different nodes) and different executors may read the data in different time and in different sequence.
So You have to excplicitly specify order in Spark action using orderBy (or sort) method. E.g.:
df.orderBy('date').show()
In this case result will be ordered by date column and would be more predictible. But, if many records have equal date value then within those date subset records also would be unordered. So in this case, in order to obtain strongly ordered data, we have to perform orderBy on set of columns. And values in all rows of those set of columns must be unique. E.g.:
df.orderBy(col("date").asc, col("other_column").desc)
In general unordered datasets is a normal case for data processing systems. Even "traditional" DBMS like PostgeSQL or MS SQL Server in general return unordered records and we have to explicitly use ORDER BY clause in SELECT statement. And even if sometime we may see the same results of one query it isn't guarenteed by DBMS that by another execution result will be the same also. Especially if data reading is performed on a large amout of data.
The situation occurs because the show is an action that is called twice.
As no .cache is applied the whole cycle starts again from the start. Moreover, I tried this a few times and got the same order and not the same order as the questioner observed. Processing is non-deterministic.
As soon as I used .cache, the same result was always gotten.
This means that there is ordering preserved over a narrow transformation on a dataframe, if caching has been applied, otherwise the 2nd action will invoke processing from the start again - the basics are evident here as well. And may be the bottom line is always do ordering explicitly - if it matters.
Like #Ihor Konovalenko and #mck mentioned, Sprk dataframe is unordered by its nature. Also, looks like your dataframe doesn’t have a reliable key to order, so one solution is using monotonically_increasing_id https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.functions.monotonically_increasing_id.html to create id and that will keep your dataframe always ordered. However if your dataframe is big, be aware this function might take some time to generate id for each row.
except() function in spark work to compare two dataframes and return non matching records from first dataframe.
however, I would like to track field details also, which are not matching. how to do it in spark ?? please help
as mentioned except will give you complete row mismatch. So I would suggest use leftanti join instead of except and have a join key or keys as condition. Primary or composite keys you can take. This will give you row mismatch w.r.t those keys. Then you need to write one more query where your keys matched I.e intersection but any mismatches in other columns. Write an inner join for this w.r.t keys where table1.colA != table2.colA like this for all fields in case condition.
I have a dynamo DB table where the sort key has a numeric value.
I have a requirement to retrieve the first item which has a lower value than the one, that I have.
I have gone through http://docs.aws.amazon.com/amazondynamodb/latest/APIReference/API_UpdateItem.html#API_UpdateItem_Examples docs but I can see no way to:
- sort the output
- limit the result to 1 entry
Is there any way to actually achieve what I want with dynamo DB?
EDIT:
According to this: http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Query.html
The results are sorted using sorting key, and when it's numeric, they are sorted descending. Which is great, but I still can't find any way to get only a single result [don't want to "pay" for the full table scan in some cases].
Are you searching for the next item which has a lower sort key within the same Partition Key?
In that case, you are able to use Query as you've found, sort in Descending and Limit to 1. This will not scan the entire table.
Alternatively, if you wish you scan cross Partitions, unfortunately a Table Scan is the only way to do this.
I'm using Core Data with an NSFetchedResultsController. My data consists of many students and lesson dates. I set my Predicate and Sort Descriptors up to return a sorted lists of lessons for a particular student. I sort ascending or defending, and my table view is loaded and happy.
However, at times, I want to return only the previous two lessons, sorted ascending. How in the world can I construct a NSFetchRequest to only return an array of two items?
I've been trying to fool the table view by modifying rows and sections... and yes, it is getting tangled up and clunky.
It seems I need to nest NSFetchRequests inside of the NSFetchedResultsController. First fetching and getting the number of total items / sections. And then just getting the last two objects when sorting ascending. How do I limit the results to the last two items when I don't know how many items there are when setting up the NSFetchRequest?
Thanks
Just tell the fetch request how many you want:
[fetchRequest setFetchLimit:2];
Results will be sorted according to your sort descriptor(s), and you'll get the first two results.
I want to get the rows returned by get_range in pycassa to be in reverse sorted order..i.e. from finish to start.
I know that there exists a parameter column_reversed for getting columns in reverse sorted order , but how do i get this done for rows?
Cassandra itself doesn't support getting a range of rows in reverse order. It also doesn't support getting rows in normal sorted order unless you're using an order preserving partitioner, which is almost never recommended. This post is a bit old, but still covers the topic quite well: http://ria101.wordpress.com/2010/02/22/cassandra-randompartitioner-vs-orderpreservingpartitioner/