Applying condition statement after reduce by key? - apache-spark

I want to use condition after using reduce by key.
Like I want entries only greater than 5 in values column.

Related

Delete bottom two rows in Azure Data Flow

I would like to delete the bottom two rows of an excel file in ADF, but I don't know how to do it.
The flow I am thinking of is this.
enter image description here
*I intend to filter -> delete the rows to be deleted in yellow.
The file has over 40,000 rows of data and is updated once a month. (The number of rows changes with each update, so the condition must be specified with a function.)
The contents of the file are also shown here.
The bottom two lines contain spaces and asterisks.
enter image description here
Any help would be appreciated.
I'm new to Azure and having trouble.
I need your help.
Add a surrogate key transformation to put a row number on each row. Add a new branch to duplicate the stream and in that new branch, add an aggregate.
Use the aggregate transformation to find the max() value of the surrogate key counter.
Then subtract 2 from that max number and filter for just the rows up to that max-2.
Let me provide a more detailed answer here ... I think I can get it in here without writing a separate blog.
The simplest way to filter out the final 2 rows is a pattern depicted in the screenshot here. Instead of the new branch, I just created 2 sources both pointing to the same data source. The 2nd stream is there just to get a row count and store it in a cached sink. For the aggregation expression I used this: "count(1)" as the row count aggregator.
In the first stream, that is the primary data processing stream, I add a Surrogate Key transformation so that I can have a row number for each row. I called my key column "sk".
Finally, set the Filter transformation to only allow rows with a row number <= the max row count from the cached sink minus 2.
The Filter expression looks like this: sk <= cachedSink#output().rowcount-2

Sum of rows with multiple conditions in Excel VBA

I need to be able to do a summary of rows based on certain column conditions.
There's a table with the following columns:
ID# (row #)
Part
Customer
Job
QTY
Dept
Pass/Fail
Where ID# can possibly be the only unique value.
From the table the following needs to be obtained:
Need to return All jobs.
Need to return all new jobs, which should be all jobs minus any duplicates (i.e., the first entry for each unique Part-Customer-Job-QTY-Dept (i.e., where all 5-values are equal).
All Repeat jobs, which should be all ID#-Part-Customer-Job-QTY-Dept-Pass.
For example:
1-Part;Customer;Job; QTY; Dept; Fail
2-Part; Customer; Job; QTY; Dept; Pass
Where Part-Customer-Job-QTY-Dept are equal, with 1 failing and 2 passing.
1 and 2 are easy, but 3 is a little tricky.
Should I just find the ID#'s (rows) that include a fail prior to a pass.
Can it be done in a single loop?
While I'm here, #2 might be tricky as well. Is there any easy way to sum, without duplicates?
Any help will be highly appreciated!
Please let me know if any additional info is needed.
The sample data below should return:
10 for All
8 for New
2 for Repeats
Doing the sample has me thinking if just subtracting New from All will return all repeats.
I am not sure if you actually need to use VBA as the question seems can be solved by default functions.
The key problem here is to identify all the repeating records, of the unique key Part-Customer-Job-QTY-Dept. Note that you don't actually need to take care of the ID and Pass/Fail as these values do not affect your calculation.
Once you know the unique key, you can solve by the following steps:
Make a new column concatenates to produce the unique key. (F1)=A1&B1&C1&D1&E1
For each row, count the appearance of the unique key among the column.(G1)=countif(F:F,F1)
You can determine the record is duplicating when the count is larger than one, meaning there are multiple lines in the data using the same unique key.=countif(G:G,">1")
Once you have the Yes/No answer on each row, you simply count the Yes to yield the repeating count, thus the new jobs according to your definition.
This can also be implemented in VBA by the same logic.

How to get last inserted 10 records in descending order using dynamodb

I am new in amazone-dynamodb. I want last inserted 10 records in descending order using dynamodb.
DynamoDB allows to sort the data only by sort key attribute. The ScanIndexForward option can be used to sort the data in ascending or descending order.
Please note that the ordering will be done for the specific partition key only. It will not sort all the items in the table and give you the last 10 records. The sort operation can be done for the specific partition key.
ScanIndexForward
Specifies the order for index traversal: If true (default), the
traversal is performed in ascending order; if false, the traversal is
performed in descending order.
Sort key definition and example:-
A composite partition-sort key is indexed as a partition key element
and a sort key element. This multi-part key maintains a hierarchy
between the first and second element values. For example, a composite
partition-sort key could be a combination of “UserID” (partition) and
“Timestamp” (sort). Holding the partition key element constant, you
can search across the sort key element to retrieve items. This would
allow you to use the Query API to, for example, retrieve all items for
a single UserID across a range of timestamps.
Sounds like you are using the DynamoDB example here: http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/GettingStarted.NodeJs.01.html
The sample data does not have insertion timestamps.
Another catch is, that you can only sort at DynamoDB by using the Sort Key, otherwise you need to perform the sorting in code.
So if your Partition Key is the Year, and the Sort Key is the Title, you need to:
Introduce an attribute which provides you with a timestamp of creation.
Create the table with an LSI of this attribute, or create a GSI using the new attribute as your Sort Key.
Now you can use query!
The Query API has an option to:
Sort by the Sort Key in descending order (using ScanIndexForward parameter)
Limiting the number of items returned (using Limit parameter)
The answer by Abhaya Chauhan is mostly correct, though there is one inaccuracy. The Limit parameter does not actually limit the number of items returned, but rather limit the number of items scanned (irregardless of whether they match the search criteria).
Thus if you set a Limit of 10, you might get anywhere between 0 and 10 items. See the below docs for more info:
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Query.html#Query.Limit

How to retrieve item closest to another item in DynamoDB?

I have a dynamo DB table where the sort key has a numeric value.
I have a requirement to retrieve the first item which has a lower value than the one, that I have.
I have gone through http://docs.aws.amazon.com/amazondynamodb/latest/APIReference/API_UpdateItem.html#API_UpdateItem_Examples docs but I can see no way to:
- sort the output
- limit the result to 1 entry
Is there any way to actually achieve what I want with dynamo DB?
EDIT:
According to this: http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Query.html
The results are sorted using sorting key, and when it's numeric, they are sorted descending. Which is great, but I still can't find any way to get only a single result [don't want to "pay" for the full table scan in some cases].
Are you searching for the next item which has a lower sort key within the same Partition Key?
In that case, you are able to use Query as you've found, sort in Descending and Limit to 1. This will not scan the entire table.
Alternatively, if you wish you scan cross Partitions, unfortunately a Table Scan is the only way to do this.

Find out what numbers are not in sequential order

I have a list of numbers, ranging from 100000 to 101000 and i need to find which ones are not in order, is there anyway to do this ? As i dont want to go through a list of 1000 numbers
PS. I am taking this data from SQL So in this instance i cannot sort the data. I just need to know which are not in correct order
If your numbers start in A1 then:
=IF(SMALL(A:A,ROW())=A1,"")
in Row1 and copied down should indicate those that are out of order.
If it is just the number in the SQL entry you can sort it directly when writing the query using ORDER BY in your SQL statement

Resources