Without using a loop, check numerous conditions in a column using pyspark - apache-spark

I have a huge dataset (rows more than 10 million), like:
az
az
az
az
ca
bb
bb
bb
az
ca
bb
.
.
.
There are a few constraints, such as the fact that "ca" cannot come after "az" and "az" cannot come after "bb". Is there a quick method to accomplish this using Pyspark without using a loop.
I would like to include a column that like the one below.
"ca" cannot come after "az" ---- replace "ca" with "az"
"az" cannot come after "bb" ---- replace "az" with "bb"
az
az
az
az
az
bb
bb
bb
ca
ca
ca
.
.

As mentioned by Luff Li you appear to be requesting functionality of order by. You can place multiple conditions such as:
order by col1 desc,
some_function(col2),
col3 desc
In this case the sql analyzer within spark will be able to "consider" all of the sorting conditions at one time avoiding the "for loop" you mentioned.

Related

Dynamically rename header to 1,2,3,4,5

Im trying to remame the columns in the table below with ascending integers. The dates will change week on week, so I need a way to dynamically rename in alteryx, is there a way of doing this with the dynamic rename tool? or another method perhaps?
To be turned into:
Week Start
1
2
3
4
5
Week
46
47
47
48
49
Thanks
You'll need to use a combination of the Transform tools and a Record ID. Use the Transpose tool with the 'Week Start' as your "Key column" then use the Summarize and group by the new Name field to get a list of all your dates and sort the dates in ascending order.
Add a Record ID after to give each date an ID. Join this back to the original Transpose and then Cross Tab to get the output you want.
See screenshot in link below.
AlteryxChangeColumnnames

How can I get a list of records missing from each group of records?

I have a list of Items, which each have a list of Settings, and I am trying to get all the items to have the same list of settings, so out of the current record set, I want to extract a list of settings that need to be added.
Ideally just using excel formulas and manipulation (and not VBA), how can I achieve this?
I have tried going through manually and comparing items against each other, but it feels like there should be a much faster programmatic way of doing this. Here is the example set of records:
Item Setting
---- --------
BS EDADM
BS EDDIS
BS EDREG
BS EEPRE
CC EDADM
CC EDREG
LB EDDIS
LB EDREG
LB EEPRE
Based upon the example set, I expect the output to be:
Item Setting
---- --------
CC EDDIS
CC EEPRE
LB EDADM

3 conditions with nested if and AND function in Excel

Iam simply trying categories some values, if satisfies some conditions.
if a student get mark according to the plan , I need to show correspond plan.
like this,
if the student satisfies plan 1 condition,
Result want to be Plan1
if the student satisfies plan 2 condition,
if the student didn't meet above 2 conditions, then show "NO PLAN"
I used this formula,
=IF(AND(H86>=200,Q86>=250,Y86>=14,AD86>=17,AI86>=18,AN86>=18),"Plan 1",IF(OR(H86>=180,H86<200,Q86>=225,Q86<250,Y86>=13,Y86<14,AD86>=15,AD86<17,AI86>=16,AI86<18,AN86>=16,AN86<18),"Plan 2","NO Plan"))
In this formula, the 2nd if contains 2 conditions, if the mark is between plan 1 and plan 2 we need show Plan 2 also if mark is greater than and equal to plan 2 , we need to show Plan2. here iam stuck.

How to split one row in different rows in TALEND

I need help to migrate one row from old DB to multiple rows in my New DB.
I have a data like:
OID CUSTOMER_NAME DOB ADDRESS
1 XYZ 03/04/1987 ABC
In my new DB i am storing data in KEY VALUE pair like:
OID KEY VALUE
1 CUSTOMER_NAME XYZ
1 DOB 03/04/1987
1 ADDRESS ABC
Someone please help me how to do this using TALEND tool.
you can use tMap multiple output linked to same output as one possible solution here. But it is not dynamic. why can you split the single row into multiple rows in source select query itself?
if you want to use this tmap option see below
tOracleInput(anyotherinput)-->tMap-->toutput/tlogrow
Take this row as input to tmap component and in tmap create one output group say out_1.
Now in this out_1 drag and link OID and CUSTOMER_NAME columns from input.
Now create another output group out_02 in this tmap and when "add a output" dialog comes
select "create join table from" and in the dropdown select out_1 group, so that our output rows from this out_02 group will also go to out_01 group.
So our tmap will have only one output group out_01 containing rows from both out_01 and out_02. now in out_02 drag and link OID and DBO columns.
similarly repeat it for out_03 and link OID and ADDRESS column..
Use tSplitRow to do it. Please see below.
Talend job:
output:
After spending hour or two I found a solution using Talend and without writing single line of Java Code.
if you follow all my steps then you will get desire result.
Note: I took your Inputs as a source for this development, so actuals may be differed.
Add tMap after your Input Source.
concatenate source columns with coma in single column.
at end of concatenated columns add semicolon. see the image for more details.
After tMap add tNormalize component and do setting as in image.
add tDenormalize component and and do the setting as in image.
Add tExtractDelimitedFields component and configured shown in image.
Add another tMap and do the setting as shown in image.
Now you have two output flows so add another tNormalize component for each output.
Configure first tNormalize component as shon in Image.
configure second tNormalize component with below setting, shown in image.
Our Final Job will be look like below image.
After doing all these things you will have this output
Now you can create another sub job to process these output to join and create new one as per your requirement.
tOracleInput(anyotherinput)-->tSplitRow-->toutput/tlogrow
Snap1
Snap2
you can use tPivotToColumnsDelimited.
Read More about it on talend Help Center.
This component will rotate your table on the basis of a row specified.
Thanks .

Grouping Zip Codes by Region in Excel

I have 50000 something random entries in a worksheet categorized by Zip Code and I need to group them by Zones. I have a reference list that shows which Zip corresponds to which Zone. How can I add a column to the worksheet with the Zone that corresponds to the Zip without manually looking it up and typing it in.
This is what the reference list looks like:
Zone Zip
1 03227
1 03254
1 03269
...
2 05687
2 05691
etc
Here's an example using VLookup

Resources