extract second row values in spark data frame - apache-spark

I have spark dataframe for table (1000000x4) sorted by second column
I need to get 2 values second row, column 0 and second row, column 3
How can I do it?

If you just need the values it's pretty simple, just use the DataFrame's internal RDD. You didn't specify the language, so I will take this freedom to show you how to achieve this using python2.
df = sqlContext.createDataFrame([("Bonsanto", 20, 2000.00),
("Hayek", 60, 3000.00),
("Mises", 60, 1000.0)],
["name", "age", "balance"])
requiredRows = [0, 2]
data = (df.rdd.zipWithIndex()
.filter(lambda ((name, age, balance), index): index in requiredRows)
.collect())
And now you can manipulate the variables inside the data list. By the way, I didn't remove the index inside every tuple just to provide you another idea about how this works.
print data
#[(Row(name=u'Bonsanto', age=20, balance=2000.0), 0),
# (Row(name=u'Mises', age=60, balance=1000.0), 2)]

Related

split row value by separator and create new columns

I I have a Pyspark dataset with a column “channels” that looks like this:
channels
name1,name2,name3,name4
happy1,happy2
entity1,entity2,entity3,entity4,entity5
I want to create 5 new columns i.e “channel1, channel2, channel3, channel4, channel5”.
Then, I want to split the contents of the “channels” column using the comma separator. After splitting values from each row, I want to put each separated value in a different column.
For example for the first row, the columns should look like this:
channel1 channel2 channel3 channel4 channel5
name1 name2 name3 name4 ~
When an element is not found, i want to use ~ as the column value. For example in the first row, there were only 4 values instead of 5 so for the channel5 column, I used ~
I only want to use ~, not None or NULL.
How can I achieve this result in pyspark?
I tried this:
df = df.withColumn("channels_split", split(df["channels"], ","))
df = df.withColumn("channel1", coalesce(df["channels_split"][0], "~"))
df = df.withColumn("channel2", coalesce(df["channels_split"][1], "~"))
df = df.withColumn("channel3", coalesce(df["channels_split"][2], "~"))
df = df.withColumn("channel4", coalesce(df["channels_split"][3], "~"))
df = df.withColumn("channel5", coalesce(df["channels_split"][4], "~"))
df = df.drop("channels_split")
but it gives me an error that:
`~` is missing
You're referencing the column `~`, but it is missing from the schema. Please check your code.
Note that I am using pyspark within Foundry
Coalesce expects cols as arguments and you are providing String, i think that you should use lit("~")
https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.coalesce.html
This is more of a spark problem than a Foundry problem though.
What #M_S said is correct, as the error message was stating, you need a column and should use the lit function.
But be careful if
spark.sql.ansi.enabled is set to True
then this code will throw an ArrayIndexOutOfBoundsException if there are less than 5 items in your array.
Another way to do this would be to ensure that your array has at least 5 elements on each row by adding some ~, then get the first 5 items or by adding a temporary column with the length of the array and use a when condition.
By the way, you don't need to repeat df = every time
df = (
df
.withColumn("channels_split", split(df["channels"], ","))
.withColumn("default_values", array_repeat(lit("~"), 5)))
.withColumn("channels_split", concat(col("channels_split"), col("default_values")))
.withColumn("channel1", df["channels_split"][0])
.withColumn("channel2", df["channels_split"][1])
.withColumn("channel3", df["channels_split"][2])
.withColumn("channel4", df["channels_split"][3])
.withColumn("channel5", df["channels_split"][4])
.drop("channels_split")
)

Pandas discard items in set using a different set

I have two columns in a pandas dataframe; parents and cte. Both columns are made up of sets. I want to use the cte column to discard overlapping items in the parents column. The dataframe is made up of over 6K rows. Some of the cte rows have empty sets.
Below is a sample:
data = {'parents': [{'loan_agreement', 'select', 'opportunity', 'sales.dim_date', 'sales.flat_opportunity_detail', 'dets', 'dets2', 'channel_partner'}
,{'seed', 'dw_salesforce.sf_dw_partner_application'}],
'cte': [{'dets', 'dets2'}, {'seed'}]}
df = pd.DataFrame(data)
df
I've used .discard(cte) previously but I can't figure out how to get it to work.
I would like the output to look like the following:
data = {'parents': [{'loan_agreement', 'select', 'opportunity', 'sales.dim_date', 'sales.flat_opportunity_detail', 'channel_partner'}
,{'dw_salesforce.sf_dw_partner_application'}],
'cte': [{'dets', 'dets2'}, {'seed'}]}
df = pd.DataFrame(data)
df
NOTE: dets, dets2 and seed have been removed from the corresponding parents cell.
Once the cte is compared to the parents, I don't need data from that row again. The next row will only compare data on that row and so on.
You need to use a loop here.
A list comprehension will likely be the fastest:
df['parents'] = [P.difference(C) for P,C in zip(df['parents'], df['cte'])]
output:
parents cte
0 {channel_partner, select, opportunity, loan_ag... {dets, dets2}
1 {dw_salesforce.sf_dw_partner_application} {seed}

pandas: multiple conditions and perform arithmetic operation

I have a data frame with few columns. I am interested to check multiple conditions (>=, <=). Finally, when the conditions are met I am interested to subtract corresponding rows column value from the score.
import pandas as pd
data = {'diag_name':['diag1','diag2','diag3','diag4','diag5','diag6','diag7'],
'final_score': [100, 90, 89, 100, 100, 99,100],
'count_corrected': [2,0,2,2,0,1,1]}
# Create DataFrame
df = pd.DataFrame(data)
To explain my point using an example, if the final_score is 100 and count_corrected >0 then, corresponding value of count_corrected needs to be subtracted from final_score value. If not, then final_score_new will be same as final_score
df['final_score_new']=np.where((df.count_corrected>0) & (df.final_score==100),100-df.count_corrected,df['final_score'])
df[['final_score_new','final_score','count_corrected']] ## check
I hope I am doing the operation and checks correctly. I thought to confirm so I am not screwing any indices.
Thank you.

How to store a df.column in a list without index in a loop?

df.shape (15,4)
I want to store 4th column of df within the loop in a list. What I'm trying is:
l=[]
n=1000 #No. of iterations
for i in range(0,n):
#df expressions and results calcualtion equations
l.append(df.iloc[:,2]) # This is storing values with index. I want to store then without indices while keeping inplace=True.
df_new = pd.DataFrame(np.array(l), columns = df.index)
I want l list to append only values from df column 3. Not series object of pandas.core.series module in each cell.
Use df.iloc[:,2]).tolist() inside append to get the desired result.

PySpark: Update column values for a given number of rows of a DataFrame

I have a DataFrame with 10 rows and 2 columns: an ID column with random identifier values and a VAL column filled with None.
vals = [
Row(ID=1,VAL=None),
Row(ID=2,VAL=None),
Row(ID=3,VAL=None),
Row(ID=4,VAL=None),
Row(ID=5,VAL=None),
Row(ID=6,VAL=None),
Row(ID=7,VAL=None),
Row(ID=8,VAL=None),
Row(ID=9,VAL=None),
Row(ID=10,VAL=None)
]
df = spark.createDataFrame(vals)
Now lets say I want to update the VAL column for 3 Rows with value "lets", 3 Rows with value "bucket" and 4 Rows with value "this".
Is there a straightforward way of doing this in PySpark?
Note: ID values is not necessarily consecutive, bucket distribution is not necessarily even
I'll try to explain an idea with some pseudo-code and you'll map to your solution.
Using window function on one partition we can generate row_number() sequential number for each row in dataframe and store it let say in column row_num.
Next your "rules" can be represented as another little dataframe: [min_row_num, max_row_num, label].
All you need is to join those two datasets on row number, adding new column:
df1.join(df2,
on=col('df1.row_num').between(col('min_row_num'), col('max_row_num'))
)
.select('df1.*', 'df2.label')

Resources