How to use map to select columns in RDD - apache-spark

I have an RDD dataset of flights and I have to select specific columns from it.
I have to select column numbers 9,4,5,8,17 and then create an sql dataframe with the results. The data is an RDD.
I tried the following but I get an error in the map.
q9 = data.map(lambda x: [x[i] for i in [9,4,5,8,17]])
sqlContext.createDataFrame(q9_1, ['Flight Num', 'DepTime', 'CRSDepTime', 'UniqueCarrier', 'Dest']).show(n=20)
What would you do? thanks!

Related

Pandas discard items in set using a different set

I have two columns in a pandas dataframe; parents and cte. Both columns are made up of sets. I want to use the cte column to discard overlapping items in the parents column. The dataframe is made up of over 6K rows. Some of the cte rows have empty sets.
Below is a sample:
data = {'parents': [{'loan_agreement', 'select', 'opportunity', 'sales.dim_date', 'sales.flat_opportunity_detail', 'dets', 'dets2', 'channel_partner'}
,{'seed', 'dw_salesforce.sf_dw_partner_application'}],
'cte': [{'dets', 'dets2'}, {'seed'}]}
df = pd.DataFrame(data)
df
I've used .discard(cte) previously but I can't figure out how to get it to work.
I would like the output to look like the following:
data = {'parents': [{'loan_agreement', 'select', 'opportunity', 'sales.dim_date', 'sales.flat_opportunity_detail', 'channel_partner'}
,{'dw_salesforce.sf_dw_partner_application'}],
'cte': [{'dets', 'dets2'}, {'seed'}]}
df = pd.DataFrame(data)
df
NOTE: dets, dets2 and seed have been removed from the corresponding parents cell.
Once the cte is compared to the parents, I don't need data from that row again. The next row will only compare data on that row and so on.
You need to use a loop here.
A list comprehension will likely be the fastest:
df['parents'] = [P.difference(C) for P,C in zip(df['parents'], df['cte'])]
output:
parents cte
0 {channel_partner, select, opportunity, loan_ag... {dets, dets2}
1 {dw_salesforce.sf_dw_partner_application} {seed}

How to compute unique values for a group of rows and create a column for all records using that value?

I have a dask dataframe with only the 'Name' and 'Value' column similar to the table below.
How do I compute the 'Average' column? I tried groupby in dash but that just gives me a dataframe of 2 records containing the average of A and B.
You can just left join your original table with the new one on Name. From https://docs.dask.org/en/latest/dataframe-joins.html:
small = small.repartition(npartitions=1)
result = big.merge(small)

pyspark RDD - Left outer join on specific key

I have two table A and B with hundred of columns. I am trying to apply left outer join on two table but they both have different keys.
I created a new column with same key in B as A. Then was able to apply left outer join. However, how do I join both tables if I am unable to make the column names consistent?
This is what I have tried:
a = spark.table('a').rdd
a = spark.table('a')
b = b.withColumn("acct_id",col("id"))
b = b.rdd
a.leftOuterJoin(b).collect()
If you have dataframe then why you are creating rdd for that, is there any specific need?
Try below command on dataframes -
a.join(b, a.column_name==b.column_name, 'left').show()
Here are few commands you can use to investigate your dataframe
##Get column names of dataframe
a.columns
##Get column names with their datatype of dataframe
a.dtypes
##What is the type of object (eg. dataframe, rdd etc.)
type(a)
DataFrames are faster than rdd, and you already have dataframes, so I sugest:
a = spark.table('a')
b = spark.table('b').withColumn("acct_id",col("id"))
result = pd.merge(a, b, left_on='id', right_on='acct_id', how='left').rdd

split,operate and union dataframe in spark

How can we split a dataframe and operate on individual split and union all the individual dataframes results back ?
Lets say i have dataframe with below columns. I need to split the dataframe based on channel and operate on individual splits which adds new column called bucket. then i need to union back the results.
account,channel,number_of_views
The groupBy is only allowing simple aggreagted operation. On each splitted dataframe i need to do feature extraction.
currently all Feature Transformers of spark-mllib are support only single dataframe.
you can randomly split like this
val Array(training_data, validat_data, test_data) = raw_data_rating_before_spilt.randomSplit(Array(0.6,0.2,0.2))
this will create 3 df then d what you want to do then you can join or union
val finalDF = df1.join(df2, df1.col("col_name")===df2.col("col_name"))
you can also join multiple df at the same time.
this is what you want or anything else.??

Spark Deduplicate column in dataframe based on column in other dataframe

I am trying to deduplicate values in a Spark dataframe column based on values in another dataframe column. It seems that withColumn() only works within a single dataframe, and subqueries won't be fully available until version 2. I suppose I could try to join the tables but that seems a bit messy. Here is the general idea:
df.take(1)
[Row(TIMESTAMP='20160531 23:03:33', CLIENT ID=233347, ROI NAME='my_roi', ROI VALUE=1, UNIQUE_ID='173888')]
df_re.take(1)
[Row(UNIQUE_ID='6866144:ST64PSIMT5MB:1')]
Basically just want to take the values from df and remove any that are found in df_re and then return the whole dataframe with the rows containing those duplicates removed. I'm sure I could iterate each one, but I am wondering if there is a better way.
Any ideas?
The way to do this is to do a left_outer join, and then filter for where the right-hand side of the join is empty. Something like:
val df1 = Seq((1,2),(2,123),(3,101)).toDF("uniq_id", "payload")
val df2 = Seq((2,432)).toDF("uniq_id", "other_data")
df1.as("df1").join(
df2.as("df2"),
col("df1.uniq_id") === col("df2.uniq_id"),
"left_outer"
).filter($"df2.uniq_id".isNull)

Resources