How to insert the rdd data into a dataframe in pyspark? - apache-spark

Please find below the psuedocode:
source dataframe with 5 columns
creating a target dataframe with schema(6 columns)
For item in source_dataframe:
#adding a column to the list buy checking item.coulmn2
list = [item.column1,item.column2,newcolumn]
#creating an rdd out of this list
#now i need to add this rdd to a target dataframe?????

You could definately explain your question a bit more in detail or give some sample code. I'm interested how others will solve that. My proposed solution is this one:
df = (
sc.parallelize([
(134, "2016-07-02 12:01:40"),
(134, "2016-07-02 12:21:23"),
(125, "2016-07-02 13:22:56"),
(125, "2016-07-02 13:27:07")
]).toDF(["itemid", "timestamp"])
)
rdd = df.map(lambda x: (x[0], x[1], 10))
df2 = rdd.toDF(["itemid", "timestamp", "newCol"])
df3 = df.join(df2, df.itemid == df2.itemid and df.timestamp == df2.timestamp, "inner").drop(df2.itemid).drop(df2.timestamp)
I'm converting the RDD to a Dataframe. Afterwards I join both Dataframes, which duplicates some columns. So finally I drop those duplicated columns.

Related

Reshaping Panda Data Frame and Create New Columns Based On a Column

I want to reshape a data frame like the following:
df1 = pd.DataFrame( columns=['Serial','Seq_Sp','PT','FirstPT','DiffAngle','R1'],
data=[['1001W','2_1',15.13,15.07,1.9,7.4], ['1001W','2_2',16.02,15.80,0.0,0.05],
['1001W','2_3',14.3,15.3,6,0.32],['1001W','2_4',14.18,15.07,2.2,0.16],
['6279W','2_1',15.13,15.13,2.3,0.31],['6279W','2_2',13.01,15.04,1.3,0.04],
['6279W','2_3',14.13,17.04,2.3,0.31],['6279W','2_4',14.01,17.23,3.1,1.17]
])
display(df1)
And create new one with unique serial numbers and long vector of new columns, like the following:
df2 = pd.DataFrame( columns=['Serial','PT_2_1','FirstPT_2_1','DiffAngle_2_1','R1_2_1','PT_2_2','FirstPT_2_2','DiffAngle_2_2',
'R1_2_2','PT_2_3','FirstPT_2_3','DiffAngle_2_3','R1_2_3','PT_2_4','FirstPT_2_4','DiffAngle_2_4',
'R1_2_4'],
data=[
['1001W',15.13,15.07,1.9,7.4,16.02,15.80,0.0,0.05, 14.3,15.3,6,0.32 ,14.18,15.07,2.2,0.16],
['6279W',15.13,15.13,2.3,0.31,13.01,15.04,1.3,0.04,14.13,17.04,2.3,0.31,14.01,17.23,3.1,1.17]
])
df2
I appreciate any help!
use pivot_table
df2=df1.pivot_table(index='Serial', columns='Seq_Sp')
df2.columns = df2.columns.map('_'.join).str.strip('_')
df2
partial output

Convert multiple DataFrames to numpy arrays

I am attempting to convert 5 dataframes to numpy arrays in a loop.
df = [df1, df2, df3, df4, df5]
for index, x in enumerate(df):
x = x.to_numpy()
print(type(df3)) still gives me pandas DataFrame as the output.
This dose not save into the environment
for index, x in enumerate(df):
df[index] = x.to_numpy()
Then you do
df[0]

Spark Data frame Join: Non matching Records from first Dataframe

Hi all I have 2 Dataframes and I'm applying some join condition on those dataframes.
1.after join condition i want all the data from first dataframe whose name,id,code,lastname is not matching which second dataframe.I have written below code.
val df3=df1.join(df2,df1("name") !== df2("name_2") &&
df1("id") !== df2("id_2") &&
df1("code") !== df2("code_2") &&
df1("lastname") !== df2("lastname_2"),"inner")
.drop(df2("id_2"))
.drop(df2("name_2"))
.drop(df2("code_2"))
.drop(df2("lastname"))
expected result.
DF1
id,name,code,lastname
1,A,001,p1
2,B,002,p2
3,C,003,p3
DF2
id_2,name_2,code_2,lastname_2
1,A,001,p1
2,B,002,p4
4,D,004,p4
DF3
id,name,code,lastname
3,C,003,p3
Can someone please help me is this the correct way to do this or Should I use sql inner query with 'not In '?. I am new to spark and using first time dataframe methods
so I am not sure this is the correct way or not?
I recommend you using Spark API to work with data:
val df1 =
Seq((1, "20181231"), (2, "20190102"), (3, "20190103"), (4, "20190104"), (5, "20190105")).toDF("id", "date")
val df2 =
Seq((1, "20181231"), (2, "20190102"), (4, "20190104"), (5, "20190105")).toDF("id", "date")
Option1. You can get all rows are not included in other dataframe:
val df3=df1.except(df2)
Option2. You can use a specific fields to do anti join, for example 'id':
val df3 = df1.as("table1").join(df2.as("table2"), $"table1.id" === $"table2.id", "leftanti")
df3.show()

How to get the numeric columns from DataFrame of Pyspark and calculating the zscore

sparkSession = SparkSession.builder.appName("example").getOrCreate()
df = sparkSession.read.json('hdfs://localhost/abc/zscore/')
I am able to read the data from hdfs and I want to calculate the zscore for only numeric columns
You can convert df to Pandas and calculate zscore
sparkSession = SparkSession.builder.appName("example").getOrCreate()
df = sparkSession.read.json('hdfs://localhost/SmartRegression/zscore/').toPandas()
num_cols = df._get_numeric_data().columns
results = df[num_cols].apply(zscore)
print(results)
toPandas() does not work for big dataset as this try to load the whole dataset in driver memory.

How to change the name of columns of a Pandas dataframe when it was saved with "pickle"?

I saved a Pandas DataFrame with "pickle". When I call it it looks like Figure A (that is alright). But when I want to change the name of the columns it looks like Figure B.
What am I doing wrong? What are the other ways to change the name of columns?
Figure A
Figure B
import pandas as pd
df = pd.read_pickle('/home/myfile')
df = pd.DataFrame(df, columns=('AWA', 'REM', 'S1', 'S2', 'SWS', 'ALL'))
df
read.pickle already returns a DataFrame.
And you're trying to create a DataFrame from an existing DataFrame, just with renamed columns. That's not necessary...
As you want to rename all columns:
df.columns = ['AWA', 'REM','S1','S2','SWS','ALL']
Renaming specific columns in general could be achieved with:
df.rename(columns={'REM':'NewColumnName'},inplace=True)
Pandas docs
I have just solved it.
df = pd.read_pickle('/home/myfile')
df1 = pd.DataFrame(df.values*100)
df1.index='Feature' + (df1.index+1).astype(str)
df1.columns=('AWA', 'REM', 'S1', 'S2', 'SWS', 'ALL')
df1

Resources