PYSPARK // drop rows like df_new = df_old.iloc[::nth, :] - apache-spark

I'm working with pyspark data frames (pyspark.sql.dataframe.DataFrame).
The data frame consists of ~20.000 rows and I want to extract only 1 in 20 rows to a new data frame, which then has ~1.000 rows.
In python/pandas, this can be easily done by df_new = df_old.iloc[::20, :] .
How can this be done in pyspark?

Related

How to transpose and Pandas DataFrame and name new columns?

I have simple Pandas DataFrame with 3 columns. I am trying to Transpose it into and then rename that new dataframe and I am having bit trouble.
df = pd.DataFrame({'TotalInvoicedPrice': [123],
'TotalProductCost': [18],
'ShippingCost': [5]})
I tried using
df =df.T
which transpose the DataFrame into:
TotalInvoicedPrice,123
TotalProductCost,18
ShippingCost,5
So now i have to add column names to this data frame "Metrics" and "Values"
I tried using
df.columns["Metrics","Values"]
but im getting errors.
What I need to get is DataFrame that looks like:
Metrics Values
0 TotalInvoicedPrice 123
1 TotalProductCost 18
2 ShippingCost 5
Let's reset the index then set the column labels
df.T.reset_index().set_axis(['Metrics', 'Values'], axis=1)
Metrics Values
0 TotalInvoicedPrice 123
1 TotalProductCost 18
2 ShippingCost 5
Maybe you can avoid transpose operation (little performance overhead)
#YOUR DATAFRAME
df = pd.DataFrame({'TotalInvoicedPrice': [123],
'TotalProductCost': [18],
'ShippingCost': [5]})
#FORM THE LISTS FROM YOUR COLUMNS AND FIRST ROW VALUES
l1 = df.columns.values.tolist()
l2 = df.iloc[0].tolist()
#CREATE A DATA FRAME.
df2 = pd.DataFrame(list(zip(l1, l2)),columns = ['Metrics', 'Values'])
print(df2)

How to update the empty Pandas dataframe from the the sum of the bottom x rows of a column from an another Pandas dataframe

I would like to sum the bottom x rows of each column of the dataframe and update it in a another empty data frame.
I tried the below code but i could not update the dataframe.
The master DataFrame is ‘df_new_final’ and it contains numerical values.
I want to update in a ‘df_new_final_tail’ as an input of the sum of tail 15 rows from Master DataFrame. But df_new_final_tail is still an empty but i can see that ‘sum_x’ is getting calculated. Not sure why it is not getting updated.
Master DataFrame ——> df_new_final
Child DataFrame ——-> df_new_final_tail
df_series_list = df_series.columns.values.tolist()
df_new_final_tail = pd.DataFrame(columns=df_series_list)
for items in df_series_list:
sum_x = df_new_final.tail(15)[items+’_buy’].sum()
df_new_final_tail[items]=sum_x
Thanks
Convert Series after sum to one column DataFrame by Series.to_frame and for one row DataFrame use transpose by DataFrame.T:
df_new_final_tail = df_new_final.tail(15).sum().to_frame().T
If df_series is another DataFrame and columns names are same with suffix _buy for parse df_new_final use:
items = df_series.columns
df_new_final_tail = df_new_final.tail(15)[items+'_buy'].sum().to_frame().T

How to plot this code of matplotlib efficiently

I am new to python and doing a time series analysis of stocks.I created a data frame of rolling average of 5 stocks according to their percentage change in close price.Therefore this df has 5 columns and i have another df index rolling average of percentage change of closing price.I want to plot individual stock column of the df with the index df. I wrote this code
fig.add_subplot(5,1,1)
plt.plot(pctchange_RA['HUL'])
plt.plot(N50_RA)
fig.add_subplot(5,1,2)
plt.plot(pctchange_RA['IRCON'])
plt.plot(N50_RA)
fig.add_subplot(5,1,3)
plt.plot(pctchange_RA['JUBLFOOD'])
plt.plot(N50_RA)
fig.add_subplot(5,1,4)
plt.plot(pctchange_RA['PVR'])
plt.plot(N50_RA)
fig.add_subplot(5,1,5)
plt.plot(pctchange_RA['VOLTAS'])
plt.plot(N50_RA)
NOTE:pctchange_RA is a pandas df of 5 stocks and N50_RA is a index df of one column
You can put your column names in a list and then just loop over it and create subplots dynamically. A pseudocode would look like the following
cols = ['HUL', 'IRCON', 'JUBLFOOD', 'PVR', 'VOLTAS']
for i, col in enumerate(cols):
ax = fig.add_subplot(5, 1, i+1)
ax.plot(pctchange_RA[col])
ax.plot(N50_RA)

How to selecting multiple rows and take mean value based on name of the row

From this data frame I like to select rows with same concentration and also almost same name. For example, first three rows has same concentration and also same name except at the end of the name Dig_I, Dig_II, Dig_III. This 3 rows same with same concentration. I like to somehow select this three rows and take mean value of each column. After that I want to create a new data frame.
here is the whole data frame:
import pandas as pd
df = pd.read_csv("https://gist.github.com/akash062/75dea3e23a002c98c77a0b7ad3fbd25b.js")
import pandas as pd
df = pd.read_csv("https://gist.github.com/akash062/75dea3e23a002c98c77a0b7ad3fbd25b.js")
new_df = df.groupby('concentration').mean()
Note: This will only find the averages for columns with dtype float or int... this will drop the img_name column and will take the averages of all columns...
This may be faster...
df = pd.read_csv("https://gist.github.com/akash062/75dea3e23a002c98c77a0b7ad3fbd25b.js").groupby('concentration').mean()
If you would like to preserve the img_name...
df = pd.read_csv("https://gist.github.com/akash062/75dea3e23a002c98c77a0b7ad3fbd25b.js")
new = df.groupby('concentration').mean()
pd.merge(df, new, left_on = 'concentration', right_on = 'concentration', how = 'inner')
Does that help?

Spark copy the top N selected Rows to a new data frame

I have some data frame which has millions of rows. I need to select all rows of top 100 ids in to other data frame
I know how to get the top column_ids
df.groupBy("some_column_id").count()
This will return the column ids and their count, now I need to filter rows of these top 100 ids to other data frame.
My sample table is below, since user 123 and 678 are having more rows than other, I want these two to be in separate data frame
How to do this ?
Try
val c = df.groupBy("student id").count.orderBy(desc("count")).limit(100)
val a = df.join(c,df.col("student id") === c.col("student id"), "leftsemi")
You can do this with Window-functions:
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
val newDF = df
.withColumn("count", count("*").over(Window.partitionBy("student_id")))
.withColumn("rank", rank().over(Window.orderBy(col("count").desc)))
.where(col("rank") <= 100)

Resources