pyspark : Operate on column sets in parallel - apache-spark

If I was interested in working on groups of rows in Spark, I could, for example add a column that denoted group membership and use a pandas groupby to operate on these rows in parallel (independently). Is there a way to do this for columns instead? I have a df with many feature columns and 1 target column and I would like to run calculations on each set{column_i, target} for i=1...p.

Related

How to fix NaN values in concatenated multiple tables in Jupyter Notebook

I am using Jupyter Notedbook. I have concatenated multiple tables. When I run the head() command I am not able to see the values in age and gender columns in a table rather than it's showing me NaN values against each user_id.
The following image_1 shows us the output when I concatenated the different two tables
How can I sort it out this issue or suggest me another way to conatenate tables where I can see all of the table values?
or do I need to access tables separeately and apply operations on different tables?
I am expecting to get the values in age and gender table rather than NaN values.
When I use these tables separately. It shows correct results but I have a big data problem so I need to concatenate the tables to access each of the feature column. In the end, I can apply operations on concatenated table features.
I have been testing your problem with the two csv's of your Github.
First of all I loaded the two tables as 'df1' and 'df2', importing the pandas library.
import pandas as pd
df1 = pd.read_csv('https://raw.githubusercontent.com/HaseebTariq7/Data_analysis_dataset/main/data_1.csv')
df2 = pd.read_csv('https://raw.githubusercontent.com/HaseebTariq7/Data_analysis_dataset/main/data_2.csv')
Then, using the pandas library, you can merge both dataframes choosing the connection column, in this case, 'user_id'.
final_df= pd.merge(df1, df2,on='user_id')
Finally we have the 'final_df' with all the information of both tables and without NaN's.

how to access the first row of the dataframe after using python

I am new to python and pandas and need help with the following.
I have some data in a dataframe on which I am using groupby function. After using the groupby function I want to access the first row of each group and copy the same to a different dataframe.
For eg.
grouped = com.groupby(['Month'])
Using this will give me groups created month-wise. I just want to extract the first number from all the 12 groups and copy the same row details to another dataframe.

Spark RDD write to Cassandra

I have a below Cassandra Table schema.
ColumnA Primary Key
ColumnB Clustering Key
ColumnC
ColumnD
Now, I have a Spark RDD with columns ordered as
RDD[ColumnC, ColumnA, ColumnB, ColumnD]
So, when I am writing to the Cassandra Table, I need to make sure the ordering is correct. So, I am having specify the column ordering using SomeColumns
rdd.saveToCassandra(keyspace,table,SomeColumns("ColumnA","ColumnB","ColumnC","ColumnD))
Is there any way I Can pass all the column names as a list instead? I am asking that Cause I have around 140 Columns in my target table and cannot give all the names as part of SomeColumns. So, looking for a more cleaner approach.
PS: I cannot write it from a DataFrame, I Am looking only for solution based on RDD's.
You can use following syntax to explode sequence into list of arguments:
SomeColumns(names_as_sequence: _*)
Update:
If you have a sequence of column names as strings, then you need to do:
SomeColumns(names_as_string_seq.map(x => x.as(x)): _*)

PySpark - A more efficient method to count common elements

I have two dataframes, say dfA and dfB.
I want to take their intersection and then count the number of unique user_ids in that intersection.
I've tried the following which is very slow and it crashes a lot:
dfA.join(broadcast(dfB), ['user_id'], how='inner').select('user_id').dropDuplicates().count()
I need to run many such lines, in order to get a plot.
How can I perform such query in an efficient way?
As described in the question, the only relevant part of the dataframe is the column user_id (in your question you describe that you join on user_id and afterwards uses only the user_id field)
The source of the performance problem is joining two big dataframes when you need only the distinct values of one column in each dataframe.
In order to improve the performance I'd do the following:
Create two small DFs which will holds only the user_id column of each dataframe
This will reduce dramatically the size of each dataframe as it will hold only one column (the only relevant column)
dfAuserid = dfA.select("user_id")
dfBuserid = dfB.select("user_id")
Get the distinct (Note: it is equivalent to dropDuplicate() values of each dataframe
This will reduce dramatically the size of each dataframe as each new dataframe will hold only the distinct values of column user_id.
dfAuseridDist = dfA.select("user_id").distinct()
dfBuseridDist = dfB.select("user_id").distinct()
Perform the join on the above two minimalist dataframes in order to get the unique values in the intersection
I think you can either select the necessary columns before and perform the join afterwards. It should also be beneficial to move the dropDuplicates before the join as well, since then you get rid of user_ids that appear multiple times in one of the dataframes.
The resulting query could look like:
dfA.select("user_id").join(broadcast(dfB.select("user_id")), ['user_id'], how='inner')\
.select('user_id').dropDuplicates().count()
OR:
dfA.select("user_id").dropDuplicates(["user_id",]).join(broadcast(dfB.select("user_id")\
.dropDuplicates(["user_id",])), ['user_id'], how='inner').select('user_id').count()
OR the version with distinct should work as well.
dfA.select("user_id").distinct().join(broadcast(dfB.select("user_id").distinct()),\
['user_id'], how='inner').select('user_id').count()

Is there a way to slice dataframe based on index in pyspark?

In python or R, there are ways to slice DataFrame using index.
For example, in pandas:
df.iloc[5:10,:]
Is there a similar way in pyspark to slice data based on location of rows?
Short Answer
If you already have an index column (suppose it was called 'id') you can filter using pyspark.sql.Column.between:
from pyspark.sql.functions import col
df.where(col("id").between(5, 10))
If you don't already have an index column, you can add one yourself and then use the code above. You should have some ordering built in to your data based on some other columns (orderBy("someColumn")).
Full Explanation
No it is not easily possible to slice a Spark DataFrame by index, unless the index is already present as a column.
Spark DataFrames are inherently unordered and do not support random access. (There is no concept of a built-in index as there is in pandas). Each row is treated as an independent collection of structured data, and that is what allows for distributed parallel processing. Thus, any executor can take any chunk of the data and process it without regard for the order of the rows.
Now obviously it is possible to perform operations that do involve ordering (lead, lag, etc), but these will be slower because it requires spark to shuffle data between the executors. (The shuffling of data is typically one of the slowest components of a spark job.)
Related/Futher Reading
PySpark DataFrames - way to enumerate without converting to Pandas?
PySpark - get row number for each row in a group
how to add Row id in pySpark dataframes
You can convert your spark dataframe to koalas dataframe.
Koalas is a dataframe by Databricks to give an almost pandas like interface to spark dataframe. See here https://pypi.org/project/koalas/
import databricks.koalas as ks
kdf = ks.DataFrame(your_spark_df)
kdf[0:500] # your indexes here

Resources