I have dataframe1 (see image) and dataframe2 (see image) and the inputs are dynamic for dataframe1.
I have to create a final dataframe, considering the dynamic nature of dataframe-1 and seeing its mapping to the columns of dataframe-2.
Output1: Final dataframe which will have column names only after mapping both the dataframes basis the given condition.
Related
I have a mapping/lookup table/DF according to which I have to extract values from a highly nested json/dictionary. These values have to be inserted as column values to a delta table. How do I do this leveraging pyspark's parallelism?
I know I can collect() the mapping dataframe, open the json file and update each column of a row of a temp df and append to delta table but that will not run in parrallel.
Alternatively, I broadcast the dict/JSON, iterate over mapping dataframe using foreach() and according to when condition I upsert my delta table. But column.when() does not allow me to update a delta table nor does the delta.tables.merge() allow me to compare a dataframe and a dict.
If I was interested in working on groups of rows in Spark, I could, for example add a column that denoted group membership and use a pandas groupby to operate on these rows in parallel (independently). Is there a way to do this for columns instead? I have a df with many feature columns and 1 target column and I would like to run calculations on each set{column_i, target} for i=1...p.
I have two dataframes, say dfA and dfB.
I want to take their intersection and then count the number of unique user_ids in that intersection.
I've tried the following which is very slow and it crashes a lot:
dfA.join(broadcast(dfB), ['user_id'], how='inner').select('user_id').dropDuplicates().count()
I need to run many such lines, in order to get a plot.
How can I perform such query in an efficient way?
As described in the question, the only relevant part of the dataframe is the column user_id (in your question you describe that you join on user_id and afterwards uses only the user_id field)
The source of the performance problem is joining two big dataframes when you need only the distinct values of one column in each dataframe.
In order to improve the performance I'd do the following:
Create two small DFs which will holds only the user_id column of each dataframe
This will reduce dramatically the size of each dataframe as it will hold only one column (the only relevant column)
dfAuserid = dfA.select("user_id")
dfBuserid = dfB.select("user_id")
Get the distinct (Note: it is equivalent to dropDuplicate() values of each dataframe
This will reduce dramatically the size of each dataframe as each new dataframe will hold only the distinct values of column user_id.
dfAuseridDist = dfA.select("user_id").distinct()
dfBuseridDist = dfB.select("user_id").distinct()
Perform the join on the above two minimalist dataframes in order to get the unique values in the intersection
I think you can either select the necessary columns before and perform the join afterwards. It should also be beneficial to move the dropDuplicates before the join as well, since then you get rid of user_ids that appear multiple times in one of the dataframes.
The resulting query could look like:
dfA.select("user_id").join(broadcast(dfB.select("user_id")), ['user_id'], how='inner')\
.select('user_id').dropDuplicates().count()
OR:
dfA.select("user_id").dropDuplicates(["user_id",]).join(broadcast(dfB.select("user_id")\
.dropDuplicates(["user_id",])), ['user_id'], how='inner').select('user_id').count()
OR the version with distinct should work as well.
dfA.select("user_id").distinct().join(broadcast(dfB.select("user_id").distinct()),\
['user_id'], how='inner').select('user_id').count()
In python or R, there are ways to slice DataFrame using index.
For example, in pandas:
df.iloc[5:10,:]
Is there a similar way in pyspark to slice data based on location of rows?
Short Answer
If you already have an index column (suppose it was called 'id') you can filter using pyspark.sql.Column.between:
from pyspark.sql.functions import col
df.where(col("id").between(5, 10))
If you don't already have an index column, you can add one yourself and then use the code above. You should have some ordering built in to your data based on some other columns (orderBy("someColumn")).
Full Explanation
No it is not easily possible to slice a Spark DataFrame by index, unless the index is already present as a column.
Spark DataFrames are inherently unordered and do not support random access. (There is no concept of a built-in index as there is in pandas). Each row is treated as an independent collection of structured data, and that is what allows for distributed parallel processing. Thus, any executor can take any chunk of the data and process it without regard for the order of the rows.
Now obviously it is possible to perform operations that do involve ordering (lead, lag, etc), but these will be slower because it requires spark to shuffle data between the executors. (The shuffling of data is typically one of the slowest components of a spark job.)
Related/Futher Reading
PySpark DataFrames - way to enumerate without converting to Pandas?
PySpark - get row number for each row in a group
how to add Row id in pySpark dataframes
You can convert your spark dataframe to koalas dataframe.
Koalas is a dataframe by Databricks to give an almost pandas like interface to spark dataframe. See here https://pypi.org/project/koalas/
import databricks.koalas as ks
kdf = ks.DataFrame(your_spark_df)
kdf[0:500] # your indexes here
I have a DataFrame df with a column column and I would like to convert column into a vector (e.g. a DenseVector) so that I can use it in vector and matrix products.
Beware: I don't need a column of vectors; I need a vector object.
How to do this?
I found out the vectorAssembler function (link) but this doesn't help me, as it converts some DataFrame columns into a vector columns, which is still a DataFrame column; my desired output should instead be a vector.
About the goal of this question: why am I trying to convert a DF column into a vector? Assume I have a DF with a numerical column and I need to compute a product between a matrix and this column. How can I achieve this? (The same could hold for a DF numerical row.) Any alternative approach is welcome.
How:
DenseVector(df.select("column_name").rdd.map(lambda x: x[0]).collect())
but it doesn't make sense in any practical scenario.
Spark Vectors are not distributed, therefore are applicable only if data fits in memory of one (driver) node. If this is the case you wouldn't use Spark DataFrame for processing.