iterating complex dataframe with array of structfield - apache-spark

I have data in one of dataframe's column with the following schema
<type 'list'>: [StructField(data,StructType(List(StructField(account,StructType(List(StructField(Id,StringType,true),StructField(Name,StringType,true),StructField(books,ArrayType(StructType(List(StructField(bookTile,StringType,true),StructField(bookId,StringType,true),StructField(bookName,StringType,true))),true),true)))))))]
I want to interate them extract each value out of it and create a new dataframe. Is there any inbuilt functions in pyspark supports this or I should iterate them? Any efficient way?

Related

(PySpark) Update a delta table based on conditional expression while iterating over a lookup df and extract values to insert from a nested dict?

I have a mapping/lookup table/DF according to which I have to extract values from a highly nested json/dictionary. These values have to be inserted as column values to a delta table. How do I do this leveraging pyspark's parallelism?
I know I can collect() the mapping dataframe, open the json file and update each column of a row of a temp df and append to delta table but that will not run in parrallel.
Alternatively, I broadcast the dict/JSON, iterate over mapping dataframe using foreach() and according to when condition I upsert my delta table. But column.when() does not allow me to update a delta table nor does the delta.tables.merge() allow me to compare a dataframe and a dict.

Is there a way to slice dataframe based on index in pyspark?

In python or R, there are ways to slice DataFrame using index.
For example, in pandas:
df.iloc[5:10,:]
Is there a similar way in pyspark to slice data based on location of rows?
Short Answer
If you already have an index column (suppose it was called 'id') you can filter using pyspark.sql.Column.between:
from pyspark.sql.functions import col
df.where(col("id").between(5, 10))
If you don't already have an index column, you can add one yourself and then use the code above. You should have some ordering built in to your data based on some other columns (orderBy("someColumn")).
Full Explanation
No it is not easily possible to slice a Spark DataFrame by index, unless the index is already present as a column.
Spark DataFrames are inherently unordered and do not support random access. (There is no concept of a built-in index as there is in pandas). Each row is treated as an independent collection of structured data, and that is what allows for distributed parallel processing. Thus, any executor can take any chunk of the data and process it without regard for the order of the rows.
Now obviously it is possible to perform operations that do involve ordering (lead, lag, etc), but these will be slower because it requires spark to shuffle data between the executors. (The shuffling of data is typically one of the slowest components of a spark job.)
Related/Futher Reading
PySpark DataFrames - way to enumerate without converting to Pandas?
PySpark - get row number for each row in a group
how to add Row id in pySpark dataframes
You can convert your spark dataframe to koalas dataframe.
Koalas is a dataframe by Databricks to give an almost pandas like interface to spark dataframe. See here https://pypi.org/project/koalas/
import databricks.koalas as ks
kdf = ks.DataFrame(your_spark_df)
kdf[0:500] # your indexes here

Is there a way to get the column data type in pyspark?

Has been discussed that the way to find the column datatype in pyspark is using df.dtypes get datatype of column using pyspark. The problem with this is that for datatypes like an array or struct you get something like array<string> or array<integer>.
Question: Is there a native way to get the pyspark data type? Like ArrayType(StringType,true)
Just use schema:
df.schema[column_name].dataType

How to automatically index DataFrame created from groupby in Pandas

Using the Kiva Loan_Data from Kaggle I aggregated the Loan Amounts by country. Pandas allows them to be easily turned into a DataFrame, but indexes on the country data. The reset_index can be used to create a numerical/sequential index, but I'm guessing I am adding an unnecessary step. Is there a way to create an automatic default index when creating a DataFrame like this?
Use as_index=False
groupby
split-apply-combine
df.groupby('country', as_index=False)['loan_amount'].sum()

Is there a good(immutable) way to pre-define column for RDD, or remove column from RDD?

I was trying to add columns to Spark RDD that I loaded from csv file and when I'm calling withColumn() it returns new RDD, I don't want to force new RDD creation, so can I somehow adjust RDD schema(best way I imagine is to add column in schema and then do a map by row and add value to new coulmn)? Same question goes if I can remove column from RDD somehow if schema is already defined by CSV file?

Resources