How to get the size of a list returned by column in pyspark - apache-spark

name
contact
address
"max"
[{"email": "watson#commerce.gov", "phone": "650-333-3456"}, {"email": "emily#gmail.com", "phone": "238-111-7689"}]
{"city": "Baltimore", "state": "MD"}
"kyle"
[{"email": "johnsmith#yahoo.com", "phone": "425-231-8754"}]
{"city": "Barton", "state": "TN"}
I am working with a dataframe in Pyspark that has a few columns including the two mentioned above. I need to create columns dynamically based on the contact fields.
When I use the "." operator on contact as contact.email I get a list of emails. I need to create separate column for each of the emails.
contact.email0, contact.email1, etc.
I found this code online, which partially does what I want, but I don't completely understand it.
employee_data.select(
'name', *[col('contact.email')[i].alias(f'contact.email{i}') for i in range(2)]).show(truncate=False)
The range is static in this case, but my range could be dynamic. How can I get the size of list to loop through it? I tried size(col('contact.email')) or len(col('contact.email')) but got an error saying the col('column name') object is not iterable.
Desired output something like -
name
contact.email0
contact.email1
max
watson#commerce.gov
emily#gmail.com
kyle
johnsmith#yahoo.com
null

You can get desired output by using pivot function,
# convert contact struct to array of emails by using transform function
# explode the array
# perform pivot
df.select("name", posexplode_outer(expr("transform(contact, c-> c.email)"))) \
.withColumn("email", concat(lit("contact.email"), col("pos"))) \
.groupBy("name").pivot("email").agg(first("col")) \
.show(truncate=False)
+----+-------------------+---------------+
|name|contact.email0 |contact.email1 |
+----+-------------------+---------------+
|kyle|johnsmith#yahoo.com|null |
|max |watson#commerce.gov|emily#gmail.com|
+----+-------------------+---------------+

To understand what the solution you found does, we can print the expression in a shell:
>>> [F.col('contact.email')[i].alias(f'contact.email{i}') for i in range(2)]
[Column<'contact.email[0] AS `contact.email0`'>, Column<'contact.email[1] AS `contact.email1`'>]
Basically, it creates two columns, one for the first element of the array contact.email and one for the second element. That's all there is to it.
SOLUTION 1
Keep this solution. But you need to find the max size of your array first:
max_size = df.select(F.max(F.size("contact"))).first()[0]
df.select('name',
*[F.col('contact')[i]['email'].alias(f'contact.email{i}') for i in range(max_size)])\
.show(truncate=False)
SOLUTION 2
Use posexplode to generate one row per element of the array + a pos column containing the index of the email in the array. Then use a pivot to create the columns you want.
df.select('name', F.posexplode('contact.email').alias('pos', 'email'))\
.withColumn('pos', F.concat(F.lit('contact.email'), 'pos'))\
.groupBy('name')\
.pivot('pos')\
.agg(F.first('email'))\
.show()
Both solutions yield:
+----+-------------------+---------------+
|name|contact.email0 |contact.email1 |
+----+-------------------+---------------+
|max |watson#commerce.gov|emily#gmail.com|
|kyle|johnsmith#yahoo.com|null |
+----+-------------------+---------------+

You can use size or array_length functions to get the length of the list in the contact column, and then use that in the range function to dynamically create columns for each email. Here's an example:
from pyspark.sql.functions import size, array_length
contact_size = size(col('contact'))
employee_data.select(
'name', *[col('contact')[i]['email'].alias(f'contact.email{i}') for i in range(contact_size)]).show(truncate=False)
Or, using array_length:
from pyspark.sql.functions import size, array_length
contact_size = array_length(col('contact'))
employee_data.select(
'name', *[col('contact')[i]['email'].alias(f'contact.email{i}') for i in range(contact_size)]).show(truncate=False)

Related

how to add column name to the dataframe storing result of correlation of two columns in pyspark?

I have read a csv file and need to find correlation between two columns.
I am using df.stat.corr('Age','Exp') and result is 0.7924058156930612.
But I want to have this result stored in another dataframe with header as "correlation".
correlation
0.7924058156930612
Following up on what #gupta_hemant commented.
You can create a new column as
df.withColumn("correlation", df.stat.corr("Age", "Exp").collect()[0].correlation)
(I am guessing the exact syntax here, but it should be something like this)
After reviewing the code, the syntax should be
import pyspark.sql.functions as F
df.withColumn("correlation", F.lit(df.stat.corr("Age", "Exp")))
Try this and let me know.
corrValue = df.stat.corr("Age", "Exp")
newDF = spark.createDataFrame(
[
(corrValue)
],
["corr"]
)

Cast some columns and select all columns without explicitly writing column names

I want to cast some columns and then select all others
id, name, property, description = column("id"), column("name"), column("property"), column("description")
select([cast(id, String).label('id'), cast(property, String).label('property'), name, description]).select_from(events_table)
Is there any way to cast some columns and select all with out mentioning all column names
I tried
select([cast(id, String).label('id'), cast(property, String).label('property')], '*').select_from(events_table)
py_.transform(return_obj, lambda acc, element: acc.append(dict(element)), [])
But I get two extra columns (total 7 columns) which are cast and I can't convert them to dictionary which throws key error.
I'm using FASTAPI, sqlalchemy and databases(async)
Thanks
Pretty sure you can do
select_columns = []
for field in events_table.keys()
select_columns.append(getattr(events_table.c, field))
select(select_columns).select_from(events_table)
to select all fields from that table. You can also keep a list of fields you want to actually select instead of events_table.keys(), like
select_these = ["id", "name", "property", "description"]
select_columns = []
for field in select_these
select_columns.append(getattr(events_table.c, field))
select(select_columns).select_from(events_table)

How to dynamically know if pySpark DF has a null/empty value for given columns?

I have to check if incoming data is having any null or "" or " " value or not. The column for which I have to check is not fixed. I am reading from a config where the column name is stored for different files with permissible null-ability.
+----------+------------------+--------------------------------------------+
| FileName | Nullable | Columns |
+----------+------------------+--------------------------------------------+
| Sales | Address2,Phone2 | OrderID,Address1,Address2,Phone1,Phone2 |
| Invoice | Bank,OfcAddress | InvoiceNo,InvoiceID,Amount,Bank,OfcAddress |
+----------+------------------+--------------------------------------------+
So for each data/file I have to see which field shouldn't contain null. On basis of that process/error out the file. Is there any pythonic way to do this?
The table structure you’re showing makes me believe you have read the file containing these job details as a Spark DataFrame. You probably shouldn’t, as it’s very likely not big data. If you have it as a Spark DataFrame, collect it to the driver, so that you can create separate Spark jobs for each file.
Then, each job is fairly straightforward: you have a certain file location from which you must read. That info is captured by the FileName, I presume. Now, I will also presume the file format for each of these files is identical. If not, you’ll have to add meta data indicating the file format. For now, I assume it’s CSV.
Next, you must determine the subset of columns that needs to be checked for the presence of nulls. That’s easy: given that you have a list of all columns in the DataFrame (which could’ve been derived from the DataFrame generated by the previous step (the loading)) and a list of all columns that can contain nulls, the list of columns that can’t contain nulls is simply the difference between these two.
Finally, you aggregate over the DataFrame the number of nulls within each of these columns. As this is a DataFrame aggregate, there’s only one row in the result set, so you can take head to bring it to the driver. Cast is to a dict for easier access to the attributes.
I’ve added a function, summarize_positive_counts, that returns the columns where there was at least one null record found, thereby invalidating the claim in the original table.
df.show(truncate=False)
# +--------+---------------+------------------------------------------+
# |FileName|Nullable |Columns |
# +--------+---------------+------------------------------------------+
# |Sales |Address2,Phone2|OrderID,Address1,Address2,Phone1,Phone2 |
# |Invoice |Bank,OfcAddress|InvoiceNo,InvoiceID,Amount,Bank,OfcAddress|
# +--------+---------------+------------------------------------------+
jobs = df.collect() # bring it to the driver, to create new Spark jobs from its
from pyspark.sql.functions import col, sum as spark_sum
def report_null_counts(frame, job):
cols_to_verify_not_null = (set(job.Columns.split(","))
.difference(job.Nullable.split(",")))
null_counts = frame.agg(*(spark_sum(col(_).isNull().cast("int")).alias(_)
for _ in cols_to_verify_not_null))
return null_counts.head().asDict()
def summarize_positive_counts(filename, null_counts):
return {filename: [colname for colname, nbr_of_nulls in null_counts.items()
if nbr_of_nulls > 0]}
for job in jobs: # embarassingly parallellizable
frame = spark.read.csv(job.FileName, header=True)
null_counts = report_null_counts(frame, job)
print(summarize_positive_counts(job.FileName, null_counts))

Store aggregate value of a PySpark dataframe column into a variable

I am working with PySpark dataframes here. "test1" is my PySpark dataframe and event_date is a TimestampType. So when I try to get a distinct count of event_date, the result is a integer variable but when I try to get max of the same column the result is a dataframe. I would like to understand what operations result in a dataframe and variable. I would also like to know how to store the max of the event date as a variable
Code that results in an integer type:
loop_cnt=test1.select('event_date').distinct().count()
type(loop_cnt)
Code that results in dataframe type:
last_processed_dt=test1.select([max('event_date')])
type(last_processed_dt)
Edited to add a reproducible example:
schema = StructType([StructField("event_date", TimestampType(), True)])
df = sqlContext.createDataFrame([(datetime(2015, 8, 10, 2, 44, 15),),(datetime(2015, 8, 10, 3, 44, 15),)], schema)
Code that returns a dataframe:
last_processed_dt=df.select([max('event_date')])
type(last_processed_dt)
Code that returns a varible:
loop_cnt=df.select('event_date').distinct().count()
type(loop_cnt)
You cannot directly access the values in a dataframe. Dataframe returns a Row Object. Instead Dataframe gives you a option to convert it into a python dictionary. Go through the following example where I will calculate average wordcount:
wordsDF = sqlContext.createDataFrame([('cat',), ('elephant',), ('rat',), ('rat',), ('cat', )], ['word'])
wordCountsDF = wordsDF.groupBy(wordsDF['word']).count()
wordCountsDF.show()
Here are the word count results:
+--------+-----+
| word|count|
+--------+-----+
| cat| 2|
| rat| 2|
|elephant| 1|
+--------+-----+
Now I calculate the average of count column apply collect() operation on it. Remember collect() returns a list.Here the list contains one element only.
averageCount = wordCountsDF.groupBy().avg('count').collect()
Result looks something like this.
[Row(avg(count)=1.6666666666666667)]
You cannot access directly the average value using some python variable. You have to convert it into a dictionary to access it.
results={}
for i in averageCount:
results.update(i.asDict())
print results
Our final results look like these:
{'avg(count)': 1.6666666666666667}
Finally you can access average value using:
print results['avg(count)']
1.66666666667
I'm pretty sure df.select([max('event_date')]) returns a DataFrame because there could be more than one row that has the max value in that column. In your particular use case no two rows may have the same value in that column, but it is easy to imagine a case where more than one row can have the same max event_date.
df.select('event_date').distinct().count() returns an integer because it is telling you how many distinct values there are in that particular column. It does NOT tell you which value is the largest.
If you want code to get the max event_date and store it as a variable, try the following max_date = df.select([max('event_date')]).distinct().collect()
Using collect()
import pyspark.sql.functions as sf
distinct_count = df.agg(sf.countDistinct('column_name')).collect()[0][0]
Using first()
import pyspark.sql.functions as sf
distinct_count = df.agg(sf.countDistinct('column_name')).first()[0]
last_processed_dt=df.select([max('event_date')])
to get the max of date, we should try something like
last_processed_dt=df.select([max('event_date').alias("max_date")]).collect()[0]
last_processed_dt["max_date"]
Based on sujit's example.We can actually print the value without iterating/looping by
[Row(avg(count)=1.6666666666666667)] by providing averageCount[0][0].
Note: we are not going through the loop, because it's going to return only one value.
try this
loop_cnt=test1.select('event_date').distinct().count()
var = loop_cnt.collect()[0]
Hope this helps
trainDF.fillna({'Age':trainDF.select('Age').agg(avg('Age')).collect()[0][0]})
What you can try is accessing the collect() function.
As of spark 3.0, you can do the following:
loop_cnt=test1.select('event_date').distinct().count().collect()[0][0]
print(loop_cnt)

How to loop through each row of dataFrame in pyspark

E.g
sqlContext = SQLContext(sc)
sample=sqlContext.sql("select Name ,age ,city from user")
sample.show()
The above statement prints theentire table on terminal. But I want to access each row in that table using for or while to perform further calculations.
You simply cannot. DataFrames, same as other distributed data structures, are not iterable and can be accessed using only dedicated higher order function and / or SQL methods.
You can of course collect
for row in df.rdd.collect():
do_something(row)
or convert toLocalIterator
for row in df.rdd.toLocalIterator():
do_something(row)
and iterate locally as shown above, but it beats all purpose of using Spark.
To "loop" and take advantage of Spark's parallel computation framework, you could define a custom function and use map.
def customFunction(row):
return (row.name, row.age, row.city)
sample2 = sample.rdd.map(customFunction)
or
sample2 = sample.rdd.map(lambda x: (x.name, x.age, x.city))
The custom function would then be applied to every row of the dataframe. Note that sample2 will be a RDD, not a dataframe.
Map may be needed if you are going to perform more complex computations. If you just need to add a simple derived column, you can use the withColumn, with returns a dataframe.
sample3 = sample.withColumn('age2', sample.age + 2)
Using list comprehensions in python, you can collect an entire column of values into a list using just two lines:
df = sqlContext.sql("show tables in default")
tableList = [x["tableName"] for x in df.rdd.collect()]
In the above example, we return a list of tables in database 'default', but the same can be adapted by replacing the query used in sql().
Or more abbreviated:
tableList = [x["tableName"] for x in sqlContext.sql("show tables in default").rdd.collect()]
And for your example of three columns, we can create a list of dictionaries, and then iterate through them in a for loop.
sql_text = "select name, age, city from user"
tupleList = [{name:x["name"], age:x["age"], city:x["city"]}
for x in sqlContext.sql(sql_text).rdd.collect()]
for row in tupleList:
print("{} is a {} year old from {}".format(
row["name"],
row["age"],
row["city"]))
It might not be the best practice, but you can simply target a specific column using collect(), export it as a list of Rows, and loop through the list.
Assume this is your df:
+----------+----------+-------------------+-----------+-----------+------------------+
| Date| New_Date| New_Timestamp|date_sub_10|date_add_10|time_diff_from_now|
+----------+----------+-------------------+-----------+-----------+------------------+
|2020-09-23|2020-09-23|2020-09-23 00:00:00| 2020-09-13| 2020-10-03| 51148 |
|2020-09-24|2020-09-24|2020-09-24 00:00:00| 2020-09-14| 2020-10-04| -35252 |
|2020-01-25|2020-01-25|2020-01-25 00:00:00| 2020-01-15| 2020-02-04| 20963548 |
|2020-01-11|2020-01-11|2020-01-11 00:00:00| 2020-01-01| 2020-01-21| 22173148 |
+----------+----------+-------------------+-----------+-----------+------------------+
to loop through rows in Date column:
rows = df3.select('Date').collect()
final_list = []
for i in rows:
final_list.append(i[0])
print(final_list)
Give A Try Like this
result = spark.createDataFrame([('SpeciesId','int'), ('SpeciesName','string')],["col_name", "data_type"]);
for f in result.collect():
print (f.col_name)
If you want to do something to each row in a DataFrame object, use map. This will allow you to perform further calculations on each row. It's the equivalent of looping across the entire dataset from 0 to len(dataset)-1.
Note that this will return a PipelinedRDD, not a DataFrame.
above
tupleList = [{name:x["name"], age:x["age"], city:x["city"]}
should be
tupleList = [{'name':x["name"], 'age':x["age"], 'city':x["city"]}
for name, age, and city are not variables but simply keys of the dictionary.

Resources