Pyspark dataframe split json column values into top-level multiple columns - apache-spark

I have a json column which can contain any no of key:value pairs. I want to create new top level columns for these key:value pairs.
For Eg if I have this data
A B
"{\"C\":\"c\" , \"D\":\"d\"...}" b
This is the output that i want
B C D ...
b c d
There are few questions similar to splitting the coulmns into multiple columns but none are working in this case. Can Anyone please help. Thanks in Advance!

You are looking for org.apache.spark.sql.functions.from_json: https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.functions$#from_json(e:org.apache.spark.sql.Column,schema:String,options:java.util.Map[String,String]):org.apache.spark.sql.Column
Here's the python code commit related to SPARK-17699: https://github.com/apache/spark/commit/fe33121a53384811a8e094ab6c05dc85b7c7ca87
Sample Usage from commit:
>>> from pyspark.sql.types import *
>>> data = [(1, '''{"a": 1}''')]
>>> schema = StructType([StructField("a", IntegerType())])
>>> df = spark.createDataFrame(data, ("key", "value"))
>>> df.select(from_json(df.value, schema).alias("json")).collect()
[Row(json=Row(a=1))]

Related

Convert lists present in each column to its respective datatypes

I have a sample dataframe as given below.
import pandas as pd
data = {'ID':['A', 'B', 'C', 'D],
'Age':[[20], [21], [19], [24]],
'Sex':[['Male'], ['Male'],['Female'], np.nan],
'Interest': [['Dance','Music'], ['Dance','Sports'], ['Hiking','Surfing'], np.nan]}
df = pd.DataFrame(data)
df
Each of the columns are in list datatype. I want to remove those lists and preserve the datatypes present within the lists for all columns.
The final output should look something shown below.
Any help is greatly appreciated. Thank you.
Option 1. You can use the .str column accessor to index the lists stored in the DataFrame values (or strings, or any other iterable):
# Replace columns containing length-1 lists with the only item in each list
df['Age'] = df['Age'].str[0]
df['Sex'] = df['Sex'].str[0]
# Pass the variable-length list into the join() string method
df['Interest'] = df['Interest'].apply(', '.join)
Option 2. explode Age and Sex, then apply ', '.join to Interest:
df = df.explode(['Age', 'Sex'])
df['Interest'] = df['Interest'].apply(', '.join)
Both options return:
df
ID Age Sex Interest
0 A 20 Male Dance, Music
1 B 21 Male Dance, Sports
2 C 19 Female Hiking, Surfing
EDIT
Option 3. If you have many columns which contain lists with possible missing values as np.nan, you can get the list-column names and then loop over them as follows:
# Get columns which contain at least one python list
list_cols = [c for c in df
if df[c].apply(lambda x: isinstance(x, list)).any()]
list_cols
['Age', 'Sex', 'Interest']
# Process each column
for c in list_cols:
# If all lists in column c contain a single item:
if (df[c].str.len() == 1).all():
df[c] = df[c].str[0]
else:
df[c] = df[c].apply(', '.join)

Finding common rows between two dataframes based on a column using pandas

I have two dataframes. I need to extract rows based on common values in column 'a'. However, instead of creating one data frame at the end I want to retain the two data frames.
For example:
###Consider the following input
df1 = pd.DataFrame({'a':[0,1,1,2,3,4], 'b':['q','r','s','t','u','v'],'c':['a','b','c','d','e','f']})
df2 = pd.DataFrame({'a':[1,4,5,6], 'b':['qq','rr','ss','tt'],'c':[1,2,3,4]})
The expected output is:
###df1:
a. b. c
0. 1. r. a
1. 1. s. c
2. 4. v. f
###df2:
a. b. c
0. 1. qq 1
1. 4. rr 2
How can I achieve the following result? Insights will be appreciated.
You can generalize it with numpy's intersect1d
import numpy as np
intersection_arr = np.intersect1d(df1['a'], df2['a'])
df1 = df1.loc[df1['a'].isin(intersection_arr),:]
df2 = df2.loc[df2['a'].isin(intersection_arr),:]
More than two dataframes:
import numpy as np
from functools import reduce
intersection_arr = reduce(np.intersect1d, (df1['a'], df2['a'], df3['a']))
df1 = df1.loc[df1['a'].isin(intersection_arr),:]
df2 = df2.loc[df2['a'].isin(intersection_arr),:]
df3 = df3.loc[df3['a'].isin(intersection_arr),:]
df1 = df1[df1['a'].isin(df2['a'])].reset_index(drop=True)
df2 = df2[df2['a'].isin(df1['a'])].reset_index(drop=True)

From dict to pandas dataframe as rows

I am sure i must be missing something basic here. But as far as I know you can create a dataframe from a dict with pd.DataFrame.from_dict(). But I am not sure how it can be set that key-values pairs in a dict can be put it as rows in a dataframe.
For instance, given this example
d = {'a':1,'b':2}
the desired output would be:
col1 col2
0 a 1
1 b 2
I know that the index might be a problem but that can be handle it with a simple index = [0]
Duplicate of Convert Python dict into a dataframe.
Simple answer for python 3.
import pandas as pd
d = {'a':1,'b':2, 'c':3}
df = pd.DataFrame(list(d.items()), columns = ['cola','colb'])
This code should help you.
d = {k: [l] for k, l in d.items()}
pd.DataFrame(d).T.reset_index().rename(columns={'index': 'col1', 0: 'col2'})

Python - Pandas Dataframe with Multiple Names per Column

Is there a way in pandas to give the same column of a pandas dataframe two names, so that I can index the column by only one of the two names? Here is a quick example illustrating my problem:
import pandas as pd
index=['a','b','c','d']
# The list of tuples here is really just to
# somehow visualize my problem below:
columns = [('A','B'), ('C','D'),('E','F')]
df = pd.DataFrame(index=index, columns=columns)
# I can index like that:
df[('A','B')]
# But I would like to be able to index like this:
df[('A',*)] #error
df[(*,'B')] #error
You can create a multi-index column:
df.columns = pd.MultiIndex.from_tuples(df.columns)
Then you can do:
df.loc[:, ("A", slice(None))]
Or: df.loc[:, (slice(None), "B")]
Here slice(None) is equivalent to selecting all indices at the level, so (slice(None), "B") selects columns whose second level is B regardless of the first level names. This is semantically the same as :. Or write in pandas index slice way. df.loc[:, pd.IndexSlice[:, "B"]] for the second case.

How to modify a column value in a row of a spark dataframe?

I am working with data frame with following structure
Here I need to modify each record so that if a column is listed in post_event_list I need to populate that column with corresponding post_column value. So in the above example for both records I need to populate col4 and col5 with post_col4 and post_col5 values. Can someone please help me to do this in pyspark.
Maybe this is what you want in pyspark2
suppose df is the DataFrame
row = df.rdd.first()
d = row.asDict()
d['col4'] = d['post_col4']
new_row = pyspark.sql.types.Row(**d)
now we has a new Row object;
put these codes in a map function can help to change all df.
You can use when/otherwise in pyspark.sql.functions. Something likes:
import pyspark.sql.functions as sf
from pyspark.sql.types import BooleanType
contains_col4_udf = udf(lambda x: 'col4' in x, BooleanType())
df.select(sf.when(contains_col4_udf('post_event_list'), sf.col('post_col4')).otherwise(sf.col('col_4')).alias('col_4'))
Here is the doc: https://spark.apache.org/docs/1.6.2/api/python/pyspark.sql.html#pyspark.sql.Column.otherwise

Resources