Convert PySpark data frame to JSON with each column as a key - apache-spark

I'm working on PySpark. I have a data frame which I need to dump as a JSON file but the the JSON file should have the following format for example -
{"Column 1": [9202, 9202, 9202, ....], "Column 2": ["FEMALE", "No matching concept", "MALE", ....]}
So there should be 1 key for each column and the corresponding value should have a list of all the values in that column
I tried to convert this to a Pandas data frame and then convert to a dict before dumping it as a JSON and was successful in doing that but as the data volume is very I want to do it directly on the PySpark data frame

One way is to collect each column values as array before you write to JSON. Try this:
column_arrays = [collect_list(c).alias(c) for c in df.columns]
df2 = df.groupBy().agg(*column_arrays)
df2.coalesce(1).write.mode("overwrite").json("/path")

L = []
for j in range(0, len(df.columns)):
arr = []
for i in range(0, df.count()):
arr.append(df.collect()[i][j])
L.append(arr)
columns = df.columns
data_dict = dict(zip(columns, L))
print(data_dict)

Related

Creating a pandas DataFrame from json object and appending a column to it

From an API, I get information about who has filled a particular form and when they have done it as a json object. I get the below data from 2 forms formid = ["61438732", "48247759dc"]. The json object is stored in the r_json_object variable.
r_json_online = {
'results': [
{'submittedAt': 1669963478503,
'values': [{'email': 'brownsvilleselect#gmail.com'}]},
{'submittedAt': 1669963259737,
'values': [{'email': 'brewsterdani33#gmail.com'}]},
{'submittedAt': 1669963165956,
'values': [{'email': 'thesource95#valpo.edu'}]}
]
}
I have used the json_normalize function to de-nest the json object and insert the values into a DataFrame called form_submissions. This is the code I have used
import pandas as pd
from pandas import json_normalize
submissions = []
formid = ["61438732", "48247759dc"]
for i in range(0,len(formid)):
submissions.extend(r_json_online["results"])
form_submissions = pd.DataFrame()
for j in submissions:
form_submissions = form_submissions.append(json_normalize(j["values"]))
form_submissions = form_submissions.append({'createdOn': j["submittedAt"]}, ignore_index=True)
form_submissions = form_submissions.append({'formid': formid[i]}, ignore_index=True)
form_submissions['createdOn'] = form_submissions['createdOn'].fillna(method = 'bfill')
form_submissions['formid'] = form_submissions['formid'].fillna(method = 'bfill')
form_submissions = form_submissions.dropna(subset= 'email')
Code explanation:
I have created an empty list called submissions
For each value in the formid list, I'm running the for loop.
In the for loop:
a. I have added data to the submissions list
b. Created an empty DataFrame, normalized the json object and appended the values to the DataFrame from each element in the submissions list
Expected Output:
I wanted the first 3 rows to have formid = '61438732'
The next 3 rows should have the formid = '48247759dc'
Actual Output:
The formid is the same for all the rows
The problem is you are using this line " form_submissions = pd.DataFrame()" in the loop which reset your dataframe each time
This can easily attained by converting into two dataframes and doing cartesian product/cross merge on between both.
formids = ["61438732", "48247759dc"]
form_submissions_df=json_normalize(r_json_online['results'], record_path=['values'], meta=['submittedAt'])
# converting form_ids list to dataframe
form_ids_df = pd.DataFrame (formids, columns = ['form_id'])
# cross merge for cartesian product result
form_submissions_df.merge(form_ids_df, how="cross")
Results

pd series from dictionary and xlsx generation with the values from the dict

I would like to generate an XLSX file with keys and values from a Dictionary. Example bellow
statistics = {
"a:": f"textt",
"b": " ",
"c": f"{len(list_1)}",
}
df = pd.DataFrame(
{'Statistics': pd.Series(statistics.keys()),
'Statistics Values': pd.Series(statistics.values()))
writer = pd.ExcelWriter(f"{output_xlsx_file}", engine='xlsxwriter')
df['Statistics'].to_excel(writer, sheet_name='Statistics', index=False)
df['Statistics Values'].to_excel(writer, sheet_name='Statistics', startcol=1, index=False)
The expected result is to have an xlsx file with 2 columns in the col 1 dicts keys in the second column dicts values
This does happen, with one exception for the dicts values if they are a number like 3rd one in my example within the XLSX there is quote infront of the number
Any idea how can make that being a number and get rid of that quote, if I want to add in xlsx the numbers it will fail as it's not seen as a number.

Splitting Multiple values inside a Pandas Column into Separate Columns

I have a dataframe with column which contains two different column values and their name as follows:
How Do I transform it into separate columns?
So far, I tried Following:
use df[col].apply(pd.Series) - It didn't work since data in column is not in dictionary format.
Tried separating columns by a semi-colon (";") sign but It is not a good idea since the given dataframe might have n number of column based on response.
EDIT:
Data in plain text format:
d = {'ClusterName': ['Date:20191010;Bucket:All','Date:20191010;Bucket:some','Date:20191010;Bucket:All']}
How about:
df2 = (df["ClusterName"]
.str.replace("Date:", "")
.str.replace("Bucket:", "")
.str.split(";", expand=True))
df2.columns = ["Date", "Bucket"]
EDIT:
Without hardcoding the variable names, here's a quick hack. You can clean it up (and make less silly variable names):
df_temp = df.ClusterName.str.split(";", expand=True)
cols = []
for col in df_temp:
df_temptemp = df_temp[col].str.split(":", expand=True)
df_temp[col] = df_temptemp[1]
cols.append(df_temptemp.iloc[0, 0])
df_temp.columns = cols
So .. maybe like this ...
Setup the data frame
d = {'ClusterName': ['Date:20191010;Bucket:All','Date:20191010;Bucket:some','Date:20191010;Bucket:All']}
df = pd.DataFrame(data=d)
df
Parse over the dataframe breaking apart by colon and semi-colon
ls = []
for index, row in df.iterrows():
splits = row['ClusterName'].split(';')
print(splits[0].split(':')[1],splits[1].split(':')[1])
ls.append([splits[0].split(':')[1],splits[1].split(':')[1]])
df = pd.DataFrame(ls, columns =['Date', 'Bucket'])

How to replace cell in pandas?

I have a Pandas dataframe created from CSV with the following headers:
podcast_name,user_name,description,image,ratings,category,itunes_link,rss,email,latest_date,listener_1,listener_2,listener_3,listener_4,listener_5,listener_6,listener_7,listener_8,listener_9,listener_10,listener_11,listener_12,listener_13,listener_14,listener_15,listener_16,listener_17,listener_18
This dataframe was loaded from several files and cleared of duplicates:
all_files = glob.glob(os.path.join("data/*.csv"))
df = pandas.concat((pandas.read_csv(f) for f in all_files))
df.drop_duplicates(keep=False, inplace=True)
Now i want to check and replace some values from category. For example i have keywords dict:
categories = {
"Comedy": ["Comedy Interviews", "Improv", "Stand-Up"],
"Fiction": ["Comedy Fiction", "Drama", "Science Fiction"]
}
So i want to check if value in category is equal to one of values from the list. For example i have line with Improv in caterogy column and i want to replace Improv with Comedy.
Honestly, I have no idea how to do this.
Create helper dictionary and replace:
#swap key values in dict
#http://stackoverflow.com/a/31674731/2901002
d = {k: oldk for oldk, oldv in categories.items() for k in oldv}
print (d)
{'Comedy Interviews': 'Comedy', 'Improv': 'Comedy',
'Stand-Up': 'Comedy', 'Comedy Fiction': 'Fiction',
'Drama': 'Fiction', 'Science Fiction': 'Fiction'}
df['category'] = df['category'].replace(d)

pyspark rdd of csv to data frame with large number of columns dynamically

I have an existing rdd which consists of a single column of text with many (20k+) comma separated values.
How can I convert this to a data frame without specifying every column literally?
# split into columns
split_rdd = input_rdd.map(lambda l: l.split(","))
# convert to Row types
rows_rdd = split_rdd.map(lambda p: Row(
field_1=p[0],
field_2=p[1],
field_3 = float(p[2]),
field_4 = float(p[3])
))
df = spark.createDataFrame(rows_rdd)
How can I dynamically create the
field_1=p[0],
dict?
For example
row_dict = dict(
field_1=p[0],
field_2=p[1],
field_3 = float(p[2]),
field_4 = float(p[3])
)
is invalid syntax since the 'p[0]' needs to be quoted, but then it is a literal and doesn't get evaluated in the lambda function.
This is a large enough dataset that I need to avoid writing out the rdd and reading it back into a dataframe for performance.
You could try using dictionary comprehension in your creation of the row instance:
df = split_rdd\
.map(lambda p: {'field_%s' % index : val
for (index, val) in enumerate(p)})\
.map(lambda p: Row(**p))\
.toDF()
This is first mapping the list column values array from split_rdd into a dictionary with dynamically generated field_N keys mapped to respective values. These dictionaries are then used in the creation of Row instances.

Resources