Spark dataframes convert nested JSON to seperate columns - apache-spark

I've a stream of JSONs with following structure that gets converted to dataframe
{
"a": 3936,
"b": 123,
"c": "34",
"attributes": {
"d": "146",
"e": "12",
"f": "23"
}
}
The dataframe show functions results in following output
sqlContext.read.json(jsonRDD).show
+----+-----------+---+---+
| a| attributes| b| c|
+----+-----------+---+---+
|3936|[146,12,23]|123| 34|
+----+-----------+---+---+
How can I split attributes column (nested JSON structure) into attributes.d, attributes.e and attributes.f as seperate columns into a new dataframe, so I can have columns as a, b, c, attributes.d, attributes.e and attributes.f in the new dataframe?

If you want columns named from a to f:
df.select("a", "b", "c", "attributes.d", "attributes.e", "attributes.f")
If you want columns named with attributes. prefix:
df.select($"a", $"b", $"c", $"attributes.d" as "attributes.d", $"attributes.e" as "attributes.e", $"attributes.f" as "attributes.f")
If names of your columns are supplied from an external source (e.g. configuration):
val colNames: Seq("a", "b", "c", "attributes.d", "attributes.e", "attributes.f")
df.select(colNames.head, colNames.tail: _*).toDF(colNames:_*)

Using the attributes.d notation, you can create new columns and you will have them in your DataFrame. Look at the withColumn() method in Java.

Use Python
Extract the DataFrame by using the pandas Lib of python.
Change the data type from 'str' to 'dict'.
Get the values of each features.
Save the results to a new file.
import pandas as pd
data = pd.read_csv("data.csv") # load the csv file from your disk
json_data = data['Desc'] # get the DataFrame of Desc
data = data.drop('Desc', 1) # delete Desc column
Total, Defective = [], [] # setout list
for i in json_data:
i = eval(i) # change the data type from 'str' to 'dict'
Total.append(i['Total']) # append 'Total' feature
Defective.append(i['Defective']) # append 'Defective' feature
# finally,complete the DataFrame
data['Total'] = Total
data['Defective'] = Defective
data.to_csv("result.csv") # save to the result.csv and check it

Related

Dealing with nested JSON in PySpark

I'm loading a JSON file into PySpark:
df = spark.read.json("20220824211022.json")
df.show()
+--------------------+--------------------+--------------------+
| data| includes| meta|
+--------------------+--------------------+--------------------+
|[{961778216070344...|{[{2018-02-09T01:...|{1562543391161741...|
+--------------------+--------------------+--------------------+
The two columns I'm interested in here are data and includes. For data, I ran the following:
df2 = df.withColumn("data", F.explode(F.col("data"))).select("data.*")
df2.show(2)
+-------------------+--------------------+-------------------+--------------+--------------------+
| author_id| created_at| id|public_metrics| text|
+-------------------+--------------------+-------------------+--------------+--------------------+
| 961778216070344705|2022-08-24T20:52:...|1562543391161741312| {0, 0, 0, 2}|With Kaskada, you...|
|1275784834321768451|2022-08-24T20:47:...|1562542031284555777| {2, 0, 0, 0}|Below is a protot...|
+-------------------+--------------------+-------------------+--------------+--------------------+
Which is something I can work with. However I can't do the same with the includes column as it has the {} enclosing the [].
Is there a way for me to deal with this using PySpark?
EDIT:
If you were to look at the includes sections in the JSON file, it looks like:
"includes": {"users": [{"id": "893899303" .... }, ...]},
So ideally in the first table in my question, I'd want the includes to be users, or at least be able to drill down to users
As your includes column is a MapType with key value = "users", you can use the .getItem() to get the array by the key, that is:
df3 = df.withColumn("includes", F.explode(F.col("includes").getItem("users"))).select("includes.*")

pd series from dictionary and xlsx generation with the values from the dict

I would like to generate an XLSX file with keys and values from a Dictionary. Example bellow
statistics = {
"a:": f"textt",
"b": " ",
"c": f"{len(list_1)}",
}
df = pd.DataFrame(
{'Statistics': pd.Series(statistics.keys()),
'Statistics Values': pd.Series(statistics.values()))
writer = pd.ExcelWriter(f"{output_xlsx_file}", engine='xlsxwriter')
df['Statistics'].to_excel(writer, sheet_name='Statistics', index=False)
df['Statistics Values'].to_excel(writer, sheet_name='Statistics', startcol=1, index=False)
The expected result is to have an xlsx file with 2 columns in the col 1 dicts keys in the second column dicts values
This does happen, with one exception for the dicts values if they are a number like 3rd one in my example within the XLSX there is quote infront of the number
Any idea how can make that being a number and get rid of that quote, if I want to add in xlsx the numbers it will fail as it's not seen as a number.

Pandas : how to consider content of certain columns as list

Let's say I have a simple pandas dataframe named df :
0 1
0 a [b, c, d]
I save this dataframe into a CSV file as follow :
df.to_csv("test.csv", index=False, sep="\t", encoding="utf-8")
Then later in my script I read this csv :
df = pd.read_csv("test.csv", index_col=False, sep="\t", encoding="utf-8")
Now what I want to do is to use explode() on column '1' but it does not work because the content of column '1' is not a list since I saved df into a CSV file.
What I tried so far is to change column '1' type into a list with astype() without any success.
Thank you by advance.
Try this, Since you are reading from csv file,your dataframe value in column A (1 in your case) is essentially a string for which you need to infer the values as list.
import pandas as pd
import ast
df=pd.DataFrame({"A":["['a','b']","['c']"],"B":[1,2]})
df["A"]=df["A"].apply(lambda x: ast.literal_eval(x))
Now, the following works !
df.explode("A")

Convert PySpark data frame to JSON with each column as a key

I'm working on PySpark. I have a data frame which I need to dump as a JSON file but the the JSON file should have the following format for example -
{"Column 1": [9202, 9202, 9202, ....], "Column 2": ["FEMALE", "No matching concept", "MALE", ....]}
So there should be 1 key for each column and the corresponding value should have a list of all the values in that column
I tried to convert this to a Pandas data frame and then convert to a dict before dumping it as a JSON and was successful in doing that but as the data volume is very I want to do it directly on the PySpark data frame
One way is to collect each column values as array before you write to JSON. Try this:
column_arrays = [collect_list(c).alias(c) for c in df.columns]
df2 = df.groupBy().agg(*column_arrays)
df2.coalesce(1).write.mode("overwrite").json("/path")
L = []
for j in range(0, len(df.columns)):
arr = []
for i in range(0, df.count()):
arr.append(df.collect()[i][j])
L.append(arr)
columns = df.columns
data_dict = dict(zip(columns, L))
print(data_dict)

How do I Programmatically parsed a fixed width text file in Pyspark?

This post does a great job of showing how parse a fixed width text file into a Spark dataframe with pyspark (pyspark parse text file).
I have several text files I want to parse, but they each have slightly different schemas. Rather than having to write out the same procedure for each one like the previous post suggests, I'd like to write a generic function that can parse a fixed width text file given the widths and column names.
I'm pretty new to pyspark so I'm not sure how to write a select statement where the number of columns, and their types is variable.
Any help would be appreciated!
Say we have a text file like the one in the example thread:
00101292017you1234
00201302017 me5678
in "/tmp/sample.txt". And a dictionary containing for each file name, a list of columns and a list of width:
schema_dict = {
"sample": {
"columns": ["id", "date", "string", "integer"],
"width" : [3, 8, 3, 4]
}
}
We can load the dataframes and split them into columns iteratively, using:
import numpy as np
input_path = "/tmp/"
df_dict = dict()
for file in schema_dict.keys():
df = spark.read.text(input_path + file + ".txt")
start_list = np.cumsum([1] + schema_dict[file]["width"]).tolist()[:-1]
df_dict[file] = df.select(
[
df.value.substr(
start_list[i],
schema_dict[file]["width"][i]
).alias(schema_dict[file]["columns"][i]) for i in range(len(start_list))
]
)
+---+--------+------+-------+
| id| date|string|integer|
+---+--------+------+-------+
|001|01292017| you| 1234|
|002|01302017| me| 5678|
+---+--------+------+-------+

Resources