I am trying to create several dictionaries out of a table of comments from a CSV with the following columns:
I need to create a dictionary for every row (hopefully using a loop so I don't have to create them all manually), where the dictionary keys are:
ID
ReviewType
Comment
However, I cannot figure out a fast way to do this. I tried creating a list of dictionaries using the following code:
# Import libraries
import csv
import json
import pprint
# Open file
reader = csv.DictReader(open('Comments.csv', 'rU'))
# Create list of dictionaries
dict_list = []
for line in reader:
dict_list.append(line)
pprint.pprint(dict_list)
However, now I do not know how to access the dictionaries or whether the key value pairs are matched properly since in the following image:
The ID, ReviewType and Comment do not seem to be showing as
dictionary keys
The Comment value seems to be showing as a list of half-sentences.
Is there any way to just create one dictionary for each row instead of a list of dictionaries?
Note: I did look at this question, however it didn't really help.
Here you go. I put the comment into an array
# Import libraries
import csv
import json
import pprint
# Open file
def readPerfReviewCSVToDict(csvPath):
reader = csv.DictReader(open(csvPath, 'rU'))
perfReviewsDictionary = []
for line in reader:
perfReviewsDictionary.append(line)
perfReviewsDictionaryWithCommentsSplit = []
for item in perfReviewsDictionary:
itemId = item["id"]
itemType = item["type"]
itemComment = item["comments"]
itemCommentDictionary = []
itemCommentDictionary = itemComment.split()
perfReviewsDictionaryWithCommentsSplit.append({'id':itemId, 'type':itemType, 'comments':itemCommentDictionary})
return perfReviewsDictionaryWithCommentsSplit
dict_list = readPerfReviewCSVToDict("test.csv")
pprint.pprint(dict_list)
The output is:
[{'comments': ['test', 'ape', 'dog'], 'id': '1', 'type': 'Test'},
{'comments': ['dog'], 'id': '2', 'type': 'Test'}]
Since you haven't given a reproducible example, with a sample DataFrame, I've created one for you
import pandas as pd
df = pd.DataFrame([[1, "Contractor", "Please post"], [2, "Developer", "a reproducible example"]])
df.columns = ['ID', 'ReviewType', 'Comment']
In your computer, instead of doing this, type:
df = pd.read_csv(file_path)
to read in the csv file as a pandas DataFrame.
Now I will create a list, called dictList which will be empty initially, I am going to populate it with a dictionary for each row in the DataFrame df
dictList = []
#Iterate over each row in df
for i in df.index:
#Creating an empty dictionary for each row
rowDict = {}
#Populating it
rowDict['ID'] = df.at[i, 'ID']
rowDict['ReviewType'] = df.at[i, 'ReviewType']
rowDict['Comment'] = df.at[i, 'Comment']
#Once I'm done populating it, I will append it to the list
dictList.append(rowDict)
#Go to the next row and repeat.
Now iterating over the list of dictionaries we have created for my example
for i in dictList:
print(i)
We get
{'ID': 1, 'ReviewType': 'Contractor', 'Comment': 'Please post'}
{'ID': 2, 'ReviewType': 'Developer', 'Comment': 'a reproducible example'}
Do you want this?
DICT = {}
for line in reader:
DICT[line['ID']] = line
Related
I have the below stories_data dictionary, which I'm able to create a df from but since owner is a dictionary as well I would like to get the value of that dictionary so the owner column would have 178413540
import numpy as np
import pandas as pd
stories_data = {'caption': 'Tel_gusto', 'like_count': 0, 'owner': {'id': '178413540'}, 'headers': {'Content-Encoding': 'gzip'}
x = pd.DataFrame(stories_data.items())
x.set_index(0, inplace=True)
stories_metric_df = x.transpose()
del stories_metric_df['headers']
I've tried this but it gets the key not the value
stories_metric_df['owner'].explode().apply(pd.Series)
You can use .str, even for objects/dicts:
stories_metric_df['owner'] = stories_metric_df['owner'].str['id']
Output:
>>> stories_metric_df
0 caption like_count owner
1 Tel_gusto 0 178413540
Another solution would be to skip the explode, and just extract id:
stories_metric_df['owner'].apply(pd.Series)['id']
although I suspect my first solution would be faster.
I have been doing data extract from many API. I would like to add a common column among all APIs.
And I have tried below
df = pd.DataFrame()
for i in range(1,200):
url = '{id}/values'.format(id=i)
res = request.get(url,headers=headers)
if res.status_code==200:
data =json.loads(res.content.decode('utf-8'))
if data['success']:
df['id'] = i
test = pd.json_normalize(data[parent][child])
df = df.append(test,index=False)
But data-frame id column I'm getting only the last iterated id only. And in case of APIs has many rows I'm getting invalid data.
From performance reasons it would be better first storing data in a dictionary and then create from this dictionary dataframe:
import pandas as pd
from collections import defaultdict
d = defaultdict(list)
for i in range(1,200):
# simulate dataframe retrieved from pd.json_normalize() call
row = pd.DataFrame({'id': [i], 'field1': [f'f1-{i}'], 'field2': [f'f2-{i}'], 'field3': [f'f3-{i}']})
for k, v in row.to_dict().items():
d[k].append(v[0])
df = pd.DataFrame(d)
I am working on a csv file that has multiple columns.
The file looks something like this...
A,B,C
1,'x;y;z','e;f;g'
2,'w;x;y','r;s;t'
3,'','p;q;r'
Each cell in the file has a string that is separated by ";".
I want to create a single list by reading each cell and splitting each cell based on the separator.
I have been able to do this but there are performance issues.
The csv file is huge and so I am looking for an optimized version.
The columns names are known upfront. My code is given below
My current solution is
Make a list reading all the rows from each column
Flatten the list
split the items in list if the item is string,append to a new list
remove duplicates from the list
import pandas as pd
from io import StringIO
from collections import Iterable
import operator
csv_path ='my_dir'
# load the data with pd.read_csv
dataDF = pd.read_csv(csv_path)
dataDF.fillna(" ")
result=[]
cols=['A','B','C']
for i in cols:
result.append(dataDF[i].tolist())
result=reduce(operator.concat, result)
print(result)
my_list=[]
for token in result:
if isinstance(token, str):
my_list.append(token.split(";"))
my_list=reduce(operator.concat, my_list)
my_list=list(set(my_list))
If you have many repeated values, this will probably go faster.
from itertools import chain
# load the data with pd.read_csv
dataDF = pd.DataFrame({'A': [1, 2, 3], 'B': ['x;y;z', 'w;x;y', ''], 'C': ['e;f;g', 'r;s;t', 'p;q;r']})
dataDF.fillna(" ", inplace=True)
results_set = set()
for i in dataDF.columns:
try:
results_set.update(chain(*dataDF[i].str.split(';').values))
except AttributeError:
pass
print(results_set)
Try this one:
from itertools import chain
# load the data with pd.read_csv
dataDF = pd.DataFrame({'A': [1, 2, 3], 'B': ['x;y;z', 'w;x;y', ''], 'C': ['e;f;g', 'r;s;t', 'p;q;r']})
dataDF.fillna(" ", inplace=True)
list_of_lists = []
for i in dataDF.columns:
try:
list_of_lists.extend(dataDF[i].str.split(';').values)
except AttributeError:
pass
print(set(chain(*list_of_lists)))
I am using python-docx to extract two tables from a document.
I have iterated over the tables and created a list of lists. Each individual list represents a table, and within that I have dictionaries per row. Each dictionary contains a key / value pair. The key is the column heading from the table and value is the cell contents for that row's data for that column.
I am facing difficulty when creating a data frame for each table and writing each table on a seperate excel sheet.
from docx.api import Document
import pandas as pd
import csv
import json
import unicodedata
document = Document('Sampletable1.docx')
tables = document.tables
print (len(tables))
big_data = []
for table in document.tables:
data = []
Keys = None
for i, row in enumerate(table.rows):
text = (cell.text for cell in row.cells)
if i == 0:
keys = tuple(text)
continue
dic = dict(zip(keys, text))
data.append(dic)
big_data.append(data)
print(big_data)
The output of the above code is:
2
[[{'Asset': 'Growth investments', 'Target investment mix': '66.50%', 'Actual investment mix': '66.30%', 'Variance': '-0.20%'}, {'Asset': 'Defensive investments', 'Target investment mix': '33.50%', 'Actual investment mix': '33.70%', 'Variance': '0.20%'}], [{'Owner': 'REST Super', 'Product': 'Superannuation', 'Type': 'Existing', 'Status': 'Existing', 'Customer 2': 'Customer 1'}, {'Owner': 'TWUSUPER TransPension', 'Product': 'TTR Pension', 'Type': 'New', 'Status': 'New', 'Customer 2': 'Customer 1'}, {'Owner': 'TWUSUPER', 'Product': 'Superannuation', 'Type': 'Existing', 'Status': 'Existing'}]]
How do I access the above lists??
Further I tried to create a pandas data frame
#write the data into a data frame
for thing in big_data:
#print(thing)
df = pd.DataFrame(thing)
print(df)
writer = pd.ExcelWriter('dftable3.xlsx', engine='xlsxwriter')
df.to_excel(writer, sheet_name='Sheet1')
writer.save()
I got the first table on the excel but unable to work with second table.
I am expecting both the table to be in the same excel workbook(dftable3.xlsx) but in different worksheets(Sheet1,Sheet2)
I have attached the images of the tables.
Thanks in advance
How do I access the above lists??
You already did, by iterating over them, or printing them.
Consider using the pretty-print library:
import pprint
pprint.pprint(big_data)
I am expecting ... different worksheets(Sheet1,Sheet2)
Well, that's unlikely, given the constant 'Sheet1' argument you supplied.
Here is one way to accomplish that:
writer = pd.ExcelWriter('dftable3.xlsx', engine='xlsxwriter')
for i, thing in enumerate(big_data):
df = pd.DataFrame(thing)
df.to_excel(writer, sheet_name=f'Sheet{i}')
writer.save()
Note the scope of writer -- it must be longer lived than each of the constituent dfs.
Supposing, I have a json file with lines in follow structure:
{
"a": 1,
"b": {
"bb1": 1,
"bb2": 2
}
}
I want to change the value of key bb1 or add a new key, like: bb3.
Currently, I use spark.read.json to load the json file into spark as DataFrame and df.rdd.map to map each row of RDD to dict. Then, change nested key value or add a nested key and convert the dict to row. Finally, convert RDD to DataFrame.
The workflow works as follow:
def map_func(row):
dictionary = row.asDict(True)
adding new key or changing key value
return as_row(dictionary) # as_row convert dict to row recursively
df = spark.read.json("json_file")
df.rdd.map(map_func).toDF().write.json("new_json_file")
This could work for me. But I concern that converting DataFrame -> RDD ( Row -> dict -> Row) -> DataFrame would kill the efficiency.
Is there any other methods that could work for this demand but not at the cost of efficiency?
The final solution that I used is using withColumn and dynamically building the schema of b.
Firstly, we can get the b_schema from df schema by:
b_schema = next(field['type'] for field in df.schema.jsonValue()['fields'] if field['name'] == 'b')
After that, b_schema is dict and we can add new field into it by:
b_schema['fields'].append({"metadata":{},"type":"string","name":"bb3","nullable":True})
And then, we could convert it to StructType by:
new_b = StructType.fromJson(b_schema)
In the map_func, we could convert Row to dict and populate the new field:
def map_func(row):
data = row.asDict(True)
data['bb3'] = data['bb1'] + data['bb2']
return data
map_udf = udf(map_func, new_b)
df.withColumn('b', map_udf('b')).collect()
Thanks #Mariusz
You can use map_func as udf and therefore omit converting DF -> RDD -> DF, still having the flexibility of python to implement business logic. All you need is to create schema object:
>>> from pyspark.sql.types import *
>>> new_b = StructType([StructField('bb1', LongType()), StructField('bb2', LongType()), StructField('bb3', LongType())])
Then you define map_func and udf:
>>> from pyspark.sql.functions import *
>>> def map_func(data):
... return {'bb1': 4, 'bb2': 5, 'bb3': 6}
...
>>> map_udf = udf(map_func, new_b)
Finally apply this UDF to dataframe:
>>> df = spark.read.json('sample.json')
>>> df.withColumn('b', map_udf('b')).first()
Row(a=1, b=Row(bb1=4, bb2=5, bb3=6))
EDIT:
According to the comment: You can add a field to existing StructType in a easier way, for example:
>>> df = spark.read.json('sample.json')
>>> new_b = df.schema['b'].dataType.add(StructField('bb3', LongType()))