Pandas - dataframe with column with dictionary, save value instead - python-3.x

I have the below stories_data dictionary, which I'm able to create a df from but since owner is a dictionary as well I would like to get the value of that dictionary so the owner column would have 178413540
import numpy as np
import pandas as pd
stories_data = {'caption': 'Tel_gusto', 'like_count': 0, 'owner': {'id': '178413540'}, 'headers': {'Content-Encoding': 'gzip'}
x = pd.DataFrame(stories_data.items())
x.set_index(0, inplace=True)
stories_metric_df = x.transpose()
del stories_metric_df['headers']
I've tried this but it gets the key not the value
stories_metric_df['owner'].explode().apply(pd.Series)

You can use .str, even for objects/dicts:
stories_metric_df['owner'] = stories_metric_df['owner'].str['id']
Output:
>>> stories_metric_df
0 caption like_count owner
1 Tel_gusto 0 178413540
Another solution would be to skip the explode, and just extract id:
stories_metric_df['owner'].apply(pd.Series)['id']
although I suspect my first solution would be faster.

Related

column comprehension robust to missing values

I have only been able to create a two column data frame from a defaultdict (termed output):
df_mydata = pd.DataFrame([(k, v) for k, v in output.items()],
columns=['id', 'value'])
What I would like to be able to do is using this basic format also initiate the dataframe with three columns: 'id', 'id2' and 'value'. I have a separate defined dict that contains the necessary look up info, called id_lookup.
So I tried:
df_mydata = pd.DataFrame([(k, id_lookup[k], v) for k, v in output.items()],
columns=['id', 'id2','value'])
I think I'm doing it right, but I get key errors. I will only know if id_lookup is exhaustive for all possible encounters in hindsight. For my purposes, simply putting it all together and placing 'N/A` or something for those types of errors will be acceptable.
Would the above be appropriate for calculating a new column of data using a defaultdict and a simple lookup dict, and how might I make it robust to key errors?
Here is an example of how you could do this:
import pandas as pd
from collections import defaultdict
df = pd.DataFrame({'id': [1, 2, 3, 4],
'value': [10, 20, 30, 40]})
id_lookup = {1: 'A', 2: 'B', 3: 'C'}
new_column = defaultdict(str)
# Loop through the df and populate the defaultdict
for index, row in df.iterrows():
try:
new_column[index] = id_lookup[row['id']]
except KeyError:
new_column[index] = 'N/A'
# Convert the defaultdict to a Series and add it as a new column in the df
df['id2'] = pd.Series(new_column)
# Print the updated DataFrame
print(df)
which gives:
id value id2
0 1 10 A
1 2 20 B
2 3 30 C
3 4 40 N/A
​

Iteratively append new data into pandas dataframe column and join with another dataframe

I have been doing data extract from many API. I would like to add a common column among all APIs.
And I have tried below
df = pd.DataFrame()
for i in range(1,200):
url = '{id}/values'.format(id=i)
res = request.get(url,headers=headers)
if res.status_code==200:
data =json.loads(res.content.decode('utf-8'))
if data['success']:
df['id'] = i
test = pd.json_normalize(data[parent][child])
df = df.append(test,index=False)
But data-frame id column I'm getting only the last iterated id only. And in case of APIs has many rows I'm getting invalid data.
From performance reasons it would be better first storing data in a dictionary and then create from this dictionary dataframe:
import pandas as pd
from collections import defaultdict
d = defaultdict(list)
for i in range(1,200):
# simulate dataframe retrieved from pd.json_normalize() call
row = pd.DataFrame({'id': [i], 'field1': [f'f1-{i}'], 'field2': [f'f2-{i}'], 'field3': [f'f3-{i}']})
for k, v in row.to_dict().items():
d[k].append(v[0])
df = pd.DataFrame(d)

'NA' handling in python pandas

i have a dataframe with name,age fieldname,name column has missing value and NA when i read the value using pd.read_excel,missing value and NA become NaN,how can i avoid this issue.
this is my code
import pandas as pd
data = {'Name':['Tom', '', 'NA','', 'Ricky',"NA",''],'Age':[28,34,29,42,35,33,40]}
df = pd.DataFrame(data)
df.to_excel("test1.xlsx",sheet_name="test")
import pandas as pd
data=pd.read_excel("./test1.xlsx")
To avoid this, just set the keep_default_na to False:
df = pd.read_excel('test1.xlsx', keep_default_na=False)

Inputting null values through np.select's "default" parameter

Trying to write values to a column given certain conditions, with default as Null value with the following code:
import pandas as pd
import numpy as np
df = pd.DataFrame({'col': list('ABCDE')})
cond1 = df['col'].eq('A')
cond2 = df['col'].isin(['B', 'E'])
df['new_col'] = np.select([cond1, cond2], ['foo', 'bar'], default=np.NaN)
But it gives 'nan' as string value in the column.
df['new_col'].unique()
#array(['foo', 'bar', 'nan'], dtype=object)
Is there a way to directly change it to null from this code?
Found the correct solution, which uses None as the default value:
df['new_col'] = np.select([cond1, cond2], ['foo', 'bar'], default=None)
Just tested it myself and it behaves properly. Check output of np.select(conditions,choices,default=np.nan) manually, maybe there're "NaN" strings in choices somewhere.
Try specifying dropna=True manually in .value_counts(), maybe it's set to default False smh?
What I tested it with:
import numpy as np
import pandas as pd
iris = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv')
iris['sepal_length'] = np.select(iris.values[:,:4].T>5, iris.values[:,:4].T, default=np.nan)
print(iris['sepal_length'].value_counts())
print(iris.sepal_length.value_counts(dropna=False))

How to create a dictionary from CSV using a loop function?

I am trying to create several dictionaries out of a table of comments from a CSV with the following columns:
I need to create a dictionary for every row (hopefully using a loop so I don't have to create them all manually), where the dictionary keys are:
ID
ReviewType
Comment
However, I cannot figure out a fast way to do this. I tried creating a list of dictionaries using the following code:
# Import libraries
import csv
import json
import pprint
# Open file
reader = csv.DictReader(open('Comments.csv', 'rU'))
# Create list of dictionaries
dict_list = []
for line in reader:
dict_list.append(line)
pprint.pprint(dict_list)
However, now I do not know how to access the dictionaries or whether the key value pairs are matched properly since in the following image:
The ID, ReviewType and Comment do not seem to be showing as
dictionary keys
The Comment value seems to be showing as a list of half-sentences.
Is there any way to just create one dictionary for each row instead of a list of dictionaries?
Note: I did look at this question, however it didn't really help.
Here you go. I put the comment into an array
# Import libraries
import csv
import json
import pprint
# Open file
def readPerfReviewCSVToDict(csvPath):
reader = csv.DictReader(open(csvPath, 'rU'))
perfReviewsDictionary = []
for line in reader:
perfReviewsDictionary.append(line)
perfReviewsDictionaryWithCommentsSplit = []
for item in perfReviewsDictionary:
itemId = item["id"]
itemType = item["type"]
itemComment = item["comments"]
itemCommentDictionary = []
itemCommentDictionary = itemComment.split()
perfReviewsDictionaryWithCommentsSplit.append({'id':itemId, 'type':itemType, 'comments':itemCommentDictionary})
return perfReviewsDictionaryWithCommentsSplit
dict_list = readPerfReviewCSVToDict("test.csv")
pprint.pprint(dict_list)
The output is:
[{'comments': ['test', 'ape', 'dog'], 'id': '1', 'type': 'Test'},
{'comments': ['dog'], 'id': '2', 'type': 'Test'}]
Since you haven't given a reproducible example, with a sample DataFrame, I've created one for you
import pandas as pd
df = pd.DataFrame([[1, "Contractor", "Please post"], [2, "Developer", "a reproducible example"]])
df.columns = ['ID', 'ReviewType', 'Comment']
In your computer, instead of doing this, type:
df = pd.read_csv(file_path)
to read in the csv file as a pandas DataFrame.
Now I will create a list, called dictList which will be empty initially, I am going to populate it with a dictionary for each row in the DataFrame df
dictList = []
#Iterate over each row in df
for i in df.index:
#Creating an empty dictionary for each row
rowDict = {}
#Populating it
rowDict['ID'] = df.at[i, 'ID']
rowDict['ReviewType'] = df.at[i, 'ReviewType']
rowDict['Comment'] = df.at[i, 'Comment']
#Once I'm done populating it, I will append it to the list
dictList.append(rowDict)
#Go to the next row and repeat.
Now iterating over the list of dictionaries we have created for my example
for i in dictList:
print(i)
We get
{'ID': 1, 'ReviewType': 'Contractor', 'Comment': 'Please post'}
{'ID': 2, 'ReviewType': 'Developer', 'Comment': 'a reproducible example'}
Do you want this?
DICT = {}
for line in reader:
DICT[line['ID']] = line

Resources