I can't seem to figure out how to show actual column name in json after dataframe has been transposed. Any thoughts please?
from pandasql import *
import pandas as pd
pysqldf = lambda q: sqldf(q, globals())
q1 = """
SELECT
beef as beef, veal as veal, pork as pork, lamb_and_mutton as lamb
FROM
meat m
LIMIT 3;
"""
meat = load_meat()
df = pysqldf(q1)
#df = df.reset_index(drop=True)
#print(df.T.to_json(orient='records'))
df1 = df.T.reset_index(drop=True)
df1.columns = range(len(df1.columns))
print(df.T.to_json(orient='records'))
Output
[{"0":751.0,"1":713.0,"2":741.0},{"0":85.0,"1":77.0,"2":90.0},{"0":1280.0,"1":1169.0,"2":1128.0},{"0":89.0,"1":72.0,"2":75.0}]
Expected Output
[ { "0": "beef", "1": 751, "2": 713, "3": 741},{"0": "veal", "1": 85, "2": 77, "3": 90 },{"0": "pork", "1": 1280, "2": 1169, "3": 1128},{ "0": "lamb", "1": 89, "2": 72, "3": 75 }]
Try this:
Where df:
beef veal pork lamb
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
Use T, reset_index, and set_axis:
df.T.reset_index()\
.set_axis(range(len(df.columns)), axis=1, inplace=False)\
.to_json(orient='records')
Output:
'[{"0":"beef","1":0,"2":4,"3":8},{"0":"veal","1":1,"2":5,"3":9},{"0":"pork","1":2,"2":6,"3":10},{"0":"lamb","1":3,"2":7,"3":11}]'
Related
The code below is a simple modification of the code that reads a page with Selenium and stores the address in a variable according to the condition of the page. (It's too long, so I've simplified some of it.)
a = ["70", "80", "90", "100", "110"]
b = [112, 1513, 14, 505, 36]
c = ["url_1", "url_2", "url_3", "url_4", "url_5"]
last = []
num = 0
for l in a:
if 110 == int(l):
last.insert(0, c[num])
elif 100 == int(l):
last.append(c[num])
elif 90 == int(l):
last.append(c[num])
elif 80 == int(l):
last.append(c[num])
elif 70 == int(l):
last.append(c[num])
num += 1
print(last)
Originally, it was the idea of putting the elements of variable-a in variable-last in order of magnitude.
But I found that I need to reorder the contents by the elements of list b.
What I want to do is to sort the elements of variable-a in numerical order, while at the same time sorting them again in order of largest in the elements of variable-b.
For example
a = ["70", "100", "90", "100", "100"]
b = [112, 1513, 14, 505, 36]
c = ["url_1", "url_2", "url_3", "url_4", "url_5"]
When classifying '100' in a above, it is also to consider the size of the numbers on the same index in b and classify them in order of size. So, again, the elements of c with the same index are put into the variable last in order. so finally
last = ["url_2", "url_4", "url_5", "url_3", "url_1"]
I want to complete a list in this order. All day long, I failed. help
You can do this using built-ins:
>>> a = ["70", "100", "90", "100", "100"]
>>> b = [112, 1513, 14, 505, 36]
>>> c = ["url_1", "url_2", "url_3", "url_4", "url_5"]
>>>
>>> sorted(zip(map(int, a), b, c), key=lambda x: x[:2], reverse=True)
[(100, 1513, 'url_2'), (100, 505, 'url_4'), (100, 36, 'url_5'), (90, 14, 'url_3'), (70, 112, 'url_1')]
Then if you want to extract only the "urls":
>>> x = sorted(zip(map(int, a), b, c), key= lambda x: x[:2], reverse=True)
>>> [i[2] for i in x]
['url_2', 'url_4', 'url_5', 'url_3', 'url_1']
The way sorted() works, is that it tries to compare each value, index for index. That means that you can give it a way to sort values based on different "columns" of the sliced value.
You can customize that sorting via the key keyword. You can give it a function (or as I did above, a lambda function).
Use pandas dataframe functions:
a = ["70", "100", "90", "100", "100"]
b = [112, 1513, 14, 505, 36]
c = ["url_1", "url_2", "url_3", "url_4", "url_5"]
df = pd.DataFrame([int(i) for i in a], columns=['a'])
df['b'] = b
df['c'] = c
df.sort_values(by = ['a','b'], ascending=False, inplace=True)
print(df['c'].to_list())
I have a dataframe that looks like this:
'id': ["1", "2", "1", "3", "3", "4"],
'date': ["2017", "2011", "2019", "2013", "2017", "2018"],
'code': ["CB25", "CD15", "CZ10", None, None, "AZ51"],
'col_example': ["22", None, "22", "55", "55", "121"],
'comments': ["bonjour", "bonjour", "bonjour", "hola", "Hello", None]
Result:
id date code col_example .... comments
0 1 2019 CB25/CZ10 22 .... bonjour (and not bonjour // bonjour)
1 2 2011 CD15 None .... bonjour
2 3 2017 None 55 .... hola // Hello
3 4 2018 AZ51 121 .... None
I want to keep a single id
If two ids are the same, I would like:
If comments = None and = str: Keep only the comments which are not None (example: id = 1, keep the comments "hello")
If two comments = str: Concaten the two comments with a "//" (example id = 3, comments = "hola // hello")
For the moment I tried with sort_value, and drop_duplicate without success
thank you
I believe you need DataFrame.dropna by column comments and then GroupBy.agg with join and GroupBy.last, last add DataFrame.mask for replace empty strings to None rows:
df1 = (df.groupby('id')
.agg({'date': 'last',
'comments': lambda x: ' // '.join(x.dropna())})
.replace({'comments': {'': None}})
.reset_index())
print (df1)
id date comments
0 1 2019 bonjour
1 2 2011 bonjour
2 3 2017 hola // Hello
3 4 2018 None
EDIT: For avoid removed all columns is necessary aggregate all of them, you can create dictionary for aggregation dynamic like:
df = pd.DataFrame({'id': ["1", "2", "1", "3", "3", "4"],
'date': ["2017", "2011", "2019", "2013", "2017", "2018"],
'code': ["CB25", "CD15", "CB25", None, None, "AZ51"],
'col_example': ["22", None, "22", "55", "55", "121"],
'comments': [None, "bonjour", "bonjour", "hola", "Hello", None]})
print (df)
id date code col_example comments
0 1 2017 CB25 22 None
1 2 2011 CD15 None bonjour
2 1 2019 CB25 22 bonjour
3 3 2013 None 55 hola
4 3 2017 None 55 Hello
5 4 2018 AZ51 121 None
d = dict.fromkeys(df.columns.difference(['id','comments']), 'last')
d['comments'] = lambda x: ' // '.join(x.dropna())
print (d)
{'code': 'last', 'col_example': 'last', 'date': 'last',
'comments': <function <lambda> at 0x000000000ECA99D8>}
df1 = (df.groupby('id')
.agg(d)
.replace({'comments': {'': None}})
.reset_index())
print (df1)
id code col_example date comments
0 1 CB25 22 2019 bonjour
1 2 CD15 None 2011 bonjour
2 3 None 55 2017 hola // Hello
3 4 AZ51 121 2018 None
I have a pandas dataframe named tshirt_orders from an API call looking like this:
Alice, small, red
Alice, small, green
Bob, small, blue
Bob, small, orange
Cesar, medium, yellow
David, large, purple
How can I get this into a dictionary style format where I first go by size and have sub keys under for name and another sublist for color so that I can address it when iterating over by using tshirt_orders?
Like this:
size:
small:
Name:
Alice:
Color:
red
green
Bob:
Color:
blue
orange
medium:
Name:
Cesar:
Color:
yellow
large:
Name:
David:
Color:
purple
What would be the best solution to change this? It is in a pandas dataframe but changing that isn't a problem should there be better solutions.
The close is write DataFrame to yaml.
First create nested dictionaries in dict comprehension:
print (df)
A B C
0 Alice small red
1 Alice small green
2 Bob small blue
3 Bob small orange
4 Cesar medium yellow
5 David large purple
d = {k:v.groupby('A', sort=False)['C'].apply(list).to_dict()
for k, v in df.groupby('B', sort=False)}
print (d)
{'small': {'Alice': ['red', 'green'],
'Bob': ['blue', 'orange']},
'medium': {'Cesar': ['yellow']},
'large': {'David': ['purple']}}
Add size to dict for key and then write to yaml file:
import yaml
with open('result.yml', 'w') as yaml_file:
yaml.dump({'size': d}, yaml_file, default_flow_style=False, sort_keys=False)
size:
small:
Alice:
- red
- green
Bob:
- blue
- orange
medium:
Cesar:
- yellow
large:
David:
- purple
Or create json:
import json
with open("result.json", "w") as twitter_data_file:
json.dump({'size': d}, twitter_data_file, indent=4)
{
"size": {
"small": {
"Alice": [
"red",
"green"
],
"Bob": [
"blue",
"orange"
]
},
"medium": {
"Cesar": [
"yellow"
]
},
"large": {
"David": [
"purple"
]
}
}
}
EDIT:
df = df.assign(A1='Name', B1='size', C1='Color')
df1 = df.groupby(['B1','B','A1','A','C1'], sort=False)['C'].apply(list).reset_index()
#https://stackoverflow.com/a/19900276
def recur_dictify(frame):
if len(frame.columns) == 1:
if frame.values.size == 1: return frame.values[0][0]
return frame.values.squeeze()
grouped = frame.groupby(frame.columns[0], sort=False)
d = {k: recur_dictify(g.iloc[:,1:]) for k,g in grouped}
return d
d = recur_dictify(df1)
print (d)
{'size': {'small': {'Name': {'Alice': {'Color': ['red', 'green']},
'Bob': {'Color': ['blue', 'orange']}}},
'medium': {'Name': {'Cesar': {'Color': ['yellow']}}},
'large': {'Name': {'David': {'Color': ['purple']}}}}}
import yaml
with open('result.yml', 'w') as yaml_file:
yaml.dump(d, yaml_file, default_flow_style=False, sort_keys=False)
I am trying to convert a single-digit string to an integer. For example, if I have "2" <-- str I want to change it to 2 <-- int. I know that the int() function in python can do this for me, but I want to know if I made a dictionary like this,
strToNumDict{
"0": 0
"1": 1
"2": 2
"3": 3
"4": 4
"5": 5
"6": 6
"7": 7
"8": 8
"9": 9
}
Would using this dictionary to convert single digits be faster than the int() function? And if one is faster, is it fast enough to make a difference on which one I should use?
Let's do a quick timeit benchmark. We generate at random one digit and will try to convert it to integer (x 1_000_000 times):
strToNumDict = {
"0": 0,
"1": 1,
"2": 2,
"3": 3,
"4": 4,
"5": 5,
"6": 6,
"7": 7,
"8": 8,
"9": 9
}
def convert1(s):
return strToNumDict[s]
def convert2(s):
return int(s)
import timeit
import random
digits = list('0123456789')
t1 = timeit.timeit(lambda: convert1(random.choice(digits)), number=1_000_000)
t2 = timeit.timeit(lambda: convert2(random.choice(digits)), number=1_000_000)
print(t1)
print(t2)
Prints on my machine (AMD 2400G, Python 3.6.8):
0.6220340259897057
0.727682675991673
So dict-based version is marginally faster in this case. Personally, not worth the effort. With int() you get conversion from negative numbers, more than number 9 etc. And it's more readable.
I have a Dataframe with one column where each cell in the column is a JSON object.
players
0 {"name": "tony", "age": 57}
1 {"name": "peter", age": 46}
I want to convert this to a data frame as:
name age
tony 57
peter 46
Any ideas how I do this?
Note: the original JSON object looks like this...
{
"players": [{
"age": 57,
"name":"tony"
},
{
"age": 46,
"name":"peter"
}]
}
Use DataFrame constructor if types of values are dicts:
#print (type(df.loc[0, 'players']))
#<class 'str'>
#import ast
#df['players'] = df['players'].apply(ast.literal_eval)
print (type(df.loc[0, 'players']))
<class 'dict'>
df = pd.DataFrame(df['players'].values.tolist())
print (df)
age name
0 57 tony
1 46 peter
But better is use json_normalize from jsons object as suggested #jpp:
json = {
"players": [{
"age": 57,
"name":"tony"
},
{
"age": 46,
"name":"peter"
}]
}
df = json_normalize(json, 'players')
print (df)
age name
0 57 tony
1 46 peter
This can do it:
df = df['players'].apply(pd.Series)
However, it's slow:
In [20]: timeit df.players.apply(pd.Series)
1000 loops, best of 3: 824 us per loop
#jezrael suggestion is faster:
In [24]: timeit pd.DataFrame(df.players.values.tolist())
1000 loops, best of 3: 387 us per loop