Pandas dataframe transpose with column name instead of index

Pandas dataframe transpose with column name instead of index - python-3.x

I can't seem to figure out how to show actual column name in json after dataframe has been transposed. Any thoughts please?
from pandasql import *
import pandas as pd
pysqldf = lambda q: sqldf(q, globals())
q1 = """
SELECT
beef as beef, veal as veal, pork as pork, lamb_and_mutton as lamb
FROM
meat m
LIMIT 3;
"""
meat = load_meat()
df = pysqldf(q1)
#df = df.reset_index(drop=True)
#print(df.T.to_json(orient='records'))
df1 = df.T.reset_index(drop=True)
df1.columns = range(len(df1.columns))
print(df.T.to_json(orient='records'))
Output
[{"0":751.0,"1":713.0,"2":741.0},{"0":85.0,"1":77.0,"2":90.0},{"0":1280.0,"1":1169.0,"2":1128.0},{"0":89.0,"1":72.0,"2":75.0}]
Expected Output
[ { "0": "beef", "1": 751, "2": 713, "3": 741},{"0": "veal", "1": 85, "2": 77, "3": 90 },{"0": "pork", "1": 1280, "2": 1169, "3": 1128},{ "0": "lamb", "1": 89, "2": 72, "3": 75 }]

Try this:
Where df:
beef veal pork lamb
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
Use T, reset_index, and set_axis:
df.T.reset_index()\
.set_axis(range(len(df.columns)), axis=1, inplace=False)\
.to_json(orient='records')
Output:
'[{"0":"beef","1":0,"2":4,"3":8},{"0":"veal","1":1,"2":5,"3":9},{"0":"pork","1":2,"2":6,"3":10},{"0":"lamb","1":3,"2":7,"3":11}]'

Related

Creating a List with 3 Conditions in Python

The code below is a simple modification of the code that reads a page with Selenium and stores the address in a variable according to the condition of the page. (It's too long, so I've simplified some of it.)
a = ["70", "80", "90", "100", "110"]
b = [112, 1513, 14, 505, 36]
c = ["url_1", "url_2", "url_3", "url_4", "url_5"]
last = []
num = 0
for l in a:
if 110 == int(l):
last.insert(0, c[num])
elif 100 == int(l):
last.append(c[num])
elif 90 == int(l):
last.append(c[num])
elif 80 == int(l):
last.append(c[num])
elif 70 == int(l):
last.append(c[num])
num += 1
print(last)
Originally, it was the idea of putting the elements of variable-a in variable-last in order of magnitude.
But I found that I need to reorder the contents by the elements of list b.
What I want to do is to sort the elements of variable-a in numerical order, while at the same time sorting them again in order of largest in the elements of variable-b.
For example
a = ["70", "100", "90", "100", "100"]
b = [112, 1513, 14, 505, 36]
c = ["url_1", "url_2", "url_3", "url_4", "url_5"]
When classifying '100' in a above, it is also to consider the size of the numbers on the same index in b and classify them in order of size. So, again, the elements of c with the same index are put into the variable last in order. so finally
last = ["url_2", "url_4", "url_5", "url_3", "url_1"]
I want to complete a list in this order. All day long, I failed. help

You can do this using built-ins:
>>> a = ["70", "100", "90", "100", "100"]
>>> b = [112, 1513, 14, 505, 36]
>>> c = ["url_1", "url_2", "url_3", "url_4", "url_5"]
>>>
>>> sorted(zip(map(int, a), b, c), key=lambda x: x[:2], reverse=True)
[(100, 1513, 'url_2'), (100, 505, 'url_4'), (100, 36, 'url_5'), (90, 14, 'url_3'), (70, 112, 'url_1')]
Then if you want to extract only the "urls":
>>> x = sorted(zip(map(int, a), b, c), key= lambda x: x[:2], reverse=True)
>>> [i[2] for i in x]
['url_2', 'url_4', 'url_5', 'url_3', 'url_1']
The way sorted() works, is that it tries to compare each value, index for index. That means that you can give it a way to sort values based on different "columns" of the sliced value.
You can customize that sorting via the key keyword. You can give it a function (or as I did above, a lambda function).

Use pandas dataframe functions:
a = ["70", "100", "90", "100", "100"]
b = [112, 1513, 14, 505, 36]
c = ["url_1", "url_2", "url_3", "url_4", "url_5"]
df = pd.DataFrame([int(i) for i in a], columns=['a'])
df['b'] = b
df['c'] = c
df.sort_values(by = ['a','b'], ascending=False, inplace=True)
print(df['c'].to_list())

drop duplicated and concat pandas

I have a dataframe that looks like this:
'id': ["1", "2", "1", "3", "3", "4"],
'date': ["2017", "2011", "2019", "2013", "2017", "2018"],
'code': ["CB25", "CD15", "CZ10", None, None, "AZ51"],
'col_example': ["22", None, "22", "55", "55", "121"],
'comments': ["bonjour", "bonjour", "bonjour", "hola", "Hello", None]
Result:
id date code col_example .... comments
0 1 2019 CB25/CZ10 22 .... bonjour (and not bonjour // bonjour)
1 2 2011 CD15 None .... bonjour
2 3 2017 None 55 .... hola // Hello
3 4 2018 AZ51 121 .... None
I want to keep a single id
If two ids are the same, I would like:
If comments = None and = str: Keep only the comments which are not None (example: id = 1, keep the comments "hello")
If two comments = str: Concaten the two comments with a "//" (example id = 3, comments = "hola // hello")
For the moment I tried with sort_value, and drop_duplicate without success
thank you

I believe you need DataFrame.dropna by column comments and then GroupBy.agg with join and GroupBy.last, last add DataFrame.mask for replace empty strings to None rows:
df1 = (df.groupby('id')
.agg({'date': 'last',
'comments': lambda x: ' // '.join(x.dropna())})
.replace({'comments': {'': None}})
.reset_index())
print (df1)
id date comments
0 1 2019 bonjour
1 2 2011 bonjour
2 3 2017 hola // Hello
3 4 2018 None
EDIT: For avoid removed all columns is necessary aggregate all of them, you can create dictionary for aggregation dynamic like:
df = pd.DataFrame({'id': ["1", "2", "1", "3", "3", "4"],
'date': ["2017", "2011", "2019", "2013", "2017", "2018"],
'code': ["CB25", "CD15", "CB25", None, None, "AZ51"],
'col_example': ["22", None, "22", "55", "55", "121"],
'comments': [None, "bonjour", "bonjour", "hola", "Hello", None]})
print (df)
id date code col_example comments
0 1 2017 CB25 22 None
1 2 2011 CD15 None bonjour
2 1 2019 CB25 22 bonjour
3 3 2013 None 55 hola
4 3 2017 None 55 Hello
5 4 2018 AZ51 121 None
d = dict.fromkeys(df.columns.difference(['id','comments']), 'last')
d['comments'] = lambda x: ' // '.join(x.dropna())
print (d)
{'code': 'last', 'col_example': 'last', 'date': 'last',
'comments': <function <lambda> at 0x000000000ECA99D8>}
df1 = (df.groupby('id')
.agg(d)
.replace({'comments': {'': None}})
.reset_index())
print (df1)
id code col_example date comments
0 1 CB25 22 2019 bonjour
1 2 CD15 None 2011 bonjour
2 3 None 55 2017 hola // Hello
3 4 AZ51 121 2018 None

Massage csv dataframe into dictionary style

I have a pandas dataframe named tshirt_orders from an API call looking like this:
Alice, small, red
Alice, small, green
Bob, small, blue
Bob, small, orange
Cesar, medium, yellow
David, large, purple
How can I get this into a dictionary style format where I first go by size and have sub keys under for name and another sublist for color so that I can address it when iterating over by using tshirt_orders?
Like this:
size:
small:
Name:
Alice:
Color:
red
green
Bob:
Color:
blue
orange
medium:
Name:
Cesar:
Color:
yellow
large:
Name:
David:
Color:
purple
What would be the best solution to change this? It is in a pandas dataframe but changing that isn't a problem should there be better solutions.

The close is write DataFrame to yaml.
First create nested dictionaries in dict comprehension:
print (df)
A B C
0 Alice small red
1 Alice small green
2 Bob small blue
3 Bob small orange
4 Cesar medium yellow
5 David large purple
d = {k:v.groupby('A', sort=False)['C'].apply(list).to_dict()
for k, v in df.groupby('B', sort=False)}
print (d)
{'small': {'Alice': ['red', 'green'],
'Bob': ['blue', 'orange']},
'medium': {'Cesar': ['yellow']},
'large': {'David': ['purple']}}
Add size to dict for key and then write to yaml file:
import yaml
with open('result.yml', 'w') as yaml_file:
yaml.dump({'size': d}, yaml_file, default_flow_style=False, sort_keys=False)
size:
small:
Alice:
- red
- green
Bob:
- blue
- orange
medium:
Cesar:
- yellow
large:
David:
- purple
Or create json:
import json
with open("result.json", "w") as twitter_data_file:
json.dump({'size': d}, twitter_data_file, indent=4)
{
"size": {
"small": {
"Alice": [
"red",
"green"
],
"Bob": [
"blue",
"orange"
]
},
"medium": {
"Cesar": [
"yellow"
]
},
"large": {
"David": [
"purple"
]
}
}
}
EDIT:
df = df.assign(A1='Name', B1='size', C1='Color')
df1 = df.groupby(['B1','B','A1','A','C1'], sort=False)['C'].apply(list).reset_index()
#https://stackoverflow.com/a/19900276
def recur_dictify(frame):
if len(frame.columns) == 1:
if frame.values.size == 1: return frame.values[0][0]
return frame.values.squeeze()
grouped = frame.groupby(frame.columns[0], sort=False)
d = {k: recur_dictify(g.iloc[:,1:]) for k,g in grouped}
return d
d = recur_dictify(df1)
print (d)
{'size': {'small': {'Name': {'Alice': {'Color': ['red', 'green']},
'Bob': {'Color': ['blue', 'orange']}}},
'medium': {'Name': {'Cesar': {'Color': ['yellow']}}},
'large': {'Name': {'David': {'Color': ['purple']}}}}}
import yaml
with open('result.yml', 'w') as yaml_file:
yaml.dump(d, yaml_file, default_flow_style=False, sort_keys=False)

int() or dict for single digit string to int conversions

I am trying to convert a single-digit string to an integer. For example, if I have "2" <-- str I want to change it to 2 <-- int. I know that the int() function in python can do this for me, but I want to know if I made a dictionary like this,
strToNumDict{
"0": 0
"1": 1
"2": 2
"3": 3
"4": 4
"5": 5
"6": 6
"7": 7
"8": 8
"9": 9
}
Would using this dictionary to convert single digits be faster than the int() function? And if one is faster, is it fast enough to make a difference on which one I should use?

Let's do a quick timeit benchmark. We generate at random one digit and will try to convert it to integer (x 1_000_000 times):
strToNumDict = {
"0": 0,
"1": 1,
"2": 2,
"3": 3,
"4": 4,
"5": 5,
"6": 6,
"7": 7,
"8": 8,
"9": 9
}
def convert1(s):
return strToNumDict[s]
def convert2(s):
return int(s)
import timeit
import random
digits = list('0123456789')
t1 = timeit.timeit(lambda: convert1(random.choice(digits)), number=1_000_000)
t2 = timeit.timeit(lambda: convert2(random.choice(digits)), number=1_000_000)
print(t1)
print(t2)
Prints on my machine (AMD 2400G, Python 3.6.8):
0.6220340259897057
0.727682675991673
So dict-based version is marginally faster in this case. Personally, not worth the effort. With int() you get conversion from negative numbers, more than number 9 etc. And it's more readable.

Convert list of Pandas Dataframe JSON objects

I have a Dataframe with one column where each cell in the column is a JSON object.
players
0 {"name": "tony", "age": 57}
1 {"name": "peter", age": 46}
I want to convert this to a data frame as:
name age
tony 57
peter 46
Any ideas how I do this?
Note: the original JSON object looks like this...
{
"players": [{
"age": 57,
"name":"tony"
},
{
"age": 46,
"name":"peter"
}]
}

Use DataFrame constructor if types of values are dicts:
#print (type(df.loc[0, 'players']))
#<class 'str'>
#import ast
#df['players'] = df['players'].apply(ast.literal_eval)
print (type(df.loc[0, 'players']))
<class 'dict'>
df = pd.DataFrame(df['players'].values.tolist())
print (df)
age name
0 57 tony
1 46 peter
But better is use json_normalize from jsons object as suggested #jpp:
json = {
"players": [{
"age": 57,
"name":"tony"
},
{
"age": 46,
"name":"peter"
}]
}
df = json_normalize(json, 'players')
print (df)
age name
0 57 tony
1 46 peter

This can do it:
df = df['players'].apply(pd.Series)
However, it's slow:
In [20]: timeit df.players.apply(pd.Series)
1000 loops, best of 3: 824 us per loop
#jezrael suggestion is faster:
In [24]: timeit pd.DataFrame(df.players.values.tolist())
1000 loops, best of 3: 387 us per loop

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Pandas dataframe transpose with column name instead of index - python-3.x

Related

Creating a List with 3 Conditions in Python

drop duplicated and concat pandas

Massage csv dataframe into dictionary style

int() or dict for single digit string to int conversions

Convert list of Pandas Dataframe JSON objects

Categories

Resources