Grouping all column values of a pandas dataframe into a dictionary - python-3.x

I have a pandas dataframe that looks something like this:
df=pd.DataFrame({'a':['A','B','C','A'], 'b':[1,4,1,3], 'c':[0,6,1,0], 'd':[1,0,0,5]})
I want a dataframe that will look like this:
The original dataframe was grouped by values in column 'a' and its corresponding values are saved as a dictionary in a new column 'dict'. The key - value pairs are the column name and values in the column respectively. In case if a value in column 'a' has multiple entries (for eg A in column 'a' occurs twice), then a list of dictionary should be created for the same value.
How can I do this ?(Please ignore the grammatical mistakes and please ask any doubts regarding the question if it sounded too vague)

Don't do this. Pandas was never designed to hold list/tuples/dicts in series / columns. You can concoct expensive workarounds, but these are not
recommended.
The main reason holding lists in series is not recommended is you lose
the vectorised functionality which goes with using NumPy arrays held in contiguous memory blocks. Your series will be of
object dtype, which represents a sequence of pointers, much like list. You will lose
benefits in terms of memory and performance, as well as access to optimized Pandas methods.
See also What are the advantages of NumPy over regular Python
lists?
The arguments in favour of Pandas are the same as for NumPy.
But if really need it:
df = df.groupby('a').apply(lambda x: x.to_dict('r')).reset_index(name='dict')
print (df)
a dict
0 A [{'a': 'A', 'b': 1, 'c': 0, 'd': 1}, {'a': 'A'...
1 B [{'a': 'B', 'b': 4, 'c': 6, 'd': 0}]
2 C [{'a': 'C', 'b': 1, 'c': 1, 'd': 0}]

Related

Check if each Column in pandas DF is only values [ 0-9]

I am trying to scan a column in a df that only contains values that have 0-9. I want to exclude or flag columns in this dataframe that contain aplha/numerical
df_analysis[df_analysis['unique_values'].astype(str).str.contains(r'^[0-9]*$', na=True)]
import pandas as pd
df = pd.DataFrame({"string": ["asdf", "lj;k", "qwer"], "numbers": [6, 4, 5], "more_numbers": [1, 2, 3], "mixed": ["wef", 8, 9]})
print(df.select_dtypes(include=["int64"]).columns.to_list())
print(df.select_dtypes(include=["object"]).columns.to_list())
Create dataframe with multiple columns. Use .select_dtypes to find the columns that are integers and return them as a list. You can add "float64" or any other numeric type to the include list.
Output:

Unpacking a JSON dump row by row and using its keys as separate columns takes a very long time in Pandas

I have a column in a dataframe with JSON data encased in a list. Each row basically contains something like this:
[{'id': 'A', 'price': 43}, {'id': 'B', 'price': 57}, {'id': 'C', 'price': 99}]
....
What I want is to unpack the JSON object so that the keys go into separate columns in the existing dataframe, and the values are updated in each row under the respective column. Basically the above should yield something like this, with A, B and C as the columns:
A, B, C
43, 57, 99
...
This is my naive implementation, but it is EXTREMELY slow, and has been running for the last 1 hour on a dataset with millions of rows.
for n in range(df.shape[0]):
for dump in df['jsondump'].iloc[n]:
df.loc[n, dump['id']] = dump['price']
What is a better way of doing this?

Dataframe column to list of strings (with groupby)

I have a dataframe and I want to get one of its columns as a list of strings, so that from something like:
df = pd.DataFrame({'customer':['a','a','a','b','b'],
'location':['1','2','3','4','5']})
I can get a dataframe like:
a ['1','2','3']
b ['4','5']
where one column is the customer and another is a list of strings of their location.
I have tried df.astype(str).values.tolist() but I can't seem to groupby in order to get the list per customer.
Just use
df.groupby('customer').location.unique()
Out[58]:
customer
a [1, 2, 3]
b [4, 5]
Name: location, dtype: object
This is string type , just did not show the quote
df.groupby('customer').location.unique()[0][0]
Out[61]: '1'
Also you should know string input in list dose not show quote in pandas' object
pd.Series([['1','2']])
Out[64]:
0 [1, 2]
dtype: object

Converting dataframe into dictionary

I have a dataframe with three columns containing 220 datapoints. Now I need to make one column the key and the other column the value and remove the third column. How do I do that?
I have created the dataframe by scraping Wikipedia in order to create a Keyword Search. Now I need to create an index of terms contained, for which dictionaries are the most effective. How do I create a dictionaries out of a dataframe where one column in the key for another column?
I have used a sample dataframe having 3 columns and 3 rows as you have not provided the actual data. You can replace it with your data and column names.
I have used for loop with iterrows() to loop over each row.
Code:
import pandas as pd
df = pd.DataFrame (
{'Alphabet': ['A', 'B','C'] ,
'Number': [1,2,3],
'To_Remove': [10, 15, 8]})
sample_dictionary = {}
for index,row in df.iterrows():
sample_dictionary[row['Alphabet']] = row['Number']
print(sample_dictionary)
Output:
{'A': 1, 'B': 2, 'C': 3}
You can use the Pandas function,
pd.Dataframe.to_dict
Documentation: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_dict.html
Example
import pandas as pd
# Original dataframe
df = pd.DataFrame({'col1': [1, 2, 3],
'col2': [0.5, 0.75, 1.0],
'col3':[0.1, 0.9, 1.9]},
index=['a', 'b', 'c'])
# To dictonary
dictionary = df.to_dict(df)

Retrieve pandas dataframe column index

Say I have a pandas dataframe. I can access the columns either by their name or by their index.
Is there a simple way in which I can retrieve the column index given its name?
Use get_loc on the columns Index object to return the ordinal index value:
In [283]:
df = pd.DataFrame(columns=list('abcd'))
df
Out[283]:
Empty DataFrame
Columns: [a, b, c, d]
Index: []
In [288]:
df.columns.get_loc('b')
Out[288]:
1
What do you mean by index exactly?
I bet you are referring to index as a list index, right?
Because Pandas has another kind of index too.
From my first understandying, you can do the following:
my_df = pd.DataFrame(columns=['A', 'B', 'C'])
my_columns = my_df.columns.tolist()
print my_columns # yields ['A', 'B', 'C'], therefore you can recover the index by just doing the following
my_columns.index('C') #yields 2

Resources