Changing column data type in Pandas DF using for command - python-3.x

I want to change the data type for a few columns in a database from Int/Float to Object.
I am trying to do this using for loop. But the data type is not changing.
for i in convert_to_object:
i = i.astype('object',inplace=True)
#Convert_to_object is a list of column names in data frame df
convert_to_object = [df_app1.REGION_RATING_CLIENT,
df_app1.REGION_RATING_CLIENT_W_CITY,
df_app1.REG_REGION_NOT_LIVE_REGION,
df_app1.REG_REGION_NOT_WORK_REGION,
df_app1.LIVE_REGION_NOT_WORK_REGION,
df_app1.REG_CITY_NOT_LIVE_CITY,
df_app1.REG_CITY_NOT_WORK_CITY,
df_app1.LIVE_CITY_NOT_WORK_CITY]
for i in convert_to_object:
i = i.astype('object',inplace=True)
df.info()
The data type for given columns should change to Object.

As far as I know DataFrame.astype does not handle the inplace param. See https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.astype.html How about:
cols = [
"REGION_RATING_CLIENT",
"REGION_RATING_CLIENT_W_CITY",
"REG_REGION_NOT_LIVE_REGION",
"REG_REGION_NOT_WORK_REGION",
"LIVE_REGION_NOT_WORK_REGION",
"REG_CITY_NOT_LIVE_CITY",
"REG_CITY_NOT_WORK_CITY",
"LIVE_CITY_NOT_WORK_CITY"
]
df[cols] = df[cols].astype(object)
Consider this example:
import pandas as pd
df = pd.DataFrame({
'A': [1,2],
'B': [1.2,3.4]
})
print(df.dtypes)
df[['A','B']] = df[['A','B']].astype(object)
print(df.dtypes)
Returns:
A int64
B float64
dtype: object
A object
B object
dtype: object

Related

Applying a function on pandas groupby object return variable type

When applying a user-defined function f that takes a pd.DataFrame as input and returns a pd.Series on a pd.DataFrame.groupby object, the type of the returned object seems to depend on the number of unique values present in the field used to perform the grouping operation.
I am trying to understand why the api is behaving this way, and for a neat way to have a pd.Series returned regardless of the number of unique values in the grouping field.
I went through the split-apply-combine section of pandas, and it seems like the single-valued dataframe is treated as a pd.Series which does not make sense to me.
import pandas as pd
from typing import Union
def f(df : pd.DataFrame) -> pd.Series:
"""
User-defined function
"""
return df['B'] / df['B'].max()
# Should only output a pd.Series
def perform_apply(df : pd.DataFrame) -> Union[pd.Series,pd.DataFrame] :
return df.groupby('A').apply(f)
# Some dummy dataframe with multiple values in field 'A'
df1 = pd.DataFrame({'A': 'a a b'.split(),
'B': [1,2,3],
'C': [4,6,5]})
# Subset of dataframe wiht a single value in field 'A'
df2 = df1[df1['A'] == 'a'].copy()
res1 = perform_apply(df1)
res2 = perform_apply(df2)
print(type(res1),type(res2))
# --------------------------------
# -> <class 'pandas.core.series.Series'> <class 'pandas.core.frame.DataFrame'>
pandas : 1.4.2
python : 3.9.0

Iteratively append new data into pandas dataframe column and join with another dataframe

I have been doing data extract from many API. I would like to add a common column among all APIs.
And I have tried below
df = pd.DataFrame()
for i in range(1,200):
url = '{id}/values'.format(id=i)
res = request.get(url,headers=headers)
if res.status_code==200:
data =json.loads(res.content.decode('utf-8'))
if data['success']:
df['id'] = i
test = pd.json_normalize(data[parent][child])
df = df.append(test,index=False)
But data-frame id column I'm getting only the last iterated id only. And in case of APIs has many rows I'm getting invalid data.
From performance reasons it would be better first storing data in a dictionary and then create from this dictionary dataframe:
import pandas as pd
from collections import defaultdict
d = defaultdict(list)
for i in range(1,200):
# simulate dataframe retrieved from pd.json_normalize() call
row = pd.DataFrame({'id': [i], 'field1': [f'f1-{i}'], 'field2': [f'f2-{i}'], 'field3': [f'f3-{i}']})
for k, v in row.to_dict().items():
d[k].append(v[0])
df = pd.DataFrame(d)

GroupBy and Sum with a RangeIndex

I'm reading in a CSV with wide data, which I convert to long data. The data contains daily values for all of 2020. I'm trying to aggregate this by month and sum. This is what I have tried:
import pandas as pd
df = pd.read_csv('Notebooks/updated_predicted_data.csv', parse_dates=['Unnamed: 0'])
df.rename( columns={'Unnamed: 0':'yyyy_mm_dd'}, inplace=True)
df = df.melt(id_vars=['yyyy_mm_dd'])
df.rename(columns={'variable': 'name'}, inplace=True)
df.rename(columns={'value': 'predicted_value'}, inplace=True)
df['predicted_value'] = df['predicted_value'].str.replace('€', '')
df['predicted_value'] = df['predicted_value'].str.replace(',', '')
df['predicted_value'] = df['predicted_value'].astype(int)
df.dtypes
yyyy_mm_dd datetime64[ns]
name object
predicted_ttv int64
dtype: object
All of the above works fine, when I then try to group the data, I run into the issue:
sum_df = df.groupby(pd.Grouper(freq='M'))
TypeError: Only valid with DatetimeIndex, TimedeltaIndex or PeriodIndex, but got an instance of 'RangeIndex'
How come I get the above error, despite converting my date column from string to datetime and also predicted column from string to int?
You can add key parameter in Grouper for pass column name of datetimes with name column in list and added aggregation function sum:
sum_df = (df.groupby(['name',pd.Grouper(freq='M', key='yyyy_mm_dd')])['predicted_ttv']
.sum()
.reset_index())
Or convert yyyy_mm_dd to DatetimeIndex first:
sum_df = (df.set_index('yyyy_mm_dd')
.groupby(['name', pd.Grouper(freq='M')])['predicted_ttv']
.sum()
.reset_index())

How to write a Pandas Dataframe into a HDF5 dataset

I'm trying to write data from a Pandas dataframe into a nested hdf5 file, with multiple groups and datasets within each group. I'd like to keep it as a single file which will grow in the future on a daily basis. I've had a go with the following code, which shows the structure of what I'd like to achieve
import h5py
import numpy as np
import pandas as pd
file = h5py.File('database.h5','w')
d = {'one' : pd.Series([1., 2., 3.], index=['a', 'b', 'c']),
'two' : pd.Series([1., 2., 3., 4.], index=['a', 'b', 'c', 'd'])}
df = pd.DataFrame(d)
groups = ['A','B','C']
for m in groups:
group = file.create_group(m)
dataset = ['1','2','3']
for n in dataset:
data = df
ds = group.create_dataset(m + n, data.shape)
print ("Dataset dataspace is", ds.shape)
print ("Dataset Numpy datatype is", ds.dtype)
print ("Dataset name is", ds.name)
print ("Dataset is a member of the group", ds.parent)
print ("Dataset was created in the file", ds.file)
print ("Writing data...")
ds[...] = data
print ("Reading data back...")
data_read = ds[...]
print ("Printing data...")
print (data_read)
file.close()
This way the nested structure is created but it loses the index and columns. I've tried the
df.to_hdf('database.h5', ds, table=True, mode='a')
but didn't work, I get this error
AttributeError: 'Dataset' object has no attribute 'split'
Can anyone shed some light please. Many thanks
df.to_hdf() expects a string as a key parameter (second parameter):
key : string
identifier for the group in the store
so try this:
df.to_hdf('database.h5', ds.name, table=True, mode='a')
where ds.name should return you a string (key name):
In [26]: ds.name
Out[26]: '/A1'
I thought to have a go with pandas\pytables and the HDFStore class instead of h5py. So I tried the following
import numpy as np
import pandas as pd
db = pd.HDFStore('Database.h5')
index = pd.date_range('1/1/2000', periods=8)
df = pd.DataFrame(np.random.randn(8, 3), index=index, columns=['Col1', 'Col2', 'Col3'])
groups = ['A','B','C']
i = 1
for m in groups:
subgroups = ['d','e','f']
for n in subgroups:
db.put(m + '/' + n, df, format = 'table', data_columns = True)
It works, 9 groups (groups instead of datasets in pyatbles instead fo h5py?) created from A/d to C/f. Columns and indexes preserved and can do the dataframe operations I need. Still wondering though whether this is an efficient way to retrieve data from a specific group which will become huge in the the future i.e. operations like
db['A/d'].Col1[4:]

Merge and then sort columns of a dataframe based on the columns of the merging dataframe

I have two dataframes, both indexed with timestamps. I would like to preserve the order of the columns in the first dataframe that is merged.
For example:
#required packages
import pandas as pd
import numpy as np
# defining stuff
num_periods_1 = 11
num_periods_2 = 4
# create sample time series
dates1 = pd.date_range('1/1/2000 00:00:00', periods=num_periods_1, freq='10min')
dates2 = pd.date_range('1/1/2000 01:30:00', periods=num_periods_2, freq='10min')
column_names_1 = ['C', 'B', 'A']
column_names_2 = ['B', 'C', 'D']
df1 = pd.DataFrame(np.random.randn(num_periods_1, len(column_names_1)), index=dates1, columns=column_names_1)
df2 = pd.DataFrame(np.random.randn(num_periods_2, len(column_names_2)), index=dates2, columns=column_names_2)
df3 = df1.merge(df2, how='outer', left_index=True, right_index=True, suffixes=['_1', '_2'])
print("\nData Frame Three:\n", df3)
The above code generates two data frames the first with columns C, B, and A. The second dataframe has columns B, C, and D. The current output has the columns in the following order; C_1, B_1, A, B_2, C_2, D. What I want the columns from the output of the merge to be C_1, C_2, B_1, B_2, A_1, D_2. The order of the columns is preserved from the first data frame and any data similar to the second data frame is added next to the corresponding data.
Could there be a setting in merge or can I use sort_index to do this?
EDIT: Maybe a better way to phrase the sorting process would be to call it uncollated. Where each column is put together and so on.
Using an OrderedDict, as you suggested.
from collections import OrderedDict
from itertools import chain
c = df3.columns.tolist()
o = OrderedDict()
for x in c:
o.setdefault(x.split('_')[0], []).append(x)
c = list(chain.from_iterable(o.values()))
df3 = df3[c]
An alternative that involves extracting the prefixes and then calling sorted on the index.
# https://stackoverflow.com/a/46839182/4909087
p = [s[0] for s in c]
c = sorted(c, key=lambda x: (p.index(x[0]), x))
df = df[c]

Resources