ValueError: arrays must all be same length in python using pandas DataFrame - python-3.x

I'm a newbie in python and using Dataframe from pandas package (python3.6).
I set it up like below code,
df = DataFrame({'list1': list1, 'list2': list2, 'list3': list3, 'list4': list4, 'list5': list5, 'list6': list6})
and it gives an error like ValueError: arrays must all be same length
So I checked all the length of arrays, and list1 & list2 have 1 more data than other lists. If I want to add 1 data to those other 4 lists(list3, list4, list5, list6) by using pd.resample, then how should I write code...?
Also, those lists are time series list with 1 minute.
Does anybody have an idea or help me out here?
Thanks in advance.
EDIT
So I changed as what EdChum said.
and added time list at the front. it is like below.
2017-04-01 0:00 895.87 730 12.8 4 19.1 380
2017-04-01 0:01 894.4 730 12.8 4 19.1 380
2017-04-01 0:02 893.08 730 12.8 4 19.3 380
2017-04-01 0:03 890.41 730 12.8 4 19.7 380
2017-04-01 0:04 889.28 730 12.8 4 19.93 380
and I typed code like
df.resample('1min', how='mean', fill_method='pad')
And it gives me this error: TypeError: Only valid with DatetimeIndex, TimedeltaIndex or PeriodIndex, but got an instance of 'RangeIndex'

I'd just construct a Series for each list and then concat them all:
In [38]:
l1 = list('abc')
l2 = [1,2,3,4]
s1 = pd.Series(l1, name='list1')
s2 = pd.Series(l2, name='list2')
df = pd.concat([s1,s2], axis=1)
df
Out[38]:
list1 list2
0 a 1
1 b 2
2 c 3
3 NaN 4
As you can pass a name arg for the Series ctor it will name each column in the df, plus it will place NaN where the column lengths don't match
resample refers to when you have a DatetimeIndex for which you want to rebase or adjust the length based on some time period which is not what you want here. You want to reindex which I think is unnecessary and messy:
In [40]:
l1 = list('abc')
l2 = [1,2,3,4]
s1 = pd.Series(l1)
s2 = pd.Series(l2)
df = pd.DataFrame({'list1':s1.reindex(s2.index), 'list2':s2})
df
Out[40]:
list1 list2
0 a 1
1 b 2
2 c 3
3 NaN 4
Here you'd need to know the longest length and then reindex all Series using that index, if you just concat it will automatically adjust the lengths and fill missing elements with NaN

According to this documentation, it looks quite difficult to do this with pd.resample() : You should calculate a frequence which add only one value to your df, and the function seems really not made for this ^^ (seems to permit easy reshaping, ex : 1 min to 30sec or 1h) ! You'd better try what EdChum did :P

Related

Pandas shift datetimeindex takes too long time running

I have a running time issue with shifting a large dataframe with datetime index.
Example using created dummy data:
df = pd.DataFrame({'col1':[0,1,2,3,4,5,6,7,8,9,10,11,12,13]*10**5,'col3':list(np.random.randint(0,100000,14*10**5)),'col2':list(pd.date_range('2020-01-01','2020-08-01',freq='M'))*2*10**5})
df.col3=df.col3.astype(str)
df.drop_duplicates(subset=['col3','col2'],keep='first',inplace=True)
If I shift not using datetimeindex, it only takes about 12s:
%%time
tmp=df.groupby('col3')['col1'].shift(2,fill_value=0)
Wall time: 12.5 s
But when I use datetimeindex, as that situation that I need, it takes about 40 minutes:
%%time
tmp=df.set_index('col2').groupby('col3')['col1'].shift(2,freq='M',fill_value=0)
Wall time: 40min 25s
In my situation, I need the data from shift(1) until shift(6) and merge them with original data by col2 and col3. So I use for looping and merge.
Is there any solution for this? Thanks for your answer, will appreciate so much any respond.
Ben's answer solves it:
%%time
tmp=df1[['col1','col3', 'col2']].assign(col2 = lambda x: x['col2'] + MonthEnd(2)).set_index(['col3', 'col2']).add_suffix(f'_{2}').fillna(0).reindex(pd.MultiIndex.from_frame(df1[['col3','col2']])).reset_index()
Wall time: 5.94 s
also implement to the looping:
%%time
res=(pd.concat([df1.assign(col2 = lambda x: x['col2'] + MonthEnd(i)).set_index(['col3', 'col2']).add_suffix(f'_{i}') for i in range(0,7)],axis=1).fillna(0)).reindex(pd.MultiIndex.from_frame(df1[['col3','col2']])).reset_index()
Wall time: 1min 44s
Actually, my real data is already using MonthEnd(0) so I just use loop in range(1,7). I also implement to multiple columns so I don't use astype and implement reindex because I use left merge.
The two operations are slightly different, and the results are not the same because your data (at least the dummy here) is not ordered and especially if you have missing dates for some col3 values. That said, the time difference seems enormous. So I think you should go a bit differently.
One way is to add X MonthEnd to col2 for X from 0 to 6, use concat all of them, after set_index the col3 and col2, add_suffix to keep track of the "shift" value. fillna and convert the dtype to original one. The rest is mostly cosmetic depending on your needs.
from pandas.tseries.offsets import MonthEnd
res = (
pd.concat([
df.assign(col2 = lambda x: x['col2'] + MonthEnd(i))
.set_index(['col3', 'col2'])
.add_suffix(f'_{i}')
for i in range(0,7)],
axis=1)
.fillna(0)
# depends on your original data
.astype(df['col1'].dtype)
# if you want a left merge ordered like original df
#.reindex(pd.MultiIndex.from_frame(df[['col3','col2']]))
# if you want col2 and col3 back as columns
# .reset_index()
)
Note that concat does a outer join by default, so you end up with month that where not in your original data and col1_0 is actually the original data with my random numbers.
print(res.head(10))
col1_0 col1_1 col1_2 col1_3 col1_4 col1_5 col1_6
col3 col2
0 2020-01-31 7 0 0 0 0 0 0
2020-02-29 8 7 0 0 0 0 0
2020-03-31 2 8 7 0 0 0 0
2020-04-30 3 2 8 7 0 0 0
2020-05-31 4 3 2 8 7 0 0
2020-06-30 12 4 3 2 8 7 0
2020-07-31 13 12 4 3 2 8 7
2020-08-31 0 13 12 4 3 2 8
2020-09-30 0 0 13 12 4 3 2
2020-10-31 0 0 0 13 12 4 3
This is an issue with groupby + shift. The problem is that if you specify an axis other than 0 or a frequency it falls back to a very slow loop over the groups. If neither of those are specified it's able to use a much faster path, which is why you see an order of magitude difference between the performance.
The relevant code in for DataFrame.GroupBy.shift is:
def shift(self, periods=1, freq=None, axis=0, fill_value=None):
"""..."""
if freq is not None or axis != 0:
return self.apply(lambda x: x.shift(periods, freq, axis, fill_value))
Previously this issue extended to specifying a fill_value

Anyone know how to write in a short way the folowing code [duplicate]

I created a DataFrame from a list of lists:
table = [
['a', '1.2', '4.2' ],
['b', '70', '0.03'],
['x', '5', '0' ],
]
df = pd.DataFrame(table)
How do I convert the columns to specific types? In this case, I want to convert columns 2 and 3 into floats.
Is there a way to specify the types while converting the list to DataFrame? Or is it better to create the DataFrame first and then loop through the columns to change the dtype for each column? Ideally I would like to do this in a dynamic way because there can be hundreds of columns, and I don't want to specify exactly which columns are of which type. All I can guarantee is that each column contains values of the same type.
You have four main options for converting types in pandas:
to_numeric() - provides functionality to safely convert non-numeric types (e.g. strings) to a suitable numeric type. (See also to_datetime() and to_timedelta().)
astype() - convert (almost) any type to (almost) any other type (even if it's not necessarily sensible to do so). Also allows you to convert to categorial types (very useful).
infer_objects() - a utility method to convert object columns holding Python objects to a pandas type if possible.
convert_dtypes() - convert DataFrame columns to the "best possible" dtype that supports pd.NA (pandas' object to indicate a missing value).
Read on for more detailed explanations and usage of each of these methods.
1. to_numeric()
The best way to convert one or more columns of a DataFrame to numeric values is to use pandas.to_numeric().
This function will try to change non-numeric objects (such as strings) into integers or floating-point numbers as appropriate.
Basic usage
The input to to_numeric() is a Series or a single column of a DataFrame.
>>> s = pd.Series(["8", 6, "7.5", 3, "0.9"]) # mixed string and numeric values
>>> s
0 8
1 6
2 7.5
3 3
4 0.9
dtype: object
>>> pd.to_numeric(s) # convert everything to float values
0 8.0
1 6.0
2 7.5
3 3.0
4 0.9
dtype: float64
As you can see, a new Series is returned. Remember to assign this output to a variable or column name to continue using it:
# convert Series
my_series = pd.to_numeric(my_series)
# convert column "a" of a DataFrame
df["a"] = pd.to_numeric(df["a"])
You can also use it to convert multiple columns of a DataFrame via the apply() method:
# convert all columns of DataFrame
df = df.apply(pd.to_numeric) # convert all columns of DataFrame
# convert just columns "a" and "b"
df[["a", "b"]] = df[["a", "b"]].apply(pd.to_numeric)
As long as your values can all be converted, that's probably all you need.
Error handling
But what if some values can't be converted to a numeric type?
to_numeric() also takes an errors keyword argument that allows you to force non-numeric values to be NaN, or simply ignore columns containing these values.
Here's an example using a Series of strings s which has the object dtype:
>>> s = pd.Series(['1', '2', '4.7', 'pandas', '10'])
>>> s
0 1
1 2
2 4.7
3 pandas
4 10
dtype: object
The default behaviour is to raise if it can't convert a value. In this case, it can't cope with the string 'pandas':
>>> pd.to_numeric(s) # or pd.to_numeric(s, errors='raise')
ValueError: Unable to parse string
Rather than fail, we might want 'pandas' to be considered a missing/bad numeric value. We can coerce invalid values to NaN as follows using the errors keyword argument:
>>> pd.to_numeric(s, errors='coerce')
0 1.0
1 2.0
2 4.7
3 NaN
4 10.0
dtype: float64
The third option for errors is just to ignore the operation if an invalid value is encountered:
>>> pd.to_numeric(s, errors='ignore')
# the original Series is returned untouched
This last option is particularly useful for converting your entire DataFrame, but don't know which of our columns can be converted reliably to a numeric type. In that case, just write:
df.apply(pd.to_numeric, errors='ignore')
The function will be applied to each column of the DataFrame. Columns that can be converted to a numeric type will be converted, while columns that cannot (e.g. they contain non-digit strings or dates) will be left alone.
Downcasting
By default, conversion with to_numeric() will give you either an int64 or float64 dtype (or whatever integer width is native to your platform).
That's usually what you want, but what if you wanted to save some memory and use a more compact dtype, like float32, or int8?
to_numeric() gives you the option to downcast to either 'integer', 'signed', 'unsigned', 'float'. Here's an example for a simple series s of integer type:
>>> s = pd.Series([1, 2, -7])
>>> s
0 1
1 2
2 -7
dtype: int64
Downcasting to 'integer' uses the smallest possible integer that can hold the values:
>>> pd.to_numeric(s, downcast='integer')
0 1
1 2
2 -7
dtype: int8
Downcasting to 'float' similarly picks a smaller than normal floating type:
>>> pd.to_numeric(s, downcast='float')
0 1.0
1 2.0
2 -7.0
dtype: float32
2. astype()
The astype() method enables you to be explicit about the dtype you want your DataFrame or Series to have. It's very versatile in that you can try and go from one type to any other.
Basic usage
Just pick a type: you can use a NumPy dtype (e.g. np.int16), some Python types (e.g. bool), or pandas-specific types (like the categorical dtype).
Call the method on the object you want to convert and astype() will try and convert it for you:
# convert all DataFrame columns to the int64 dtype
df = df.astype(int)
# convert column "a" to int64 dtype and "b" to complex type
df = df.astype({"a": int, "b": complex})
# convert Series to float16 type
s = s.astype(np.float16)
# convert Series to Python strings
s = s.astype(str)
# convert Series to categorical type - see docs for more details
s = s.astype('category')
Notice I said "try" - if astype() does not know how to convert a value in the Series or DataFrame, it will raise an error. For example, if you have a NaN or inf value you'll get an error trying to convert it to an integer.
As of pandas 0.20.0, this error can be suppressed by passing errors='ignore'. Your original object will be returned untouched.
Be careful
astype() is powerful, but it will sometimes convert values "incorrectly". For example:
>>> s = pd.Series([1, 2, -7])
>>> s
0 1
1 2
2 -7
dtype: int64
These are small integers, so how about converting to an unsigned 8-bit type to save memory?
>>> s.astype(np.uint8)
0 1
1 2
2 249
dtype: uint8
The conversion worked, but the -7 was wrapped round to become 249 (i.e. 28 - 7)!
Trying to downcast using pd.to_numeric(s, downcast='unsigned') instead could help prevent this error.
3. infer_objects()
Version 0.21.0 of pandas introduced the method infer_objects() for converting columns of a DataFrame that have an object datatype to a more specific type (soft conversions).
For example, here's a DataFrame with two columns of object type. One holds actual integers and the other holds strings representing integers:
>>> df = pd.DataFrame({'a': [7, 1, 5], 'b': ['3','2','1']}, dtype='object')
>>> df.dtypes
a object
b object
dtype: object
Using infer_objects(), you can change the type of column 'a' to int64:
>>> df = df.infer_objects()
>>> df.dtypes
a int64
b object
dtype: object
Column 'b' has been left alone since its values were strings, not integers. If you wanted to force both columns to an integer type, you could use df.astype(int) instead.
4. convert_dtypes()
Version 1.0 and above includes a method convert_dtypes() to convert Series and DataFrame columns to the best possible dtype that supports the pd.NA missing value.
Here "best possible" means the type most suited to hold the values. For example, this a pandas integer type, if all of the values are integers (or missing values): an object column of Python integer objects are converted to Int64, a column of NumPy int32 values, will become the pandas dtype Int32.
With our object DataFrame df, we get the following result:
>>> df.convert_dtypes().dtypes
a Int64
b string
dtype: object
Since column 'a' held integer values, it was converted to the Int64 type (which is capable of holding missing values, unlike int64).
Column 'b' contained string objects, so was changed to pandas' string dtype.
By default, this method will infer the type from object values in each column. We can change this by passing infer_objects=False:
>>> df.convert_dtypes(infer_objects=False).dtypes
a object
b string
dtype: object
Now column 'a' remained an object column: pandas knows it can be described as an 'integer' column (internally it ran infer_dtype) but didn't infer exactly what dtype of integer it should have so did not convert it. Column 'b' was again converted to 'string' dtype as it was recognised as holding 'string' values.
Use this:
a = [['a', '1.2', '4.2'], ['b', '70', '0.03'], ['x', '5', '0']]
df = pd.DataFrame(a, columns=['one', 'two', 'three'])
df
Out[16]:
one two three
0 a 1.2 4.2
1 b 70 0.03
2 x 5 0
df.dtypes
Out[17]:
one object
two object
three object
df[['two', 'three']] = df[['two', 'three']].astype(float)
df.dtypes
Out[19]:
one object
two float64
three float64
This below code will change the datatype of a column.
df[['col.name1', 'col.name2'...]] = df[['col.name1', 'col.name2'..]].astype('data_type')
In place of the data type, you can give your datatype what you want, like, str, float, int, etc.
When I've only needed to specify specific columns, and I want to be explicit, I've used (per pandas.DataFrame.astype):
dataframe = dataframe.astype({'col_name_1':'int','col_name_2':'float64', etc. ...})
So, using the original question, but providing column names to it...
a = [['a', '1.2', '4.2'], ['b', '70', '0.03'], ['x', '5', '0']]
df = pd.DataFrame(a, columns=['col_name_1', 'col_name_2', 'col_name_3'])
df = df.astype({'col_name_2':'float64', 'col_name_3':'float64'})
pandas >= 1.0
Here's a chart that summarises some of the most important conversions in pandas.
Conversions to string are trivial .astype(str) and are not shown in the figure.
"Hard" versus "Soft" conversions
Note that "conversions" in this context could either refer to converting text data into their actual data type (hard conversion), or inferring more appropriate data types for data in object columns (soft conversion). To illustrate the difference, take a look at
df = pd.DataFrame({'a': ['1', '2', '3'], 'b': [4, 5, 6]}, dtype=object)
df.dtypes
a object
b object
dtype: object
# Actually converts string to numeric - hard conversion
df.apply(pd.to_numeric).dtypes
a int64
b int64
dtype: object
# Infers better data types for object data - soft conversion
df.infer_objects().dtypes
a object # no change
b int64
dtype: object
# Same as infer_objects, but converts to equivalent ExtensionType
df.convert_dtypes().dtypes
Here is a function that takes as its arguments a DataFrame and a list of columns and coerces all data in the columns to numbers.
# df is the DataFrame, and column_list is a list of columns as strings (e.g ["col1","col2","col3"])
# dependencies: pandas
def coerce_df_columns_to_numeric(df, column_list):
df[column_list] = df[column_list].apply(pd.to_numeric, errors='coerce')
So, for your example:
import pandas as pd
def coerce_df_columns_to_numeric(df, column_list):
df[column_list] = df[column_list].apply(pd.to_numeric, errors='coerce')
a = [['a', '1.2', '4.2'], ['b', '70', '0.03'], ['x', '5', '0']]
df = pd.DataFrame(a, columns=['col1','col2','col3'])
coerce_df_columns_to_numeric(df, ['col2','col3'])
df = df.astype({"columnname": str})
#e.g - for changing the column type to string
#df is your dataframe
Create two dataframes, each with different data types for their columns, and then appending them together:
d1 = pd.DataFrame(columns=[ 'float_column' ], dtype=float)
d1 = d1.append(pd.DataFrame(columns=[ 'string_column' ], dtype=str))
Results
In[8}: d1.dtypes
Out[8]:
float_column float64
string_column object
dtype: object
After the dataframe is created, you can populate it with floating point variables in the 1st column, and strings (or any data type you desire) in the 2nd column.
df.info() gives us initial datatype of temp which is float64
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 date 132 non-null object
1 temp 132 non-null float64
Now, use this code to change the datatype to int64:
df['temp'] = df['temp'].astype('int64')
if you do df.info() again, you will see:
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 date 132 non-null object
1 temp 132 non-null int64
This shows you have successfully changed the datatype of column temp. Happy coding!
Starting pandas 1.0.0, we have pandas.DataFrame.convert_dtypes. You can even control what types to convert!
In [40]: df = pd.DataFrame(
...: {
...: "a": pd.Series([1, 2, 3], dtype=np.dtype("int32")),
...: "b": pd.Series(["x", "y", "z"], dtype=np.dtype("O")),
...: "c": pd.Series([True, False, np.nan], dtype=np.dtype("O")),
...: "d": pd.Series(["h", "i", np.nan], dtype=np.dtype("O")),
...: "e": pd.Series([10, np.nan, 20], dtype=np.dtype("float")),
...: "f": pd.Series([np.nan, 100.5, 200], dtype=np.dtype("float")),
...: }
...: )
In [41]: dff = df.copy()
In [42]: df
Out[42]:
a b c d e f
0 1 x True h 10.0 NaN
1 2 y False i NaN 100.5
2 3 z NaN NaN 20.0 200.0
In [43]: df.dtypes
Out[43]:
a int32
b object
c object
d object
e float64
f float64
dtype: object
In [44]: df = df.convert_dtypes()
In [45]: df.dtypes
Out[45]:
a Int32
b string
c boolean
d string
e Int64
f float64
dtype: object
In [46]: dff = dff.convert_dtypes(convert_boolean = False)
In [47]: dff.dtypes
Out[47]:
a Int32
b string
c object
d string
e Int64
f float64
dtype: object
In case you have various objects columns like this Dataframe of 74 Objects columns and 2 Int columns where each value have letters representing units:
import pandas as pd
import numpy as np
dataurl = 'https://raw.githubusercontent.com/RubenGavidia/Pandas_Portfolio.py/main/Wes_Mckinney.py/nutrition.csv'
nutrition = pd.read_csv(dataurl,index_col=[0])
nutrition.head(3)
Output:
name serving_size calories total_fat saturated_fat cholesterol sodium choline folate folic_acid ... fat saturated_fatty_acids monounsaturated_fatty_acids polyunsaturated_fatty_acids fatty_acids_total_trans alcohol ash caffeine theobromine water
0 Cornstarch 100 g 381 0.1g NaN 0 9.00 mg 0.4 mg 0.00 mcg 0.00 mcg ... 0.05 g 0.009 g 0.016 g 0.025 g 0.00 mg 0.0 g 0.09 g 0.00 mg 0.00 mg 8.32 g
1 Nuts, pecans 100 g 691 72g 6.2g 0 0.00 mg 40.5 mg 22.00 mcg 0.00 mcg ... 71.97 g 6.180 g 40.801 g 21.614 g 0.00 mg 0.0 g 1.49 g 0.00 mg 0.00 mg 3.52 g
2 Eggplant, raw 100 g 25 0.2g NaN 0 2.00 mg 6.9 mg 22.00 mcg 0.00 mcg ... 0.18 g 0.034 g 0.016 g 0.076 g 0.00 mg 0.0 g 0.66 g 0.00 mg 0.00 mg 92.30 g
3 rows × 76 columns
nutrition.dtypes
name object
serving_size object
calories int64
total_fat object
saturated_fat object
...
alcohol object
ash object
caffeine object
theobromine object
water object
Length: 76, dtype: object
nutrition.dtypes.value_counts()
object 74
int64 2
dtype: int64
A good way to convert to numeric all columns is using regular expressions to replace the units for nothing and astype(float) for change the columns data type to float:
nutrition.index = pd.RangeIndex(start = 0, stop = 8789, step= 1)
nutrition.set_index('name',inplace = True)
nutrition.replace('[a-zA-Z]','', regex= True, inplace=True)
nutrition=nutrition.astype(float)
nutrition.head(3)
Output:
serving_size calories total_fat saturated_fat cholesterol sodium choline folate folic_acid niacin ... fat saturated_fatty_acids monounsaturated_fatty_acids polyunsaturated_fatty_acids fatty_acids_total_trans alcohol ash caffeine theobromine water
name
Cornstarch 100.0 381.0 0.1 NaN 0.0 9.0 0.4 0.0 0.0 0.000 ... 0.05 0.009 0.016 0.025 0.0 0.0 0.09 0.0 0.0 8.32
Nuts, pecans 100.0 691.0 72.0 6.2 0.0 0.0 40.5 22.0 0.0 1.167 ... 71.97 6.180 40.801 21.614 0.0 0.0 1.49 0.0 0.0 3.52
Eggplant, raw 100.0 25.0 0.2 NaN 0.0 2.0 6.9 22.0 0.0 0.649 ... 0.18 0.034 0.016 0.076 0.0 0.0 0.66 0.0 0.0 92.30
3 rows × 75 columns
nutrition.dtypes
serving_size float64
calories float64
total_fat float64
saturated_fat float64
cholesterol float64
...
alcohol float64
ash float64
caffeine float64
theobromine float64
water float64
Length: 75, dtype: object
nutrition.dtypes.value_counts()
float64 75
dtype: int64
Now the dataset is clean and you are able to do numeric operations with this Dataframe only with regex and astype().
If you want to collect the units and paste on the headers like cholesterol_mg you can use this code:
nutrition.index = pd.RangeIndex(start = 0, stop = 8789, step= 1)
nutrition.set_index('name',inplace = True)
nutrition.astype(str).replace('[^a-zA-Z]','', regex= True)
units = nutrition.astype(str).replace('[^a-zA-Z]','', regex= True)
units = units.mode()
units = units.replace('', np.nan).dropna(axis=1)
mapper = { k: k + "_" + units[k].at[0] for k in units}
nutrition.rename(columns=mapper, inplace=True)
nutrition.replace('[a-zA-Z]','', regex= True, inplace=True)
nutrition=nutrition.astype(float)
Is there a way to specify the types while converting to DataFrame?
Yes. The other answers convert the dtypes after creating the DataFrame, but we can specify the types at creation. Use either DataFrame.from_records or read_csv(dtype=...) depending on the input format.
The latter is sometimes necessary to avoid memory errors with big data.
1. DataFrame.from_records
Create the DataFrame from a structured array of the desired column types:
x = [['foo', '1.2', '70'], ['bar', '4.2', '5']]
df = pd.DataFrame.from_records(np.array(
[tuple(row) for row in x], # pass a list-of-tuples (x can be a list-of-lists or 2D array)
'object, float, int' # define the column types
))
Output:
>>> df.dtypes
# f0 object
# f1 float64
# f2 int64
# dtype: object
2. read_csv(dtype=...)
If you're reading the data from a file, use the dtype parameter of read_csv to set the column types at load time.
For example, here we read 30M rows with rating as 8-bit integers and genre as categorical:
lines = '''
foo,biography,5
bar,crime,4
baz,fantasy,3
qux,history,2
quux,horror,1
'''
columns = ['name', 'genre', 'rating']
csv = io.StringIO(lines * 6_000_000) # 30M lines
df = pd.read_csv(csv, names=columns, dtype={'rating': 'int8', 'genre': 'category'})
In this case, we halve the memory usage upon load:
>>> df.info(memory_usage='deep')
# memory usage: 1.8 GB
>>> pd.read_csv(io.StringIO(lines * 6_000_000)).info(memory_usage='deep')
# memory usage: 3.7 GB
This is one way to avoid memory errors with big data. It's not always possible to change the dtypes after loading since we might not have enough memory to load the default-typed data in the first place.
I thought I had the same problem, but actually I have a slight difference that makes the problem easier to solve. For others looking at this question, it's worth checking the format of your input list. In my case the numbers are initially floats, not strings as in the question:
a = [['a', 1.2, 4.2], ['b', 70, 0.03], ['x', 5, 0]]
But by processing the list too much before creating the dataframe, I lose the types and everything becomes a string.
Creating the data frame via a NumPy array:
df = pd.DataFrame(np.array(a))
df
Out[5]:
0 1 2
0 a 1.2 4.2
1 b 70 0.03
2 x 5 0
df[1].dtype
Out[7]: dtype('O')
gives the same data frame as in the question, where the entries in columns 1 and 2 are considered as strings. However doing
df = pd.DataFrame(a)
df
Out[10]:
0 1 2
0 a 1.2 4.20
1 b 70.0 0.03
2 x 5.0 0.00
df[1].dtype
Out[11]: dtype('float64')
does actually give a data frame with the columns in the correct format.
I had the same issue.
I could not find any solution that was satisfying. My solution was simply to convert those float into str and remove the '.0' this way.
In my case, I just apply it on the first column:
firstCol = list(df.columns)[0]
df[firstCol] = df[firstCol].fillna('').astype(str).apply(lambda x: x.replace('.0', ''))
If you want convert one column from string format I suggest use this code"
import pandas as pd
#My Test Data
data = {'Product': ['A','B', 'C','D'],
'Price': ['210','250', '320','280']}
data
#Create Data Frame from My data df = pd.DataFrame(data)
#Convert to number
df['Price'] = pd.to_numeric(df['Price'])
df
Total = sum(df['Price'])
Total
else if you going to convert a number of column values to number I suggest to you first filter your values and save in empty array and after that convert to number. I hope this code solve your problem.
Convert string representation of long numbers to integers
By default, astype(int) converts to int32, which wouldn't work (OverflowError) if a number is particularly long (such as phone number); try 'int64' (or even float) instead:
df['long_num'] = df['long_num'].astype('int64')
On a side note, if you get SettingWithCopyWarning, then make a copy of your frame and do whatever you were doing again. For example, if you were converting col1 and col2 to float dtype, then do:
df = df.copy()
df[['col1', 'col2']] = df[['col1', 'col2']].astype(float)
# or use assign
df = df.assign(**{k: df[k].astype(float) for k in ['col1', 'col2']})
Convert integers to timedelta
Also, the long string/integer maybe datetime or timedelta, in which case, use to_datetime or to_timedelta to convert to datetime/timedelta dtype:
df = pd.DataFrame({'long_int': ['1018880886000000000', '1590305014000000000', '1101470895000000000', '1586646272000000000', '1460958607000000000']})
df['datetime'] = pd.to_datetime(df['long_int'].astype('int64'))
# or
df['datetime'] = pd.to_datetime(df['long_int'].astype(float))
df['timedelta'] = pd.to_timedelta(df['long_int'].astype('int64'))
Convert timedelta to numbers
To perform the reverse operation (convert datetime/timedelta to numbers), view it as 'int64'. This could be useful if you were building a machine learning model that somehow needs to include time (or datetime) as a numeric value. Just make sure that if the original data are strings, then they must be converted to timedelta or datetime before any conversion to numbers.
df = pd.DataFrame({'Time diff': ['2 days 4:00:00', '3 days', '4 days', '5 days', '6 days']})
df['Time diff in nanoseconds'] = pd.to_timedelta(df['Time diff']).view('int64')
df['Time diff in seconds'] = pd.to_timedelta(df['Time diff']).view('int64') // 10**9
df['Time diff in hours'] = pd.to_timedelta(df['Time diff']).view('int64') // (3600*10**9)
Convert datetime to numbers
For datetime, the numeric view of a datetime is the time difference between that datetime and the UNIX epoch (1970-01-01).
df = pd.DataFrame({'Date': ['2002-04-15', '2020-05-24', '2004-11-26', '2020-04-11', '2016-04-18']})
df['Time_since_unix_epoch'] = pd.to_datetime(df['Date'], format='%Y-%m-%d').view('int64')
astype is faster than to_numeric
df = pd.DataFrame(np.random.default_rng().choice(1000, size=(10000, 50)).astype(str))
df = pd.concat([df, pd.DataFrame(np.random.rand(10000, 50).astype(str), columns=range(50, 100))], axis=1)
%timeit df.astype(dict.fromkeys(df.columns[:50], int) | dict.fromkeys(df.columns[50:], float))
# 488 ms ± 28 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit df.apply(pd.to_numeric)
# 686 ms ± 45.8 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

How to create a dataframe from records that has Timestamps as index?

I am trying to create a dataframe in which i have Timestamps as index but it throws error. I can use the same methodology to create a dataframe in case the index is not a Timestamp. The following codebit is a bare min example:
Works fine:
pd.DataFrame.from_dict({'1':{'a':1,'b':2,'c':3},'2':{'a':1,'c':4},'3':{'b':6}})
output:
1 2 3
a 1 1.0 NaN
b 2 NaN 6.0
c 3 4.0 NaN
BREAKS
o=np.arange(np.datetime64('2017-11-01 00:00:00'),np.datetime64('2017-11-01 00:00:00')+np.timedelta64(3,'D'),np.timedelta64(1,'D'))
pd.DataFrame.from_records({o[0]:{'a':1,'b':2,'c':3},o[1]:{'a':1,'c':4},o[2]:{'b':6}})
output:
KeyError Traceback (most recent call last)
<ipython-input-627-f9a075f611c0> in <module>
1 o=np.arange(np.datetime64('2017-11-01 00:00:00'),np.datetime64('2017-11-01 00:00:00')+np.timedelta64(3,'D'),np.timedelta64(1,'D'))
2
----> 3 pd.DataFrame.from_records({o[0]:{'a':1,'b':2,'c':3},o[1]:{'a':1,'c':4},o[2]:{'b':6}})
~/anaconda3/envs/dfs/lib/python3.6/site-packages/pandas/core/frame.py in from_records(cls, data, index, exclude, columns, coerce_float, nrows)
1617 if columns is None:
1618 columns = arr_columns = ensure_index(sorted(data))
-> 1619 arrays = [data[k] for k in columns]
1620 else:
1621 arrays = []
~/anaconda3/envs/dfs/lib/python3.6/site-packages/pandas/core/frame.py in <listcomp>(.0)
1617 if columns is None:
1618 columns = arr_columns = ensure_index(sorted(data))
-> 1619 arrays = [data[k] for k in columns]
1620 else:
1621 arrays = []
KeyError: Timestamp('2017-11-01 00:00:00')
Please help me understand the behavior and what I am missing. Also, how do go about creating a dataframe from records that has Timestamps as indices
Change from_records to from_dict (just like in your working example)
and everything executes fine.
Another, optional hint: Since you create a Pandas DataFrame, use
pandasonic native way to create datetime values to use as column names:
o = pd.date_range(start='2017-11-01', periods=3)
Edit
I noticed that if you create o object the way I proposed (as a
date_range), you can use even from_records.
Edit 2
You wrote that you want datetime objects as the index, whereas
your code attempts to set them as column names.
If you want datetime objects as the index, run something like:
df = pd.DataFrame.from_records({'1':{o[0]:1, o[1]:2, o[2]:3},
'2':{o[0]:1, o[2]:4}, '3':{o[1]:6}})
The result is:
1 2 3
2017-11-01 1 1.0 NaN
2017-11-02 2 NaN 6.0
2017-11-03 3 4.0 NaN
Another way to create the above result is:
df = pd.DataFrame.from_records([{'1':1, '2':1}, {'1':2, '3':6}, {'1':3, '2':4}], index=o)

Returning non-NaN values when calling df.X in pandas

This is a rather specific follow up to this question on creating pandas dataframes when entries have different lengths.
I have a dataset where I have:
general environmental variables that apply to the whole problem (e.g. avg precipitation)
values at, say, specific depth (e.g. average amount of water at any depth after rainfall)
so my data looks like
d = dict{'depth': [1,2,3], 'var1',[.01,.009,.002],'globalvar',[2.5]}
df = pd.DataFrame(dict([ (k,pd.Series(v)) for k,v in d.items() ]))
>>
depth globalvar var1
0 1 2.5 0.010
1 2 NaN 0.009
2 3 NaN 0.002
Is there a way to call globalvar, e.g. df.globalvar without calling df.globalvar[1]? Is there a more pythonic way to do this?
You can do with stack
df.stack().loc[pd.IndexSlice[:,'globalvar']]
Out[445]:
0 2.5
dtype: float64
Or dropna
df.globalvar.dropna()

Create of multiple subsets from existing pandas dataframe [duplicate]

I have a very large dataframe (around 1 million rows) with data from an experiment (60 respondents).
I would like to split the dataframe into 60 dataframes (a dataframe for each participant).
In the dataframe, data, there is a variable called 'name', which is the unique code for each participant.
I have tried the following, but nothing happens (or execution does not stop within an hour). What I intend to do is to split the data into smaller dataframes, and append these to a list (datalist):
import pandas as pd
def splitframe(data, name='name'):
n = data[name][0]
df = pd.DataFrame(columns=data.columns)
datalist = []
for i in range(len(data)):
if data[name][i] == n:
df = df.append(data.iloc[i])
else:
datalist.append(df)
df = pd.DataFrame(columns=data.columns)
n = data[name][i]
df = df.append(data.iloc[i])
return datalist
I do not get an error message, the script just seems to run forever!
Is there a smart way to do it?
Can I ask why not just do it by slicing the data frame. Something like
#create some data with Names column
data = pd.DataFrame({'Names': ['Joe', 'John', 'Jasper', 'Jez'] *4, 'Ob1' : np.random.rand(16), 'Ob2' : np.random.rand(16)})
#create unique list of names
UniqueNames = data.Names.unique()
#create a data frame dictionary to store your data frames
DataFrameDict = {elem : pd.DataFrame() for elem in UniqueNames}
for key in DataFrameDict.keys():
DataFrameDict[key] = data[:][data.Names == key]
Hey presto you have a dictionary of data frames just as (I think) you want them. Need to access one? Just enter
DataFrameDict['Joe']
Firstly your approach is inefficient because the appending to the list on a row by basis will be slow as it has to periodically grow the list when there is insufficient space for the new entry, list comprehensions are better in this respect as the size is determined up front and allocated once.
However, I think fundamentally your approach is a little wasteful as you have a dataframe already so why create a new one for each of these users?
I would sort the dataframe by column 'name', set the index to be this and if required not drop the column.
Then generate a list of all the unique entries and then you can perform a lookup using these entries and crucially if you only querying the data, use the selection criteria to return a view on the dataframe without incurring a costly data copy.
Use pandas.DataFrame.sort_values and pandas.DataFrame.set_index:
# sort the dataframe
df.sort_values(by='name', axis=1, inplace=True)
# set the index to be this and don't drop
df.set_index(keys=['name'], drop=False,inplace=True)
# get a list of names
names=df['name'].unique().tolist()
# now we can perform a lookup on a 'view' of the dataframe
joe = df.loc[df.name=='joe']
# now you can query all 'joes'
You can convert groupby object to tuples and then to dict:
df = pd.DataFrame({'Name':list('aabbef'),
'A':[4,5,4,5,5,4],
'B':[7,8,9,4,2,3],
'C':[1,3,5,7,1,0]}, columns = ['Name','A','B','C'])
print (df)
Name A B C
0 a 4 7 1
1 a 5 8 3
2 b 4 9 5
3 b 5 4 7
4 e 5 2 1
5 f 4 3 0
d = dict(tuple(df.groupby('Name')))
print (d)
{'b': Name A B C
2 b 4 9 5
3 b 5 4 7, 'e': Name A B C
4 e 5 2 1, 'a': Name A B C
0 a 4 7 1
1 a 5 8 3, 'f': Name A B C
5 f 4 3 0}
print (d['a'])
Name A B C
0 a 4 7 1
1 a 5 8 3
It is not recommended, but possible create DataFrames by groups:
for i, g in df.groupby('Name'):
globals()['df_' + str(i)] = g
print (df_a)
Name A B C
0 a 4 7 1
1 a 5 8 3
Easy:
[v for k, v in df.groupby('name')]
Groupby can helps you:
grouped = data.groupby(['name'])
Then you can work with each group like with a dataframe for each participant. And DataFrameGroupBy object methods such as (apply, transform, aggregate, head, first, last) return a DataFrame object.
Or you can make list from grouped and get all DataFrame's by index:
l_grouped = list(grouped)
l_grouped[0][1] - DataFrame for first group with first name.
In addition to Gusev Slava's answer, you might want to use groupby's groups:
{key: df.loc[value] for key, value in df.groupby("name").groups.items()}
This will yield a dictionary with the keys you have grouped by, pointing to the corresponding partitions. The advantage is that the keys are maintained and don't vanish in the list index.
The method in the OP works, but isn't efficient. It may have seemed to run forever, because the dataset was long.
Use .groupby on the 'method' column, and create a dict of DataFrames with unique 'method' values as the keys, with a dict-comprehension.
.groupby returns a groupby object, that contains information about the groups, where g is the unique value in 'method' for each group, and d is the DataFrame for that group.
The value of each key in df_dict, will be a DataFrame, which can be accessed in the standard way, df_dict['key'].
The original question wanted a list of DataFrames, which can be done with a list-comprehension
df_list = [d for _, d in df.groupby('method')]
import pandas as pd
import seaborn as sns # for test dataset
# load data for example
df = sns.load_dataset('planets')
# display(df.head())
method number orbital_period mass distance year
0 Radial Velocity 1 269.300 7.10 77.40 2006
1 Radial Velocity 1 874.774 2.21 56.95 2008
2 Radial Velocity 1 763.000 2.60 19.84 2011
3 Radial Velocity 1 326.030 19.40 110.62 2007
4 Radial Velocity 1 516.220 10.50 119.47 2009
# Using a dict-comprehension, the unique 'method' value will be the key
df_dict = {g: d for g, d in df.groupby('method')}
print(df_dict.keys())
[out]:
dict_keys(['Astrometry', 'Eclipse Timing Variations', 'Imaging', 'Microlensing', 'Orbital Brightness Modulation', 'Pulsar Timing', 'Pulsation Timing Variations', 'Radial Velocity', 'Transit', 'Transit Timing Variations'])
# or a specific name for the key, using enumerate (e.g. df1, df2, etc.)
df_dict = {f'df{i}': d for i, (g, d) in enumerate(df.groupby('method'))}
print(df_dict.keys())
[out]:
dict_keys(['df0', 'df1', 'df2', 'df3', 'df4', 'df5', 'df6', 'df7', 'df8', 'df9'])
df_dict['df1].head(3) or df_dict['Astrometry'].head(3)
There are only 2 in this group
method number orbital_period mass distance year
113 Astrometry 1 246.36 NaN 20.77 2013
537 Astrometry 1 1016.00 NaN 14.98 2010
df_dict['df2].head(3) or df_dict['Eclipse Timing Variations'].head(3)
method number orbital_period mass distance year
32 Eclipse Timing Variations 1 10220.0 6.05 NaN 2009
37 Eclipse Timing Variations 2 5767.0 NaN 130.72 2008
38 Eclipse Timing Variations 2 3321.0 NaN 130.72 2008
df_dict['df3].head(3) or df_dict['Imaging'].head(3)
method number orbital_period mass distance year
29 Imaging 1 NaN NaN 45.52 2005
30 Imaging 1 NaN NaN 165.00 2007
31 Imaging 1 NaN NaN 140.00 2004
For more information about the seaborn datasets
NASA Exoplanets
Alternatively
This is a manual method to create separate DataFrames using pandas: Boolean Indexing
This is similar to the accepted answer, but .loc is not required.
This is an acceptable method for creating a couple extra DataFrames.
The pythonic way to create multiple objects, is by placing them in a container (e.g. dict, list, generator, etc.), as shown above.
df1 = df[df.method == 'Astrometry']
df2 = df[df.method == 'Eclipse Timing Variations']
In [28]: df = DataFrame(np.random.randn(1000000,10))
In [29]: df
Out[29]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1000000 entries, 0 to 999999
Data columns (total 10 columns):
0 1000000 non-null values
1 1000000 non-null values
2 1000000 non-null values
3 1000000 non-null values
4 1000000 non-null values
5 1000000 non-null values
6 1000000 non-null values
7 1000000 non-null values
8 1000000 non-null values
9 1000000 non-null values
dtypes: float64(10)
In [30]: frames = [ df.iloc[i*60:min((i+1)*60,len(df))] for i in xrange(int(len(df)/60.) + 1) ]
In [31]: %timeit [ df.iloc[i*60:min((i+1)*60,len(df))] for i in xrange(int(len(df)/60.) + 1) ]
1 loops, best of 3: 849 ms per loop
In [32]: len(frames)
Out[32]: 16667
Here's a groupby way (and you could do an arbitrary apply rather than sum)
In [9]: g = df.groupby(lambda x: x/60)
In [8]: g.sum()
Out[8]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 16667 entries, 0 to 16666
Data columns (total 10 columns):
0 16667 non-null values
1 16667 non-null values
2 16667 non-null values
3 16667 non-null values
4 16667 non-null values
5 16667 non-null values
6 16667 non-null values
7 16667 non-null values
8 16667 non-null values
9 16667 non-null values
dtypes: float64(10)
Sum is cythonized that's why this is so fast
In [10]: %timeit g.sum()
10 loops, best of 3: 27.5 ms per loop
In [11]: %timeit df.groupby(lambda x: x/60)
1 loops, best of 3: 231 ms per loop
The method based on list comprehension and groupby- Which stores all the split dataframe in list variable and can be accessed using the index.
Example
ans = [pd.DataFrame(y) for x, y in DF.groupby('column_name', as_index=False)]
ans[0]
ans[0].column_name
You can use the groupby command, if you already have some labels for your data.
out_list = [group[1] for group in in_series.groupby(label_series.values)]
Here's a detailed example:
Let's say we want to partition a pd series using some labels into a list of chunks
For example, in_series is:
2019-07-01 08:00:00 -0.10
2019-07-01 08:02:00 1.16
2019-07-01 08:04:00 0.69
2019-07-01 08:06:00 -0.81
2019-07-01 08:08:00 -0.64
Length: 5, dtype: float64
And its corresponding label_series is:
2019-07-01 08:00:00 1
2019-07-01 08:02:00 1
2019-07-01 08:04:00 2
2019-07-01 08:06:00 2
2019-07-01 08:08:00 2
Length: 5, dtype: float64
Run
out_list = [group[1] for group in in_series.groupby(label_series.values)]
which returns out_list a list of two pd.Series:
[2019-07-01 08:00:00 -0.10
2019-07-01 08:02:00 1.16
Length: 2, dtype: float64,
2019-07-01 08:04:00 0.69
2019-07-01 08:06:00 -0.81
2019-07-01 08:08:00 -0.64
Length: 3, dtype: float64]
Note that you can use some parameters from in_series itself to group the series, e.g., in_series.index.day
here's a small function which might help some (efficiency not perfect probably, but compact + more or less easy to understand):
def get_splited_df_dict(df: 'pd.DataFrame', split_column: 'str'):
"""
splits a pandas.DataFrame on split_column and returns it as a dict
"""
df_dict = {value: df[df[split_column] == value].drop(split_column, axis=1) for value in df[split_column].unique()}
return df_dict
it converts a DataFrame to multiple DataFrames, by selecting each unique value in the given column and putting all those entries into a separate DataFrame.
the .drop(split_column, axis=1) is just for removing the column which was used to split the DataFrame. the removal is not necessary, but can help a little to cut down on memory usage after the operation.
the result of get_splited_df_dict is a dict, meaning one can access each DataFrame like this:
splitted = get_splited_df_dict(some_df, some_column)
# accessing the DataFrame with 'some_column_value'
splitted[some_column_value]
The existing answers cover all good cases and explains fairly well how the groupby object is like a dictionary with keys and values that can be accessed via .groups. Yet more methods to do the same job as the existing answers are:
Create a list by unpacking the groupby object and casting it to a dictionary:
dict([*df.groupby('Name')]) # same as dict(list(df.groupby('Name')))
Create a tuple + dict (this is the same as #jezrael's answer):
dict((*df.groupby('Name'),))
If we only want the DataFrames, we could get the values of the dictionary (created above):
[*dict([*df.groupby('Name')]).values()]
I had similar problem. I had a time series of daily sales for 10 different stores and 50 different items. I needed to split the original dataframe in 500 dataframes (10stores*50stores) to apply Machine Learning models to each of them and I couldn't do it manually.
This is the head of the dataframe:
I have created two lists;
one for the names of dataframes
and one for the couple of array [item_number, store_number].
list=[]
for i in range(1,len(items)*len(stores)+1):
global list
list.append('df'+str(i))
list_couple_s_i =[]
for item in items:
for store in stores:
global list_couple_s_i
list_couple_s_i.append([item,store])
And once the two lists are ready you can loop on them to create the dataframes you want:
for name, it_st in zip(list,list_couple_s_i):
globals()[name] = df.where((df['item']==it_st[0]) &
(df['store']==(it_st[1])))
globals()[name].dropna(inplace=True)
In this way I have created 500 dataframes.
Hope this will be helpful!

Resources