Efficiently create categorical data frame with nulls - python-3.x

I want to create a categorical data frame with nulls and set the categories before expanding the index. The index is very large and I want to avoid the memory spike and I cannot seem to do this.
Example:
# memory spike
df = pd.DataFrame(index=list(range(0, 1000)), columns=['a', 'b'])
df.info(memory_usage='deep')
Output:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1000 entries, 0 to 999
Data columns (total 2 columns):
a 0 non-null object
b 0 non-null object
dtypes: object(2)
memory usage: 70.3 KB
Convert to Categorical:
for _ in df.columns:
df[_] = df[_].astype('category')
# set categories for columns
df['a'] = df['a'].cat.add_categories(['d', 'e', 'f'])
df['b'] = df['b'].cat.add_categories(['g', 'h', 'i'])
# check memory usage
df.info(memory_usage='deep')
Output:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1000 entries, 0 to 999
Data columns (total 2 columns):
a 0 non-null category
b 0 non-null category
dtypes: category(2)
memory usage: 9.9 KB
Is there a way to do this while avoiding the memory spike?

If the data frame is created by the DataFrame constructor, the columns can be initialized as category types.
import numpy as np
import pandas as pd
from pandas.api.types import CategoricalDtype
cat_type1 = CategoricalDtype(["d", "e", "f"])
cat_type2 = CategoricalDtype(["g", "h", "i"])
index = pd.Index(range(1000))
df = pd.DataFrame({"a": pd.Series([np.nan] * len(index), dtype=cat_type1, index=index),
"b": pd.Series([np.nan] * len(index), dtype=cat_type2, index=index)},
index=index)
Another alternative solution is the following.
cols = ["a", "b"]
index = pd.Index(range(1000))
df = pd.DataFrame({k: [np.nan] * len(index) for k in cols}, index=index, dtype="category")
df["a"].cat.set_categories(["d", "e", "f"], inplace=True)
df["b"].cat.set_categories(["g", "h", "i"], inplace=True)
If the data frame is created via methods such as read_csv, the dtype keyword parameter can be used to make sure the output columns have desired data types rather than making conversions after the data frame is created -- which leads to more memory consumption.
df = pd.read_csv("file.csv", dtype={"a": cat_type1, "b": cat_type2})
Here, the category values can also be directly inferred from the data by passing in dtype={"a": "category"}. Specifying the categories beforehand can save the inference overhead and also let the parser check the data values match the specified category values. It is also necessary if some category values do not occur in the data.

Related

Writing a CSV, or reading a CSV changes my pandas data frame from float16 into float64. How can I avoid this?

I have a data frame test_df with the following information
test_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Columns: 4097 entries, index to 4095
dtypes: float16(4096), object(1)
memory usage: 800.9+ KB
Clearly, all the columns are of data type float16 except the first column and the total size of the data frame is about 800 MB. Now I save this data frame as a CSV file using Watson Studio as follows
# import the lib
from ibm_watson_studio_lib import access_project_or_space
wslib = access_project_or_space()
# save the data frame as csv
wslib.save_data("test_df.csv", test_df.to_csv(index=False, header=False).encode())
Checking the size of the CSV, it is suddenly 1.8 MB. For some reason the size doubled.
Now, when reading in the same CSV again with the following code
import itc_utils.flight_service as itcfs
readClient = itcfs.get_flight_client()
nb_data_request = {
'data_name': """test_df.csv""",
'interaction_properties': {
#'row_limit': 500,
'infer_schema': 'true',
'infer_as_varchar': 'false'
}
}
flightInfo = itcfs.get_flight_info(readClient, nb_data_request=nb_data_request)
test_df = itcfs.read_pandas_and_concat(readClient, flightInfo, timeout=10000)
test_df.index.name = None
# rename first column to 'index'
test_df.rename(columns = {'COLUMN1':'index'}, inplace = True)
# rename the rest of columns as a consecutive integer
new_columns = {}
for i in range(len(test_df.columns)-1):
new_columns[test_df.columns[i+1]] = str(i)
test_df = test_df.rename(columns=new_columns)
And checking info now gives
test_df.info()
Time taken: 0.0468 minutes
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Columns: 4097 entries, index to 4095
dtypes: float64(4096), object(1)
memory usage: 3.1+ MB
Now the data type is float64 and it's 3.1 MB. How can avoid this?

How to extract 4dimensional data from a list of pandas dataframes?

I have a list of 500 dataframes (in the form of .csv files); 500 = 20(time) x 25(energy) bins. In other words, each dataframe is a measurement of flux at a single time and energy and is represented as 150x150 mesh grid corresponding to x and y spatial coordinates. However, I would like to transform these data into 4-d coordinates of the form Flux(x, y, t, E) such that I have new set of dataframes with columns E and rows t for any given (x,y) position.
I am not sure how to approach the problem. I would appreciate your help in giving me some sort of roadmap for doing this procedure.
Note:
The time and energy of each dataframe is in the name of the corresponding .csv file in the form time-5e+35-energy0.00023-position.csv where t=-5 10^35 and E=0.00023.
What I know:
500 dataframes of 20tx25E must be converted to 22,500 dataframes of 150x150 coordinates. However, this is very time consuming and I am not sure if there is any other package in python3 that can do the job easier.
Code that combines your files into one big Pandas dataframe of size 11,250,000 or 25 × 20 × 150 × 150:
import pandas as pd
from glob import glob
import re
from datetime import datetime
pattern_file_name = re.compile(r'time-(.*)-energy(.*)-position.csv')
start_time = datetime.now()
result_df = None
for file_name in glob('time-*.csv'):
# extract time and energy values from file name
if not pattern_file_name.match(file_name):
raise ValueError(f'file name {file_name} failed pattern match.')
time_s, energy_s = pattern_file_name.findall(file_name)[0]
time, energy = float(time_s), float(energy_s)
print(f'Processing | {time_s} | {energy_s} |...')
df = pd.read_csv(file_name, header=None)
# assuming the CSV (i) has no headers (ii) is an array of 150x150...
# ...floats with no missing or problematic values (iii) each row...
# ...represents a fixed y-coordinate; adjust to your needs
df.index.name = 'y'
df = df.stack()
df.index.rename('x', level=-1, inplace=True)
df = df.swaplevel().sort_index().reset_index().rename(columns={0: 'flux'})
# df is now (x, y, f)
# x and y will each vary from 0 to 149
df.insert(0, 't', time)
df.insert(0, 'E', energy)
result_df = df if result_df is None else pd.concat([result_df, df])
result_df = result_df.set_index(['E', 't', 'x', 'y']).sort_index()
# result_df is now (E, t, x, y) -> flux
result_df.to_csv('output.csv', index=True)
final_time = datetime.now()
delta_time = final_time - start_time
print(f'Completed in {delta_time}')
The main steps are as follows:
Loop over file names
Extract t and E values from file name
Read square matrix of flux values from file
Transform 150 × 150 square matrix to Pandas dataframe of length 22,500
Add columns to keep track of E and t
Append local result to a global, ever-increasing result vector
Finally, leave the loop and save results to disk as CSV
The resulting CSV file will have 5 columns. The first four would represent (E,t,x,y) and the last column would be the value of the flux field at those co-ordinates.

Pandas copy slice warning appears to be inconsistent?

I know there are a million posts about Pandas DataFrame copy slice warning, and I have researched this... but I still don't understand why the warning is NOT called in Line10 below but IS called in Line15 below. Using Python 3.8.3 & pandas 1.0.5
import pandas as pd
#### Example DataFrame
myid = [1, 1, 1, 2, 2]
myorder = [3, 2, 1, 2, 1]
y = [3642, 3640, 3632, 3628, 3608]
x = [11811, 11812, 11807, 11795, 11795]
df = pd.DataFrame(list(zip(myid, myorder, x, y)),
columns =['myid', 'myorder', 'x', 'y'])
df.sort_values(by=['myid', 'myorder'], inplace=True) ## LINE 10
df.reset_index(drop=True, inplace=True)
idval =2
tempdf = df[mygdf.myid == idval]
tempdf.sort_values(by=['myid', 'myorder'], inplace=True) ## LINE 15
tempdf.reset_index(drop=True, inplace=True)
This line:
tempdf = df[mygdf.myid == idval]
Is creating a view called tempdf on df, the underlying data has not been copied, think of tempdf as a pre-recorded filter applied to df, and further changes will only be applied to those rows that meet the filter.
This means that if you update tempdf, you will be updating df - hence the warning.
To avoid the warning you would need to do the following, which forces df and tempdf to use differing underlying data structures:
tempdf = df[mygdf.myid == idval].copy()
Now changes to tempdf will have no impact on df, so your warning goes away.

How to encode multiple categorical columns for test data efficiently?

I have multiple category columns (nearly 50). I using custom made frequency encoding and using it on training data. At last i am saving it as nested dictionary. For the test data I am using map function to encode and unseen labels are replaced with 0. But I need more efficient way?
I have already tried pandas replace method but it don't cares of unseen labels and leaves it as it. Further I am much concerned about the time and i want say 80 columns and 1 row to be encoded within 60 ms. Just need the most efficient way I can do it. I have taken my example from here.
import pandas
from sklearn import preprocessing
df = pandas.DataFrame({'pets': ['cat', 'dog', 'cat', 'monkey', 'dog', 'meo'],
'owner': ['Champ', 'Ron', 'Brick', 'Champ', 'Veronica', 'Ron'],
'location': ['San_Diego', 'New_York', 'New_York', 'San_Diego', 'San_Diego',
'New_York']})
My dict looks something like this :
enc = {'pets': {'cat': 0, 'dog': 1, 'monkey': 2},
'owner': {'Brick': 0, 'Champ': 1, 'Ron': 2, 'Veronica': 3},
'location': {'New_York': 0, 'San_Diego': 1}}
for col in enc:
if col in input_df.columns:
input_df[col]= input_df[col].map(dict_online['encoding'][col]).fillna(0)
Further I want multiple columns to be encoded at once. I don't want any loop for every column.... I guess we cant do it in map. Hence replace is good choice but in that as said it doesn't cares about unseen labels.
EDIT:
This the code i am using for now, Please note there is only 1 row in test data frame ( Not very sure i should handle it like numpy array to reduce time...). But i need to decrease this time to under 60 ms: Further i have dictionary only for mapping ( Cant use one hot because of use case). Currently time = 331.74 ms. Any idea how to do it more efficiently. Not sure that multiprocessing will work..? Further with replace method i have got many issues like : 1. It does not handle unseen labels and leave them as it is ( for string its issue). 2. It has problem with overlapping of keys and values.
from string import ascii_lowercase
import itertools
import pandas as pd
import numpy as np
import time
def iter_all_strings():
for size in itertools.count(1):
for s in itertools.product(ascii_lowercase, repeat=size):
yield "".join(s)
l = []
for s in iter_all_strings():
l.append(s)
if s == 'gr':
break
columns = l
df = pd.DataFrame(columns=columns)
for col in df.columns:
df[col] = np.random.randint(1, 4000, 3000)
transform_dict = {}
for col in df.columns:
cats = pd.Categorical(df[col]).categories
d = {}
for i, cat in enumerate(cats):
d[cat] = i
transform_dict[col] = d
print(f"The length of the dictionary is {len(transform_dict)}")
# Creating another test data frame
df2 = pd.DataFrame(columns=columns)
for col in df2.columns:
df2[col] = np.random.randint(1, 4000, 1)
print(f"The shape of teh 2nd data frame is {df2.shape}")
t1 = time.time()
for col in df2.columns:
df2[col] = df2[col].map(transform_dict[col]).fillna(0)
print(f"Time taken is {time.time() - t1}")
# print(df)
Firstly, when you want to encode categorical variables, which is not ordinal (meaning: there is no inherent ordering between the values of the variable/column. ex- cat, dog), you must use one hot encoding.
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
df = pd.DataFrame({'pets': ['cat', 'dog', 'cat', 'monkey', 'dog', 'meo'],
'owner': ['Champ', 'Ron', 'Brick', 'Champ', 'Veronica', 'Ron'],
'location': ['San_Diego', 'New_York', 'New_York', 'San_Diego', 'San_Diego',
'New_York']})
enc = [['cat','dog','monkey'],
['Brick', 'Champ', 'Ron', 'Veronica'],
['New_York', 'San_Diego']]
ohe = OneHotEncoder(categories=enc, handle_unknown='ignore', sparse=False)
Here, I have modified your enc in a way that can be fed into the OneHotEncoder.
Now comes the point of how can we going to handle the unseen
labels?
when you handle_unknown as False, the unseen values will have zeros in all the dummy variables, which in a way would help the model to understand its a unknown value.
colnames= ['{}_{}'.format(col,val) for col,unique_values in zip(df.columns,ohe.categories_) \
for val in unique_values]
pd.DataFrame(ohe.fit_transform(df), columns=colnames)
Update:
If you are fine with ordinal endocing, the following change could help.
df2.apply(lambda row: [transform_dict[val].get(col,0) \
for val,col in row.items()],
axis=1,
result_type='expand')
#1000 loops, best of 3: 1.17 ms per loop

How to infer a schema for a pyspark dataframe?

There are many question on this site regarding how to convert a pyspark rdd to a dataframe. But none of them answer the question of how to convert a SQL table style rdd to a dataframe while preserving type.
I have an rdd that is exactly a list of dicts in python:
>>> rdd.take(1)
[{'se_error': 0, 'se_subjective_count': 0, 'se_word_count': 10, 'se_entity_summary_topic_phrases': {}, 'se_entity_hits': 1, 'se_entity_summary': 'rt #mercuryinrx: disgusting. cut it out FOCALENTITY twitter.com/anons4cetacean', 'se_query_with_hits': 0, 'id': 180034992495.0, 'se_objective_count': 2, 'se_category': {}, 'se_sentence_count': 2, 'se_entity_sentiment': 0.0, 'se_document_sentiment': -0.49000000953674316, 'se_entity_themes': {}, 'se_query_hits': 0, 'se_named_entities': {}}]
>>> rdd.take(1)[0].keys()
dict_keys(['se_error', 'se_subjective_count', 'se_word_count', 'se_entity_summary_topic_phrases', 'se_entity_hits', 'se_entity_summary', 'se_query_with_hits', 'id', 'se_objective_count', 'se_category', 'se_sentence_count', 'se_entity_sentiment', 'se_document_sentiment', 'se_entity_themes', 'se_query_hits', 'se_named_entities'])
All rows have the same columns. All columns have the same datatype. This is trivial to turn into a dataframe in pandas.
out = rdd.take(rdd.count())
outdf = pd.DataFrame(out)
This of course defeats the purpose of using spark! I can demonstrate that the columns are all the same datatype as well.
>>> typemap = [{key: type(val) for key, val in row.items()} for row in out]
>>> typedf = pd.DataFrame(typemap)
>>> for col in list(typedf):
>>> typedf[col].value_counts()
<class 'float'> 1016
Name: id, dtype: int64
<class 'dict'> 1010
Name: se_category, dtype: int64
<class 'float'> 1010
Name: se_document_sentiment, dtype: int64
<class 'int'> 1010
Name: se_entity_hits, dtype: int64
...
It goes on farther, but they are all one type; or else they are nones.
How do I do this in spark? Here are some tries that don't work:
>>> outputDf = rdd.toDF()
...
ValueError: Some of types cannot be determined by the first 100 rows, please try again with sampling
>>> outputDf = rdd.toDF(sampleRatio=0.1)
...
File "/usr/hdp/current/spark-client/python/pyspark/sql/types.py", line 905, in <lambda>
return lambda row: dict((kconv(k), vconv(v)) for k, v in row.items())
AttributeError: 'NoneType' object has no attribute 'items'
What is the issue here? Why is it so hard to figure out the datatype in a column that only has one python datatype?
The solution here is in the line
<class 'float'> 1016
Name: id, dtype: int64
<class 'dict'> 1010
Name: se_category, dtype: int64
There are 1016 rows total in this rdd; but in 6 of those rose, the column se_category is absent. That is why you only see 1010 dict objects. This is no problem for pandas, which simply infers type from the rest of the column and uses an empty object of whatever appropriate type (list -> []; dict -> {}; float or int -> NaN) to fill in the blanks.
Spark doesn't do that. If you think about it from the perspective of Java, which is the language underlying the rdd objects, this makes complete sense. Since I have been programming mostly python, a dynamically-typed language, for some time, it didn't occur to me immediately that this was a problem. But in a statically-typed language, it would be expected that something has a defined type at compile time.
The solution is to 'declare' each row to be returned to an rdd as a set of objects with types; thus imitating the static typing. So I declare
{"int_field": 0; "list_field": []; "float_field": 0.0, "string_field": ""}
before I fill in any of the values. This way, if the value is not updated by my function that generates the rdd; the row still has all the correct types in place, and
outputDf = rdd.map(lambda x: Row(**x)).toDF()
successfully converts this rdd to a dataframe.

Resources