Getting all acceptable string arguments to DataFrameGroupby.aggregate - python-3.x

So, I have a piece of code that takes a groupby object and a dictionary mapping columns in the groupby to strings, indicating aggregation types. I want to validate that all the values in the dictionary are strings that pandas accepts in its aggregation. However, I don't want to use a try/except (which, without a loop, will only catch a single problem value). How do I do this?
I've already tried importing the SelectionMixin from pandas.core.generic and checking against the values in SelectionMixin._cython_table, but this clearly isn't an exhaustive list. My version of pandas is 0.20.3.
Here's an example of how I want to use this
class SomeModule:
ALLOWED_AGGREGATIONS = # this is where I would save the collection of allowed values
#classmethod
def aggregate(cls, df, groupby_cols, aggregation_dict):
disallowed_aggregations = list(
set(aggregation_dict.values) - set(cls.ALLOWED_AGGREGATIONS)
)
if len(disallowed_aggregations):
val_str = ', '.join(disallowed_aggregations)
raise ValueError(
f'Unallowed aggregations found: {val_str}'
)
return df.groupby(groupby_cols).agg(aggregation_dict)

Related

How can I convert from SQLite3 format to dictionary

How can i convert my SQLITE3 TABLE to a python dictionary where the name and value of the column of the table is converted to key and value of dictionary.
I have made a package to solve this issue if anyone got into this problem..
aiosqlitedict
Here is what it can do
Easy conversion between sqlite table and Python dictionary and vice-versa.
Get values of a certain column in a Python list.
Order your list ascending or descending.
Insert any number of columns to your dict.
Getting Started
We start by connecting our database along with
the reference column
from aiosqlitedict.database import Connect
countriesDB = Connect("database.db", "user_id")
Make a dictionary
The dictionary should be inside an async function.
async def some_func():
countries_data = await countriesDB.to_dict("my_table_name", 123, "col1_name", "col2_name", ...)
You can insert any number of columns, or you can get all by specifying
the column name as '*'
countries_data = await countriesDB.to_dict("my_table_name", 123, "*")
so you now have made some changes to your dictionary and want to
export it to sql format again?
Convert dict to sqlite table
async def some_func():
...
await countriesDB.to_sql("my_table_name", 123, countries_data)
But what if you want a list of values for a specific column?
Select method
you can have a list of all values of a certain column.
country_names = await countriesDB.select("my_table_name", "col1_name")
to limit your selection use limit parameter.
country_names = await countriesDB.select("my_table_name", "col1_name", limit=10)
you can also arrange your list by using ascending parameter
and/or order_by parameter and specifying a certain column to order your list accordingly.
country_names = await countriesDB.select("my_table_name", "col1_name", order_by="col2_name", ascending=False)

Python3 - Return CSV with row-level errors for missing data

New to Python. I'm importing a CSV, then if any data is missing I need to return a CSV with an additional column to indicate which rows are missing data. Colleague suggested that I import CSV into a dataframe, then create a new dataframe with a "Comments" column, fill it with a comment on the intended rows, and append it to the original dataframe. I'm stuck at the step of filling my new dataframe, "dferr", with the correct number of rows that would match up to "dfinput".
Have Googled, "pandas csv return error column where data is missing", but haven't found anything related to creating a new CSV that marks bad rows. I don't even know if the proposed way is the best way to go about this.
import pandas as pd
dfinput = None
try:
dfinput = pd.read_csv(r"C:\file.csv")
except:
print("Uh oh!")
if dfinput is None:
print("Ack!")
quit(10)
dfinput.reset_index(level=None, drop=False, inplace=True, col_level=0,
col_fill='')
dferr = pd.DataFrame(columns=['comment'])
print("Empty DataFrame", dferr, sep='\n')
Expected results: "dferr" would have an index column with number of rows equal to "dfinput", and comments on the correct rows where "dfinput" has missing values.
Actual results: "dferr" is empty.
My understanding of 'missing data' here would be null values. It seems that for every row, you want the names of null fields.
df = pd.DataFrame([[1,2,3],
[4,None,6],
[None,8,None]],
columns=['foo','bar','baz'])
# Create a dataframe of True/False, True where a criterion is met
# (in this case, a null value)
nulls = df.isnull()
# Iterate through every row of *nulls*,
# and extract the column names where the value is True by boolean indexing
colnames = nulls.columns
null_labels = nulls.apply(lambda s:colnames[s], axis=1)
# Now you have a pd.Series where every entry is an array
# (technically, a pd.Index object)
# Pandas arrays have a vectorized .str.join method:
df['nullcols'] = null_labels.str.join(', ')
The .apply() method in pandas can sometimes be a bottleneck in your code; there are ways to avoid using this, but here it seemed to be the simplest solution I could think of.
EDIT: Here's an alternate one-liner (instead of using .apply) that might cut down computation time slightly:
import numpy as np
df['nullcols'] = [colnames[x] for x in nulls.values]
This might be even faster (a bit more work is required):
np.where(df.isnull(),df.columns,'')

Pyspark applying foreach

I'm nooby in Pyspark and I pretend to play a bit with a couple of functions to understand better how could I use them in more realistic scenarios. for a while, I trying to apply a specific function to each number coming in a RDD. My problem is basically that, when I try to print what I grabbed from my RDD the result is None
My code:
from pyspark import SparkConf , SparkContext
conf = SparkConf().setAppName('test')
sc = SparkContext(conf=conf)
sc.setLogLevel("WARN")
changed = []
def div_two (n):
opera = n / 2
return opera
numbers = [8,40,20,30,60,90]
numbersRDD = sc.parallelize(numbers)
changed.append(numbersRDD.foreach(lambda x: div_two(x)))
#result = numbersRDD.map(lambda x: div_two(x))
for i in changed:
print(i)
I appreciate a clear explanation about why this is coming Null in the list and what should be the right approach to achieve that using foreach whether it's possible.
thanks
Your function definition of div_two seems fine which can yet be reduced to
def div_two (n):
return n/2
And you have converted the arrays of integers to rdd which is good too.
The main issue is that you are trying to add rdds to an array changed by using foreach function. But if you look at the definition of foreach
def foreach(self, f) Inferred type: (self: RDD, f: Any) -> None
which says that the return type is None. And thats what is getting printed.
You don't need an array variable for printing the changed elements of an RDD. You can simply write a function for printing and call that function in foreach function
def printing(x):
print x
numbersRDD.map(div_two).foreach(printing)
You should get the results printed.
You can still add the rdd to an array variable but rdds are distributed collection in itself and Array is a collection too. So if you add rdd to an array you will have collection of collection which means you should write two loops
changed.append(numbersRDD.map(div_two))
def printing(x):
print x
for i in changed:
i.foreach(printing)
The main difference between your code and mine is that I have used map (which is a transformation) instead of foreach ( which is an action) while adding rdd to changed variable. And I have use two loops for printing the elements of rdd

Spark: Join within UDF or map function

I have to write a complex UDF, in which I have to do a join with a different table, and return the number of matches. The actual use case is much more complex, but I've simplified the case here to minimum reproducible code. Here is the UDF code.
def predict_id(date,zip):
filtered_ids = contest_savm.where((F.col('postal_code')==zip) & (F.col('start_date')>=date))
return filtered_ids.count()
When I define the UDF using the below code, I get a long list of console errors:
predict_id_udf = F.udf(predict_id,types.IntegerType())
The final line of the error is:
py4j.Py4JException: Method __getnewargs__([]) does not exist
I want to know what is the best way to go about it. I also tried map like this:
result_rdd = df.select("party_id").rdd\
.map(lambda x: predict_id(x[0],x[1]))\
.distinct()
It also resulted in a similar final error. I want to know, if there is anyway, I can do a join within UDF or map function, for each row of the original dataframe.
I have to write a complex UDF, in which I have to do a join with a different table, and return the number of matches.
It is not possible by design. I you want to achieve effect like this you have to use high level DF / RDD operators:
df.join(ontest_savm,
(F.col('postal_code')==df["zip"]) & (F.col('start_date') >= df["date"])
).groupBy(*df.columns).count()

How to handle min_itemsize exception in writing to pandas HDFStore

I am using pandas HDFStore to store dfs which I have created from data.
store = pd.HDFStore(storeName, ...)
for file in downloaded_files:
try:
with gzip.open(file) as f:
data = json.loads(f.read())
df = json_normalize(data)
store.append(storekey, df, format='table', append=True)
except TypeError:
pass
#File Error
I have received the error:
ValueError: Trying to store a string with len [82] in [values_block_2] column but
this column has a limit of [72]!
Consider using min_itemsize to preset the sizes on these columns
I found that it is possible to set min_itemsize for the column involved but this is not a viable solution as I do not know the max length I will encounter and all the columns which I will encounter the problem.
Is there a solution to automatically catch this exception and handle it each item it occur?
I think you can do it this way:
store.append(storekey, df, format='table', append=True, min_itemsize={'Long_string_column': 200})
basically it's very similar to the following create table SQL statement:
create table df(
id int,
str varchar(200)
);
where 200 is the maximal allowed length for the str column
The following links might be very helpful:
https://www.google.com/search?q=pandas+ValueError%3A+Trying+to+store+a+string+with+len+in+column+but+min_itemsize&pws=0&gl=us&gws_rd=cr
HDFStore.append(string, DataFrame) fails when string column contents are longer than those already there
Pandas pytable: how to specify min_itemsize of the elements of a MultiIndex

Resources