Python-Hypothesis: specifying and managing NaN values - python-hypothesis

I'm trying to use Hypothesis to generate a set of dataframes that I'll merge together. I want each individual column to be allowed to have NaN values, and I want to allow Hypothesis to generate some wacky examples.
But I mostly want to focus on examples where there is at least one row in each dataframe with actual values - and in particular, I'd like to be able to generate dataframes with some information shared between corresponding columns, such that a merged dataframe is not empty. (E.g. I want some values from 'store' in store.csv to overlap with values from 'store' in train.csv.)
I have some example code here that generates NaN values and wacky examples all over the place, but most of the generated examples contain very few non-NaN values. (A dataframe strategy starts on line 57.)
Any suggestions for how to create slightly more 'realistic' examples? Thanks!

Your solution looks fine to me, but here's two more tactics that might help:
Use the fill=st.nothing() argument to columns and series, to disable filling behaviour. This makes the entries dense instead of sparse(ish), so there's a substantial runtime cost but noticable change in the example density. Alternatively fill=st.floats(allow_nan=False) might be cheaper and still work!
Use a .filter(...) on the strategy to reject dataframes without any nan-free rows. A typical rule of thumb is to avoid using .filter when it would reject more than half the examples and look for an alternative when it's over a tenth... but this could be combined with the first point easily enough.

Answering my own question, but I'd love to hear other answers.
I ended up doing two things:
1) Requiring that the end user not give garbage files. (Just because we have a magical property-generation framework doesn't absolve us of the responsibility of having common sense, which I forgot.)
2) Testing for things that are reasonable accidents but not absolute garbage, by requiring that each dataframe have at least one row with no NaNs. With that requirement, I generate the non-NaN dataframe, and then add some NaNs afterward.
From there, ipython and .example() make it easy to see what's going on.
Example code below (google_files and google_weeks are custom strategies previously created)
# Create dataframes from the strategies above
# We'll create dataframes with all non-NaN values, then add NaNs to rows
# after the fact
df = draw(data_frames([
column('file', elements=google_files),
column('week', elements=google_weeks),
column('trend',
elements=(integers(min_value=0, max_value=100)))],
index=range_indexes(min_size=1, max_size=100)))
# Add the nans
# With other dataframes, this ended up getting written into a function
rows = len(df)
df.loc[rows+1] = [np.NaN, '2014-01-05 - 2014-01-11', 42]
df.loc[rows+2] = ['DE_BE', np.NaN, 42]
df.loc[rows+3] = ['DE_BE', '2014-01-05 - 2014-01-11', np.NaN]
df.loc[rows+4] = [np.NaN, np.NaN, np.NaN]

Related

Pandas DataFrame indexing problem from 40,000 to 49,999

I have a strange problem with my code (At least it is strange for me!).
I have a Pandas DataFrame called "A". One of the column names is "asin". I want to execute all specific rows including my data. So I write this simple code:
df2 = A[A['asin']=='B0000UYZG0']
And it works normally as expected, except for data from 40,000 to 499,999!!
It doesn't work on these data series at all!
Refer to the picture, df2 = A[A['asin']=='0077614992'] (related to 50,000) works but df2 = A[A['asin']=='B0006O0WW6'] (related to 49,999) does not work!
I do not have tried all 10,000 data! But randomly I test them and have no answer.
I have grow accustomed to fixing bugs such as this one, usually when that happen is because of an alternative dtype or maybe because the string you see displayed to you isn't actually THE string itself. It seen your issue is mostly on the second part.
So lets first clear your "string" column from any white spaces.
df2['asin'] = df2.asin.str.strip()
# I am going with the idea that that is your non functional df object
After that try rerunning your filter
df2[df2['asin'].eq('0077614992')]

When I combine two pandas columns with zip into a dict it reduces my samples

I have two colums in pandas: df.lat and df.lon.
Both have a length of 3897 and 556 NaN values.
My goal is to combine both columns and make a dict out of them.
I use the code:
dict(zip(df.lat,df.lon))
This creates a dict, but with one element less than my original columns.
I used len()to confirm this. I can not figure out why the dict has one element
less than my columns, when both columns have the same length.
Another problem is that the dict has only raw values, but not the keys "lat" respectively "lon".
Maybe someone here has an idea?
You may have a different length if there are repeated values in df.lat as you can't have duplicate keys in the dictionary and so these values would be dropped.
A more flexible approach may be to use the df.to_dict() native method in pandas. In this example the orientation you want is probably 'records'. Full code:
df[['lat', 'lon']].to_dict('records')

How to check where rows in a dataframe are missing?

I'm working on a rather lengthy data manipulation project and one of the datasets that I merged into my dataframe has a lesser length of 96 rows (6963 vs 6867), and I cannot for the life of me figure out where.
Checking .head() and .tail() shows that the all datasets in the dataframe start and end on the same date, but I have reason to think its interfering with my other analysis, for obvious reasons.
Checking crypto_df['doge_close,'doge_open'] shows they are the correct length of 6963 (total # of dates in dataframe), but I cannot find where in the dataset the rows are missing.
Is there a function to check this? I don't think I'm being specific enough on google, if there is.
If you are looking for rows that have NaN, do this:
crypto_df[crypto_df['doge_close'].isnull ()]

Pandas read_table with duplicate names

When reading a table while specifying duplicate column names - let's say two different names - pandas 0.16.1 will copy the last two columns of the data over and over again.
In [1]:
​
df = pd.read_table('Datasets/tbl.csv', header=0, names=['one','two','one','two','one'])
df
tbl.csv contains a table with 5 different columns. The last two will be repeated instead of giving all columns.
Out[1]:
one two one two one
0 0.132846 0.120522 0.132846 0.120522 0.132846
1 -0.059710 -0.151850 -0.059710 -0.151850 -0.059710
2 0.003686 0.011072 0.003686 0.011072 0.003686
3 -0.220749 -0.029358 -0.220749 -0.029358 -0.220749
The actual table has different values in every column. Here, the same two columns (corresponding to the two last ones in the file) are repeated. No error or warning is given.
Do you think this is a bug or is it intended? I find it very dangerous to silently change an input like that. Or is it my ignorance?
Using duplicate values in indexes are inherently problematic.
They lead to ambiguity. Code that you think works fine can suddenly fail on DataFrames with non-unique indexes. argmax, for instance, can lead to a similar pitfall when DataFrames have duplicates in the index.
It's best to avoid putting duplicate values in (row or
column) indexes if you can. If you need to use a non-unique index, use them with care.
Double-check the effect duplicate values have on the behavior of your code.
In this case, you could use
df = pd.read_csv('data', header=None)
df.columns = ['one','two','one','two','one']
instead.

Stata tab over entire dataset

In Stata is there any way to tabulate over the entire data set as opposed to just over one variable/column? This would give you the tabulation over all the columns.
Related - is there a way to find particular values in Stata if one does not know which column they occur in? The output would be which column and row they are located in or at least which column.
Stata does not use row and column terminology except with reference to matrices and vectors. It uses the terminology of observations and variables.
You could stack or reshape the entire dataset into one variable if and only if all variables are numeric or all are string. If that assumption is incorrect, then you would need to convert numeric variables to string, at least temporarily, before you could do that. I guess wildly that you are only interested in blocks of variables that are all either numeric or string.
When you say "tabulate" you may mean the tabulate command. That has limits on the number of rows and/or columns it can show that might bite, but with a small amount of work list could be used for a simple table with many more values.
tabm from tab_chi on SSC may be what you seek.
For searching across several variables, you could automate a loop.
I'd say that if this is a felt need, it is quite probable that you have the wrong data structure for at least some of what you want to do and should reshape. But further details might explode that.

Resources