I've read about coercing errors and eyeballing stuff like this, but I was wondering if there was a more optimal way to automate this issue:
I have a regular dataset being outputted by a system, then manually modified by some folks. Unfortunately, the modifications tend to be inconsistent with the original system outputs, so I end up having to manually standardize the formatting before loading it into a pandas dataframe. Is there a smarter way to do this -- i.e., something I'm not aware of where pandas or some other function would be able to clean this for me?
Sample dataframe column with its messing data types:
START_DTTIME
-----
2021-11-01 17:10:00
2021-11-01 17:12:00
2021-11-01 17:15:00
11/3/2021
11/4/21
UNKNOWN
UNK
2021-10-04 14:01:20
10-5-21
10-30-2021
???
2021-10-05 14:03:21
The dataset usually is pretty manageable in size (no more than 100 records daily), so I was thinking if absolutely necessary I could just make a function to loop through each record checking for all the different variations that are commonplace (there are only so many different ways one could type in a date, right?)... but that's a last resort as I wanted to check if there's a "smarter" way to do this first before I do something wildly inefficient. :-)
If it helps, I only care about the DATE; the time is actually extraneous info produced by the system, but as you can observe in the non-standardized formatting, the manual inputs only consist of the date.
Thank you!
Dataframe
df=pd.DataFrame({'START_DTTIME':['2021-11-01 17:10:00','11/3/2021','11/4/21','UNKNOWN','10-30-2021','???']})
convert the column to datetime, coerce errors to create NaN and then select those that are not NaN
df[pd.to_datetime(df['START_DTTIME'], errors='coerce').notna()]
START_DTTIME
0 2021-11-01 17:10:00
1 11/3/2021
2 11/4/21
4 10-30-2021
Related
I had a dataframe like in image-1 - Input dataframe on which I want to rename Rows/indices by dates (dtype='datetime64[ns]) in YYYY-MM-DD format.
So, I used index re-naming option as shown in the image-2 below, which is last date of every 6th month for every row incrementing till end. It did rename the rows but end up making NaNs for all data values. I did try the transpose of dataframe, same result.
After trying few other things as shown in image-3, which were all unfruitful and mostly I had error suggesting TypeError: 'DatetimeIndex' object is not callable
As the final solution, I end up creating dataframe for all dates image-4, followed by merging two dataframes by columns, image-5 and then assign/set very first column as row names, image-6.
Dates have a weird format when converting to list, and wondering why it is so, image-7. How do we get exactly the year-month-date? I tried different combinations but didn't end up in fruitful results. strftime is the way to go here, but how?
Why I went this strftime approach, I was thinking to output a list of dates in a sensible YYYY-MM-DD format and then use function as --> pd.rename(index=list_dates) to replace default 0 1 2 by dates as new index names.
So, I have a solution but is it an economic solution or are there good solutions available?
This is an attempt to share my solution for those who can use it and learn new solutions from wizards here.
BRgrds,
I'm working on a rather lengthy data manipulation project and one of the datasets that I merged into my dataframe has a lesser length of 96 rows (6963 vs 6867), and I cannot for the life of me figure out where.
Checking .head() and .tail() shows that the all datasets in the dataframe start and end on the same date, but I have reason to think its interfering with my other analysis, for obvious reasons.
Checking crypto_df['doge_close,'doge_open'] shows they are the correct length of 6963 (total # of dates in dataframe), but I cannot find where in the dataset the rows are missing.
Is there a function to check this? I don't think I'm being specific enough on google, if there is.
If you are looking for rows that have NaN, do this:
crypto_df[crypto_df['doge_close'].isnull ()]
I'm trying to use Hypothesis to generate a set of dataframes that I'll merge together. I want each individual column to be allowed to have NaN values, and I want to allow Hypothesis to generate some wacky examples.
But I mostly want to focus on examples where there is at least one row in each dataframe with actual values - and in particular, I'd like to be able to generate dataframes with some information shared between corresponding columns, such that a merged dataframe is not empty. (E.g. I want some values from 'store' in store.csv to overlap with values from 'store' in train.csv.)
I have some example code here that generates NaN values and wacky examples all over the place, but most of the generated examples contain very few non-NaN values. (A dataframe strategy starts on line 57.)
Any suggestions for how to create slightly more 'realistic' examples? Thanks!
Your solution looks fine to me, but here's two more tactics that might help:
Use the fill=st.nothing() argument to columns and series, to disable filling behaviour. This makes the entries dense instead of sparse(ish), so there's a substantial runtime cost but noticable change in the example density. Alternatively fill=st.floats(allow_nan=False) might be cheaper and still work!
Use a .filter(...) on the strategy to reject dataframes without any nan-free rows. A typical rule of thumb is to avoid using .filter when it would reject more than half the examples and look for an alternative when it's over a tenth... but this could be combined with the first point easily enough.
Answering my own question, but I'd love to hear other answers.
I ended up doing two things:
1) Requiring that the end user not give garbage files. (Just because we have a magical property-generation framework doesn't absolve us of the responsibility of having common sense, which I forgot.)
2) Testing for things that are reasonable accidents but not absolute garbage, by requiring that each dataframe have at least one row with no NaNs. With that requirement, I generate the non-NaN dataframe, and then add some NaNs afterward.
From there, ipython and .example() make it easy to see what's going on.
Example code below (google_files and google_weeks are custom strategies previously created)
# Create dataframes from the strategies above
# We'll create dataframes with all non-NaN values, then add NaNs to rows
# after the fact
df = draw(data_frames([
column('file', elements=google_files),
column('week', elements=google_weeks),
column('trend',
elements=(integers(min_value=0, max_value=100)))],
index=range_indexes(min_size=1, max_size=100)))
# Add the nans
# With other dataframes, this ended up getting written into a function
rows = len(df)
df.loc[rows+1] = [np.NaN, '2014-01-05 - 2014-01-11', 42]
df.loc[rows+2] = ['DE_BE', np.NaN, 42]
df.loc[rows+3] = ['DE_BE', '2014-01-05 - 2014-01-11', np.NaN]
df.loc[rows+4] = [np.NaN, np.NaN, np.NaN]
I have a dataframe of 27 columns (26 are numeric variables and the 27th column tells me which group each row is associated with). There are 7 groups in total I'm trying to apply the Kruskal-Wallis test to each variable, split by group, to determine if there is a significant difference or not.
I have tried:
df.groupby(['treatment']).apply(kruskal)
which throws an error "Need at least 2 groups two groups in stats.kruskal()".
My other attempts haven't produced an output either. I'll be doing similar analyses on a regular basis and with larger datasets. Can someone help me understand this issue and how to fix it?
With Scipy, you could do like that for each variable:
scipy.stats.kruskal(*[group["variable"].values for name, group in df.groupby("treatment")])
I have about 100 columns with some empty values that I would like to replace with zeroes. I know how to do this with a single column using Calculate and Replace, but I wanted to see if there was a way to do this with multiple columns at once.
Thanks!
You could script it but it'd probably take you as long to write the script as it would to do it manually with a transformation. A better idea would be to fix it in the data source itself before you import it so SPOTFIRE doesn't have to do the transformation every time, which if you are dealing with a large amount of data, could hinder your performance.