How to check where rows in a dataframe are missing? - python-3.x

I'm working on a rather lengthy data manipulation project and one of the datasets that I merged into my dataframe has a lesser length of 96 rows (6963 vs 6867), and I cannot for the life of me figure out where.
Checking .head() and .tail() shows that the all datasets in the dataframe start and end on the same date, but I have reason to think its interfering with my other analysis, for obvious reasons.
Checking crypto_df['doge_close,'doge_open'] shows they are the correct length of 6963 (total # of dates in dataframe), but I cannot find where in the dataset the rows are missing.
Is there a function to check this? I don't think I'm being specific enough on google, if there is.

If you are looking for rows that have NaN, do this:
crypto_df[crypto_df['doge_close'].isnull ()]

Related

Pandas DataFrame indexing problem from 40,000 to 49,999

I have a strange problem with my code (At least it is strange for me!).
I have a Pandas DataFrame called "A". One of the column names is "asin". I want to execute all specific rows including my data. So I write this simple code:
df2 = A[A['asin']=='B0000UYZG0']
And it works normally as expected, except for data from 40,000 to 499,999!!
It doesn't work on these data series at all!
Refer to the picture, df2 = A[A['asin']=='0077614992'] (related to 50,000) works but df2 = A[A['asin']=='B0006O0WW6'] (related to 49,999) does not work!
I do not have tried all 10,000 data! But randomly I test them and have no answer.
I have grow accustomed to fixing bugs such as this one, usually when that happen is because of an alternative dtype or maybe because the string you see displayed to you isn't actually THE string itself. It seen your issue is mostly on the second part.
So lets first clear your "string" column from any white spaces.
df2['asin'] = df2.asin.str.strip()
# I am going with the idea that that is your non functional df object
After that try rerunning your filter
df2[df2['asin'].eq('0077614992')]

Clean inconsistent date formatting in pandas dataframe column

I've read about coercing errors and eyeballing stuff like this, but I was wondering if there was a more optimal way to automate this issue:
I have a regular dataset being outputted by a system, then manually modified by some folks. Unfortunately, the modifications tend to be inconsistent with the original system outputs, so I end up having to manually standardize the formatting before loading it into a pandas dataframe. Is there a smarter way to do this -- i.e., something I'm not aware of where pandas or some other function would be able to clean this for me?
Sample dataframe column with its messing data types:
START_DTTIME
-----
2021-11-01 17:10:00
2021-11-01 17:12:00
2021-11-01 17:15:00
11/3/2021
11/4/21
UNKNOWN
UNK
2021-10-04 14:01:20
10-5-21
10-30-2021
???
2021-10-05 14:03:21
The dataset usually is pretty manageable in size (no more than 100 records daily), so I was thinking if absolutely necessary I could just make a function to loop through each record checking for all the different variations that are commonplace (there are only so many different ways one could type in a date, right?)... but that's a last resort as I wanted to check if there's a "smarter" way to do this first before I do something wildly inefficient. :-)
If it helps, I only care about the DATE; the time is actually extraneous info produced by the system, but as you can observe in the non-standardized formatting, the manual inputs only consist of the date.
Thank you!
Dataframe
df=pd.DataFrame({'START_DTTIME':['2021-11-01 17:10:00','11/3/2021','11/4/21','UNKNOWN','10-30-2021','???']})
convert the column to datetime, coerce errors to create NaN and then select those that are not NaN
df[pd.to_datetime(df['START_DTTIME'], errors='coerce').notna()]
START_DTTIME
0 2021-11-01 17:10:00
1 11/3/2021
2 11/4/21
4 10-30-2021

Python-Hypothesis: specifying and managing NaN values

I'm trying to use Hypothesis to generate a set of dataframes that I'll merge together. I want each individual column to be allowed to have NaN values, and I want to allow Hypothesis to generate some wacky examples.
But I mostly want to focus on examples where there is at least one row in each dataframe with actual values - and in particular, I'd like to be able to generate dataframes with some information shared between corresponding columns, such that a merged dataframe is not empty. (E.g. I want some values from 'store' in store.csv to overlap with values from 'store' in train.csv.)
I have some example code here that generates NaN values and wacky examples all over the place, but most of the generated examples contain very few non-NaN values. (A dataframe strategy starts on line 57.)
Any suggestions for how to create slightly more 'realistic' examples? Thanks!
Your solution looks fine to me, but here's two more tactics that might help:
Use the fill=st.nothing() argument to columns and series, to disable filling behaviour. This makes the entries dense instead of sparse(ish), so there's a substantial runtime cost but noticable change in the example density. Alternatively fill=st.floats(allow_nan=False) might be cheaper and still work!
Use a .filter(...) on the strategy to reject dataframes without any nan-free rows. A typical rule of thumb is to avoid using .filter when it would reject more than half the examples and look for an alternative when it's over a tenth... but this could be combined with the first point easily enough.
Answering my own question, but I'd love to hear other answers.
I ended up doing two things:
1) Requiring that the end user not give garbage files. (Just because we have a magical property-generation framework doesn't absolve us of the responsibility of having common sense, which I forgot.)
2) Testing for things that are reasonable accidents but not absolute garbage, by requiring that each dataframe have at least one row with no NaNs. With that requirement, I generate the non-NaN dataframe, and then add some NaNs afterward.
From there, ipython and .example() make it easy to see what's going on.
Example code below (google_files and google_weeks are custom strategies previously created)
# Create dataframes from the strategies above
# We'll create dataframes with all non-NaN values, then add NaNs to rows
# after the fact
df = draw(data_frames([
column('file', elements=google_files),
column('week', elements=google_weeks),
column('trend',
elements=(integers(min_value=0, max_value=100)))],
index=range_indexes(min_size=1, max_size=100)))
# Add the nans
# With other dataframes, this ended up getting written into a function
rows = len(df)
df.loc[rows+1] = [np.NaN, '2014-01-05 - 2014-01-11', 42]
df.loc[rows+2] = ['DE_BE', np.NaN, 42]
df.loc[rows+3] = ['DE_BE', '2014-01-05 - 2014-01-11', np.NaN]
df.loc[rows+4] = [np.NaN, np.NaN, np.NaN]

Spark - Have I read from csv correctly?

I read a csv file into Spark using:
df = spark.read.format(file_type).options(header='true', quote='\"',
ignoreLeadingWhiteSpace='true',inferSchema='true').load(file_location)
When I tried it with sample csv data from another source and did diplsay(df) it showed a neatly displayed header row followed by data.
When I try it on my main data, which has 40 columns, and millions of rows, it simply displays the first 20 column headers and no data rows.
Is this normal behavior or is it reading it wrong?
Update:
I shall mark the question as answered as the tips below are useful. However my results from doing:
df.show(5, truncate=False)
currently shows:
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ |��"periodID","DAXDate","Country Name","Year","TransactionDate","QTR","Customer Number","Customer Name","Customer City","Document Type Code","Order Number","Product Code","Product Description","Selling UOM","Sub Franchise Code","Sub Franchise Description","Product Major Code","Product Major Description","Product Minor Code","Product Minor Description","Invoice Number","Invoice DateTime","Class Of Trade ID","Class Of Trade","Region","AmountCurrencyType","Extended Cost","Gross Trade Sales","Net Trade Sales","Total(Ext Std Cost)","AdjustmentType","ExcludeComment","CurrencyCode","fxRate","Quantity","FileName","RecordCount","Product Category","Direct","ProfitCenter","ProfitCenterRegion","ProfitCenterCountry"| +-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
I shall have to go back to basics an preview the csv in a text editor to find out what the correct format is for this file to figure out what's going wrong. Note, I had to update my code to the following to deal with pipe delimter:
df = spark.read.format(file_type).options(header='true', quote='\"', delimiter='|',ignoreLeadingWhiteSpace='true',inferSchema='true').load(file_location)
Yes this is normal beheaviour. The dataframe function show() has a default value to display 20 rows. You can set a different value for that (but keep in mind that it doesn't make sense to print all rows of your file) and also stop it from truncating. For example:
df.show(100, truncate=False)
It is a normal behaviour. You can view the content of your data in different ways:
show(): Show you in a formatted way the first 20 rows. You can specify as argument the number of rows you want to display (if you provide a value much higher that your data is ok!). Columns will be truncated too, as a default configuration. You can specify truncate=False to show all the columns. (like #cronoik correctly said in his answer).
head(): The same as show(), but it prints the date in a "row" format. Does not provide a nice formatted table, it is useful for a quick complete look of your data, for example with head(1) to show only the first row.
describe().show(): you can show a summary that gives you an insight of the data. For example, shows you the count of elements, the min/max/avg value of each column.
It is normal for Spark dataframes to display limited rows and columns. Your reading of the data should not be a problem. However, to confirm that you have read the csv correctly you can try to see the number of rows and columns in the df, using
len(df.columns)
or
df.columns
For number of rows
df.count()
In case you need to see the content in detail you can use the option stated by cronoik.

Replacing null values with zeroes in multiple columns [Spotfire]

I have about 100 columns with some empty values that I would like to replace with zeroes. I know how to do this with a single column using Calculate and Replace, but I wanted to see if there was a way to do this with multiple columns at once.
Thanks!
You could script it but it'd probably take you as long to write the script as it would to do it manually with a transformation. A better idea would be to fix it in the data source itself before you import it so SPOTFIRE doesn't have to do the transformation every time, which if you are dealing with a large amount of data, could hinder your performance.

Resources