Pandas DataFrame indexing problem from 40,000 to 49,999 - python-3.x

I have a strange problem with my code (At least it is strange for me!).
I have a Pandas DataFrame called "A". One of the column names is "asin". I want to execute all specific rows including my data. So I write this simple code:
df2 = A[A['asin']=='B0000UYZG0']
And it works normally as expected, except for data from 40,000 to 499,999!!
It doesn't work on these data series at all!
Refer to the picture, df2 = A[A['asin']=='0077614992'] (related to 50,000) works but df2 = A[A['asin']=='B0006O0WW6'] (related to 49,999) does not work!
I do not have tried all 10,000 data! But randomly I test them and have no answer.

I have grow accustomed to fixing bugs such as this one, usually when that happen is because of an alternative dtype or maybe because the string you see displayed to you isn't actually THE string itself. It seen your issue is mostly on the second part.
So lets first clear your "string" column from any white spaces.
df2['asin'] = df2.asin.str.strip()
# I am going with the idea that that is your non functional df object
After that try rerunning your filter
df2[df2['asin'].eq('0077614992')]

Related

Plotting the string value frequency in python

So I have this data frame related to species of spiders and I wanted to see what are the top 10 highest occurring family of spiders. So I used the below code to find it out:
n=10
dfc['family'].value_counts()[:n].index.tolist()
I want to create a plot which will show how many of each of those top 10 species exists in the data frame. That is, I want a plot that says 300 of the first species exist and 200 of the second species exist in the data frame, just like this. But I cannot quite figure out the code for this.
Can anyone help me out with it?
Not knowing what your dataframe looks like at all, it is a little tough to give a precise answer, and I didn't check the below code on a dataframe because I didn't have something handy. (also, I assume you are using pandas here).
It sounds like you want (or at least could use) a dataframe that has a column of families and then the next column is just the count of that family in the original. You can accomplish this with groupby().
df2 = dfc.groupby(['family']).count()
If you want to then just have the top 10 left on there to make it easy to plot, you can use the nlargest() function in pandas.
df2 = df2.nlargest(10,'family')

How to check where rows in a dataframe are missing?

I'm working on a rather lengthy data manipulation project and one of the datasets that I merged into my dataframe has a lesser length of 96 rows (6963 vs 6867), and I cannot for the life of me figure out where.
Checking .head() and .tail() shows that the all datasets in the dataframe start and end on the same date, but I have reason to think its interfering with my other analysis, for obvious reasons.
Checking crypto_df['doge_close,'doge_open'] shows they are the correct length of 6963 (total # of dates in dataframe), but I cannot find where in the dataset the rows are missing.
Is there a function to check this? I don't think I'm being specific enough on google, if there is.
If you are looking for rows that have NaN, do this:
crypto_df[crypto_df['doge_close'].isnull ()]

Spark VectorAssembler creating a dictionary from features

I'm having trouble with VectorAssembler in PySpark.
Task
I have a bunch of data (all numerical) that describes the customer base for a phone company (monthly bills, gender, etc.). On datum can be seen here:
My goal is to try to predict if a customer will churn (leave the service).
Method
In pursuit of this goal, I implemented the following code:
from pyspark.ml.feature import VectorAssembler
ignore = ['Churn_indexed', 'customerID']
vectorAssembler = VectorAssembler(inputCols=[x for x in df_num.columns if x not in ignore],
outputCol='Independent Features')
df_final = vectorAssembler.transform(df)
Where I have made a VectorAssembler to ignore customerID (irrelevant) and churn (to be predicted).
Error
I then printed df_final to ensure it looked as expected as saw the following, where the red dots indicate rows that are of the expected form. The remaining rows do not have the expected form:
For some reason some of the rows have "20" at the beginning followed by a list. I should note that there are 20 features (including Churn and excluding CustomerID) which is perhaps where this pre-appended 20 comes from? I printed an incorrect row and a correct row and it looks like for the incorrect ones the assembler turned the features into a dictionary?
CSV
I opened the CSV and tried looking for extra spaces or bad formatting and could not recognize a pattern for determining what ended up being VectorAssembled properly and what did not.
Has anyone else run into this issue? I've been troubleshooting for a long time to no avail.
Thanks for any help.
It looks like the VectorAssembler automatically converts rows with many non-zeros (>= half from what I can tell) to their sparse form as seen here

String lookup failed on Iteration over a list using pandas dataframe

I have a list of strings I am trying to search through a pandas DF column with and delete any rows containing an element of that list.
Here is the code to search a specific column, then remove a row containing a substring of text in quotes. In this case, all rows containing 'dave' in the Owner_Name column would be removed. this works great by itself, exactly as expected.
df = df[~df.Owner_Name.str.contains('dave')
When I try to automate this over a list of 54 or so elements, it gets hung up and only removes some, but not all. Any idea why?
Here is my simple code for the loop(mock up to show what I am doing, not my actual code):
badWords= ['random stuff','code words','secret squirrel','blue','black','dave']
for word in badWords:
df = df[~df.Owner_Name.str.contains(word)]
print('Total Rows Left',df.shape[0], word)
I am not getting any errors, but it certainly isn't working like I would want. For example, after the loop, there are still 'dave' elements around in the Owner_Name column, even though it supposedly looped through the list. I even put breadcrumbs to call out the element being passed, so it is doing the loop, but it is as if the str.contains('') is not working properly to remove the rows. I made sure to make everything match the case of my list objects also in the df, so that shouldnt be an issue. I am really stumped and cant find anything on stack about this specific issue.
Adding the answer here which worked:
badWords= ['random stuff','code words','secret squirrel','blue','black','dave']
for word in badWords:
df = df[~df.Owner_Name.str.contains(word,case=False)]
print('Total Rows Left',df.shape[0], word)

Spark - Have I read from csv correctly?

I read a csv file into Spark using:
df = spark.read.format(file_type).options(header='true', quote='\"',
ignoreLeadingWhiteSpace='true',inferSchema='true').load(file_location)
When I tried it with sample csv data from another source and did diplsay(df) it showed a neatly displayed header row followed by data.
When I try it on my main data, which has 40 columns, and millions of rows, it simply displays the first 20 column headers and no data rows.
Is this normal behavior or is it reading it wrong?
Update:
I shall mark the question as answered as the tips below are useful. However my results from doing:
df.show(5, truncate=False)
currently shows:
|��"periodID","DAXDate","Country Name","Year","TransactionDate","QTR","Customer Number","Customer Name","Customer City","Document Type Code","Order Number","Product Code","Product Description","Selling UOM","Sub Franchise Code","Sub Franchise Description","Product Major Code","Product Major Description","Product Minor Code","Product Minor Description","Invoice Number","Invoice DateTime","Class Of Trade ID","Class Of Trade","Region","AmountCurrencyType","Extended Cost","Gross Trade Sales","Net Trade Sales","Total(Ext Std Cost)","AdjustmentType","ExcludeComment","CurrencyCode","fxRate","Quantity","FileName","RecordCount","Product Category","Direct","ProfitCenter","ProfitCenterRegion","ProfitCenterCountry"|
I shall have to go back to basics an preview the csv in a text editor to find out what the correct format is for this file to figure out what's going wrong. Note, I had to update my code to the following to deal with pipe delimter:
df = spark.read.format(file_type).options(header='true', quote='\"', delimiter='|',ignoreLeadingWhiteSpace='true',inferSchema='true').load(file_location)
Yes this is normal beheaviour. The dataframe function show() has a default value to display 20 rows. You can set a different value for that (but keep in mind that it doesn't make sense to print all rows of your file) and also stop it from truncating. For example:
df.show(100, truncate=False)
It is a normal behaviour. You can view the content of your data in different ways:
show(): Show you in a formatted way the first 20 rows. You can specify as argument the number of rows you want to display (if you provide a value much higher that your data is ok!). Columns will be truncated too, as a default configuration. You can specify truncate=False to show all the columns. (like #cronoik correctly said in his answer).
head(): The same as show(), but it prints the date in a "row" format. Does not provide a nice formatted table, it is useful for a quick complete look of your data, for example with head(1) to show only the first row.
describe().show(): you can show a summary that gives you an insight of the data. For example, shows you the count of elements, the min/max/avg value of each column.
It is normal for Spark dataframes to display limited rows and columns. Your reading of the data should not be a problem. However, to confirm that you have read the csv correctly you can try to see the number of rows and columns in the df, using
len(df.columns)
or
df.columns
For number of rows
df.count()
In case you need to see the content in detail you can use the option stated by cronoik.

Resources