Pyspark/jupyter notebook display issue with database - apache-spark

I try to use PySpark with jupyter notebook. But when I want to see (a part of) the dataframe,
...(some columns are even not shown).
I would like to have a display
.
Any idea how to do it?

Your dataframe is semicolon separated.
Pass that as a separator
df = spark.read.csv(path,sep=';')

Related

Change data in Pandas dataframe by column

I have some data I imported from a excel spreadsheet as a csv. I created a dataframe using Pandas, and want to change a specific column. The column contains strings such as "5.15.1.0.0". I want to change these strings to floats like "5.15100".
So far I've tried using the method "replace" to change every instance in that column:
df['Fix versions'].replace("5.15.1.0.0", 5.15.1.0.0)
this however does not work. When I reprint the dataframe after the replace methods are called it shows me the same dataframe where no changes are made. Is it not possible to change a string to a float using replace? If not does anyone know another way to do this?
I could parse each string and remove the "." but I'd prefer not to do it this way as some of the strings represent numbers of different lengths and decimal place values.
Adding the parameter "inplace" which default is false. Changing this to true will change the dataframe in place, which can be type casted.
df['Fix versions'].replace(to_replace="5.15.1.0.0", value="5.15100", inplace=True)

Pandas DataFrame indexing problem from 40,000 to 49,999

I have a strange problem with my code (At least it is strange for me!).
I have a Pandas DataFrame called "A". One of the column names is "asin". I want to execute all specific rows including my data. So I write this simple code:
df2 = A[A['asin']=='B0000UYZG0']
And it works normally as expected, except for data from 40,000 to 499,999!!
It doesn't work on these data series at all!
Refer to the picture, df2 = A[A['asin']=='0077614992'] (related to 50,000) works but df2 = A[A['asin']=='B0006O0WW6'] (related to 49,999) does not work!
I do not have tried all 10,000 data! But randomly I test them and have no answer.
I have grow accustomed to fixing bugs such as this one, usually when that happen is because of an alternative dtype or maybe because the string you see displayed to you isn't actually THE string itself. It seen your issue is mostly on the second part.
So lets first clear your "string" column from any white spaces.
df2['asin'] = df2.asin.str.strip()
# I am going with the idea that that is your non functional df object
After that try rerunning your filter
df2[df2['asin'].eq('0077614992')]

Pandas read_csv - dealing with columns that have ID numbers with consecutive '$' and '#' symbols, along with letters and digits

I'm trying to read a csv file with a column of data that has a scrambled ID number that includes the occasional consecutive $$ along with #, numbers, and letters.
SCRAMBLE_ID
AL9LLL677
AL9$AM657
$L9$$4440
#L9$306A1
etc.
I tried the following:
df = pd.read_csv('MASTER~1.CSV',
dtype = {'SCRAMBLE_ID': str})
which rendered the third entry as L9$4440 (L9 appear in serif font, italicized, and the first and second $ vanish).
Faced with an entire column of ID numbers configured in this manner, what is the best way of dealing with such data? I can imagine:
PRIOR TO pd.read_csv: replacing the offending symbols with substitutes that don't create this problem (and what would those be), OR,
is there a way of preserving the IDs as is but making them into a data type that ignores these symbols while keeping them present?
Thank you. I've attached a screenshot of the .csv side by side with resulting df (Jupyter notebook) below.
csv column to pandas df with $$
I cannot replicate this using the same values as you in a mock CSV file.
Are you sure that the formatting based on the $ symbol is not occurring in wherever you are rendering your dataframe values? Have you checked to see if the data in the dataframe is what you expect or are you just rendering it externally?

Display full Pandas dataframe in Jupyter without index

I have a pandas dataframe that I would like to pretty-print in full (it's ~90 rows) in a Jupyter notebook. I'd also like to display it without the index column, if possible. How can I do that?
For pretty-printing without an index, I think the right approach is to call the display method for HTML (which is what jupyter does under the hood):
from IPython.display import HTML
HTML(df.to_html(index=False))
(Credit to Display pandas dataframe without index)
As others have suggested you can use pd.display_max_rows() for the row count limitation.
In pandas you can use this
pd.set_option("display.max_rows", None, "display.max_columns", None)
please use this.
Without index use additionally.
df.to_string(index=False)

DataFrame entries got round off when converted to txt

This is what the dataframe looks like before exporting
After that it becomes
Rounding down is not what I want here; I want the text in txt.file look like what it is shown in the console. So how can I fix this? Any simple solutions?
Did you try writing directly from your Pandas dataframe instead of going through Numpy?
Try DF.to_csv(‘output.txt’, sep=‘\t’, float_format=‘%g’)
For more details see pandas.DataFrame.to_csv

Resources