Spark - Have I read from csv correctly? - apache-spark

I read a csv file into Spark using:
df = spark.read.format(file_type).options(header='true', quote='\"',
ignoreLeadingWhiteSpace='true',inferSchema='true').load(file_location)
When I tried it with sample csv data from another source and did diplsay(df) it showed a neatly displayed header row followed by data.
When I try it on my main data, which has 40 columns, and millions of rows, it simply displays the first 20 column headers and no data rows.
Is this normal behavior or is it reading it wrong?
Update:
I shall mark the question as answered as the tips below are useful. However my results from doing:
df.show(5, truncate=False)
currently shows:
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ |��"periodID","DAXDate","Country Name","Year","TransactionDate","QTR","Customer Number","Customer Name","Customer City","Document Type Code","Order Number","Product Code","Product Description","Selling UOM","Sub Franchise Code","Sub Franchise Description","Product Major Code","Product Major Description","Product Minor Code","Product Minor Description","Invoice Number","Invoice DateTime","Class Of Trade ID","Class Of Trade","Region","AmountCurrencyType","Extended Cost","Gross Trade Sales","Net Trade Sales","Total(Ext Std Cost)","AdjustmentType","ExcludeComment","CurrencyCode","fxRate","Quantity","FileName","RecordCount","Product Category","Direct","ProfitCenter","ProfitCenterRegion","ProfitCenterCountry"| +-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
I shall have to go back to basics an preview the csv in a text editor to find out what the correct format is for this file to figure out what's going wrong. Note, I had to update my code to the following to deal with pipe delimter:
df = spark.read.format(file_type).options(header='true', quote='\"', delimiter='|',ignoreLeadingWhiteSpace='true',inferSchema='true').load(file_location)

Yes this is normal beheaviour. The dataframe function show() has a default value to display 20 rows. You can set a different value for that (but keep in mind that it doesn't make sense to print all rows of your file) and also stop it from truncating. For example:
df.show(100, truncate=False)

It is a normal behaviour. You can view the content of your data in different ways:
show(): Show you in a formatted way the first 20 rows. You can specify as argument the number of rows you want to display (if you provide a value much higher that your data is ok!). Columns will be truncated too, as a default configuration. You can specify truncate=False to show all the columns. (like #cronoik correctly said in his answer).
head(): The same as show(), but it prints the date in a "row" format. Does not provide a nice formatted table, it is useful for a quick complete look of your data, for example with head(1) to show only the first row.
describe().show(): you can show a summary that gives you an insight of the data. For example, shows you the count of elements, the min/max/avg value of each column.

It is normal for Spark dataframes to display limited rows and columns. Your reading of the data should not be a problem. However, to confirm that you have read the csv correctly you can try to see the number of rows and columns in the df, using
len(df.columns)
or
df.columns
For number of rows
df.count()
In case you need to see the content in detail you can use the option stated by cronoik.

Related

Pandas DataFrame indexing problem from 40,000 to 49,999

I have a strange problem with my code (At least it is strange for me!).
I have a Pandas DataFrame called "A". One of the column names is "asin". I want to execute all specific rows including my data. So I write this simple code:
df2 = A[A['asin']=='B0000UYZG0']
And it works normally as expected, except for data from 40,000 to 499,999!!
It doesn't work on these data series at all!
Refer to the picture, df2 = A[A['asin']=='0077614992'] (related to 50,000) works but df2 = A[A['asin']=='B0006O0WW6'] (related to 49,999) does not work!
I do not have tried all 10,000 data! But randomly I test them and have no answer.
I have grow accustomed to fixing bugs such as this one, usually when that happen is because of an alternative dtype or maybe because the string you see displayed to you isn't actually THE string itself. It seen your issue is mostly on the second part.
So lets first clear your "string" column from any white spaces.
df2['asin'] = df2.asin.str.strip()
# I am going with the idea that that is your non functional df object
After that try rerunning your filter
df2[df2['asin'].eq('0077614992')]

How to split a Pandas dataframe into multiple csvs according to when the value of a column changes

So, I have a dataframe with 3D point cloud data (X,Y,Z,Color):
dataframe sample
Basically, I need to group the data according to the color column (which takes values of 0,0.5 and 1). However, I don't need an overall grouping (this is easy). I need it to create new dataframes every time the value changes. That is, I'd like a new dataframe for every set of rows that are followed by and preceded by 5 zeros (because single zeros are sometimes erroneously present in chunks of data that I'm interested in).
Basically, the zero values (black) are meaningless for me; I'm only interested in the 0.5 (red) and 1 values (green). What I want to accomplish is to segment the original point cloud into smaller clusters that I can then visualize. I hope this is clear. I can't seem to find answers to my question anywhere.
First of all, you should understand the for loop well. Python is a great programming language for using the code of any library inside functions and loops. Let's say you have a dataset and you want to navigate and control column a. First, let's start the loop with the "for i in dataset:" code. When you move to the bottom line, you have now specified the criteria you want with the code if "i[a] > 0.5:" in each for loop. Now if the value is greater than 0.5, you can write the necessary codes to create a new dataset with all the data of the row you are in. In terms of personal training, I did not write ready-made code.

Pandas read_csv - dealing with columns that have ID numbers with consecutive '$' and '#' symbols, along with letters and digits

I'm trying to read a csv file with a column of data that has a scrambled ID number that includes the occasional consecutive $$ along with #, numbers, and letters.
SCRAMBLE_ID
AL9LLL677
AL9$AM657
$L9$$4440
#L9$306A1
etc.
I tried the following:
df = pd.read_csv('MASTER~1.CSV',
dtype = {'SCRAMBLE_ID': str})
which rendered the third entry as L9$4440 (L9 appear in serif font, italicized, and the first and second $ vanish).
Faced with an entire column of ID numbers configured in this manner, what is the best way of dealing with such data? I can imagine:
PRIOR TO pd.read_csv: replacing the offending symbols with substitutes that don't create this problem (and what would those be), OR,
is there a way of preserving the IDs as is but making them into a data type that ignores these symbols while keeping them present?
Thank you. I've attached a screenshot of the .csv side by side with resulting df (Jupyter notebook) below.
csv column to pandas df with $$
I cannot replicate this using the same values as you in a mock CSV file.
Are you sure that the formatting based on the $ symbol is not occurring in wherever you are rendering your dataframe values? Have you checked to see if the data in the dataframe is what you expect or are you just rendering it externally?

Specify rows to read in from excel using pd.read_excel

I have a large excel file and want to select specific rows (not continuous block) and columns to read in. With columns this is easy, is there a way to do this for rows? or do I need to read in everything and then delete all the rows I don't want?
Consider an excel file with structure
,CW18r4_aer_7,,,,,,,
,Vegetation ,,,,,,,
Date,A1,A2,B1,B2,C1,C2,C3,C4
1/7/86,3.80,8.02,7.94,9.81,9.82,4.19,3.88,0.87
2/7/86,0.50,2.02,5.26,3.70,8.59,8.61,9.86,3.27
3/7/86,4.75,3.88,0.46,5.95,9.45,9.62,4.33,1.63
4/7/86,7.64,6.93,2.71,9.96,1.25,0.35,1.84,1.02
5/7/86,3.33,8.24,7.36,7.86,0.43,2.32,2.18,1.91
6/7/86,1.96,1.78,7.45,2.28,5.27,9.94,0.22,2.94
7/7/86,4.67,8.41,1.49,5.48,5.46,1.39,1.85,7.71
8/7/86,8.07,5.60,4.23,3.93,3.92,9.09,9.90,2.15
9/7/86,7.00,5.16,6.10,8.86,7.18,9.42,8.78,5.42
10/7/86,7.53,9.81,3.33,1.50,9.45,6.96,5.41,5.25
11/7/86,0.95,3.84,3.52,5.94,8.77,1.94,5.69,8.62
12/7/86,2.94,3.07,5.13,8.10,6.52,9.93,5.85,3.91
13/7/86,9.33,7.03,5.80,2.45,2.86,7.32,5.00,0.17
14/7/86,7.39,4.85,9.15,2.23,1.70,9.42,2.72,9.32
15/7/86,3.38,4.67,6.63,2.12,5.09,7.71,0.99,9.72
16/7/86,9.85,6.68,3.09,5.05,0.34,5.44,5.99,6.19
I want to take the headers from row 3 and then read in some of the rows and columns.
import pandas as pd
df = pd.read_excel("filename.xlsx", skiprows = 2, usecols = "A:C,F:I", userows = "4:6,13,17:19")
Importantly, this is not a block that can be described by say [A3:C10] or the like.
The userows option does not exist. I know I can skip rows at the top, and at the bottom - so presumably can make lots of data frames and knit them together. But is there a simple way to just read in what you need once? My workaround is to just create lots of excel spreadsheets that just have what I need for different data frames, but this leaves things very open to me making a mistake I can't find.

Excel : Comparing two csv files, and isolate line data from corresponding columns

I have two CSV files, one of 25 000 lines containing all data and one of 9000 lines containing names i need to get the data from the first one.
Someone told me that would be fairly easy using excel but i can't seem to find a similar problem.
I've tried comparisons tools, but they are not helping me isolate what i need.
Using this example
Master file :
Name;email;displayname
Bbob;Bbob#mail.com;Bob bob
Mmartha;Martha#mail.com;Mmartha
Cclaire;Cclaire#mail.com;cclair
Name file :
Name
Mmartha
Cclaire
What i need to get after comparison :
Name;email;displayname
Mmartha;Martha#mail.com;Mmartha
Cclaire;Cclaire#mail.com;cclair`
So for the names I've in my second csv, I've got to get the entire line from the master csv file.
Right now i can use notepad compare for exemple, but on 25000 lines considering what i need, it's a lot of manual labor to come. I think there is a way someone faced a similar issue.
I can't seem to find a solution right now so here I am.
Beforehand, excuses for the Dutch screenshots, I'm unsure about the English terms in PowerQuery, but you should be able to follow the procedure.
Using PowerQuery:
Start PowerQuery
Load both source CSV1 and CSV2
Join Query as new
Select both column 1 and select Inner option
Result should look like this:
Use first row as headers:
Delete 4th column, close and load values

Resources