Pyspark CSV Incorrect Count - apache-spark

I'm trying to read the below data from a CSV file and end up with a wrong count, although the dataframe contains all the records below. df_inputfile.count() prints 3 although it should have been 4.
It looks like this is happening because of the single comma in the 4th column of the 3rd row. Can someone please explain why?
B1123451020-502,"","{""m"": {""difference"": 60}}","","","",2022-02-12T15:40:00.783Z
B1456741975-266,"","{""m"": {""difference"": 60}}","","","",2022-02-04T17:03:59.566Z
B1789753479-460,"","",",","","",2022-02-18T14:46:57.332Z
B1456741977-123,"","{""m"": {""difference"": 60}}","","","",2022-02-04T17:03:59.566Z
df_inputfile = (spark.read.format("com.databricks.spark.csv")
.option("inferSchema", "true")
.option("header","false")
.option("quotedstring",'\"')
.option("escape",'\"')
.option("multiline","true")
.option("delimiter",",")
.load('<path to csv>'))
print(df_inputfile.count()) # Prints 3
print(df_inputfile.distinct().count()) # Prints 4

Related

Join different DataFrames using loop in Pyspark

I have 5 CSV files in a file, and want to join them in one data frame in Pyspark: I use the code below:
name_file =['A', 'B', 'C', 'D', 'V']
for n in name_file:
n= spark.read.csv(fullpath+n+'.csv'
,header=False,
inferSchema= True)
full_data=full_data.join(n,["id"])
Error: I got an unexpected result > The last dataframe joined just with itself.
Expected Result: There should be 6 columns, each CSV has 2 data frames one of them in common with others. The join should be on this column. As a result, the final data frame should have a common column and 5 special columns from each CSV file.
There seem to be several things wrong with the code or perhaps you have not provided the complete code.
Have you defined fullpath?
You have set header=False then how will spark know that there is an
"id" column?
Your indentation looks wrong under the for loop.
full_data has not been defined yet, so how are you using it on the
right side of the evaluation within the for loop? I suspect you have
initialized this to the first csv file and then attempting to join
it with first csv again.
I ran a small test on the below code which worked for me and addresses the questions I've raised above. You can adjust it to your need.
fullpath = '/content/sample_data/'
full_data = spark.read.csv(fullpath+'Book1.csv'
,header=True,
inferSchema= True)
name_file =['Book2', 'Book3']
for n in name_file:
n= spark.read.csv(fullpath+n+'.csv'
,header=True,
inferSchema= True)
full_data=full_data.join(n,["id"])
full_data.show(5)

Unquoted date in first column pf CSV for Python/Pandas read_csv

Incoming CSV from American Express download looks like below. (I would prefer each field has quotes around it, but it doesn't. It is treating the quoted long number in the second CSV column as the first column in the Pandas data frame, i.e. 320193480240275508 as my "Date" column:
12/13/19,'320193480240275508',Alamo Rent A Car,John
Doe,-12345,178.62,Travel-Vehicle Rental,DEBIT,
colnames = ['Date', 'TransNum', 'Payee', 'NotUsed4', 'NotUsed5', 'Amount', 'AmexCategory', 'DebitCredit']
df = pd.read_csv(filenameIn, names=colnames, header=0, delimiter=",")
delimiter=",")
pd.set_option('display.max_rows', 15)
pd.set_option('display.width', 200)
print (df)
print (df.values)
Start
Date ... DebitCredit 12/13/19 '320193480240275508' ... NaN
I have a routine to reformat the date ( to handle things like 1/3/19, and to add the century). It is called like this:
df['Date'][j] = reformatAmexDate2(df['Date'][j])
That routine shows the date as follows:
def reformatAmexDate2(oldDate):
print ("oldDate=" + oldDate)
oldDate='320193480240275508'
I saw this post which recommended dayfirst=True, and added that, but same result. I never even told Pandas that column 1 is a date, so it should treat it as text I believe.
IIUC, the problem seems to be name=colnames, it sets new names for your columns being read from csv file, as you are trying to read specific columns from csv file, you can use usecol
df = pd.read_csv(filenameIn,usecols=colnames, header=0, delimiter=",")
Looking at the data, I didn't notice the comma after the column value, i.e. the comma after "DEBIT,"
12/13/19,'320193480240275508',Alamo Rent A Car,John Doe,-12345,178.62,Travel-Vehicle Rental,DEBIT,
I just added another column at the end of my columns array:
colnames = ['Date', 'TransNum', 'Payee', 'NotUsed4', 'NotUsed5', 'Amount', 'AmexCategory', 'DebitCredit','NotUsed9']
and life is wonderful.

extract sub dataframe loc in for loop

I am having problem in slicing a total Pandas dataframe into subsets and then join them in a for loop.
The original dataframe looks like this:
original DataFrame content
*Forgive me that I blocked out some details to avoid issues on data security/policy. As long as the message is conveyed.
This is a query problem that I need to query for each "Surveyname" from another list which contains the survey names, and output to a table of all surveys (in sequences of the list) and their other information from selected columns.
The original Dataframe has columns of these:
Index(['Surveyname', 'Surveynumber', 'Datasetstatus', 'Datasetname',
'Datasetprocessname', 'Datasetdatatype', 'Datasetseismicformat',
'Datasettapedisplay', 'Inventoryid', 'Inventoryname',
'Inventorybarcode', 'Inventoryremarks', 'Datashipmentnumber',
'Facilityid', 'Facilityname', 'Inventoryfullpathfilename',
'Inventorytype', 'Mmsdatasetname', 'Inventorymediatype',
'Inventoryoriglocation', 'Inventoryreceiveddate',
'Inventorydataattribdesc', 'Firstalternatedesc', 'Secondalternatedesc',
'Thirdalternatedesc', 'field26', 'field27', 'field28', 'field29',
'field30', 'field31', 'field32'],
dtype='object')
And I am selecting only these columns as output:
cols =['Surveyname','Surveynumber','Datasettapedisplay','Inventoryid','Inventorybarcode','Inventoryoriglocation']
I set up an empty dataframe at the start and try to append the "queried" subset dataframe to this one. Hope it will grow along with the for loop.
The code looks like this :
f=open('EmptySurveyList2.txt','r')
cols =['Surveyname','Surveynumber','Datasettapedisplay','Inventoryid','Inventorybarcode','Inventoryoriglocation']
setdf=pd.DataFrame(columns=cols)# create an empty DataFrame
for line in f:
print(line)
# check by string content
df0=df_MIG.loc[df_MIG['Surveyname']==line,cols]
print(df_MIG.loc[df_MIG['Surveyname']==line,cols])
# check by string length for exact match
df0=df0.loc[df0['Surveyname'].str.len()==len(line),cols]
print(df0.loc[df0['Surveyname'].str.len()==len(line),cols])
print('df0:',len(df0))
setdf=setdf.append(df0)
print('setdf:',len(setdf))
However, this code would still give me only a few rows from the very last survey on the 'setdf' dataframe.
I went on debugging. I found that in the for loop, the df0 dataframe is not finding the survey information from the main df_MIG for the first N surveys in the list except the last one. By printing out the lenth of the df0 and the setdf:
>...Centauro
>
>Empty DataFrame
>Columns: [Surveyname, Surveynumber, Datasettapedisplay, Inventoryid,
>Inventorybarcode, Inventoryoriglocation]
>Index: []
>Empty DataFrame
>Columns: [Surveyname, Surveynumber, Datasettapedisplay, Inventoryid,
>Inventorybarcode, Inventoryoriglocation]
>Index: []
>df0: 0
>
>setdf: 0
>
>Blueberry
>
>Empty DataFrame
>Columns: [Surveyname, Surveynumber, Datasettapedisplay, Inventoryid,
>Inventorybarcode, Inventoryoriglocation]
>Index: []
>Empty DataFrame
>Columns: [Surveyname, Surveynumber, Datasettapedisplay, Inventoryid,
>Inventorybarcode, Inventoryoriglocation]
>Index: []
>df0: 0
>
>setdf: 0
>
>Baha (G)
> Surveyname Surveynumber Datasettapedisplay Inventoryid Inventorybarcode \
>219 Baha (G) 329130 FIN 1538554 4210380
>
>Inventoryoriglocation
>219 /wgdisk/hn0016/mc03/BAHA_329130/MIGFIN_639_256...
> Surveyname Surveynumber Datasettapedisplay Inventoryid Inventorybarcode \
>219 Baha (G) 329130 FIN 1538554 4210380
>
>Inventoryoriglocation
>219 /wgdisk/hn0016/mc03/BAHA_329130/MIGFIN_639_256...
>df0: 1
>
>setdf: 1
While if I do the query outside of the loop,
a = "Blueberry"
df0=df_MIG.loc[df_MIG['Surveyname']==a,cols]
df0=df0.loc[df0['Surveyname'].str.len()==len(a),cols]
setdf=setdf.append(df0)
It works normal and no issues, found the rows which has the name of the survey and add it to the setdf
Bebugging outside loop
This is quite a mystery to me. Anyone can help clarify why , or suggest a better alternative ?

Save pandas dataframe into csv file

I have a problem in saving large panda dataframe into the csv file.
below is just snapshot of first 3 lines of panda dataframes.
parent_sernum parent_pid created_date sernum pid pn
0 FCH21467XBN UXXXX-XXX-XXXX 2017-12-20 11:02:00.177 SSGQA20741370EA85A,SSGQA... UXXXX, 22RV-A,Uxxx-xx... 15-104065-01,15-104065-0...
1 FCH21467XBN Uxxxx-xxx-xx 2017-12-20 11:38:45.373 SSGQA20741370EA85A,SSGQA... Uxx-xxx-xxxx-A,Uxx-xx-... 15-104065-01,15-104065-0...
2 FCH2145V0UW Uxxx-xxxx-M4S 2017-12-02 11:01:26.993 SSH8A2071935A2ACDE,SSH8A... Uxx-xx-1X324RV-A,UCS-ML-... 15-104064-01,15-104064-0...
When it comes to save this dataframe into csv file. it only captures First letter and drops the rest from most of columns as below.
parent_sernum,parent_pid,created_date,sernum,pid,pn
F,U,2017-12-20 11:02:00.177,S,U,1
F,U,2017-12-20 11:38:45.373,S,U,1
F,U,2017-12-02 11:01:26.993,S,U,1
Below is my codes when it comes to save data. (df = dataframe).
Is there any option that I need to set if I need to save all from dataframe?
Or is there a limitation in csv file size so it automatically captures only fraction of data to meet the size restriction?
df.to_csv('sample.csv', index = False, header = True)

Splitting a row of dataframe into multiple rows in spark. The dataframe only has one column which contains an array of string values

I have a csv file that I am reading into spark.
The only column I am reading has an array of time values. I want each time value to be a different row. I have tried a couple of different things like explode but they don't seem to work for me.
val checkin_data=sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "true")
.load("/home/saurabh/Projects/BigData/Datasets/YelpDataSet/yelp_academic_dataset_checkin.csv")
.select("time")
This is the result I get if I select the first row
checkin_data.first()
[[u'Fri-0:2', u'Sat-0:1', u'Sun-0:1', u'Wed-0:2', u'Sat-1:2', u'Thu-1:1', u'Wed-1:1', u'Sat-2:1', u'Sun-2:2', u'Thu-2:1', u'Wed-2:1', u'Fri-3:1', u'Sun-3:3', u'Thu-4:1', u'Tue-4:1', u'Sun-6:1', u'Wed-6:1', u'Fri-10:1', u'Sat-10:1', u'Mon-11:1', u'Wed-11:2', u'Mon-12:1', u'Sat-12:1', u'Tue-12:1', u'Sat-13:2', u'Thu-13:1', u'Tue-13:2', u'Wed-13:2', u'Fri-14:2', u'Sat-14:1', u'Wed-14:1', u'Fri-15:1', u'Sat-15:1', u'Thu-15:1', u'Tue-15:1', u'Fri-16:1', u'Sat-16:2', u'Sun-16:1', u'Tue-16:1', u'Sat-17:3', u'Sun-17:1', u'Fri-18:1', u'Mon-18:1', u'Sat-18:2', u'Sun-18:1', u'Tue-18:2', u'Wed-18:1', u'Fri-19:2', u'Mon-19:1', u'Sun-19:2', u'Thu-19:1', u'Wed-19:1', u'Mon-20:1', u'Sun-20:5', u'Thu-20:1', u'Tue-20:1', u'Wed-20:2', u'Fri-21:2', u'Sun-21:1', u'Thu-21:4', u'Tue-21:1', u'Wed-21:1', u'Fri-22:1', u'Thu-22:1', u'Fri-23:1', u'Mon-23:1', u'Sat-23:3', u'Sun-23:1', u'Thu-23:2', u'Tue-23:1']]
Is there a way I can convert each row into multiple rows like this?
Fri-0:2
Sat-0:1
Sun-0:1
Wed-0:2
Sat-1:2
Thu-1:1
I am new to spark so I am sorry if I could not explain this right. Any help is much appreciated.
SparkSql's explode method should help you!
Here is a post that might help.

Resources