Databricks - CSV not loading properly

Databricks - CSV not loading properly - databricks

I have a simple csv file that's pipe delimited that I can load into Databricks, then disuplay the df and it displays nicely. I then try with my main dataset which is formatted the same way and an export from SQL server. After it loads the output shows that it was loaded (lists the field names and the data type it inferred -- all string though which isn't a good sign)
df = spark.read.format("csv").options(header='true', quote='"', delimiter="|",ignoreLeadingWhiteSpace='true',inferSchema='true').load("/mnt/gl/mainfile.csv")
Then I do display (df) and I don't see a nice display. Instead it shows the following:
Job 34 View
(Stages: 1/1)
Job 35 View
(Stages: 1/1)
Job 36 View
(Stages: 1/1)
Obviously the csv is at fault here but I've no idea how to go about solving this - I've already been careful with how I export it from SQL server so not sure what I'd do differently there.

Ok I solved it. If you get a similar issue it might mean your csv is not formatted properly. Open up your cv using a text editor like Ron's Editor then visually inspect the data. On my dataset for some reason the final field which is a $ amount, had a " in front of it , but not at the end of it.
e.g. "12344.67
Not sure why SQL Server would do that (I was using Import/Export Wizard) but I got rid of the " delimiter from my exported csv and it now works fine

Related

I could not write the data frame including twitter hashtag to the csv file. When I subset some of the variables, csv file does not show variables

I extracted the twitter data to the R script via packages (rtweet) and (tidyverse) through spesific hashtag. After sucessfully get the data, I need to subset some of the variables that I want to analyze. I code the subset function and console shows the subsetted variables. Despite this, when I tried to write this to the csv file, written csv shows the whole variables instead to show only subsetted variables. Codes that I entered as follows.
twitter_data_armenian_issue_iki <- search_tweets("ermeniler", n=1000, include_rts = FALSE)
view(twitter_data_armenian_issue_iki)
twitter_data_armenian_issue_iki_clean <- cbind(twitter_data_armenian_issue_iki, users_data(twitter_data_armenian_issue_iki)[,c("id","id_str","name", "screen_name")])
twitter_data_armenian_issue_iki_clean <-twitter_data_armenian_issue_iki_clean[,! duplicated(colnames(twitter_data_armenian_issue_iki_clean))]
view(twitter_data_armenian_issue_iki_clean)
data_bir <-data.frame(twitter_data_armenian_issue_iki_clean)
data_bir[ , c("created_at", "id", "id_str", "full_text", "name", "screen_name", "in_reply_to_screen_name")]
write.csv(data_bir, "newdata.csv", row.names = FALSE, )
If anyone want to help me, I will be more pleased. Thank you
I tried to get twitter data with only some spesific columns that I want to analyze. In order to do this, I entried the subset function and run. But when I tried to write this to the csv, written csv file shows the wole variable. I controlled the environment panel, I could not see the subsetted data.
My question is How can I add the subsetted data to the environment and write this to the csv without any error.

How to manage mangled data when importing from your source in sqoop or pyspark

I have been working on a project to import the Danish 2.5Million ATM transaction data set to derive some visualizations.
The data is hosted on a mysql server provided by the university. The objective is to import the data using Sqoop and then apply a few transformations to it using pyspark.
Link to the dataset here : https://www.kaggle.com/sparnord/danish-atm-transactions
The Sql server, that hosts this information has a few rows which are intentionally or unintentionally mangled.
Example:
So I have a very basic sqoop command which gets the details from the source database. However I run into an issue where there are values which have a double-quote " especially in the column message_text
Sqoop Command :
sqoop import --connect jdbc:mysql:{source-connection-string} --table SRC_ATM_TRANS --username {username}--password {password} --target-dir /user/root/etl_project --fields-terminated-by '|' --lines-terminated-by "\n" -m 1
Here is sample row that is imported in the transaction.
2017|January|1|Sunday|21|Active|85|Diebold Nixdorf|KÃƒÆ’Ã‚Â¸benhavn|Regnbuepladsen|5|1550|55.676|12.571|DKK|MasterCard|4531|Withdrawal|4017|"Suspected malfunction|0.000|55.676|13|2618425|0.000|277|1010|93|3|280.000|0|75|803|Clouds
However the expected output should be
2017|January|1|Sunday|21|Active|85|Diebold Nixdorf|KÃƒÆ’Ã‚Â¸benhavn|Regnbuepladsen|5|1550|55.676|12.571|DKK|MasterCard|4531|Withdrawal|4017|"Suspected malfunction,0.000|55.676|13|2618425|0.000|277|1010|93|3|280.000|0|75|803|Clouds|Cloudy
At first I was okay with this hoping that pyspark would handle the mangled data since the delimiters are specified.
But now I run into issues when populating my dataframe.
transactions = spark.read.option("sep","|").csv("/user/root/etl_project/part-m-00000", header = False,schema = transaction_schema)
However when I inspect my rows I see that the mangled data has caused the dataframe to put these affected values into a single column!
transactions.filter(transactions.message_code == "4017").collect()
Row(year=2017, month=u'January', day=1, weekday=u'Sunday', hour=17, atm_status=u'Active', atm_id=u'35', atm_manufacturer=u'NCR', atm_location=u'Aabybro', atm_streetname=u'\xc3\u0192\xcb\u0153stergade', atm_street_number=6, atm_zipcode=9440, atm_lat=57.162, atm_lon=9.73, currency=u'DKK', card_type=u'MasterCard', transaction_amount=7387, service=u'Withdrawal', message_code=u'4017', message_text=u'Suspected malfunction|0.000|57.158|10|2625037|0.000|276|1021|83|4|319.000|0|0|800|Clear', weather_lat=None, weather_lon=None, weather_city_id=None, weather_city_name=None, temp=None, pressure=None, humidity=None, wind_speed=None, wind_deg=None, rain_3h=None, clouds_all=None, weather_id=None, weather_main=None, weather_description=None)
At this point I am not sure on what to do?
Do I go ahead and create temporary columns to manage this and use a regex replacement to fill in these values ?
Or is there any better way I can import the data and manage these mangled values either in sqoop or in pyspark ?

How to read sequence files exported from HBase

I used the following code to export an HBase table and save the output to HDFS:
hbase org.apache.hadoop.hbase.mapreduce.Export \
MyHbaseTable1 hdfs://nameservice1/user/ken/data/exportTable1
Output files are binary files. If I use pyspark to read the file folder:
test1 = sc.textFile('hdfs://nameservice1/user/ken/data/exportTable1')
test1.show(5)
It shows:
u'SEQ\x061org.apache.hadoop.hbase.io.ImmutableBytesWritable%org.apache.hadoop.hbase.client.Result\x00\x00\x00\x00\x00\x00\ufffd-\x10A\ufffd~lUE\u025bt\ufffd\ufffd\ufffd&\x00\x00\x04\ufffd\x00\x00\x00'
u'\x00\x00\x00\x067-2010\ufffd\t'
u'|'
u'\x067-2010\x12\x01r\x1a\x08clo-0101 \ufffd\ufffd\ufffd*(\x042\\6.67|10|10|10|7.33|6.67|6.67|6.67|6.67|6.67|6.67|5.83|3.17|0|0|0.67|0.67|0.67|0.67|0|0|0|0|0'
u'u'
I can tell that
'7-2010' in the 2nd line is the Rowkey,
'r' in the 4th line is the column family,
'clo-0101' in the 4th line is the column name,
'6.67|10|10|10|7.33|6.67|6.67|6.67|6.67|6.67|6.67|5.83|3.17|0|0|0.67|0.67|0.67|0.67|0|0|0|0|0' is the value.
I don't know where 3rd and 5th line came from. It seems like Hbase-export followed its own rule to generate the file, if I use my own way to decode it, data might got corrupted.
Question:
How can I convert this file back to a readable format? For example:
7-2010, r, clo-0101, 6.67|10|10|10|7.33|6.67|6.67|6.67|6.67|6.67|6.67|5.83|3.17|0|0|0.67|0.67|0.67|0.67|0|0|0|0|0
I have tried:
test1 = sc.sequenceFile('/user/youyang/data/hbaseSnapshot1/', keyClass=None, valueClass=None, keyConverter=None, valueConverter=None, minSplits=None, batchSize=0)
test1.show(5)
and
test1 = sc.sequenceFile('hdfs://nameservice1/user/ken/data/exportTable1'
, keyClass='org.apache.hadoop.hbase.mapreduce.TableInputFormat'
, valueClass='org.apache.hadoop.hbase.io.ImmutableBytesWritable'
, keyConverter='org.apache.spark.examples.pythonconverters.ImmutableBytesWritableToStringConverter'
, valueConverter='org.apache.spark.examples.pythonconverters.HBaseResultToStringCon verter'
, minSplits=None
, batchSize=100)
No luck, the code did not work, ERROR:
Caused by: java.io.IOException: Could not find a deserializer for the Value class: 'org.apache.hadoop.hbase.client.Result'. Please ensure that the configuration 'io.serializations' is properly configured, if you're using custom serialization.
Any suggestions? Thank you!

I had this problem recently myself. I solved it by going away from sc.sequenceFile, and instead using sc.newAPIHadoopFile (or just hadoopFile if you're on the old API). The Spark SequenceFile-reader appears to only handle keys/values that are Writable types (it's stated in the docs).
If you use newAPIHadoopFile it uses the Hadoop deserialization logic, and you can specify which Serialization types you need in the config-dictionary you give it:
hadoop_conf = {"io.serializations": "org.apache.hadoop.io.serializer.WritableSerialization,org.apache.hadoop.hbase.mapreduce.ResultSerialization"}
sc.newAPIHadoopFile(
<input_path>,
'org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat',
keyClass='org.apache.hadoop.hbase.io.ImmutableBytesWritable',
valueClass='org.apache.hadoop.hbase.client.Result',
keyConverter='org.apache.spark.examples.pythonconverters.ImmutableBytesWritableToStringConverter',
valueConverter='org.apache.spark.examples.pythonconverters.HBaseResultToStringConverter',
conf=hadoop_conf)
Note that the value in hadoop_conf for "io.serializations" is a comma separated list which includes "org.apache.hadoop.hbase.mapreduce.ResultSerialization". That is the key configuration you need to be able to deserialize the Result. The WritableSerialization is also needed in order to be able to deserialize ImmutableBytesWritable.
You can also use sc.newAPIHadoopRDD, but then you also need to set a value for "mapreduce.input.fileinputformat.inputdir" in the config dictionary.

Filtering SAS datasets created with PROC IMPORT with dmbs=xlsx

My client uses SAS 9.3 running on an AIX (IBM Unix) server. The client interface is SAS Enterprise Guide 5.1.
I ran into this really puzzling problem: when using PROC IMPORT in combination with dbms=xlsx, it seems impossible to filter rows based on the value of a character variable (at least, when we look for an exact match).
With an .xls file, the following import works perfectly well; the expected subset of rows is written to myTable:
proc import out = myTable(where=(myString EQ "ABC"))
datafile ="myfile.xls"
dbms = xls replace;
run;
However, using the same data but this time in an .xlsx file, an empty dataset is created (having the right number of variables and adequate column types).
proc import out = myTable(where=(myString EQ "ABC"))
datafile ="myfile.xlsx"
dbms = xlsx replace;
run;
Moreover, if we exclude the where from the PROC IMPORT, the data is seemingly imported correctly. However, filtering is still not possible. For instance, this will create an empty dataset:
data myFilteredTable;
set myTable;
where myString EQ "ABC";
run;
The following will work, but is obviously not satisfactory:
data myFilteredTable;
set myTable;
where myString LIKE "ABC%";
run;
Also note that:
Using compress or other string functions does not help
Filtering using numerical columns works fine for both xls and xlsx files.
My preferred method to read spreadsheets is to use excel libnames, but this is technically not possible at this time.
I wonder if this is a known issue, I couldn't find anything about it so far. Any help appreciated.

It sounds like your strings have extra values on the end not being picked up by compress. Try using the countc function on MyString to see if any extra characters exist on the end. You can then figure out what characters to remove with compress once they're determined.

How to import data into teradata tables from an excel file using BTEQ import?

Im trying to import data into tables from a file using BTEQ import.
im facing weird errors while doing this
Like:
if im using text file as input data file with ',' as delimiter as filed seperator im getting the error as below:
*** Failure 2673 The source parcel length does not match data that was defined.
or
if im using EXCEL file as input data file im getting the error as below:
* Growing Buffer to 53200
* Error: Import data size does not agree with byte length.
The cause may be:
1) IMPORT DATA vs. IMPORT REPORT
2) incorrect incoming data
3) import file has reached end-of-file.
*** Warning: Out of data.
please help me out by giving the syntax for BTEQ import using txt file as input data file and also the syntax if we use EXCEL file as the input data file
Also is there any specific format for the input data file for correct reading of data from it.
if so please give me the info about that.
Thanks in advance:)
EDIT
sorry for not posting the script in first.
Im new to teradata and yet to explore other tools.
I was asked to write the script for BTEQ import
.LOGON TDPD/XXXXXXX,XXXXXX
.import VARTEXT ',' FILE = D:\cc\PDATA.TXT
.QUIET ON
.REPEAT *
USING COL1 (VARCHAR(2)) ,COL2 (VARCHAR(1)) ,COL3 (VARCHAR(56))
INSERT INTO ( COL1 ,COL2 ,COL3) VALUES ( :COL1 ,:COL2 ,:COL3);
.QUIT
I executed the above script and it is successful using a txt(seperating the fileds with comma) file and giving the datatype as varchar.
sample input txt file:
1,b,helloworld1
2,b,helloworld2
3,D,helloworld1
12,b,helloworld1
I also tried to do the same using tab(\t) as the field seperator but it giving the same old error.
Q) Does this work only for comma seperated txt files?
Please could u tell me where can i find the BTEQ manual...
Thanks a lot

Can you post your BTEQ script? May I also ask why you are using BTEQ instead of FastLoad or MultiLoad?
The text file error is possibly due to the data types declared in the using clause. I believe they need to be declared as VARCHAR when reading delimited input (eg. declare as VARCHAR(10) for INTEGER fields).
As for Excel, I can't find anything in the BTEQ manual that says that BTEQ can handle .xls files.
For your tab delimited files, are you doing this (that's a tab character below)?
.import vartext ' '
Or this?
.import vartext '\t'
The former works, the latter doesn't.
The BTEQ manual that I have is on my work machine. One of the first Google results for "BTEQ manual" should yield one online.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Databricks - CSV not loading properly - databricks

Related

I could not write the data frame including twitter hashtag to the csv file. When I subset some of the variables, csv file does not show variables

How to manage mangled data when importing from your source in sqoop or pyspark

How to read sequence files exported from HBase

Filtering SAS datasets created with PROC IMPORT with dmbs=xlsx

How to import data into teradata tables from an excel file using BTEQ import?

Categories

Resources