Loading Special Character via Polybase - azure

I am trying to load single quote string delimited file and I am able to load data except for certain records for the string which contains below format. How to Load this below values using PolyBase in SQL Datawarehouse. Any input is highly appreciated.
Eg:
'Don''t Include'
'1'''
'Can''t'
'VM''s'
External File Format:
CREATE EXTERNAL FILE FORMAT SAMPLE_HEADER
with (format_type=delimitedtext,
format_options(
FIELD_TERMINATOR=',',
STRING_DELIMITER='''',
DATE_FORMAT='yyyy-MM-dd HH:mm:ss',
USE_TYPE_DEFAULT=False)
)

In this case your string delimiter needs to be something other than a single quote.
I assume you're using a comma-delimited file. You have a couple of options:
Make your column delimiter something other than comma.
Make your string delimiter a character that does not exist in your data
Use an output format other than CSV, such as Parquet or Orc
If you're going to use a custom delimiter, I suggest ASCII Decimal(31) or Hex(0x1F), which is specifically reserved for this purpose.
If you're going to use a string delimiter you might use double-quote (but I'm guessing this is in your data) or choose some other character.
That said, my next guess is that you're going to come across data with embedded carriage returns, and this is going to cause yet another layer of problem. For that reason, I suggest you move your extracts to something other than CSV, and look to Parquet or Orc.

Currently, Polybase in SQLDW does not support handling of the escape character in the delimited text format. So you cannot load your file directory in SQLDW.
In order to load your file, you may pre-process your input file. During pre-processing you may generate another data file either in binary format (PARQUET or ORC which are directory readable by poly-base) or another delimited file with some special field separator(any character which is not expected in your data file, e.g. | or ~). With such special character, there is no need of using escaping/delimiting the values)
Hope its helps.
From Azure docs:
<format_options> ::=
{
FIELD_TERMINATOR = field_terminator
| STRING_DELIMITER = string_delimiter
| First_Row = integer -- ONLY AVAILABLE SQL DW
| DATE_FORMAT = datetime_format
| USE_TYPE_DEFAULT = { TRUE | FALSE }
| Encoding = {'UTF8' | 'UTF16'}
}

Related

Different delimiters on different lines in the same file for Databricks Spark

I have a file that has a mix of comma delimited lines and pipe delimited lines I need to import into Databricks.
Is it possible to indicate the use of two or more different separators when creating a sql table in Databricks/Spark?
I see lots of posts for multiple character separators, but nothing on different separators.
https://forums.databricks.com/questions/10743/how-to-read-file-in-pyspark-with-delimiter.html
Possible to handle multi character delimiter in spark
http://blog.madhukaraphatak.com/spark-3-introduction-part-1
etc.
I'm currently using something like this.
create table myschema.mytable (
foo string,
bar string
)
using csv
options (
header = "true",
delimiter = ","
);
One methood you could try is to create spark dataframe first and then make a table out of it. Giving example for a hypothetical case below using pyspark where delimiters were | and -
BEWARE: we are using split and it means that it will split everything, e.g. 2000-12-31 is a value yest it will be split. Therefor we should be very sure that no such case would ever occur in data. As general advice, one should never accept these types of files as there are accidents waiting to happen.
How sample data looks: in this case we have 2 files in our directory with | and - occurring randomly as delimiters
# Create RDD. Basically read as simple text file.
# sc is spark context
rddRead = sc.textFile("/mnt/adls/RI_Validation/ReadMulktipleDelimerFile/Sample1/")
rddRead.collect() # For debugging
import re # Import for usual python regex
# Create another rdd using simple string opertaions. This will be similar to list of lists.
# Give regex expression to split your string based on anticipated delimiters (this could be dangerous
# if those delimiter occur as part of value. e.g.: 2021-12-31 is a single value in reality.
# But this a price we have to pay for not having good data).
# For each iteration, k represents 1 element which would eventually become 1 row (e.g. A|33-Mech)
rddSplit = rddRead.map(lambda k: re.split("[|-]+", k)) # Anticipated delimiters are | OR - in this case.
rddSplit.collect() # For debugging
# This block is applicable only if you have headers
lsHeader = rddSplit.first() # Get First element from rdd as header.
print(lsHeader) # For debugging
print()
# Remove rows representing header. (Note: Have assumed name of all columns in
# all files are same. If not, then will have to filter by manually specifying
#all of them which would be a nightmare from pov of good code as well as maintenance)
rddData = rddSplit.filter(lambda x: x != lsHeader)
rddData.collect() # For debugging
# Convert rdd to spark dataframe
# Utilise the header we got in earlier step. Else can give our own headers.
dfSpark = rddData.toDF(lsHeader)
dfSpark.display() # For debugging

Can you remove a random string of commas and replace with one for exporting to CSV

I am using Netmiko to extract some data from Cisco Switches and Routers. I would like to put that data in to a spread sheet. For example show cdp neighbour would give me string with random white space in
Port Name Status Vlan Duplex Speed Type
Et0/0 connected 1 auto auto unknown
Et0/1 connected 1 auto auto unknown
Et0/2 connected routed auto auto unknown
Et0/3 connected 1 auto auto unknown
I thought i could remove it and replace with , but i get this
Port,,,,,,Name,,,,,,,,,,,,,,,Status,,,,,,,Vlan,,,,,,,Duplex,,Speed,Type
Et0/0,,,,,,,,,,,,,,,,,,,,,,,,connected,,,,1,,,,,,,,,,,,auto,,,auto,unknown
Et0/1,,,,,,,,,,,,,,,,,,,,,,,,connected,,,,1,,,,,,,,,,,,auto,,,auto,unknown
Et0/2,,,,,,,,,,,,,,,,,,,,,,,,connected,,,,routed,,,,,,,auto,,,auto,unknown
Et0/3,,,,,,,,,,,,,,,,,,,,,,,,connected,,,,1,,,,,,,,,,,,auto,,,auto,unknown
Any way of extracting data like the above. Ideally to go straight in to a structured table in excel (Cells and Rows) or anyway to do what i did and then replace repeating , with just one so i can export to CSV and then import to Excel. I may be the most long winded person you have ever seen because i am so new to prgramming :)
I'd go with regex matches which are more flexible. You can adapt this to your needs. I put the data in a list for testing, but you could process 1 line at a time instead.
Here's the file (called mydata.txt)
Et0/0,,,,,,,,,,,,,,,,,,,,,,,,connected,,,,1,,,,,,,,,,,,auto,,,auto,unknown
Et0/1,,,,,,,,,,,,,,,,,,,,,,,,connected,,,,1,,,,,,,,,,,,auto,,,auto,unknown
Et0/2,,,,,,,,,,,,,,,,,,,,,,,,connected,,,,routed,,,,,,,auto,,,auto,unknown
Et0/3,,,,,,,,,,,,,,,,,,,,,,,,connected,,,,1,,,,,,,,,,,,auto,,,auto,unknown
Here's how to read it and write the result to a csv file (mydata.csv)
import re
_re = re.compile('([^,]+)+')
newfile = open(r'mydata.csv', 'w')
with open(r'mydata.txt') as data:
for line in data.readlines():
newfile.write(','.join(f for f in _re.findall(line)))
newfile.close()
And here is the output
Et0/0,connected,1,auto,auto,unknown
Et0/1,connected,1,auto,auto,unknown
Et0/2,connected,routed,auto,auto,unknown
Et0/3,connected,1,auto,auto,unknown
Explanation:
The re library allows the use of regular expressions for parsing
text. So the first line imports it.
The second line specifies the regular expression to extract anything
that is not a comma, but it is only a specification. It doesn't
actually do the extraction
The third line opens the output file, with 'w' specifying that we
can write to it. The next line opens the input file. The file is
reference by the name 'newfile'
The fourth line reads each line from the input file one at a time.
The fifth line is an all-at-once operation to separate the non-comma
parts of the input, join them back together separated by commas, and
write the resulting string to the output file.
The last line closes the output file.
I hope I didn't misunderstand you. To turn that repeating commas to one single comma, just run this code with your string s:
while ",," ins s:
s = s.replace(",,", ",")

Behavior of spark.read.csv with inferschema=True in case of multiple file loading

I am facing a difficulty(rather a strange result) while trying to load multiple csv files simultaneously into Spark by
df = spark.read.csv('/dir/*.csv', header=True, inferSchema=True)
df.printschema()
# Sale_Value String(nullable=true) # it should be Double/float
# Name String # Fine for all string columns
So basically all my integer and double column are converting into String Type, and i am expecting it to be double type, as i am passing inferschema parameter as true.
I checked my data and confirmed that there is no null value or any string value is there.
Strange thing is->
i read each file into separate dfs, df1=spark.read.csv(file1, inferSchema=True),
df2=spark.read.csv(file2, inferSchema=True), then printing schema for each dataframes, All the schema are as expected(dpouble are coming double, string is string)
I started appending each separate dfs into single df like df = df1.union(df2), df= df.union(df2)..etc and checked printing df.printSchema(), then also all results are as expected...No issue in that.
So i am confused with the behavior with multiple file load in single load staement(*.csv).
IS there anything i am missing about inferSchema behavior. Please shed some light.
Edit
My data is doubleQuote enclosed(to avoid splitting of records if comma found in fields), ex: "name","sale_target","sale_V","gender"....
I have 3 files out of that 2 of the files are quote-enclosed and one file is without quote-enclosed(that is the reason, I've faced this issue, i dropped the file not having quotes in it, and guess what everything worked perfectly).
So is it mandatory that if i am using/reading/loading multiple csv files then either i have to use all-files-QuoteEnclosed or all-file-without-QuoteEnclosed??
File location https://github.com/satya-panda/king-s-things

Error in U-SQL Job on Azure Data Lake

I have lots of json files in my Azure Data Lake account. They are organized as: Archive -> Folder 1 -> JSON Files.
What I want to do is extract a particular field: timestamp from each json and then then just put it in a csv file.
My issue is:
I started with this script:
CREATE ASSEMBLY IF NOT EXISTS [Newtonsoft.Json] FROM "correct_path/Assemblies/JSON/Newtonsoft.Json.dll";
CREATE ASSEMBLY IF NOT EXISTS [Microsoft.Analytics.Samples.Formats] FROM "correct_path/Assemblies/JSON/Microsoft.Analytics.Samples.Formats.dll";
REFERENCE ASSEMBLY [Newtonsoft.Json];
REFERENCE ASSEMBLY [Microsoft.Analytics.Samples.Formats];
DECLARE #INPUT_FILE string = #"correct_path/Tracking_3e9.json";
//Extract the different properties from the Json file using a JsonExtractor
#json =
EXTRACT Partition string, Custom string
FROM #INPUT_FILE
USING new Microsoft.Analytics.Samples.Formats.Json.JsonExtractor();
OUTPUT #json
TO "correct_path/Output/simple.csv"
USING Outputters.Csv(quoting : false);
I get error:
E_STORE_USER_FILENOTFOUND: File not found or access denied
But I do have access to the file in the data explorer of the Azure Data Lake, so how can it be?
I don't want to run it for each file one by one. I just want to give it all the files in a folder (like Tracking*.json) or just a bunch of folders (like Folder*) and it should go through them and put the output for each file in a single row of the output csv.
Haven't found any tutorials on this.
Right now, I am reading the entire json, how to read just one field like time stamp which is a field within a particular field, like data : {timestamp:"xxx"}?
Thanks for your help.
1) Not sure why you're running into that error without more information - are you specifically missing the input file or is it the assemblies?
2) You can use a fileset to extract data from a set of files. Just use {} to denote the wildcard character in your input string, and then save that character in a new column. So for example, your input string could be #"correct_path/{day}/{hour}/{id}.json", and then your extract statement becomes:
EXTRACT
column1 string,
column2 string,
day int,
hour int,
id int
FROM #input
3) You'll have to read the entire JSON in your SELECT statement, but you can refine it down to only the rows you want in future rowsets. For example:
#refine=
SELECT timestamp FROM #json;
OUTPUT #refine
...
It sounds like some of your JSON data is nested however (like the timestamp field). You can find information on our GitHub (Using the JSON UDFs) and in this blog for how to read nested JSON data.
Hope this helps, and please let me know if you have additional questions!

How to parse CSV files with double-quoted strings in Julia?

I want to read CSV files where the columns are separated by commas. The columns can be strings and if those strings contain a comma in their content, they are wrapped in double-quotes. Currently I'm loading my data using:
file = open("data.csv","r")
data = readcsv(file)
But this code code would split the follwing string into 4 pieces whereas it only should be 3:
1,"text, more text",3,4
Is there a way in Julia's Standard Library to parse CSV while respecting quoting or do I have to write my own custom solution?
The readcsv function in base is super-basic (just blindly splitting on commas).
You will probably be happier with readtable from the DataFrames.jl package: http://juliastats.github.io/DataFrames.jl/io.html
To use the package, you just need to Pkg.add("DataFrames"), and then import it with `using DataFrames"
The readcsv function in base (0.3 prerelease) can now read quoted columns.
julia> readcsv(IOBuffer("1,\"text, more text\",3,4"))
1x4 Array{Any,2}:
1.0 "text, more text" 3.0 4.0
It is much simpler than DataFrames. But may be quicker if you just need the data as an array.

Resources