Azure Machine learning - Strip top X rows from dataset - azure

I have a plain text csv file, which i am trying to read in Azure ML studio - the file format is pretty much like this
Geolife trajectory
WGS 84
Altitude is in Feet
Reserved 3
0,2,255,My Track,0,0,2,8421376
0
39.984702,116.318417,0,492,39744.1201851852,2008-10-23,02:53:04
39.984683,116.31845,0,492,39744.1202546296,2008-10-23,02:53:10
39.984686,116.318417,0,492,39744.1203125,2008-10-23,02:53:15
39.984688,116.318385,0,492,39744.1203703704,2008-10-23,02:53:20
39.984655,116.318263,0,492,39744.1204282407,2008-10-23,02:53:25
39.984611,116.318026,0,493,39744.1204861111,2008-10-23,02:53:30
The real data starts from Line 7, how could i strip it off, these files need to be downloaded on the fly so I don't think i would like to strip off the data by some code.

What is your source location - SQL or Blob or http?
If SQL, then you can use query to start from line 6.
If Blob/http, I would suggest reading a file as a single column TSV format, use simple R/Python script to drop first 6 rows and convert to csv

Related

Split a file using Azure Data Factory based on delimiter

I'm very new to ADF, and I'm kinda stuck with the following use case:
I want to split a big file into multiple smaller files based on a delimiter. There will be a delimiter after some rows. For example, following is the input file content:
row1content
row2content
row3content
-----
row4content
row5content
-----
row6content
row7content
row8content
row9content
-----
row10content
row11content
row12content
Here ----- is the delimiter on which I want to split a file into multiple smaller files as output, and name them like MyFile1, MyFile2, MyFile3, MyFile4 and so on, such that their content will be as follows(split based on the delimiter):
MyFile1:
row1content
row2content
row3content
MyFile2
row4content
row5content
MyFile3:
row6content
row7content
row8content
row9content
MyFile4:
row10content
row11content
row12content
I'm trying to achieve it using Data flows in ADF.
The source and destination of input/output files will be azure blob storage.
It'll be really helpful if someone can point me to a direction or a source from where I can proceed further.

write data to text file in azure data factory version 2

It's seem ADF v2 does not support writing data to TEXT file (.TXT).
After select File System
But don't see TextFormat at the next screen
So do we any method to write data to TEXT file ?
Thanks,
Thai
Data Factory only support these 6 file formats:
Please see: Supported file formats and compression codecs in Azure Data Factory.
If we want to write data to a txt file, the only format we can using is Delimited text, when the pipeline finished, you will get a txt file.
Reference: Delimited text: Follow this article when you want to parse the delimited text files or write the data into delimited text format.
For example, I create a pipeline to copy data from Azure SQL to Blob, choose DelimitedText format as Sink dataset:
The txt file I get in Blob Storeage:
Hope this helps
I think what you are looking for is DelimitedText dataset. You can specify extension as part of the file name

Converting 2TB of gziped multiline JSONs to NDJSONs

For my research I have a dataset of about 20,000 gziped multiline json files (~2TB, all have the same schema). I need to process and clean this data (I should say I'm very new to data analytics tools).
After spending a few days reading about Spark and Apache Beam I'm convinced that the first step would be to first convert this dataset to NDJSONs. In most books and tutorials they always assume you are working with some new line delimited file.
What is the best way to go about converting this data?
I've tried to just launch a large instance on gcloud and just use gunzip and jq to do this. Not surprisingly, it seems that this will take a long time.
Thanks in advance for any help!
Apache Beam supports unzipping file if you use TextIO.
But the delimiter remains to be New Line.
For multiline json you can read complete file using in parallel and then convert the json string to pojo and eventually reshuffle the data to utilize parallelism.
So the steps would be
Get the file list > Read individual files > Parse file content to json objects > Reshuffle > ...
You can get the file list by FileSystems.match("gcs://my_bucker").metadata().
Read individual files by Compression Compression.detect((fileResouceId).getFilename()).readDecompressed(FileSystems.open(fileResouceId))
Converting to NDJSON is not necessary if you use sc.wholeTextFiles. Point this method at a directory, and you'll get back an RDD[(String, String)] where ._1 is the filename and ._2 is the content of the file.

We have many mainframe files which are in EBCDIC format, is there a way in Python to parse or convert the mainframe file into csv file or text file?

I need to read the records from mainframe file and apply the some filters on record values.
So I am looking for a solution to convert the mainframe file to csv or text or Excel workbook so that I can easily perform the operations on the file.
I also need to validate the records count.
Who said anything about EBCDIC? The OP didn't.
If it is all text then FTP'ing with EBCDIC to ASCII translation is doable, including within Python.
If not then either:
The extraction and conversion to CSV needs to happen on z/OS. Perhaps with a COBOL program. Then the CSV can be FTP'ed down with
or
The data has to be FTP'ed BINARY and then parsed and bits of it translated.
But, as so often is the case, we need more information.
I was recently processing the hardcopy log and wanted to break the record apart. I used python to do this as the record was effectively a fixed position record with different data items at fixed locations in the record. In my case the entire record was text but one could easily apply this technique to convert various colums to an appropriate type.
Here is a sample record. I added a few lines to help visualize the data offsets used in the code to access the data:
1 2 3 4 5 6 7 8 9
0123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890
N 4000000 PROD 19114 06:27:04.07 JOB02679 00000090 $HASP373 PWUB02#C STARTED - INIT 17
Note the fixed column positions for the various items and how they are referenced by position. Using this technique you could process the file and create a CSV with the output you want for processing in Excel.
For my case I used Python 3.
def processBaseMessage(self, message):
self.command = message[1]
self.routing = list(message[2:9])
self.routingCodes = [] # These are routing codes extracted from the system log.
self.sysname = message[10:18]
self.date = message[19:24]
self.time = message[25:36]
self.ident = message[37:45]
self.msgflags = message[46:54]
self.msg = [ message[56:] ]
You can then format into the form you need for further processing. There are other ways to process mainframe data but based on the question this approach should suit your needs but there are many variations.

Databricks - CSV not loading properly

I have a simple csv file that's pipe delimited that I can load into Databricks, then disuplay the df and it displays nicely. I then try with my main dataset which is formatted the same way and an export from SQL server. After it loads the output shows that it was loaded (lists the field names and the data type it inferred -- all string though which isn't a good sign)
df = spark.read.format("csv").options(header='true', quote='"', delimiter="|",ignoreLeadingWhiteSpace='true',inferSchema='true').load("/mnt/gl/mainfile.csv")
Then I do display (df) and I don't see a nice display. Instead it shows the following:
Job 34 View
(Stages: 1/1)
Job 35 View
(Stages: 1/1)
Job 36 View
(Stages: 1/1)
Obviously the csv is at fault here but I've no idea how to go about solving this - I've already been careful with how I export it from SQL server so not sure what I'd do differently there.
Ok I solved it. If you get a similar issue it might mean your csv is not formatted properly. Open up your cv using a text editor like Ron's Editor then visually inspect the data. On my dataset for some reason the final field which is a $ amount, had a " in front of it , but not at the end of it.
e.g. "12344.67
Not sure why SQL Server would do that (I was using Import/Export Wizard) but I got rid of the " delimiter from my exported csv and it now works fine

Resources