Store large text file to DB using oData - string

I need to store a very large string into the backend table under one field which is of type string.
The string which I am storing is above 10 million (1 crore) character length. It is taking long time to store and retrieve from the backend.
I tried compressing algorithms,which failed to compress such large string.
So what is the best way to handle this situation and improve the performance.
Technologies used:
front end - SAP UI5,
gateway - oData,
backend - SAP ABAP.
Compressing methods tried:
https://github.com/tcorral/JSONC
https://github.com/floydpink/lzwCompress.js
the above compressing methods weren't able to solve my problem.

Well, Marc is right stating that transferring XLSX is definitely better/faster than JSON.
ABAP JSON tools are not so rich however sufficient for most manipulations. More peculiar tasks can be done via internal tables and transformations. So it is highly recommended to perform your operations (XLSX >> JSON) on the backend server.
What concerns backend DB table, I support Chris N that inserting 10M string into string field is a worst idea that can be ever imagined. The recommended way of storing big files in transparent tables is utilizing XSTRING type. This is a kind of BLOB for ABAP which is much faster in handling binary data.
I've made some SAT performance tests on my sample 14-million file and that's what I got.
INSERT into XSTRING field:
INSERT into STRING field:
As you can notice DB operations net time differs significantly, not in favour of STRING.
Your upload code can look like this:
DATA: len type i,
lt_content TYPE standard table of tdline,
ws_store_tab TYPE zstore_tab.
"Upload the file to Internal Table
call function 'GUI_UPLOAD'
exporting
filename = '/TEMP/FILE.XLSX'
filetype = 'BIN'
importing
filelength = len
tables
data_tab = lt_content
.
IF sy-subrc <> 0.
message 'Unable to upload file' type 'E'.
ENDIF.
"Convert binary itab to xstring
call function 'SCMS_BINARY_TO_XSTRING'
exporting
input_length = len
FIRST_LINE = 0
LAST_LINE = 0
importing
buffer = zstore_tab-file "should be of type XSTRING!
TABLES
binary_tab = gt_content
exceptions
failed = 1
others = 2
.
IF sy-subrc <> 0.
MESSAGE 'Unable to convert binary to xstring' type 'E'.
ENDIF.
INSERT zstore_tab FROM ws_store.
IF sy-subrc IS INITIAL.
MESSAGE 'Successfully uploaded' type 'S'.
ELSE.
MESSAGE 'Failed to upload' type 'E'.
ENDIF.
For parsing and manipulating XLSX multiple AS ABAP wrappers already present, examples are here, here and here.
All this is about backend-side optimization. Optimization on the frontend
are welcomed from UI5-experts (to whom I don't belong), however general SAP recommendation is to move all massive manipulation to application server.

Related

Read n rows from csv in Google Cloud Storage to use with Python csv module

I have a variety of very large (~4GB each) csv files that contain different formats. These come from data recorders from over 10 different manufacturers. I am attempting to consolidate all of these into BigQuery. In order to load these up on a daily basis I want to first load these files into Cloud Storage, determine the schema, and then load into BigQuery. Due to the fact that some of the files have additional header information (from 2 - ~30 lines) I have produced my own functions to determine the most likely header row and the schema from a sample of each file (~100 lines), which I can then use in the job_config when loading the files to BQ.
This works fine when I am working with files from local storage direct to BQ as I can use a context manager and then Python's csv module, specifically the Sniffer and reader objects. However, there does not seem to be an equivalent method of using a context manager direct from Storage. I do not want to bypass Cloud Storage in case any of these files are interrupted when loading into BQ.
What I can get to work:
# initialise variables
with open(csv_file, newline = '', encoding=encoding) as datafile:
dialect = csv.Sniffer().sniff(datafile.read(chunk_size))
reader = csv.reader(datafile, dialect)
sample_rows = []
row_num = 0
for row in reader:
sample_rows.append(row)
row_num+=1
if (row_num >100):
break
sample_rows
# Carry out schema and header investigation...
With Google Cloud Storage I have attempted to use download_as_string and download_to_file, which provide binary object representations of the data, but then I cannot get the csv module to work with any of the data. I have attempted to use .decode('utf-8') and it returns a looong string with \r\n's. I then used splitlines() to get a list of the data but still the csv functions keep giving a dialect and reader that splits the data into single characters as each entry.
Has anyone managed to get a work around to use the csv module with files stored in Cloud Storage without downloading the whole file?
After having a look at the csv source code on GitHub, I have managed to use the io module and csv module in Python to solve this problem. The io.BytesIO and TextIOWrapper were the two key functions to use. Probably not a common use case but thought I would post the answer here to save some time for anyone that needs it.
# Set up storage client and create a blob object from csv file that you are trying to read from GCS.
content = blob.download_as_string(start = 0, end = 10240) # Read a chunk of bytes that will include all header data and the recorded data itself.
bytes_buffer = io.BytesIO(content)
wrapped_text = io.TextIOWrapper(bytes_buffer, encoding = encoding, newline = newline)
dialect = csv.Sniffer().sniff(wrapped_text.read())
wrapped_text.seek(0)
reader = csv.reader(wrapped_text, dialect)
# Do what you will with the reader object

Transform a string to some code in python 3

I store some data in a excel that I extract in a JSON format. I also call some data with GET requests from some API I created. With all these data, I do some test (does the data in the excel = the data returned by the API?)
In my case, I may need to store in the excel the way to select the data from the API json returned by the GET.
for example, the API returns :
{"countries":
[{"code":"AF","name":"Afghanistan"},
{"code":"AX","name":"Ă…land Islands"} ...
And in my excel, I store :
excelData['countries'][0]['name']
I can retrieve the excelData['countries'][0]['name'] in my code just fine, as a string.
Is there a way to convert excelData['countries'][0]['name'] from a string to some code that actually points and get the data I need from the API json?
here's how I want to use it :
self.assertEqual(str(valueExcel), path)
#path is the string from the excel that tells where to fetch the data from the
# JSON api
I thought strings would be interpreted but no :
AssertionError: 'AF' != "excelData['countries'][0]['code']"
- AF
+ excelData['countries'][0]['code']
You are looking for the eval method. Try with this:
self.assertEqual(str(valueExcel), eval(path))
Important: Keep in mind that eval can be dangerous, since malicious code could be executed. More warnings here: What does Python's eval() do?

We have many mainframe files which are in EBCDIC format, is there a way in Python to parse or convert the mainframe file into csv file or text file?

I need to read the records from mainframe file and apply the some filters on record values.
So I am looking for a solution to convert the mainframe file to csv or text or Excel workbook so that I can easily perform the operations on the file.
I also need to validate the records count.
Who said anything about EBCDIC? The OP didn't.
If it is all text then FTP'ing with EBCDIC to ASCII translation is doable, including within Python.
If not then either:
The extraction and conversion to CSV needs to happen on z/OS. Perhaps with a COBOL program. Then the CSV can be FTP'ed down with
or
The data has to be FTP'ed BINARY and then parsed and bits of it translated.
But, as so often is the case, we need more information.
I was recently processing the hardcopy log and wanted to break the record apart. I used python to do this as the record was effectively a fixed position record with different data items at fixed locations in the record. In my case the entire record was text but one could easily apply this technique to convert various colums to an appropriate type.
Here is a sample record. I added a few lines to help visualize the data offsets used in the code to access the data:
1 2 3 4 5 6 7 8 9
0123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890
N 4000000 PROD 19114 06:27:04.07 JOB02679 00000090 $HASP373 PWUB02#C STARTED - INIT 17
Note the fixed column positions for the various items and how they are referenced by position. Using this technique you could process the file and create a CSV with the output you want for processing in Excel.
For my case I used Python 3.
def processBaseMessage(self, message):
self.command = message[1]
self.routing = list(message[2:9])
self.routingCodes = [] # These are routing codes extracted from the system log.
self.sysname = message[10:18]
self.date = message[19:24]
self.time = message[25:36]
self.ident = message[37:45]
self.msgflags = message[46:54]
self.msg = [ message[56:] ]
You can then format into the form you need for further processing. There are other ways to process mainframe data but based on the question this approach should suit your needs but there are many variations.

What is the proper way to validate datatype of csv data in spark?

We have a JSON file as input to the spark program(which describe schema definition and constraints which we want to check on each column) and I want to perform some data quality checks such as (Not NULL, UNIQUE) and datatype validations as well(Wants to check whether csv file contains the data according to json schema or not?).
JSON File:
{
"id":"1",
"name":"employee",
"source":"local",
"file_type":"text",
"sub_file_type":"csv",
"delimeter":",",
"path":"/user/all/dqdata/data/emp.txt",
"columns":[
{"column_name":"empid","datatype":"integer","constraints":["not null","unique"],"values_permitted":["1","2"]},
{"column_name":"empname","datatype":"string","constraints":["not null","unique"],"values_permitted":["1","2"]},
{"column_name":"salary","datatype":"double","constraints":["not null","unique"],"values_permitted":["1","2"]},
{"column_name":"doj","datatype":"date","constraints":["not null","unique"],"values_permitted":["1","2"]},
{"column_name":"location","string":"number","constraints":["not null","unique"],"values_permitted":["1","2"]}
]
}
Sample CSV input :
empId,empname,salar,dob,location
1,a,10000,11-03-2019,pune
2,b,10020,14-03-2019,pune
3,a,10010,15-03-2019,pune
a,1,10010,15-03-2019,pune
Keep in mind that,
1) intentionally I have put the invalid data for empId and name field(check last record).
2) The number of column in the json file is not fixed?
Question:
How can I ensure that an input data file contains all the records as per the given datatype(in JSON) file or not?
We have tried below things:
1) If we try to load the data from the CSV file using a data frame by applying external schema, then the spark program immediately throws some cast exception(NumberFormatException, etc) and it abnormally terminates the program. But I want to continue the execution flow and log the specific error as "Datatype mismatch error for column empID".
Above scenario works only when we call some RDD action on data frame which I felt a weird way to validate schema.
Please guide me, How we can achieve it in spark?
I don't think there is a free lunch here you have to write this process yourself but the process you can do is...
Read the csv file as a Dataset of Strings so that every row is good
Parse the Dataset using the map function to check for Null or datatype problems per column
Add an extra two columns, a boolean called like validRow and a String called like message or description
With the parser mentioned in '2.', do some sort of try/catch or a Try/Success/Failure for each value in each column and catch the exception and set the validRow and the description column accordingly
Do a filter and write one DataFrame/DataSet that is successful (validRow flag is set to True) to a success place, and write the error DataFrame/DataSet to an error place

How to read/write protocol buffer messages with Apache Spark?

I want to Read/write protocol buffer messages from/to HDFS with Apache Spark. I found these suggested ways:
1) Convert protobuf messsages to Json with Google's Gson Library and then read/write them by SparkSql. This solution is explained in this link But I think doing that (convert to json) is an extra task.
2) Convert to Parquet file. There are parquet-mr and sparksql-protobuf github projects for this way but I don't want parquet file because I always work with all columns (not some columns) and in this way Parquet Format does not give me any gain (at least I think).
3) ScalaPB. May be it's what I am looking for. but in scala language that I don't know anything about it. I am looking for a java-based solution. This youtube video introduce scalaPB and explain how to use it (for scala developers).
4) Through the use of the sequence file and this is what I looking for, but found nothing about that. So, my question is: How can I write protobuf messages to sequence file on HDFS and from that? Any other suggestion will be useful.
5) Through twitter's Elephant-bird Library.
Though a bit hidden between the points, you seem to be asking how to write to a sequencefile in spark. I found an example here.
// Importing org.apache.hadoop.io package
import org.apache.hadoop.io._
// As we need data in sequence file format to read. Let us see how to write first
// Reading data from text file format
val dataRDD = sc.textFile("/public/retail_db/orders")
// Using null as key and value will be of type Text while saving in sequence file format
// By Int and String, we do not need to convert types into IntWritable and Text
// But for others we need to convert to writable object
// For example, if the key/value is of type Long, we might have to
// type cast by saying new LongWritable(object)
dataRDD.
map(x => (NullWritable.get(), x)).
saveAsSequenceFile("/user/`whoami`/orders_seq")
// Make sure to replace `whoami` with the appropriate OS user id
// Saving in sequence file with key of type Int and value of type String
dataRDD.
map(x => (x.split(",")(0).toInt, x.split(",")(1))).
saveAsSequenceFile("/user/`whoami`/orders_seq")
// Make sure to replace `whoami` with the appropriate OS user id

Resources