Transform a string to some code in python 3 - python-3.x

I store some data in a excel that I extract in a JSON format. I also call some data with GET requests from some API I created. With all these data, I do some test (does the data in the excel = the data returned by the API?)
In my case, I may need to store in the excel the way to select the data from the API json returned by the GET.
for example, the API returns :
{"countries":
[{"code":"AF","name":"Afghanistan"},
{"code":"AX","name":"Ă…land Islands"} ...
And in my excel, I store :
excelData['countries'][0]['name']
I can retrieve the excelData['countries'][0]['name'] in my code just fine, as a string.
Is there a way to convert excelData['countries'][0]['name'] from a string to some code that actually points and get the data I need from the API json?
here's how I want to use it :
self.assertEqual(str(valueExcel), path)
#path is the string from the excel that tells where to fetch the data from the
# JSON api
I thought strings would be interpreted but no :
AssertionError: 'AF' != "excelData['countries'][0]['code']"
- AF
+ excelData['countries'][0]['code']

You are looking for the eval method. Try with this:
self.assertEqual(str(valueExcel), eval(path))
Important: Keep in mind that eval can be dangerous, since malicious code could be executed. More warnings here: What does Python's eval() do?

Related

How to include SQLAlchemy Data types in Python dictionary

I've written an application using Python 3.6, pandas and sqlalchemy to automate the bulk loading of data into various back-end databases and the script works well.
As a brief summary, the script reads data from various excel and csv source files, one at a time, into a pandas dataframe and then uses the df.to_sql() method to write the data to a database. For maximum flexibility, I use a JSON file to provide all the configuration information including the names and types of source files, the database engine connection strings, the column titles for the source file and the column titles in the destination table.
When my script runs, it reads the JSON configuration, imports the specified source data into a dataframe, renames source columns to match the destination columns, drops any columns from the dataframe that are not required and then writes the dataframe contents to the database table using a call similar to:
df.to_sql(strTablename, con=engine, if_exists="append", index=False, chunksize=5000, schema="dbo")
The problem I have is that I would like to also specify the data types in the df.to_sql method for columns and provide them as inputs from the JSON configuration file however, this doesn't appear to be possible as all the strings in the JSON file need to be be enclosed in quotes and they don't then translate when read by my code. This is how the df.to_sql call should look:
df.to_sql(strTablename, con=engine, if_exists="append", dtype=dictDatatypes, index=False, chunksize=5000, schema="dbo")
The entries that form the dtype dictionary from my JSON file look like this:
"Data Types": {
"EmployeeNumber": "sqlalchemy.types.NVARCHAR(length=255)",
"Services": "sqlalchemy.types.INT()",
"UploadActivities": "sqlalchemy.types.INT()",
......
and there a many more, one for each column.
However, when the above is read in as a dictionary, which I pass to the df.to_sql method, it doesn't work as the alchemy datatypes shouldn't be enclosed in quotes but, I can't get around this in my JSON file. The dictionary values therefore aren't recognised by pandas. They look like this:
{'EmployeeNumber': 'sqlalchemy.types.INT()', ....}
And they really need to look like this:
{'EmployeeNumber': sqlalchemy.types.INT(), ....}
Does anyone have experience of this to suggest how I might be able to have the sqlalchemy datatypes in my configuration file?
You could use eval() to convert the string names to objects of that type:
import sqlalchemy as sa
dict_datatypes = {"EmployeeNumber": "sa.INT", "EmployeeName": "sa.String(50)"}
pprint(dict_datatypes)
"""console output:
{'EmployeeName': 'sa.String(50)', 'EmployeeNumber': 'sa.INT'}
"""
for key in dict_datatypes:
dict_datatypes[key] = eval(dict_datatypes[key])
pprint(dict_datatypes)
"""console output:
{'EmployeeName': String(length=50),
'EmployeeNumber': <class 'sqlalchemy.sql.sqltypes.INTEGER'>}
"""
Just be sure that you do not pass untrusted input values to functions like eval() and exec().

How can I convert a Pyspark dataframe to a CSV without sending it to a file?

I have a dataframe which I need to convert to a CSV file, and then I need to send this CSV to an API. As I'm sending it to an API, I do not want to save it to the local filesystem and need to keep it in memory. How can I do this?
Easy way: convert your dataframe to Pandas dataframe with toPandas(), then save to a string. To save to a string, not a file, you'll have to call to_csv with path_or_buf=None. Then send the string in an API call.
From to_csv() documentation:
Parameters
path_or_bufstr or file handle, default None
File path or object, if None is provided the result is returned as a string.
So your code would likely look like this:
csv_string = df.toPandas().to_csv(path_or_bufstr=None)
Alternatives: use tempfile.SpooledTemporaryFile with a large buffer to create an in-memory file. Or you can even use a regular file, just make your buffer large enough and don't flush or close the file. Take a look at Corey Goldberg's explanation of why this works.

Python - API multiple responses

I am currently working on obtaining data from a nested JSON response called "Result"
Now after reviewing the API documentation, they say that they only return 100 records per request, so which means if we have 425 records I would have to pass the request. get at least 4 times with:
/example
/example?$skip=100
/example?$skip=200
/example?$skip=400
After that is done it should write the response list in a csv file.I have parsed the response from the get to json.loads, I have converted the dictionary to list and created a for loop that writes whatever is in the "Result" dictionary.
My question is how can I create that it loops also the request.get and increments the url value to skip 100,200,300,400. Hope this makes sense
So after searching and searching the best way that worked for me was.
Creating a for loop with the number of times that it needs to loop over.
toSkip = (i+1) * 100
Concatenate the string with 'url string' + '?$Skip=' + str(toSkip)
Create a request with passing the authorization header
Parse it with json.loads
Write the result to a csv file or google sheets API

What is the proper way to validate datatype of csv data in spark?

We have a JSON file as input to the spark program(which describe schema definition and constraints which we want to check on each column) and I want to perform some data quality checks such as (Not NULL, UNIQUE) and datatype validations as well(Wants to check whether csv file contains the data according to json schema or not?).
JSON File:
{
"id":"1",
"name":"employee",
"source":"local",
"file_type":"text",
"sub_file_type":"csv",
"delimeter":",",
"path":"/user/all/dqdata/data/emp.txt",
"columns":[
{"column_name":"empid","datatype":"integer","constraints":["not null","unique"],"values_permitted":["1","2"]},
{"column_name":"empname","datatype":"string","constraints":["not null","unique"],"values_permitted":["1","2"]},
{"column_name":"salary","datatype":"double","constraints":["not null","unique"],"values_permitted":["1","2"]},
{"column_name":"doj","datatype":"date","constraints":["not null","unique"],"values_permitted":["1","2"]},
{"column_name":"location","string":"number","constraints":["not null","unique"],"values_permitted":["1","2"]}
]
}
Sample CSV input :
empId,empname,salar,dob,location
1,a,10000,11-03-2019,pune
2,b,10020,14-03-2019,pune
3,a,10010,15-03-2019,pune
a,1,10010,15-03-2019,pune
Keep in mind that,
1) intentionally I have put the invalid data for empId and name field(check last record).
2) The number of column in the json file is not fixed?
Question:
How can I ensure that an input data file contains all the records as per the given datatype(in JSON) file or not?
We have tried below things:
1) If we try to load the data from the CSV file using a data frame by applying external schema, then the spark program immediately throws some cast exception(NumberFormatException, etc) and it abnormally terminates the program. But I want to continue the execution flow and log the specific error as "Datatype mismatch error for column empID".
Above scenario works only when we call some RDD action on data frame which I felt a weird way to validate schema.
Please guide me, How we can achieve it in spark?
I don't think there is a free lunch here you have to write this process yourself but the process you can do is...
Read the csv file as a Dataset of Strings so that every row is good
Parse the Dataset using the map function to check for Null or datatype problems per column
Add an extra two columns, a boolean called like validRow and a String called like message or description
With the parser mentioned in '2.', do some sort of try/catch or a Try/Success/Failure for each value in each column and catch the exception and set the validRow and the description column accordingly
Do a filter and write one DataFrame/DataSet that is successful (validRow flag is set to True) to a success place, and write the error DataFrame/DataSet to an error place

Store large text file to DB using oData

I need to store a very large string into the backend table under one field which is of type string.
The string which I am storing is above 10 million (1 crore) character length. It is taking long time to store and retrieve from the backend.
I tried compressing algorithms,which failed to compress such large string.
So what is the best way to handle this situation and improve the performance.
Technologies used:
front end - SAP UI5,
gateway - oData,
backend - SAP ABAP.
Compressing methods tried:
https://github.com/tcorral/JSONC
https://github.com/floydpink/lzwCompress.js
the above compressing methods weren't able to solve my problem.
Well, Marc is right stating that transferring XLSX is definitely better/faster than JSON.
ABAP JSON tools are not so rich however sufficient for most manipulations. More peculiar tasks can be done via internal tables and transformations. So it is highly recommended to perform your operations (XLSX >> JSON) on the backend server.
What concerns backend DB table, I support Chris N that inserting 10M string into string field is a worst idea that can be ever imagined. The recommended way of storing big files in transparent tables is utilizing XSTRING type. This is a kind of BLOB for ABAP which is much faster in handling binary data.
I've made some SAT performance tests on my sample 14-million file and that's what I got.
INSERT into XSTRING field:
INSERT into STRING field:
As you can notice DB operations net time differs significantly, not in favour of STRING.
Your upload code can look like this:
DATA: len type i,
lt_content TYPE standard table of tdline,
ws_store_tab TYPE zstore_tab.
"Upload the file to Internal Table
call function 'GUI_UPLOAD'
exporting
filename = '/TEMP/FILE.XLSX'
filetype = 'BIN'
importing
filelength = len
tables
data_tab = lt_content
.
IF sy-subrc <> 0.
message 'Unable to upload file' type 'E'.
ENDIF.
"Convert binary itab to xstring
call function 'SCMS_BINARY_TO_XSTRING'
exporting
input_length = len
FIRST_LINE = 0
LAST_LINE = 0
importing
buffer = zstore_tab-file "should be of type XSTRING!
TABLES
binary_tab = gt_content
exceptions
failed = 1
others = 2
.
IF sy-subrc <> 0.
MESSAGE 'Unable to convert binary to xstring' type 'E'.
ENDIF.
INSERT zstore_tab FROM ws_store.
IF sy-subrc IS INITIAL.
MESSAGE 'Successfully uploaded' type 'S'.
ELSE.
MESSAGE 'Failed to upload' type 'E'.
ENDIF.
For parsing and manipulating XLSX multiple AS ABAP wrappers already present, examples are here, here and here.
All this is about backend-side optimization. Optimization on the frontend
are welcomed from UI5-experts (to whom I don't belong), however general SAP recommendation is to move all massive manipulation to application server.

Resources