Bulk Insert or any other insert options to insert file contents without providing file path - string

I want to know if it is possible to import a file's contents without knowing the exact file location/path. Instead of the file path, the file contents will directly be passed in as a string, to say a stored procedure.
How can I utilize any of the insert options like BULK INSERT or OPENROWSET etc. to import/copy data directly from input as string (instead of giving file path), and please do give some examples or links, with code if possible.
If more details are needed, I can provide. Or, do I have to perform string manipulation with looping of some kind to input the huge string file blob into respective tables/columns?

Related

BigQuery howto load data from local file as content

I have a requirement where in I will receive file content which I need to load to BigQuery tables. Standard API shows how to load data from local file but I don't see any variant of the load method which accepts file content as string rather than a file path. Any idea how I can achieve this ?
As we can see in the source code and official documentation load function loads data only from a local file or Storage File. Allowed options are:
AVRO,
CSV,
JSON,
ORC,
PARQUET
The load job is created and it will run your data load asynchronously. If you would like instantaneous access to your data, insert it using Table insert function, where you need to provide the rows to insert into the table:
// Insert a single row
table.insert({
INSTNM: 'Motion Picture Institute of Michigan',
CITY: 'Troy',
STABBR: 'MI'
}, insertHandler);
If you want to load i.e. CSV file, firstly you need to save data to a CSV in Node.js manually. Then, load it as a single column CSV using load() method. That will load the whole string as a single column.
Additionally, what I can recommend you is to use Dataflow templates, i.e. Cloud Storage Text to BigQuery, that read text files stored in Cloud Storage, transform them using a JavaScript User Defined Function (UDF), and output the result to BigQuery. But your data to load needs to be stored in Cloud Storage.

How can I convert a Pyspark dataframe to a CSV without sending it to a file?

I have a dataframe which I need to convert to a CSV file, and then I need to send this CSV to an API. As I'm sending it to an API, I do not want to save it to the local filesystem and need to keep it in memory. How can I do this?
Easy way: convert your dataframe to Pandas dataframe with toPandas(), then save to a string. To save to a string, not a file, you'll have to call to_csv with path_or_buf=None. Then send the string in an API call.
From to_csv() documentation:
Parameters
path_or_bufstr or file handle, default None
File path or object, if None is provided the result is returned as a string.
So your code would likely look like this:
csv_string = df.toPandas().to_csv(path_or_bufstr=None)
Alternatives: use tempfile.SpooledTemporaryFile with a large buffer to create an in-memory file. Or you can even use a regular file, just make your buffer large enough and don't flush or close the file. Take a look at Corey Goldberg's explanation of why this works.

DoCmd.TransferText where delimiter is semicolon and decimal is comma

I'm trying to import a csv file with:
Dim appAccess As Access.Application
Set appAccess = CreateObject("Access.Application")
appAccess.OpenCurrentDatabase (databasePath)
appAccess.DoCmd.TransferText transferType:=acImportDelim, tableName:=dbTableName, Filename:=strPath, hasFieldNames:=True
I'm using a German machine, where the standard delimiter is ; and the standard decimal-separator is ,.
If I use those separators, I get an error (the data is not separated correctly).
If I change the separator in the csv file to ,and the decimal-separator to ., the data is loaded in the database, but the . is ignored and numeric values therefore aren't imported correctly.
I don't have the option, to create an import scheme in Access manually. Is there a way, to do this with VBA?
I created a Schema.ini file, which looks like this:
[tempfile.csv]
Format=Delimited(;)
ColNameHeader=True
DecimalSymbol=","
I saved it in the same folder where the csv file is located.
I still get a Runtime-Error, saying field1;field2;... is not a header in the target table. So I'm guessing, the method didn't use ; as a delimiter.
If you have a look at the documentation of the DoCmd.TransferText method there exists a parameter SpecificationName which says:
A string expression that's the name of an import or export specification you've created and saved in the current database. For a fixed-width text file, you must either specify an argument or use a schema.ini file, which must be stored in the same folder as the imported, linked, or exported text file.
To create a schema file, you can use the text import/export wizard to create the file. For delimited text files and Microsoft Word mail merge data files, you can leave this argument blank to select the default import/export specifications.
So if you are not able to generate that schema.ini file using the wizard you can generate it yourself in the same folder as your files to import. For a documentation how to build that file see Schema.ini File (Text File Driver).
It should look something like the following I think:
[YourImportFileName.csv]
Format=Delimited(;)
DecimalSymbol=","
Note that you have to generate one ini file for each CSV file you want to import because the first line is always the name of the import file. So generate the schema.ini, import, delete the ini and start over generating the next ini for the next file.
If you want to generate that ini file with VBA on the fly, have a look at How to create and write to a txt file using VBA.

Error in U-SQL Job on Azure Data Lake

I have lots of json files in my Azure Data Lake account. They are organized as: Archive -> Folder 1 -> JSON Files.
What I want to do is extract a particular field: timestamp from each json and then then just put it in a csv file.
My issue is:
I started with this script:
CREATE ASSEMBLY IF NOT EXISTS [Newtonsoft.Json] FROM "correct_path/Assemblies/JSON/Newtonsoft.Json.dll";
CREATE ASSEMBLY IF NOT EXISTS [Microsoft.Analytics.Samples.Formats] FROM "correct_path/Assemblies/JSON/Microsoft.Analytics.Samples.Formats.dll";
REFERENCE ASSEMBLY [Newtonsoft.Json];
REFERENCE ASSEMBLY [Microsoft.Analytics.Samples.Formats];
DECLARE #INPUT_FILE string = #"correct_path/Tracking_3e9.json";
//Extract the different properties from the Json file using a JsonExtractor
#json =
EXTRACT Partition string, Custom string
FROM #INPUT_FILE
USING new Microsoft.Analytics.Samples.Formats.Json.JsonExtractor();
OUTPUT #json
TO "correct_path/Output/simple.csv"
USING Outputters.Csv(quoting : false);
I get error:
E_STORE_USER_FILENOTFOUND: File not found or access denied
But I do have access to the file in the data explorer of the Azure Data Lake, so how can it be?
I don't want to run it for each file one by one. I just want to give it all the files in a folder (like Tracking*.json) or just a bunch of folders (like Folder*) and it should go through them and put the output for each file in a single row of the output csv.
Haven't found any tutorials on this.
Right now, I am reading the entire json, how to read just one field like time stamp which is a field within a particular field, like data : {timestamp:"xxx"}?
Thanks for your help.
1) Not sure why you're running into that error without more information - are you specifically missing the input file or is it the assemblies?
2) You can use a fileset to extract data from a set of files. Just use {} to denote the wildcard character in your input string, and then save that character in a new column. So for example, your input string could be #"correct_path/{day}/{hour}/{id}.json", and then your extract statement becomes:
EXTRACT
column1 string,
column2 string,
day int,
hour int,
id int
FROM #input
3) You'll have to read the entire JSON in your SELECT statement, but you can refine it down to only the rows you want in future rowsets. For example:
#refine=
SELECT timestamp FROM #json;
OUTPUT #refine
...
It sounds like some of your JSON data is nested however (like the timestamp field). You can find information on our GitHub (Using the JSON UDFs) and in this blog for how to read nested JSON data.
Hope this helps, and please let me know if you have additional questions!

How to import data in csv format in J?

I want to know how can I import data in CSV and then how I can deal with it?
I had loaded the file but do not know how to read it.
'',' fixdsv dat ] load '/Users/apple/Downloads/data'
Assuming that the file /Users/apple/Downloads/data is a csv file then you should be able to load it into a J session as a boxed table like this:
load 'csv'
data=: readcsv '/Users/apple/Downloads/data'
If the file uses delimiters other than commas (e.g. Tabs) then you could use the tables/dsv addon.
data=: TAB readdsv '/Users/apple/Downloads/data'
See the J wiki for more information on the tables/csv and tables/dsv addons.
After loading the file, I think that I would start by reading the file into a variable then working with that.
data=: 1:!1 <'filepath/filename' NB. filename and path need to be boxed string
http://www.jsoftware.com/help/dictionary/dx001.htm
Also you could look at jd which is specifically a relational database system if you are more focussed on file management than data processing.
http://code.jsoftware.com/wiki/Jd/Index

Resources