How to import data in csv format in J? - j

I want to know how can I import data in CSV and then how I can deal with it?
I had loaded the file but do not know how to read it.
'',' fixdsv dat ] load '/Users/apple/Downloads/data'

Assuming that the file /Users/apple/Downloads/data is a csv file then you should be able to load it into a J session as a boxed table like this:
load 'csv'
data=: readcsv '/Users/apple/Downloads/data'
If the file uses delimiters other than commas (e.g. Tabs) then you could use the tables/dsv addon.
data=: TAB readdsv '/Users/apple/Downloads/data'
See the J wiki for more information on the tables/csv and tables/dsv addons.

After loading the file, I think that I would start by reading the file into a variable then working with that.
data=: 1:!1 <'filepath/filename' NB. filename and path need to be boxed string
http://www.jsoftware.com/help/dictionary/dx001.htm
Also you could look at jd which is specifically a relational database system if you are more focussed on file management than data processing.
http://code.jsoftware.com/wiki/Jd/Index

Related

How to read the most recent Excel export into a Pandas dataframe without specifying the file name?

I frequent a real estate website that shows recent transactions, from which I will download data to parse within a Pandas dataframe. Everything about this dataset remains identical every time I download it (regarding the column names, that is).
The name of the Excel output may change, though. For example, if I already have download a few of these in my Downloads folder, the file that's exported may read "Generic_File_(3)" or "Generic_File_(21)" if I already have a few older "Generic_File" exports in that folder from a previous export.
Ideally, I'd like my workflow to look like this: export this Excel file of real estate sales, then run a Python script to read in the most recent export as a Pandas dataframe. The catch is, I don't want to have to go in and change the filename in the script to match the appending number of the Excel export everytime. I want the pd.read_excel method to simply read the "Generic_File" that is appended with the largest number (which will obviously correspond to the most rent export).
I suppose I could always just delete old exports out of my Downloads folder so the newest, freshest export is always named the same ("Generic_File", in this case), but I'm looking for a way to ensure I don't have to do this. Are wildcards the best path forward, or is there some other method to always read in the most recently downloaded Excel file from my Downloads folder?
I would use the OS package and create a method to read to file names in the downloads folder. Parsing string filenames you could then find the file following your specified format with the highest copy number. Something like the following might help you get started.
import os
downloads = os.listdir('C:/Users/[username here]/Downloads/')
is_file = [True if '.' in item else False for item in downloads]
files = [item for keep, item in zip(is_file, downloads) if keep]
** INSERT CODE HERE TO IDENTIFY THE FILE OF INTEREST **
Regex might be the best way to find matches if you have a diverse listing of files in your downloads folder.

How can I convert a Pyspark dataframe to a CSV without sending it to a file?

I have a dataframe which I need to convert to a CSV file, and then I need to send this CSV to an API. As I'm sending it to an API, I do not want to save it to the local filesystem and need to keep it in memory. How can I do this?
Easy way: convert your dataframe to Pandas dataframe with toPandas(), then save to a string. To save to a string, not a file, you'll have to call to_csv with path_or_buf=None. Then send the string in an API call.
From to_csv() documentation:
Parameters
path_or_bufstr or file handle, default None
File path or object, if None is provided the result is returned as a string.
So your code would likely look like this:
csv_string = df.toPandas().to_csv(path_or_bufstr=None)
Alternatives: use tempfile.SpooledTemporaryFile with a large buffer to create an in-memory file. Or you can even use a regular file, just make your buffer large enough and don't flush or close the file. Take a look at Corey Goldberg's explanation of why this works.

how to read a specific text file in pandas

I want to read a specific line in a csv file in pandas on python.
Here is the structure of the file :
file :
example
how would be the best way to fill the values into a dataframe, with the correct name of the parameters?
thanks for help
Possible methods:
pandas.read_table method seems to be a good way to read (also in chunks) a tabular data file
doc: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_table.html
pandas has a good fast (compiled) csv reader pandas.read_csv (may be more than one).
doc: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html
Ref Link: https://codereview.stackexchange.com/questions/152194/reading-from-a-txt-file-to-a-pandas-dataframe

Converting 2TB of gziped multiline JSONs to NDJSONs

For my research I have a dataset of about 20,000 gziped multiline json files (~2TB, all have the same schema). I need to process and clean this data (I should say I'm very new to data analytics tools).
After spending a few days reading about Spark and Apache Beam I'm convinced that the first step would be to first convert this dataset to NDJSONs. In most books and tutorials they always assume you are working with some new line delimited file.
What is the best way to go about converting this data?
I've tried to just launch a large instance on gcloud and just use gunzip and jq to do this. Not surprisingly, it seems that this will take a long time.
Thanks in advance for any help!
Apache Beam supports unzipping file if you use TextIO.
But the delimiter remains to be New Line.
For multiline json you can read complete file using in parallel and then convert the json string to pojo and eventually reshuffle the data to utilize parallelism.
So the steps would be
Get the file list > Read individual files > Parse file content to json objects > Reshuffle > ...
You can get the file list by FileSystems.match("gcs://my_bucker").metadata().
Read individual files by Compression Compression.detect((fileResouceId).getFilename()).readDecompressed(FileSystems.open(fileResouceId))
Converting to NDJSON is not necessary if you use sc.wholeTextFiles. Point this method at a directory, and you'll get back an RDD[(String, String)] where ._1 is the filename and ._2 is the content of the file.

Nodejs best way to read xlsx as utf8 text

I need to read xlsx in nodejs. Xlsx contains text with accents and apostrophes and so on. Then i have to save the text in json file.
What are the best practices to perform that task?
Stage 1 - take a look at this module node-xlsx or more robust and possibly better for your needs xlsx.
Stage 2 - Writing the file to JSON - if the module can return a JSON format then great. If you use xlsx it has an option to JSON --> take a look here.
Since you may need to actually strip and/or protect special accents etc. you may need to validate the data which is returned before producing a JSON file.
As to actually writing a JSON file, there are a huge amount of NPM modules for the task.

Resources