We have to perform a CSV transformation into another CSV (1 file to 1 file). We are looking for a cheap solution. The first idea that popped into my mind was Excel, but the file will be to big.
1) Is it possible to do a CSV to CSV conversion through XSLT? I can't seem to find a tool or google result which tells me how I could possibly do it.
2) Is there a better approach to do CSV transformations?
Edit:
It should be possible to automate/schedule the process
My answers below
1) No, XSLT only transforms XML files.
2) Yes, as the answer to question 1 is "No", it is reasonable to assert there are better approaches. As CSV is not a standardised format there are a plethora of varied approaches to choose from.
Use Rscript to automate the transformation of CSV:
# Rscript --vanilla myscript.R
Where myscript.R is something like:
csv <- read.csv(file="input.csv",head=TRUE,sep=",")
# Modify your CSV ...
write.csv(data, file = "output.csv")
Related
Hi I have a question for Matlab programming, I want to ask if I am using Mac OS and I have placed all my audio files in the same folder as Matlab, how do I read all the .wav audio files? I want to automate the process.
Example:
Firstly, I have an excel sheet with the audio file name and information.
Secondly, I want to extract the audio file names from the excel sheet (first column) and put it into the audioread function in MatLab.
I need to use the following audioread function.
[y,Fs]=audioread('audio1.wav');
I want to read audio1.wav and do some calculations on it. After finishing the calculation, I will proceed to read audio2.wav and do the same calculation for it. Can you teach me how to do this and show me the code for this?
Thank you.
In Matlab you can read xls files with readmatrix. You are maybe best to export your spreadsheet of audio files to a csv file first.
With regard to organising the data, it would be easiest for the spreadsheet to contain the full pathname to the file (i.e. /path/from/root/to/file.wav)
So, say you had a audio_files.csv of file paths like
/path/to/file1.wav, file1data
/path/to/file2.wav, file2data
/path/to/file3.wav, file3data
You could read each file with something like
filename = 'audio_files.csv';
audio_file_list = readmatrix(filename);
for audio_file = audio_file_list(:,1) % so long as the first column is the file paths
[y,Fs]=audioread(audio_file);
% do something to y
end
Of course, the % do something to y will depend entirely on what you want to achieve.
For my research I have a dataset of about 20,000 gziped multiline json files (~2TB, all have the same schema). I need to process and clean this data (I should say I'm very new to data analytics tools).
After spending a few days reading about Spark and Apache Beam I'm convinced that the first step would be to first convert this dataset to NDJSONs. In most books and tutorials they always assume you are working with some new line delimited file.
What is the best way to go about converting this data?
I've tried to just launch a large instance on gcloud and just use gunzip and jq to do this. Not surprisingly, it seems that this will take a long time.
Thanks in advance for any help!
Apache Beam supports unzipping file if you use TextIO.
But the delimiter remains to be New Line.
For multiline json you can read complete file using in parallel and then convert the json string to pojo and eventually reshuffle the data to utilize parallelism.
So the steps would be
Get the file list > Read individual files > Parse file content to json objects > Reshuffle > ...
You can get the file list by FileSystems.match("gcs://my_bucker").metadata().
Read individual files by Compression Compression.detect((fileResouceId).getFilename()).readDecompressed(FileSystems.open(fileResouceId))
Converting to NDJSON is not necessary if you use sc.wholeTextFiles. Point this method at a directory, and you'll get back an RDD[(String, String)] where ._1 is the filename and ._2 is the content of the file.
script that reads CSV files and gets headers and filter by specific column, I have tried researching on it but nothing of quality I have managed to get.
Please any help will be deeply appreciated
There's a standard csv library included with Python.
https://docs.python.org/3/library/csv.html
It will automatically create a dictionary of arrays where the first row in the CSV determines the keys in the dict.
You can also follow pandas.read_csv for the same.
In connection with my earlier question, when I give the command,
filePath = sc.textFile("/user/cloudera/input/Hin*/datafile.txt")
filePath.collect()
some part of the data has '\xa0' prefixed to every word, and other part of the data doesn't have that special character. I am attaching 2 pictures, one with '\xa0', and another without '\xa0'. The content shown in 2 pictures belong to same file. Only some part of the data from same file is read that way by Spark. I have checked the original data file present in HDFS, and it was problem free.
I feel that it has something to do with encoding. I tried all methods like using replaceoption in flatMap like flatMap(lambda line: line.replace(u'\xa0', ' ').split(" ")), flatMap(lambda line: line.replace(u'\xa0', u' ').split(" ")), but none worked for me. This question might sound dump, but I am newbie in using Apache Spark, and I require some assistance to overcome this problem.
Can anyone please help me? Thanks in advance.
Check the encoding of your file. When you use sc.textFile, spark expects an UTF-8 encoded file.
One of the solution is to acquire your file with sc.binaryFiles and then apply the expected encoding.
sc.binaryFile create a key/value rdd where key is the path to file and value is the content as a byte.
If you need to keep only the text and apply an decoding function, :
filePath = sc.binaryFile("/user/cloudera/input/Hin*/datafile.txt")
filePath.map(lambda x :x[1].decode('utf-8')) #or another encoding depending on your file
I want to know how can I import data in CSV and then how I can deal with it?
I had loaded the file but do not know how to read it.
'',' fixdsv dat ] load '/Users/apple/Downloads/data'
Assuming that the file /Users/apple/Downloads/data is a csv file then you should be able to load it into a J session as a boxed table like this:
load 'csv'
data=: readcsv '/Users/apple/Downloads/data'
If the file uses delimiters other than commas (e.g. Tabs) then you could use the tables/dsv addon.
data=: TAB readdsv '/Users/apple/Downloads/data'
See the J wiki for more information on the tables/csv and tables/dsv addons.
After loading the file, I think that I would start by reading the file into a variable then working with that.
data=: 1:!1 <'filepath/filename' NB. filename and path need to be boxed string
http://www.jsoftware.com/help/dictionary/dx001.htm
Also you could look at jd which is specifically a relational database system if you are more focussed on file management than data processing.
http://code.jsoftware.com/wiki/Jd/Index