How to add columns from multiple files in U-SQL in ADLA? - azure

I have a lot of csv files in a Azure Data Lake, consisting of data of various types (e.g., pressure, temperature, true/false). They are all time-stamped and I need to collect them in a single file according to timestamp for machine learning purposes. This is easy enough to do in Java - start a filestream, run a loop on the folder that opens each file, compares timestamps to write relevant values to the output file, starting a new column (going to the end of the first line) for each file.
While I've worked around the timestamp problem in U-SQL I'm having trouble coming up with syntax that will help me run this on the whole folder. The wildcard syntax {*} treats all files as the same fileset while I need to run some sort of loop to join a column from each file individually.
Is there any way to do this, perhaps using virtual columns?

First you have to think about your problem functional/declaratively and not based on procedural paradigms such as loops.
Let me try to rephrase your question to see if I can help. You have many csv files with data that is timestamped. Different files can have rows with the same timestamp, and you want to have all rows for the same timestamp (or range of timestamps) output to a specific file? So you basically want to repartition the data?
What is the format of each of the files? Do they all have the same schema or different schemas? In the later case, how can you differentiate them? Based on filename?
Let me know in the comments if that is a correct declarative restatement and the answers to my questions and I will augment my answer with the next step.

Related

Summing csv columns

I have a csv file with a couple columns and about 100k rows. One of the columns is a date, and I was wondering what the easiest way was to count the number of rows that have a certain date for all the possible dates and make a new csv file with just the date and the number of rows that have that date in the specific column. Any language or method is fine!
Thanks
Example of what data looks like now
I recommend using CsvHelper and Visual Studio in C#. This is by far the easiest as well as a fast way to read and process CSV files. CsvHelper is a popular library that makes it easy to process most any CSV files and is much faster than the standard .NET alternatives.
Here is another blog about using the class. Keep it Simple with CvsHelper

Node.js script for editing a .xlsx file

I have a large .xlsx file where each row contains a person's name and various other information. Some rows have duplicate entries throughout the file. I'd like to create a Node.js script that parses the file and deletes the rows with duplicate entries. What is the easiest way to go about this?
I have found Sheet.js to be the easiest way to interact with Excel files in node. They publish the xlsx node module: https://www.npmjs.com/package/xlsx.
The documentation can be a bit confusing, however. If you have specific issues during your implementation, feel free to edit your question with code or ask a new question!
Concerning your specific scenario, the xlsx module comes with some nifty ways to convert spreadsheets to and from arrays of arrays as well as arrays of objects. You say you have "a large .xlsx file". If it is truly massive, you might consider something like a stream read from the spreadsheet populating a new array with duplicates as you go. Then, using the original spreadsheet, stream it again into a new document ommiting the entries from the duplicates array.
However the array-of-arrays helpers etc might be an easier route. I have done in-memory processing of CSVs with nearly 100,000 rows (~50MB). It's a bit slow, but definitely possible.
Hope that helps
https://docs.sheetjs.com

Combining CSVs in Power Query returning 1 row of data

I am trying to set up a query that will simply combine data from CSVs into a table as new files get added to a specific folder, where each row contains the data from a separate file. While doing tests with CSVs that I created in excel, this was very simple. After expanding the content column, I would see an individual row of data for each file.
In practice however, where I am trying to use CSVs that are put out from a proprietary android app, expanding the content column leads to 1 single row, with data from all files placed end to end.
Does this have something to do with there not being and "end of line" character in the CSVs the app is producing? If so, is there an easy way to remedy this without changing the app? If not, is there something simple and direct I can ask the developer to change which would prevent this behavior?
Thanks for any insight!

Duplicates removal from different excel files at a time

I have 5 folders and each folder consists of around 20 excel sheets.
And these excel sheets contain duplicates within it. It is becoming very hectic to open every file and remove duplicates.
Is there anyother way to remove duplicates from all these files at once ?
All the files contain different set of duplicates and no common columns will be present.
XD I'm really understanding your situation but I think that the solution will be one of two :) :
1-make a program with any programming language you can use and try to load the files one by one to do what you want
2-(the easiest one)Try to find a good converter to convert all your files to SQL tables then come here to this site and ask how to delete duplicated rows from different SQL tables after doing that reconvert the SQL tables to EXCEL files again and it will be done (y) ;)

Can NetLogo Read an Excel File Format?

I've been using individual lists of data to update variables in my ABM. Unfortunately due to the size of data I am using now, it is becoming time consuming to build the lists and then tables. The data for these tables changes frequently, so it is not just a one time thing.
I am hoping to gain some ideas for a method to create a table that can be read directly from an excel spreadsheet, without going through the time to build the table explicitly by inputing the individual lists? My table includes one list of keys ( a list of over 1000 keys) and nearly a hundred variables corresponding to each key, that must be updated when the key is called. The data is produced from a different model (not an ABM) and produces an excel spreadsheet with Keys (X values) and Values (Y values). Something like:
X1 Y1,1 Y1,2 Y1,3… Y1,100
X2 Y2,1 Y2,2 Y2,3… Y2,100
…..
X1000 Y1000,1 Y1000,2 Y1000,3…. Y1000,100
If anyone has a faster method for getting large amounts data from excel into a NetLogo table, I would be very appreciative.
Two solutions, assuming you do not want to write an extension. You can save the Excel file as CSV and then
write a NetLogo procedure to read your CSV file. To get started, see http://netlogoabm.blogspot.com/2014/01/reading-from-csv-file.html
or
use a scripting language (Python recommended) to read the CSV file and then write out a .nls file with code for creating the table

Resources