I have a csv file with a couple columns and about 100k rows. One of the columns is a date, and I was wondering what the easiest way was to count the number of rows that have a certain date for all the possible dates and make a new csv file with just the date and the number of rows that have that date in the specific column. Any language or method is fine!
Thanks
Example of what data looks like now
I recommend using CsvHelper and Visual Studio in C#. This is by far the easiest as well as a fast way to read and process CSV files. CsvHelper is a popular library that makes it easy to process most any CSV files and is much faster than the standard .NET alternatives.
Here is another blog about using the class. Keep it Simple with CvsHelper
Related
I have a large .xlsx file where each row contains a person's name and various other information. Some rows have duplicate entries throughout the file. I'd like to create a Node.js script that parses the file and deletes the rows with duplicate entries. What is the easiest way to go about this?
I have found Sheet.js to be the easiest way to interact with Excel files in node. They publish the xlsx node module: https://www.npmjs.com/package/xlsx.
The documentation can be a bit confusing, however. If you have specific issues during your implementation, feel free to edit your question with code or ask a new question!
Concerning your specific scenario, the xlsx module comes with some nifty ways to convert spreadsheets to and from arrays of arrays as well as arrays of objects. You say you have "a large .xlsx file". If it is truly massive, you might consider something like a stream read from the spreadsheet populating a new array with duplicates as you go. Then, using the original spreadsheet, stream it again into a new document ommiting the entries from the duplicates array.
However the array-of-arrays helpers etc might be an easier route. I have done in-memory processing of CSVs with nearly 100,000 rows (~50MB). It's a bit slow, but definitely possible.
Hope that helps
https://docs.sheetjs.com
I have a lot of csv files in a Azure Data Lake, consisting of data of various types (e.g., pressure, temperature, true/false). They are all time-stamped and I need to collect them in a single file according to timestamp for machine learning purposes. This is easy enough to do in Java - start a filestream, run a loop on the folder that opens each file, compares timestamps to write relevant values to the output file, starting a new column (going to the end of the first line) for each file.
While I've worked around the timestamp problem in U-SQL I'm having trouble coming up with syntax that will help me run this on the whole folder. The wildcard syntax {*} treats all files as the same fileset while I need to run some sort of loop to join a column from each file individually.
Is there any way to do this, perhaps using virtual columns?
First you have to think about your problem functional/declaratively and not based on procedural paradigms such as loops.
Let me try to rephrase your question to see if I can help. You have many csv files with data that is timestamped. Different files can have rows with the same timestamp, and you want to have all rows for the same timestamp (or range of timestamps) output to a specific file? So you basically want to repartition the data?
What is the format of each of the files? Do they all have the same schema or different schemas? In the later case, how can you differentiate them? Based on filename?
Let me know in the comments if that is a correct declarative restatement and the answers to my questions and I will augment my answer with the next step.
I see that the number of rows at a worksheet is limited to 1,048,576.
Is this just an excel thing? For example can I create a csv file that has more rows say 5 Million rows? I understand I can't open it with excel but can I still have the file and access it some other way (say C++).
I assume this is feasible as CSV is not necessarily an excel thing right?
Thanks in advance.
A CSV file is simply a text file formatted in a certain way. Excel's row limitation is simply an artificial limitation. There is no artificial limit to the size of a CSV file.
Excel is most certainly Not the only program that can open or create a CSV file. If you want to create a CSV file with something besides Excel, then you can create as many rows or fields as you wish to.
Part of my job is to pull a report weekly that lists patching information for around 75000 PCs. I have to filter some erroneous data, based on certain criteria, and then summarize this data myself and update it in a separate spreadsheet. I am comfortable with pivot tables / formulas, but it ends up taking a good couple of hours.
Is there a way to import data from a CSV file into a template that already has in place my formulas/settings, etc. if the data has the same columns, but a different amount of rows each time?
If you're confortable with programming, then, you can use macros, on this case, you will connect to your CSV file, then extract the information and put it in the corresponding places on your spreadsheet, on this question you can find most of what you need to start off: macro to Import csv file into an excel non active worksheet.
I need to import tabular data into my database. The data is supplied via spreadsheets (mostly Excel files) from multiple parties. The format of each of these files is similar but not the same and various transformations will be necessary to massage the data into the final format suitable for import. Furthermore the input formats are likely to change in the future. I am looking for a tool that can be run and administered by regular users to transform the input files.
Now let me list some of the transformations I am looking to do:
swap columns:
Input is:
|Name|Category|Price|
|data|data |data |
Output is
|Name|Price|Category|
|data|data |data |
rename columns
Input is:
|PRODUCTNAME|CAT |PRICE|
|data |data|data |
Output is
|Name|Category|Price|
|data|data |data |
map columns according to a lookup table, like in the above examples:
replace every occurrence of the string "Car" by "automobile" in the column Category
basic maths:
multiply the price column by some factor
basic string manipulations
Lets say that the format of the Price column is "3 x $45", I would want to split that into two columns of amount and price
filtering of rows by value: exclude all rows containing the word "expensive"
etc.
I have the following requirements:
it can run on any of these platform: Windows, Mac, Linux
Open Source, Freeware, Shareware or commercial
the transformations need to be editable via a GUI
if the tool requires end user training to use that is not an issue
it can handle on the order of 1000-50000 rows
Basically I am looking for a graphical tool that will help the users normalize the data so it can be imported, without me having to write a bunch of adapters.
What tools do you use to solve this?
The simplest solution IMHO would be to use Excel itself - you'll get all the Excel built-in functions and macros for free.Have your transformation code in a macro that gets called via Excel controls (for the GUI aspect) on a spreadsheet. Find a way to insert that spreadsheet and macro in your client's Excel files. That way you don't need to worry about platform compatibility (it's their file, so they must be able to open it) and all the rest. The other requirements are met as well. The only training would be to show them how to enable macros.
The Mule Data Integrator will do all of this from a csv file. So you can export your spreadsheet to a CSV file, and load the CSV file ito the MDI. It can even load the data directly to the database. And the user can specify all of the transformations you requested. The MDI will work fine in non-Mule environments. You can find it here mulesoft.com (disclaimer, my company developed the transformation technology that this product is based on).
You didn't say which database you're importing into, or what tool you use. If you were using SQL Server, then I'd recommend using SQL Server Integration Services (SSIS) to manipulate the spreadsheets during the import process.
I tend to use MS Access as a pipeline between multiple data sources and destinations - but you're looking for something a little more automated. You can use macros and VB script with Access to help through a lot of the basics.
However, you're always going to have data consistency problems with users mis-interpreting how to normalize their information. Good luck!