Creating .txt output with string and numbers in matlab - string

I'm new to matlab.
I have multiple .txt files with up to 1000 each with a content as the following:
09.10.2015,08:17:02,51683,8,3286,78,6,7,0,13
I'm trying to merge all .txt files together to create one big .txt file that I can use for further analysis.
The .txt files have the same number of columns but different number of lines.
I don't have difficulties merging the files if there are the numbers only but the date and time causes difficulties.
Really would appreciate any help you could give.

This is not a job for Matlab, as you will be reading data (with format) writing data (creating new file). Which is inefficient and could blow-up your memory if you have BIG BIG data.
This is a job for Bash - Unix, something like:
cat *.txt > bigFile.txt
Or in Windows :
cat *.txt >> bigFile.txt
Or
copy /b *.txt bigFile.txt

You just need to read all files, store them somehow (in a matrix, in a cell), whatever suits you better.
Use fopen, fread, fid, or even simpler - http://www.mathworks.com/help/matlab/ref/fscanf.html
Once you have your information totally organized, just use this function - http://www.mathworks.com/help/matlab/ref/fprintf.html

Related

Continous appending of data on existing tabular data file (CSV, parquet) using PySpark

For a project I need to append frequently but on a non-periodic way about one thousand or more data files (tabular data) on one existing CSV or parquet file with same schema in Hadoop/HDFS (master=yarn). At the end, I need to be able to do some filtering on the result file - to extract subset of data.
One dummy file may look like this (very simple example):
id,uuid,price
1,16c533c3-c191-470c-97d9-e1e01ccc3080,46159
2,6bb0917b-2414-4b24-85ca-ae2c2713c9a0,50222
3,7b1fa3f9-2db2-4d93-a09d-ca6609cfc834,74591
4,e3a3f874-380f-4c89-8b3e-635296a70d76,91026
5,616dd6e8-5d05-4b07-b8f2-7197b579a058,73425
6,23e77a21-702d-4c87-a69c-b7ace0626616,34874
7,339e9a7f-efb1-4183-ac32-d365e89537bb,63317
8,fee09e5f-6e16-4d4f-abd1-ecedb1b6829c,6642
9,2e344444-35ee-47d9-a06a-5a8bc01d9eab,55931
10,d5cba8d6-f0e1-49c8-88e9-2cd62cde9737,51792
Number of rows may vary between 10 and about 100000
On user request, all input files copied in a source folder should be ingested by an ETL pipeline and appended at the end of one single CSV/parquet file or any other appropriate file format (no DB). Data from a single input file may be spread over one, two or more partitions.
Because the input data files may all have different number of rows, I am concerned about getting partitions with different sizes in the resulting CSV/parquet file. Sometimes all the data may be append in one new file. Sometimes the data is so big that several files are appended.
And because input files may be appended a lot of time from different users and different sources, I am also concerned that the result CSV/parquet may contains too much part-files for the namenode to handle them.
I have done some small test appending data on existing CSV / parquet files and noticed that for each appending, a new file was generated - for example:
df.write.mode('append').csv('/user/applepy/pyspark_partition/uuid.csv')
will append the new data as a new file in the file 'uuid.csv' (which is actually a directory generated by pyspark containing all pieces of appended data).
Doing some load tests based on real conditions, I quickly realized that I was generating A LOT of files (several 10-thousands). At some point I got so much files that PySpark was unable to simple count the number of rows (NameNode memory overflow).
So I wonder how to solve this problem. What would be the best practice here? Read the whole file, append the data chunk, same the data in a new file doesn't seems to be very efficient here.
NameNode memory overflow
Then increase the heapsize of the namenode
quickly realized that I was generating A LOT of files
HDFS write operations almost never append to single files. They append "into a directory", and create new files, yes.
From Spark, you can use coalesce and repartition to create larger writer batches.
As you'd mentioned, you wanted parquet, so write that then. That'll cause you to have even smaller file sizes in HDFS.
or any other appropriate file format (no DB)
HDFS is not really the appropriate tool for this. Clickhouse, Druid, and Pinot are the current real time ingest / ETL tools being used, especially when data is streamed in "non periodically" from Kafka

How can I find & delete duplicate strings from ~800gb worth of text files?

I have a dataset of ~800gb worth of text files, with about 50k .txt files in total.
I'd like to go through and make a master .txt file from these, with all duplicate lines removed from all txt files.
I can't find a way to do this that isn't going to take months for my computer to process, idealy i'd like to keep it less than a week.
sort -u <data.txt >clean.txt
All you need is a large disk.
sort is quite efficient: it will automatically split the file into manageable bites, sort each one separately, then merge them (which can be done in O(N) time); and while merging, it will discard the dupes (due to -u option). But you will need at least the space for the output file, plus the space for all the intermediate files.

In Excel how can I update an external file if it has so many rows that it cannot be loaded?

I have a .csv file that has around 2 million rows, and I want to add a new column. Problem is, I could manage to that by losing a lot of data (basicly everything above ~1,1m rows). When I used connection to the external file (so that I could read all rows), and made changes to it in Power Query, the changes was not saved to the .csv file.
You can apply one of several solutions:
Using a text editor which can handle huge files, save the csv files into smaller chunks. Apply the modifications to each chunk. Join chunks again to get the desired file.
Create a "small" program yourself, which loads the csv line by line and applies the modification, adding the resulting data to a second file.
Maybe some other software can handle that size of a csv. Patch the LibreOffice for this purpose, to handle 2000000+ lines - the source code is available :)

Split many CSV files into a few bigger files in Linux

I have a bunch of small CSV files (a few hundred files about 100 MB each) that I want to pack into several bigger files. I know how to join all (or a subset) of those files into one file - I simply need to use cat command in Linux and redirect its output to a file. My problem is the result files have to be not bigger than some size (say, 5 GB), i.e. merging all small files into one is not a solution because the resulting file will be too big. So, I am wondering if there is a way to do it in the command line that would be simpler than writing a bash script looping over the directory?
Thanks.
The split command does exactly what you need. You can have it split STDIN to different outputs based on size or number of lines. You can also specify the output file suffix.

How to convert xml file of stack overflow dump to csv file

I have stack overflow data dump file in .xml format,nearly 27GB and I want to convert them in .csv file. Please somebody tell me, tools to convert xml to csv file or python program
Use one of the python xml modules to parse the .xml file. Unless you have much more that 27GB ram, you will need to do this incrementally, so limit your choices accordingly. Use the csv module to write the .csv file.
Your real problem is this. Csv files are lines of fields. They represent a rectangular table. Xml files, in general, can represent more complex structures: hierarchical databases, and/or multiple tables. So your real problem to to understand the data dump format well enough to extract records to write to the .csv file.
I have written a PySpark function to parse the .xml in .csv. XmltoCsv_StackExchange is the github repo. Used it to convert 1 GB of xml within 2-3 minutes on a minimal 2-core and 2 GB RAM Spark setup. It can convert 27GB file too, just increase minPartitions from 4 to around 128 in this line.
raw = (sc.textFile(fileName, 4))

Resources