I have one file (lets call it enrolled_students.txt) that I need to read in Perl. This file will have data per line such that it requires to refer other files for getting some more information.
For example, the main database will have names and addresses. But depending on the nationality of each person, I have to refer other files (sorted by country) to find the matching name, the nationality and home address.
Lets say I have 100 name_of_country.txt files and there are 10,000 lines in my enrolled_students.txt. My questions are:
Do I read each line in enrolled_students.txt and parse the other 100 files one by to find a match? That seems like an awful way to process this data. Is there a faster way to do this?
Can I execute this process in parallel mode (multithread)?
Thanks,
Hans
What you are trying to do here is similar to what a database engine has to do when joining data from two tables together. A database engine will typically have a number of different join plans to choose from, and it will attempt to choose the best one based on what it knows about the data in each table.
The same applies to you. There are several ways to join the data and the best way will depend on factors such as the size of each of the input files, whether they are pre-sorted, etc.
Some possible approaches:
A 'Nested Loop', where you read each line of the enrolled_students.txt file and for each of those iterate through the other file(s) to find a match. Not likely to be very fast, you would probably only choose this if the files were too large to make any other solution practical.
A 'Hash Join', where you would read one half of the data to be joined (in your example, probably the name_of_country.txt) into a data structure indexed by a hash. Then for each row of the other file, you can look up the corresponding row in the hash. This can be quite high performance, as long as there is enough memory to store at least one of the two sets of data at once.
If both files are in some sorted order, sorted according to the same key, you might be able to use a 'Merge Join'. This is where you read rows from both files at once, matching the records together like teeth in a zipper.
The above assumes a simple case with two data files that have to be joined. Your question talks about 100 different name_of_country.txt files, which might complicate matters.
In regard to your second question - can you use parallel processing - that would probably only be useful if the processing was CPU-bound. The complexity of producing a forked or threaded solution is probably not warranted unless you find that it is actually CPU bound.
Finally - if you are doing multiple analysis runs of the same data, it might be advisable to import the data into a real database and use that run queries. That would save you a lot of coding work.
I will treat your question as: How to efficient perform a "join" operation of two files and here is the answer.
Actually there is a join command in Unix.
http://linux.die.net/man/1/join
Suppose you have two files, student and student_with_country:
student: [name] [age] [...]
student_with_country: [name] [country] [...]
you can do:
join student student_with_country (by default, it will join based on the first field)
Then the question is how to make it faster by using multiple cores?
Answer is parallel command. Basically, you can run a simple map-reduce program using it. For example, in this case
cat student_with_country | parallel --block 10M --pipe join student -
It will divide the student_with_country file into 10M blocks and run the join command in parallel. In this way, you can utilize power of multiple cores.
Related
I have multiple large files, and I use a glob that matches them all to read them into a single dataframe. Then I do so some mapping, i.e. processing rows independently from each other. For development purposes, I don't want to process the whole data, so I'm thinking of doing a df.take(5). Will Spark be smart enough to realize that it only needs to read the first five rows of the first file? Thanks!
I'm hoping it will only read the first five records, but I don't know if it does.
I've seen many answers and blob posts suggesting that:
df.repartition('category').write().partitionBy('category')
Will output one file per category, but this doesn't appear to be true if the number of unique 'category' values in df is less than the number of default partitions (usually 200).
When I use the above code on a file with 100 categories, I end up with 100 folders each containing between 1 and 3 "part" files, rather than having all rows with a given "category" value in the same "part". The answer at https://stackoverflow.com/a/42780452/529618 seems to explain this.
What is the fastest way get exactly one file per partition value?
Things I've tried
I've seen many claims that
df.repartition(1, 'category').write().partitionBy('category')
df.repartition(2, 'category').write().partitionBy('category')
Will create "exactly one file per category" and "exactly two files per category" respectively, but this doesn't appear to be how this parameter works. The documentation makes it clear that the numPartitions argument is the total number of partitions to create, not the number of partitions per column value. Based on that documentation, specifying this argument as 1 should (accidentally) output a single file per partition when the file is written, but presumably only because it removes all parallelism and forces your entire RDD to be shuffled / recalculated on a single node.
required_partitions = df.select('category').distinct().count()
df.repartition(required_partitions, 'category').write().partitionBy('category')
The above seems like a workaround based on the documented behaviour, but one that would be costly for several reasons. For one, a separate count if df is expensive and not cached (and/or so big that it would be wasteful to cache just for this purpose), and also any repartitioning of a dataframe can cause unnecessary shuffling in a multi-stage workflow that has various dataframe outputss along the way.
The "fastest" way probably depends on the actual hardware set-up and actual data (in case it is skewed). To my knowledge, I also agree that df.repartition('category').write().partitionBy('category') will not help solving your problem.
We faced a similar problem in our application but instead of doing first a count and then the repartition, we separated the writing of the data and the requirement to have only a single file per partition into two different Spark jobs. The first job is optimized to write the data. The second job just iterates over the partitioned folder structure and simply reads the data per folder/partition, coalesces its data to one partition and overwrites them back. Again, I can not tell if that is the fastest way also to your environment, but for us it did the trick.
Having done some research on this topic lead to the Auto Optimize Writes feature on Databricks for writing to a Delta Table. Here, they use a similar approach: First writing the data and then running a separate OPTIMIZE job to aggregate the files into a single file. In the mentioned link you will find this explanation:
"After an individual write, Azure Databricks checks if files can further be compacted, and runs an OPTIMIZE job [...] to further compact files for partitions that have the most number of small files."
As a side note: Make sure to keep the configuration spark.sql.files.maxRecordsPerFile to 0 (default value) or to a negative number. Otherwise, this configuration alone could lead to multiple files for data with the same value in the column "category".
You can try coalesce(n); coalesce is used to decrease the number of partitions, which is an optimized version of repartition.
n = The number of partitions you want to be output.
[Disclaimer: While this question is somewhat specific, I think it circles a very generic issue with Hadoop/Spark.]
I need to process a large dataset (~14TB) in Spark. Not doing aggregations, mostly filtering. Given ~30k files (250 part files, per month for 10 years, each part being ~ 200MB), I would like to load them into a RDD/DataFrame and filter out items based on some arbitrary filters.
To make the listing of the files efficient (I'm on google dataproc/cloud storage, so the driver doing a wildcard glob was very serial and very slow), I precalculate an RDD of the file names, then load them into an RDD (I'm using avro, but file type shouldn't be relevant), e.g.
#returns an array of files to load
files = sc.textFile('/list/of/files/').collect()
#load the files into a dataframe
documents = sqlContext.read.format('com.databricks.spark.avro').load(files)
When I do this, even on a 50-worker cluster, it seems that only one executor is doing the work of reading the files. I've experimented with broadcasting the files list and read a dozen different approaches but I can't seem to crack the issue.
So, is there an efficient way to create a very large dataframe from multiple files? How do I best take advantage of all the potential computing power when creating this RDD?
This approach works very well on smaller sets but, at this size, I see a large number of symptoms like long-running processes with no feedback. Is there some treasure trove of knowledge -- besides #zero323 :-) -- on optimizing spark at this scale?
Listing 30k files shouldn't be an issue for GCS - even if single GCS list request that lists up to 500 files at a time will take 1 second each, all 30k files will be listed in a minute or so. There could be some corner cases with some glob patterns that make it slow, but there were recent optimizations in GCS connector globbing implementation that could help.
That's why it should be good enough for you to just rely on default Spark API with globbing:
val df = sqlContext.read.avro("gs://<BUCKET>/path/to/files/")
I have been working with databases recently and before that I was developing standalone components that do not use databases.
With all the DB work I have a few questions that sprang up.
Why is a database query faster than a programming language data retrieval from a file.
To elaborate my question further -
Assume I have a table called Employee, with fields Name, ID, DOB, Email and Sex. For reasons of simplicity we will also assume they are all strings of fixed length and they do not have any indexes or primary keys or any other constraints.
Imagine we have 1 million rows of data in the table. At the end of the day this table is going to be stored somewhere on the disk. When I write a query Select Name,ID from Employee where DOB="12/12/1985", the DBMS picks up the data from the file, processes it, filters it and gives me a result which is a subset of the 1 million rows of data.
Now, assume I store the same 1 million rows in a flat file, each field similarly being fixed length string for simplicity. The data is available on a file in the disk.
When I write a program in C++ or C or C# or Java and do the same task of finding the Name and ID where DOB="12/12/1985", I will read the file record by record and check for each row of data if the DOB="12/12/1985", if it matches then I store present the row to the user.
This way of doing it by a program is too slow when compared to the speed at which a SQL query returns the results.
I assume the DBMS is also written in some programming language and there is also an additional overhead of parsing the query and what not.
So what happens in a DBMS that makes it faster to retrieve data than through a programming language?
If this question is inappropriate on this forum, please delete but do provide me some pointers where I may find an answer.
I use SQL Server if that is of any help.
Why is a database query faster than a programming language data retrieval from a file
That depends on many things - network latency and disk seek speeds being two of the important ones. Sometimes it is faster to read from a file.
In your description of finding a row within a million rows, a database will normally be faster than seeking in a file because it employs indexing on the data.
If you pre-process you data file and provided index files for the different fields, you could speedup data lookup from the filesystem as well.
Note: databases are normally used not for this feature, but because they are ACID compliant and therefore are suitable for working in environments where you have multiple processes (normally many clients on many computers) querying the database at the time.
There are lots of techniques to speed up various kinds of access. As #Oded says, indexing is the big solution to your specific example: if the database has been set up to maintain an index by date, it can go directly to the entries for that date, instead of reading through the entire file. (Note that maintaining an index does take up space and time, though -- it's not free!)
On the other hand, if such an index has not been set up, and the database has not been stored in date order, then a query by date will need to go through the entire database, just like your flat-file program.
Of course, you can write your own programs to maintain and use a date index for your file, which will speed up date queries just like a database. And, you might find that you want to add other indices, to speed up other kinds of queries -- or remove an index that turns out to use more resources than it is worth.
Eventually, managing all the features you've added to your file manager may become a complex task; you may want to store this kind of configuration in its own file, rather than hard-coding it into your program. At the minimum, you'll need features to make sure that changing your configuration will not corrupt your file...
In other words, you will have written your own database.
...an old one, I know... just for if somebody finds this: The question contained "assume ... do not have any indexes"
...so the question was about the sequential dataread fight between the database and a flat file WITHOUT indexes, which the database wins...
And the answer is: if you read record by record from disk you do lots of disk seeking, which is expensive performance wise. A database always loads pages by concept - so a couple of records all at once. Less disk seeking is definitely faster. If you would do a mem buffered read from a flat file you could achieve the same or better read values.
I execute batch update which modifies few rows within few column families. In case of TimedOutException some data could be modified, but possibly not whole set....
In order to implement compensating transaction, I would need to know what data (rows) was modified - is there a way to find this out? Does exception contain this information?
Thanks,
Maciej
Creating a system that can scale out means taking some trade-offs - one of these is facilitating "idempotent" operations in your application.
This means that you would either:
assume that the data was written somewhere and that the node will
eventually become consistent
fire the entire contents of the write again, perhaps sleeping a given amount of time or
at a less restrictive consistency level
A good description of this approach can be found in section 6 of Pat Helland's "Building on Quicksand" paper: http://arxiv.org/pdf/0909.1788