What's the best way to store a number (<= 10k) of huge csv files (~200k lines) in Cassandra? - cassandra

I tried to store them as a whole string in text column but there're errors when I was trying to save huge strings (over 45 millions chars). Are there any better options than to create long row_index column as a clustering key and store .csv line-by-line (each csv record will be a line in the table).
Write/read constraints: there's 1 write only (as soon as we create that .csv file), there's a small number (<= 10) reads for a specific file as well. And we'd probably like to optimize for fast writing, if possible.
Thanks.

Related

How do I most effectively compress my highly-unique columns?

I have a Spark DataFrame consisting of many double columns that are measurements, but I want a way of annotating each unique row by computing a hash of several other non-measurement columns. This hash results in garbled strings that are highly unique, and I've noticed my dataset size increases substantially when this column is present. How can I sort / lay out my data to decrease the overall dataset size?
I know that the Snappy compression protocol used on my parquet files executes best upon runs of similar data, so I think a sort over the primary key could be useful, but I also can't coalesce() the entire dataset into a single file (it's hundreds of GB in total size before the primary key creation step).
My hashing function is SHA2(128) FYI.
If you have a column that can be computed from the other columns, then simply omit that column before compression, and reconstruct it after decompression.

How can I make my Athena SQL query faster

I am running this on AWS Athena based on PrestoDB. My original plan was to query data 3 months in the past to analyze that data. However, even the query times for 2 hours in the past takes more than 30 minutes, at which point the Query times out. Is there any more efficient way for the query to be carried out?
SELECT column1, dt, column 2
FROM database1
WHERE date_parse(dt, '%Y%m%d%H%i%s') > CAST(now() - interval '1' hour AS timestamp)
The date column is recorded in the form of a string YYYYmmddhhmmss
Likely, the problem is that the query applies a function on the column being filtered. This is inefficient, becase the database needs to convert the entire column before it is able to filter it. One says that this predicate is non-SARGable.
Your primary effort should go into fixing your data model and store dates as dates rather than strings.
That said, the string format that you are using to represent dates still makes it possible to use direct filtering. The idea is to convert the filter value to the target string format (rather than converting the column value to a date):
where dt > date_format(now() - interval '1' hour, '%Y%m%d%H%i%s')
There are a lot of different factors that influence the time it takes for Athena to execute a query. The amount of data is usually dominates, but other important factors are data format (there's a huge difference between CSV and Parquet for example), and the number of files. In contrast to many other new database situations the complexity of the query is less often an important factor, and your query is very straightforward and is not the problem (it doesn't help that you apply a function in both sides of the WHERE condition, but it's not a big deal in Athena since the filtering is brute force and applying a function on each row isn't that big a deal compared to IO in an engine like Athena.
If you provide more information about the number of files, the data format, and so on we can probably help you better, because without that kind of information it could be just about anything. My suspicion is that you have something like a single prefix with tens or hundreds of millions of files – this is the worst possible case for Athena.
When Athena plans a query it lists the table's location on S3. S3's list operation has a page size of 1000, so if there are more files than that Athena will have to list sequentially until it gets the full listing. This cannot be parallelised, and it's also not very fast.
You need to avoid, almost at all cost, having more than 1000 files in the same prefix. If you have more files than that you can add prefixes (directories), because Athena will list S3 as if it was a file system, and parallelise listings of prefixes. A 1000 files each in table-data/a/, table-data/b/, table-data/c/ is much better than 3000 files in table-data/.
The reason why I suspect it's lots of small files rather than a lot of data is that if it was a lot of data you would probably have said so – and lots of data is actually something Athena is really good at. Ripping though terabytes of data is no problem unless it's a billion tiny files.

Create dataframe from text file based on certain criterias

I have a text file that is around 3.3GB. I am only interested in 2 columns in this text file (out of 47). From these 2 columns, I only need rows where col2=='text1'. For example, consider my text file to have values such as:
text file:
col1~col2~~~~~~~~~~~~
12345~text1~~~~~~~~~~~~
12365~text1~~~~~~~~~~~~
25674~text2~~~~~~~~~~~~
35458~text3~~~~~~~~~~~~
44985~text4~~~~~~~~~~~~
I want to create a df where col2=='text1'. What I have done so far is tried to load the entire textfile into my df and then filter out the needed rows. However, since this is a large text file, creating a df takes more than 45 mins. I believe loading only the necessary rows (if possible) would be ideal as the df would be of considerably smaller size and I won't run into memory issues.
My code:
df=pd.read_csv('myfile.txt',low_memory=False,sep='~',usecols=['col1','col2'],dtype={'col2':str})
df1=df[df['col2']=='text1']
In short, can I filter a column, based on a criteria, while loading the text file to dataframe so as to 1) Reduce time for loading and 2) Reduce the size of df on my memory.
Okay, So I came up with a solution. Basically it has to do with loading the data in chunks, and filtering the chunks for col2=='text1'. This way, I only have a chunk loaded in memory each time and my final df will only have the data I need.
Code:
final=pd.DataFrame()
df=pd.read_csv('myfile.txt',low_memory=False,sep='~',usecols=['col1','col2'],dtype={'col2':str},chunksize=100000)
for chunk in df:
a=chunk[chunk['col2']=='text1']
final=pd.concat([final,a],axis=0)
Better alternatives, if any, will be most welcome!

Converting data from .dat to parquet using Pyspark

Why the number of rows is different after converting from .dat to parquet data format using pyspark? Even when I repeat the conversion on the same file multiple times, I get a different result (slightly more or slightly less or equal to the original rows count)!
I am using my Macbook pro with 16 gb
.dat file size is 16.5 gb
spark-2.3.2-bin-hadoop2.7.
I already have the rows count from my data provider (45 million rows).
First I read the .dat file
2011_df = spark.read.text(filepath)
Second, I convert it to parquet, a process that takes about two hours.
2011_df.write.option("compression","snappy").mode("overwrite").save("2011.parquet")
Afterwards, I read the converted parquet file
de_parq = spark.read.parquet(filepath)
Finally, I use "count" to get rows numbers.
de_parq.count()

pyspark: Efficiently have partitionBy write to same number of total partitions as original table

I had a question that is related to pyspark's repartitionBy() function which I originally posted in a comment on this question. I was asked to post it as a separate question, so here it is:
I understand that df.partitionBy(COL) will write all the rows with each value of COL to their own folder, and that each folder will (assuming the rows were previously distributed across all the partitions by some other key) have roughly the same number of files as were previously in the entire table. I find this behavior annoying. If I have a large table with 500 partitions, and I use partitionBy(COL) on some attribute columns, I now have for example 100 folders which each contain 500 (now very small) files.
What I would like is the partitionBy(COL) behavior, but with roughly the same file size and number of files as I had originally.
As demonstration, the previous question shares a toy example where you have a table with 10 partitions and do partitionBy(dayOfWeek) and now you have 70 files because there are 10 in each folder. I would want ~10 files, one for each day, and maybe 2 or 3 for days that have more data.
Can this be easily accomplished? Something like df.write().repartition(COL).partitionBy(COL) seems like it might work, but I worry that (in the case of a very large table which is about to be partitioned into many folders) having to first combine it to some small number of partitions before doing the partitionBy(COL) seems like a bad idea.
Any suggestions are greatly appreciated!
You've got several options. In my code below I'll assume you want to write in parquet, but of course you can change that.
(1) df.repartition(numPartitions, *cols).write.partitionBy(*cols).parquet(writePath)
This will first use hash-based partitioning to ensure that a limited number of values from COL make their way into each partition. Depending on the value you choose for numPartitions, some partitions may be empty while others may be crowded with values -- for anyone not sure why, read this. Then, when you call partitionBy on the DataFrameWriter, each unique value in each partition will be placed in its own individual file.
Warning: this approach can lead to lopsided partition sizes and lopsided task execution times. This happens when values in your column are associated with many rows (e.g., a city column -- the file for New York City might have lots of rows), whereas other values are less numerous (e.g., values for small towns).
(2) df.sort(sortCols).write.parquet(writePath)
This options works great when you want (1) the files you write to be of nearly equal sizes (2) exact control over the number of files written. This approach first globally sorts your data and then finds splits that break up the data into k evenly-sized partitions, where k is specified in the spark config spark.sql.shuffle.partitions. This means that all values with the same values of your sort key are adjacent to each other, but sometimes they'll span a split, and be in different files. This, if your use-case requires all rows with the same key to be in the same partition, then don't use this approach.
There are two extra bonuses: (1) by sorting your data its size on disk can often be reduced (e.g., sorting all events by user_id and then by time will lead to lots of repetition in column values, which aids compression) and (2) if you write to a file format the supports it (like Parquet) then subsequent readers can read data in optimally by using predicate push-down, because the parquet writer will write the MAX and MIN values of each column in the metadata, allowing the reader to skip rows if the query specifies values outside of the partition's (min, max) range.
Note that sorting in Spark is more expensive than just repartitioning and requires an extra stage. Behind the scenes Spark will first determine the splits in one stage, and then shuffle the data into those splits in another stage.
(3) df.rdd.partitionBy(customPartitioner).toDF().write.parquet(writePath)
If you're using spark on Scala, then you can write a customer partitioner, which can get over the annoying gotchas of the hash-based partitioner. Not an option in pySpark, unfortunately. If you really want to write a custom partitioner in pySpark, I've found this is possible, albeit a bit awkward, by using rdd.repartitionAndSortWithinPartitions:
df.rdd \
.keyBy(sort_key_function) \ # Convert to key-value pairs
.repartitionAndSortWithinPartitions(numPartitions=N_WRITE_PARTITIONS,
partitionFunc=part_func) \
.values() # get rid of keys \
.toDF().write.parquet(writePath)
Maybe someone else knows an easier way to use a custom partitioner on a dataframe in pyspark?
df.repartition(COL).write().partitionBy(COL)
will write out one file per partition. This will not work well if one of your partition contains a lot of data. e.g. if one partition contains 100GB of data, Spark will try to write out a 100GB file and your job will probably blow up.
df.repartition(2, COL).write().partitionBy(COL)
will write out a maximum of two files per partition, as described in this answer. This approach works well for datasets that are not very skewed (because the optimal number of files per partition is roughly the same for all partitions).
This answer explains how to write out more files for the partitions that have a lot of data and fewer files for the small partitions.

Resources