Dask DataFrame converts undelying parquet files index from datetime64 to object, why? - python-3.x

I make great efforts to save my parquet files with a index using the datetime64['ns'] dtype. But when I then read multiple of the parquet files in Dask Dataframe it converts the index to dtype object (str). Why?
I cannot use parse_dates argument in my read_parquet call as it only works on columns. I read in each individual underlying parquet file with pandas and checked the dtype of the index, they are consistent.
My code is simple
try:
df = dd.read_parquet(data_filenames, columns=list(cols_to_retrieve),
engine='pyarrow')
except Exception as ex:
self.build_error(ex, end_date)
df = df[list(cols_to_retrieve)]
df = df.compute()
What is the recommended approach to fixing Dasks tendency to change dtypes?

It sounds like you might have found a bug. You should generate a little code that creates the file(s) with the datetime type, but on loading they are no longer that type, and post this at https://github.com/dask/dask/issues/new
i.e., we need a MCVE, so that we can debug and fix, and make your example into a test against future regressions. It also often reveals something you might have been doing wrong, if there is in fact no bug.
That said, you could also try the fastparquet parquet engine, just in case it happens to cope with your particular data.

Related

Get sizes of individual columns of delta/parquet table

I would like to check how each column of parquet data contributes to total file size / total table size.
I looked through Spark/Databricks commands, parquet-cli, parquet-tools and unfortunately it seems that none of them provide such information directly. Considering that this is a columnar format, it should be possible to pull out somehow.
So far the closest I got would be to run parquet-tools meta, summing up details by column for each row group within the file, then aggregating this for all files of a table. This means iterating on all parquet files and cumbersome parsing of the output.
Maybe there is an easier way?
Your approach is correct. Here is a py script using DuckDB to find overall compressed and uncompressed size of all the columns in a parquet dataset.
import duckdb
con = duckdb.connect(database=':memory:')
print(con.execute("""SELECT SUM(total_compressed_size) AS
total_compressed_size_in_bytes, SUM(total_uncompressed_size) AS
total_uncompressed_size_in_bytes, path_in_schema AS column_name from
parquet_metadata('D:\\dev\\tmp\\parq_dataset\\*') GROUP BY path_in_schema""").df())
D:\\dev\\tmp\\parq_dataset\\* here parq_dataset consists of multiple parquet files with same schema. Something similar should be possible using other libraries like pyarrow/fastparquet as well.

Spark: How collect large amount of data without out of memory

I have the following issue:
I do a sql query over a set of parquet files on HDFS and then I do a collect in order to get the result.
The problem is that when there are many rows I get an out of memory error.
This query requires shuffling so I can not do the query on each file.
One solution could be to iterate over the values of a column and save the result on disk:
df = sql('original query goes here')
// data = collect(df) <- out of memory
createOrReplaceTempView(df, 't')
for each c in cities
x = collect(sql("select * from t where city = c")
append x to file
As far as I know it will result in the program taking too much time because the query will be executed for each city.
What is the best way of doing this?
In the case if its running out of memory, which means that the output data is really very huge, so,
you can write down the results into some file itself just like parquet file.
If you want to further perform some operation, on this collected data, you can read data from this file.
For large datasets we should not use collect(), instead you may use take(100) or take(some_integer) in order to check that some values are correct.
As #cricket_007 said, I would not collect() your data from Spark to append it to a file in R.
Additionally, it doesn't make sense to iterate over a list of SparkR::distinct() cities and then select everything from those tables just to append them to some output dataset. The only time you would want to do that is if you are trying to do another operation within each group based upon some sort of conditional logic or apply an operation to each group using a function that is NOT available in SparkR.
I think you are trying to get a data frame (either Spark or R) with observations grouped in a way so that when you look at them, everything is pretty. To do that, add a GROUP BY city clause to your first SQL query. From there, just write the data back out to HDFS or some other output directory. From what I understand about your question, maybe doing something like this will help:
sdf <- SparkR::sql('SELECT SOME GREAT QUERY FROM TABLE GROUP BY city')
SparkR::write.parquet(sdf, path="path/to/desired/output/location", mode="append")
This will give you all your data in one file, and it should be grouped by city, which is what I think you are trying to get with your second query in your question.
You can confirm the output is what you want via:
newsdf<- SparkR::read.parquet(x="path/to/first/output/location/")
View(head(sdf, num=200))
Good luck, hopefully this helps.
Since your data is huge it is no longer possible to collect() anymore. So you can use a strategy to sample data and learn from the sampled data.
import numpy as np
arr = np.array(sdf.select("col_name").sample(False, 0.5, seed=42).collect())
Here you are sampling 50% of the data and just a single column.

to_hdf in python 3 causes MemoryError

I'm writing a piece of code for my thesis which collects a series of .csv files that represent temperature images. I want to make the use of these .csv files more efficient by storing them into a dataframe.
I have multiple videos of +10000 frames and each video should be stored in a separate dataframe. I made a code that works for lower numbers of files, however when running the code on a bunch of videos (lets say 10), it crashes after a couple of videos. It returns the MemoryError. I already tried gc.collect() functions and deleted the dataframe after using df.to_hdf to prevent python of keeping lists open in memory that have been created. My memory usage still keeps increasing until it completely fills up the RAM, and it crashes.
the relevant piece of the code is added here.
dfs={}
for l in range(k*number_of_csv,len(i)):
df=pd.read_csv(i[l],sep=',',header=None)
dfs['{:0>8}'.format(i[l])] = df
dfs=pd.concat(dfs)
dfs.to_hdf('experiment_data_series_'+str(k)+'.h5',key='raw',mode='w')
del dfs
gc.collect()
in short: it builds a dataframe from all the csv files and then stores them into the h5 file.
Can someone detect what is missing to prevent this from overconsuming memory?
I already inserted a chunking procedure, so that the number of .csv files that is stored in a single h5 is always <20000. (usually the h5 file has a size of 3-4GB)
I suspect that python allocates some memory for the storing operation, but doesn't liberate it afterwards or something.
I'd appreciate the help.
kind regards,
Cedric
Since you have stated that each video should be stored in a separate dataframe I don't think you need to concatenate them all into dfs. Simply read in each .csv file and then write them to your HDFStore under its own key; imagine a dict of DataFrame objects.
Consider the following code:
# create an HDF5 file which will contain all your data;
# this statement automatically opens the h5 file for you
hdf = pd.HDFStore('experiment_data_series.h5')
for l in range(k*number_of_csv,len(i)):
df = pd.read_csv(i[l], sep=',', header=None)
# store this dataframe in the HDF5 file under some key
hdf['raw_{}'.format(k)] = df
# when you are finished writing to the HDF5 file remember to close it
hdf.close()
later on you can open the file again and view some information
hdf = pd.HDFStore('experiment_data_series.h5', mode='r')
print(hdf.info())
print(hdf.keys())
# you can get one dataframe by its key
df = hdf.select('raw_0', auto_close=True)
I think this should work because you won't be having all of your data loaded in memory at once, i.e., you are using the disk instead.

Converting JSON to Parquet in Amazon EMR

I need to achieve the following, and am having difficulty coming up with an approach to accomplish it due to my inexperience with Spark:
Read data from .json.gz files stored in S3.
Each file includes a partial day of Google Analytics data with the schema as specified in https://support.google.com/analytics/answer/3437719?hl=en.
File names are in the pattern ga_sessions_20170101_Part000000000000_TX.json.gz where 20170101 is a YYYYMMDD date specification and 000000000000 is an incremental counter when there are multiple files for a single day (which is usually the case).
An entire day of data is therefore composed of multiple files with incremental "part numbers".
There are generally 3 to 5 files per day.
All fields in the JSON files are stored with qoute (") delimiters, regardless of the data type specified in the aforementioned schema documentation. The data frame which results from reading the files (via sqlContext.read.json) therefore has every field typed as string, even though some are actually integer, boolean, or other data types.
Convert the all-string data frame to a properly typed data frame according to the schema specification.
My goal is to have the data frame typed properly so that when it is saved in Parquet format the data types are correct.
Not all fields in the schema specification are present in every input file, or even every day's worth of input files (the schema may have changed over time). The conversion will therefore need to be dynamic, converting the data types of only the fields actually present in the data frame.
Write the data in the properly typed data frame back to S3 in Parquet format.
The data should be partitioned by day, with each partition stored in a separate folder named "partition_date=YYYYMMDD" where "YYYYMMDD" is the actual date associated with the data (from the original input file names).
I don't think the number of files per day matters. The goal is simply to have partitioned Parquet format data that I can point Spectrum at.
I have been able to read and write the data successfully, but have been unsuccessful with several aspects of the overall task:
I don't know how to approach the problem to ensure that I'm effectively utilizing the AWS EMR cluster to its full potential for parallel/distributed processing, either in reading, converting, or writing the data. I would like to size up the cluster as needed to accomplish the task within whatever time frame I choose (within reason).
I don't know how to best accomplish the data type conversion. Not knowing which fields will or will not be present in any particular batch of input files requires dynamic code to retype the data frame. I also want to make sure this task is distributed effectively and isn't done inefficiently (I'm concerned about creating a new data frame as each field is retyped).
I don't understand how to manage partitioning of the data appropriately.
Any help working through an overall approach would be greatly appreciated!
If your input JSONs have fixed schema you can specify DF schema manually, stating fields as optional. Refer to the official guide.
If you have all values inside "", you can read them as strings and later cast to required type.
I don't know how to approach the problem to ensure that I'm effectively...
Use Dataframe API to read input, most likely defaults will be good for this task. If you hit performance issue, attach Spark Job Timeline.
I don't know how to best accomplish the data type conversion...
Use cast column.cast(DataType) method.
For example, you have 2 JSONs:
{"foo":"firstVal"}{"foo":"val","bar" : "1"}
And you want to read 'foo' as String and bar as integer you can write something like this:
val schema = StructType(
StructField("foo", StringType, true) ::
StructField("bar", StringType, true) :: Nil
)
val df = session.read
.format("json").option("path", s"${yourPath}")
.schema(schema)
.load()
val withCast = df.select('foo, 'bar cast IntegerType)
withCast.show()
withCast.write.format("parquet").save(s"${outputPath}")

How do properly remove/clear a s3 bucket before re-using it in spark?

I am working in a jupyter notebook, creating and then saving a spark dataframe to s3 in python, using spark 2.0.1. The code looks something like
action = 'CREATE'
if action == 'CREATE':
df = dfA.filter(...)
df = df.join(...)
df.coalesce(4).write.format('parquet').save('s3://my/path')
elif action == 'LOAD':
df = spark.read.parquet('s3://my/path')
I think at some point, I had a bug and wrote a df which had 4 items in it (4 for a specific query) when it should only have 2 (every record was duplicated - probably because I was joining with something without first de-duping it).
After re-working things, I can verify that when I delete the old s3://my/path and then run that create logic so that it can write the location, my df has the 2 items I expect.
What I am confused about is that if I now run the LOAD logic, which should load the dataframe I just wrong with 2 items, replacing my df with the one read from s3, I get a dataframe with erroneous 4 items in it.
If I start over with a new path, s3://my/path2, then this exercise of creating and loading works.
It seems like a bug with s3, or maybe spark?
Its not a bug,it could be the behaviour of s3, it has read-after-write consistency for PUTS of new objects ,eventual consistency for overwrite PUTS and DELETES objects. when you delete a data and it might appear in list for some time, but eventually after some time when delete is fully propagated to all zones, it will list the correct data.(http://docs.aws.amazon.com/AmazonS3/latest/dev/Introduction.html#ConsistencyModel)
I like #madhan-s answer, but my overwrites/re-read's were all done interactively, reloading pages on aws s3, checking that I couldn't read the object when it wasn't there, so I'm not sure this explains it (but thanks for that link! I need to learn that kind of stuff!)
Something that might have happened, is that I think I did
df = make_bad_df()
df.save_to_s3()
df = read_from_s3()
df.persist()
At that point, I would have persisted a dataframe I read from a particular s3 object - If I then delete and re-write that s3 bucket, and do something like
df2 = read_from_s3()
Spark has no way of knowing the s3 object has changed, maybe spark will say - aha! I've already persisted that dataframe, I'll just get it from memory/local disk?

Resources