I have a csv file that contains large number of columns. I want to load just one column from that file using spark.
I know that we can use select statement to filter a column. But what i want, while doing the read operation itself, it should load just one column.
In this way, i should be able to avoid extra memory getting used by other columns. Is there any way to do this?
Spark will load complete file and parse for columns. As you mentioned, you can use select to restrict columns in dataframe, so dataframe will have only one column.
Spark will load the complete file in memory and will filter down the column you want with the help of select statements which you have mentioned.
Because all the read operation in spark, reads and scans the whole file as a distributed stream reader gets created (the reader gets instantiated at every node where the data has been stored).
And if your problem is to read the data column-wise then you can store the file in parquet format and read that file. Indeed, parquet is columnar storage and it is exactly meant for this type of use case(you can verify it using the explain).
Related
I am trying to read data from elastic, I could see Column is present as Array of string in elastic but while I am reading by Spark as Dataframe i am seeing as a Srting, how could I handle this data in Spark.
Note: I am trying to read with mode (sqlContext.read.format("org.elasticsearch.spark.sql") becuase i need to write it as CSV file in future.
This is my first question ever so thanks in advance for answering me.
I want to create an external table by Spark in Azure Databricks. I've the data in my ADLS already that are automatically extracted from different sources every day. The folder structure on the storage is like ../filename/date/file.parquet.
I do not want to duplicate the files by saving their copy on another folder/container.
My problem is that I want to add a date column extracted from the folder path to the table neither without copying nor changing the source file.
I am using Spark SQL to create the table.
CREATE TABLE IF EXISTS my_ext_tbl
USING parquet
OPTIONS (path "/mnt/some-dir/source_files/")
Is there any proper way to add such a column in one easy and readable step or I have to read the raw data into Dataframe, add column and then save it as external tabel to different location?
I am aware of that unmanaged tables stores only metadata in dbfs. However, I am wondering is this even possible.
Hope it's clear.
EDIT:
Since it seems like there is no viable solution for that without copying or interfere in source file, I would like to ask how are you handling such challenges?
EDIT2:
I think that link might provide a solution. The difference in my case is that, the date inside the folder path is not the real partition, it's just a date added during the pipeline extracting data from external source.
If one calls df.write.parquet(destination), is the DataFrame schema (i.e. StructType information) saved along with the data?
If the parquet files are generated by other programs other than Spark, how does sqlContext.read.parquet figure out the schema of the DataFrame?
Parquet files automatically preserves the schema of the original data when saving. So there will be no difference if it's Spark or another system that writes/reads the data.
If one or multiple columns are used to partition the data when saving, the data type for these columns are lost (since the information is stored in the file structure). The data types of these can be automatically inferred by Spark when reading (currently only numeric data types and strings are supported).
This automatic inference can be turned off by setting spark.sql.sources.partitionColumnTypeInference.enabled to false, which will make these columns be read as strings. For more information see here.
I have a lot of csv files in a Azure Data Lake, consisting of data of various types (e.g., pressure, temperature, true/false). They are all time-stamped and I need to collect them in a single file according to timestamp for machine learning purposes. This is easy enough to do in Java - start a filestream, run a loop on the folder that opens each file, compares timestamps to write relevant values to the output file, starting a new column (going to the end of the first line) for each file.
While I've worked around the timestamp problem in U-SQL I'm having trouble coming up with syntax that will help me run this on the whole folder. The wildcard syntax {*} treats all files as the same fileset while I need to run some sort of loop to join a column from each file individually.
Is there any way to do this, perhaps using virtual columns?
First you have to think about your problem functional/declaratively and not based on procedural paradigms such as loops.
Let me try to rephrase your question to see if I can help. You have many csv files with data that is timestamped. Different files can have rows with the same timestamp, and you want to have all rows for the same timestamp (or range of timestamps) output to a specific file? So you basically want to repartition the data?
What is the format of each of the files? Do they all have the same schema or different schemas? In the later case, how can you differentiate them? Based on filename?
Let me know in the comments if that is a correct declarative restatement and the answers to my questions and I will augment my answer with the next step.
I had one table with more than 10,000,000 records in Cassandra, but for some reason, I want to build another Cassandra table with the same fields and several additional fields, and I will migrate the previous data into it. And now the two tables are in the same Cassandra cluster.
I want to ask how to finish this task in a shortest time?
And If my new table in the different Cassandra, How to do it?
Any advice will be appreciated!
If you just need to add blank fields to a table, then the best thing to do is use the alter table command to add the fields to the existing table. Then no copying of the data would be needed and the new fields would show up as null in the existing rows until you set them to something.
If you want to change the structure of the data in the new table, or write it to a different cluster, then you'd probably need to write an application to read each row of the old table, transform the data as needed, and then write each row to the new location.
You could also do this by exporting the data to a csv file, write a program to restructure the csv file as needed, then import the csv file into the new location.
Another possible method would be to use Apache Spark. You'd read the existing table into an RDD, transform and filter the data into a new RDD, then save the transformed RDD to the new table. That would only work within the same cluster and would be fairly complex to set up.