I'm fairly new to Databricks. I have an SQL query in a notebook and I want to download the full results (about 3000 rows) to a CSV file. However, when I run the query, it takes half an hour to display the first 1000 rows (which is useless to me) and then I have to click on "Download full results" which re-runs the query, hence the half hour it had just spent was completely wasted.
Is there a way to download the full results without first displaying the first 1000 rows in the browser?
might be this will help -
Create 1 variable and Load your table into the variable
Mytable = spark.table("tableName")
then storing the data into the csv file with option like
Mytable.write.format("com.databricks.spark.csv").option("delimiter", "as per your requirement").option("header", "true").save("dbfs:/df/mytabledata.csv")
then you can download/access the file under the data bricks instance file system.
Related
I am quite new to Django so forgive me if this is an easy answer. I'm creating a simple app that will display data that I scrape from the web on a daily basis. This will be pricing data for certain items; so basically on a daily basis, the only thing that would really change would be the price of a given item. Eventually, I will have an app that contains an HTML table in which I will populate an item ID as well as its price using only the data that is scraped for the current day... i.e any item ID in the table should appear only once and have today's price. I want to keep historical data for another app where I will display prices over time.
I set up a management command which will run the scraper for me and dump my data into the sqlite database and am also using django-simple-history to store historical data. The way I understand this is working is my scraper dumps my data into both the table that I created (model_data) as well as a historical table (model_historicaldata).
What I am hoping my scraper will do on a daily basis is EMPTY the model_data table - since it will already be in the historical table anyways (not delete the entire table itself) and then proceed with filling it up with only the current day's data. The model_historicaldata table should remain untouched at this stage.
Current at the beginning of my management command I have model.objects.all().delete(). This seems to work when displaying only today's data via the HTML template, however when querying the model_data in the shell I am still seeing the total count of all items (3000 in this case when I only scraped 1500 today).
# Empty existing table to fill with only current days data
if model_data.objects.all().delete():
print("Yesterday's data cleared. Scraping today's data.")
else:
print("Could not empty yesterday's data...")
# Scrape...
In Python shell:
python manage.py shell
from app.models import CreatedModel
CreatedModel.objects.all() # Shows 3000 items
Basically, I am wondering if those extra 1500 items in the model_data table are taking up actual space in the DB, and if so how can I erase them completely so that when I run the above query I only show 1500 items there? I'm thinking ahead as when this is actually live I'll be scraping 100K + items a day and think storage would run out fast. Also, for this specific case is using django-simple-history most prudent, or should I just keep all the data in one table and use filters when querying it in the view?
Currently, I am using the local sqlite database however hope to migrate to a remote one once I get the product up and running.
Thanks for the help!
I have an ADF Pipleline which executes a DataFlow.
The Dataflow has Source A table which has around 1 Million Rows,
Filter which has a query to select only yesterday's records from the source table,
Alter Row settings which uses upsert,
Sink which is archival table where the records are getting upsert
This whole pipeline is taking around 2 hours or so which is not acceptable. Actually, the records being transferred / upserted are around 3000 only.
Core count is 16. Tried the partitioning with round robin and 20 partitions.
Similar archival doesn't take more than 15 minutes for another table which has around 100K records.
I thought of creating source which would select only yesterday's record but the dataset we can select only table.
Please suggest if I am missing anything to optimize it.
The table of the Data Set really doesn't matter. Whichever activity you use to access that Data Set can be toggled to use a query instead of the whole table, so that you can pass in a value to select only yesterday's data from the database.
Or course, if you have the ability to create a stored procedure on the source, you could also do that.
When migrating really large sets of data, you'll get much better performance using a Copy activity to stage the data into an Azure Storage Blob before using another Copy activity to pull from that Blob into the source. But, for what you're describing here, that doesn't seem necessary.
I am supposed to optimize the performance of an old Access DB in my company. It contains several tables with about 20 columns and 50000 rows. The speed is very slow, because the people work with the whole table and set the filters afterwards.
Now I want to compose a query to reduce the amount of data in Excel before transfering the complete rows, but the speed is still very slow.
First I tried the new power query editor from Excel. I first reduced the rows by selecting only the last few ones (by date). Then I made an inner join with the 2nd table.
Finally I got less than 20 rows returned, and I thought I was fine.
But when I started Excel to perform the query, it took 10 - 20 seconds to read the data. I could see, Excel loads the complete tables, before setting the filters.
My next try was to create the same query direcly inside the Access DB, same setting. Then I opened this query in Excel, and the time to load the rows is nearly zero. You select "refresh", and the result is shown instantly.
My question is: Is there any way to perform a query in Excel only (without touching the Access file), that is nearly as fast as a query in Access itself?
Best regards,
Stefan
Of course.
Just run an SQL query from MS Query in Excel. You can create the query in Access, and copy-paste the SQL in MS Query. They're executed by the same database engine, and should run at exactly the same speed.
See this support page on how to run queries using MS Query in Excel.
More complex solutions using VBA are available, but shouldn't be needed.
I have a redshift table and it is storing a lot of data. Every weekend I go and manually using Workbench TRUNCATE last week of data that I no longer need.
I manually have to run
DELETE FROM tableName WHERE created_date BETWEEN timeStamp1 AND timeStamp2;
Is it possible to have some way to tell the table or have some expiration policy that removes the data every Sunday for me?
If not, Is there a way to automate the delete process every 7 days? Some sort of shell script or cron job in nodeJS that does this.
No, there is no in-built ability to run commands on a regular basis on Amazon Redshift. You could, however, run a script on another system that connects to Redshift and runs the command.
For example, a cron job that calls psql to connect to Redshift and execute the command. This could be done in a one-line script.
Alternatively, you could configure an AWS Lambda function to connect to Redshift and execute the command. (You would need to write the function yourself, but there are libraries that make this easier.) Then, you would configure Amazon CloudWatch Events to trigger the Lambda function on a desired schedule (eg once a week).
A common strategy is to actually store data in separate tables per time period (eg a month, but in your case it would be a week). Then, define a view that combines several tables. To delete a week of data, simply drop the table that contains that week of data, create a new table for this week's data, then update the view to point to the new table but not the old table.
By the way...
Your example uses the DELETE command, which is not the same as the TRUNCATE command.
TRUNCATE removes all data from a table. It is an efficient way to completely empty a table.
DELETE is good for removing part of a table but it simply marks rows as deleted. The data still occupies space on disk. Therefore, it is recommended that you VACUUM the table after deleting a significant quantity of data.
A few day back, I had a requirement from the client to import/dump the data from SQL server (which was a result set of multiple joins producing more than 300 columns) into excel on daily basis. However, while running the utility, It gave Too many Field error.
As a workaround, I used Flat File Destination Task to load all the required data (more than 300 columns) to csv file. And later saved this csv as excel.
This is the only workaround I achieved for the above scenario.