A couple days ago I've been working with an sqlite3 database and noticed that after deleting a lot of data (an entire table (DROP) to be exact) the file size did not change (the table used 5MB) at all. Not the database I'm asking about, but maybe this is relevant to my question.
I'm having a bunch of databases that I was looking through using grep -a for a specific message, I successfully found the database in question having 50MB in size, seems perfect! However, opening the database in an sqlite3 database viewer (or the sqlite3 command line tool) shows only a little over 800 entries in the messages table. The current auto increment index is above 18.000, which is the amount of entries the table should have. Going through the .db file with grep, everything I want seems to be there, but I'm guessing it's somehow "hidden"(?).
How can I retrieve ALL the data from an sqlite database? Note that .dump <table> does also not include the messages I'm looking for. For a quick explanation on this and why data doesn't actually get deleted/filesize won't shrink when I delete a 5MB table I'd be thankful too.
sqlite> .dbinfo
database page size: 4096
write format: 1
read format: 1
reserved bytes: 0
file change counter: 366
database page count: 11405
freelist page count: 10372
schema cookie: 2
schema format: 4
default cache size: 0
autovacuum top root: 0
incremental vacuum: 0
text encoding: 1 (utf8)
user version: 7
application id: 0
software version: 3008005
number of tables: 3
number of indexes: 1
number of triggers: 0
number of views: 0
schema size: 737
(Note that I did not create nor ever edited this database, it's Instagram's direct messages database.)
Deleted data is not immediately removed from the database file; see change sqlite file size after "DELETE FROM table".
As shown by the freelist page count, that file is mostly empty, and running VACUUM is likely to throw away the remains of the data you're searching for.
There is no easy way to recover deleted data (because some important management information is missing). If you are interested in only a small amount of data, try a hex editor.
Related
Using .vacuum() on a DeltaLake table is very slow (see Delta Lake (OSS) Table on EMR and S3 - Vacuum takes a long time with no jobs).
If I manually deleted the underlying parquet files and did not add a new json log file or add a new .checkpoint.parquet file and change the _delta_log/_last_checkpoint file that points to it; what would the negative impacts to the DeltaLake table be, if any?
Obviously time-traveling, i.e. loading a previous version of the table that relied on the parquet files I removed, would not work. What I want to know is, would there be any issues reading, writing, or appending to the current version of the DeltaLake table?
What I am thinking of doing in pySpark:
### Assuming a working SparkSession as `spark`
from subprocess import check_output
import json
from pyspark.sql import functions as F
awscmd = "aws s3 cp s3://my_s3_bucket/delta/_delta_log/_last_checkpoint -"
last_checkpoint = str(json.loads(check_output(awscmd, shell=True).decode("utf-8")).get('version')).zfill(20)
s3_bucket_path = "s3a://my_s3_bucket/delta/"
df_chkpt_del = (
spark.read.format("parquet")
.load(f"{s3_bucket_path}/_delta_log/{last_checkpoint}.checkpoint.parquet")
.where(F.col("remove").isNotNull())
.select("remove.*")
.withColumn("deletionTimestamp", F.from_unixtime(F.col("deletionTimestamp")/1000))
.withColumn("delDateDiffDays", F.datediff(F.col("deletionTimestamp"), F.current_timestamp()))
.where(F.col("delDateDiffDays") < -7 )
)
There are a lot of options from here. One could be:
df_chkpt_del.select("path").toPandas().to_csv("files_to_delete.csv", index=False)
Where I could read files_to_delete.csv into a bash array and then use a simple bash for loop passing each parquet file s3 path to an aws s3 rm command to remove the files one by one.
This may be slower than vacuum(), but at least it will not be consuming cluster resources while it is working.
If I do this, will I also have to either:
write a new _delta_log/000000000000000#####.json file that correctly documents these changes?
write a new 000000000000000#####.checkpoint.parquet file that correctly documents these changes and change the _delta_log/_last_checkpoint file to point to that checkpoint.parquet file?
The second option would be easier.
However, if there will be no negative effects if I just remove the files and don't change anything in the _delta_log, then that would be the easiest.
TLDR. Answering this question.
If I manually deleted the underlying parquet files and did not add a new json log file or add a new .checkpoint.parquet file and change the _delta_log/_last_checkpoint file that points to it; what would the negative impacts to the DeltaLake table be, if any?
Yes, this could potentially corrupt your delta table.
Let me briefly answers how delta-lake reads a version using _delta_log.
If you want to read version x then it will go to delta log of all versions from 1 to x-1 and will make a running sum of parquet files to read. Summary of this process is saved as a .checkpoint after every 10th version to make this process of running sum efficient.
What do I mean by this running sum?
Assume,
version 1 log says, add add file_1, file_2, file_3
version 2 log says, add delete file_1, file_2, and add file_4
So when reading version no 2, total instruction will be
add file_1, file_2, file_3 -> delete file_1, file_2, and add file_4
So, resultant files read will be file_3 and file_4.
What if you delete a parquet from a file system?
Say in version 3, you delete file_4 from file system. If you don't use .vacuum then delta log will not know that file_4 is not present, it will try to read it and will fail.
hello I pushed some rows into a bigquery table as follows:
errors = client.insert_rows("course-big-query-python.api_data_set_course_33.my_table_aut33",[string_tuple], selected_fields = schema2)
assert errors == []
however when I verify the result at the visual interface I see that the actual table size is 0,
I verify the Streaming buffer statistics there is the table successfully inserted:
I also excecuted a query to the table and the result is appearing stored in a temporal table as follows:
So I would like to appreciate support to insert the table in the corresponding place rather than a temporary table
To load data in BigQuery, you can either stream or batch it in.
If you choose streaming, data will go straight into a temporal space until it gets consolidated into the table.
You can find a longer description of how a streaming insert works here:
https://cloud.google.com/blog/products/gcp/life-of-a-bigquery-streaming-insert
If you want to batch instead of stream, use jobs.load instead of insert_row.
I am new on Neo4J. I am using Cypher to create nodes from the import of a csv file containing on each row the user ID, name and email.
I use this line of code:
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM "file:///Users/Cam/Documents/Neo4j/default.graphdb/import/node_attributes.csv" AS data
create (u:User {id:data.Id, name:data.displayName, email:data.email})
it seems to work as I have this message:
Added 840 labels, created 840 nodes, statement completed in 334 ms.
but when I go on neo4j in the database section; I see that only 25 nodes have been created and that they use only 2 names of my users....
any idea to know how to see them ?
thanks in advance
Are you using the neo4j browser? When you click on a label (or *) in the Node labels section on the left side, it automatically includes a LIMIT 25 clause to the generated Cypher code (to avoid swamping the the UI if there is a lot of data, and to avoid possibly taking too long to respond). For example:
MATCH (n) RETURN n LIMIT 25
If this is your situation, just remove the LIMIT 25 clause to see all your data. For example:
MATCH (n) RETURN n
I have two problems using openpyxl
The number of rows in the spreadsheet are 1048498. The iteration hogs memory so I put a logic to check for first five empty columns and break from it
Logic 1 works for me and code does not indefinitely iterate over the spreadsheet blank cells. I am using P4Python to delete this read only file after I am done reading it. However, openpyxl is still using that file and there is no method except save to close the archive used internally. Since my file is in read only mode, I cannot save the file. When P4 is trying to delete this file, I get this error - "The process cannot access the file because it is being used by another process."
Help is appreciated :)
If you open the file in read-only mode then it will not hog memory. Cells are created only when read. Memory use has been tested with huge files but if you think this is a bug then please submit a bug report with a sample file.
This looks like an existing issue or intended beahvior with openpyxl. If you have a read only file (P4Python sync operation - p4.run_sync(file_path_to_sync)) and if you are reading it using openpyxl, you will not be able to delete the file (P4Python p4.run_sync(file_path_to_sync + '#0') - Remove from workspace) until you save the file which is not possible (or intended in my case) since it is a read only file.
I am facing a problem in converting a large Oracle table to a SAS dataset. I did this earlier and the method worked. However, this time, it is giving me the following error messages.
SAS code:
option compress = yes;
libname sasdata ".";
libname myora oracle user=scott password=tiger path=XYZDATA ;
data sasdata.expt_tabl;
set myora.expt_tabl;
run;
Log file:
You are running SAS 9. Some SAS 8 files will be automatically converted
by the V9 engine; others are incompatible. Please see
http://support.sas.com/rnd/migration/planning/platform/64bit.html
PROC MIGRATE will preserve current SAS file attributes and is
recommended for converting all your SAS libraries from any
SAS 8 release to SAS 9. For details and examples, please see
http://support.sas.com/rnd/migration/index.html
This message is contained in the SAS news file, and is presented upon
initialization. Edit the file "news" in the "misc/base" directory to
display site-specific news and information in the program log.
The command line option "-nonews" will prevent this display.
NOTE: SAS initialization used:
real time 1.63 seconds
cpu time 0.03 seconds
1 option compress = yes;
2 libname sasdata ".";
NOTE: Libref SASDATA was successfully assigned as follows:
Engine: V9
Physical Name: /******/dibyendu
3 libname myora oracle user=scott password=XXXXXXXXXX path=XYZDATA ;
NOTE: Libref MYORA was successfully assigned as follows:
Engine: ORACLE
Physical Name: XYZDATA
4 data sasdata.expt_tabl;
5 set myora.expt_tabl;
6 run;
NOTE: There were 6422133 observations read from the data set MYORA.EXPT_TABL.DATA.
ERROR: Expecting page 1, got page -1 instead.
ERROR: Page validation error while reading SASDATA.EXPT_TABL.DATA.
ERROR: Expecting page 1, got page -1 instead.
ERROR: Page validation error while reading SASDATA.EXPT_TABL.DATA.
ERROR: File SASDATA.EXPT_TABL.DATA is damaged. I/O processing did not complete.
NOTE: The data set SASDATA.EXPT_TABL.DATA has 6422133 observations and 49 variables.
ERROR: Expecting page 1, got page -1 instead.
ERROR: Page validation error while reading SASDATA.EXPT_TABL.DATA
ERROR: Expecting page 1, got page -1 instead.
ERROR: Page validation error while reading SASDATA.EXPT_TABL.DATA.
ERROR: Expecting page 1, got page -1 instead.
2 The SAS System 21:40 Monday, April 1, 2013
ERROR: Page validation error while reading SASDATA.EXPT_TABL.DATA.
ERROR: Expecting page 1, got page -1 instead.
ERROR: Page validation error while reading SASDATA.EXPT_TABL.DATA.
NOTE: Compressing data set SASDATA.EXPT_TABL.DATA decreased size by 78.88 percent.
Compressed is 37681 pages; un-compressed would require 178393 pages.
ERROR: File SASDATA.EXPT_TABL.DATA is damaged. I/O processing did not complete.
NOTE: SAS set option OBS=0 and will continue to check statements. This might cause NOTE: No observations in data set.
NOTE: DATA statement used (Total process time):
real time 8:55.98
cpu time 1:39.33
7
ERROR: Errors printed on pages 1,2.
NOTE: SAS Institute Inc., SAS Campus Drive, Cary, NC USA 27513-2414
NOTE: The SAS System used:
real time 8:58.67
cpu time 1:39.40
This is running on a RH Linux Server.
Any suggestion will be appreciated.
Thanks and regards,
This sounds like a space issue on your server. How large is the file system in your default directory (from your libname sasdata '.'; statement)? Use the data set option obs=1 on your Oracle table reference to create a new SAS dataset with one row and inspect the variables.
data sasdata.dummy_test;
set myora.expt_tabl(obs=1);
run;
Perhaps there are extremely large VARCHAR or BLOB columns that are consuming too much space. Remember that SAS does not have a VARCHAR type.
Though I am not totally sure, I believe the main issue was that I was initially trying to create/write the dataset in a directory, which was restricted in some (?) sense. This was indirectly causing trouble, since the dataset created was defective. When I created it elsewhere, it was okay.
Thanks and regards,
Dibyendu