I am using the psycopg2 library with Python3 on a linux server to create some temporary tables on Redshift and querying these tables to get results and write to files on the server.
Since my queries are long and takes about 15 minutes to create all these temp tables that I ultimate pull data from, how do I ensure that my connection persists and I don't lose the temp tables that I later query? Right now I just do a cursor() before the execute(), is there a default timeout for these?
I have noticed that whenever I do a
Select a,b from #results_table
or
select * from #results_table
the query just freezes/hangs, but
select top 35 from #results_table
returns the results (select top 40 fails!). There are about a 100 rows in #results_table, and I am not able to get them all. I did a ps aux and the process just stays in the S+ state. If I manually run the query on Redshift it finishes in seconds.
Any ideas?
Re-declaring a cursor doesn't create new connection while using psycopg2.
Related
I have a small log dataframe which has metadata regarding the ETL performed within a given notebook, the notebook is part of a bigger ETL pipeline managed in Azure DataFactory.
Unfortunately, it seems that Databricks cannot invoke stored procedures so I'm manually appending a row with the correct data to my log table.
however, I cannot figure out the correct sytnax to update a table given a set of conditions :
the statement I use to append a single row is as follows :
spark_log.write.jdbc(sql_url, 'internal.Job',mode='append')
this works swimmingly however, as my Data Factory is invoking a stored procedure,
I need to work in a query like
query = f"""
UPDATE [internal].[Job] SET
[MaxIngestionDate] date {date}
, [DataLakeMetadataRaw] varchar(MAX) NULL
, [DataLakeMetadataCurated] varchar(MAX) NULL
WHERE [IsRunning] = 1
AND [FinishDateTime] IS NULL"""
Is this possible ? if so can someone show me how?
Looking at the documentation this only seems to mention using select statements with the query parameter :
Target Database is an Azure SQL Database.
https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html
just to add this is a tiny operation, so performance is a non-issue.
You can't do single record updates using jdbc in Spark with dataframes. You can only append or replace the entire table.
You can do updates using pyodbc- requires installing the MSSQL ODBC driver (How to install PYODBC in Databricks) or you can use jdbc via JayDeBeApi (https://pypi.org/project/JayDeBeApi/)
I am newbie to Hadoop and Hive. My current requirement is to collect the stats of number of records loaded in 15 tables on each run day. Instead of executing each select Count(*) query and copy output manually to XL. Could anyone suggest what is the best method to automate this task please?
Note: we are not having any GUI to run Hive Queries, submitting Hive queries in normal Unix terminal.
Export to the CSV or TSV file, then open file in Excel. Normally it generates TSV file (tab-separated). This is how to transform it to comma-separated if you prefer CSV;
hive -e "SELECT 'table1' as source, count(*) cnt FROM db.table1
UNION ALL
SELECT 'table2' as source, count(*) cnt FROM db.table2" | tr "\t" "," > mydata.csv
Add more tables to the query.
You can mount directory in which you are writing output file in Windows using SAMBA/NFS. Schedule the command using crontab and voila, every day you have updated file.
Also you can connect directly using ODBC drivers:
https://mapr.com/blog/connecting-apache-hive-to-odbc/
https://learn.microsoft.com/en-us/azure/hdinsight/hadoop/apache-hadoop-connect-excel-hive-odbc-driver
Error connecting Hortonworks Hive ODBC in Excel 2013
I have a table in hive
db.table_name
When I run the following in hive I get results back
SELECT * FROM db.table_name;
When I run the following in a spark-shell
spark.read.table("db.table_name").show
It shows nothing. Similarly
sql("SELECT * FROM db.table_name").show
Also shows nothing. Selecting arbitrary columns out before the show also displays nothing. Performing a count states the table has 0 rows.
Running the same queries works against other tables in the same database.
Spark Version: 2.2.0.cloudera1
The table is created using
table.write.mode(SaveMode.Overwrite).saveAsTable("db.table_name")
And if I read the file using the parquet files directly it works.
spark.read.parquet(<path-to-files>).show
EDIT:
I'm currently using a workaround by describing the table and getting the location and using spark.read.parquet.
Have you refresh metadata table? Maybe you need to refresh table to access to new data.
spark.catalog.refreshTable("my_table")
I solved the problem by using
query_result.write.mode(SaveMode.Overwrite).format("hive").saveAsTable("table")
which stores the results in textfile.
There is probably some incompatibility with the Hive parquet.
I also found a Cloudera report about it (CDH Release Notes): they recommend creating the Hive table manually and then load data from a temporary table or by query.
I am working on a simple Spark script, and running into issues putting data where I want to, and getting the job to work. Specifically, I need to specify the database of the tables when registering a data frame to a temp table.
df_del_records,df_add_records,df_exclusion_records=get_new_records(dff)
df_del_records.registerTempTable("db.update_deletes_temp_table")
df_add_records.registerTempTable("db.update_adds_temp_table")
df_exclusion_records.registerTempTable("db.exclusions_temp_table")
sqlContext.sql("insert overwrite table db.automated_quantity_updates select * from db.update_deletes_temp_table")
sqlContext.sql("insert into table db.automated_quantity_updates select * from db.update_adds_temp_table")
sqlContext.sql("insert into table db.exclusions select * from db.exclusions_temp_table")
The code above runs without errors, but does not yield any results. Removing the database yields results, but that won't work in production where the database in which the temp tables have to be stored is not whatever default Spark is using. How do I specify which database a temp table needs to be in registering a datagrame to a temp table in Spark 1.6?
The temporary table/view which is created by registerTempTable or createOrReplaceTempView is not related to any databases. It just creates a view of the dataframe with a query plan based on how the dataframe was created.
From Apache Spark's Dataset.scala
Local temporary view is session-scoped. Its lifetime is the lifetime of the session that created it, i.e. it will be automatically dropped when the session terminates. It's not tied to any databases, i.e. we can't use db1.view1 to reference a local temporary view.
emphasis added by me.
I am new on Neo4J. I am using Cypher to create nodes from the import of a csv file containing on each row the user ID, name and email.
I use this line of code:
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM "file:///Users/Cam/Documents/Neo4j/default.graphdb/import/node_attributes.csv" AS data
create (u:User {id:data.Id, name:data.displayName, email:data.email})
it seems to work as I have this message:
Added 840 labels, created 840 nodes, statement completed in 334 ms.
but when I go on neo4j in the database section; I see that only 25 nodes have been created and that they use only 2 names of my users....
any idea to know how to see them ?
thanks in advance
Are you using the neo4j browser? When you click on a label (or *) in the Node labels section on the left side, it automatically includes a LIMIT 25 clause to the generated Cypher code (to avoid swamping the the UI if there is a lot of data, and to avoid possibly taking too long to respond). For example:
MATCH (n) RETURN n LIMIT 25
If this is your situation, just remove the LIMIT 25 clause to see all your data. For example:
MATCH (n) RETURN n