SELECT statement from two dataframes using RMySQL - string

Consider two data frames, dataFrame1 and dataFrame2:
dataFrame1 has N columns (colmn1, ..., colmnN)
dataFrame2 has 3 columns (col1, col2, col3)
Can I write a statement like:
Select colmn1, colmn2, ..., colmnN, col1, col2 from dataFrame1, dataFrame2
using RMySQL?

Maybe you want package sqldf instead.
Try this:
library("sqldf")
sqldf("select colmn1, colmn2, ..., colmnN, col1, col2 from dataFrame1, dataFrame2")
of course you must replace ... with actual column names.

Related

Is there any alternative to merge multiple rows into a single row without using groupBy() & collect_list() in spark?

I am trying to merge multiple rows into a single row after grouping data on a different column.
col1 col2
A 1
A 2
B 1
B 3
to
col1 col2
A 1,2
B 1,3
By using the below code:
df = spark.sql("select col1, col2, col3,...., colN from tablename where col3 = 'ABCD' limit 1000")
df.select('col1','col2').groupby('col1').agg(psf.concat_ws(', ', psf.collect_list(df.col2))).display()
This is working fine when there is less data.
But if I try to increase the number of rows to 1million, the code fails with the exception:
java.lang.Exception: Results too large
Is there any alternative to merge multiple rows into a single row in spark without using the combination of groupby() & collect_list()

How do you specify column names with psycopg2

I have a SQL statement such as:
INSERT INTO my_table (col1, col2, col3) VALUES (1,2,3)
I am using psycopg2 to insert data as follows:
cur.execute(
sql.SQL("INSERT INTO{} VALUES (%s, %s, %s)").format(sql.Identifier('my_table')),[1, 2, 3]
)
I don't see how to specify column names into the insert statement though? The above sql.sql is "assuming" that 1,2,3 are in order of col1, col2 and col3. For instance, at times when I want to say insert only col3, how would I specify the column name with sql.sql?
The execute is just executing the SQL code, so you can just mention the columns as per standard PostgreSQL INSERT statement like
INSERT INTO TABLE_ABC (col_name_1,col_name_2,col_name_3) VALUES (1, 2, 3)"

Display barplot in column order using pandas

I have a data frame with two columns - col1 and col2, but when I use df.plot.barh, the plot returns results in col2 and col1 order. Is there a way to get the plot to display results in col1 and col2 order?
df = pd.DataFrame(np.random.randint(0,10,(5,2)), columns=['col1','col2'])
df.plot.barh()
will yield this:
Instead using bar():
df = pd.DataFrame(np.random.randint(0,10,(5,2)), columns=['col1','col2'])
df.plot.bar()
In both instances, col1 is first in that it is closest to the x axis. To reverse the order of the columns, you would need to reverse the order in which they appear in your dataframe. For just two columns you can use:
df = df[df.columns[::-1]]

Column list specification in INSERT OVERWRITE statement

While trying to OVERWRITE the Hive table with specific columns from Spark(Pyspark) using a dataframe getting the below error
pyspark.sql.utils.ParseException: u"\nmismatched input 'col1' expecting {'(', 'SELECT', 'FROM', 'VALUES', 'TABLE', 'INSERT', 'MAP', 'REDUCE'}(line 1, pos 36)\n\n== SQL ==\ninsert OVERWRITE table DB.TableName (Col1, Col2, Col3) select Col1, Col2, Col3 FROM dataframe\n------------------------------------^^^\n"
Based on the https://issues.apache.org/jira/browse/HIVE-9481 looks like column-list is still not supported in INSERT OVERWRITE, so on trying to run without the overwrite keyword still gives me the same error.
sparkSession.sql("insert into table DB.TableName (Col1, Col2, Col3) select Col1, Col2, Col3 FROM dataframe")
Note: The above works fine when the specific column-list is not
specified, and the columns between the tables match.
But on trying the same via Hive Terminal goes through fine.
INSERT INTO TABLE DB.TableName (Col1, Col2, Col3) select Col1, Col2, Col3 from DB.TableName2;
Should any property or configuration be set or passed through the Spark-Submit
Please do let me know if you need more data or information..

how to load a csv file if multiple columns contain multiple commas in between using sparksql 1.6v

how to load a csv file if multiple columns contain multiple commas in between using sparksql 1.6v for example given csv file contains some records similar to below.
col1, col2, col3, col4, col5
abc, xyz, pqr, p,q, qq pr,tt
now col1 should contain abc
col2 xyz, pqr
col3 p,q
col4 qq pr
col5 tt
In spark dataframe using spark 1.6v.

Resources