I already have a pyspark dataframe. I'm passing the variable to selectExpr in databricks to create new dataframe with column names that I need.
When I pass direct column names with aliases into selectExpr I don't have and error as a result my new dataframe successfully has been created. On the screenshot I have my columns with aliases
But when I try to pass into selectExpr exactly the same column but with variable I got en error:
Where did am I missed something?
Assuming that a sql_pattern is a string of all column aliases, you can split it into a list of strings, using split function:
pivotDF = clean_df.selectExpr(sql_pattern.split(","))
Related
I am trying to create a azure datafactory mapping dataflow that is generic for all tables. I am going to pass table name, the primary column for join purpose and other columns to be used in groupBy and aggregate functions as parameters to the DF.
parameters to df
I am unable to refernce this parameter in groupBy
Error: DF-AGG-003 - Groupby should reference atleast one column -
MapDrifted1 aggregate(
) ~> Aggregate1,[486 619]
Has anyone tried this scenario? Please help if you have some knowledge on this or if it can be handled in u-sql script.
We need to first lookup your parameter string name from your incoming source data to locate the metadata and assign it.
Just add a Derived Column previous to your Aggregate and it will work. Call the column 'groupbycol' in your Derived Column and use this formula: byName($group1).
In your Agg, select 'groupbycol' as your groupby column.
I have tried many options including withColumn, udf, lambda, foreach, map but but not getting the expected output. At max, I am able to transform only the first record. The inputfile.json will keep on increasing and the expect op should give the xml in the desired structure. I will later on produce the expected op on Kafka.
Spark 2.3, Python 2.7. Need is to do in PySpark.
Edit 1:
I am able to add a column in the main dataframe which has the required xml. I used withColumn and functions.format_string and able to add strings(the xml structures) to columns of the dataframe.
Now my next target is to produce just the value of that new column to Kafka. I am using df.foreachPartition(send_to_kafka) and have created a function as below:
def send_to_kafka(rows):
kafka = SimpleClient('localhost:9092')
producer = SimpleProducer(kafka)
for row in rows:
producer.send_messages('test', str(row.asDict()))
But unfortunately it does two things:
a. Produces record on Kafka as {'newColumn':u'myXMLPayload'}. I do not want that. I want only myXMLPayload to be produced on Kafka.
b. It adds u' to the value for unicoding the value.
I want to get rid of these two parts and I would be good to go.
Any help would be appreciated.
I am a newbie in Spark.I want to write the dataframe data into hive table. Hive table is partitioned on mutliple column. Through, Hivemetastore client I am getting the partition column and passing that as a variable in partitionby clause in write method of dataframe.
var1="country","state" (Getting the partiton column names of hive table)
dataframe1.write.partitionBy(s"$var1").mode("overwrite").save(s"$hive_warehouse/$dbname.db/$temp_table/")
When I am executing the above code,it is giving me error partiton "country","state" does not exists.
I think it is taking "country","state" as a string.
Can you please help me out.
The partitionBy function takes a varargs not a list. You can use this as
dataframe1.write.partitionBy("country","state").mode("overwrite").save(s"$hive_warehouse/$dbname.db/$temp_table/")
Or in scala you can convert a list into a varargs like
val columns = Seq("country","state")
dataframe1.write.partitionBy(columns:_*).mode("overwrite").save(s"$hive_warehouse/$dbname.db/$temp_table/")
I have a parquet file with 400+ columns, when I read it, the default datatypes attached to a lot of columns is String (may be due to the schema specified by someone else).
I was not able to find a parameter similar to
inferSchema=True' #for spark.read.parquet, present for spark.read.csv
I tried changing
mergeSchema=True #but it doesn't improve the results
To manually cast columns as float, I used
df_temp.select(*(col(c).cast("float").alias(c) for c in df_temp.columns))
this runs without error, but converts all the actual string column values to Null. I can't wrap this in a try, catch block as its not throwing any error.
Is there a way where i can check whether the columns contains only 'integer/ float' values and selectively cast those columns to float?
Parquet columns are typed, so there is no such thing as schema inference when loading Parquet files.
Is there a way where i can check whether the columns contains only 'integer/ float' values and selectively cast those columns to float?
You can use the same logic as Spark - define preferred type hierarchy and attempt to cast, until you get to the point, where you find the most selective type, that parses all values in the column.
How to force inferSchema for CSV to consider integers as dates (with "dateFormat" option)?
Spark data type guesser UDAF
There's no easy way currently,
there's a Github issue already existing which can be referred
https://github.com/databricks/spark-csv/issues/264
somthing like https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVInferSchema.scala
existing for scala this can be created for pyspark
I have a loop that is going to create multiple rows of data which I want to convert into a dataframe.
Currently I am creating a CSV format string and inside the loop keep appending to it along separated by a newline. I am creating a CSV file so that I can also save it as a text file for other processing.
File Header:
output_str="Col1,Col2,Col3,Col4\n"
Inside for loop:
output_str += "Val1,Val2,Val3,Val4\n"
I then create an RDD by splitting it with the newline and then convert in into the dataframe as follows.
output_rdd = sc.parallelize(output_str.split("\n"))
output_df = output_rdd.map(lambda x: (x, )).toDF()
It creates a dataframe but only has 1 column. I know that is because of the map function where I am making it into a list with only 1 item in the set. What I need is a list with multiple items. So perhaps I should be calling split() function on every line to get a list. But I am getting a feeling that there should be a much more straight-forward way. Appreciate any help. Thanks.
Edit: To give more information, using Spark SQL I have filtered my dataset to those rows that contain the problem. However the rows contain information in following format (separated by '|'). And I need to extract those values from column 3 which has corresponding flag set to 1 in column 4 (Here it is 0xcd)
Field1|Field2|0xab,0xcd,0xef|0x00,0x01,0x00
So I am collecting the output at the driver and then parsing the last 2 columns after which I am left with regular strings that I want to put back in a dataframe. I am not sure if I can achieve the same using Spark SQL to parse the output in the manner I want.
Yes, indeed your current approach seems a little too complicated... Creating large string in Spark Driver and then parallelizing it with Spark is not really performant.
First of all question from where you are getting your input data? In my opinion you should use one of existing Spark readers to read it. For example you can use:
CSV -> http://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#pyspark.sql.DataFrameReader.csv
jdbc -> http://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#pyspark.sql.DataFrameReader.jdbc
json -> http://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#pyspark.sql.DataFrameReader.json
parquet -> http://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#pyspark.sql.DataFrameReader.parquet
not structured text file -> http://spark.apache.org/docs/2.1.0/api/python/pyspark.html#pyspark.SparkContext.textFile
In next step you can preprocess it using Spark DataFrame or RDD API depending on your use case.
A bit late, but currently you're applying a map to create a tuple for each row containing the string as the first element. Instead of this, you probably want to split the string, which can easily be done inside the map step. Assuming all of your rows have the same number of elements you can replace:
output_df = output_rdd.map(lambda x: (x, )).toDF()
with
output_df = output_rdd.map(lambda x: x.split()).toDF()