I have got a dataframe in which there is one column that I want to split, a sample row looks as follows:
"Parameter:'river':Chainage"
Now if i split this row
test_line.split(':')
I receive:
['Parameter', "'river'", 'Chainage']
However, I dont want river to include "''", but only '' , hence:
['Parameter', 'river', 'Chainage']
How do I get there?
Related
I I have a Pyspark dataset with a column “channels” that looks like this:
channels
name1,name2,name3,name4
happy1,happy2
entity1,entity2,entity3,entity4,entity5
I want to create 5 new columns i.e “channel1, channel2, channel3, channel4, channel5”.
Then, I want to split the contents of the “channels” column using the comma separator. After splitting values from each row, I want to put each separated value in a different column.
For example for the first row, the columns should look like this:
channel1 channel2 channel3 channel4 channel5
name1 name2 name3 name4 ~
When an element is not found, i want to use ~ as the column value. For example in the first row, there were only 4 values instead of 5 so for the channel5 column, I used ~
I only want to use ~, not None or NULL.
How can I achieve this result in pyspark?
I tried this:
df = df.withColumn("channels_split", split(df["channels"], ","))
df = df.withColumn("channel1", coalesce(df["channels_split"][0], "~"))
df = df.withColumn("channel2", coalesce(df["channels_split"][1], "~"))
df = df.withColumn("channel3", coalesce(df["channels_split"][2], "~"))
df = df.withColumn("channel4", coalesce(df["channels_split"][3], "~"))
df = df.withColumn("channel5", coalesce(df["channels_split"][4], "~"))
df = df.drop("channels_split")
but it gives me an error that:
`~` is missing
You're referencing the column `~`, but it is missing from the schema. Please check your code.
Note that I am using pyspark within Foundry
Coalesce expects cols as arguments and you are providing String, i think that you should use lit("~")
https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.coalesce.html
This is more of a spark problem than a Foundry problem though.
What #M_S said is correct, as the error message was stating, you need a column and should use the lit function.
But be careful if
spark.sql.ansi.enabled is set to True
then this code will throw an ArrayIndexOutOfBoundsException if there are less than 5 items in your array.
Another way to do this would be to ensure that your array has at least 5 elements on each row by adding some ~, then get the first 5 items or by adding a temporary column with the length of the array and use a when condition.
By the way, you don't need to repeat df = every time
df = (
df
.withColumn("channels_split", split(df["channels"], ","))
.withColumn("default_values", array_repeat(lit("~"), 5)))
.withColumn("channels_split", concat(col("channels_split"), col("default_values")))
.withColumn("channel1", df["channels_split"][0])
.withColumn("channel2", df["channels_split"][1])
.withColumn("channel3", df["channels_split"][2])
.withColumn("channel4", df["channels_split"][3])
.withColumn("channel5", df["channels_split"][4])
.drop("channels_split")
)
I have a pandas dataframe like this:
How do I get the price of an item1 without making 'Items column' an index column?
I tried df['Price (R)'][item1] but it returns the price of item2, while I expect output to be 1
The loc operators are required in front of the selection brackets []. When using loc, the part before the comma is the rows you want, and the part after the comma is the columns you want to select. Therefore, the code can be:
result = df.loc[df['Items']=="item1","Price(R)"]
The data type of created output is Pandas Series object.
I have problem in which is have to search for first occurrence of "Employee_ID" in "Mapped_Project_ID", Need to pick the values in the array till last value from the first matching occurrences
I have one dataframe like below :
Employee_Name|Employee_ID|Mapped_Project_ID
Name1|E101|[E101, E102, E103]
Name2|E102|[E101, E102, E103]
Name3|E103|[E101, E102, E103, E104, E105]
I want to have output df like below:
Employee_Name|Employee_ID|Mapped_Project_ID
Name1|E101|[E101, E102, E103]
Name2|E102|[E102, E103]
Name3|E103|[E103, E104, E105]
Not sure, How to achieve this.
Can someone provide an help on this or logic to handle this in spark without need of any UDFs?
Once you have your dataframe you can use spark 2.4's higher order array function (see https://docs.databricks.com/_static/notebooks/apache-spark-2.4-functions.html) to filter out any values within the array that are lower than the value in the Employee_ID column like so:
myDataframe
.selectExpr(
"Employee_Name",
"Employee_ID",
"filter(Mapped_Project_ID, x -> x >= Employee_ID) as Mapped_Project_ID"
);
Here is my dataframe
Word 1_gram-Probability
0 ('A',) 0.001461
1 ('45',) 0.000730
now i just want to select the row where Word is 45. i tried
print(simple_df.loc[simple_df['Word']=='45'])
but i get
Empty DataFrame
what am i missing? Is this the correct way of accessing the row? I also tried ('45',) as the value but that did not work either.
It appears that you have the literal string value "('45',)" in the cell of your dataframe. You must select it exactly so.
simple_df.loc[simple_df['Word']=="('45',)"]
I have a spreadsheet looking like this:
あう to meet
青 あお blue
青い あおい blue
Is there a way that I could convert the data in these columns into three SQL statements that I could then use to enter the data into a database? I am not so much concerned in asking how to form the statement but I would like to know if there's a way I can take data in columns and make it into a script?
If I could convert this into:
col1: 青い col2: あおい col3: blue
I could add and modify this for the correct format
INSERT INTO JLPT col1,col2,col3 VALUES ('青い', 'あおい', 'blue')
etc
Use the formula
="('"&A1&"', '"&B1&"', '"&C1&"'), "
in column D and copy the formula down for all the rows. Then prepend
Insert into JPT (col1, col2, col3) values
and your are done. The end result will be something like this:
Just don't forget to delete the last comma (and optionally exchange it for a semicolon) when you copy over the data from Excel.