I I have a Pyspark dataset with a column “channels” that looks like this:
channels
name1,name2,name3,name4
happy1,happy2
entity1,entity2,entity3,entity4,entity5
I want to create 5 new columns i.e “channel1, channel2, channel3, channel4, channel5”.
Then, I want to split the contents of the “channels” column using the comma separator. After splitting values from each row, I want to put each separated value in a different column.
For example for the first row, the columns should look like this:
channel1 channel2 channel3 channel4 channel5
name1 name2 name3 name4 ~
When an element is not found, i want to use ~ as the column value. For example in the first row, there were only 4 values instead of 5 so for the channel5 column, I used ~
I only want to use ~, not None or NULL.
How can I achieve this result in pyspark?
I tried this:
df = df.withColumn("channels_split", split(df["channels"], ","))
df = df.withColumn("channel1", coalesce(df["channels_split"][0], "~"))
df = df.withColumn("channel2", coalesce(df["channels_split"][1], "~"))
df = df.withColumn("channel3", coalesce(df["channels_split"][2], "~"))
df = df.withColumn("channel4", coalesce(df["channels_split"][3], "~"))
df = df.withColumn("channel5", coalesce(df["channels_split"][4], "~"))
df = df.drop("channels_split")
but it gives me an error that:
`~` is missing
You're referencing the column `~`, but it is missing from the schema. Please check your code.
Note that I am using pyspark within Foundry
Coalesce expects cols as arguments and you are providing String, i think that you should use lit("~")
https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.coalesce.html
This is more of a spark problem than a Foundry problem though.
What #M_S said is correct, as the error message was stating, you need a column and should use the lit function.
But be careful if
spark.sql.ansi.enabled is set to True
then this code will throw an ArrayIndexOutOfBoundsException if there are less than 5 items in your array.
Another way to do this would be to ensure that your array has at least 5 elements on each row by adding some ~, then get the first 5 items or by adding a temporary column with the length of the array and use a when condition.
By the way, you don't need to repeat df = every time
df = (
df
.withColumn("channels_split", split(df["channels"], ","))
.withColumn("default_values", array_repeat(lit("~"), 5)))
.withColumn("channels_split", concat(col("channels_split"), col("default_values")))
.withColumn("channel1", df["channels_split"][0])
.withColumn("channel2", df["channels_split"][1])
.withColumn("channel3", df["channels_split"][2])
.withColumn("channel4", df["channels_split"][3])
.withColumn("channel5", df["channels_split"][4])
.drop("channels_split")
)
Related
I have a pandas dataframe like this:
How do I get the price of an item1 without making 'Items column' an index column?
I tried df['Price (R)'][item1] but it returns the price of item2, while I expect output to be 1
The loc operators are required in front of the selection brackets []. When using loc, the part before the comma is the rows you want, and the part after the comma is the columns you want to select. Therefore, the code can be:
result = df.loc[df['Items']=="item1","Price(R)"]
The data type of created output is Pandas Series object.
I have a dataframe df as below.
I want the final dataframe to be like this as follows. i.e, for each unique Name only last 2 rows must be present in the final output.
i tried the following snippet but its not working.
df = df[df['Name']].tail(2)
Use GroupBy.tail:
df1 = df.groupby('Name').tail(2)
Just one more way to solve this using GroupBy.nth:
df1 = df.groupby('Name').nth([-1,-2]) ## this will pick the last 2 rows
I am trying to split a copy off of a Pandas dataframe starting after a certain column by header name.
So far, I've been able to manipulate the column headers or indexes according to a set number of known columns, like below. However, the number of columns will change, and I want to still extract every column that happens after.
In the below example, say I want to grab all columns after 'Tail' even if the 'Body' columns goes to column X. So the below sample with X number of Body columns:
df = pd.DataFrame({'Intro1': ['blah'],
'Intro2': ['blah'],'Intro3': ['blah'],'Body1': ['blah'],'Body2': ['blah'],'Body3': ['blah'],'Body4': ['blah'], ... 'BodyX': ['blah'],'Tail': ['blah'],'OtherTail': ['blah'],'StillAnotherTail': ['blah'],})
Should produce a copy of the dataframe as:
dftail = pd.DataFrame({'Tail': ['blah'],'OtherTail': ['blah'],'StillAnotherTail': ['blah'],})
Ideally I'd like to find a way to combine the two techiques below so that the column starts at 'Tail' and goes to the end of the dataframe:
dftail = [col for col in df if col.startswith('Tail')]
dftail = df.iloc[:, 164:] # column number (164) will change based on 'Tail' index number
How about this:
df_tail = df.iloc[:, list(df.columns).index("Tail"):]
df_tail then prints out:
Tail OtherTail StillAnotherTail
0 blah blah blah
I have a dataframe with 1M+ rows. A sample of the dataframe is shown below:
df
ID Type File
0 123 Phone 1
1 122 Computer 2
2 126 Computer 1
I want to split this dataframe based on Type and File. If the total count of Type is 2 (Phone and Computer), total number of files is 2 (1,2), then the total number of splits will be 4.
In short, total splits is as given below:
total_splits=len(set(df['Type']))*len(set(df['File']))
In this example, total_splits=4. Now, I want to split the dataframe df in 4 based on Type and File.
So the new dataframes should be:
df1 (having data of type=Phone and File=1)
df2 (having data of type=Computer and File=1)
df3 (having data of type=Phone and File=2)
df4 (having data of type=Computer and File=2)
The splitting should be done inside a loop.
I know we can split a dataframe based on one condition (shown below), but how do you split it based on two ?
My Code:
data = {'ID' : ['123', '122', '126'],'Type' :['Phone','Computer','Computer'],'File' : [1,2,1]}
df=pd.DataFrame(data)
types=list(set(df['Type']))
total_splits=len(set(df['Type']))*len(set(df['File']))
cnt=1
for i in range(0,total_splits):
for j in types:
locals()["df"+str(cnt)] = df[df['Type'] == j]
cnt += 1
The result of the above code gives 2 dataframes, df1 and df2. df1 will have data of Type='Phone' and df2 will have data of Type='Computer'.
But this is just half of what I want to do. Is there a way we can make 4 dataframes here based on 2 conditions ?
Note: I know I can first split on 'Type' and then split the resulting dataframe based on 'File' to get the output. However, I want to know of a more efficient way of performing the split instead of having to create multiple dataframes to get the job done.
EDIT
This is not a duplicate question as I want to split the dataframe based on multiple column values, not just one!
You can make do with groupby:
dfs = {}
for k, d in df.groupby(['Type','File']):
type, file = k
# do want ever you want here
# d is the dataframe corresponding with type, file
dfs[k] = d
You can also create a mask:
df['mask'] = df['File'].eq(1) * 2 + df['Type'].eq('Phone')
Then, for example:
df[df['mask'].eq(0)]
gives you the first dataframe you want, i.e. Type==Phone and File==1, and so on.
I have large Excel files, which contain observations of objects. I read the files via pandas, group them and then iterate over each group. For each group I calculate specific and quite complex results - let's say result1, result2 and optional result3.
I define an empty df with predefined columns, in which I insert my calculated values. In the end I combine all dfs to one final df.
Maybe better explained by code:
data = pd.read_excel()
grouped = data.groupby('obj_id')
columns = ['result1','result2','result3']
combined_results = pd.DataFrame(columns=columns)
for obj_id, obj_df in grouped:
obj_results = pd.DataFrame(columns=columns,index=[0])
# creates empty df with one all NaN row
obj_results['result1'] = fooCalculation(obj_df)
obj_results['result2'] = fooCalculation(obj_df)
combined_results = combined_results.append(obj_results, sort=False)
I like my current method, because if optional columns end up having no value, the col still exists, because it was set (to NaN) initially. This way I can one by one calculate my result values per object and update my df row as soon as I have updates.
I can't help but think, that this is not the cleanest way. Especially because obj_results['result1'] = fooCalculation() sets an entire column and I falsely use it to set one value.
What is the clean / best-pratice way here.
Should I instead "cache" the results in a dict and insert it into the combine_results?