pySpark 3.0 how to trim spaces for all columns [duplicate] - apache-spark

This question already has answers here:
Trim in a Pyspark Dataframe
(5 answers)
Closed 1 year ago.
For this dataframe: How to trim all leading and trailing spaces for each column in a loop?
df = spark.createDataFrame(
[
(' a', '10 ', ' b '), # create your data here, be consistent in the types.
],
['col1', 'col2','col3'] # add your columns label here
)
df.show(5)
I know how to do that by specifing each column like below, but need to do that for all columns in a loop because in real case i will not know the column names and how many of the columns.
from pyspark.sql.functions import trim
df = df.withColumn("col2", trim(df.col2))
df.show(5)

You can use a list comprehension to apply trim to all columns:
from pyspark.sql.functions import trim, col
df2 = df.select([trim(col(c)).alias(c) for c in df.columns])

Related

Pandas : Reorganization of a DataFrame [duplicate]

This question already has answers here:
Split (explode) pandas dataframe string entry to separate rows
(27 answers)
Closed 2 years ago.
I'm looking for a way to clean the following data:
I would like to output something like this:
with the tokenized words in the first column and their associated labels on the other.
Is there a particular strategy with Pandas and NLTK to obtain this type of output in one go?
Thank you in advance for your help or advice
Given the 1st table, it's simply a matter of splitting the first column and repeating the 2nd column:
import pandas as pd
data = [['foo bar', 'O'], ['George B', 'PERSON'], ['President', 'TITLE']]
df1 = pd.DataFrame(data, columns=['col1', 'col2'])
print(df1)
df2 = pd.concat([pd.Series(row['col2'], row['col1'].split(' '))
for _, row in df1.iterrows()]).reset_index()
df2 = df2.rename(columns={'index': 'col1', 0: 'col2'})
print(df2)
The output:
col1 col2
0 foo bar O
1 George B PERSON
2 President TITLE
col1 col2
0 foo O
1 bar O
2 George PERSON
3 B PERSON
4 President TITLE
As for splitting the 1st column, you want to look at the split method which supports regular expression, which should allow you to handle the various language delimiters:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.split.html
If 1st table is not given there is no way to do this in 1 go with pandas since pandas has no built-in NLP capabilities.

Get columns names if 'value' is in a list pandas Python

I need to find columns names if they contain one of these words COMPLETE, UPDATED and PARTIAL
This is my code, not working.
import pandas as pd
df=pd.DataFrame({'col1': ['', 'COMPLETE',''],
'col2': ['UPDATED', '',''],
'col3': ['','PARTIAL','']},
)
print(df)
items=["COMPLETE", "UPDATED", "PARTIAL"]
if x in items:
print (df.columns)
this is the desired output:
I tried to get inspired by this question Get column name where value is something in pandas dataframe but I couldn't wrap my head around it
We can do isin and stack and where:
s=df.where(df.isin(items)).stack().reset_index(level=0,drop=True).sort_index()
s
col1 COMPLETE
col2 UPDATED
col3 PARTIAL
dtype: object
Here's one way to do it.
# check each column for any matches from the items list.
matched = df.isin(items).any(axis=0)
# produce a list of column labels with a match.
matches = list(df.columns[matched])

Build Pandas DataFrame with String Entries using 2 Separate DataFrames

Suppose you have two separate pandas DataFrames with the same row and column indices (in my case, the column indices were constructed by .unstack()'ing a MultiIndex built using df.groupby([col1,col2]))
df1 = pd.DataFrame({'a':[.01,.02,.03],'b':[.04,.05,.06]})
df2 = pd.DataFrame({'a':[.04,.05,.06],'b':[.01,.02,.03]})
Now suppose I would like to create a 3rd DataFrame, df3, where each entry of df3 is a string which uses the corresponding element-wise entries of df1 and df2. For example,
df3.iloc[0,0] = '{:.0%}'.format(df1.iloc[0,0]) + '\n' + '{:.0%}'.format(df2.iloc[0,0])
I recognize this is probably easy enough to do by looping over all entries in df1 and df2 and creating a new entry in df3 based on these values (which can be slow for large DataFrames), or even by joining the two DataFrames together (which may require renaming columns), but I am wondering if there a more pythonic / pandorable way of accomplishing this, possibly using applymap or some other built-in pandas function?
The question is similar to Combine two columns of text in dataframe in pandas/python but the previous question does not consider combining multiple DataFrames into a single.
IIUC, you just need add df1 and df2 with '\n'
df3 = df1.astype(str) + '\n' + df2.astype(str)
Out[535]:
a b
0 0.01\n0.04 0.04\n0.01
1 0.02\n0.05 0.05\n0.02
2 0.03\n0.06 0.06\n0.03
You can make use of the vectorized operations of Pandas (given that the dataframes share row and column index)
(df1 * 100).astype(str) + '%\n' + (df2 * 100).astype(str) + '%'
You get
a b
0 1.0%\n4.0% 4.0%\n1.0%
1 2.0%\n5.0% 5.0%\n2.0%
2 3.0%\n6.0% 6.0%\n3.0%

How to add a column to the left of a datafra,e [duplicate]

This question already has answers here:
Add column to dataframe with constant value
(10 answers)
Closed 3 years ago.
I am trying to add a column to the left of the dataframe. By default it seems to add to the right. Is there a way to add the columns to the left?
Here is my code:
import pandas as pd
import numpy as np
df = pd.read_csv("/home/Sample Text Files/sample5.csv", delimiter = "\t")
df=pd.DataFrame(df)
df['Creation_DT']=pd.to_datetime('today')
print(df)
Here is the output:
ID,Name,Age Creation_DT
0 1233,Maliva,15 2019-07-17 11:11:37.145194
I want the output to be like this:
Creation_DT, ID, Name, Age
[value], [Value], [Value], [Value]
you can add a line of code to rearrange the columns as follows:
df = df[['Creation_DT', 'ID', 'Name', 'Age']]
another option is to insert the column during conversion:
df.insert(loc=0, column='Creation_DT', value=pd.to_datetime('today'))
try df.insert to place column at specific data
df.insert(loc=0, column='Creation_DT', value=pd.to_datetime('today'))

Apache Spark -- Parsing the data and convert columns into rows

I need to convert the columns into rows .Please help me in the below requirement in spark scala code.input file is | delimiter and one of the column having comma delimiter value.based on the comma delimiter i need to convert them into rows
my input records:
c11|c12|a,b|c14
c21|c22|a,c,d|c24
expected output :
a,c11,c12,c14
b,c11,c12,c14
a,c21,c22,c24
c,c21,c22,c24
d,c21,c22,c24
Thanks,
Siva
First read the dataframe as csv with | as a separator:
This provides a dataframe with the base columns you need except the third which would be a string. Lets say you renamed this column to be called _c2 (the default name for the third column). Now you can split the string to get an array
We also remove the previous column as we don't need it anymore.
Lastly we use explode to turn the array to rows and drop the unused column
from pyspark.sql.functions import split
from pyspark.sql.functions import explode
df1 = spark.read.csv("pathToFile", sep="|")
df2 = df1.withColumn("splitted", split(df1["_c2"],",")).drop("_c2")
df3 = df2.withColumn("exploded", explode(df2["splitted"])).drop("splitted")
or in scala (free form)
import org.apache.spark.sql.functions.split
import org.apache.spark.sql.functions.explode
val df1 = spark.read.csv("pathToFile", sep="|")
val df2 = df1.withColumn("splitted", split(df1("_c2"),",")).drop("_c2")
val df3 = df2.withColumn("exploded", explode(df2("splitted"))).drop("splitted")

Resources