how to remove rows with less than 3 letter? - apache-spark

I have a pyspark data frame with many rows. each rows is a text. there is just one column. I want to delete or remove rows with less than 3 letter. for example in the following 4 rows I want to remove the second column and 4th. (pdf and a):
this is a text
pdf
a
No ways

You can filter using the length of the column:
df2 = df.filter('length(col) > 3')
If spaces matter, you can remove them first:
df2 = df.filter("length(replace(col, ' ', '')) > 3")

Related

Splitting the data of one excel column into two columns sing python

I have problem of splitting the content of one excel column which contains numbers and letters into two columns the numbers in one column and the letters in the other.
As can you see in the first photo there is no space between the numbers and the letters, but the good thing is the letters are always "ms". I need a method split them as in the second photo.
Before
After
I tried to use the replace but it did not work. it did not split them.
Is there any other method.
You can use the extract method. Here is an example:
df = pd.DataFrame({'time': ['34ms', '239ms', '126ms']})
df[['time', 'unit']] = df['time'].str.extract('(\d+)(\D+)')
# convert time column into integer
df['time'] = df['time'].astype(int)
print(df)
# output:
# time unit
# 0 343 ms
# 1 239 ms
# 2 126 ms
It is pretty simple.
You need to use pandas.Series.str.split
Attaching the Syntax here :- pandas.Series.str.split
The Code should be
import pandas as pd
data_before = {'data' : ['34ms','56ms','2435ms']}
df = pd.DataFrame(data_before)
result = df['data'].str.split(pat='(\d+)',expand=True)
result = result.loc[:,[1,2]]
result.rename(columns={1:'number', 2:'string'}, inplace=True)
Output : -
print(result)
Output

How to remove rows of columns whose value count is less than particular number?

df['Brand'].value_counts() gives list of occurrence of each value in column Brand. I want to remove all rows where the occurrence is less than 6. Column Brand is string.
Table
Use:
df = pd.DataFrame({'Brand':[1,2,3,3,3,3,3,3,3,3]})
df[df.apply(lambda x: df.value_counts()[x]>6)['Brand'].values]
Output:
A more efficient way, if your data size is huge:
temp = df.value_counts()>6
df[df['Brand'].isin(temp[temp].index.get_level_values(0).values)]
output:
Another way:
df = pd.DataFrame({'Brand':[1,2,3,3,3,3,3,3,3,3]})
temp = df['Brand'].tolist()
df[df['Brand'].apply(lambda x: temp.count(x)>6)]
with the same output.
You can do this below;
column = df['Brand'] > 6
valueCount = column.value_counts()

Transforming multiple data frame columns into one series

I have a dataset df(250,3) 250 raws and three columns. I want to write a loop that merges the content of each column in my dataframe to have one single series(250,1) of 250 raws and 1 columns 'df_single'. The manual operation is the following:
df_single = df['colour']+" "+df['model']+" "+df['size']
How can I create df_single with a for loop, or non-manually?
I tried to write this code with TypeError
df_conc=[]
for var in cols:
cat_list=df_code_part[var]
df_conc = df_conc+" "+cat_list
TypeError: can only concatenate list (not "str") to list
I think if need join 3 columns then your solution is really good:
df_single = df['colour']+" "+df['model']+" "+df['size']
If need general solution for many columns use DataFrame.astype for convert to strings if necessary with DataFrame.add for add whitespace, sum for concatenate and last remove tralining whitespeces by Series.str.rstrip for remove traling whitespace:
cols = ['color','model','size']
df_single = df[cols].astype(str).add(' ').sum(axis=1).str.rstrip()
Or:
df_single = df[cols].astype(str).apply(' '.join, axis=1)
If you want to have spaces between columns, run:
df.apply(' '.join, axis=1)
"Ordinary" df.sum(axis=1) concatenates all columns, but without
spaces between them.
if you want the sum You need use:
df_single=df.astype(str).add(' ').sum(axis=1).str.rstrip()
if you don't want to add all the columns then you need to select them previously:
columns=['colour','model','size']
df_single=df[columns].astype(str).add(' ').sum(axis=1).str.rstrip()

How can I add a column that has value for some of the rows and leaves the rest of them empty

I have a dataset that looks like the following:
I would like to add a column (sentences) to this dataframe. As for the rows, I want it to say (e.g) sentence1 on row zero and for example sentence2 at row 6. So basically I want the sentence column to mark the beginning of every sentence in this dataframe. the sentences are separated by a space.
I would be grateful if anyone can help me.
Thank you in advance
First, we will find indexes of the empty rows in the dataframe:
na_index = pd.isnull(df).any(1).nonzero()[0]
Now, we will create an empty np-array for a new column:
sentences = [None] * (shape(df)[0])
Now, we should set the first value in our array to become "Sentence1", and after that in a loop we can mark all other sentences:
sentences[0]='Sentence1'
index = 2
for a in na_index:
sentences[a+1] = "Sentence "+str(index)
index+=1
Finally, we need to add a new column to the dataframe:
df["Sentence#"]=sentences

How to get the required output in excel?

Please tell me how to do the below.
Say I have a single column A.
If the data in the 1st 3 rows 1st field is
XPWCS432, XPWCS440, XPWCS394, XPWCS395, XPWCS396, XPWCS397, XPWCS398, XPWCS399, XPWCS476, XPWCS390, XPWCS391
XPWCS432, XPWCS470
XPWCS432, XPWCS434, XPWCS312, XPWCS313, XPWCS314, XPWCS315, XPWCS316, XPWCS317, XPWCS318, XPWCS319, XPWCS320, XPWCS321, XPWCS322, XPWCS323, XPWCS324, XPWCS325, XPWCS326, XPWCS327, XPWCS328, XPWCS329, XPWCS330, XPWCS331, XPWCS372, XPWCS332
The output data should be like below
1)with out leading and traiiling comma
2)No spaces between values,no duplicates and values should be comma seperated
The below conditions should be achieved.
1) Remove the ,(comma) if it appears at the starting of string.
2) Remove any blank spaces in the string.
3) sort the string words in ascending and remove the duplicate words in the string.
The data(words) in the field are changing from row to row i.e column1 row1 field1 may contain 3 words
row2 field1 may contain 10 words
row3 field1 may contain 20 words
like this there may be say some 100 rows.
Thanks,
Srihai
I Should propose record the following macro of excel commands:
Text to columns, with "space" and "comma" as delimiter to remove them.
Traspose the data row to a data column.
Remove Duplicates and sort of data.
Traspose the data column to a data row.

Resources