How to get the required output in excel? - excel

Please tell me how to do the below.
Say I have a single column A.
If the data in the 1st 3 rows 1st field is
XPWCS432, XPWCS440, XPWCS394, XPWCS395, XPWCS396, XPWCS397, XPWCS398, XPWCS399, XPWCS476, XPWCS390, XPWCS391
XPWCS432, XPWCS470
XPWCS432, XPWCS434, XPWCS312, XPWCS313, XPWCS314, XPWCS315, XPWCS316, XPWCS317, XPWCS318, XPWCS319, XPWCS320, XPWCS321, XPWCS322, XPWCS323, XPWCS324, XPWCS325, XPWCS326, XPWCS327, XPWCS328, XPWCS329, XPWCS330, XPWCS331, XPWCS372, XPWCS332
The output data should be like below
1)with out leading and traiiling comma
2)No spaces between values,no duplicates and values should be comma seperated
The below conditions should be achieved.
1) Remove the ,(comma) if it appears at the starting of string.
2) Remove any blank spaces in the string.
3) sort the string words in ascending and remove the duplicate words in the string.
The data(words) in the field are changing from row to row i.e column1 row1 field1 may contain 3 words
row2 field1 may contain 10 words
row3 field1 may contain 20 words
like this there may be say some 100 rows.
Thanks,
Srihai

I Should propose record the following macro of excel commands:
Text to columns, with "space" and "comma" as delimiter to remove them.
Traspose the data row to a data column.
Remove Duplicates and sort of data.
Traspose the data column to a data row.

Related

How do I drop complete rows (including all values in it) that contain a certain value in my Pandas dataframe?

I'm trying to write a python script that finds unique values (names) and reports the frequency of their occurrence, making use of Pandas library. There's a total of around 90 unique names, which I've anonymised in the head of the dataframe pasted below.
,1,2,3,4,5
0,monday09-01-2022,tuesday10-01-2022,wednesday11-01-2022,thursday12-01-2022,friday13-01-2022
1,Anonymous 1,Anonymous 1,Anonymous 1,Anonymous 1,
2,Anonymous 2,Anonymous 4,Anonymous 5,Anonymous 5,Anonymous 5
3,Anonymous 3,Anonymous 3,,Anonymous 6,Anonymous 3
4,,,,,
I'm trying to drop any row (the full row) that contains the regex expression "^monday.*", intending to indicate the word "monday" followed by any other number of random characters. I want to drop/deselect any cell/value within that row.
To achieve this goal, I've tried using the line of code below (and many other approaches I found on SO).
df = df[df[1].str.contains("^monday.*", case = True, regex=True) == False]
To clarify, I'm trying to search values of column "1" for the value "^.monday.*" and then deselecting the rows and all values in that row that match the regex expression. I've succesfully removed "monday09-01-2022" and "tuesday10-01-2022" etc.. but I'm also losing random names that are not in the matching rows.
Any help would be very much appreciated! Thank you!

How to extract text from a string between where there are multiple entires that meet the criteria and return all values

This is an exmaple of the string, and it can be longer
1160752 Meranji Oil Sats -Mt(MA) (000600007056 0001), PE:Toolachee Gas Sats -Mt(MA) (000600007070 0003)GL: Contract Services (510000), COT: Network (N), CO: OM-A00009.0723,Oil Sats -Mt(MA) (000600007053 0003)
The result needs to be column1 600007056 column2 600007070 column3 600007053
I am working in Spotfire and creating calclated columns through transformations as I need the columns to join to other data sets
I have tried the below, but it is only picking up the 1st 600.. number not the others, and there can be an undefined amount of those.
Account is the column with the string
Mid([Account],
Find("(000",[Account]) + Len("(000"),
Find("0001)",[Account]) - Find("(000",[Account]) - Len("(000"))
Thank you!
Assuming my guess is correct, and the pattern to look for is:
9 numbers, starting with 6, preceded by 1 opening parenthesis and 3 zeros, followed by a space, 4 numbers and a closing parenthesis
you can grab individual occurrences by:
column1: RXExtract([Amount],'(?<=\\(000)6\\d{8}(?=\\s\\d{4}\\))',1)
column2: RXExtract([Amount],'(?<=\\(000)6\\d{8}(?=\\s\\d{4}\\))',2)
etc.
The tricky bit is to find how many columns to define, as you say there can be many. One way to know would be to first calculate a max number of occurrences like this:
maxn: Max((Len([Amount]) - Len(RXReplace([Amount],'(?<=\\(000)6\\d{8}(?=\\s\\d{4}\\))','','g'))) / 9)
still assuming the number of digits in each column to extract is 9. This compares the length of the original [Amount] to the one with the extracted patterns replaced by an empty string, divided by 9.
Then you know you can define up to maxn columns, the extra ones for the rows with fewer instances will be empty.
Note that Spotfire always wants two back-slash for escaping (I had to add more to the editor to make it render correctly, I hope I have not missed any).

how to remove rows with less than 3 letter?

I have a pyspark data frame with many rows. each rows is a text. there is just one column. I want to delete or remove rows with less than 3 letter. for example in the following 4 rows I want to remove the second column and 4th. (pdf and a):
this is a text
pdf
a
No ways
You can filter using the length of the column:
df2 = df.filter('length(col) > 3')
If spaces matter, you can remove them first:
df2 = df.filter("length(replace(col, ' ', '')) > 3")

How can I add a column that has value for some of the rows and leaves the rest of them empty

I have a dataset that looks like the following:
I would like to add a column (sentences) to this dataframe. As for the rows, I want it to say (e.g) sentence1 on row zero and for example sentence2 at row 6. So basically I want the sentence column to mark the beginning of every sentence in this dataframe. the sentences are separated by a space.
I would be grateful if anyone can help me.
Thank you in advance
First, we will find indexes of the empty rows in the dataframe:
na_index = pd.isnull(df).any(1).nonzero()[0]
Now, we will create an empty np-array for a new column:
sentences = [None] * (shape(df)[0])
Now, we should set the first value in our array to become "Sentence1", and after that in a loop we can mark all other sentences:
sentences[0]='Sentence1'
index = 2
for a in na_index:
sentences[a+1] = "Sentence "+str(index)
index+=1
Finally, we need to add a new column to the dataframe:
df["Sentence#"]=sentences

Compare multiple columns, pull out only cells that appear in every column

I have 10 or so columns in my worksheet. Each column contains about 200 names, and there is no other data on the sheet.
What I'd like to do is create a new column that only contains the names that are common between the columns. So essentially compare each cell in each column to all the other cells in all the other columns, and only return the the common cells.
For example:
Column1 : name_A, name_C, name_F
Column2: name_C, name_B, name_D
Column3: name_C, name_Z, name_X
So in this example, the new column would only contain name_C, because it's the only value common to all three columns.
Is there any way to do this? My knowledge of Excel is quite poor, and I can't find anything similar to my problem online so I would appreciate any help.
Thanks for reading,
N
Put everything on a single spreadsheet and create a pivot table is probably more efficient than the algorithm you have on your mind.
here is my mock-up. I added extra names to demonstrate better
D(formula) has the easiest version. this will list only values that appear in all columns, but these will appear on the same lines as the corresponding name in column A, with blanks, and not sorted (giving D(result))
IF you would like all the names to appear the the top - as shown here in column E you can either sort your table (you will have to re-sort if the columns change) OR you can use my solution below:
get yourself the MoreFunc Addon for Excell ( here is the last working download link I found, and here is a good installation walk-through video )
once all is done select cells E1:E8, click the formula bar and type the following: =UNIQUEVALUES(IF(COUNTIF(A2:C9,A2:A9)=3,A2:A9,""))
accept the formula by clicking ctrl-shift-enter (this will create an array-formula and curly braces will appear around your formula)
A B C D(formula) D(result) E(result - sorted)
-------------------------------------------------------------------------------------------------------
1 | name_A name_C name_C =IF(COUNTIF($A$1:$C$8,A1)=3,A1,"") name_m
2 | name_C name_B name_Z =IF(COUNTIF($A$1:$C$8,A2)=3,A2,"") name_C name_C
3 | name_F name_D name_X =IF(COUNTIF($A$1:$C$8,A3)=3,A3,"")
4 | name_t name_o name_g =IF(COUNTIF($A$1:$C$8,A4)=3,A4,"")
5 | name_y name_p name_h =IF(COUNTIF($A$1:$C$8,A5)=3,A5,"")
6 | name_u name_k name_7 =IF(COUNTIF($A$1:$C$8,A6)=3,A6,"")
7 | name_i name_5 name_9 =IF(COUNTIF($A$1:$C$8,A7)=3,A7,"")
8 | name_m name_m name_m =IF(COUNTIF($A$1:$C$8,A8)=3,A8,"") name_m

Resources