in power Query: How to create conditional column that removes numbers and keeps text - excel

col 1 contains rows that have just numbers and just text. example:
row 1 = 70
row 2 = RS
row 3= abcddkss
row 5 = 5
row 6 = 88
and so on
What I want to do is add a column using logic like this: if Col1 not a number then Col1 else null.
what I have so far:
=let mylist=List.RemoveItems(List.Transform({1..126}, each Character.FromNumber(_)),{"0".."9"})
in
if List.Contains(mylist,Text.From([Column1])) then [Column1] else null
however, this will not work for rows that have more than one letter and will only work on ones that have one letter

You can use this:
if Value.Is(Value.FromText([dat]), type number) then null else [dat]

You could also check if the string is purely digit characters.
if [Column1] = Text.Select([Column1], {"0".."9"}) then null else [Column1]

Related

Increase the values in a column values based on values in other column in pandas

I have my source data in the form of csv file as below:
id,col1,col2
123,11|22|33||||||,val1|val3|val2
456,99||77|||88|||||||||6|,val4|val5|val6|val7
I need to add a new column(fnlsrc) which will have the values based on values in Col2 and Col1, i.e if col1 has 9 values(separated with pipe) and col2 has 3 values(separated with pipe), then in fnlsrc column I have to load 9 values(separated with pipe) 3 set of col2(val1|val3|val2|val1|val3|val2|val1|val3|val2). Please refer the output below, which will help in understanding the requirement easily:
id,col1,col2,fnlsrc
123,11|22|33||||||,val1|val3|val2,val1|val3|val2|val1|val3|val2|val1|val3|val2
456,99||77|||88|||||||||6|,val4|val5|val6|val7,val4|val5|val6|val7|val4|val5|val6|val7|val4|val5|val6|val7|val4|val5|val6|val7
I have tried following code, but its adding only the one set:
zipped = zip(df['col1'], df['col2'])
for s,t in zipped:
count = int((s.count('|') + 1)/(t.count('|') + 1))
for val in range(count):
df['fnlsrc'] = t
As the new column is based on the other two, I would use panda's apply() function. I defined a function that calculates the new column value based on the other two columns, which is then applied to each row:
def new_value(x):
# Find out number of values in both columns
col1_numbers = x['col1'].count('|') + 1
col2_numbers = x['col2'].count('|') + 1
# Calculate how many times col2 should appear in the new column
repetition = int(col1_numbers/col2_numbers)
# Create list of strings containing the values of the new column
values = [x['col2']]*repetition
# Join the list of strings with pipes
return '|'.join(values)
# Apply the function on every row
df['fnlsrc'] = df.apply(lambda x:new_value(x), axis=1)
df
Output:
id col1 col2 fnlsrc
0 123 11|22|33|||||| val1|val3|val2 val1|val3|val2|val1|val3|val2|val1|val3|val2
1 456 99||77|||88|||||||||6| val4|val5|val6|val7 val4|val5|val6|val7|val4|val5|val6|val7|val4|v...
Full output in your input format:
id,col1,col2,fnlsrc
123,11|22|33||||||,val1|val3|val2,val1|val3|val2|val1|val3|val2|val1|val3|val2
456,99||77|||88|||||||||6|,val4|val5|val6|val7,val4|val5|val6|val7|val4|val5|val6|val7|val4|val5|val6|val7|val4|val5|val6|val7

selecting different columns each row

I have a dataframe which has 500K rows and 7 columns for days and include start and end day.
I search a value(like equal 0) in range(startDay, endDay)
Such as, for id_1, startDay=1, and endDay=7, so, I should seek a value D1 to D7 columns.
For id_2, startDay=4, and endDay=7, so, I should seek a value D4 to D7 columns.
However, I couldn't seek different column range successfully.
Above-mentioned,
if startDay > endDay, I should see "-999"
else, I need to find first zero (consider the day range) and such as for id_3's, first zero in D2 column(day 2). And starDay of id_3 is 1. And I want to see, 2-1=1 (D2 - StartDay)
if I cannot find 0, I want to see "8"
Here is my data;
data = {
'D1':[0,1,1,0,1,1,0,0,0,1],
'D2':[2,0,0,1,2,2,1,2,0,4],
'D3':[0,0,1,0,1,1,1,0,1,0],
'D4':[3,3,3,1,3,2,3,0,3,3],
'D5':[0,0,3,3,4,0,4,2,3,1],
'D6':[2,1,1,0,3,2,1,2,2,1],
'D7':[2,3,0,0,3,1,3,2,1,3],
'startDay':[1,4,1,1,3,3,2,2,5,2],
'endDay':[7,7,6,7,7,7,2,1,7,6]
}
data_idx = ['id_1','id_2','id_3','id_4','id_5',
'id_6','id_7','id_8','id_9','id_10']
df = pd.DataFrame(data, index=data_idx)
What I want to see;
df_need = pd.DataFrame([0,1,1,0,8,2,8,-999,8,1], index=data_idx)
You can create boolean array to check in each row which 'Dx' column(s) are above 'startDay' and below 'endDay' and the value is equal to 0. For the first two conditions, you can use np.ufunc.outer with the ufunc being np.less_equal and np.greater_equal such as:
import numpy as np
arr_bool = ( np.less_equal.outer(df.startDay, range(1,8)) # which columns Dx is above startDay
& np.greater_equal.outer(df.endDay, range(1,8)) # which columns Dx is under endDay
& (df.filter(regex='D[0-9]').values == 0)) #which value of the columns Dx are 0
Then you can use np.argmax to find the first True per row. By adding 1 and removing 'startDay', you get the values you are looking for. Then you need to look for the other conditions with np.select to replace values by -999 if df.startDay >= df.endDay or 8 if no True in the row of arr_bool such as:
df_need = pd.DataFrame( (np.argmax(arr_bool , axis=1) + 1 - df.startDay).values,
index=data_idx, columns=['need'])
df_need.need= np.select( condlist = [df.startDay >= df.endDay, ~arr_bool.any(axis=1)],
choicelist = [ -999, 8],
default = df_need.need)
print (df_need)
need
id_1 0
id_2 1
id_3 1
id_4 0
id_5 8
id_6 2
id_7 -999
id_8 -999
id_9 8
id_10 1
One note: to get -999 for id_7, I used the condition df.startDay >= df.endDay in np.select and not df.startDay > df.endDay like in your question, but you can cahnge to strict comparison, you get 8 instead of -999 in this case.

How do you search through a row (Row 1) of a CSV file, but search through the next row (Row 2) at the same time?

Imagine there are THREE columns and a certain number of rows in a dataframe. First column are random values, second column are Names, third column are Ages.
I want to search through every row (First Row) of this dataframe and find when value 1 appears in the first column. Then simultaneously, I want to know that if value 1 does indeed exist in the column, does value 2 appear in the SAME column but in the next row.
If this is the case. Copy First Rows, Value, Name And Age into an empty dataframe. Every time this condition is met, copy these rows into an empty dataframe
EmptyDataframe = pd.DataFrame(columns['Name','Age'])
csvfile = pd.DataFrame(columns['Value', 'Name', 'Age'])
row_for_csv_dataframe = next(csv.iterrows())
for index, row_for_csv_dataframe in csv.iterrows():
if row_for_csv_dataframe['Value'] == '1':
# How to code this:
# if the NEXT row after row_for_csv_dataframe finds the 'Value' == 2
# then copy 'Age' and 'Name' from row_for_csv_dataframe into the empty DataFrame.
Assuming you have a dataframe data like this:
Value Name Age
0 1 Anne 10
1 2 Bert 20
2 3 Caro 30
3 2 Dora 40
4 1 Emil 50
5 1 Flip 60
6 2 Gabi 70
You could do something like this, although this is probably not the most efficient:
iterator1 = data.iterrows()
iterator2 = data.iterrows()
iterator2.__next__()
for current, next in zip(iterator1,iterator2):
if(current[1].Value==1 and next[1].Value==2):
print(current[1].Value, current[1].Name, current[1].Age)
And would get this result:
1 Anne 10
1 Flip 60

Assign values in large scale into a data base. Formula (libreOffice calc)

I am trying to create a data base in libreOffice spreed-sheet application. And what I need is, the first column to be Id's, but each Id has to fill 100 cells. So I would like to have 2000 Id's and each Id takes up 100 cells, we have 200 000 cells. (Id's values = range(1,2000))
row#1 : row#100 = Id#1 // row#101 : row#200 = Id#2 ....// row#199900 : row#200000 = Id#2000
What I simply want is to assign the value 1 to the first 100 cells in the first column, the value 2 to the next 100 cells in the same column and so on, until I have the 2000 Id's in the first column.
So I would like to find a formula to achieve that with out having to select and scroll manually 2000 times the sheet.
Thanks
If the ID is in A column:
=QUOTIENT(ROW(A1);100)+1
The formula adds 1 to integer part of the number of row divided by 100.
Apply with a loop?
Public Sub test()
Dim i As Long
For i = 1 To 2000
Range("A1:A100").Offset((i - 1) * 100, 0) = i
Next
End Sub

Java Apache POI reading empty column after row data column

I have 100 rows 10 columns. reading each row one by one and column also. After 10th column the cursor not moving into next row.
It's going to 11 12 13 etc column. Could anyone help me how to move the next row once reading 10th column and how to stop reading empty 11 column.
Here is some code:
while (rowIterator.hasNext()) {
row = rowIterator.next();
while (cellIterator.hasNext()) {
cell = cellIterator.next();
if(cell.getColumnIndex()==0 ) { }
.....
if(cell.getColumnIndex()==10) { }
}
}
First, but this will not necessarily fix your problem, you should be using the for each syntax to iterate rows and columns, then once you get past column 10 you can just break out of the loop like so:
for (Row row : sheet){
for (Cell cell : row) {
...
if (cell.getColumnIndex() >= 10) break;
}
}
This is documented in the POI Quick Guide here: https://poi.apache.org/spreadsheet/quick-guide.html#Iterator
NOTE: I am breaking when column index is 10 or greater (that would be the 11th column as the indexes are 0 based). I mention this only because your code example is using column indexes 0 - 10, but your text says that there are only 10 valid columns.
if you want to skip column you can use
while (cellsInRow.hasNext()) {
Cell currentCell = cellsInRow.next();
if (currentCell.getColumnIndex() >= 3 && currentCell.getColumnIndex() <= 6) {
continue;
}
}
to skip column 4, 5, 6 and 7 from excel file. index starts at 0

Resources