How to snip part of a rows' data and only leave the first 3 digits in Python - python-3.x

0 546/001441
1 540/001495
2 544/000796
3 544/000797
4 544/000798
I have a column in my dataframe that I've provided above. It can have any number of rows depending on the data being crunched. It is one of many columns but the first three numbers match another columns data. I need to cut off everything after the first 3 numbers in order to append it to another dataframe based off of the similar values. Any ideas as to how to get only the first 3 numbers and cut off the remaining 8 values?
So far I've got the whole column singled out as a Series and also as a numpy.array in order to try to convert it to a str instead of an object.
I know this is getting me closer to an answer but i can't seem to figure out how to cut out the unnecessary values
testcut=dfwhynot[0][:3]
this cuts the string where i need it, but how do i do this for the whole column is what i can't figure out.

Assuming the name of your column is col, you can
# Split the column into two
df['col'] = df['col'].apply(lambda row: row.split('/'))
df[['col1', 'col2']] = pd.DataFrame(df_out.values.tolist())
col1 col2
0 546 001441
1 540 001495
2 544 000796
3 544 000797
4 544 000798
then get the minimal element of each col1 group
df.groupby('col1').min().reset_index()
resulting in
col1 col2
0 540 001495
1 544 000796
2 546 001441

Related

subtracting values between columns both ways

I need to find those values that are in column 1 and not in column 2 and vica versa. It can look like this: take fist row in the first column and look if there is same number in the second column if so then on the third column write 0 (substraction) and if there won't be the same number then write searched number or error, doesn't matter. This should work both ways (some numbers can be in col2 but not in col1, those i need to find aswell). So probably there would be 2 formulas in 2 columns. one searching from col1 to col2, and same for col2 to col1. And if there for example in col1 would be twice some value and in col2 just once, than it should show for the first number 0 and for second number error or searched number.
Dataset looks like this:
Col1
Col2.
42646
55
42646
77
33
25
77
Col3
Col4
0
55
0
0
33(or error,NA etc)
25
0
I have tried vlook up, but wasn't sucesfull.
I guess this is what you are looking for. You can use for Col3:
=IF(A2:A6="", "",IF(ISNA(XMATCH(A2:A6,B2:B6)),A2:A6,0))
and for Col4:
=IF(B2:B6="", "", IF(ISNA(XMATCH(B2:B6,A2:A6)),B2:B6,0))
Both formulas returns 0 if the value was found (including blanks), otherwise the missing value.
You can put all together using HSTACK:
= HSTACK(IF(A2:A6="", "",IF(ISNA(XMATCH(A2:A6,B2:B6)),A2:A6,0)),
IF(B2:B6="", "", IF(ISNA(XMATCH(B2:B6,A2:A6)),B2:B6,0)))
Or using LET to avoid repetitions.
= LET(A, A2:A6, B, B2:B6, HSTACK(IF(A="","",IF(ISNA(XMATCH(A,B)),A,0)),
IF(B="", "", IF(ISNA(XMATCH(B,A)),B,0))))
Here is the output:
You can use XLOOKUP too, but the formula is longer, because the first three input arguments are required:
=IF(ISNA(XLOOKUP(A2:A6,B2:B6, A2:A6)),A2:A6,0)
A shame you haven't added a sample in your data that would show what you meant with:
"And if there for example in col1 would be twice some value and in col2 just once, than it should show for the first number 0 and for second number error or searched number."
Your requirements make this a little tricky, but try:
Formula in C1:
=IF(A1="","",IF(COUNTIF(B:B,A1)-COUNTIF(A$1:A1,A1)<0,A1,0))
Formula in D1:
=IF(B1="","",IF(COUNTIF(A:A,B1)-COUNTIF(B$1:B1,B1)<0,B1,0))

Csv file split comma separated values into separate rows and dividing the corresponding dollar amount by the number of comma separated values in panda

beginner here!
I have a csv file with comma separated values. I want to split each comma separated value in different rows in pandas. However, the corresponding dollar amounts should be divided by the number of comma separated values in each cell and export the result in a different csv file.
the csv table and the desired output table
I have used df.explode(IDs) but couldn’t figure out how to divide the Dollar_Amount by the number of IDs in the corresponding cells.
import pandas as pd
in_csv = pd.read_csv(‘inputCSV.csv’)
new_csv = df.explode(‘IDs’)
new_csv.to_csv(‘outputCSV.csv’)
You can divide the dollar amount by the number of ids in each row before using explode. This can be done as follows:
# Preprocessing
df['Dollar_Amount'] = df['Dollar_Amount'].str[1:].str.replace(',', '').astype(float)
df['IDs'] = df['IDs'].str.split(",")
# Compute the new dollar amount and explode
df['Dollar_Amount'] = df['Dollar_Amount'] / df['IDs'].str.len()
df = df.explode('IDs')
# Postprocessing
df['Dollar_Amount'] = df['Dollar_Amount'].round(2).apply(lambda x: '${0:,.2f}'.format(x))
With an example input:
IDs Dollar_Amount A
0 1,2,3,4 $100,000.00 4
1 5,6,7 $50,000.00 3
2 9 $20,000.00 1
3 10,11 $20,000.00 2
The result is as follows:
IDs Dollar_Amount A
0 1 $25,000.00 4
0 2 $25,000.00 4
0 3 $25,000.00 4
0 4 $25,000.00 4
1 5 $16,666.67 3
1 6 $16,666.67 3
1 7 $16,666.67 3
2 9 $20,000.00 1
3 10 $10,000.00 2
3 11 $10,000.00 2
There will be a one line way to do this with a lambda function (if you are new, read up on lambda functions!) but as a slightly less new beginner, I think its easier to think about this as two separate operations.
Operation 1 - get the count of ids, Operation 2 - do the division
If you take a look here https://towardsdatascience.com/count-occurrences-of-a-value-pandas-e5dad02303e9 you'll get a good lesson on how to do the group by you need to get the count of ids and join it back to your data frame. I'd read that because its a much more detailed explainer, but if you want a simple line of code consider this Pandas, how to count the occurance within grouped dataframe and create new column?
Once you have it, the divison is as simple as df['new_col'] = df['col1']/df['col2']

MS excel calculate sum by conditions

I have a table:
col1
col2
134
1
432
2
222
3
21
4
982
5
1352
8
111
9
I need to find all possible sum combinations of col1 values IF col2 sum is 10. (5+4+1, 2+3+5, etc.) & number of terms is 3
Please advice how to solve this task?
To get all unique possible sums based on a give count of items in col2 and sum of col2 is a specific amount, with ms365, try:
Formula in D1:
=LET(inp,B1:B7,cnt,3,sm,10,B,COUNTA(inp),A,MAKEARRAY(B,cnt,LAMBDA(r,c,INDEX(inp,r,1))),D,B^cnt,E,UNIQUE(MAKEARRAY(D,cnt,LAMBDA(rw,cl,INDEX(IF(A="","",A),MOD(CEILING(rw/(D/(B^cl)),1)-1,B)+1,cl)))),F,FILTER(E,MMULT(--(E<>""),SEQUENCE(cnt,,,0))=cnt),G,FILTER(F,BYROW(F,LAMBDA(a,(SUM(a)=sm)*(COUNT(UNIQUE(a,1))=cnt)))),UNIQUE(BYROW(G,LAMBDA(a,SUM(XLOOKUP(a,inp7,A1:A7))))))
You can now change parameters cnt and sm to whichever amount you like.
The trick is borrowed from my answer here to calculate all permutations first. Instead of a range, the single column input is extended using MAKEARRAY().
A short visual representation of what is happening:
Expand the given array based on cnt;
Create a complete list of all possible permutations;
Filter step 2 based on a sum per row and unique numbers (don't use values from col2 more than once);
Lookup each value per row to create a total sum per row;
Return only the unique summations as per screenshot at the top.

Exporting a list as a new column in a pandas dataframe as part of a nested for loop

I am inputting multiple spreadsheets with multiple columns of data. For each spreadsheet, the maximum value of each column is found. Then, for each element in the column, the element is divided by the maximum value of that column. The output should be a value (between 0 and 1) for each element in the column in ascending order. This is appended to a list which should be added to the source spreadsheet as a column.
Currently, the nested loops are performing correctly apart from the final step, as far as I understand. Each column is added to the spreadsheet EXCEPT the values are for the final column of the source spreadsheet rather than values related to each individual column.
I have tried changing the indents to associate levels of the code with different parts (as I think this is the problem) and tried moving the appended column along in the dataframe, to no avail.
for i in distlist:
#listname = i[4:] + '_norm'
df2 = pd.read_excel(i,header=0,index_col=None, skip_blank_lines=True)
df3 = df2.dropna(axis=0, how='any')
cols = []
for column in df3:
cols.append(column)
for x in cols:
listname = x + ' norm'
maxval = df3[x].max()
print(maxval)
mylist = []
for j in df3[x]:
findNL = (j/maxval)
mylist.append(findNL)
df3[listname] = mylist
saveloc = 'E:/test/'
filename = i[:-18] + '_Normalised.xlsx'
df3.to_excel(saveloc+filename, index=False)
New columns are added to the output dataframe with bespoke headings relating to the field headers in the source spreadsheet and renamed according to (listname). The data in each one of these new columns is identical and relates to the final column in the spreadsheet. To me, it seems to be overwriting the values each time (as if looping through the entire spreadsheet, not outputting for each column), and adding it to the spreadsheet.
Any help would be much appreciated. I think it's something simple, but I haven't managed to work out what...
If I understand you correctly, you are overcomplicating things. You dont need a for loop for this. You can simplify your code:
# Make example dataframe, this is not provided
df = pd.DataFrame({'col1':[1, 2, 3, 4],
'col2':[5, 6, 7, 8]})
print(df)
col1 col2
0 1 5
1 2 6
2 3 7
3 4 8
Now we can use DataFrame.apply and use add_suffix to give the new columns _norm suffix and after that concat the columns to one final dataframe
df_conc = pd.concat([df, df.apply(lambda x: x/x.max()).add_suffix('_norm')],axis=1)
print(df_conc)
col1 col2 col1_norm col2_norm
0 1 5 0.25 0.625
1 2 6 0.50 0.750
2 3 7 0.75 0.875
3 4 8 1.00 1.000
Many thanks. I think I was just overcomplicating it. Incidentally, I think my code may do the same job, but because there is so little difference in the values, it wasn't notable.
Thanks for your help #Erfan

How to match two sets of data by dates which do not synchronise and include missing values in Excel

Please forgive any errors or shortcomings in this question, it's my first on stackoverflow.
I have two sets of data in Excel of differing lengths and frequency, and would like to be able to place a value of 0 for where they don't synchronise, and match the rest.
For example, dataset 1 could be:
Date Set1
01-01-2010 10
01-03-2010 4
01-04-2010 8
01-05-2010 5
01-06-2010 10
01-09-2010 12
01-10-2010 9
01-11-2010 4
And dataset 2 could be:
Date Set2
01-03-2010 102
01-06-2010 104
01-10-2010 102
I'm looking for an output table that displays the values alongside each other for dates matching, 0 otherwise, like so:
Date Set1 Set2
01-01-2010 10 0
01-03-2010 4 102
01-04-2010 8 0
01-05-2010 5 0
01-06-2010 10 104
01-09-2010 12 0
01-10-2010 9 102
01-11-2010 4 0
I can't seem to be able to crack this with my limited knowledge and the lack of synchronisation in the data. Any help would be much appreciated, thanks.
You can do this using a VLOOKUP nested in an IFERROR statement.
The two equations used (and dragged down to last unique date row) are:
H3 = IFERROR(VLOOKUP(G3,A:B,2,0),0)) & I3 = IFERROR(VLOOKUP(G3,D:E,2,0),0))
This will not work if you have duplicate dates in the same data set with varying values since VLOOKUP will always return the first matched value (reading top down).
Place Set1 in A1:B9 (header in row 1). Add a column of zeros next to it in column C, so A2:A9 is dates, B2:B9 is values and C2:C9 is zeros.
Place Set2 (without the header) in A10:B12; move the Set2 data to column C and put zeros in column B, so A10:A12 is dates, B10:B12 is zeros, C10:C12 is values.
Sort the range A2:C12 by Date (column A).
Easier to show with a screenshot but newbies are not allowed to post images.

Resources