Extracting Numbers with Commas from String - python-3.x

I want to extract a number of squared meters from a house description column. For example, I used:
df['description'].str.extract('(\d\,\d{1,3}\s?[sS])', expand=True)
to extract 3000 from a string that looks like "The house is 3,000 square meters".
What if I wanted to extract 800 square meters? So a condition that does not involve the comma. How would I add that to the condition. Sorry I looked around and still couldn't figure it out.

I am using str.findall
s=pd.Series(['llll llll llll 100,000.00 lll lll ll ','xyz 800 bgm bhd','80','1,000.00 and 10'])
s.str.findall(r'(?:[,\d]+.?\d*)')
0 [100,000.00]
1 [800]
2 [80]
3 [1,000.00, 10]
dtype: object

Related

How to swap characters around inside a column in EXCEL?

Specifically, I know ahead of time I only need to swap position 1 and 2 with 4 and 5.
2 Examples:
HEART
New output:
RTAHE
12734
New output:
34712
There is probably more than a handful of ways to do this. If you're interested in a formula, here is one way to go about it:
=RIGHT(A3,2)&MID(A3,3,LEN(A3)-4)&LEFT(A3,2)
Seems to be working on some test data I threw together.
A bit more robust, as suggested by #Rafalon:
=MID(A3,4,2)&MID(A3,3,1)&LEFT(A3,2)&MID(A3,6,LEN(A3))
Produces following results:
Input
1
12
123
1234
12345
123456
1234567
Output
1
12
312
4312
45312
453126
4531267

calculate sumproduct(Excel) in pandas dataframe based on condition

I have a dataframe.Structure:
SEQ product_name prod_cost non-prd_cost mgmt grand_total
1 prod1 100 200 20 320
2 prod2 200 400 30 630
3 prod3 300 500 40 840
4 prod4 100 300 50 450
I want to calculate sumproduct(in excel) based on condition.The condition is based on product_name.
lets say I want to calculate a variable called
sumprod_prod1_prd_prod3_mgmt = SUMPRODUCT(SEQ 1-4,product_name='prod1'_prod_cost and 'prod3'_mgmt)/2 = 100+40=140
How can I do this in pandas?
While I am a bit confused by your question, since the excel SUMPRODUCT function returns the sum of the products of corresponding ranges or arrays, and you seem to want the SUM of a singular combination.
To get the desired value:
sumprod_prod1_prd_prod3_mgmt = df[df['product_name'] == 'prod1']['prod_cost'].values[0]+df[df['prod_name']=='prod3']['mgmt'].values[0]
This solution gives a single result for the specified values. If you need a solution which provides the same functionality as excel, please update your question and example to better define what you are looking for.

Insert a 0"zero" infront of number for specify number sequences

I have this data
Name | Code | Price
XXX 102 1000
YYY 4321 1150
ZZZ 202 1150
AAA 123 1000
I can now Add concatenate and Add 0 in front of Code which makes
0102
04321
0202
0123
Now here the problem lies. I dont want that 0 in front of 4321 . I want 0 only infront of 3 digit numbers not more than 3 digit.
Right click on Column, go to Format cell-->Custom and write 0000 in the type and click on Ok
Simplest and easy solution
Assuming the '102' data is located at B2, just type :
=IF(len(B2)<=3,"0"&B2,B2)
will do. Alternatively, using concatenate() function you may do it like this :
=IF(len(B2)<=3,CONCATENATE("0"&B2),B2)
Assuming you have codes in B column
if(len(b2)=3,concatenate("0",b2),b2)
If you want to write formula then this would be better,
=REPT(0,4-LEN(A1))&A1

SAS 9.4 Character functions - Why might one cell return unexpected results?

I am manipulating some string variables and certain cells are returning unexpected values with substring and length functions. These cells hold character-formatted dates, as I need to do a few manipulations before converting them to SAS dates, because of the nature of the Excel file they're coming from. Here is an example:
HAVE:
Obs _orig
1 4/3
2 12/16
3 1/13
4 6/2
5 3/10
6 5/4
7 10/14
WANT:
Obs _orig _length _sub_1_2
1 4/3 3 4/
2 12/16 5 12
3 1/13 4 1/
4 6/2 3 6/
5 3/10 4 3/
6 5/4 3 5/
7 10/14 5 10
I am using this code:
data want;
set have;
_strip=strip(_orig);
_sub_1_2=substr(_strip,1,2);
_length=length(_strip);
run;
This is what I get. The discrepancies are bolded.
Obs _orig _length _sub_1_2
1 4/3 5
2 12/16 5 12
3 1/13 4 1/
4 6/2 3 6/
5 3/10 4 3/
6 5/4 5
7 10/14 5 10
Both are cases where SAS calculates length=5 when length should = 3. In both cases, the value for the substring-derived variable is blank altogether. Results are the same if I use compress(), trim(), or trimn() in my code, rather than strip(). Thank you for any help you can provide
Sounds like maybe unprintable characters got in your data. If you PUT _orig $hex.; to the log, what do you see? Should be: 342F332020
152 data want;
153 length orig $5;
154 orig='4/3';
155 len=length(orig);
156 put orig= len=;
157 put orig hex.;
158 run;
orig=4/3 len=3
342F332020
To get rid of non-printable characters, you could try:
_strip=compress(orig,,'kw');
Seems pretty clear to me that your variables have leading spaces or other leading characters that look like spaces on the screen. So for OBS=6 the value of the string is more like " 5/4" which has a length of 5 and the first two characters both look like spaces. If LENGTHN() of your new _sub_1_2 variable is not 0 then it has some non-printing character there. Perhaps something like 'A0'X which some webpages use as a non-breaking space or tab character ('09'x).
I suspect that you don't want the first two characters, but instead want the first word when using / as the delimiter. You can use the LEFT() or STRIP() function to remove the leading blanks. Or COMPRESS() to remove other junk. So you might use COMPRESS() with the k and d modifiers to only keep the digits and slashes.
data want;
set have;
length first $5 ;
first = scan(compress(_orig,'/','kd'),1,'/');
run;

Prob with my formatting.(Python)

I am trying to format this line:
print("{0:5} {1:5} {2:5} £{3:5} £{4:5}".format(GTIN,product,Quantity,indprice,finprice,))
however, it gives me an output of, with the pound sign separate from the 20.:
46389121 chicken 2 £10.00 £ 20
I want there to be 5 spaces between each variable, like this:
46389121 chicken 2 £10.00 £20
Can someone spot my stupid mistake in my formatting?
EDIT:
print("{} {} {} £{} £{}".format(GTIN,product,Quantity,indprice,finprice))
346389121 chicken 345435435 £10.00 £3454354350
46389121 chicken 2 £10.00 £20
46389121 chicken 2 £10.00 £20
46389121 chicken 23213213 £10.00 £232132130
when I try much bigger numbers , they do not aline.
Just put the spaces in the format string
`{} {} {} £{} £{}`.format(GTIN,product,Quantity,indprice,finprice)
You should put a "<" sign before the number of spaces. By default python uses ">" as the alignment for numbers which adds space or the fill symbol to the left of the number. The symbol "<" adds the space to the right of the number, which is want you need
print("{0:<5} {1:<5} {2:<5} £{3:<5} £{4:<5}".format(1, 2, 3, 5, 20))
>> 1 2 3 £5 £20
print("{0:5} {1:5} {2:5} £{3:5} £{4:5}".format(1, 2, 3, 5, 20))
>> 1 2 3 £ 5 £ 20
or format your individual strings first, then str.join them:
GTIN = 46389121
product = 'chicken'
Quantity = 2
indprice = 10.00
finprice = 20.00
strgs = [str(GTIN), product, str(Quantity), '£{:.2f}'.format(indprice),
'£{:.2f}'.format(indprice)]
print((5*' ').join(strgs))
that way you can easily change the number of spaces between the individual strings.
note that ':5'.format(...) may mess your desired formatting up. 5 is here the minimal space reserved for your input; if your input is shorter you will get more spaces between your data. if it is longer, everything is ok in your case.
you could also 'construct' the format string fist and then fill it (which amounts to the same as Patrick Haugh's answer):
fmt = (5*' ').join(('{}', '{}', '{}', '£{:.2f}', '£{:.2f}'))
print(fmt.format(GTIN,product,Quantity,indprice,finprice))

Resources