I have certain data that need to be converted to strings. Example:
[ABCGHDEF-12345, ABCDKJEF-123235,...]
The example above does not represent a constant or a string by itself but is taken from an Excel sheet (ranging upto 30+ items for each row). I want to convert these to strings. Since data is undefined, explicitly converting them doesn't work. Is there a way to do this iteratively without placing double/single quotes manually between each data element?
What I want finally:
["ABCGHDEF-12345", "ABCDKJEF-123235",...]
To convert the string to list of strings you can try:
s = "[ABCGHDEF-12345, ABCDKJEF-123235]"
s = s.strip("[]").split(", ")
print(s)
Prints:
['ABCGHDEF-12345', 'ABCDKJEF-123235']
Related
I have a data frame (df).
The Data frame contains a string column called: supported_cpu.
The (supported_cpu) data is a string type separated by a comma.
I want to use this data for the ML model.
enter image description here
I had to get unique values for the column (supported_cpu). The output is a (list) of unique values.
def pars_string(df,col):
#Separate the column from the string using split
data=df[col].value_counts().reset_index()
data['index']=data['index'].str.split(",")
# Create a list including all of the items, which is separated by column
df_01=[]
for i in range(data.shape[0]):
for j in data['index'][i]:
df_01.append(j)
# get unique value from sub_df
list_01=list(set(df_01))
# there are some leading or trailing spaces in the list_01 which need to be deleted to get unique value
list_02=[x.strip(' ') for x in list_01]
# get unique value from list_02
list_03=list(set(list_02))
return(list_03)
supported_cpu_list = pars_string(df=df,col='supported_cpu')
The output:
enter image description here
I want to map this output to the data frame to encode it for the ML model.
How could I store the output in the data frame? Note : Some row have a multi-value(more than one CPU)
Input: string type separated by a column
output: I did not know what it should be.
Input: string type separated by a column
output: I did not know what it should be.
I really recommend to anyone who's starting using pandas to read about vectorization and thinking in terms of columns (aka Series). This is the way it was build and it is the way in which its supposed to be used.
And from what I understand (I may be wrong) is that you want to get unique values from supported_cpu column. So you could use the Series methods on string to split that particular column, then flatten the resulting array using internal `chain
from itertools import chain
df['supported_cpu'] = df['supported_cpu'].str.split(pat=',')
unique_vals = set(chain(*df['supported_cpus'].tolist()))
unique_vals = (item for item in unique_vals if item)
Multi-values in some rows should be parsed to single values for later ML model training. The list can be converted to dataframe simply by pd.DataFrame(supported_cpu_list).
I try to scrape som data from an apartment listing site.
I want to use the price to calculate. So I need to store it as numbers. But it's written like text on the website like this: 5 670 money/month
I want to remove all the characters and spaces, Then make it an integer to save in my db.
I tried regular expression, but get this error.
TypeError: expected string or bytes-like object
This is a element I collect the price from.
<p class="info-price">399 euro per month</p>
I get the price with xpath like this
p = response.xpath('//p[#class="info-price"]/text()').extract()
And the output when I collect name of object and price would be like this
{'object': ['North West End 24'], 'price': ['399\xa0euro\xa0per\xa0month']}
How and when should I convert it?
So I found a solution. Maybe it's a dirty solution and someone comes along with elegant one-liner.
But as I understand, the text I scrape with this line
p = response.xpath('//p[#class="info-price"]/text()').extract()
is a list object.
So I add a line to 'convert' it to sa string with this code
p = ''.join(map(str, p)) #Convert to string from list object
And finally to remove all space and text, so I end up with just the price in numbers I use this code
p = re.sub('\D', '', p) #Remove all but numbers
So all in all this snippet takes the text of the price, convert it to string and then removes all but niumbers.
p = response.xpath('//p[#class="info-price"]/text()').extract()
p = ''.join(map(str, p)) #Convert to string from list object
p = re.sub('\D', '', p) #Remove all but numbers
What the .extract() method does is find all occurences of your xpath expression; that's why it returns a list - there might be more than one result. If you know there's only one result or only care about the first one, use .extract_first() instead - it will return the first result as a string (or None, if no match is found), so you don't have to convert the list to a string. (See https://docs.scrapy.org/en/latest/topics/selectors.html#id1)
p = response.xpath('//p[#class="info-price"]/text()').extract_first()
I am attempting to save an array to a txt file, I used the following function:
numpy.savetxt('C:/Users/Adminstrator/Desktop/mesh/ELIST2-clean.txt',array2DClean, delimiter='\t')
The data in file are shown as following:
1.300000000000000000e+01 2.710000000000000000e+02 2.360000000000000000e+02 7.200000000000000000e+01 2.350000000000000000e+02
2.400000000000000000e+01 2.760000000000000000e+02 2.060000000000000000e+02 1.310000000000000000e+02 1.300000000000000000e+02
3.200000000000000000e+01 2.580000000000000000e+02 2.820000000000000000e+02 2.570000000000000000e+02 5.000000000000000000e+01
3.600000000000000000e+01 2.800000000000000000e+02 5.100000000000000000e+01 5.000000000000000000e+01 1.030000000000000000e+02
3.900000000000000000e+01 2.800000000000000000e+02 2.250000000000000000e+02 1.120000000000000000e+02 1.110000000000000000e+02
4.300000000000000000e+01 2.810000000000000000e+02 1.630000000000000000e+02 2.200000000000000000e+01 1.640000000000000000e+02
4.900000000000000000e+01 2.850000000000000000e+02 1.150000000000000000e+02 1.600000000000000000e+02 1.610000000000000000e+02
How can I format the numbers written to the file as whole integers without xe+y notation?
numpy.savetxt has an argument fmt that takes the number format.
fmt: str or sequence of strs, optional
A single format (%10.5f), a sequence of formats, or a multi-format
string, e.g. ‘Iteration %d – %10.5f’, in which case delimiter is
ignored. For complex X, the legal options for fmt are:
a single specifier, fmt=’%.4e’, resulting in numbers formatted like ‘
(%s+%sj)’ % (fmt, fmt)
a full string specifying every real and imaginary part, e.g. ‘ %.4e
%+.4ej %.4e %+.4ej %.4e %+.4ej’ for 3 columns
a list of specifiers, one per column - in this case, the real and
imaginary part must have separate specifiers, e.g. [‘%.3e + %.3ej’,
‘(%.15e%+.15ej)’] for 2 columns
Do numpy.savetxt('C:/Users/Adminstrator/Desktop/mesh/ELIST2-clean.txt',array2DClean, delimiter='\t', fmt='%.0f') to use a format string that writes floats out in fixed-point notation with zero decimal places.
More info here: https://docs.python.org/3/library/string.html#format-specification-mini-language
I am trying to convert all items in my dataframe to a float. The types are varies at the moment. The following error persist -> ValueError: could not convert string to float: '116,584.54'
The file can be found at https://www.imf.org/external/pubs/ft/weo/2019/01/weodata/WEOApr2019all.xls
I checked the value in excel, it is a Number. I tried .replace, .astype, pd.to_numeric.
for i in weo['1980']:
if i == float:
print(i)
i.replace(",",'')
i.replace("--",np.nan)
else:
continue
Also, I have tried:
weo['1980'] = weo['1980'].apply(pd.to_numeric)
You can try using DataFrame.astype in order to conduct the conversion which is usually the recommended approach. As you already attempted in your question, you may have to remove all the comas form the string in column 1980 first as it may cause the same error as quoted in your question:
weo['1980'] = weo['1980'].replace(',', '')
weo['1980'] = weo['1980'].asytpe(float)
If you're reading your DataFrame from Excel using pandas.read_excel, you can also specify the thousands argument to do this conversion for you which will likely result in a higher performance:
pandas.read_excel(file, thousands=',')
I had types error all the time while playing with dataframes. I now always use this to convert all the values that can be converted into floats.
# Convert all columns that can be converted into float into float.
# Error were raised because their type was Object
df = df.apply(pd.to_numeric, errors='ignore')
I want to write a list of strings to a binary file. Suppose I have a list of strings mylist? Assume the items of the list has a '\t' at the end, except the last one has a '\n' at the end (to help me, recover the data back). Example: ['test\t', 'test1\t', 'test2\t', 'testl\n']
For a numpy ndarray, I found the following script that worked (got it from here numpy to r converter):
binfile = open('myfile.bin','wb')
for i in range(mynpdata.shape[1]):
binfile.write(struct.pack('%id' % mynpdata.shape[0], *mynpdata[:,i]))
binfile.close()
Does binfile.write automatically parses all the data if variable has * in front it (such in the *mynpdata[:,i] example above)? Would this work with a list of integers in the same way (e.g. *myIntList)?
How can I do the same with a list of string?
I tried it on a single string using (which I found somewhere on the net):
oneString = 'test'
oneStringByte = bytes(oneString,'utf-8')
struct.pack('I%ds' % (len(oneString),), len(oneString), oneString)
but I couldn't understand why is the % within 'I%ds' above replaced by (len(oneString),) instead of len(oneString) like the ndarray example AND also why is both len(oneString) and oneString passed?
Can someone help me with writing a list of string (if necessary, assuming it is written to the same binary file where I wrote out the ndarray) ?
There's no need for struct. Simply join the strings and encode them using either a specified or an assumed text encoding in order to turn them into bytes.
''.join(L).encode('utf-8')