R convert all rows to strings - string

I need to convert all rows of a dataframe to strings.
Here's a sample data:
1.12331,4.331123,4.12335435,1,"asd"
1.123453345,5.654456,4.889999,1.45456,"qwe"
2.00098,5.5445,4.768799,1.999999,"ttre"
I read this data into R, got a dataframe.
td<-read.table("test.csv", sep=',')
When I run apply(td, 2, as.character) on this data, I got
V1 V2 V3 V4 V5
[1,] "1.1233" "4.3311" "4.1234" "1.0000" "asd"
[2,] "1.1235" "5.6545" "4.8900" "1.4546" "qwe"
[3,] "2.0010" "5.5445" "4.7688" "2.0000" "ttre"
But when I do the same only on numeric columns, I got the different result:
apply(td[,1:4], 2, as.character)
V1 V2 V3 V4
[1,] "1.12331" "4.331123" "4.12335435" "1"
[2,] "1.123453345" "5.654456" "4.889999" "1.45456"
[3,] "2.00098" "5.5445" "4.768799" "1.999999"
As a result I need a dataframe with values exactly the same as in source file. What I'm doing wrong?

You can set colClasses in read.table() to make all columns as character.
td <- read.table("test.csv", sep=',',colClasses="character")
td
V1 V2 V3 V4 V5
1 1.12331 4.331123 4.12335435 1 asd
2 1.123453345 5.654456 4.889999 1.45456 qwe
3 2.00098 5.5445 4.768799 1.999999 ttre
str(td)
'data.frame': 3 obs. of 5 variables:
$ V1: chr "1.12331" "1.123453345" "2.00098"
$ V2: chr "4.331123" "5.654456" "5.5445"
$ V3: chr "4.12335435" "4.889999" "4.768799"
$ V4: chr "1" "1.45456" "1.999999"
$ V5: chr "asd" "qwe" "ttre"

The best way to do this is the read the data in as character in the first place. You can do this with the colClasses argument to read.table:
td <- read.table("test.csv", sep=',', colClasses="character")

Related

Python : Split string every three words in dataframe

I've been searching around for a while now, but I can't seem to find the answer to this small problem.
I have this code that is supposed to split the string after every three words:
import pandas as pd
import numpy as np
df1 = {
'State':['Arizona AZ asdf hello abc','Georgia GG asdfg hello def','Newyork NY asdfg hello ghi','Indiana IN asdfg hello jkl','Florida FL ASDFG hello mno']}
df1 = pd.DataFrame(df1,columns=['State'])
df1
def splitTextToTriplet(df):
text = df['State'].str.split()
n = 3
grouped_words = [' '.join(str(text[i:i+n]) for i in range(0,len(text),n))]
return grouped_words
splitTextToTriplet(df1)
Currently the output is as such:
['0 [Arizona, AZ, asdf, hello, abc]\n1 [Georgia, GG, asdfg, hello, def]\nName: State, dtype: object 2 [Newyork, NY, asdfg, hello, ghi]\n3 [Indiana, IN, asdfg, hello, jkl]\nName: State, dtype: object 4 [Florida, FL, ASDFG, hello, mno]\nName: State, dtype: object']
But I am actually expecting this output in 5 rows, one column on dataframe:
['Arizona AZ asdf', 'hello abc']
['Georgia GG asdfg', 'hello def']
['Newyork NY asdfg', 'hello ghi']
['Indiana IN asdfg', 'hello jkl']
['Florida FL ASDFG', 'hello mno']
how can I change the regex so it produces the expected output?
For efficiency, you can use a regex and str.extractall + groupby/agg:
(df1['State']
.str.extractall(r'((?:\w+\b\s*){1,3})')[0]
.groupby(level=0).agg(list)
)
output:
0 [Arizona AZ asdf , hello abc]
1 [Georgia GG asdfg , hello def]
2 [Newyork NY asdfg , hello ghi]
3 [Indiana IN asdfg , hello jkl]
4 [Florida FL ASDFG , hello mno]
regex:
( # start capturing
(?:\w+\b\s*) # words
{1,3} # the maximum, up to three
) # end capturing
You can do:
def splitTextToTriplet(row):
text = row['State'].split()
n = 3
grouped_words = [' '.join(text[i:i+n]) for i in range(0,len(text),n)]
return grouped_words
df1.apply(lambda row: splitTextToTriplet(row), axis=1)
which gives as output the following Dataframe:
0
0
['Arizona AZ asdf', 'hello abc']
1
['Georgia GG asdfg', 'hello def']
2
['Newyork NY asdfg', 'hello ghi']
3
['Indiana IN asdfg', 'hello jkl']
4
['Florida FL ASDFG', 'hello mno']

Panda returns 50x1 matrix instead of 50x7? (read_csv gone wrong)

I'm quite new to Python. I'm trying to load a .csv file with Panda but it returns a 50x1 matrix instead of expected 50x7. I'm a bit uncertain whether it is becaue my data contains numbers with "," (although I thought the quotechar attribute would solve that problem).
EDIT: Should perhaps mention that including the attribute sep=',' doesn't solve the issue)
My code looks like this
df = pd.read_csv('data.csv', header=None, quotechar='"')
print(df.head)
print(len(df.columns))
print(len(df.index))
Any ideas? Thanks in advance
Here is a subset of the data as text
10-01-2021,813,116927,"2,01",-,-,-
11-01-2021,657,117584,"2,02",-,-,-
12-01-2021,462,118046,"2,03",-,-,-
13-01-2021,12728,130774,"2,24",-,-,-
14-01-2021,17895,148669,"2,55",-,-,-
15-01-2021,15206,163875,"2,81",5,5,"0,0001"
16-01-2021,4612,168487,"2,89",7,12,"0,0002"
17-01-2021,2536,171023,"2,93",717,729,"0,01"
18-01-2021,3883,174906,"3,00",2147,2876,"0,05"
Here is the output of the head-function
0
0 27-12-2020,6492,6492,"0,11",-,-,-
1 28-12-2020,1987,8479,"0,15",-,-,-
2 29-12-2020,8961,17440,"0,30",-,-,-
3 30-12-2020,11477,28917,"0,50",-,-,-
4 31-12-2020,6197,35114,"0,60",-,-,-
5 01-01-2021,2344,37458,"0,64",-,-,-
6 02-01-2021,8895,46353,"0,80",-,-,-
7 03-01-2021,6024,52377,"0,90",-,-,-
8 04-01-2021,2403,54780,"0,94",-,-,-
Using your data I got the expected result. (even without quotechar='"')
Could you maybe show us your output?
import pandas as pd
df = pd.read_csv('data.csv', header=None)
print(df)
> 0 1 2 3 4 5 6
> 0 10-01-2021 813 116927 2,01 - - -
> 1 11-01-2021 657 117584 2,02 - - -
> 2 12-01-2021 462 118046 2,03 - - -
> 3 13-01-2021 12728 130774 2,24 - - -
> 4 14-01-2021 17895 148669 2,55 - - -
> 5 15-01-2021 15206 163875 2,81 5 5 0,0001
> 6 16-01-2021 4612 168487 2,89 7 12 0,0002
> 7 17-01-2021 2536 171023 2,93 717 729 0,01
> 8 18-01-2021 3883 174906 3,00 2147 2876 0,05
You need to define the seperator and delimiter, like this:
df = pd.read_csv('data.csv', header=None, sep = ',', delimiter=',' , quotechar='"')

How get all combinations at python with repeat

Code example
from itertools import *
from collections import Counter
from tqdm import *
#for i in tqdm(Iterable):
for i in combinations_with_replacement(['1','2','3','4','5','6','7','8'], 8):
b = (''.join(i))
if b == '72637721':
print (b)
when i try profuct i have
for i in product(['1','2','3','4','5','7','6','8'], 8):
TypeError: 'int' object is not iterable
How can i get all combinations ? ( i was belive it before not test , so now all what i was do wrong)
i was read about combinations_with_replacement return all , but how i see it's lie
i use python 3.8
Out put for ask
11111111 11111112 11111113 11111114 11111115 11111116 11111117
11111118 11111122 11111123 11111124 11111125 11111126 11111127
11111128 11111133 11111134 11111135 11111136 11111137 11111138
11111144 11111145 11111146 11111147 11111148 11111155 11111156
11111157 11111158 11111166 11111167 11111168 11111177 11111178
11111188 11111222 11111223 11111224 11111225 11111226 11111227
11111228 11111233 11111234 11111235 11111236 11111237 11111238
11111244 11111245 11111246 11111247 11111248 11111255 11111256
11111257 11111258 11111266 11111267 11111268 11111277 11111278
11111288
what it start give at end
56666888 56668888 56688888 56888888 58888888 77777777 77777776
77777778 77777766 77777768 77777788 77777666 77777668 77777688
77777888 77776666 77776668 77776688 77776888 77778888 77766666
77766668 77766688 77766888 77768888 77788888 77666666 77666668
77666688 77666888 77668888 77688888 77888888 76666666 76666668
76666688 76666888 76668888 76688888 76888888 78888888 66666666
66666668 66666688 66666888 66668888 66688888 66888888 68888888
88888888
more cleare think it how it be count from 1111 1111 to 8888 8888 ( but for characters , so this why i use try do it at permutation/combine with repitions...
it miss some possible combinations of that symbols.
as example what i try do , make all permutatuion of possible variants of hex numbers , like from 0 to F , but make it not only for them , make this possible for any charater.
this only at example ['1','2','3','4','5','6','7','8']
this can be ['a','b','x','c','d','g','r','8'] etc.
solition is use itertools.product instead combinations_with_replacement
from itertools import *
for i in product(['1','2','3','4','5','6','7','8'],repeat = 8):
b = (''.join(i))
if b == '72637721':
print (b)
:
itertools.product ('ABCD', 'ABCD') AA AB AC AD BA BB BC BD CA CB CC CD DA DB DC DD # full multiplication with duplicates and mirrored pairs
itertools.permutations ('ABCD', 2) -> AB AC AD BA BC BD CA CB CD DA DB DC # full multiplication without duplicates and mirrored pairs
itertools.combinations_with_replacement ('ABCD', 2) -> AA AB AC AD BB BC BD CC CD DD # no mirror pairs with duplicates
itertools.combinations ('ABCD', 2) -> AB AC AD BC BD CD # no mirrored pairs and no duplicates
Here's the updated code that will print you all the combinations. It does not matter if your list has strings and numbers.
To ensure that you are doing a combination only for the specific number of elements, I recommend that you do:
comb_list = [1, 2, 3, 'a']
comb_len = len(comb_list)
and replace the line with:
comb = combinations_with_replacement(comb_list, comb_len)
from itertools import combinations_with_replacement
comb = combinations_with_replacement([1, 2, 3, 'a'], 4)
for i in list(comb):
print (''.join([str(j) for j in i]))
This will result as follows:
1111
1112
1113
111a
1122
1123
112a
1133
113a
11aa
1222
1223
122a
1233
123a
12aa
1333
133a
13aa
1aaa
2222
2223
222a
2233
223a
22aa
2333
233a
23aa
2aaa
3333
333a
33aa
3aaa
aaaa
I don't know what you are trying to do. Here's an attempt to start a dialogue to get to the final answer:
samples = [1,2,3,4,5,'a','b']
len_samples = len(samples)
for elem in samples:
print (str(elem)*len_samples)
The output of this will be as follows:
1111111
2222222
3333333
4444444
5555555
aaaaaaa
bbbbbbb
Is this what you want? If not, explain your question section what you expect as an output.

Pandas dataframe.read_csv ,quotechar doesnot work

I am not getting the output as expected.
I am trying to convert CSV to dataframe, But it is not working:
sales=pd.read_csv('Downloads/item.csv',sep=',',delimeter='"',error_bad_lines=False,quotechar='"')
This is my CSV file sample:
"account_number,name,item_code,category,quantity,unit price,net_price,date "
"093356,Waters-Walker,AS-93055,Shirt,5,82.68,413.40,2013-11-17 20:41:11"
"659366,Waelchi-Fahey,AS-93055,Shirt,18,99.64,1793.52,2014-01-03 08:14:27"
"563905,""Kerluke, Reilly and Bechtelar"",AS-93055,Shirt,17,52.82,897.94,2013-12-04 02:07:05"
"995267,Cole-Eichmann,GS-86623,Shoes,18,15.28,275.04,2014-04-09 16:15:03"
"524021,Hegmann and Sons,LL-46261,Shoes,7,78.78,551.46,2014-06-18 19:25:10"
"929400,""Senger, Upton and Breitenberg"",LW-86841,Shoes,17,38.19,649.23,2014-02-10 05:55:56"
Please take a look at the bold characters in the CSV files they are enclosed with ""
Here is my proposal:
df = pd.read_csv('file.csv')
col_name = 'account_number,name,item_code,category,quantity,unit price,net_price,date'
z = pd.concat([df[col_name].str.split(r'(,(?=\S)|:)', expand=True)], axis=1)
z['date'] = z[14]+z[15]+z[16]+z[17]+z[18]
z = z.drop(columns=[1,3,5,7,9,11,13, 14,15,16,17,18])
z.columns = col_name.split(',')
Crucial is this regex r'(,(?=\S)|:)' - comma but not followed by space but I don't know why it also split on :. If you can fix it then you don't have manually concat date.
Output:
account_number ... date
0 093356 ... 2013-11-17 20:41:11
1 659366 ... 2014-01-03 08:14:27
2 563905 ... 2013-12-04 02:07:05
3 995267 ... 2014-04-09 16:15:03
4 524021 ... 2014-06-18 19:25:10
5 929400 ... 2014-02-10 05:55:56

velox package point extraction failure

I am trying to work with the velox package in R 3.4.1, using the current (0.2.0) velox package version. I want to extract raster pixel values using the VeloxRaster_extract_points functionality and after failures with my own data, I ran the exact code provided on page 19 of the current reference manual. This returned the error shown (below). I have been unable to find any relevant references to this error online. Any suggestions?
Thanks
> ## Make VeloxRaster with two bands
> set.seed(0)
> mat1 <- matrix(rnorm(100), 10, 10)
> mat2 <- matrix(rnorm(100), 10, 10)
> vx <- velox(list(mat1, mat2), extent=c(0,1,0,1), res=c(0.1,0.1),crs="+proj=longlat +datum=WGS84 +no_defs")
> ## Make SpatialPoints
> library(sp)
> library(rgeos)
> coord <- cbind(runif(10), runif(10))
> spoint <- SpatialPoints(coords=coord)
> ## Extract
> vx$extract_points(sp=spoint)
Error in envRefInferField(x, what, getClass(class(x)), selfEnv) :
‘extract_points’ is not a valid field or method name for reference class “VeloxRaster”
When trying, it worked fine for my case:
library('velox')
## Make VeloxRaster with two bands
set.seed(0)
mat1 <- matrix(rnorm(100), 10, 10)
mat2 <- matrix(rnorm(100), 10, 10)
vx <- velox(list(mat1, mat2), extent=c(0,1,0,1), res=c(0.1,0.1),
crs="+proj=longlat +datum=WGS84 +no_defs")
## Make SpatialPoints
library(sp)
library(rgeos)
coord <- cbind(runif(10), runif(10))
spoint <- SpatialPoints(coords=coord)
## Extract
vx$extract_points(sp=spoint)
[,1] [,2]
[1,] 0.76359346 -0.4125199
[2,] 0.35872890 0.3178857
[3,] 0.25222345 -1.1195991
[4,] 0.00837096 2.0247614
[5,] 0.77214219 -0.5922254
[6,] 0.00837096 2.0247614
[7,] 1.10096910 0.5989751
[8,] 1.15191175 -0.9558391
[9,] 0.14377148 -1.5236149
[10,] 1.27242932 0.0465803
I think you may need to reinstall the package.

Resources