load csv and delete \r\n python - python-3.x

I can not load my csv with pandas correctly, I have some matrices and vectors but I get a (\ r \ n) for example i tried this:
test = pd.read_csv('test.csv',sep='\t',index_col=0)
test.head()
and i get this:
test['image_gray'][0]
'[[0.4297102 0.4297102 0.42578863 ... 0.37573176 0.34549804 0.30628235]\r\n [0.41794549 0.41402392 0.40618078 ... 0.37573176 0.34549804 0.30628235]\r\n [0.39833765 0.39441608 0.38151255 ... 0.3718102 0.34549804 0.30628235]\r\n ...\r\n [0.03164039 0.01987569 0.01763569 ... 0.55161137 0.55553294 0.55496745]\r\n [0.03385765 0.02771882 0.01763569 ... 0.55945451 0.56281059 0.56281059]\r\n [0.03777922 0.02771882 0.01763569 ... 0.56281059 0.56673216 0.57065373]]'
i don't want (\r\n) i have the same proble in others columns of my dataframe.
¿What could i do ?

If you're using windows. then try passing the os.linesep as the lineterminator to read_csv
pd.read_csv('test.csv', sep='\t', sep='\t', lineterminator=os.linesep)
But the file you have might be malformed with \r\n\n. Check that.

Related

How to read in pandas column as column of lists?

Probably a simple solution but I couldn't find a fix scrolling through previous questions so thought I would ask.
I'm reading in a csv using pd.read_csv() One column is giving me issues:
0 ['Bupa', 'O2', 'EE', 'Thomas Cook', 'YO! Sushi...
1 ['Marriott', 'Evans']
2 ['Toni & Guy', 'Holland & Barrett']
3 []
4 ['Royal Mail', 'Royal Mail']
It looks fine here but when I reference the first value in the column i get:
df['brand_list'][0]
Out : '[\'Bupa\', \'O2\', \'EE\', \'Thomas Cook\', \'YO! Sushi\', \'Costa\', \'Starbucks\', \'Apple Store\', \'HMV\', \'Marks & Spencer\', "Sainsbury\'s", \'Superdrug\', \'HSBC UK\', \'Boots\', \'3 Store\', \'Vodafone\', \'Marks & Spencer\', \'Clarks\', \'Carphone Warehouse\', \'Lloyds Bank\', \'Pret A Manger\', \'Sports Direct\', \'Currys PC World\', \'Warrens Bakery\', \'Primark\', "McDonald\'s", \'HSBC UK\', \'Aldi\', \'Premier Inn\', \'Starbucks\', \'Pizza Hut\', \'Ladbrokes\', \'Metro Bank\', \'Cotswold Outdoor\', \'Pret A Manger\', \'Wetherspoon\', \'Halfords\', \'John Lewis\', \'Waitrose\', \'Jessops\', \'Costa\', \'Lush\', \'Holland & Barrett\']'
Which is obviously a string not a list as expected. How can I retain the list type when I read in this data?
I've tried the import ast method I've seen in other posts: df['brand_list_new'] = df['brand_list'].apply(lambda x: ast.literal_eval(x)) Which didn't work.
I've also tried to replicate with dummy dataframes:
df1 = pd.DataFrame({'a' : [['test','test1','test3'], ['test59'], ['test'], ['rhg','wreg']],
'b' : [['erg','retbn','ert','eb'], ['g','eg','egr'], ['erg'], 'eg']})
df1['a'][0]
Out: ['test', 'test1', 'test3']
Which works as I would expect - this suggests to me that the solution lies in how I am importing the data
Apologies, I was being stupid. The following should work:
import ast
df['brand_list_new'] = df['brand_list'].apply(lambda x: ast.literal_eval(x))
df['brand_list_new'][0]
Out: ['Bupa','O2','EE','Thomas Cook','YO! Sushi',...]
As desired

Append numpy array with inequal row in a loop

I want to append several arrays but with different size. However I don't want to merge them together, just stock them in a mega-list. Here a simplified code of mine which try to reproduce my problem:
import numpy as np
total_wavel = 5
tot_values = []
for i in range(total_wavel):
size = int(np.random.uniform(low=2, high=7))
values = np.array(np.random.uniform(low=1, high=6, size=(size,)))
tot_values = np.append(tot_values,values)
Exemple Output :
array([4.88776545, 4.86006097, 1.80835575, 3.52393214, 2.88971373,
1.62978552, 4.06880898, 4.10556672, 1.33428321, 3.81505999,
3.95533471, 2.18424975, 5.15665168, 5.38251801, 1.7403673 ,
4.90459377, 3.44198867, 5.03055533, 3.96271897, 1.93934124,
5.60657218, 1.24646798, 3.14179412])
Expected Output :
np.array([np.array([4.88776545, 4.86006097, 1.80835575, 3.52393214)], np.array([2.88971373,
1.62978552, 4.06880898, 4.10556672]), np.array([1.33428321, 3.81505999,
3.95533471, 2.18424975, 5.15665168, 5.38251801]), np.array([1.7403673 ,
4.90459377, 3.44198867, 5.03055533], np.array([3.96271897, 1.93934124,
5.60657218, 1.24646798, 3.14179412])])
Or
np.array([4.88776545, 4.86006097, 1.80835575, 3.52393214], [2.88971373,
1.62978552, 4.06880898, 4.10556672],[1.33428321, 3.81505999,
3.95533471, 2.18424975, 5.15665168, 5.38251801], [1.7403673 ,
4.90459377, 3.44198867, 5.03055533], [3.96271897, 1.93934124,
5.60657218, 1.24646798, 3.14179412])
Thank you in advance
In for loop tot_values.append(list(values)), and after loop tot_np=np.array(tot_values)

trouble with transpose in pd.read_csv

I have a data in a CSV file structured like this
Installation Manufacturing Sales & Distribution Project Development Other
43,934 24,916 11,744 - 12,908
52,503 24,064 17,722 - 5,948
57,177 29,742 16,005 7,988 8,105
69,658 29,851 19,771 12,169 11,248
97,031 32,490 20,185 15,112 8,989
119,931 30,282 24,377 22,452 11,816
137,133 38,121 32,147 34,400 18,274
154,175 40,434 39,387 34,227 18,111
I want to skip the header and transpose the list like this
43934 52503 57177 69658 97031 119931 137133 154175
24916 24064 29742 29851 32490 30282 38121 40434
11744 17722 16005 19771 20185 24377 32147 39387
0 0 7988 12169 15112 22452 34400 34227
12908 5948 8105 11248 8989 11816 18274 18111
Here is my code
import pandas as pd
import csv
FileName = "C:/Users/kesid/Documents/Pthon/Pthon.csv"
data = pd.read_csv(FileName, header= None)
data = list(map(list, zip(*data)))
print(data)
I am getting the error "TypeError: zip argument #1 must support iteration". Any help much appreciated.
You can read_csv in "normal" mode:
df = pd.read_csv('input.csv')
(colum names will be dropper later).
Start processing from replacing NaN with 0:
df.fillna(0, inplace=True)
Then use either df.values.T or df.values.T.tolist(), whatever
better suits your needs.
You should use skiprows=[0] to skip reading the first row and use .T to transpose
df = pd.read_csv(filename, skiprows=[0], header=None).T

Pandas dataframe.read_csv ,quotechar doesnot work

I am not getting the output as expected.
I am trying to convert CSV to dataframe, But it is not working:
sales=pd.read_csv('Downloads/item.csv',sep=',',delimeter='"',error_bad_lines=False,quotechar='"')
This is my CSV file sample:
"account_number,name,item_code,category,quantity,unit price,net_price,date "
"093356,Waters-Walker,AS-93055,Shirt,5,82.68,413.40,2013-11-17 20:41:11"
"659366,Waelchi-Fahey,AS-93055,Shirt,18,99.64,1793.52,2014-01-03 08:14:27"
"563905,""Kerluke, Reilly and Bechtelar"",AS-93055,Shirt,17,52.82,897.94,2013-12-04 02:07:05"
"995267,Cole-Eichmann,GS-86623,Shoes,18,15.28,275.04,2014-04-09 16:15:03"
"524021,Hegmann and Sons,LL-46261,Shoes,7,78.78,551.46,2014-06-18 19:25:10"
"929400,""Senger, Upton and Breitenberg"",LW-86841,Shoes,17,38.19,649.23,2014-02-10 05:55:56"
Please take a look at the bold characters in the CSV files they are enclosed with ""
Here is my proposal:
df = pd.read_csv('file.csv')
col_name = 'account_number,name,item_code,category,quantity,unit price,net_price,date'
z = pd.concat([df[col_name].str.split(r'(,(?=\S)|:)', expand=True)], axis=1)
z['date'] = z[14]+z[15]+z[16]+z[17]+z[18]
z = z.drop(columns=[1,3,5,7,9,11,13, 14,15,16,17,18])
z.columns = col_name.split(',')
Crucial is this regex r'(,(?=\S)|:)' - comma but not followed by space but I don't know why it also split on :. If you can fix it then you don't have manually concat date.
Output:
account_number ... date
0 093356 ... 2013-11-17 20:41:11
1 659366 ... 2014-01-03 08:14:27
2 563905 ... 2013-12-04 02:07:05
3 995267 ... 2014-04-09 16:15:03
4 524021 ... 2014-06-18 19:25:10
5 929400 ... 2014-02-10 05:55:56

Write a pyspark dataframe to text without changing its structure

I have a pyspark dataframe as shown below
+--------------------+
| speed|
+--------------------+
|[5.59239, 2.51329...|
|[0.0191166, 0.169...|
|[0.561913, 0.4098...|
|[0.393343, 0.3580...|
|[0.118315, 0.1183...|
|[0.831407, 0.4470...|
|[1.49012e-08, 0.1...|
|[0.0411047, 0.152...|
|[0.620069, 0.8262...|
|[0.20373, 0.20373...|
+--------------------+
How can I write this dataframe to CSV such that I save it as it is shown above.Currently I tried coalesce but it saved as below
"[5.59239, 2.51329, 0.141536, 1.27485, 2.35138, 12.9668, 12.9668, 2.52421, 0.330804, 0.459188, 0.459188, 0.651573, 3.15373, 6.11923, 8.8445, 8.0871, 0.855173, 1.43534, 1.43534, 1.05988, 1.05988, 0.778344, 1.20522, 1.70414, 1.70414, 0.0795492, 1.10385, 1.4759, 1.64844, 0.82941, 1.11321, 1.37977, 0.849902, 1.24436, 1.24436, 0.698651, 0.791467, 0.636781, 0.666729, 0.666729, 0.45688, 0.45688, 0.158829, 2.12693, 29.8682, 29.8682, 9.62536, 3.40384, 2.51002, 1.55077, 1.01774, 0.922753, 0.922753, 0.0438924, 0.530669, 0.879573, 0.627267, 0.0532846, 0.0890066, 0.0884833, 0.140008, 0.147534, 0.0180038, 0.0132851, 0.112785, 0.112785, 0.22997, 0.22997, 0.0524423, 0.141886, 0.328422,............]"
But I want to save it in the format such that it is a proper excel file,with speed as column name and its values as a list of lists.
I dont want to use topandas() as it is memory intensive
If i have over emphasised/under emphasised sth,please let me know in the comments.
df.coalesce(1).write.option("header","true").csv("file:///s/tesing")
I resolved this!
df_Welding_amp.rdd.coalesce(1).saveAsTextFile('home/ram/file.csv')
Though I didnt get exactly as a list of lists, I was able to successfully get in row format as below
Row(speed='[5.59239, 2.51329, 0.141536, 1.27485, 2.35138, 12.9668, 12.9668, 2.52421, 0.330804, 0.459188, 0.459188, 0.651573, 3.15373, 6.11923, 8.8445, 8.0871, 0.855173, 1.43534, 1.43534, 1.05988, 1.05988, 0.778344, 1.20522, 1.70414, 1.70414, 0.0795492, 1.10385, 1.4759, 1.64844, 0.82941........
.....]
Row(speed='[0.0191166, 0.169978, 0.226254, 0.149923, 0.149923, 0.505102, 0.505102, 0.369975, 0.305384, 0.154693, 0.224818, 0.875909, 0.875909, 2.5506, 6.06761, 5.0829, 4.46667, 2.16333, 3.74257, 3.74257, 2.33873, 1.39336, 1.56772, 0.889895, 0.249284, 0.249284, 0.132409, 0.177825, 0.270215, 0.398466, 2.3726, 4.87186, 4.05198, 2.23753, 0.266356, 0.513157, 0.78962, 0.523164, 0.138469, 0.315834, 0.315834]

Resources