Organise and create new dataframe from existing one in Python

Organise and create new dataframe from existing one in Python - python-3.x

I have a list like this.
['9_1_152', '9_2_129', '9_3_22', '9_3_140', '10_3_28', '10_3_134', '10_3_147', '10_5_15', '11_3_18', '11_3_32', '11_3_137', '11_4_150', '12_2_13', '12_3_25', '12_3_151', '12_4_138', '13_4_13', '13_4_27', '13_5_139', '13_5_151', '14_4_16', '14_4_30', '14_4_134', '14_4_146', '15_1_92', '15_2_25', '15_2_122', '15_3_11', '15_4_40', '15_4_73', '15_5_197', '15_6_60', '15_6_103', '15_6_210', '16_1_19', '16_1_34', '16_2_8', '16_2_161', '16_4_51', '16_4_61', '16_4_85', '16_4_109', '16_5_73', '16_7_208', '16_8_213', '17_2_77', '17_4_5', '17_4_44', '17_5_30', '17_5_59', '17_5_97', '17_5_111', '17_5_157', '17_6_177', '17_6_189', '17_9_217', '18_1_22', '18_2_177', '18_2_205', '18_3_163', '18_5_11', '18_5_78', '18_5_107', '18_6_55', '18_6_65', '18_6_89', '18_6_98', '19_1_16', '19_1_68', '19_1_121', '19_1_155', '19_2_181', '19_3_77', '19_3_101', '19_4_37', '19_4_89', '19_5_54', '20_1_22', '20_1_131', '20_1_145', '20_2_172', '20_3_49', '20_6_84', '20_6_159', '20_6_217', '21_2_25', '21_2_139', '21_3_66', '21_4_40', '21_4_191', '21_5_204', '21_6_93', '21_6_108', '22_1_49', '22_1_61', '22_1_134', '22_1_160', '22_1_181', '22_4_1', '22_4_93', '22_5_102', '22_5_211', '22_6_196', '22_6_203', '22_7_12', '22_8_22', '23_3_192', '23_5_92', '23_6_122', '23_6_182', '24_1_87', '24_1_137', '24_2_111', '24_4_76', '24_5_1', '24_6_41', '24_7_12', '24_8_22', '25_1_101', '25_1_137', '25_2_10', '25_2_91', '25_4_165', '25_5_68', '25_6_79', '25_6_113', '25_8_217', '26_2_34', '26_2_66', '26_2_82', '26_2_106', '26_2_117', '26_2_214', '26_4_97', '26_6_172', '26_9_197', '26_10_201', '27_2_34', '27_2_86', '27_4_9', '27_5_49', '27_5_63', '27_5_163', '27_5_190', '27_9_209', '27_10_213', '28_1_205', '28_2_17', '28_2_151', '28_4_58', '28_4_113', '28_4_124', '28_5_169', '28_6_69', '29_1_34', '29_1_81', '29_1_134', '29_1_155', '29_1_173', '29_2_51', '29_6_8', '29_6_21', '30_1_8', '30_1_37', '30_1_126', '30_1_164', '30_2_151', '30_4_65', '30_5_83', '30_5_176', '30_6_50', '31_1_19', '31_1_141', '31_2_58', '31_3_81', '31_5_116', '31_6_45', '32_2_45', '32_2_71', '32_2_97', '32_5_87', '32_5_121', '32_6_21', '32_6_166', '33_1_30', '33_1_55', '33_2_17', '33_2_102', '33_2_166', '33_5_6', '33_5_44', '33_6_117', '34_1_4', '34_1_16', '34_1_43', '34_1_75', '34_1_107', '34_1_116', '34_2_139', '34_5_30', '34_5_183', '35_1_12', '35_3_1', '35_3_39', '35_3_52', '35_3_63', '35_3_73', '35_3_91', '35_3_109', '35_3_118', '35_3_159', '35_3_198', '35_3_210', '35_4_82', '35_4_100', '35_4_131', '35_4_171', '35_4_184', '35_4_222', '35_4_229', '35_5_25', '35_5_145', '37_1_145', '37_1_197', '37_2_132', '37_3_8', '37_3_42', '37_3_56', '37_3_85', '37_3_94', '37_3_112', '37_3_122', '37_3_172', '37_3_186', '37_3_204', '37_3_224', '37_4_103', '37_4_160', '37_4_216', '37_5_25', '37_6_74', '39_1_169', '39_2_157', '39_2_189', '39_3_4', '39_3_15', '39_3_70', '39_3_88', '39_3_97', '39_3_115', '39_3_126', '39_3_179', '39_4_54', '39_4_106', '39_4_142', '39_4_198', '39_4_210', '39_5_39', '42_1_30', '42_1_96', '42_1_141', '42_1_189', '42_2_154', '42_2_197', '42_3_4', '42_3_15', '42_3_46', '42_3_59', '42_3_105', '42_3_166', '42_3_217', '42_4_69', '42_4_79', '42_4_117', '42_4_177', '42_4_204', '42_6_129', '53_3_130', '53_3_143', '53_4_34', '53_4_47', '53_4_156', '53_5_20', '54_4_121', '54_6_13', '54_6_36', '54_6_135', '54_6_147', '55_1_112', '55_2_28', '55_2_143', '55_3_156', '55_5_127', '55_7_3', '55_8_14', '56_3_35', '56_4_20', '56_5_133', '56_6_153', '57_2_21', '57_2_125', '57_2_135', '57_2_147', '57_5_35', '58_2_40', '58_4_23', '58_4_127', '58_4_153', '58_6_141', '166_1_149', '166_2_30', '175_6_17', '175_6_31', '176_6_26', '180_1_26']
I create a dataframe from this list.
x
0 9_1_152
1 9_2_129
2 9_3_22
3 9_3_140
4 10_3_28
.. ...
310 166_2_30
311 175_6_17
312 175_6_31
313 176_6_26
314 180_1_26
I splitted this dataframe
x[['i','r','p']] = x['x'].str.split('_',expand=True)
x['i'] = pd.to_numeric(x['i'], downcast='integer')
x['r'] = pd.to_numeric(x['r'], downcast='integer')
x['p'] = pd.to_numeric(x['p'], downcast='integer')
print(x)
and obtain this one.
x i r p
0 9_1_152 9 1 152
1 9_2_129 9 2 129
2 9_3_22 9 3 22
3 9_3_140 9 3 140
4 10_3_28 10 3 28
.. ... ... .. ...
310 166_2_30 166 2 30
311 175_6_17 175 6 17
312 175_6_31 175 6 31
313 176_6_26 176 6 26
314 180_1_26 180 1 26
[315 rows x 4 columns]
What i would like to do that, create new dataframe.
New elements are column 'i'.
New columns are column 'r'.
New indexes are column 'p'.
Like this
1 2 3 4 5 6
17 175
22 9
28 28
129 9
152 9

This might be what you're looking for.
x_pivot = x.pivot_table(index="p", columns="r", values="i", aggfunc="sum", fill_value="")

Related

Panda returns 50x1 matrix instead of 50x7? (read_csv gone wrong)

I'm quite new to Python. I'm trying to load a .csv file with Panda but it returns a 50x1 matrix instead of expected 50x7. I'm a bit uncertain whether it is becaue my data contains numbers with "," (although I thought the quotechar attribute would solve that problem).
EDIT: Should perhaps mention that including the attribute sep=',' doesn't solve the issue)
My code looks like this
df = pd.read_csv('data.csv', header=None, quotechar='"')
print(df.head)
print(len(df.columns))
print(len(df.index))
Any ideas? Thanks in advance
Here is a subset of the data as text
10-01-2021,813,116927,"2,01",-,-,-
11-01-2021,657,117584,"2,02",-,-,-
12-01-2021,462,118046,"2,03",-,-,-
13-01-2021,12728,130774,"2,24",-,-,-
14-01-2021,17895,148669,"2,55",-,-,-
15-01-2021,15206,163875,"2,81",5,5,"0,0001"
16-01-2021,4612,168487,"2,89",7,12,"0,0002"
17-01-2021,2536,171023,"2,93",717,729,"0,01"
18-01-2021,3883,174906,"3,00",2147,2876,"0,05"
Here is the output of the head-function
0
0 27-12-2020,6492,6492,"0,11",-,-,-
1 28-12-2020,1987,8479,"0,15",-,-,-
2 29-12-2020,8961,17440,"0,30",-,-,-
3 30-12-2020,11477,28917,"0,50",-,-,-
4 31-12-2020,6197,35114,"0,60",-,-,-
5 01-01-2021,2344,37458,"0,64",-,-,-
6 02-01-2021,8895,46353,"0,80",-,-,-
7 03-01-2021,6024,52377,"0,90",-,-,-
8 04-01-2021,2403,54780,"0,94",-,-,-

Using your data I got the expected result. (even without quotechar='"')
Could you maybe show us your output?
import pandas as pd
df = pd.read_csv('data.csv', header=None)
print(df)
> 0 1 2 3 4 5 6
> 0 10-01-2021 813 116927 2,01 - - -
> 1 11-01-2021 657 117584 2,02 - - -
> 2 12-01-2021 462 118046 2,03 - - -
> 3 13-01-2021 12728 130774 2,24 - - -
> 4 14-01-2021 17895 148669 2,55 - - -
> 5 15-01-2021 15206 163875 2,81 5 5 0,0001
> 6 16-01-2021 4612 168487 2,89 7 12 0,0002
> 7 17-01-2021 2536 171023 2,93 717 729 0,01
> 8 18-01-2021 3883 174906 3,00 2147 2876 0,05

You need to define the seperator and delimiter, like this:
df = pd.read_csv('data.csv', header=None, sep = ',', delimiter=',' , quotechar='"')

type error in functions to run point in polygon query on RAPIDS

I want to create a point in polygon query for 14million NYC taxi trips and find out which of the 263 taxi zones the trips were located.
I want to the code on RAPIDS cuspatial. I read a few forums and posts, and came across cuspatial polygon limitations that users can only perform queries on 32 polygons in each run. So I did the following to split my polygons in batches.
This is my taxi zone polygon file
cusptaxizone
(0 0
1 1
2 34
3 35
4 36
...
258 348
259 349
260 350
261 351
262 353
Name: f_pos, Length: 263, dtype: int32,
0 0
1 232
2 1113
3 1121
4 1137
...
349 97690
350 97962
351 98032
352 98114
353 98144
Name: r_pos, Length: 354, dtype: int32,
x y
0 933100.918353 192536.085697
1 932771.395560 191317.004138
2 932693.871591 191245.031174
3 932566.381345 191150.211914
4 932326.317026 190934.311748
... ... ...
98187 996215.756543 221620.885314
98188 996078.332519 221372.066989
98189 996698.728091 221027.461362
98190 997355.264443 220664.404123
98191 997493.322715 220912.386162
[98192 rows x 2 columns])
There are 263 polygons/ taxi zones in total - I want to do queries in 24 batches and 11 polygons in each iteration.
def create_iterations(start, end, batches):
iterations = list(np.arange(start, end, batches))
iterations.append(end)
return iterations
pip_iterations = create_iterations(0, 264, 24)
#loop to do point in polygon query in a table
def perform_pip(cuda_df, cuspatial_data, polygon_name, iter_batch):
cuda_df['borough'] = " "
for i in range(len(iter_batch)-1):
start = pip_iterations[i]
end = pip_iterations[i+1]
pip = cuspatial.point_in_polygon(cuda_df['pickup_longitude'], cuda_df['pickup_latitude'],
cuspatial_data[0][start:end], #poly_offsets
cuspatial_data[1], #poly_ring_offsets
cuspatial_data[2]['x'], #poly_points_x
cuspatial_data[2]['y'] #poly_points_y
)
for i in pip.columns:
cuda_df['borough'].loc[pip[i]] = polygon_name[i]
return cuda_df
When I ran the function I received a type error. I wonder what might cause the issue?
pip_pickup = perform_pip(cutaxi, cusptaxizone, pip_iterations)
TypeError: perform_pip() missing 1 required positional argument: 'iter_batch'

It seems like you are passing in cutaxi for cuda_df, cusptaxizone for cuspatial_data and pip_iterations for polygon_name variable in perform_pip function. There is no variable/value passed for iter_batch defined in perform_pip function:
def perform_pip(cuda_df, cuspatial_data, polygon_name, iter_batch):
Hence, you get the above error which states that iter_batch is missing. As stated in the above comment as well you are not passing the right number of parameters for perform_pip function.
If you edit your code to pass in the right number of variables to perform_pip function the above mentioned error :
TypeError: perform_pip() missing 1 required positional argument: 'iter_batch'
would be resolved.

How to create a dataframe with different random numbers on each column?

I'm trying to make different random numbers but it keeps being the same on every column, how to fix it, using 1 line?
CODE:
yuju= pd.DataFrame()
column_price_x = [random.uniform(65.5,140.5) for i in range(20)]
for i in range(1990,2020):
yuju[i] = column_price_x
yuju
RESULT
EXPECTED:
Different numbers value for each column
How can I deal with it?

Its much easier than you think
In [12]: import numpy as np
In [13]: df = pd.DataFrame(np.random.rand(5,5))
In [14]: df
Out[14]:
0 1 2 3 4
0 0.463645 0.818606 0.520964 0.016413 0.286529
1 0.701693 0.556813 0.352911 0.738017 0.148805
2 0.899378 0.626350 0.821576 0.917648 0.404706
3 0.985617 0.336138 0.443910 0.690457 0.627859
4 0.121281 0.784853 0.799065 0.102332 0.156317
np.random.rand samples from standard uniform distribution (over [0,1])
Edit
if you want uniform distribution over given numbers, use np.random.uniform
In [16]: pd.DataFrame(np.random.uniform(low=65.5,high=140.5,size=(5,5))
...: )
Out[16]:
0 1 2 3 4
0 124.356069 96.718934 100.587485 136.670313 124.134073
1 68.109675 105.677037 86.084935 109.284336 108.393333
2 120.445978 125.036895 92.557137 105.864824 95.297450
3 91.027931 140.040051 94.362951 80.870850 70.106912
4 107.404708 92.472469 84.748544 82.116756 129.313166

here the solution
each iteration you should random again to assign new value for each column
yuju= pd.DataFrame()
for i in range(1990,2020):
yuju[i]= [random.uniform(65.5,140.5) for i in range(20)]
yuju
output
1990 1991 1992 1993 1994 1995 1996 1997 ...
0 73.117785 104.158470 76.704672 136.295814 106.008801 88.129275 96.843800 118.172649 ... 106.08
1 77.146977 131.584449 112.781430 113.071448 118.806880 140.301281 132.196554 136.222878 ... 74.85
2 67.976294 90.571586 137.313729 126.388545 134.941530 119.544528 119.692859 124.883332 ... 82.48
3 76.577618 102.765745 137.014399 84.696234 70.087628 86.180974 121.070030 87.991356 ... 71.67
4 104.675987 134.869611 120.221701 69.652423 105.650834 107.308007 122.372708 80.037225 ... 90.58
5 107.093326 124.649323 138.961846 84.312784 98.964176 87.691698 120.426266 79.888018 ... 97.46
6 97.375159 97.607740 119.027947 77.545403 81.365235 119.204719 75.426836 132.545121 ... 120.15
7 81.099338 94.315767 123.389789 85.734648 134.746295 99.196135 65.963834 72.895016 ... 135.63
8 129.577824 118.482358 137.838454 83.338883 68.603851 138.657750 85.155046 73.311065 ... 91.12
9 129.321333 134.598491 138.810883 119.487502 75.794849 125.314185 118.499014 126.969947 ... 74.86
10 122.704160 118.282868 114.196318 69.668442 112.237553 68.953530 115.395672 114.560736 ... 88.21
11 112.653109 109.635751 78.470715 81.973892 111.413094 76.918852 76.318205 129.423737 ... 103.06
12 80.984595 136.170595 83.258407 112.248942 96.730922 84.922575 104.984614 127.646325 ... 103.24
13 82.658896 97.066191 95.096705 107.757428 93.767250 93.958438 115.113325 98.931509 ... 105.32
14 85.173060 77.257117 72.668875 87.061919 130.088992 80.001858 104.526423 85.237558 ... 87.86
15 68.428850 79.948204 107.060400 92.962859 133.393354 93.806838 99.258857 138.314982 ... 86.80
16 115.105281 110.567551 119.868457 139.482290 103.235046 128.805920 140.131489 107.568099 ... 98.16
17 71.318147 119.965667 97.135972 90.174975 125.738171 115.655945 86.333461 114.574965 ... 134.80
18 134.000260 121.417473 104.832999 129.277671 139.932955 122.623911 92.369881 109.523118 ... 137.47
19 104.444951 111.712214 130.602922 119.446700 88.256841 110.316280 74.611164 88.364896 ... 115.32

Find the shortest paths between connected nodes (csv files)

I'm trying to find the longest shortest path(s) between 2 counties. I was given 2 .txt files, one with all of the nodes (county ID, population, latitude and longitude, and commuters inside the county) and one with the links (source county, destination county, distance, number of commuters).
01001 43671 32.523283 -86.577176 7871
01003 140415 30.592781 -87.748260 45208
01005 29038 31.856515 -85.331312 8370
01007 20826 33.040054 -87.123243 3199
01009 51024 33.978461 -86.554768 8966
01011 11714 32.098285 -85.704915 2237
01013 21399 31.735884 -86.662232 5708
01015 112249 33.741989 -85.817544 39856
01017 36583 32.891233 -85.288745 9281
01019 23988 34.184158 -85.621930 4645
01021 39593 32.852554 -86.689982 8115
01023 15922 32.027681 -88.257855 3472
01025 27867 31.688155 -87.834164 7705
...
01001 01001 0 7871
01001 01007 76.8615966430995 7
01001 01013 87.9182871130127 37
01001 01015 152.858742124667 5
01001 01021 38.1039665382023 350
01001 01031 140.051395101308 8
01001 01037 57.6726084645634 12
01001 01047 48.517875245493 585
01001 01051 38.9559472915165 741
01001 01053 169.524277177911 5
01001 01059 245.323879285783 7
01001 01065 102.775324022097 2
01001 01073 114.124721221283 142
...
01003 48439 932.019063970525 9
01003 53033 3478.13978129133 11
01003 54081 997.783781484149 10
01005 01005 0.000134258785931453 8370
01005 01011 44.3219329413987 72
01005 01021 168.973302699063 7
...
The first file with the nodes is called "THE_NODES.txt" and the second is "THE_LINKS.txt".
How would I use python code to find the longest shortest path(s) between any of the two counties? I assume I start with making a graph of the network, and since the second file has the connections, use 'THE_LINKS.txt' for the edges(I don't know if the weights would be the distance?)? Also, I think these files can only be read as a csv (correct me if I'm wrong), so I can't (or don't know how to) use networkx for this problem.

You can use the read_table function with | separator to read .txt files
node = pd.read_table('node.txt', sep='|', header=None)
links = pd.read_table('links.txt', sep='|', header=None)
Then you need to find the location of countries ( please refer this link : How to select rows from a DataFrame based on column values? ). Then you have to calculate the distance between the countries.
What have you tried so far ? Include that too.

Vim loading and formatting slow

I'm using vim with many plugins, the .vimrc file has a big number of plugins, but yet it was very fast, and suddenly for some reason it's not any more, not sure may be after I started using eslinter, it takes about two second every time I save a file, or open a file in a new tab.
Is there any way I can find which plugin that is causing all that delay?
FUNCTIONS SORTED ON TOTAL TIME
count total (s) self (s) function
1 1.193202 0.000100 <SNR>67_BufWritePostHook()
1 1.192941 0.000469 <SNR>67_UpdateErrors()
1 1.188491 0.000942 <SNR>67_CacheErrors()
1 1.184402 0.000040 287()
1 1.184210 0.000254 286()
1 1.183663 0.000135 SyntaxCheckers_javascript_eslint_GetLocList()
1 1.182540 0.000674 SyntasticMake()
1 1.181614 0.000483 syntastic#util#system()
3 0.023413 0.000393 airline#extensions#tabline#get()
3 0.023020 0.001615 airline#extensions#tabline#tabs#get()
12 0.022842 0.010081 <SNR>180_parse_screen()
2 0.018991 0.003055 381()
12 0.012386 <SNR>180_create_matches()
12 0.011903 0.002183 <SNR>172_OnCursorMovedNormalMode()
8 0.011681 0.000275 <SNR>157_get_seperator()
14 0.010961 0.010719 <SNR>172_OnFileReadyToParse()
46 0.009832 0.003413 airline#highlighter#get_highlight()
10 0.009530 0.000443 <SNR>157_get_transitioned_seperator()
10 0.009087 0.000364 airline#highlighter#add_separator()
10 0.008723 0.000830 <SNR>153_exec_separator()
FUNCTIONS SORTED ON SELF TIME
count total (s) self (s) function
12 0.012386 <SNR>180_create_matches()
14 0.010961 0.010719 <SNR>172_OnFileReadyToParse()
12 0.022842 0.010081 <SNR>180_parse_screen()
92 0.005751 <SNR>153_get_syn()
46 0.009832 0.003413 airline#highlighter#get_highlight()
13 0.003297 <SNR>123_Highlight_Matching_Pair()
2 0.018991 0.003055 381()
1 0.002336 0.002331 gitgutter#sign#remove_signs()
12 0.011903 0.002183 <SNR>172_OnCursorMovedNormalMode()
3 0.001629 airline#extensions#tabline#tabs#map_keys()
3 0.023020 0.001615 airline#extensions#tabline#tabs#get()
1 0.001750 0.001601 gitgutter#async#execute()
12 0.001445 <SNR>146_update()
1 0.001305 0.001302 gitgutter#sign#find_current_signs()
12 0.001276 <SNR>157_get_accented_line()
14 0.001155 <SNR>172_AllowedToCompleteInBuffer()
10 0.003474 0.001059 airline#highlighter#exec()
2 0.001717 0.000985 xolox#misc#cursorhold#autocmd()
1 0.002802 0.000983 347()
1 0.001208 0.000966 gitgutter#sign#upsert_new_gitgutter_signs()

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Organise and create new dataframe from existing one in Python - python-3.x

This might be what you're looking for. x_pivot = x.pivot_table(index="p", columns="r", values="i", aggfunc="sum", fill_value="")

Related

Panda returns 50x1 matrix instead of 50x7? (read_csv gone wrong)

type error in functions to run point in polygon query on RAPIDS

How to create a dataframe with different random numbers on each column?

Find the shortest paths between connected nodes (csv files)

Vim loading and formatting slow

Categories

Resources