Geopandas list comprehension when reading in geometry types - list-comprehension

I'm trying to learn list comprehension looking at some of my already working code. The following reads in geometries (shapefiles) from a list of file paths (layer_path). It filters by the geom_type returning pnt, ln', and 'ply - points, lines, polygons.
pnt = gpd.GeoDataFrame()
ln = gpd.GeoDataFrame()
ply = gpd.GeoDataFrame()
for layer in layer_list:
layer_path = layer_list[layer]
gdf = gpd.read_file(layer_path)
if 'geometry' in gdf:
gdf = gdf.dropna(axis=0, subset=['geometry'])
for index, row in gdf.iterrows():
row['gis_layer'] = layer
if row['geometry'].geom_type == 'Point':
pnt = pnt.append(row)
if row['geometry'].geom_type == 'MultiLineString' or row['geometry'].geom_type == 'LineString':
ln = ln.append(row)
if row['geometry'].geom_type == 'MultiPolygon' or row['geometry'].geom_type == 'Polygon':
ply = ply.append(row)
layer_path is a list of file paths to the files which are read in. So to condense the above I need to
points = [gpd.read_file(layer_path) for layer in gpd.read_file(layer_path) if layer.geometry.geom_type == 'Point']
lines = [gpd.read_file(layer_path) for layer in gpd.read_file(layer_path) if layer.geometry.geom_type == 'Linestring']
polygons = [gpd.read_file(layer_path) for layer in gpd.read_file(layer_path) if layer.geometry.geom_type == 'Polygon']
This does not work, so step by step, the following reads in all geometries to a list of geodataframes, but how do you apply an if statement as in if geom_type == 'Polygon'
gdf = [gpd.read_file(layer_list) for layer_list in layer_list]
The other issue is that I want to add in row['gis_layer'] = layer which adds the name of the original file to the geodataframe.
Is using list comprehension overkill for this as I'll probably be extracting the geodataframes from the list anyway?
Edit: as I'm returning (geo)dataframes, should this actually use dictionary comprehension instead?
edit: progress so far gives a repeated list of a single directory for the amount of files found.
ply = [layer_path for layer_list in layer_list for index, row in gpd.read_file(layer_path).iterrows() if row['geometry'].geom_type == 'MultiPolygon' or row['geometry'].geom_type == 'Polygon']

Related

Display and save contents of a data frame with multi-dimensional array elements

I have created and updated a pandas dataframe to fill details of a section of an image and its corresponding features.
slice_sq_dim = 200
df_slice = pd.DataFrame({'Sample': str,
'Slice_ID':int,
'Slice_Array': [np.zeros((slice_sq_dim,slice_sq_dim))],
'Interface_Array': [np.zeros((slice_sq_dim,slice_sq_dim))],
'Slice_Array_Threshold': [np.zeros((slice_sq_dim,slice_sq_dim))]})
I added individual elements of this dataframe by updating the value of each cell through row by row iteration. Once I have completed my dataframe (with around 200 rows), I cannot seem to display more than the first row of its contents. I assume that this is due to the inclusion of multi-dimensional numpy arrays (image slices) as a component. I have also exported this data into a JSON file so that it can act as an input file during the next run. The following code shows how I exactly tried this and also how I fill my dataframe.
Slices_data_file = os.path.join(os.getcwd(), "Slices_dataframe.json")
if os.path.isfile(Slices_data_file):
print("Using the saved data of slices from previous run..")
df_slice = pd.read_json(Slices_data_file, orient='records')
else:
print("No previously saved slice data found..")
no_of_slices = 20
for index, row in df_files.iterrows(): # df_files is the previous dataframe with image path details
path = row['image_path']
slices, slices_thresh, slices_interface = slice_image(path, slice_sq_dim, no_of_slices)
# each of the output is a list of 20 image slices
for n, arr in enumerate(slices):
indx = (indx_row - 1 ) * no_of_slices + n
df_slice.Sample[indx] = path
df_slice.Slice_ID[indx] = n+1
df_slice.Slice_Array[indx] = arr
df_slice.Interface_Array[indx] = slices_interface[n]
df_slice.Slice_Array_Threshold[indx] = slices_thresh[n]
df_slice.to_json(Slices_data_file, orient='records')
I would like to do the following things:
Complete the dataframe with the possibility to add further columns of scalar values
View the dataframe normally with multiple rows and iterate using functions such as df_slice.iterrows() which is currently not supported
Save and reuse the database so as to avoid the repeated and time-consuming operations
Any advice or better suggestions?
After some while of searching, I found some topics that helped. pd.Series was very appropriate here. Also, I think that there was a "SettingwithCopyWarning" thatI chose to ignore somewhere in between. Final code is given below:
Slices_data_file = os.path.join(os.getcwd(), "Slices_dataframe.json")
if os.path.isfile(Slices_data_file):
print("Using the saved data of slices from previous run..)")
df_slice = pd.read_json(Slices_data_file, orient = 'columns')
else:
print("No previously saved slice data found..")
Sample_col = []
Slice_ID_col = []
Slice_Array_col = []
Interface_Array_col = []
Slice_Array_Threshold_col = []
no_of_slices = 20
slice_sq_dim = 200
df_slice = pd.DataFrame({'Sample': str,
'Slice_ID':int,
'Slice_Array': [],
'Interface_Array': [],
'Slice_Array_Threshold': []})
for index, row in df_files.iterrows():
path = row['image_path']
slices, slices_thresh, slices_interface = slice_image(path, slice_sq_dim, no_of_slices)
for n, arr in enumerate(slices):
Sample_col.append(Image_Unique_ID)
Slice_ID_col.append(n+1)
Slice_Array_col.append(arr)
Interface_Array_col.append(slices_interface[n])
Slice_Array_Threshold_col.append(slices_thresh[n])
print("Sicing -> ", Image_Unique_ID, " Complete")
df_slice['Sample'] = pd.Series(Sample_col)
df_slice['Slice_ID'] = pd.Series(Slice_ID_col)
df_slice['Slice_Array'] = pd.Series(Slice_Array_col)
df_slice['Interface_Array'] = pd.Series(Interface_Array_col)
df_slice['Slice_Array_Threshold'] = pd.Series(Slice_Array_Threshold_col)
df_slice.to_json(os.path.join(os.getcwd(), "Slices_dataframe.json"), orient='columns')

Create dictionary with count of values from list

I'm trying to figure out how to create a dictionary with the key as the school and values the wins-losses-draws, based on each item in the list. For example, calling my_dict['Clemson'] would return the string "1-1-1"
"
team_score_list =[['Georgia', 'draw'], ['Duke', 'loss'], ['Virginia Tech', 'win'], ['Virginia', 'loss'], ['Clemson', 'loss'], ['Clemson', 'win'], ['Clemson', 'draw']]
The output for the above list should be the following dictionary:
{'Georgia': 0-0-1, 'Duke': 0-1-0, 'Virginia Tech': 1-0-0, 'Virginia': 0-1-0, 'Clemson': 1-1-1}
For context, the original data comes from a CSV, where each line is in the form of Date,Opponent,Location,Points For,Points Against.
For example: 2016-12-31,Kentucky,Neutral,33,18.
I've managed to wrangle the data into the above list (albeit probably not in the most efficient manner), however just not exactly sure how to get this into the format above.
Any help would be greatly appreciated!
Not beautiful but this should work.
team_score_list = [
["Georgia", "draw"],
["Duke", "loss"],
["Virginia Tech", "win"],
["Virginia", "loss"],
["Clemson", "loss"],
["Clemson", "win"],
["Clemson", "draw"],
]
def gen_dict_lst(team_score_list):
"""Generates dict of list based on team record"""
team_score_dict = {}
for team_record in team_score_list:
if team_record[0] not in team_score_dict.keys():
team_score_dict[team_record[0]] = [0, 0, 0]
if team_record[1] == "win":
team_score_dict[team_record[0]][0] += 1
elif team_record[1] == "loss":
team_score_dict[team_record[0]][1] += 1
elif team_record[1] == "draw":
team_score_dict[team_record[0]][2] += 1
return team_score_dict
def convert_format(score_dict):
"""formats list to string for output validation"""
output_dict = {}
for key, value in score_dict.items():
new_val = []
for index, x in enumerate(value):
if index == 2:
new_val.append(str(x))
else:
new_val.append(str(x) + "-")
new_str = "".join(new_val)
output_dict[key] = new_str
return output_dict
score_dict = gen_dict_lst(team_score_list)
out_dict = convert_format(score_dict)
print(out_dict)
You can first make a dictionary and insert/increment values of wins,loss and draw while iterating over the dictionary values. Here I have shown a way using variable name same as the string used for win,loss and draw and then increased corresponding value in dictionary using global()['str'] (from another answer)
dct={}
for i in team_score_list:
draw=2
win=0
loss=1
if i[0] in dct:
dct[i[0]][globals()[i[1]]]+=1
else:
dct[i[0]]=[0,0,0]
dct[i[0]][globals()[i[1]]]=1
You can then convert your list to string by using '-'.join(...) to get it in a format you want in the dictionary.
I now get what you mean:
You could do
a = dict()
f = lambda x,s: str(int(m[x]=='1' or j==s))
for (i,j) in team_score_list:
m = a.get(i,'0-0-0')
a[i] = f"{f(0,'win')}-{f(2,'draw')}-{f(4,'loss')}"
{'Georgia': '0-1-0',
'Duke': '0-0-1',
'Virginia Tech': '1-0-0',
'Virginia': '0-0-1',
'Clemson': '1-1-1'}
Now this is an answer only for this example. If you had many data, it would be good to use a list then join at the end. Eg
b = dict()
g = lambda x,s: str(int(m[x]) + (j==s))
for (i,j) in team_score_list:
m = b.get(i,[0,0,0])
b[i] =[g(0,"win"),g(1,"draw"),g(2,"loss")]
{key:'-'.join(val) for key,val in b.items()}
{'Georgia': '0-1-0',
'Duke': '0-0-1',
'Virginia Tech': '1-0-0',
'Virginia': '0-0-1',
'Clemson': '1-1-1'}

Look up a number inside a list within a pandas cell, and return corresponding string value from a second DF

(I've edited the first column name in the labels_df for clarity)
I have two DataFrames, train_df and labels_df. train_df has integers that map to attribute names in the labels_df. I would like to look up each number within a given train_df cell and return in the adjacent cell, the corresponding attribute name from the labels_df.
So fore example, the first observation in train_df has attribute_ids of 147, 616 and 813 which map to (in the labels_df) culture::french, tag::dogs, tag::men. And I would like to place those strings inside one cell on the same row as the corresponding integers.
I've tried variations of the function below but fear I am wayyy off:
def my_mapping(df1, df2):
tags = df1['attribute_ids']
for i in tags.iteritems():
df1['new_col'] = df2.iloc[i]
return df1
The data are originally from two csv files:
train.csv
labels.csv
I tried this from #Danny :
sample_train_df['attribute_ids'].apply(lambda x: [sample_labels_df[sample_labels_df['attribute_name'] == i]
['attribute_id_num'] for i in x])
*please note - I am running the above code on samples of each DF due to run times on the original DFs.
which returned:
I hope this is what you are looking for. i am sure there's a much more efficient way using look up.
df['new_col'] = df['attribute_ids'].apply(lambda x: [labels_df[labels_df['attribute_id'] == i]['attribute_name'] for i in x])
This is super ugly and one day, hopefully sooner than later, i'll be able to accomplish this task in an elegant fashion though, until then, this is what got me the result I need.
split train_df['attribute_ids'] into their own cell/column
helper_df = train_df['attribute_ids'].str.split(expand=True)
combine train_df with the helper_df so I have the id column (they are photo id's)
train_df2 = pd.concat([train_df, helper_df], axis=1)
drop the original attribute_ids column
train_df2.drop(columns = 'attribute_ids', inplace=True)
rename the new columns
train_df2.rename(columns = {0:'attr1', 1:'attr2', 2:'attr3', 3:'attr4', 4:'attr5', 5:'attr6',
6:'attr7', 7:'attr8', 8:'attr9', 9:'attr10', 10:'attr11'})
convert the labels_df into a dictionary
def create_file_mapping(df):
mapping = dict()
for i in range(len(df)):
name, tags = df['attribute_id_num'][i], df['attribute_name'][i]
mapping[str(name)] = tags
return mapping
map and replace the tag numbers with their corresponding tag names
train_df3 = train_df2.applymap(lambda s: my_map.get(s) if s in my_map else s)
create a new column of the observations tags in a list of concatenated values
helper1['new_col'] = helper1[helper1.columns[0:10]].apply(lambda x: ','.join(x.astype(str)), axis = 1)

Retrieving dict value via hardcoded key, works. Retrieving via computed key doesn't. Why?

I'm generating a common list of IDs by comparing two sets of IDs (the ID sets are from a dictionary, {ID: XML "RECORD" element}). Once I have the common list, I want to iterate over it and retrieve the value corresponding to the ID from a dictionary (which I'll write to disc).
When I compute the common ID list using my diff_comm_checker function, I'm unable to retrieve the dict value the ID corresponds to. It doesn't however fail with a KeyError. I can also print the ID out.
When I hard code the ID in as the common_id value, I can retrieve the dict value.
I.e.
common_ids = diff_comm_checker( list_1, list_2, "text")
# does nothing - no failures
common_ids = ['0603599998140032MB']
#gives me:
0603599998140032MB {'R': '0603599998140032MB'} <Element 'RECORD' at 0x04ACE788>
0603599998140032MB {'R': '0603599998140032MB'} <Element 'RECORD' at 0x04ACE3E0>
So I suspected there was some difference between the strings. I checked both the function output and compared it against the hard-coded values using:
print [(_id, type(_id), repr(_id)) for _id in common_ids][0]
I get exactly the same for both:
>>> ('0603599998140032MB', <type 'str'>, "'0603599998140032MB'")
I have also followed the advice of another question and used difflib.ndiff:
common_ids1 = diff_comm_checker( [x.keys() for x in to_write[0]][0], [x.keys() for x in to_write[1]][0], "text")
common_ids = ['0603599998140032MB']
print "\n".join(difflib.ndiff(common_ids1, common_ids))
>>> 0603599998140032MB
So again, doesn't appear that there's any difference between the two.
Here's a full, working example of the code:
from StringIO import StringIO
import xml.etree.cElementTree as ET
from itertools import chain, islice
def diff_comm_checker(list_1, list_2, text):
"""Checks 2 lists. If no difference, pass. Else return common set between two lists"""
symm_diff = set(list_1).symmetric_difference(list_2)
if not symm_diff:
pass
else:
mismatches_in1_not2 = set(list_1).difference( set(list_2) )
mismatches_in2_not1 = set(list_2).difference( set(list_1) )
if mismatches_in1_not2:
mismatch_logger(
mismatches_in1_not2,"{}\n1: {}\n2: {}".format(text, list_1, list_2), 1, 2)
if mismatches_in2_not1:
mismatch_logger(
mismatches_in2_not1,"{}\n2: {}\n1: {}".format(text, list_1, list_2), 2, 1)
set_common = set(list_1).intersection( set(list_2) )
if set_common:
return sorted(set_common)
else:
return "no common set: {}\n".format(text)
def chunks(iterable, size=10):
iterator = iter(iterable)
for first in iterator:
yield chain([first], islice(iterator, size - 1))
def get_elements_iteratively(file):
"""Create unique ID out of image number and case number, return it along with corresponding xml element"""
tag = "RECORD"
tree = ET.iterparse(StringIO(file), events=("start","end"))
context = iter(tree)
_, root = next(context)
for event, record in context:
if event == 'end' and record.tag == tag:
xml_element_2 = ''
xml_element_1 = ''
for child in record.getchildren():
if child.tag == "IMAGE_NUMBER":
xml_element_1 = child.text
if child.tag == "CASE_NUM":
xml_element_2 = child.text
r_id = "{}{}".format(xml_element_1, xml_element_2)
record.set("R", r_id)
yield (r_id, record)
root.clear()
def get_chunks(file, chunk_size):
"""Breaks XML into chunks, yields dict containing unique IDs and corresponding xml elements"""
iterable = get_elements_iteratively(file)
for chunk in chunks(iterable, chunk_size):
ids_records = {}
for k in chunk:
ids_records[k[0]]=k[1]
yield ids_records
def create_new_xml(xml_list):
chunk = 5000
chunk_rec_ids_1 = get_chunks(xml_list[0], chunk)
chunk_rec_ids_2 = get_chunks(xml_list[1], chunk)
to_write = [chunk_rec_ids_1, chunk_rec_ids_2]
######################################################################################
### WHAT'S GOING HERE ??? WHAT'S THE DIFFERENCE BETWEEN THE OUTPUTS OF THESE TWO ? ###
common_ids = diff_comm_checker( [x.keys() for x in to_write[0]][0], [x.keys() for x in to_write[1]][0], "create_new_xml - large - common_ids")
#common_ids = ['0603599998140032MB']
######################################################################################
for _id in common_ids:
print _id
for gen_obj in to_write:
for kv_pair in gen_obj:
if kv_pair[_id]:
print _id, kv_pair[_id].attrib, kv_pair[_id]
if __name__ == '__main__':
xml_1 = """<?xml version="1.0"?><RECORDSET><RECORD><CASE_NUM>140032MB</CASE_NUM><IMAGE_NUMBER>0603599998</IMAGE_NUMBER></RECORD></RECORDSET>"""
xml_2 = """<?xml version="1.0"?><RECORDSET><RECORD><CASE_NUM>140032MB</CASE_NUM><IMAGE_NUMBER>0603599998</IMAGE_NUMBER></RECORD></RECORDSET>"""
create_new_xml([xml_1, xml_2])
The problem is not in the type or value of common_ids returned from diff_comm_checker. The problem is that the function diff_comm_checker or in constructing the arguments to the function that destroys the values of to_write
If you try this you will see what I mean
common_ids = ['0603599998140032MB']
diff_comm_checker( [x.keys() for x in to_write[0]][0], [x.keys() for x in to_write[1]][0], "create_new_xml - large - common_ids")
This will give the erroneous behavior without using the return value from diff_comm_checker()
This is because to_write is a generator and the call to diff_comm_checker exhausts that generator. The generator is then finished/empty when used in the if-statement in the loop. You can create a list from a generator by using list:
chunk_rec_ids_1 = list(get_chunks(xml_list[0], chunk))
chunk_rec_ids_2 = list(get_chunks(xml_list[1], chunk))
But this may have other implications (memory usage...)
Also, what is the intention of this construct in diff_comm_checker?
if not symm_diff:
pass
In my opinion nothing will happen regardless if symm_diff is None or not.

networkx - can't remove_node from graph

A G = nx.DiGraph() whose nodes and edges are the following:
G.nodes() = ['10.2.110.1', '10.2.25.65', '10.2.94.87', '10.2.20.209', '10.2.6.206', '10.2.94.55', '10.2.182.10', '10.2.94.86', '10.2.20.2', '10.2.20.1', '10.2.94.94']
G.edges() = [('10.2.110.1', '10.2.20.2'), ('10.2.110.1', '10.2.20.1'), ('10.2.25.65', '10.2.6.206'), ('10.2.94.87', '10.2.94.55'), ('10.2.20.209', '10.2.110.1'), ('10.2.94.55', '10.2.20.209'), ('10.2.182.10', '10.2.182.10'), ('10.2.94.86', '10.2.94.87'), ('10.2.20.2', '10.2.25.65'), ('10.2.20.1', '10.2.182.10'), ('10.2.94.94', '10.2.94.86')]
That above, produces the following topology.
As you can see, node_94 is green, because is the starting node. Both node_10 and node_206 are the farEnds.
I want to remove nodes from the graph depending on the number of hops away from the farEnds for node_94.
I have this function which tries to remove nodes depending on how far a node is from a given farEnd.
def getHopToNH(G):
labelList = {}
nodes = G
for startNode in nodes.nodes():
try:
farInt = nx.get_node_attributes(nodes,'farInt')[startNode]
except:
farInt = 'NA'
try:
p = min([len(nx.shortest_path(nodes,source=startNode,target=end)) for end in farInt])
except:
p = 0
if p < 7:
labelList = {**labelList,**{str(startNode):'node_'+str(startNode).split(".")[3]}}
else:
nodes.remove_node(startNode)
return labelList,nodes
However, when running that function, I get the following error:
File "trace_1_7.py", line 87, in getHopToNH
for startNode in nodes.nodes():
RuntimeError: dictionary changed size during iteration
The problem arises with the nodes.remove_node(startNode). If I remove that line, the code works nice and produces the plot that you can see above.
How can I accomplish the removal based on the number of hops towards a farEnd?
Thanks!
Lucas
Since networkx internally represents graphs using dicts, when iterating over a graph's nodes, we are iterating over the keys of a dictionary (this dictionary maps each node to its attributes). Using remove_node will change the size of this dictionary, which is not allowed when we are iterating over its key, hence the RuntimeError.
To remove nodes, we maintain a list containing the nodes we want to remove, then remove the nodes in this list after the for loop.
def getHopToNH(G):
labelList = {}
nodes = G
nodes_to_remove = []
for startNode in nodes.nodes():
try:
farInt = nx.get_node_attributes(nodes,'farInt')[startNode]
except:
farInt = 'NA'
try:
p = min([len(nx.shortest_path(nodes,source=startNode,target=end)) for end in farInt])
except:
p = 0
if p < 7:
labelList = {**labelList,**{str(startNode):'node_'+str(startNode).split(".")[3]}}
else:
nodes_to_remove.append(startNode)
nodes.remove_nodes_from(nodes_to_remove)
return labelList,nodes

Resources