I want to determine the rows in a data frame that has the same value in some special columns (sex, work class, education).
new_row_data=df.head(20)
new_center_clusters =new_row_data.head(20)
for j in range(len(new_center_clusters)):
row=[]
for i in range(len(new_row_data)):
if (new_center_clusters.iloc[j][5] == new_row_data.iloc[i][5]):
if(new_center_clusters.iloc[j][2] == new_row_data.iloc[i][2]):
if(new_center_clusters.iloc[j][3] == new_row_data.iloc[i][3]):
if(new_center_clusters.iloc[j][0] != new_center_clusters.iloc[i][0]):
row.append(new_center_clusters.iloc[j][0])
row.append(new_center_clusters.iloc[i][0])
myset = list(set(row))
myset.sort()
print(myset)
I need a list that includes all the IDs of similar rows in one list. but I can not merge all the lists in one list.
I get this result:
I need to get like this:
[1,12,8,17,3,18,4,19,5,13,6,9]
Thank you in advance.
if you want combine all list
a=[1,3,4]
b=[2,4,1]
a.extend(b)
it will give output as:
[1,3,4,2,4,1]
similary if you want to remove the duplicates, convert it into set and again list:
c=list(set(a))
it will give output as:
[1,3,4,2]
Related
This one is a bit tricky. I have a list: list_a=[5.,4.,2.,6.] and i want to order this list by ascending but also do the same ordering to another list: list_b=[left,up,right,down]. The output should be:
list_a=[2.,4.,5.,6.]
list_b=[right,up,left,down]
In reality the lists are huge and variable but have the same len (list_a is though always number and a dot). I want to copy the ordering of the list_a to list_b.
Thanks!
You can zip and sorted for this,
sorted_list_b = [x for _,x in sorted(zip(list_a,list_b))]
print(sorted(list_a))
print(sorted_list_b)
I am currently trying to write a function to iterate through a nested list and check if one item from the list, 'team', is already in a separate list 'teams'.
If it is not, I want to append a nested list, 'player_values' with a different item from the original nested list that was examined, in the form of a new list in the nested list.
If it is, I want to append the nested list 'player_values' with the item from the original nested list, but I want to add it to the most recent list in the nested list 'player_values' instead of creating a new list.
Currently, my code looks like this :
def teams_and_games(list, player, idx):
teams = []
player_values = []
x = 0
y = -1
for rows in list:
if player == list[x][BD.player_id] and list[x][BD.team] not in teams:
teams.append(list[x][BD.team])
player_values.append([list[x][idx]])
x += 1
y += 1
elif player == list[x][BD.player_id]:
player_values[y].append(list[x][idx])
x += 1
return player_values, teams
However, when I run the code in my main, using
values, teams = teams_and_games(NiceRow, name, BD.games)
print(values)
print(teams)
It only prints empty lists. The fact that it prints empty lists shows that it is returning the correct variables, but I can't figure out why the code in the function is failing to add anything to the lists. I have tried switching the .append with a more simple list += statement, but the result has been the same so far.
Ideally, I would be getting a nested list, containing an amount of lists equal to the number of items added to the other 'teams' list, and the list of teams in the order they were added.
The data I am working with is a nested list pulled from a .csv file, which has been formatted slightly using the .strip() and .split() commands. Each number has been converted to an int, and strings left as they are. The .CSV file it is from has 19 columns and ~80,000 rows, with each column always being either a string or an int.
In this simple code to read a tsv file of many columes:
InColnames = ['Chr','Pos','Ref','Alt']
tsvin = csv.DictReader(fin, delimiter='\t')
for row in tsvin:
print(', '.join(row[InColnames]))
How can I make the print work ?
The following will do:
for row in tsvin:
print(', '.join(row[col] for col in InCOlNames))
You cannot pass a list of keys to the dict's item-lookup and magically get a list of values. You have to somehow iterate the keys and retrieve each one's value individually. The approach at hand uses a generator expression for that.
I have a DataFrame with a column named 'UserNbr' and a column named 'Spclty', which is composed of elements like this:
[['104', '2010-01-31'], ['215', '2014-11-21'], ['352', '2016-07-13']]
where there can be 0 or more elements in the list.
Some UserNbr keys appear in multiple rows, and I wish to collapse each such group into a single row such that 'Spclty' contains all the unique dicts like those in the list shown above.
To save overhead on appending to a DataFrame, I'm appending each output row to a list, instead to the DataFrame.
My code is working, but it's taking hours to run on 0.7M rows of input. (Actually, I've never been able to keep my laptop open long enough for it to finish executing.)
Is there a better way to aggregate into a structure like this, maybe using a library that provides more data reshaping options instead looping over UserNbr? (In R, I'd use the data.table and dplyr libraries.)
# loop over all UserNbr:
# consolidate specialty fields into dict-like sets (to remove redundant codes);
# output one row per user to new data frame
out_rows = list()
spcltycol = df_tmp.column.get_loc('Spclty')
all_UserNbr = df_tmp['UserNbr'].unique()
for user in all_UserNbr:
df_user = df_tmp.loc[df_tmp['UserNbr'] == user]
if df_user.shape[0] > 0:
open_combined = df_user_open.iloc[0, spcltycol] # capture 1st row
for row in range(1, df_user.shape[0]): # union with any subsequent rows
open_combined = open_combined.union(df_user.iloc[row, spcltycol])
new_row = df_user.drop(['Spclty', 'StartDt'], axis = 1).iloc[0].tolist()
new_row.append(open_combined)
out_rows.append(new_row)
# construct new dataframe with no redundant UserID rows:
df_out = pd.DataFrame(out_rows,
columns = ['UserNbr', 'Spclty'])
# convert Spclty sets to dicts:
df_out['Spclty'] = [dict(df_out['Spclty'][row]) for row in range(df_out.shape[0])]
The conversion to dict gets rid of specialties that are repeated between rows, In the output, a Spclty value should look like this:
{'104': '2010-01-31', '215': '2014-11-21', '352': '2016-07-13'}
except that there may be more key-value pairs than in any corresponding input row (resulting from aggregation over UserNbr).
I withdraw this question.
I had hoped there was an efficient way to use groupby with something else, but I haven't found any examples with a complex data structure like this one and have received no guidance.
For anyone who gets similarly stuck with very slow aggregation problems in Python, I suggest stepping up to PySpark. I am now tackling this problem with a Databricks notebook and am making headway with the pyspark.sql.window Window functions. (Now, it only takes minutes to run a test instead of hours!)
A partial solution is in the answer here:
PySpark list() in withColumn() only works once, then AssertionError: col should be Column
I have two dataframes that I want to merge based on the values in one Column (SKU). The 'SKU' values are varied. For example, SKU values range from "00047" to "TPA20839". However, they are always alphanumeric.
When the dataframes are read from the csv files, I convert the 'SKU' column to strings so they merge on the same data type. The data merges correctly, EXCEPT when there are overlapping string values.
For example, there is a df_master['SKU'] value = "6748". There are two similar values in df_inv['SKU'] -> "6748" AND "9006748" (two different items, 6748 == 6748 9006784 != 6748).
This causes this row to NOT appear in the new dataframe.
I want it to to EXACT match, similar to the =MATCH('','',0) function in excel. Can you help me achieve this?
df_master['SKU'] = df_master['SKU'].astype(str)
df_inv['SKU'] = df_inv['SKU'].astype(str)
df_new = pd.merge(df_inv, df_master, on='SKU')
df_new.to_csv('new-master.csv', sep=',', encoding='utf-8')
I think the trick may be to format the data type differently, but I'm not sure.
Try this:
vals_matched = []
haystacks = df_inv['SKU'].astype(str).tolist()
needles = df_master['SKU'].astype(str).tolist()
for needle in needles:
for haystack in haystacks:
if needle in haystack:
vals_matched.append(needle)
break
df_master = df_master[df_master.SKU.astype(str).isin(needles)]
The break statement continues to the next needle, that is, the next value you're trying to match. The reason for that is that a single match is sufficient between two lists.