Remove duplicates form a list in pandas - python-3.x

I have a list like this :
['35UP\nPLx', '35UP']
I need a list of unique elements:
['PLx', '35UP']
i have tried this :
veh_line = list(dict.fromkeys(filter['p_Mounting_Location'].replace('\n',',', regex=True).tolist()))

This is one approach using str.splitlines with set.
Ex:
data = ['35UP\nPLx', '35UP']
result = list(set(j for i in data for j in i.splitlines()))
print(result)
Output:
['35UP', 'PLx']

Related

How to create a dataframe from extracted hashtags?

I have used below code to extract hashtags from tweets.
def find_tags(row_string):
tags = [x for x in row_string if x.startswith('#')]
return tags
df['split'] = df['text'].str.split(' ')
df['hashtags'] = df['split'].apply(lambda row : find_tags(row))
df['hashtags'] = df['hashtags'].apply(lambda x : str(x).replace('\\n', ',').replace('\\', '').replace("'", ""))
df.drop('split', axis=1, inplace=True)
df
However, when I am counting them using the below code I am getting output that is counting each character.
from collections import Counter
d = Counter(df.hashtags.sum())
data = pd.DataFrame([d]).T
data
Output I am getting is:
I think the problem lies with the code that I am using to extract hashtags. But I don't know how to solve this issue.
Change find_tags by replace in list comprehension with split and for count values use Series.explode with Series.value_counts:
def find_tags(row_string):
return [x.replace('\\n', ',').replace('\\', '').replace("'", "")
for x in row_string.split() if x.startswith('#')]
df['hashtags'] = df['text'].apply(find_tags)
and then:
data = df.hashtags.explode().value_counts().rename_axis('val').reset_index(name='count')

How to get index based on value in pyspark list

I have a list like below
[[[Row(cola=53273831, colb=1197), Row(cola=15245438, colb=1198)], [Row(cola=53273831, colb=1198)]]]
Here I want to search for a particular element and get the index value of it. Ex:
mylist.index((([['53273831', '1198']])))
should give me index as 1. But I'm getting error
ValueError: [['53273831', '1198']] is not in list.
This the code I'm using
df2=df.groupBy("order").agg(collect_list(struct(["id","node_id"])).alias("res"))
newrdd = df2.rdd.map(lambda x : (x))
order_info = newrdd.collectAsMap()
dict_values=(list(order_info.values()))
dict_keys=(list(order_info.keys()))
a=[[53273831, 1198]]
k2= dict_keys[dict_values.index(((a)))] # This line is givin
g me the error :ValueError: [['53273831', '1198']] is not in list
order_info dict looks like this
{10160700: [Row(id=53273831, node_id=1197), Row(id=15245438, node_id=1198)], 101600201: [Row(iid=53273831, node_id=1198)]}
Can you please help me to get the index value from this struct type list?
The element is a Row object, not a list, so you need to specify the Row object. Also you should get the index from mylist[0] because mylist is a multilayer array.
from pyspark.sql import Row
mylist = [[[Row(cola=53273831, colb=1197), Row(cola=15245438, colb=1198)], [Row(cola=53273831, colb=1198)]]]
id = mylist[0].index([Row(cola=53273831, colb=1198)])
will give you an id of 1.

Python3 filter/remove strings of a list

i grabed some data and stuff with my crawler and now i have to analyse them.
I got a list with different links of some pictures, but i just want to save the ones without "/Thumbnails"
pictures = ['Media/Shop/922180cruv.jpg', 'Media/Shop/922180cruvdet.jpg', 'Media/Shop/Thumbnails/922180cruvx320x240.jpg', 'Media/Shop/Thumbnails/922180cruvdetx320x240.jpg']
Is there a way to say "if string got /Thumbnails", remove all from the list?.
I got a csv with ~25000 list in separate columns.
Already tried with remove() and indexing, but my lists are different, so the index for the first list wont fit to the second list for example
Use filtering list comprehension:
pictures = ['Media/Shop/922180cruv.jpg', 'Media/Shop/922180cruvdet.jpg', 'Media/Shop/Thumbnails/922180cruvx320x240.jpg', 'Media/Shop/Thumbnails/922180cruvdetx320x240.jpg']
pictures = [p for p in pictures if not "/Thumbnails/" in p]
print(pictures)
Prints:
['Media/Shop/922180cruv.jpg', 'Media/Shop/922180cruvdet.jpg']
You can user split and join feature of python as below.
pictures = ['Media/Shop/922180cruv.jpg', 'Media/Shop/922180cruvdet.jpg',
'Media/Shop/Thumbnails/922180cruvx320x240.jpg',
'Media/Shop/Thumbnails/922180cruvdetx320x240.jpg']
pictures2= []
for pic in pictures:
if "Thumbnails" in pic:
pic_list = pic.split("Thumbnails/")
pic = "".join(pic_list)
print(pic)
pictures2.append(pic)
else:
pictures2.append(pic)
print(pictures2)
output
['Media/Shop/922180cruv.jpg', 'Media/Shop/922180cruvdet.jpg', 'Media/Shop/922180cruvx320x240.jpg', 'Media/Shop/922180cruvdetx320x240.jpg']
You can iterate through the list and check if each string contains the substring /Thumbnail
result = []
pictures = ['Media/Shop/922180cruv.jpg', 'Media/Shop/922180cruvdet.jpg',
'Media/Shop/Thumbnails/922180cruvx320x240.jpg',
'Media/Shop/Thumbnails/922180cruvdetx320x240.jpg']
for picture in pictures:
if picture.find('/Thumbnails') == -1:
result.append(picture)
Output:
['Media/Shop/922180cruv.jpg', 'Media/Shop/922180cruvdet.jpg']
Demo: https://onlinegdb.com/rkAo1EL4w

How to merge lists from a loop in jupyter?

I want to determine the rows in a data frame that has the same value in some special columns (sex, work class, education).
new_row_data=df.head(20)
new_center_clusters =new_row_data.head(20)
for j in range(len(new_center_clusters)):
row=[]
for i in range(len(new_row_data)):
if (new_center_clusters.iloc[j][5] == new_row_data.iloc[i][5]):
if(new_center_clusters.iloc[j][2] == new_row_data.iloc[i][2]):
if(new_center_clusters.iloc[j][3] == new_row_data.iloc[i][3]):
if(new_center_clusters.iloc[j][0] != new_center_clusters.iloc[i][0]):
row.append(new_center_clusters.iloc[j][0])
row.append(new_center_clusters.iloc[i][0])
myset = list(set(row))
myset.sort()
print(myset)
I need a list that includes all the IDs of similar rows in one list. but I can not merge all the lists in one list.
I get this result:
I need to get like this:
[1,12,8,17,3,18,4,19,5,13,6,9]
Thank you in advance.
if you want combine all list
a=[1,3,4]
b=[2,4,1]
a.extend(b)
it will give output as:
[1,3,4,2,4,1]
similary if you want to remove the duplicates, convert it into set and again list:
c=list(set(a))
it will give output as:
[1,3,4,2]

extract just one element from list and write it into a csv as another name

I have a list:
IDs = ["111111111111", "222222222222"]
and create a csv with this code:
for acc in IDs:
with open("/tmp/test.csv", "a+") as f:
test = csv.writer(f)
test.writerow([IDs])
result is:
{'111111111111', '222222222222'}
what i want to do is like:
if IDs == "111111111111":
IDs = "AccountA"
elif IDs == "222222222222":
IDs = "AccountB"
expected result in csv:
Account A
some information about account a i put later on it
Account B
some information about account a i put later on it
How can I achieve the result?
You could use a dictionary. What you do is you create a dictonary with all data. At the left side you would have your input, and and the right side you have your data that your want to write. For this case, take a look at this dictionary:
data = {
'111111111111':'AccountA',
'222222222222':'AccountB'
}
Than, create a loop around your list and create a new list, with the new ids, configured with your data.
new_ids = []
for x in ids:
new_ids.append(data[x])
Now, you can use the new_ids list to use in your write function.
Hope it helps.
Sincerly, Chris Fowl.

Resources