Find number of co-occurring elements between dataframe columns

Find number of co-occurring elements between dataframe columns - python-3.x

I have a DataFrame that has a website, categories, and keywords for that website.
Url | categories | keywords
Espn | [sport, nba, nfl] | [half, touchdown, referee, player, goal]
Tmz | [entertainment, sport] | [gossip, celebrity, player]
Goal [ [sport, premier_league, champions_league] | [football, goal, stadium, player, referee]
Which can be created using this code:
data = [{ 'Url': 'ESPN', 'categories': ['sport', 'nba', 'nfl'] ,
'keywords': ["half", "touchdown", "referee", "player", "goal"] },
{ 'Url': 'TMZ', 'categories': ["entertainment", "sport"] ,
'keywords': ["gossip", "celebrity", "player"] },
{ 'Url': 'Goal', 'categories': ["sport", "premier_league", "champions_league"] ,
'keywords': ["football", "goal", "stadium", "player", "referee"]},
]
df =pd.DataFrame(data)
For all the word in the keywords column, I want to get the frequency of categories associated with it. The results might look like this:
{half: {sport: 1, nba: 1, nfl: 1}, touchdown : {sport: 1, nba: 1,
nfl: 1}, referee: {sport: 2, nba: 1, nfl: 1, premier_league: 1,
champions_league:1 }, player: {sport: 3, nba: 1, nfl: 1,
premier_league: 1, champions_league:1 }, gossip: {sport:1,
entertainment:1}, celebrity: {sport:1, entertainment:1}, goal:
{sport:2, premier_league:1, champions_league:1, nba: 1, nfl: 1},
stadium:{sport:1, premier_league:1, champions_league:1} }

Since the columns contain lists, you can explode them to repeat a row once for each element per list:
result = (
df.explode("keywords")
.explode("categories")
.groupby(["keywords", "categories"])
.size()
)

Related

Restructure TSV to list of list of dicts

A simplified look at my data right at parse:
[
{'id':'group1'},
{'id':'member1', 'parentId':'group1', 'size':51},
{'id':'member2', 'parentId':'group1', 'size':16},
{'id':'group2'},
{'id':'member1', 'parentId':'group2', 'size':21},
...
]
The desired output should be like this:
data =
[
[
{'id':'group1'},
{'id':'member1', 'parentId':'group1', 'size':51},
{'id':'member2', 'parentId':'group1', 'size':16}
],
[
{'id':'group2'},
{'id':'member1', 'parentId':'group2', 'size':21},
]
]
The issue is that it's very challenging to iterate through this kind of data structure because each list contains a different length of possible objects: some might have 10 some might have 3, making it unclear when to begin and end each list. And it's also not uniform. Note some have only 'id' entries and no 'parentId' or 'size' entries.
master_data = []
for i in range(len(tsv_data)):
temp = {}
for j in range(?????):
???
How can Python handle arranging vanilla .tsv data into a list of lists as seen above?
I thought one appropriate direction to take the code was to see if I could tally something simple, before tackling the whole data set. So I attempted to compute a count of all occurences of group1, based off this discussion:
group_counts = {}
for member in data:
group = member.get('group1')
try:
group_counts[group] += 1
except KeyError:
group_counts[group] = 1
However, this returned:
'list' object has no attribute 'get'
Which leads me to believe that counting text occurences may not be the solution afterall.

You could fetch all groups to create the new datastructure afterwards add all the items:
data = [
{
'id': 'group1'
}, {
'id': 'member1',
'parentId': 'group1',
'size': 51
}, {
'id': 'member2',
'parentId': 'group1',
'size': 16
}, {
'id': 'group2'
}, {
'id': 'member1',
'parentId': 'group2',
'size': 21
}, {
'id': 'member3',
'parentId': 'group1',
'size': 16
}
]
result = {} # Use a dict for easier grouping.
lastGrpId = 0
# extract all groups
for dct in data:
if 'group' in dct['id']:
result[dct['id']] = [dct]
# extract all items and add to groups
for dct in data:
if 'parentId' in dct:
result[dct['parentId']].append(dct)
nestedListResult = [v for k, v in result.items()]
Out:
[
[
{
'id': 'group1'
}, {
'id': 'member1',
'parentId': 'group1',
'size': 51
}, {
'id': 'member2',
'parentId': 'group1',
'size': 16
}, {
'id': 'member3',
'parentId': 'group1',
'size': 16
}
], [{
'id': 'group2'
}, {
'id': 'member1',
'parentId': 'group2',
'size': 21
}]
]

group dictionaries and get count

I have a list of dictionaries like this:
list1 = [{'name': 'maik','is_payed': 1, 'brand': 'HP', 'count': 1, 'items': [{'device': 'mouse', 'count': 110}]},{'name': 'milanie','is_payed': 0, 'brand': 'dell', 'count':10, 'items': [{'device': 'bales', 'count': 200}]}]
list2 = [{'name': 'maik','is_payed': 0, 'brand': 'HP', 'count': 20, 'items': [{'device': 'mouse', 'count': 1}]},{'name': 'nikola','is_payed': 1, 'brand': 'toshiba', 'count':10, 'items': [{'device': 'hard', 'count': 20}]}]
my_list= list1 + list2
count = pd.DataFrame(my_list).groupby(['name', 'is_payed'])
final_list_ = []
for commande, group in count:
print(commande)
records = group.to_dict("records")
final_list_.append({"name": commande[0],
"payed": commande[1],
"occurrence": len(group),
"items": pd.DataFrame(records).groupby('device').agg(
occurrence=('device', 'count')).reset_index().to_dict('records')})
I don't know how can I get it like this:
the 'payed' field is like this payed/total_commands
for example lets take maik he has two commands one is payed and the other one is not, so the final result will be like this:
{'name': 'maik','payed': 1/2, 'brand': 'HP', 'count': 21, 'items': [{'device': 'mouse', 'count': 111}]}

Since you just want to group by "name" and are only interested in the "played" values, let's concentrate on that and ignore the other data.
So for our purposes, your starting data looks like:
my_list = [
{'name': 'maik', 'is_payed': 1},
{'name': 'milanie', 'is_payed': 0},
{'name': 'maik', 'is_payed': 0},
{'name': 'nikola', 'is_payed': 1}
]
Now let's take a first pass over this data and count up the number of times we see a name and the number of times that name corresponds to an "is_payed" flag
results = {}
for item in my_list:
key = item["name"]
results.setdefault(key, {"sum": 0, "count": 0})
results[key]["count"] += 1
results[key]["is_payed"] += item["is_payed"]
At this point we have a dictionary that will look like:
{
'maik': {'is_payed': 1, 'count': 2},
'milanie': {'is_payed': 0, 'count': 1},
'nikola': {'is_payed': 1, 'count': 1}
}
Now we will take a pass over this dictionary and create our true final result:
results = [
{"name": key, "payed": f"{value['is_payed']}/{value['count']}"}
for key, value in results.items()
]
Giving us:
[
{'name': 'maik', 'payed': '1/2'},
{'name': 'milanie', 'payed': '0/1'},
{'name': 'nikola', 'payed': '1/1'}
]

Mongoose/Node: selecting element WITHOUT field/column name

So I have document like this
datatable: [{
data:[
["ABC", 123, 10, 1],
["ABC", 121, 10, 1],
["DDE", 13, 10, 1],
["OPP", 523, 10, 1]
]
}]
I want to select with a parameter "ABC" and would return arrays only with "ABC" like this:
datatable: [{
data:[
["ABC", 123, 10, 1],
["ABC", 121, 10, 1]
]
}]
Im starting with this code:
router.get("/", (req, res) => {
model.find({}).then(val=> {
res.send(val)
})
})
I cant find ways to find the value without the fieldname.
I tried using $elemMatch. Other ways needs a matching column name with the value.

How to create nested list in python with oop?

I try to write a shopping program in python. so i need to categorizing shopping item as default or new category that user adding like below:
1- user can add category and item also update them.
shop = [category1[ [item name : apple , count : 2 , price:1$],[item name :orange , count :2 , price:3]],category2[[item name : spoon , count :2 , price :3],[item name :fork , count :4 , price:5]]]

You may be better off using a dictionary for the data:
shop = {
'category1': {
'apple': { 'count': 2, 'price': 1 },
'orange': { 'count': 2, 'price': 3 }
},
'category2': {
'spoon': { 'count': 2, 'price': 3 },
'fork': { 'count': 4, 'price': 5 }
}
}
You can still iterate over the keys if you want, and it provides sensible nesting because you can access the named keys instead of indexes.

How to filter out the same keys that you $slice in mongodb

I have a collection as this:
[
{_id: "1234", myId: 1, a: [1,2,3], b: [23,3,2], c: [3,234,4], ...},
{_id: "5678", myId: 2, a: [21,32,3], b: [32,123,4], c: [32,32,12], ...},
{_id: "3242", myId: 3, a: [21,23,2], b: [12,2,32], c: [12,213,1], ...}
]
There are many more arrays in each of the document and also the size of each of these arrays is much larger. I want to be able to apply the $slice projection on two of my desired keys (to retrieve 50 latest values) and have only those two keys returned back, using mongo js.
I know how to do the tasks separately, but I'm unable to figure out an intersection of the two .
So using {a: {$slice: -50}, b: {$slice: -50}} as projection in the find() function will return me the last 50 entries, and {a: 1, b: 1} will return me only these two keys from the find() result.
How can I do both simultaneously?

You could try adding a dummy field in your projection:
db.test.find({}, {a: {$slice: -2}, b: {$slice: -2}, dummy: 1, _id: 0})
Returns
/* 0 */
{
"a" : [ 2, 3 ],
"b" : [ 3, 2 ]
}
/* 1 */
{
"a" : [ 32, 3 ],
"b" : [ 123, 4 ]
}
/* 2 */
{
"a" : [ 23, 2 ],
"b" : [ 2, 32 ]
}

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Find number of co-occurring elements between dataframe columns - python-3.x

Since the columns contain lists, you can explode them to repeat a row once for each element per list: result = ( df.explode("keywords") .explode("categories") .groupby(["keywords", "categories"]) .size() )

Related

Restructure TSV to list of list of dicts

group dictionaries and get count

Mongoose/Node: selecting element WITHOUT field/column name

How to create nested list in python with oop?

How to filter out the same keys that you $slice in mongodb

Categories

Resources