I try to set up an NSFetchRequest for an entity Location with properties like countryand city:
country | city
————————————————
Germany | Berlin
USA | San Francisco
USA | New York
Germany | Munich
Germany | Munich
USA | San Francisco
Germany | Stuttgart
The NSFetchRequest should return the country (or a Location object with the appropriate country) and the number of cities.
[
{ country: 'Germany', cityCount: 3 },
{ country: 'USA', cityCount: 2 }
]
I know that I can just fetch all entries and 'count it myself' but I am interested in how to set up an appropriate fetch request (or if it's possible to do so) and would like to see how you would do it! :)
The correct answer to this question to refactor the data model in order to avoid redundancy.
The country strings are repeated unnecessarily in the table. Additionally, you make a simple query gratuitously complicated. The model should reflect your data, and writing out "USA" for every American city is just not smart or efficient.
Your data model should look like this
Country <----->> City
Now you can just fetch all countries and get the cities with cities.count.
I had to resort to two separate fetches in order to achieve (I think) what you want. The first fetch gets objectIDs for one object for each distinct combination of country and city. The second fetch is filtered, using an IN predicate, to just these objects. It uses NSExpression and propertiesToGroupBy to get the counts for each country:
// Step 1, get the object IDs for one object for each distinct country and city
var objIDExp = NSExpression(expressionType: NSExpressionType.EvaluatedObjectExpressionType)
var objIDED = NSExpressionDescription()
objIDED.expression = objIDExp
objIDED.expressionResultType = .ObjectIDAttributeType
objIDED.name = "objID"
var fetch = NSFetchRequest(entityName: "Location")
fetch.propertiesToFetch = [objIDED]
fetch.propertiesToGroupBy = ["country", "city"]
fetch.resultType = .DictionaryResultType
let results = self.managedObjectContext!.executeFetchRequest(fetch, error: nil)
// extract the objectIDs into an array...
let objIDArray = (results! as NSArray).valueForKey("objID") as! [NSManagedObjectID];
// Step 2, count using GROUP BY
var countExp = NSExpression(format: "count:(SELF)")
var countED = NSExpressionDescription()
countED.expression = countExp
countED.expressionResultType = .ObjectIDAttributeType
countED.name = "count"
var newFetch = NSFetchRequest(entityName: "Location")
newFetch.predicate = NSPredicate(format: "SELF IN %#", objIDArray)
newFetch.propertiesToFetch = ["country", countED]
newFetch.propertiesToGroupBy = ["country"]
newFetch.resultType = .DictionaryResultType
let newResults = self.managedObjectContext!.executeFetchRequest(newFetch, error: nil)
println("\(newResults!)")
This will be inefficient: if you have a large number of distinct countries and cities, the IN predicate will slow things down. You might find it's more efficient to fetch everything and count them.
Related
I have a query in my workbook that has the counts of successful and failed API calls and what I want to do is, allow the user to click on a line within the grid and grab two values from the row as parameters, in order to use them to display all the detail lines in another query.
I found the following "grid row click" example, and I thought is was perfect, but for some reason, it is not populating the parameters for the detail query.
Set up a grid row click
This is my count query:
let funcRequests = requests
| where (cloud_RoleName contains "xxxxxapi-dev")
| project cloud_RoleName, name, success, funcId = tostring(customDimensions.InvocationId);
let funcExceptions = exceptions
| extend funcId = tostring(customDimensions.InvocationId), errorMessage = customDimensions.FormattedMessage
| project funcId, errorMessage
| join
(
funcRequests
) on funcId
| project funcId, cloud_RoleName, name, success, errorMessage;
funcRequests
| join kind=leftouter
(
funcExceptions
) on funcId
| summarize totals = count(), successes = countif(success == "True" and errorMessage == ""), failures = countif(success <> "True" or errorMessage <> "") by
service = cloud_RoleName, name
| project Site=service,
["Operation Name"] = name,
["Count"] = totals ,
["Success"] = successes,
["Failure"] = failures,
Status = iif(failures>0,"❌","✔️");
I set up two export parameters in the Advanced Settings tab, one for "service" and one for "name".
I then added the parameter names to my detail query and got a "Query could not run because some parameters are not set. Please set _service, _name"
This is the detail query:
let funcRequests = requests
| where (cloud_RoleName == '{_service}' and operation_Name == '{_name}')
| project cloud_RoleName, name, success, funcId = tostring(customDimensions.InvocationId), details = itemId;
let funcExceptions = exceptions
| extend funcId = tostring(customDimensions.InvocationId), errorMessage = customDimensions.FormattedMessage
| project funcId, errorMessage
| join
(
funcRequests
) on funcId
| project funcId, cloud_RoleName, name, success, errorMessage;
funcRequests
| join kind=leftouter
(
funcExceptions
) on funcId
| project Site=cloud_RoleName,
["Operation Name"] = name,
Status = iif(success=="False","❌","✔️"),
["Details"]=details;
Obviously, I' missing something, but not sure what it is. Any suggestions would be appreciated!
you have the workbooks set up to export a column named "service" as the parameter _service and a column named "name" as a parameter named _name
but you don't have columns with those names.
the projection at the end of your query:
project Site=service,
["Operation Name"] = name,
renamed the "service" column to "Site" and the "name" column to "Operation Name"
so when you select a row, it does exactly what you said, and exports no values for those parameters because they have no values in those rows.
if i do this, to add new ones with the right column names:
i get the right results:
I have two dataframes, one that has an address column that's an entire string of words, human-entered data, and university addresses(df) and I want to extract city information from that. So I use another master city data(world_countries) to identify which of these cities are present in the string and use that in a new column.
Initially, I attempted this join with a contains() which gave me bad values as it is searching for even the characters and lookalikes like below
df = df.join(world_countries, df.address.contains(world_countries.city), how='left')
Gave me some results like this:
I understand what I need, I need to join whole words in strings, maybe something like a regexp string will help. but I don't have a said pattern to match on, I need the match to be on the entire column on a dataframe.
I can go the long way around, split the address into multiple columns and left join on each to find the match but that's just a lot and doesn't seem programmatically correct as well.
Please suggest a better way to do this, esp by joining on regexp on the entire column. Here is the code.
def pubmed2(pubmed_clean, worldcities):
df = pubmed_clean.selectExpr("AffiliationInfo as address").filter((F.col('address').isNotNull())).distinct()
w = Window().orderBy(lit('A'))
df = df.withColumn("row_num", row_number().over(w))
df= df.filter(F.col("row_num")<=100)
df = df.select([F.lower(col(c)).alias(c) for c in df.columns])
world_cities=worldcities.select([F.lower(col(c)).alias(c) for c in worldcities.columns])
df = udf(df, world_cities)
final = df.dropDuplicates()
return final
def udf(df,world_cities):
world_countries = (world_cities.selectExpr("country").distinct())#.withColumnRenamed(entity,"entity"))
df = (df.join(world_countries, df.address.contains(world_countries.country), how='left'))
world_countries = (world_cities.selectExpr("city").distinct())
df = df.join(world_countries,df.address.contains(world_countries.city), how='left')
return df
Sample data:
Centre for Primary Health Care, University of Basel, Kantonsspital Baselland, Rheinstrasse 26, 4410, Liestal, Switzerland.
Department of Family Medicine, Healthcare System Gangnam Center, Seoul National University Hospital, Seoul, 06236, Republic of Korea.
Department of Medical-Surgical Nursing, School of Nursing & Midwifery, Shahid Beheshti University of Medical Sciences, Tehran, Iran.
Department of Nursing, Federal University of the Valleys of Jequitinhonha and Mucuri, Diamantina, Minas Gerais, Brazil. Laboratory of Bioengineering, Federal University of Minas Gerais, Belo Horizonte,Minas Gerais, Brazil.
Sensory Science Centre, Division of Food, Nutrition and Dietetics, School of Biosciences, The University of Nottingham, Sutton Bonington Campus, Leicestershire, LE12 5RD, UK.
University of Michigan Medical School, Ann Arbor, MI, USA.
This works
df = spark.createDataFrame([["Centre for Primary Health Care, University of Basel, Kantonsspital Baselland, Rheinstrasse 26, 4410, Liestal, Switzerland","1"],\
["Department of Family Medicine, Healthcare System Gangnam Center, Seoul National University Hospital, Seoul, 06236, Republic of Korea","2"],\
["Intentional Bad Record(Seoul), Korea", "3"]])\
.toDF("address", "row_num");
world_countries = spark.createDataFrame([["Seoul","Republic of Korea"],["Liestal","Switzerland"]]).toDF("City", "Countries");
df=df.withColumn("address",F.regexp_replace(F.col("address"), " ", "")).withColumn("asArray", F.split("address", ","));
df.join(world_countries, F.array_contains(df["asArray"], world_countries["City"]), how='left').show()
As you can see the record with row_num=3 contains the word "Seoul" within the first string, but won't be matched with Seoul as "a city" during the Join.
Input/Output:
Let's say we have two tables, trans and product. Hypothetically the trans table consists of over a billion rows of purchases bought by users.
I am trying to find paired products that are often purchased together(purchased on the same date) by the same user, such as wine and bottle openers, chips and beer, etc..
I am trying to find the top five paired products and their names.
trans and prod dataframe :-
trans = {'ID':[1,1,2,2,3,3,1,5,5,6,6,6],
'productID':[11,22,11,22,33,77,11,77,88,11,22,77],
'Year':['2022-01-01','2022-01-01','2020-01-05','2020-01-05','2019-01-01','2019-01-01','2020-01-07','2020-01-08',
'2020-01-08','2021-06-01','2021-06-01','2021-06-01']}
trans = pd.DataFrame(trans)
trans['Year'] = pd.to_datetime(trans['Year'])
trans
product = {'productID':[11,22,33,44,55,77,88],
'prodname':['phone','Charger','eaphones','headset','scratchgaurd','pin','cover']}
product = pd.DataFrame(product)
product
My code till now where was trying to Rank the items with same ID and Year and then try to get the product names.
transprod = pd.merge(trans,product,on='productID' , how='inner')
transprod
transprod['Rank'] = transprod.groupby('ID')['Year'].rank(method = 'dense').astype(int)
transprod = transprod.sort_values(['ID','productID','Rank'])
transprod
Desired Output:
Product 1 | Product 2 | Count
phone charger 3
Charger pin 1
eaphones pin 1
pin cover 1
Any help is really appreciated. Thanks in advance
You could group the transactions table by ID (and date) and list all product pairs for each order. itertools.combinations is useful here. By taking the set over an order first, you can ignore multiple equal items.
Since it does not matter in which order a pair appears, you could then construct a flat list of all the pairs and use a collections.Counter instance to count them. Sorting each pair first makes sure that you can disregard the order of items within a pair.
The product table can be transformed into a dictionary for easy lookup. This will provide a way to add the product names to the table of results.
from itertools import combinations
from collections import Counter
pairs_by_trans = trans.groupby(['ID', 'Year'])['productID'].agg(
lambda x: list(combinations(set(x), 2)))
pairs_flat = [tuple(sorted(pair)) for row in pairs_by_trans for pair in row]
counts = Counter(pairs_flat)
top_counts = pd.DataFrame(counts.most_common(5),
columns=['pair', 'count'])
prodname = {k: v for k, v in product.values}
top_counts['names'] = top_counts['pair'].apply(lambda x: (prodname[x[0]],
prodname[x[1]]))
top_counts
pair count names
0 (11, 22) 3 (phone, Charger)
1 (33, 77) 1 (eaphones, pin)
2 (77, 88) 1 (pin, cover)
3 (11, 77) 1 (phone, pin)
4 (22, 77) 1 (Charger, pin)
The below solution works perfectly fine for me
transprod = pd.merge(trans,product,on='productID' , how='inner')
transprod['Rank'] = transprod.groupby('ID')['Year'].rank(method = 'dense').astype(int)
transprod = transprod.sort_values(['ID','productID','Rank'])
def checkprod(x):
v1 = (x['Rank']==x['Rank'].shift(-1))
return (x[v1 | v1.shift(1)])
out = transprod.groupby('ID').apply(checkprod).reset_index(drop=True)
pairs = out.groupby(['ID','Rank'])['prodname'].agg(
lambda x: list(combinations(set(x), 2)))
Counter(list(itertools.chain(*pairs)))
I have this autoquery implementation
var q = AutoQuery.CreateQuery(request, base.Request).SelectDistinct();
var results = Db.Select<ProductDto>(q);
return new QueryResponse<ProductDto>
{
Offset = q.Offset.GetValueOrDefault(0),
Total = (int)Db.Count(q),
Results = results
};
The request has some joins:
public class ProductSearchRequest : QueryDb<GardnerRecord, ProductDto>
, ILeftJoin<GardnerRecord, RecordToBicCode>, ILeftJoin<RecordToBicCode, GardnerBicCode>
{
}
The records gets returned correctly but the total is wrong. I can see 40,000 records in database but it tells me there is 90,000. There is multiple RecordToBicCode for each GardnerRecord so it's giving me the number of records multiplied by the number of RecordToBicCode.
How do I match the total to the number of GardnerRecord matching the query?
I am using PostgreSQL so need the count statement to be like
select count(distinct r.id) from gardner_record r etc...
Dores OrmLite have a way to do this?
I tried:
var q2 = q;
q2.SelectExpression = "select count(distinct \"gardner_record\".\"id\")";
q2.OrderByExpression = null;
var count = Db.Select<int>(q2);
But I get object reference not set error.
AutoQuery is returning the correct total count for your query of which has left joins so will naturally return more results then the original source table.
You can perform a Distinct count with:
Total = Db.Scalar<long>(q.Select(x => Sql.CountDistinct(x.Id));
In an interview I was asked a bigdata problem where Dataset with the below schema was given
UserId, MovieId, Rating
Here each row has Users Rating of Movie data. So based on the movie that user has watched Rating are given (Rating are based on duration of time user has watched, or some other criteria that does not matter).
Problem statement is to get the top rated MovieId list for each UserId having rating more than 7, which user has not watched yet. Or Basically its a list of movies that can be recommended to a Netflix user.
So for example for the below dataset
User_123 Movie_442 5
User_123 Movie_434 8
User_123 Movie_487 6
User_123 Movie_423 9
User_415 Movie_442 8
User_415 Movie_994 9
User_993 Movie_884 7
User_993 Movie_887 6
User_993 Movie_883 9
I am looking for the below output :-
User_123 Movie_883
User_123 Movie_994
User_415 Movie_423
User_415 Movie_434
User_415 Movie_883
User_993 Movie_423
User_993 Movie_434
User_993 Movie_442
User_993 Movie_994
I have solution for this using Apache Pig but I am looking for more optimized approach. Can anyone suggest me a better solution
LoadFile = load '$input' using PigStorage('\t') as (UserId:
chararray, MovieId:chararray, Rating: int);
UserIdField = foreach LoadFile generate UserId;
DistinctUsers = distinct UserIdField;
RatingGreaterThan7 = filter LoadFile by Rating > 7;
UsersMovieList = foreach RatingGreaterThan7 generate UserId, MovieId;
CrossJoin = cross DistinctUsers, UsersMovieList;
FilterMovies = filter CrossJoin by NOT(DistinctUsers::UserId == UsersMovieList::UserId);
UsersMovie = foreach FilterMovies generate DistinctUsers::UserId as UserId, UsersMovieList::MovieId as MovieId;
FilterAlreadyWatched = JOIN UsersMovie by (UserId, MovieId) LEFT OUTER , LoadFile by (UserId, MovieId);
FilterAlreadyWatched1 = FILTER FilterAlreadyWatched by (LoadFile::UserId is null AND LoadFile::MovieId is null);
UserMovieWatchList = foreach FilterAlreadyWatched1 generate UsersMovie::UserId, UsersMovie::MovieId;
STORE UserMovieWatchList into '$output' using PigStorage('\t');
If its a interview Question , I will try to look at multiple perspective.
Find size of all movies having rating > 7. Say its 200 MB then I will extract this data , write to hdfs.
In Pig Job load this data in Reducer memory(After Group By) by using custom UDF to find the for a given user which movies is new to him .
I may even plan to load the movies above 7 rating in Hbase and do a lookup from Reducer (after groupby)
*Cross Join is not a scalable solution as the rows in the table keep growing.