A csv file has two column CATEGORY and MILES, I have to find % of Miles under Business Category and % of Miles under Personal Category, in python - python-3.x

CATEGORY MILES
Business 5.1
Business 4.6
Business 3.9
Personal 8.5
Business 3.7
Personal 6.2
Personal 11
This is an excerpt from the excel sheet

So you have a text file in CSV format, and you need to read it, convert it into combinations of [Category, Miles], and you want for every Category "% of miles", whatever that may be.
Category Miles
X 1
X 4
X 2
Y 3
I think that you want: "Category X has 70% of the miles, Category Y also has 30% of the miles".
To solve this, it is best to cut your problem into smaller pieces.
Given a string fileName, read the file as text
Given a string in CSV format, convert it into a sequence of class BusinessMiles
Given a sequence of BusinessMiles, convert it to "% of miles", according to the definition above.
Cutting your problem into smaller pieces has several advantages:
The function of each piece will be easier to understand
Easier to unit test.
Easier to change, for instance if you don't read from a CSV file, but from a database, or if you don't read from a CSV string, but from an XML or JSON file.
Most important: you will be able to reuse the pieces for other tasks, like: "How many of my rows are about Business?"
The reusability is demonstrated most clearly, because several of these pieces already exist and can be used freely: reading the file and converting the file to CSV.
For this, consider to use Nuget Package CSV helper. Easy to use, versatile, and thus one of the most used CSV packages.
So let's assume you have procedures to read the CSV file and to convert it to a sequence of BusinessMiles
enum Category
{
Business,
Personal,
}
class BusinessMile // TODO: invent proper name
{
public Category Category {get; set;}
public Decimal Miles {get; set;}
}
By using an enum you can be certain that after reading the CSV there won't be any incorrect Categories. It will be easy to add new Categories for a future version. If you don't know at compile time which Categories are allowed, consider to use a string for it. This has the danger that someone might have a typing error, which leads to a complete new Category, without anyone noticing "Personnel" instead of "Personal"
IEnumerable<BusinessMile> ReadBusinessMiles(string csvText)
{
// use CSVHelper to convert the csvText to the sequence
}
IEnumerable<BusinessMile> ReadBusineMilesFile(string fileName)
{
// either use CSVHelper, or read the file and call the other method
}
After this, your problem will be easy:
string fileName = ...
IEnumerable<BusinessMile> businessMiles = ReadBusinesMilesFile(fileName);
Make groups of BusinessMiles that have the same Category:
var categoryGroups = businessMiles.GroupBy(
businessMile => businessMile.Category,
// parameter resultSelector: for every Category, and all BusinessMiles
// that have this Category to make one new
(category, businessMilesInThisCategory) => new
{
Category = category,
TotalMiles = businesMilesInThisCategory
.Select(businessMile => businessMile.Miles)
.Sum(),
});
So now you've got:
Category TotalMiles
X 7
Y 3
If you really want to have percentages, you need to get the total of all Miles of all Categories (=6), and divide TotalMiles by this total
var totalMiles = categorieGroups.Select(group => group.TotalMiles).Sum();
var result = categoryGroups.Select(group => new
{
Category = group.Category,
TotalMilesPercentage = 100.0M * group.TotalMiles / totalMiles,
})
In my definition of BusinessMiles, the Miles are a decimal. Take care to convert it if your Miles are integers.

Related

Isolate lines that dont exist in txt file

I have two text files that have camera models, however not all models on one text file are present in the other, so, I want to find the missing models. One issue tho, some models have extra strings in their name e.g., :
NIKON D610
D610
CANON POWERSHOT A1200
POWERSHOT A1200
"Nikon" and "Canon" is non-existent in one file.
~~ I'm scratching my head since 2 days.
At first there are some assumtions required for this answer:
Two strings describing the same model differ in a way that they either do or do not contain a manufacturer string.
It is feasibl to make a list of all possible manufacturer strins.
If these two assumptions are satisfied, one can ingore every string that is part of the manufacturer sting list while comparing two model stings. This way only the rest of the model string is evaluated.
Here is an example in C#. The local strings aClean and bClean are used to not mess up the original strings.
List<string> manufacturers // List of all possible manufacturer stings
List<string> modelsA // List of all models strings form file A
List<string> modelsB // List of all models strings form file B
foreach (string a in modelsA)
{
// Remove manufacturer name and spaces
string aClean = RemoveManufacturer(a).Replace(" ", "");
foreach (string b in modelsB)
{
// Remove manufacturer name and spaces
string bClean = RemoveManufacturer(b).Replace(" ", "");
// Now compare and process the strings.
// Store original strings a or b if required
...
}
}
string RemoveManufacturer(string model)
{
foreach (string manufacturer in manufacturers)
{
// remove manufacturer from model if possible
model.Replace(manufacturer, "");
}
return model;
}
This code is far from optimized. But it seems that your use case is not exactly performance sensitive anyways.

ILOG CPLEX / OPL dynamic Excel sheet referencing

I'm trying to dynamically reference Excel sheets or tables within the .dat for a Mixed Integer Problem in Vehicle Routing that I'm trying to solve in CPLEX (OPL).
The setup is a: .mod = model, .dat = data and a MS Excel spreadsheet
I have a 2 dimensional array with customer demand data = Excel range (for coding convenience I did not format the excel data as a table yet)
The decision variable in .mod looks like this:
dvar boolean x[vertices][vertices][scenarios]
in .dat:
vertices from SheetRead (data, "Table!vertices");
and
scenarios from SheetRead (data, "dont know how to yet"); this might not be needed
without the scenario Index everything is fine.
But as the demand for the customers changes in this model I'd like to include this via changing the data base reference.
Now what I'd like to do is one of 2 things:
Either:
Change the spreadsheet in Excel so that depending on the scenario I get something like that in .dat:
scenario = 1:
vertices from SheetRead (data, "table-scenario-1!vertices");
scenario = 2:
vertices from SheetRead (data, "table-scenario-2!vertices");
so changing the spreadsheet for new base data,
or:
Change the range within the same spreadsheet:
scenario = 1:
vertices from SheetRead (data, "table!vertices-1");
scenario = 2:
vertices from SheetRead (data, "table!vertices-2");
either way would be fine.
Knowing how 3D Tables in Excel are created using multiple spreadsheets with 2D Tables grouped, the more natural approach seems to be, to have vertices always reference the same range in every Excel spreadsheet while depending on the scenario the spreadsheet/page is switched, but I just don't know how to.
Thanks for the advice.
Unfortunately, the arguments to SheetConnection must be a string literal or an Id (see the OPL grammar in the user manual here). And similarly for SheetRead. This means, you cannot have dynamic sources for a sheet connection.
As we discussed in the comments, one option is to add an additional index to all data: the scenario. Then always read the data for all scenarios and in the .mod file select what you want to actually use.
at https://www.ibm.com/developerworks/community/forums/html/topic?id=5af4d332-2a97-4250-bc06-76595eef1ab0&ps=25 I shared an example where you can set a dynamic name for the Excel file. The same way you could have a dynamic range, the trick is to use flow control.
sub.mod
float maxOfx = 2;
string fileName=...;
dvar float x;
maximize x;
subject to {
x<=maxOfx;
}
execute
{
writeln("filename= ",fileName);
}
and then the main model is
main {
var source = new IloOplModelSource("sub.mod");
var cplex = new IloCplex();
var def = new IloOplModelDefinition(source);
var opl = new IloOplModel(def,cplex);
for(var k=11;k<=20;k++)
{
var opl = new IloOplModel(def,cplex);
var data2= new IloOplDataElements();
data2.fileName="file"+k;
opl.addDataSource(data2);
opl.generate();
if (cplex.solve()) {
writeln("OBJ = " + cplex.getObjValue());
} else {
writeln("No solution");
}
opl.postProcess();
opl.end();
}
}

My segmented picker has normal Int values as tags, How is this passed to and from CoreData?

My SwiftUI segmented control picker uses plain Int ".tag(1)" etc values for its selection.
CoreData only has Int16, Int32 & Int64 options to choose from, and with any of those options it seems my picker selection and CoreData refuse to talk to each other.
How is this (??simple??) task achieved please?
I've tried every numeric based option within CoreData including Int16-64, doubles and floats, all of them break my code or simply just don't work.
Picker(selection: $addDogVM.gender, label: Text("Gender?")) {
Text("Boy ♂").tag(1)
Text("?").tag(2)
Text("Girl ♀").tag(3)
}
I expected any of the 3 CoreData Int options to work out of the box, and to be compatible with the (standard) Int used by the picker.
Each element of a segmented control is represented by an index of type Int, and this index therefore commences at 0.
So using your example of a segmented control with three segments (for example: Boy ♂, ?, Girl ♀), each segment is represented by three indexes 0, 1 & 2.
If the user selects the segmented control that represents Girl ♀, then...
segmentedControl.selectedSegmentIndex = 2
When storing a value using Core Data framework, that is to be represented as a segmented control index in the UI, I therefore always commence with 0.
Everything you read from this point onwards is programmer preference - that is and to be clear - there are a number of ways to achieve the same outcome and you should choose one that best suits you and your coding style. Note also that this can be confusing for a newcomer, so I would encourage patience. My only advice, keep things as simple as possible until you've tested and debugged and tested enough to understand the differences.
So to continue:
The Apple Documentation states that...
...on 64-bit platforms, Int is the same size as Int64.
So in the Core Data model editor (.xcdatamodeld file), I choose to apply an Integer 64 attribute type for any value that will be used as an Int in my code.
Also, somewhere, some time ago, I read that if there is no reason to use Integer 16 or Integer 32, then default to the use of Integer 64 in object model graph. (I assume Integer 16 or Integer 32 are kept for backward compatibility.) If I find that reference I'll link it here.
I could write about the use of scalar attribute types here and manually writing your managed object subclass/es by selecting in the attribute inspector Class Codegen = Manual/None, but honestly I have decided such added detail will only complicate matters.
So your "automatically generated by Core Data" managed object subclass/es (NSManagedObject) will use the optional NSNumber? wrapper...
You will therefore need to convert your persisted/saved data in your code.
I do this in two places... when I access the data and when I persist the data.
(Noting I assume your entity is of type Dog and an instance exists of dog i.e. let dog = Dog())
// access
tempGender = dog.gender as? Int
// save
dog.gender = tempGender as NSNumber?
In between, I use a "temp" var property of type Int to work with the segmented control.
// temporary property to use with segmented control
private var tempGender: Int?
UPDATE
I do the last part a little differently now...
Rather than convert the data in code, I made a simple extension to my managed object subclass to execute the conversion. So rather than accessing the Core Data attribute directly and manipulating the data in code, now I instead use this convenience var.
extension Dog {
var genderAsInt: Int {
get {
guard let gender = self.gender else { return 0 }
return Int(truncating: gender)
}
set {
self.gender = NSNumber(value: newValue)
}
}
}
Your picker code...
Picker(selection: $addDogVM.genderAsInt, label: Text("Gender?")) {
Text("Boy ♂").tag(0)
Text("?").tag(1)
Text("Girl ♀").tag(2)
}
Any questions, ask in the comments.

Un-nesting nested tuples to single terms

I have written an udf (extends EvalFunc<Tuple>) which has as output tuples with inner tuples (nested).
For example the dump looks like:
(((photo,photos,photo)))
(((wedg,wedge),(audusd,audusd)))
(((quantum,quantum),(mind,mind)))
(((cassi,cassie),(cancion,canciones)))
(((calda,caldas),(nova,novas),(rodada,rodada)))
(((fingerprint,fingerprint),(craft,craft),(easter,easter)))
Now I want to process each of this terms, distinct it and give it an id (RANK). To do this, i need to get rid of the brackets. A simple FLATTENdoes not help in this case.
The final output should be like:
1 photo
2 photos
3 wedg
4 wedge
5 audusd
6 quantum
7 mind
....
My code (not the udf part and not the raw parsing):
tags = FOREACH raw GENERATE FLATTEN(tags) AS tag;
tags_distinct = DISTINCT tags;
tags_sorted = RANK tags_distinct BY tag;
DUMP tags_sorted;
I think your UDF is return is not optimal for your workflow. Instead of returning a tuple with variable number of fields (which are tuples), it would be a lot more convenient to return a bag of tuples.
Instead of
(((wedg,wedge),(audusd,audusd)))
you will have
({(wedg,wedge),(audusd,audusd)})
and you will be able to FLATTEN that bag to:
1. make the DISTINCT
2. RANK the tags
To do so, update your UDF like this :
class MyUDF extends EvalFunc <DataBag> {
#Override
public DataBag exec(Tuple input) throws IOException {
// create DataBag
}
}

Setting a df threshold, beyond which, query terms should be ignored

I am using Solr to search and index products from a database. Products have two interesting fields : a name and a description. Product names are normally unique, but sometimes contain common words, which serve as a pre-description of the product. One example would be "UltraScrew - a motor powered screwdriver”. Names are generally much shorter than descriptions
The problem is that when one searches for a common term, documents that contain it in the name get an unwanted boost, over those that contain it only in the description. This is due to the fact that names are shorter, and even with the normalization added afterwards, it is quite visible.
I was wondering if it is possible to filter terms out of the name, not with a dictionary of stop words, but based on the relative document frequency of the term. That means, if a term appears in more than 10% of the available documents, it should be ignored when the name field is queried. The description field should be left untouched.
Is this generally possible?
maybe you could use your own similarity:
import org.apache.lucene.search.Similarity;
public class MySimilarity extends Similarity {
#Override
public float idf(int docFreq, int numDocs) {
float freq = ((float)docFreq)/((float)numDocs);
if (freq >=0.1) return 0;
return (float) (Math.log(numDocs / (double) (docFreq + 1)) + 1.0);
}
...
}
and use that one instead of the default one.
You can set the similarity for an indexSearcher at lucene level, see this other answer to a question.
I am not sure if I understood the question correctly, but you could run two separate queries. Pseudo code:
SearchResults nameSearchResults = search("name:X");
if (nameSearchResults.size() * 10 >= corpusSize) { // name-based search useless?
return search("description:X"); // use description-based search
} else {
return search("name:X description:X); // search both fields
}

Resources