I'm trying to build a string that contains all attributes of a class-object. The object name is jsonData and it has a few attributes, some of them being
jsonData.Serial,
jsonData.InstrumentSerial,
jsonData.Country
I'd like to build a string that has those attribute names in the format of this:
'Serial InstrumentSerial Country'
End goal is to define a schema for a Spark dataframe.
I'm open to alternatives, as long as I know order of the string/object because I need to map the schema to appropriate values.
You'll have to be careful about filtering out unwanted attributes, but try this:
' '.join([x for x in dir(jsonData) if '__' not in x])
That filters out all the "magic methods" like __init__ or __new__.
To include those, do
' '.join(dir(jsonData))
These take advantage of Python's dir method, which returns a list of all attributes of an object.
I don't quite understand why you want to group the attribute names in a single string.
You could simply have a list of attribute names as the order of a python list is persist.
attribute_names = [x for x in dir(jsonData) if '__' not in x]
From there you can create your dataframe. If you don't need to specify the SparkTypes, you can just to:
df = SparkContext.createDataFrame(data, schema = attribute_names)
You could also create a StructType and specify the types in your schema.
I guess that you are going to have a list of jsonData records that you want to consider as Rows.
Let's considered it as a list of objects, but the logic would still be the same.
You can do that as followed:
my_object_list = [
jsonDataClass(Serial = 1, InstrumentSerial = 'TDD', Country = 'France'),
jsonDataClass(Serial = 2, InstrumentSerial = 'TDI', Country = 'Suisse'),
jsonDataClass(Serial = 3, InstrumentSerial = 'TDD', Country = 'Grece')]
def build_record(obj, attr_names):
from operator import attrgetter
return attrgetter(*attr_names)(obj)
So the data attribute referred previously would be constructed as:
data = [build_record(x, attribute_names) for x in my_object_list]
Related
I'm trying to remove all values with index vals 1, 2, and 3, from a string list like
['1:1', '2:100.0', '3:100.0',...]. The data is in sparse vector format and was loaded as a pandas dataframe. I used an online regex tester to match the first three positions of this list with success.
But as it exists in my program, the same regex doesn't work. On running:
data = pd.read_csv("c:\data.csv")
for index, row in data.itterrows():
line = parseline(row)
def parseline(line):
line = line.values.flatten() # data like: ['1:1 2:100.0 3:100.0...']
stringLine = listToString(line) # data like: 1:1 2:100.0 3:100.0...
splitLine = stringLine.split(" ") # data like: ['1:1', '2:100.0', '3:100.0',...]
remove = re.findall(r"'1:1'|'[2,3]:\d+.\d+'")
splitLine.remove(remove)
print(splitLine)
I get the following error:
TypeError: findall() missing 1 required positional argument: 'string'
Does anyone have any ideas? Thanks in advance.
The splitLine object was actually a list, but the re.findall() method (and re.sub() method, which was what was actually used) requires a string, instead of a list. Was just operating on the wrong data structure. Ultimately:
def parseline(line):
line = line.values.flatten().tolist()
stringLine = listToString(line)
stringLine = re.sub(r"1:1 |2:\d+.\d+ ", "", stringLine)
...
did the trick.
So I am trying to combine multiple CSV files. I have one csv with a current part number list of products we stock. Sorry, I can't embedded images as I am new. I've seen many similar posts but not any with both a merge and a groupby together.
current_products
I have another csv with a list of image files that are associated with that part but are split up on to multiple rows. This list also has many more parts listed than we offer so merging based on the current_products sku is important.
product_images
I would like to reference the first csv for parts I currently use and combine the images files in the following format.
newestproducts
I get a AttributeError: 'function' object has no attribute 'to_csv', although when I just print the output in the terminal it appears to be the way I want it.
current_products = 'currentproducts.csv'
product_images = 'productimages.csv'
image_list = 'newestproducts.csv'
df_currentproducts = pd.read_csv(currentproducts)
df_product_images = pd.read_csv(product_images)
df_current_products['sku'] = df_currentproducts['sku'].astype(str)
df_product_images['sku'] = df_product_images['sku'].astype(str)
df_merged = pd.merge(df_current_products, df_product_images[['sku','images']], on = 'sku', how='left')
df_output = df_merged.groupby(['sku'])['images_y'].apply('&&'.join).reset_index
#print(df_output)
df_output.to_csv(image_list, index=False)
Your are missing () after reset_index:
df_output = df_merged.groupby(['sku']['images_y'].apply('&&'.join).reset_index()
That resulted df_output type to method rather then a dataframe (just print type(df_output) to see that), so obviously he doesn't know any method named to_csv
I have a series of dictionaries which each contain the same keys but their values are different i.e Age in dictionary 1 = 2, Age in dictionary 2 = 4 etc etc but they are broadly identical in structure.
what I would like to do is to randomly select one of these dictionaries and then assign specific values with the dictionary to variables. i.e python randomly chooses Dictionary 1 and then I then want to fill the dictAge variable with the age value from Dictionary 1.
import random
dictList = ['myDict', 'otherDict']
mydict = {
'age' : 10,
'other': "dummy data"
}
.
.
.
randomDict = random.choice(dictList)
dictAge = randomDict['age']
print(dictAge)
In the case of the code above what should happen is:
randomDict is assigned a random value from the distList variable (at the top). This sets which dictionary's values will be used going forward.
I next want the dictAge variable to then be assigned the age value from the selected dictionary. In this case (as mydict is was the only dictionary available) it should be assigned the age value of 10.
The error I am getting is:
TypeError: string indices must be integers
I know this is such a common error but my brain can't quite work out what the best solution is.
(Disclaimer: I haven't used python in ages so I know I am doing something really obviously silly but I can't quite work out what to do).
Right now, you are not actually using the definition of your dicts.
This is because dictList is comprised of strings: ['myDict', 'otherDict'].
So, when doing randomDict = random.choice(dictList), randomDict will either be the string 'myDict', or the string 'otherDict'.
Then you are doing randomDict['age'], which means you are trying to slice a string, with a string. As the error suggests, this can't be done and indices can only be ints.
What you want to do, is move the definition of the dictList to be after the definitions of your dicts, and include references to the dicts themselves, not strings. Something like:
mydict = {
'age' : 10,
'other': "dummy data"
}
.
.
.
dictList = [myDict, otherDict]
In the following piece of code:
dictAge = randomDict['age']
You are trying to index the name of dictionary variable (a string) returned by random.choice function.
To make it work you would need to do it using locals:
locals()[randomDict]['age']
or rather correct the dictList to contain the dictionaries instead of their names:
dictList = [myDict, otherDict]
In the latter case please note that myDict and otherDict should be declared before dictList.
I have an array of objects which is defined as below:
def list = [{'name':'test','grade':1,'num':1},{'name':'test1','grade':2,'num':2},{'name':'test','grade':1,'num':1}]
I am trying to avoid duplicate of num values so i tried the below way:
//Set<String> studentArray = new HashSet<String>(Arrays.asList(studentList.num));
HashSet <String> studentInfo = new HashSet <String>();
studentInfo.addAll(list.num)
println("Information:"+studentInfo);
Now I can see distinct values but in the console, I see the value is appending with an array like [1]. How to see only the value?
HashSet does not allow to duplicate values. The code you have constructed creates a set of a single list of elements 1, 2 and 1. If you print studentArray to the console you will see something like this:
[[1, 2, 1]]
And this is correct because the type of the constructed structure is Set<List<Integer>>. The way you use set in this case would prevent from adding another list [1,2,1].
If you want to create a set like [1,2] then you can cast studentList.num as Set.
def studentList = [[name:'test',grade:1,num:1],[name:'test1',grade:2,num:2],[name:'test',grade:1,num:1]]
def studentNums = studentList.num as Set
assert studentNums == [1,2] as Set
data = ['{"osc":{"version":"1.0"}}']
or
data = ['{"device":{"network":{"ipv4_dante":{"auto":"testing"}}}}']
From the code above, I only get random outputs, but I need to get the last value i.e "1.0" or "testing" and so on.
I always need to get the last value. How can I do it using python?
Dictionaries have no "last" element. Assuming your dictionary doesn't branch and you want the "deepest" element, this should work:
import json
data = ['{"device":{"network":{"ipv4_dante":{"auto":"testing"}}}}']
obj = json.loads(data[0])
while isinstance(obj, dict):
obj = obj[list(obj.keys())[0]]
print(obj)
This should work -
import ast
x = ast.literal_eval(data[0])
while(type(x)==dict):
key = x.keys()[0]
x = x.get(key)
print(x)