Reading a TTree branch using PyROOT - pyroot

I like to read the content of a branch in TTree using PyROOT. I look for a possible solution to my problem here: Reading a TTree in root using PyRoot. However, the answer doesn't help me, because probabaly PyROOT doesn't know the structure of a branch. Please see the bottom output snippet, and suggest a solution.
Thanks,
Sadia
>>> import ROOT
>>> f = ROOT.TFile("../Ntuple/v0406_default_1012p2_101X_dataRun2_Express_v7_Fill6573_Ntuple.root")
>>> f.ls()
TFile** ../Ntuples/v0406_default_1012p2_101X_dataRun2_Express_v7_Fill6573_Ntuple.root
TFile* ../Ntuples/v0406_default_1012p2_101X_dataRun2_Express_v7_Fill6573_Ntuple.root
KEY: TTree trajTree;1 Trajectory measurements in the Pixel detector.
KEY: TTree eventTree;1 The event.
KEY: TTree clustTree;1 Pixel clusters.
KEY: TTree trajROCEfficiencyTree;1 ROC and module efficiencies.
>>> t = f.Get("trajTree")
>>> t.Print()
******************************************************************************
*Tree :trajTree : Trajectory measurements in the Pixel detector. *
*Entries : 42180482 : Total = 31858466801 bytes File Size = 8076610485 *
* : : Tree compression factor = 3.94 *
******************************************************************************
*............................................................................*
*Br 5 :clust_pix : pix[size][2]/F *
*Entries : 42180482 : Total Size= 1597865089 bytes File Size = 569202378 *
*Baskets : 12058 : Basket Size= 2175488 bytes Compression= 2.81 *
*............................................................................*
*Br 7 :traj : validhit/I:missing:lx/F:ly:lz:glx:gly:glz:clust_near/I:*
* | hit_near:pass_effcuts:alpha/F:beta:norm_charge:d_tr:dx_tr:dy_tr: *
* | d_cl:dx_cl:dy_cl:dx_hit:dy_hit:onedge/I:lx_err/F:ly_err/F *
*Entries :42180482 : Total Size= 4220749516 bytes File Size = 2508894561 *
*Baskets : 28411 : Basket Size= 2275840 bytes Compression= 1.68 *
*............................................................................*
>>> t.clust_pix
<read-write buffer ptr 0x7fba04428200, size 514 at 0x115e5ecf0>
>>> t.traj
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'TTree' object has no attribute 'traj'
>>> t.traj.beta
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'TTree' object has no attribute 'traj'

Self answering with a help of a colleague: I define the structure of the branch and then set its address.
The structure data file is structDef.py
from ROOT import gROOT, AddressOf
gROOT.ProcessLine(
"struct Traj {\
Int_t validhit;\
Float_t missing;\
Float_t lx;\
Float_t ly;\
Float_t lz;\
Float_t glx;\
Float_t gly;\
Float_t glz;\
Int_t clust_near;\
Float_t hit_near;\
Float_t pass_effcuts;\
Float_t alpha;\
Float_t beta;\
Float_t norm_charge;\
Float_t d_tr;\
Float_t dx_tr;\
Float_t dy_tr;\
Float_t d_cl;\
Float_t dx_cl;\
Float_t dy_cl;\
Float_t dx_hit;\
Float_t dy_hit;\
Int_t onedge;\
Float_t lx_err;\
Float_t ly_err;\
};"
);
Then in my main code, I set the branch address.
#!/usr/bin/env python
from ROOT import TFile, TTree
import structDef
from ROOT import Traj
traj = Traj()
f = TFile.Open('../Ntuples/v0406_default_1012p2_101X_dataRun2_Express_v7_Fill6573_Ntuple.root')
t1 = f.Get("trajTree")
t1.SetBranchAddress("traj", structDef.AddressOf(traj, 'validhit'))
for iev in xrange(t1.GetEntries()):
t1.GetEntry(iev)
print traj.norm_charge
If any one has a better solution, then I really appreciate your help as I do see warnings, despite that it works for me.
input_line_20:2:9: error: redefinition of 'Traj'
struct Traj { Int_t validhit; Float_t missing; Float_t lx; Float_t ly; Float_t lz; Float_t glx; Float_...
^
input_line_19:2:9: note: previous definition is here
struct Traj { Int_t validhit; Float_t missing; Float_t lx; Float_t ly; Float_t lz; Float_t glx; Float_...
^
17.0971317291
I like python, but this additional layer makes my macro complicated. Moreover, I like to take tips as how to loop over all entries in a tree in python efficiently just like in C++. May be its more a question of ROOT vs PyROOT. Currently, my macro takes double the time, if something is written in C++.

Related

Moviepy: add audio to a video

I am trying to run the following code:
from moviepy.editor import *
videoclip = VideoFileClip("filename.mp4")
audioclip = AudioFileClip("audioname.mp3")
new_audioclip = CompositeAudioClip([videoclip.audio, audioclip])
videoclip.audio = new_audioclip
videoclip.write_videofile("new_filename.mp4")
but when I run it I got the following error:
*
Traceback (most recent call last): File "C:/Users/arthu/PycharmProjects/Comprei da China/video.py", line 5, in
new_audioclip = CompositeAudioClip([videoclip.audio, audioclip]) File "C:\Users\arthu\PycharmProjects\Comprei da China\venv\lib\site-packages\moviepy\audio\AudioClip.py", line 285, in
init
ends = [c.end for c in self.clips] File "C:\Users\arthu\PycharmProjects\Comprei da China\venv\lib\site-packages\moviepy\audio\AudioClip.py", line 285, in
ends = [c.end for c in self.clips] AttributeError: 'NoneType' object has no attribute 'end'
*
Does anybody know how can I solve that?
Pass only one parameter in CompositeAudioClip the built in class AudioClip.py has one parameter
from moviepy.editor import *
videoclip = VideoFileClip("filename.mp4")
audioclip = AudioFileClip("audioname.mp3")
new_audioclip = CompositeAudioClip([audioclip])
videoclip.audio = new_audioclip
videoclip.write_videofile("new_filename.mp4")

Creating a DataFrame from Row results in 'infer schema issue'

When I began learning PySpark, I used a list to create a dataframe. Now that inferring the schema from list has been deprecated, I got a warning and it suggested me to use pyspark.sql.Row instead. However, when I try to create one using Row, I get infer schema issue. This is my code:
>>> row = Row(name='Severin', age=33)
>>> df = spark.createDataFrame(row)
This results in the following error:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/spark2-client/python/pyspark/sql/session.py", line 526, in createDataFrame
rdd, schema = self._createFromLocal(map(prepare, data), schema)
File "/spark2-client/python/pyspark/sql/session.py", line 390, in _createFromLocal
struct = self._inferSchemaFromList(data)
File "/spark2-client/python/pyspark/sql/session.py", line 322, in _inferSchemaFromList
schema = reduce(_merge_type, map(_infer_schema, data))
File "/spark2-client/python/pyspark/sql/types.py", line 992, in _infer_schema
raise TypeError("Can not infer schema for type: %s" % type(row))
TypeError: Can not infer schema for type: <type 'int'>
So I created a schema
>>> schema = StructType([StructField('name', StringType()),
... StructField('age',IntegerType())])
>>> df = spark.createDataFrame(row, schema)
but then, this error gets thrown.
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/spark2-client/python/pyspark/sql/session.py", line 526, in createDataFrame
rdd, schema = self._createFromLocal(map(prepare, data), schema)
File "/spark2-client/python/pyspark/sql/session.py", line 387, in _createFromLocal
data = list(data)
File "/spark2-client/python/pyspark/sql/session.py", line 509, in prepare
verify_func(obj, schema)
File "/spark2-client/python/pyspark/sql/types.py", line 1366, in _verify_type
raise TypeError("StructType can not accept object %r in type %s" % (obj, type(obj)))
TypeError: StructType can not accept object 33 in type <type 'int'>
The createDataFrame function takes a list of Rows (among other options) plus the schema, so the correct code would be something like:
from pyspark.sql.types import *
from pyspark.sql import Row
schema = StructType([StructField('name', StringType()), StructField('age',IntegerType())])
rows = [Row(name='Severin', age=33), Row(name='John', age=48)]
df = spark.createDataFrame(rows, schema)
df.printSchema()
df.show()
Out:
root
|-- name: string (nullable = true)
|-- age: integer (nullable = true)
+-------+---+
| name|age|
+-------+---+
|Severin| 33|
| John| 48|
+-------+---+
In the pyspark docs (link) you can find more details about the createDataFrame function.
you need to create a list of type Row and pass that list with schema to your createDataFrame() method. sample example
from pyspark.sql import *
from pyspark.sql.types import *
from pyspark.sql.functions import *
department1 = Row(id='AAAAAAAAAAAAAA', type='XXXXX',cost='2')
department2 = Row(id='AAAAAAAAAAAAAA', type='YYYYY',cost='32')
department3 = Row(id='BBBBBBBBBBBBBB', type='XXXXX',cost='42')
department4 = Row(id='BBBBBBBBBBBBBB', type='YYYYY',cost='142')
department5 = Row(id='BBBBBBBBBBBBBB', type='ZZZZZ',cost='149')
department6 = Row(id='CCCCCCCCCCCCCC', type='XXXXX',cost='15')
department7 = Row(id='CCCCCCCCCCCCCC', type='YYYYY',cost='23')
department8 = Row(id='CCCCCCCCCCCCCC', type='ZZZZZ',cost='10')
schema = StructType([StructField('id', StringType()), StructField('type',StringType()),StructField('cost', StringType())])
rows = [department1,department2,department3,department4,department5,department6,department7,department8 ]
df = spark.createDataFrame(rows, schema)
If you're just making a pandas dataframe, you can convert each Row to a dict and then rely on pandas' type inference, if that's good enough for your needs. This worked for me:
import pandas as pd
sample = output.head(5) #this returns a list of Row objects
df = pd.DataFrame([x.asDict() for x in sample])
I have had a similar problem recently and the answers here helped me untderstand the problem better.
my code:
row = Row(name="Alice", age=11)
spark.createDataFrame(row).show()
resulted in a very similar error:
An error was encountered:
Can not infer schema for type: <class 'int'>
Traceback ...
the cause of the problem:
createDataFrame expects an array of rows. So if you only have one row and don't want to invent more, simply make it an array: [row]
row = Row(name="Alice", age=11)
spark.createDataFrame([row]).show()

I am trying to parse a fixed width mainframe screenshot file with python using struct and string module. I am getting below error

In [1]: import string
In [2]: import struct
In [3]: baseformat = '10s 1x 25s 10s 1x 20s'
In [4]: theline = "DD : 3KZ BD NAME : PETERA QDVISORS LLC"
In [5]: mytup = struct.unpack_from(baseformat,theline,offset=0)
TypeError Traceback (most recent call last)
in ()
----> 1 mytup = struct.unpack_from(baseformat,theline,offset=0)
TypeError: a bytes-like object is required, not 'str'
theline is a string, but struct.unpack_from is expecting a binary sequence bytes. You can convert it like so:
theline = bytes("BD : 3KZ BD NAME : CETERA ADVISORS LLC", 'utf-8')
(UTF-8 is probably a reasonable encoding if you're expecting all the input data to be ASCII.)
After that your next problem will be that baseformat defines a parser expecting 67 bytes, and theline is 38 characters.

Keep getting error `TypeError: 'float' object is not callable' when trying to run file using numpy library

I intend to perform a Newton Raphson iteration on some data I read in from a file. I use the following function in my python program.
def newton_raphson(r1, r2):
guess1 = 2 * numpy.log(2) / (numpy.pi() * (r1 + r2))
I call this function as so:
if answer == "f": # if data is in file
fileName = input("What is the name of the file you want to open?")
dataArray = extract_data_from_file(fileName)
resistivityArray = []
for i in range(0, len(dataArray[0])):
resistivity_point = newton_raphson(dataArray[0][i], dataArray[1][i])
resistivityArray += [resistivity_point]
On running the program and entering my file, this returns `TypeError: 'float' object is not callable'. Everything I've read online suggests this is due to missing an operator somewhere in my code, but I can't see where I have. Why do I keep getting this error and how do I avoid it?
numpy.pi is not a function, it is a constant:
>>> import numpy
>>> numpy.pi
3.141592653589793
Remove the () call from it:
def newton_raphson(r1, r2):
guess1 = 2 * numpy.log(2) / (numpy.pi * (r1 + r2))
as that is causing your error:
>>> numpy.pi()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: 'float' object is not callable

Spark join by values

I have a two pair RDDs in spark like this
rdd1 = (1 -> [4,5,6,7])
(2 -> [4,5])
(3 -> [6,7])
rdd2 = (4 -> [1001,1000,1002,1003])
(5 -> [1004,1001,1006,1007])
(6 -> [1007,1009,1005,1008])
(7 -> [1011,1012,1013,1010])
I would like to combine them to look like this.
joinedRdd = (1 -> [1000,1001,1002,1003,1004,1005,1006,1007,1008,1009,1010,1011,1012,1013])
(2 -> [1000,1001,1002,1003,1004,1006,1007])
(3 -> [1005,1007,1008,1009,1010,1011,1012,1013])
Can someone suggest me how to do this.
Thanks
Dilip
With Scala Spark API, this would be
import org.apache.spark.SparkContext._ // enable PairRDDFunctions
val rdd1Flat = rdd1.flatMapValues(identity).map(_.swap)
val rdd2Flat = rdd2.flatMapValues(identity)
rdd1Flat.join(rdd2Flat).values.distinct.groupByKey.collect
Result of this operation is
Array[(Int, Iterable[Int])] = Array(
(1,CompactBuffer(1001, 1011, 1006, 1002, 1003, 1013, 1005, 1007, 1009, 1000, 1012, 1008, 1010, 1004)),
(2,CompactBuffer(1003, 1004, 1007, 1000, 1002, 1001, 1006)),
(3,CompactBuffer(1008, 1009, 1007, 1011, 1005, 1010, 1013, 1012)))
The approach proposed by Gabor will not work, since Spark doesn't support RDD operations performed within other RDD operation. You'll get a Java NPE thrown by a worker when trying to access the SparkContext available on the driver only.

Resources