Index out of range while converting from rdd to dataframe - apache-spark

I am trying to convert spark RDD to dataframe. While RDD is fine when I convert it to dataframe I get index out of range error.
alarms = sc.textFile("hdfs://nanalyticsedge.com:8020/hdp/oneday.csv")
alarms = alarms.map(lambda line: line.split(","))
header = alarms.first()
alarms = alarms.filter(lambda line:line != header)
alarms = alarms.filter(lambda line: len(line)>1)
alarms_df = alarms.map(lambda line: Row(IDENTIFIER=line[0],SERIAL=line[1],NODE=line[2],NODEALIAS=line[3],MANAGER=line[4],AGENT=line[5],ALERTGROUP=line[6],ALERTKEY=line[7],SEVERITY=line[8],SUMMARY=line[9])).toDF()
alarms_df.take(100)
Here alarms.count() works fine whereas alarms_df.count() gives index out of range. It is data export from oracle
From #Dikei's answer I found that:
alarms = alarms.filter(lambda line: len(line) == 10)
gives me proper Dataframe but why do dataframe get lost when it is database export and how do I prevent it?

I thinks the problem is some of your lines do not contain 10 elements.
It's easy to check, try changing
alarms = alarms.filter(lambda line: len(line)>1)
to
alarms = alarms.filter(lambda line: len(line) == 10)

No data with the index mentioned. Try something like, if the array has more than 9 print 10th element
myData.foreach { x => if(x.size.!=(9)){println(x(10))} }

Related

Does PySpark run operation out-of-sequence due to optimization?

I'm confused about the result my code is giving me. Here is the code I wrote:
def update_cassandra(df : DataFrame, aggr: str):
aggr_map_dict = {
'Giornaliera' : 'day',
'Settimanale' : 'week',
'Bi-Settimanale' : 'bi_week',
'Mensile': 'month'
}
max_min_dates = df.agg(F.max(df['data']), F.min(df['data'])).collect()[0]
upper_date = max_min_dates[0]
lower_date = max_min_dates[1]
df = (df.select('data', 'punto_di_interesse', 'id_telco', 'presenze', 'presenze_uniche', 'presenze_00_06','presenze_06_08', 'presenze_08_10', 'presenze_10_12', 'presenze_12_14', 'presenze_14_16', 'presenze_16_18', 'presenze_18_20', 'presenze_20_22', 'presenze_22_24')
)
print('contenuto del csv')
display(df.where(F.col('punto_di_interesse')== 'CC - Neapolis'))
telco_day_aggr = read_from_cassandra_dev(f'telco_{aggr_map_dict[aggr]}_aggr').where(F.col('data').between(lower_date,upper_date))
if telco_day_aggr.count() == 0:
telco_day_aggr = create_empty_df()
print('telco_day_aggr as is')
display(telco_day_aggr.where(F.col('punto_di_interesse')== 'CC - Neapolis'))
union_df = df.union(telco_day_aggr)
print('unione del AS-IS e del csv')
display(union_df.where(F.col('punto_di_interesse')== 'CC - Neapolis'))
output_df = (union_df.groupBy('data', 'punto_di_interesse', 'id_telco')
.agg(
F.sum('presenze').alias('presenze'),
F.sum('presenze_uniche').alias('presenze_uniche'),
F.sum('presenze_00_06').alias('presenze_00_06'),
F.sum('presenze_06_08').alias('presenze_06_08'),
F.sum('presenze_08_10').alias('presenze_08_10'),
F.sum('presenze_10_12').alias('presenze_10_12'),
F.sum('presenze_12_14').alias('presenze_12_14'),
F.sum('presenze_14_16').alias('presenze_14_16'),
F.sum('presenze_16_18').alias('presenze_16_18'),
F.sum('presenze_18_20').alias('presenze_18_20'),
F.sum('presenze_20_22').alias('presenze_20_22'),
F.sum('presenze_22_24').alias('presenze_22_24')
)
)
return output_df
aggregate_df = aggregate_table(df_daily, 'Giornaliera')
write_on_cassandra_dev(aggregate_df, 'telco_day_aggr')
What I expect to achieve is to create a sort of update for cassandra, becouse the cassandra drivers. So the operation in my head are like this:
read from blob storage the csv and store it in a dataframe (the df variable, input of the method)
with max and min dates of this csv file, query the table in cassandra and save it in another variable
concatenate the two dataframe
summing up with the groupby
write on cassandra the new dataframe overwriting the existing rows with the new ones
it seems to me that, some how, what is in the dataframe "df" is written before I can read "telco_day_aggr" and that the union and grupby part are ininfluent. In other words on my cassandra table there is present only the content of df.
I can provide additional information if needed.

Python .iloc error trying to input values into dataframe

From my df which has a huge amount of rows I attempt to physically enter some values for some of the "NaN". My code is below:
pamap2_df["heartrate"].iloc[0:4]=100
It does the task, however, It also throws this back in my face:
C:\Users\the-e\anaconda3\lib\site-packages\pandas\core\indexing.py:1637:
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-
docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
self._setitem_single_block(indexer, value, name)
df['some_col'].iloc[:4] is so called index chaining, and its behaviour is unpredictable. I recommend reading the link in the error for details
For updating the data, it is recommended that you use a single .loc or iloc for both column/index:
col_idx = pamap2_df.columns.get_loc('heartrate')
pamap2_df.iloc[:4, col_idx] = 100
or:
idx = pamap2_df.index[:4]
pamap2_df.loc[idx, 'heartrate'] = 100
Note the error might still persist if your pamap2_df is a slice of another dataframe. For example:
pamap2_df = df[df['Age'] < some_threshold]
idx = pamap2_df.index[:4]
# this will raise a warning / failure
pamap2_df.loc[idx, 'heartrate'] = 100
# this will do
df.loc[idx, 'heartrate'] = 100

Rdd to Data Frame - i am getting output in the data frame table with " " like "2012-10-10" but my required output is without " " like 2012-10-10

My input file contains below input
"date","time","size","r_version","r_arch","r_os"
"2012-10-01","00:30:13",35165,"2.15.1","i686","linux-gnu"
"2012-10-01","00:30:15",212967,"2.15.1","i686","linux-gnu"
"2012-10-01","02:30:16",167199,"2.15.1","x86_64","linux-gnu"
my present output is like
present output
my required output is
required output
I tried with below code
conf=SparkConf().setMaster("local").setAppName("logfile")
sc=SparkContext(conf = conf)
spark=SparkSession.builder.appName("yuva").getOrCreate()
lines=sc.textFile("file:///SaprkCourse/filelog.txt")
lines=Seq("file:///SaprkCourse/filelog.txt").t
header = lines.first()
lines = lines.filter(lambda row : row != header)
values=lines.map(lambda x: x.split(","))
df=values.toDF(header.split(","))
df.show()
you should check data type in data frame and cast it to String. Maybe, Data frame auto infer data type to date.
For Example, Pyspark will auto infer "2010-10-02" to date time.
You can use the below option while creating the dataframe
option("quote", "\"")
Hope this helps
hi I hope you are using pyspark2, if so you can simply write below command:
lines= spark.read.csv("file:///SaprkCourse/filelog.txt",header=True)
Else you can edit your code by adding a small function as shown below :
lines= sc.textFile("file:///SaprkCourse/filelog.txt")
header = lines.first()
lines = lines.filter(lambda row : row != header)
def text(x):
k = x.replace('"','').strip().split(",")
return k
values=lines.map(text)
df=values.toDF(header.replace('"','').split(","))
df.show()

How do I give a text key to a dataframe stored as a value in a dictionary?

So I have 3 dataframes - df1,df2.df3. I'm trying to loop through each dataframe so that I can run some preprocessing - set date time, extract hour to a separate column etc. However, I'm running into some issues:
If I store the df in a dict as in df_dict = {'df1' : df1, 'df2' : df2, 'df3' : df3} and then loop through it as in
for k, v in df_dict.items():
if k == 'df1':
v['Col1']....
else:
v['Coln']....
I get a NameError: name 'df1' is not defined
What am I doing wrong? I initially thought I was not reading in the df1..3 data in but that seems to operate ok (as in it doesn't fail and its clearly reading it in given the time lag (they are big files)). The code preceding it (for load) is:
DF_DATA = { 'df1': 'df1.csv','df2': 'df2.csv', 'df3': 'df3.csv' }
for k,v in DF_DATA.items():
print(k, v) #this works to print out both key and value
k = pd.read_csv(v) #this does not
I am thinking this maybe the cause but not sure.I'm expecting the load loop to create the 3 dataframes and put them into memory. Then for the loop on the top of the page, I want to reference the string key in my if block condition so that each df can get a slightly different preprocessing treatment.
Thanks very much in advance for your assist.
You didn't create df_dict correctly. Try this:
DF_DATA = { 'df1': 'df1.csv','df2': 'df2.csv', 'df3': 'df3.csv' }
df_dict= {k:pd.read_csv(v) for k,v in DF_DATA.items()}

Accessing a global lookup Apache Spark

I have a list of csv files each with a bunch of category names as header columns. Each row is a list of users with a boolean value (0, 1) whether they are part of that category or not. Each of the csv files does not have the same set of header categories.
I want to create a composite csv across all the files which has the following output:
Header is a union of all the headers
Each row is a unique user with a boolean value corresponding to the category column
The way I wanted to tackle this is to create a tuple of a user_id and a unique category_id for each cell with a '1'. Then reduce all these columns for each user to get the final output.
How do I create the tuple to begin with? Can I have a global lookup for all the categories?
Example Data:
File 1
user_id,cat1,cat2,cat3
21321,,,1,
21322,1,1,1,
21323,1,,,
File 2
user_id,cat4,cat5
21321,1,,,
21323,,1,,
Output
user_id,cat1,cat2,cat3,cat4,cat5
21321,,1,1,,,
21322,1,1,1,,,
21323,1,1,,,,
Probably the title of the question is misleading in the sense that conveys a certain implementation choice as there's no need for a global lookup in order to solve the problem at hand.
In big data, there's a basic principle guiding most solutions: divide and conquer. In this case, the input CSV files could be divided in tuples of (user,category).
Any number of CSV files containing an arbitrary number of categories can be transformed to this simple format. The resulting CSV results of the union of the previous step, extraction of the total nr of categories present and some data transformation to get it in the desired format.
In code this algorithm would look like this:
import org.apache.spark.SparkContext._
val file1 = """user_id,cat1,cat2,cat3|21321,,,1|21322,1,1,1|21323,1,,""".split("\\|")
val file2 = """user_id,cat4,cat5|21321,1,|21323,,1""".split("\\|")
val csv1 = sparkContext.parallelize(file1)
val csv2 = sparkContext.parallelize(file2)
import org.apache.spark.rdd.RDD
def toTuples(csv:RDD[String]):RDD[(String, String)] = {
val headerLine = csv.first
val header = headerLine.split(",")
val data = csv.filter(_ != headerLine).map(line => line.split(","))
data.flatMap{elem =>
val merged = elem.zip(header)
val id = elem.head
merged.tail.collect{case (v,cat) if v == "1" => (id, cat)}
}
}
val data1 = toTuples(csv1)
val data2 = toTuples(csv2)
val union = data1.union(data2)
val categories = union.map{case (id, cat) => cat}.distinct.collect.sorted //sorted category names
val categoriesByUser = union.groupByKey.mapValues(v=>v.toSet)
val numericCategoriesByUser = categoriesByUser.mapValues{catSet => categories.map(cat=> if (catSet(cat)) "1" else "")}
val asCsv = numericCategoriesByUser.collect.map{case (id, cats)=> id + "," + cats.mkString(",")}
Results in:
21321,,,1,1,
21322,1,1,1,,
21323,1,,,,1
(Generating the header is simple and left as an exercise for the reader)
You dont need to do this as a 2 step process if all you need is the resulting values.
A possible design:
1/ Parse your csv. You dont mention whether your data is on a distributed FS, so i'll assume it is not.
2/ Enter your (K,V) pairs into a mutable parallelized (to take advantage of Spark) map.
pseudo-code:
val directory = ..
mutable.ParHashMap map = new mutable.ParHashMap()
while (files[i] != null)
{
val file = directory.spark.textFile("/myfile...")
val cols = file.map(_.split(","))
map.put(col[0], col[i++])
}
and then you can access your (K/V) tuples by way of an iterator on the map.

Resources