Unable to read Kinesis stream from SparkStreaming - apache-spark

import org.apache.spark.SparkConf
import org.apache.spark.storage.StorageLevel
import org.apache.spark.streaming.Milliseconds
import org.apache.spark.streaming.StreamingContext
import org.apache.spark.streaming.dstream.DStream.toPairDStreamFunctions
import com.amazonaws.auth.AWSCredentials
import com.amazonaws.auth.DefaultAWSCredentialsProviderChain
import com.amazonaws.auth.SystemPropertiesCredentialsProvider
import com.amazonaws.services.kinesis.AmazonKinesisClient
import com.amazonaws.services.kinesis.clientlibrary.lib.worker.InitialPositionInStream
import org.apache.spark.streaming.kinesis.KinesisInputDStream
import org.apache.spark.streaming.kinesis.KinesisInitialPositions.Latest
import org.apache.spark.streaming.kinesis.KinesisInitialPositions.TrimHorizon
import java.util.Date
val tStream = KinesisInputDStream.builder
.streamingContext(ssc)
.streamName(streamName)
.endpointUrl(endpointUrl)
.regionName(regionName)
.initialPosition(new TrimHorizon())
.checkpointAppName(appName)
.checkpointInterval(kinesisCheckpointInterval)
.storageLevel(StorageLevel.MEMORY_AND_DISK_2)
.build()
tStream.foreachRDD(rdd => if (rdd.count() > 0) rdd.saveAsTextFile("/user/hdfs/test/") else println("No record to read"))
Here, even though I see data coming into the stream, my above spark job isn't getting any records. I am sure that I am connecting to right stream with all credentials.
Please help me out.

Related

Import sorting error. Flake8 isort error "I001 isort found an import in the wrong position"

How can I fix it?
Test error: flake8 isort found an import in the wrong position
from api.v1.filters import TitleFilter
from api.v1.permissions import (AuthorAdminModeratorOrReadOnly, IsAdmin,
IsAdminOrReadOnly)
from api.v1.serializers import (CategorySerializer, CommentSerializer,
ConfirmationCodeSerializer, GenreSerializer,
ReviewSerializer, SignUpSerializer,
TitleReadSerializer, TitleWriteSerializer,
UserEditSerializer, UserSerializer)
**from api_yamdb.settings import DEFAULT_FROM_EMAIL**
from django.contrib.auth.tokens import default_token_generator
from django.core.mail import send_mail
from django.db.models import Avg
from django.shortcuts import get_object_or_404
from django_filters.rest_framework import DjangoFilterBackend
from rest_framework import filters, permissions, status, viewsets
from rest_framework.decorators import action, api_view, permission_classes
from rest_framework.response import Response
from rest_framework_simplejwt.tokens import AccessToken
from reviews.models import Category, Genre, Review, Title, User
from .mixins import CreateDestroyViewSet```
I use isort . and isort views.py but there are no changes.

Cloud Schedule Cloud Function to read and write data to BigQuery fails

I am trying to schedule a read and write Cloud Function in GCP, but I keep getting a fail on the execution of the scheduling in Cloud Scheduler. My function (which b.t.w. is validated and activated by Cloud Functions) is given by
def geopos_test(request):
from flatten_json import flatten
import requests
import flatten_json
import pandas as pd
import os, json, sys,glob,pathlib
import seaborn as sns
from scipy import stats
import collections
try:
collectionsAbc = collections.abc
except AttributeError:
collectionsAbc = collections
import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
import matplotlib.colors as colors
import matplotlib.ticker as ticker
import datetime
import seaborn as sns
from mpl_toolkits.axes_grid1 import make_axes_locatable
from matplotlib.colors import ListedColormap, LinearSegmentedColormap
from matplotlib.lines import Line2D
import numpy as np
import math
from pandas.io.json import json_normalize
from operator import attrgetter
from datetime import date, timedelta
import pandas_gbq
import collections
from google.cloud import bigquery
client = bigquery.Client()
project = "<ProjectId>"
dataset_id = "<DataSet>"
dataset_ref = bigquery.DatasetReference(project, dataset_id)
table_ref = dataset_ref.table('sectional_accuracy')
table = client.get_table(table_ref)
Sectional_accuracy = client.list_rows(table).to_dataframe()
sectional_accuracy = sectional_accuracy.drop_duplicates()
sectional_accuracy.sort_values(['Store'])
job_config = bigquery.LoadJobConfig(
schema=[
bigquery.SchemaField("Store", bigquery.enums.SqlTypeNames.STRING),
bigquery.SchemaField("storeid", bigquery.enums.SqlTypeNames.STRING),
bigquery.SchemaField("storeIdstr", bigquery.enums.SqlTypeNames.STRING),
bigquery.SchemaField("Date", bigquery.enums.SqlTypeNames.TIMESTAMP),
bigquery.SchemaField("Sections", bigquery.enums.SqlTypeNames.STRING),
bigquery.SchemaField("Percentage", bigquery.enums.SqlTypeNames.FLOAT),
bigquery.SchemaField("devicePercentage", bigquery.enums.SqlTypeNames.FLOAT),
bigquery.SchemaField("distance", bigquery.enums.SqlTypeNames.STRING),
],)
NtableId = '<ProjectId>.<DataSet>.test'
job = client.load_table_from_dataframe(sectional_accuracy, Ntable_id, job_config=job_config)
This function only reads data from one table and writes it to a new one. The idea is to do a load of transformations between the reading and writing.
The Function is associated to the App Engine default service account for which I am the owner and I have added (probably overkill) The Cloud Run Invoker, Cloud Functions Invoker and Cloud Scheduler Job Runner.
Now, for the Cloud Scheduler:
I have defined it by HTTP with POSTmethod with an URL, AUth OIDC token with the same service account as that used by the function. As for the HTTP header, I have User-Agent with value Google-Cloud-Scheduler. Note that I have no other header as I am uncertain of what it should be.
Yet, it fails every single time with a PERMISSION DENIED message in the log.
What Have I tried:
Change geopos_test(request) to geopos_test(event, context)
Tried to change the HTTP header to (Content-Type, application/octet-stream) or (Content-Type, application/json)
Change service account
What I haven't tried is to give some value in body, since I do not know what it could be.
I am now out of ideas. Any help would be appreciated.
Update: Error message:
{
httpRequest: {1}
insertId: "********"
jsonPayload: {
#type: "type.googleapis.com/google.cloud.scheduler.logging.AttemptFinished"
jobName: "******************"
status: "PERMISSION_DENIED"
targetType: "HTTP"
url: "*************"
}
logName: "*************/logs/cloudscheduler.googleapis.com%2Fexecutions"
receiveTimestamp: "2022-10-24T10:10:52.337822391Z"
resource: {2}
severity: "ERROR"
timestamp: "2022-10-24T10:10:52.337822391Z"

iCal4j Principles not found

We are trying to connect to our instance of the CalendarStore but we don't understand the exception, that we get back when executing the code.
The Error we're having:
org.codehaus.groovy.runtime.InvokerInvocationException: net.fortuna.ical4j.connector.ObjectStoreException: net.fortuna.ical4j.connector.FailedOperationException: Principals not found
We also know that the error occurs in line 62. Please note that I have edited the strings in the variables to not expose them to the public.
import com.github.caldav4j.CalDAVCollection;
import com.github.caldav4j.CalDAVConstants;
import com.github.caldav4j.exceptions.CalDAV4JException;
import com.github.caldav4j.methods.CalDAV4JMethodFactory;
import com.github.caldav4j.methods.HttpGetMethod;
import com.github.caldav4j.model.request.CalendarQuery;
import com.github.caldav4j.util.GenerateQuery;
import net.fortuna.ical4j.connector.ObjectStoreException;
import net.fortuna.ical4j.connector.dav.CalDavCalendarCollection;
import net.fortuna.ical4j.connector.dav.CalDavCalendarStore;
import net.fortuna.ical4j.connector.dav.PathResolver;
import org.apache.commons.codec.binary.Base64;
import org.apache.http.*;
import org.apache.http.auth.AuthScope;
import org.apache.http.auth.UsernamePasswordCredentials;
import org.apache.http.client.CredentialsProvider;
import org.apache.http.conn.routing.HttpRoute;
import org.apache.http.conn.routing.HttpRoutePlanner;
import org.apache.http.impl.client.BasicCredentialsProvider;
import org.apache.http.impl.client.HttpClientBuilder;
import org.apache.http.impl.conn.DefaultRoutePlanner;
import org.apache.http.impl.conn.DefaultSchemePortResolver;
import org.apache.http.protocol.HttpContext;
import net.fortuna.ical4j.model.Calendar;
import net.fortuna.ical4j.model.Component;
import net.fortuna.ical4j.model.ComponentList;
import net.fortuna.ical4j.model.component.VEvent;
import net.fortuna.ical4j.model.Date;
import net.fortuna.ical4j.data.CalendarBuilder;
import net.fortuna.ical4j.data.ParserException;
import net.fortuna.ical4j.connector.CalendarStore;
import org.apache.http.client.HttpClient;
import org.apache.http.client.methods.HttpUriRequest;
import org.apache.http.impl.client.HttpClients;
import java.io.IOException;
import java.net.MalformedURLException;
import java.net.URL;
import java.nio.charset.StandardCharsets;
import java.util.Iterator;
import java.util.List;
String USER = "User";
String PASS = "dont matter";
g_log.info("Start");
String uri = "our uri";
String prodId = "hm";
URL url = new URL("Our url");
PathResolver pathResolver = PathResolver.CHANDLER;
CalendarStore<CalDavCalendarCollection> calendarStore = new CalDavCalendarStore(prodId, url, pathResolver);
boolean testCon = calendarStore.connect(USER.toString(), PASS.toCharArray());
g_log.info("testCon: "+ testCon);

Not able to import name from root of the project

I am having a server.py file which is written in Falcon. Which looks like this.
try:
import falcon, logging
from os.path import dirname, realpath, join
from wsgiref import simple_server
from .config.config import server_config
from .middlewares.SQLAlchemySessionManager import SQLAlchemySessionManager
from .middlewares.GlobalInternalServerErrorManager import InternalServerErrorManager
from .lib.dbConnector import Session
from .routes import router
except ImportError as err:
falcon = None
raise err
serv_conf = server_config()
salescoachbot = falcon.API(middleware= [
SQLAlchemySessionManager(Session),
InternalServerErrorManager()
])
But when I am trying to import "salescoachbot" to other folder and files
like:
from ..server import salescoachbot
This gives me an error saying that the
from ..server import salescoachbot
ImportError: cannot import name 'salescoachbot'
The server.py is in the root of the project and has an init.py as well as the file which is trying to import the name.
What am I doing wrong here?

Phoenix "org.apache.phoenix.spark.DefaultSource" error

I am new to phoenix, I am trying to load hbase table into Phoenix. When I try to load Phoenix, I am getting below error.
java.lang.ClassNotFoundException: org.apache.phoenix.spark.DefaultSource
My code:
package com.vas.reports
import org.apache.spark.SparkContext
import org.apache.spark.sql.{SQLContext, SaveMode}
import org.apache.phoenix.spark
import java.sql.DriverManager
import com.google.common.collect.ImmutableMap
import org.apache.hadoop.hbase.filter.FilterBase
import org.apache.phoenix.query.QueryConstants
import org.apache.phoenix.filter.ColumnProjectionFilter;
import org.apache.phoenix.hbase.index.util.ImmutableBytesPtr;
import org.apache.phoenix.hbase.index.util.VersionUtil;
import org.apache.hadoop.hbase.filter.Filter
object PhoenixRead {
case class Record(NO:Int,NAME:String,DEPT:Int)
def main(args: Array[String]) {
val sc= new SparkContext("local","phoenixsample")
val sqlcontext=new SQLContext(sc)
val numWorkers = sc.getExecutorStorageStatus.map(_.blockManagerId.executorId).filter(_ != "driver").length
import sqlcontext.implicits._
val df1=sc.parallelize(List((2,"Varun", 58),
(3,"Alice", 45),
(4,"kumar", 55))).
toDF("NO", "NAME", "DEPT")
df1.show()
println(numWorkers)
println("pritning df2")
val df =sqlcontext.load("org.apache.phoenix.spark",Map("table"->"udm_main","zkUrl"->"phoenix url:2181/hbase-unsecure"))
df.show()
SPARK-SUBMIT
~~~~~~~~~~~~
spark-submit --class com.vas.reports.PhoenixRead --jars /home/hadoop1/phoenix-core-4.4.0-HBase-1.1.jar /shared/test/ratna-0.0.1-SNAPSHOT.jar
Please look into this and suggest me.
This is because, you need to add following library files in HBASE_HOME/libs and SPARK_HOME/lib.
in HBASE_HOME/libs:
phoenix-spark-4.7.0-HBase-1.1.jar
phoenix-4.7.0-HBase-1.1-server.jar
in SPARK_HOME/lib:
phoenix-spark-4.7.0-HBase-1.1.jar
phoenix-4.7.0-HBase-1.1-client.jar

Resources