Ok, I have done my searching and I have tried many things. I think it is time to put my question here:
I have been working on taking in other user's SQL Server error logs, parsing out the rows into columns, then bulk inserting the data 1000 at a time. I troubleshoot SQL Server for other people so sp_readerrorlog will only show me my local instance. Finding root cause involves 4 sets of logs (SQL Server, Application Event, System Event, and get-clusterlog outputs and matching up timestamps. A fast load into SQL Server along with the ability to pull the exact timeframe needed will shorten my time spent staring at log files.
I am currently bottlenecked in testing the rows with a regular expression, which does work if I feed it data myself:
def sqlrowmatch(row):
pattern = re.compile(r'\d\d\d\d-\d\d-\d\d\s\d\d:\d\d:\d\d.\d\d')
if pattern.search(row):
return True
else:
return False
given any string that matches above (1111-11-11 11:11:11.11) will return as true. The idea is if in a SQL Server Error Log, if this is matched, then it is a separate entry. this will allow memory graphs, deadlock graphs, and dumps to all be grouped in one entry as opposed to being split over several lines.
However, if I point it at one of the SQL Error Logs, there seems to be extra characters. This is giving re.match and re.show a hard time finding a match. If I load any line in this function,sqlrowmatch(), it reports back false for all rows.
ÿþ <-- this appears to be the first 2 characters at the first line. re.search just doesn't even find it anywhere in the in the different elements.
False is what is returned if I put the function in with the 'with open' as statement:
with open(file, 'r') as sqllog:
for line in sqllog:
print(sqlrowmatch(line))
the first line should always be true if sqlrowmatch() is used.
2018-10-13 22:40:09.41 Server Microsoft SQL Server 2016 (SP2-CU2-GDR) (KB4458621) - 13.0.5201.2 (X64)
So I am lost and my current project is at a halt. Perhaps some seasoned insight from this group can get me going again.
TIA
Interesting enough, I found my answer here: Opening huge text file, unicode issue
open should be done with encoding='utf-16'
It now matches appropriately
Related
I have a Dash application that queries an API, based on a user search query, performs some calculations on the response, then displays the final results to the user on a Dash app. In order to provide a quick response to the user, I am trying to set up a quick result callback and a full result long_callback.
The quick result will grab limited results from the API and display results to the user within 10-15 seconds, while the full results will run in the background, collecting all results (which can take up to 2 minutes), then updates the page with the full results when they are available.
I am curious what the best way to perform this action is, as I have run into forking issues with my current attempt.
My current attempt: Using the diskcache.Cache() as the DiskcacheLongCallbackManager and a database txt file to store availability of results.
I have a database txt file that stores a dictionary, with the keys being the search query and the fields being quick_results: bool, full_results: bool, file_path: str, timestamp: dt (as str).
When a search query is entered and submit is pressed, a callback loads the database file as a variable and then checks the dictionary keys for the presence of this search query.
If it finds the query in the keys of the database, it loads the saved feather file from the provided file_path and returns it to the dash app for generation of the page content.
If it does not find the query in the database keys, it requests limited data from the API, runs calculations, saves the DataFrame as a feather file on disk, then creates an entry in the database with the search query(as the key), the file path of the saved feather file, the current timestamp, and sets the quick_results value to True.
It then loads this feather file from the file_path created and returns it to the dash app for generation of the page content.
A long_callback is triggered at the same time as the above callback, with a 20 second sleep to prevent overlap with the quick search. This callback also loads the database file as a variable and checks if the query is present in the database keys.
If found, it then checks if the full results value is True and if the timestamp is more than 0 days old.
If the full results are unavailable or are more than 0 days old, the long_callback requests full results from the API, performs the same calculations, then updates the already existing search query in the database, making the full_results True and the timestamp the time of completion for the full search.
It then loads the feather file from the file_path and returns it to the dash app for generation of the page content.
If the results are available and less then 1 day old, the long callback simply loads the feather file from the provided file_path and returns it to the dash app for generation of the page content.
The problem I am currently facing is that I am getting a weird forking error on the long callback on only one of the conditions for a full search. I currently have the long_callback setup to perform a full search only if the full results flag is False or the results are more than 0 days old. When the full_results flag is False, the callback runs as expected, updates the database and returns the full results. However, when the results are available but more than 0 days old, the callback hits a forking error and is unable to complete.
The process has forked and you cannot use this CoreFoundation functionality safely. You MUST exec(). Break on __THE_PROCESS_HAS_FORKED_AND_YOU_CANNOT_USE_THIS_COREFOUNDATION_FUNCTIONALITY___YOU_MUST_EXEC__() to debug.
I am at a loss as to why the function would run without error on one of the conditions, but then have a forking error on the other condition. The process that runs after both conditions is exactly the same.
By using print statements, I have noticed that this forking error triggers when the function tries to call the requests.get() function on the API.
If this issue is related to how I have setup the background process functionality, I would greatly appreciate some suggestions or assitance on how to do this properly, where I will not face this forking error.
If there is any information I have left out that will be helpful, please let me know and I will try to provide it.
Thank you for any help you can provide.
Trying out the queue system for a better user upload experience with Laravel-Excel.
.env was been changed from 'sync' to 'database' and migrations run. All the necessary use statements are in place yet the error above persists.
The exact error happens here:
Illuminate\Queue\Queue.php:97
$payload = json_encode($this->createPayloadArray($job, $queue, $data));
if (JSON_ERROR_NONE !== json_last_error()) {
throw new InvalidPayloadException(
If I drop ShouldQueue, the file imports perfectly in-session (large file so long wait period for user.)
I've read many stackoverflow, github etc comments on this but I don't have the technical skills to deep-dive to fix my particular situation (most of them speak of UTF-8 but I don't if that's an issue here; I changed the excel save format to UTF-8 but it didn't fix it.)
Ps. Whilst running the migration, I got the error:
SQLSTATE[42000]: Syntax error or access violation: 1071 Specified key was too long; max key length is 767 bytes (SQL: alter table `jobs` add index `jobs_queue_index`(`queue`))
I bypassed by dropping the 'add index'; so my jobs table is not indexed on queue but I don't feel this is the cause.
One thing you can do when looking into json_encode() errors is use the json_last_error_msg() function, which will give you a bit more of a readable error message.
In your case you're getting a '5' back, which is the JSON_ERROR_UTF8 error code. The error message back for this is a slightly more informative one:
'Malformed UTF-8 characters, possibly incorrectly encoded'
So we know it's encountering non-UTF-8 characters, even though you're saving the file specifically with UTF-8 encoding. At first glance you might think you need to convert the encoding yourself in code (like this answer), but in this case, I don't think that'll help. For Laravel-Excel, this seems to be a limitation of trying to queue-read .xls files - from the Laravel-Excel docs:
You currently cannot queue xls imports. PhpSpreadsheet's Xls reader contains some non-utf8 characters, which makes it impossible to queue.
In this case you might be stuck with a slow, non-queueable option, or need to convert your spreadsheet into a queueable format e.g. .csv.
The key length error on running the migration is unrelated. It has been around for a while and is a side-effect of using an older version of MySQL/MariaDB. Check out this answer and the Laravel documentation around index lengths - you need to add this to your AppServiceProvider::boot() method:
Schema::defaultStringLength(191);
I have a set of multiple API's I need to source data from and need four different data categories. This data is then used for reporting purposes in Excel.
I initially created web queries in Excel, but my Laptop just crashes because there is too many querie which have to be updated. Do you guys know a smart workaround?
This is an example of the API I will source data from (40 different ones in total)
https://api.similarweb.com/SimilarWebAddon/id.priceprice.com/all
The data points I need are:
EstimatedMonthlyVisits, TopOrganicKeywords, OrganicSearchShare, TrafficSources
Any ideas how I can create an automated report which queries the above data on request?
Thanks so much.
If Excel is crashing due to the demand, and that doesn't surprise me, you should consider using Python or R for this task.
install.packages("XML")
install.packages("plyr")
install.packages("ggplot2")
install.packages("gridExtra")
require("XML")
require("plyr")
require("ggplot2")
require("gridExtra")
Next we need to set our working directory and parse the XML file as a matter of practice, so we're sure that R can access the data within the file. This is basically reading the file into R. Then, just to confirm that R knows our file is in XML, we check the class. Indeed, R is aware that it's XML.
setwd("C:/Users/Tobi/Documents/R/InformIT") #you will need to change the filepath on your machine
xmlfile=xmlParse("pubmed_sample.xml")
class(xmlfile) #"XMLInternalDocument" "XMLAbstractDocument"
Now we can begin to explore our XML. Perhaps we want to confirm that our HTTP query on Entrez pulled the correct results, just as when we query PubMed's website. We start by looking at the contents of the first node or root, PubmedArticleSet. We can also find out how many child nodes the root has and their names. This process corresponds to checking how many entries are in the XML file. The root's child nodes are all named PubmedArticle.
xmltop = xmlRoot(xmlfile) #gives content of root
class(xmltop)#"XMLInternalElementNode" "XMLInternalNode" "XMLAbstractNode"
xmlName(xmltop) #give name of node, PubmedArticleSet
xmlSize(xmltop) #how many children in node, 19
xmlName(xmltop[[1]]) #name of root's children
To see the first two entries, we can do the following.
# have a look at the content of the first child entry
xmltop[[1]]
# have a look at the content of the 2nd child entry
xmltop[[2]]
Our exploration continues by looking at subnodes of the root. As with the root node, we can list the name and size of the subnodes as well as their attributes. In this case, the subnodes are MedlineCitation and PubmedData.
#Root Node's children
xmlSize(xmltop[[1]]) #number of nodes in each child
xmlSApply(xmltop[[1]], xmlName) #name(s)
xmlSApply(xmltop[[1]], xmlAttrs) #attribute(s)
xmlSApply(xmltop[[1]], xmlSize) #size
We can also separate each of the 19 entries by these subnodes. Here we do so for the first and second entries:
#take a look at the MedlineCitation subnode of 1st child
xmltop[[1]][[1]]
#take a look at the PubmedData subnode of 1st child
xmltop[[1]][[2]]
#subnodes of 2nd child
xmltop[[2]][[1]]
xmltop[[2]][[2]]
The separation of entries is really just us, indexing into the tree structure of the XML. We can continue to do this until we exhaust a path—or, in XML terminology, reach the end of the branch. We can do this via the numbers of the child nodes or their actual names:
#we can keep going till we reach the end of a branch
xmltop[[1]][[1]][[5]][[2]] #title of first article
xmltop[['PubmedArticle']][['MedlineCitation']][['Article']][['ArticleTitle']] #same command, but more readable
Finally, we can transform the XML into a more familiar structure—a dataframe. Our command completes with errors due to non-uniform formatting of data and nodes. So we must check that all the data from the XML is properly inputted into our dataframe. Indeed, there are duplicate rows, due to the creation of separate rows for tag attributes. For instance, the ELocationID node has two attributes, ValidYN and EIDType. Take the time to note how the duplicates arise from this separation.
#Turning XML into a dataframe
Madhu2012=ldply(xmlToList("pubmed_sample.xml"), data.frame) #completes with errors: "row names were found from a short variable and have been discarded"
View(Madhu2012) #for easy checking that the data is properly formatted
Madhu2012.Clean=Madhu2012[Madhu2012[25]=='Y',] #gets rid of duplicated rows
Here is a link that should help you get started.
http://www.informit.com/articles/article.aspx?p=2215520
If you have never used R before, it will take a little getting used to, but it's worth it. I've been using it for a few years now and when compared to Excel, I have seen R perform anywhere from a couple hundred percent faster to many thousands of percent faster than Excel. Good luck.
I am using container engine, and my pods are hosted there.
I am trying to fetch logs, using log api :
http://localhost:8000/api/v1/namespaces/app-test/pods/designer-0/log?tailLines=100&sinceTime=2017-09-17T10:47:58Z
if i used both the query params separately, it works and show the proper result, but if i am using it simultaneously only the top 100 logs are returning, the sinceTime param is get ignored.
my scenario is, i need a log from a specific time, in a chunk like, 100 lines, 100 lines.. like this.
I am not sure, whether it is a bug, or it is not implemented.
I found this from the api reference manual
https://kubernetes.io/docs/api-reference/v1.6/
tailLines - If set, the number of lines from the end of the logs to
show. If not specified, logs are shown from the creation of the
container or sinceSeconds or sinceTime
So, that means if you specify tailLines, it start from the end. I dont see any option explicitly mentioned other than limitBytes. But you will have to play around with it as it does not guarantee number of lines.
tailLines=X tells the server to start that many lines from the end
sinceTime tells the server to start from the specified time
the options are mutually exclusive
Thanks All,
I have later on recognized that, it is not ignoring the sinceTime, as the TailLines intended functionality is return the lines from the last.
So, if i mentioned the sinceTime= 10 PM yesterday, it will return the records from that time..And if also tailLines, is mentioned, so it will return the recent logs from that chunk.
So, it was working as expected. I need to play with LimitBytes for getting the logs in chunk, from that time, Instead of full logs.
OK, so my latest project requires loading an Excel 2007 spreadsheet into a SQL Server table. I'm working in SSIS 2008R2. Based on some stuff I found on the internet, I opened the Excel source in Advanced editor and changed the datatype of the long column to DT_NTEXT, so that it wouldn't truncate it. Then I made the database column VARCHAR(MAX). This runs correctly in debug mode on my laptop.
Then I deployed it to the development server and attempted to load the same test file. It failed with the following error messages:
Error: Code: 0xC0208265
Source: Main Data Flow Task Get Main Data [1]
Description: Failed to retrieve long data for column "DESCR".
End Error
Error: Code: 0xC020901C
Source: Main Data Flow Task Get Main Data [1]
Description: There was an error with output column "DESCR" (72) on output "Excel Source Output" (9). The column status returned was: "DBSTATUS_UNAVAILABLE".
End Error
Error: Code: 0xC0209029
Source: Main Data Flow Task Get Main Data [1]
Description: SSIS Error Code DTS_E_INDUCEDTRANSFORMFAILUREONERROR. The "output column "DESCR" (72)" failed because error code 0xC0209071 occurred, and the error row disposition on "output column "DESCR" (72)" specifies failure on error. An error occurred on the specified object of the specified component. There may be error messages posted before this with more information about the failure.
End Error
Searching for information about the error, I found about a million sites offering the same three suggested solutions:
Add 'IMEX=1' to the extended properties of the connection string.
It was already there.
Change the TypeGuessRows key in the registry.
This was set to zero on the server, which I understand to mean that it should look at the entire file. Nevertheless, I changed it to 8 to match my laptop. The same error occurred when I ran it again. Then I changed it to 1,763, which is more than the number of rows in the spreadsheet. It still gave the same error. So, I put it back to zero. (There's a 1,900-character value in the first row of my test file, so it shouldn't really matter how many it checks, in this case.)
Change the datatype to DT_WSTR(4000) in the source.
The column is supposed to have up to 10,000 characters, so I'm not sure this would be a good idea even if it worked. However, I tried it anyway. This time it gave me a truncation error. I changed the truncation error disposition to "ignore failure" and it loaded the data, but truncated the value to 255 characters. I have verified that the length is 4000 and doesn't get changed when I save the file, but it's still truncating at 255 characters.
I have no idea what else to look at. Any help would be appreciated.
UPDATE 1/29: The package, without any changes, works correctly when running on the pre-production server. It still fails when running on the development server. Both servers have the same version of SSIS (including minor version numbers) as well as the same versions of Windows, Access and Excel. I do not know how to explain this, nor do I know how to tell if it would work in production.
I created a new package with similar non-functional requirements (Excel 2007 file, SSIS 2008, SQL Server 2008 R2, VARCHAR(MAX) target column) and it worked just fine after deployment into the database server. My package:
Metadata at the Excel Source component's output (checked using Advanced Editor): DT_NTEXT
Derived Column component between source and destination to cast to non-unicode from unicode using (DT_TEXT,1252)
Metadata at the OLE DB Destination component's input (checked using Advanced Editor): DT_TEXT
Target Column data type: VARCHAR(MAX)
I do not explicitly use the extended property IMEX in the connection
Executed by right-clicking on the package at the database server, and loaded a file with a few thousand characters per record into the table without truncation. Hope this helps
I have faced this issue while importing an excel file with a field containing more than 255 characters. I solved the issue using Python.
Simply, import the excel in a pandas data frame and then calculate the length of each of those string values per row.
Then, sort the dataframe in descending order. This will enable SSIS to allocate maximum space for that field as it scans the first 3 rows to allocate storage:
df = pd.read_excel(f,sheet_name=0,skiprows = 1)
df = df.drop(df.columns[[0]], axis = 1)
df['length'] = df['Item Description'].str.len()
df.sort_values('length', ascending=False, inplace=True)
writer = ExcelWriter('Clean/Cleaned_'+f[5:])
df.to_excel(writer,sheet_name='Billing',index=False)
writer.save()