Text mining Error- Getting this error while creating the DocumentTermMatrix & Word Cloud - word-cloud

I am getting an error message 'Error in simple_triplet_matrix(i, j, v, nrow = length(terms), ncol = length(corpus), :
'i, j' invalid'
While creating the DocumentTermMatrix or creating a Word Cloud.This is happening in all data sets.Here is the code I used
corpus=tm_map(corpus,tolower)
corpus=tm_map(corpus,removePunctuation)
corpus=tm_map(corpus,removeWords,stopwords("english"))
corpus=tm_map(corpus,stemDocument)
corpus <- tm_map(corpus, PlainTextDocument)
library("SnowballC")
dtm=DocumentTermMatrix(corpus)
library("wordcloud")
wordcloud(corpus,min.freq=4,scale=c(5,1),random.color=F,max.words=45,random.order=F)
I have recently changed my computer and Reinstalled R(3.3.4).Not sure if that is causing this problem.Everything was working fine on the old PC.Please help.
PS: I tried to read through all the available questions this topic and suggestions like installing 'SnowballC' package etc are not helping me
Any help will be much appreciated.
Thanks,
Nidhin VC

remove
corpus <- tm_map(corpus, PlainTextDocument)

Related

How to get a right video url of an Instagram post using python

I am trying to build a program which runs a function that input a url of a post, output the links of images and videos the post contain. It works really good for images. However, when it comes to get the links of videos, it return me a wrong url. I have no idea how to handle this situation.
https://scontent-lax3-2.cdninstagram.com/v/t50.2886-16/86731551_2762014420555254_3542675879337307555_n.mp4?efg=eyJ2ZW5jb2RlX3RhZyI6InZ0c192b2RfdXJsZ2VuLjcyMC5jYXJvdXNlbF9pdGVtIiwicWVfZ3JvdXBzIjoiW1wiaWdfd2ViX2RlbGl2ZXJ5X3Z0c19vdGZcIl0ifQ&_nc_ht=scontent-lax3-2.cdninstagram.com&_nc_cat=106&_nc_ohc=WDuXskvIuLEAX9rj7MU&vs=17877888256532912_3147883953&_nc_vs=HBksFQAYJEdCOXJLd1gyMVdhWUNkQUpBS090UGo3eEhDb3hia1lMQUFBRhUAAsgBABUAGCRHTFBXTUFVTXNPaG5XcW9EQU5KUEE5bEZVdVZxYmtZTEFBQUYVAgLIAQAoABgAGwGIB3VzZV9vaWwBMBUAABgAFuD4nJGH9sE%2FFQIoAkMzLBdAEszMzMzMzRgSZGFzaF9iYXNlbGluZV8xX3YxEQB17gcA&_nc_rid=97e769e058&oe=5EDF10A5&oh=3713c35f89fca1aa9554a281aa3421ed
https://scontent-gmp1-1.cdninstagram.com/v/t50.2886-16/0_0_0_\x00.mp4?_nc_ht=scontent-gmp1-1.cdninstagram.com&_nc_cat=100&_nc_ohc=Wnu_-GvKHJoAX9F_ui1&oe=5EDE8214&oh=7920ac3339d5bf313e898c3cbec85fa2
Here are two urls. The first one is copied from the sources of a web page, while the second one is copied from the data scraped by pyquery. They come from a same Instagram post, same path, but they are totally different. The first one works well, but the second one doesn't. How can I solve this? How can I get a right video url?
I am searching for a long time on net. But no use. Please help or try to give some ideas how to achieve this.
Here is my code related to the question
def getUrls(url):
URL = str(url)
html = get_html(URL)
doc = pq(html)
urls = []
items = doc('script[type="text/javascript"]').items()
for item in items:
if item.text().strip().startswith('window._sharedData'):
js_data = json.loads(item.text()[21:-1], encoding='utf-8')
shortcode_media = js_data["entry_data"]["PostPage"][0]["graphql"]["shortcode_media"]
edges = shortcode_media['edge_sidecar_to_children']['edges']
for edge in edges:
is_video = edge['node']['is_video']
if is_video:
video_url = edge['node']['video_url']
video_url.replace(r'\u0026', "&")
urls.append(video_url)
else:
display_url = edge['node']['display_resources'][-1]['src']
display_url.replace(r'\u0026', "&")
urls.append(display_url)
return urls
Thanks in advance.
There's nothing wrong with your code. This is a known intermittent issue with Instagram, and other people have encountered it too: https://github.com/arc298/instagram-scraper/issues/545
There doesn't appear to be a known fix yet.
Also, while unrelated to your question, it's worth mentioning that you don't need to inspect the display_resources object to get the URL of the image:
display_url = edge['node']['display_resources'][-1]['src']
There is already a display_url property available (I'm guessing you saw this, based on the variable name?). So you can simply do:
display_url = edge['node']['display_url']
I've seen this sometimes when using this Python module instead of HTML-scraping. At least with that module, edge["node"]["videos"]["standard_resolution"]["url"] usually (but not always) gives a non-corrupted value.

Python Machine Learning Regression P.1 pandas error

Pycharm screenshot of error message with code above
I don't know what to do about the error. I don't know what it means
also there should print(data.head()) between line 8 & 9.
The error is happening in your data = read.csv() line. Check your working directory, or point explicitly to the file location, using something like:
data = read.csv("C:/Users/user/filename.csv")

The procedure entry point could not be located in the dynamic link library in deep learning

I get an error in my project (link below)
Movie-Review-Sentiment-Analysis
Sentiment Analysis using Recursive Neural Network.
This is the relevant part of my code:
hidden_layer = tf.contrib.rnn.BasicLSTMCell(hidden_layer_size)
hidden_layer = tf.contrib.rnn.DropoutWrapper(hidden_layer, dropout_rate)
cell = tf.contrib.rnn.MultiRNNCell([hidden_layer]*number_of_layers)
init_state = cell.zero_state(batch_size, tf.float32)
But I get this error:
NotFoundError:
C:\ProgramData\Anaconda3\envs\tensorflow\lib\site-packages\tensorflow\contrib\image\python\ops_distort_image_ops.so not found
tensorflow\contrib\coder\python\ops_coder_ops.so not found
When I am going to execute this above code, I get a Python Event Message Box that shows
the procedure entry point could do not located in dynamic link library in deep learning
I run this code from GitHUB.
If someone is still trying to find the answer to this problem,
Try to edit the code like this
hidden_layer = tf.nn.rnn_cell.BasicLSTMCell(hidden_layer_size)
hidden_layer = tf.nn.rnn_cell.DropoutWrapper(hidden_layer, dropout_rate)
cell = tf.nn.rnn_cell.MultiRNNCell([hidden_layer]*number_of_layers)

Why would Tensorflow not save run_metadata?

I was simply trying to generate a summary that would show the run_metadata as follows:
run_options = tf.RunOptions(trace_level=tf.RunOptions.FULL_TRACE)
run_metadata = tf.RunMetadata()
summary = sess.run([x, y], options=run_options, run_metadata=run_metadata)
train_writer.add_run_metadata(paths.logs, 'step%d' % step)
train_writer.add_summary(paths.logs, step)
I made sure the path to the logs folder exists, this is confirmed by the fact the the summary file is generated but no metadata is presetn. Now I am not sure a file is actually generated to be honest (for the metadata), but when I open tensorboard, the graph looks fine and the session runs dropdown menu is populated. Now when I select any of the runs it shows a progress bar "Parsing metadata.pbtxt" that stops and hangs right half way through.
This prevents me from gathering any more additional info about my graph. Am I missing something ? A similar issue happened when trying to run this tutorial locally (MNIST summary tutorial). I feel like I am missing something simple. Does anyone have an idea about what could cause this issue ? Why would my tensorboard hang when trying to load a session run data ?
I can't believe I made it work right after posting the question but here it goes. I noticed that this line:
run_options = tf.RunOptions(trace_level=tf.RunOptions.FULL_TRACE)
was giving me an error so I removed the params and turned it into
run_options = tf.RunOptions()
without realizing that this is what caused the metadata not to be parsed. Once I researched the error message:
Couldn't open CUDA library cupti64_90.dll
I looked into this Github Thread and moved the file into the bin folder. After that I ran again my code with the trace_level param, had no errors and the metadata was successfully parsed.

error uploading csv file on cloud jupyter notebook

I have set up a google cloud account
I want to perform my deep learning much more faster on a jupyter notebook, but
I cannot find a way to read my csv file
I downloaded it with wget from my github account and afterwards I tried
dataset = pd.read_csv('/home/user/.jupyter/SIEMENSTRAIN.csv')
but I get the following error
pandas.parser.CParserError: Error tokenizing data. C error: Expected 2 fields in line 3, saw 12
Why? When I read it on my laptop using my jupyter notebooks, everything runs well
Any suggestions?
I tried the recommended solutions for this error and I got the next warning
/home/user/anaconda3/lib/python3.5/site-packages/ipykernel/main.py:1: ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support regex separators; you can avoid this warning by specifying engine='python'.
if name == 'main':
When I ran dataset.head() this is what appeared
Any help please?
There are a number of possibilities that could be causing the problem... I would first always make sure that Pandas (pd)'s version is updated and compatible.
The more likely cause is that the CSV itself is not right, so pd.read_csv() is not able to work correctly (thus a Parse Error). This may have something to do with the headers, though I'm not sure what your original CSV file looks like. It's worth playing around with read_csv, for example:
df = pandas.read_csv(fileName, sep='delimiter', header=None)
This tampers with 2 things - the delimiter, and if pd is reading a header from CSV or not.
I go through some pd.read_csv() stuff in my book about Stock Prediction (another cool Machine Learning problem) and Deep Learning, feel free to check it out.
Good Luck!
I tried what you proposed and this is what I got
So, any suggestions?
I suppose that the path is ok, but it just won't be read properly, or am I wrong?

Resources