Thursday, November 8, 2018

PHPhun

Ah, how quaint to look back at my old Input/Output post and see the exuberant naivete of youth... sentences like "I'm excited to try to figure out more PHP and get things hopefully running as a start-to-finish 'use case' in the next day or two!" have now been confronted with the reality of many hours banging my head against the wall trying to figure out how to get PHP to work with the R scripts I have. The good news is that I think I've finally overcome the major obstacles, and now I really do feel confident in getting a "use case" demo up in the next couple of days at most.

The goal

The point of PHP is essentially to get the parts of the project to interact with each other. Here's a basic outline of what should be possible:

  • user selects a SOTU from a list, let's say Obama's 2016 SOTU 
  • user clicks a button to display sentiment for that SOTU 
  • the button-press activates a PHP script that: 
    • identifies the proper SOTU 
    • runs an R script on that SOTU 
    • the R script returns a few lines of sentiment (starting with "#A tibble" in the picture below for example) 
    • that output is written to a file 
    • that file is displayed back to the user in their browser



The problem(s)

I will copy here a long portion of an email I sent asking for help from a good friend and talented programmer, Olly, with some extra notes highlighted in yellow:

1. I know that I can do the following in the command line:
echo words >> output.txt
that works fine, so far so good.
the above commands will find output.txt (or create it if it doesn't exist) and append the word "words" to the file.

2. I also know that because I have R installed on my command line and a short script called myscript.R, I can type this into my terminal:

Rscript pathTo/myscript.R >> output.txt
which does what I expect. The output of the script, which is redirected (appended) to output.txt, looks like this:


the script "myscript.R" is running a really basic sentiment analysis on a text (a pre-defined text in this case; the results will be the same every time). By using the >> redirection again, the system sends those results to output.txt, resulting in a file like the one you see above, giving the number of "negative" and "positive" sentiment words found in the text.

3. What I want to have happen next is for PHP to do this script for me, triggered by a user clicking a button on the web page. Then the resulting file should be echoed back to the user so they can see the same results you see above. But I can't do it! Here's what I can do:

4. I can definitely call the PHP script (text-to-file.php), which can pipe arbitrary text to a file, then show the file to the user. I actually did this a couple different ways but I think the cleanest was at this commit, basically:

file_put_contents("../sentiments.md","Text is getting appended then a line break \n", FILE_APPEND);

$output = fopen("../sentiments.md","r");

echo fread($output,filesize("../sentiments.md"));

I could just keep hitting refresh and seeing new lines of "Text is getting appended then a line break" appear in the browser each time (which is what I was expecting).

5. I can also have the php script do:

exec("echo happy >> sentiments.txt");

$output = fopen("sentiments.txt","r");

echo fread($output, filesize("sentiments.txt"));


which again does what I want: I can F5 and keep appending "happy" to sentiments.txt, which shows in the browser, over and over.

But when I simply swap out "echo happy" with "Rscript myscript.R," it no longer works! The file (sentiments.txt) is created, but nothing is written to it, which makes the fread throw an error or at best display its contents: nothing. I'm guessing that the Rscript is just taking too long (it takes a good couple of seconds to execute and that time will only increase with longer documents) but I've had no luck using "sleep" or anything like that to try to introduce a delay between running the command and reading from the file.

Identifying and Overcoming

One really helpful factor in figuring out what was going wrong was redirecting my STDERR (standard error) to my STDOUT (standard output) so that instead of getting blank files, I was getting files that had the error printed to them. Thanks to Brian Storti for this great article explaining the 2>&1 idiom to redirect that output.

Error the First: Environmental Variables

Once I was able to see the error output that I'd redirected to a file, I was surprised to see that I was actually generating a "the term RScript is not recognized as the name of a cmdlet, function, script file..." This was a weird problem since I knew for sure I had already added Rscript.exe to my PATH variable (after all, I could run it no problem just by typing into the terminal).

It turns out that Apache server has it's own environmental variables. From the documentation, it seemed like it should be able to use the system's PATH variables too, but obviously in practice that wasn't working. Doing phpinfo() showed the environmental variables that Apache was using, and Rscript was nowhere to be found. Stupidly, I couldn't quite figure out how to update the environment variables for Apache, so my solution now is just to use the "f:/ully/qualified/path/to/Rscript.exe" in the PHP file. This seems like a silly way to do it, but it works for now, so I'm forging ahead.

Error the Second: Package Locations

Once the PHP file knew where to find RScript.exe, I was a little surprised to see it still throwing errors instead of output. This time, the error was about missing packages in R. I had been following the default practice in the RStudio install, which installs packages to the individual profile. Instead, I wanted the packages put into a common library (the difference between these practices is explained in this RStudio support article). I just did a simple copy from the user folder to the common library, and that took care of that.

Error the Third: Permissions

Once I finally got the PHP to find RScipt.exe AND the R packages it needed, I was still getting weird results. From experimenting some more on the command line, I found that somewhere along the way my permissions had gotten messed up, and the PHP script was no longer allowed permission to create or edit the text files it was trying to create / append. This makes sense: when I run commands from the terminal, it knows I am logged in on my own user account, with those permissions. But when the PHP script tries to access those same shell commands, what "user" is it doing so under? What permissions does it have? Honestly, I don't quite know the answer to this yet (security is a weak point of my own software knowledge and I don't think I'm alone among DHers in that regard) but by changing a few permissions and a few locations for where files are written and read from, I have a solution for now.

Results

It's not on the live, public server yet (I'll need to hand-edit the full paths to RScript and stuff, since my development server at home is on Windows but the public one for the site is on Ubuntu, so obviously different file path schemes). But from the home server, I can now type in "obama," "washington," or anything else, and get taken to a real results page that shows:

the term entered in the search box,
whether that matches with "obama" or "washington,"
the full 2012 SOTU text if obama was searched, or the full 1790 text if washington was searched,
in the sentiment info box, the number of positive and negative (and the net positive-negative) sentiment word matches in the SOTU from the Bing sentiment lexicon:




The spacing is messed up, but as a proof of concept this reflects a ton of learning and work on my part in a short amount of time. From here, figuring out the SQL database is my only remaining big hurdle; the rest is about scaling up (which will not be a small job)! With this milestone, I'm feeling good about having a minimum viable product in the next few weeks. The next step will be figuring out SQL as well as leaving some of the web-dev stuff to the side to work on the R scripting.

No comments:

Post a Comment