Minggu, 29 Mei 2011

How Does Apache Server Work With PHP Module?

Q: How the HECK does Apache server work with PHP module? What is the flow in which an HTTP request makes through a PHP-enabled Apache web server?

Problem
I have WAMP (Version 2.0) installed on my Windows 7 operating system. For those who don't know, WAMP is an acronym that stands for Windows Apache MySQL PHP. Just so you know this versino of WAMP comes with the following versions of Apache, MySQL, PHP:

- Apache version 2.2.11
- PHP version 5.3.0
- MySQL version 5.1.36

I need to enable the rewrite module of Apache to process URL rewrite rules, so I have the following line in httpd.conf:

LoadModule rewrite_module modules/mod_rewrite.so
I enable rewrite logging at "C:\wamp\logs\rewrite.log" so I can see how rewriting is done via adding the following line in httpd.conf right after Loadmodule:

RewriteLog "C:\wamp\logs\rewrite.log"
RewriteLogLevel 9

I also have the following line in httpd.conf to enable PHP processing:

LoadModule php5_module "c:/wamp/bin/php/php5.3.0/php5apache2_2.dll"
Now an HTTP request goes through Apache, which hands it to the rewrite module to process URL rewrite handling, and has the resulting URL go through PHP engine. But how does Apache rewrite module supposed to know when it should stop rewriting the URL and hand it to PHP module? Is there any way we can control when the processing goes to PHP engine through rewrite rules? Let's answer these questions below.

By the way if you need to know how Apache server variables including %{DOCUMENT_ROOT}, %{REQUEST_URI}, %{REQUEST_FILENAME} work refer to Test Whether a Server Variable is Empty in Popular Web Servers.

Questions?

My Answers
Simply by having the line 'LoadModule php5_module "c:/wamp/bin/php/php5.3.0/php5apache2_2.dll"' in httpd.conf, PHP is enabled to handle pages with extension 'php' (e.g. black-jacket.php) after Apache passes the page through to PHP module. By the way if you'd like PHP to also be able to handle pages with other extensions add the following line after LoadModule in httpd.conf (so PHP handles .html too):

AddType application/x-httpd-php .html
Now here's the interesting part. How the HECK does Apache rewrite module know when to pass through the request to PHP module? The answer is ONLY if and ONLY when the request is NOT matched by ANY rewrite rule, the request is passed through to PHP handler. Even funnier is that there is absolutely NO control you have in the rewrite rules to tell Apache to pass through to PHP handler right away! If you don't agree let me know.

I've tried [L] and [PT] on the RewriteRule directive but when the request matches that particular directive, the request is "internally redirected" to rewrite module again and it's processed by the rewrite module again! In rewrite log it looks like this:

127.0.0.1 - - [29/May/2011:22:26:46 +0800] [localhost/sid#d33140][rid#2254c80/initial/redir#1] (4) [perdir C:/repository/trunk-php/] RewriteCond: input='C:/repository/trunk-php/cache/index.php' pattern='-f' => matched
127.0.0.1 - - [29/May/2011:22:26:46 +0800] [localhost/sid#d33140][rid#2254c80/initial/redir#1] (2) [perdir C:/repository/trunk-php/] rewrite 'index.php' -> '/cache/index.php' # this rule has [L] but it's still internally redirected to rewrite module as the following log statement suggests
127.0.0.1 - - [29/May/2011:22:26:46 +0800] [localhost/sid#d33140][rid#2254c80/initial/redir#1] (1) [perdir C:/repository/trunk-php/] internal redirect with /cache/index.php [INTERNAL REDIRECT]
127.0.0.1 - - [29/May/2011:22:26:46 +0800] [localhost/sid#d33140][rid#2351398/initial/redir#2] (3) [perdir C:/repository/trunk-php/] strip per-dir prefix: C:/repository/trunk-php/cache/index.php -> cache/index.php
127.0.0.1 - - [29/May/2011:22:26:46 +0800] [localhost/sid#d33140][rid#2351398/initial/redir#2] (3) [perdir C:/repository/trunk-php/] applying pattern '(.+)$' to uri 'cache/index.php'
...

The log statement in RED tells you the request is internally redirected to Apache rewrite module for handling again. Only when no rewrite rules have been matched by the current request will you see the following line in the rewrite log:

127.0.0.1 - - [29/May/2011:22:27:52 +0800] [localhost/sid#d33140][rid#228b510/initial/redir#2] (1) [perdir C:/repository/trunk-php/] pass through C:/repository/trunk-php/cache/index.php
Meaning that the Apache rewrite module is finally done with handling the request, and it's the next handler's job to handle it. In this case PHP engine will simply render C:/repository/trunk-php/cache/index.php.

So how do you add a handler and control the order of the handlers? To my disappointment I have NOT found any document online that answers this question. There is "AddHandler" directive in httpd.conf that's supposed to do that but loading the PHP module does that implicitly already.

Conclusion
So the conclusion is that ONLY if and ONLY when the request is NOT matched by ANY rewrite rule, the request is passed through to PHP handler. There is absolutely NO control you have in the rewrite rules to tell Apache to pass through to PHP handler right away! Not by using 'last' option [L] or 'pass through' option [PT] at the end of RewriteRule directive. If you don't agree let me know.

Knowing this fact you may find it impossible to handle the rewrite logic you have in mind. Think deeper and the solution will come. For example you may have a rewrite rule that you'd like the rewritten URL to go straight to PHP engine right away, and you add [L] to that rule which immediately internally redirects the rewritten URL to rewrite engine again (and therefore all the server variables such as %{QUERY_STRING}, %{REQUEST_FILENAME} and %{REQUEST_URI} are updated accordingly) and make the request go through each rule again. In that case you'll have to make sure the request is NOT matched by any RewriteRule and therefore passed through to PHP handler.

If you have any questions please let me know and I will do my best to help you!

Test Whether a Server Variable is Empty in Popular Web Servers

Q: How to test whether a variable is empty in server configurations of popular web servers such as Apache and Nginx?

Problem
It is AMAZING how unfriendly server configuration syntax can be. Even tasks as small as testing whether a value is empty can be confusing. Since I've personally used Apache and Nginx for a long time allow me to unravel the mysteries of how to evaluate whether a servervariable (e.g. document root, query string, etc.) is empty in server's configuration file.

Solution for Apache
In your httpd.conf or .htaccess simply use ="" to evaluate whether an Apache server variable is empty. Use !="" to test whether an Apache server variable is NOT empty. Here's an example:

RewriteEngine on
...
RewriteCond %{DOCUMENT_ROOT}/cache%{REQUEST_URI} -f
RewriteCond %{QUERY_STRING} =""
RewriteRule (.+)$ /cache/$1 [L]
...

This block of rules basically checks whether %{DOCUMENT_ROOT}/cache%{REQUEST_URI} exists as a file and whether the HTTP request has no URL parameters. If both are true rewrite current request to /cache/{current request uri}.

Common Apache Server Variables: By the way the following is some comments that tell you how common server variables such as %{DOCUMENT_ROOT}, %{REQUEST_URI} and %{REQUEST_FILENAME} work as they are often confusing as hell:

#
# Suppose your website is http://www.mensfashionforless.com/
# and document root is set to /usr/repository/trunk (via DocumentRoot directive in httpd.conf).
#
# Now someone issues a request for
# http://www.mensfashionforless.com/2010/10/g-by-guess-grey-low-boot-cut-jeans.html
# then the following is a list of common Apache server variables and their values:
#
# %{DOCUMENT_ROOT} = /usr/repository/trunk
# %{REQUEST_URI} = /2010/10/g-by-guess-grey-low-boot-cut-jeans.html
# %{REQUEST_FILENAME} = /usr/repository/trunk/2010/10/g-by-guess-grey-low-boot-cut-jeans.html
#

Questions? Let me know!

Solution for Nginx
Nginx is a popular web server for its speed and efficiency. Simply use ='' to check whether an Nginx server variable is empty. use !='' to check whether an Nginx server variable is not empty. Here's an example:
server {
listen 80;
server_name www.mensfashionforless.com;
rewrite_log on;
...
location / {
# test whether $document_root/cache$request_uri exists
# in the file system
if (-f $document_root/cache$request_uri) {
set $test P;
}
# test whether there's no URL argument to this request
# = '' tests whether the value is empty!
if ($args = ''){
set $test "${test}C";
}
# if both of the above tests are true, do the rewrite
if ($test = PC){
rewrite ^/(.+)$ /cache/$1 last;
break;
}
...
}
...
}
This block of code checks whether $document_root/cache$request_uri exists as a file in the file system AND whether there is no URL parameter to this HTTP request. If both are true rewrite current request to /cache/{current request uri}. Refer to the post on How do I test multiple conditions in Nginx server configuration? if you are confused by the syntax of testing multiple conditions.

If you have any questions please let me know and I will do my best to help you!

In Nginx Rewrite How To Test Multiple "if" Conditions

Q: How the HECK do I test multiple conditions in "if" statement in Nginx server configuration file?

Problem
It it AMAZING that Nginx server configuration does NOT support multiple conditions natively, meaning no such thing as the following:

if ($request_method = POST && -f $request_filename) {
...
}

In fact it does NOT even support nested conditions, meaning no such syntax as the following:
if ($request_method = POST) {
if (-f $request_filename) {
...
}
}
Why?? I am just as confused as you are. The fact that common conditionals like AND and OR are NOT supported is a serious inconvenience to webmasters especially those who are converting from Apache to Nginx. Below we'll see how to get around the issue of Nginx not supporting multiple conditions in "if" block.

Solution
When there's a will there's a way. You can use a hack by setting a variable in each tested-true condition and when the variable reflects that both conditions are true, do what you need to do. Here's an example:
server {
listen 80;
server_name www.mensfashionforless.com;
rewrite_log on;
...
location / {
# test whether $document_root/cache$request_uri exists
# in the file system
if (-f $document_root/cache$request_uri) {
set $test P;
}
# test whether there's no URL argument to this request
# = '' tests whether the value is empty!
# for more info refer to how to test whether a server variable is empty
if ($args = ''){
set $test "${test}C";
}
# if both of the above tests are true, do the rewrite
if ($test = PC){
rewrite ^/(.+)$ /cache/$1 last;
break;
}
...
}
...
}

Incidentally if you are confused by the test-empty-variable syntax refer to how to test whether a server variable is empty. You need to pay extra attention to Nginx's syntax! For example there MUST be a space between if and ( according to the syntax. The following will fail:
...
if($test = PC){
rewrite ^/(.+)$ /cache/$1 last;
break;
}
...
Nginx's server configuration syntax is very unforgiving. So make sure you check the correctness of syntax before you restart Nginx server. The way to do is it use Nginx's command line tool with "-t" option. Assume it's installed at /usr/sbin/nginx you run the following command:

/usr/sbin/nginx -t -c {location to nginx configuration file or no -c to check the default location}
If you see the following then it means your syntax is correct:

$ /usr/sbin/nginx -t
2011/05/29 11:43:04 [info] 29803#0: the configuration file /etc/nginx/nginx.conf syntax is ok
2011/05/29 11:43:04 [info] 29803#0: the configuration file /etc/nginx/nginx.conf was tested successfully
$

If your nginx config has syntax errors you'll see something like the following:

$ /usr/sbin/nginx -t
2011/05/29 11:52:08 [emerg] 30258#0: unknown directive "abc" in /etc/nginx/nginx.conf:2
2011/05/29 11:52:08 [emerg] 30258#0: the configuration file /etc/nginx/nginx.conf test failed
$

Simply correct the errors and run the command again until you see the success message. By the way this command will check your nginx configuration RECURSIVELY! Suppose your nginx.conf contains such statements as the following:

include /etc/nginx/sites-enabled/*;
Now you run "nginx -t" to check syntax, and it'll check syntax of every configuration file located in /etc/nginx/sites-enabled/!

If you have any questions please let me know and I will do my best to help you!

Jumat, 27 Mei 2011

A QUICK Unix Shell Script To Crawl an XML Sitemap or sitemap.xml

Background
I'd like to quickly crawl every URL of my XML sitemap because doing so triggers caching of each page and better user experience. An XML sitemap is usually named sitemap.xml and contains URLs for crawlers to crawl. It looks something like this:

<?xml version="1.0" encoding="UTF-8" ?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url>
<loc>http://www.mensfashionforless.com/</loc>
<priority>1.000</priority>
</url>
...
<url>
<loc>http://www.mensfashionforless.com/black-jacket.html</loc>
<priority>0.5000</priority>
</url>
</urlset>

<loc> is the tag you use to indicate an URL. They are the URLs I'd like to spider.

Here is the Unix shell script!
To achieve this purpose I first extract all the URLs; then I issue an HTTP request to them one by one. Keep in mind I don't need to see the content at all; I just need to issue the request so that the server receives the request and does what it's supposed to do. A good use case is that my server caches webpages on demand. So I use this crawler to make my server cache all the webpages specified in sitemap.xml so that later when someone visits my website they'll see the webpage more quickly.

# spider.sh: use awk to get URLs from an XML sitemap 
# and use wget to spider every one of them
ff()
{
while read line1; do
wget --spider $line1
done
}
awk '{if(match($0,"<loc>")) {sub(/<\/loc>.*$/,"",$0);
sub(/<loc>/,"",$0); print $0}}' sitemap.xml | ff

The above script should run successfully in C shell, Bourne shell, Korne shell. If not let me know! In the script above I use 'awk' to extract URLs and use 'wget' to spider each of the URLs without downloading the contents (done via --spider option). Save it as 'spider.sh' and run 'chmod 700 spider.sh' and run './spider.sh' to spider your sitemap.xml!

If you have any questions please let me know and I will do my best to help you!

Unix Command 'nohup' Does Not Work

Q: I am trying to use Unix command 'nohup' to run a process in the background even when I log out. However 'nohup' does NOT work. Why?

Introduction
'nohup' stands for 'no hang up' and allows you to run a process continually until it ends during which you can log out and close your terminal. This is because 'nohup' suppresses or ignores HUP (also known as hangup) Unix signal allowing the process to still run even after the user who issued it logs out. This is useful when for example you'd like to start running a big process, shut down your computer, go home. When you get home you'd like to log in and see that the process is still running.

Tutorial
Suppose you have a script called 'shell-script.sh'. You run the following to run the script in the background persistently:

$ nohup shell-script.sh &
In the same directory a file called 'nohup.out' will be created if it hasn't been created yet. The output of running shell-script.sh goes into nohup.out. Therefore you can run the following to see the output of running shell-script.sh as it rolls:

$ tail -f nohup.out
Problem
The problem is sometimes 'nohup' just doesn't work even though I can run the script fine! Recently I wrote a script to crawl my website and I call it spider.sh. When I run './spider.sh' in the directory where spider.sh exists it works perfectly. However when I run 'nohup ./spider.sh &' it doesn't work. Here's the command prompt trace:

$ nohup ./spider.sh &
[1] 21724
$ nohup: ignoring input and appending output to `nohup.out'

[1]+ Exit 2 nohup ./spider.sh
$
$ cat nohup.out
./spider.sh: 1: Syntax error: "(" unexpected
$

I know there's a problem because when I press Enter after I run 'nohup ./spider.sh &' my shell says 'Exit 2' meaning 'nohup' process has ended. Then in nohup.out I see the syntax error. The weird thing is that I can run 'spider.sh' successfully if I simply run it. How come 'nohup' complains that spider.sh has syntax errors? This is because the shell 'nohup' uses to run the process is different from the shell that you account uses to run the process. The syntax of each shell (e.g. C Shell, Bash Shell, Korn Shell) is different but is mostly minor.

Solution
Script spider.sh begins with:

function ff() {
And 'nohup' complains about "(" (however shell does NOT complain about it). When I changed it to:

ff() {
It works for both 'nohup' and shell! This is because the shell 'nohup' uses to run the process is different from the shell that you account uses to run the process. The syntax of each shell (e.g. C Shell, Bash Shell, Korn Shell) is different but should be minor. Fix the syntax errors and 'nohup' will work! Any feedback feel free to share with us!

Kamis, 05 Mei 2011

Combine Multiple Javascript Files Into One

How come while combining multiple JavaScript files into one I get errors?
SCENARIO
Here's what I am trying to do. I have several Javascript files and I combine them into one via 'cat' command or something (refer to Insert Newlines With Unix 'cat' Command To Combine Multiple Files for how to combine several files into one). Now I run the website and the browser is complaining about the Javascript saying it contains errors! But WHY??

PROBLEM
Don't worry I'll get to the solution soon, but you need to know why this is happening first. The problem is when you combine the js files WITHOUT ending each js file properly you run the risk of violating Javascript syntax! At the end of the js source you have the leniency of not ending it properly but still having it work fine. When it's followed by another Javascript statement however you'll run into errors.

SOLUTION
Simple. Just make sure before you combine the Javascript sources every Javascript source ends with a semicolon (;). I suggest that you add newlines or spaces at the end of the source to make it more readable. Again refer to Insert Newlines With Unix 'cat' Command To Combine Multiple Files for how to combine several files into one!

If you have any questions please let me know and I will do my best to help you!

Insert Newlines With Unix Cat Command

Q: In Unix how do you use 'cat' command to combine multiple files into one with a newline (or any other character) inserted between each pair of files?

A: What I am trying to do is simple: I have multiple files and I'd like to run a Unix command to combine them into one big file, with newlines or breaks inserted following the content of each file. This is useful to for example combine many css files into one so that your website can reference only one css file but you can use many css files during development for easier understanding and modularization purposes.

It turns out that it's not so simple. You CANNOT use newline operators like \n in the shell command because it'll be interpreted incorrectly by the shell. After some trial and error I finally arrived at the following solution.

SOLUTION
1. Create a file called 'separator.txt' that contains one newline. If you use 'vi' program simply type 'i', 'Enter', 'ESC' or 'Escape', then 'ZZ'. If you want to insert other characters such as 'XXX' simply put 'XXX' in that file.

2. Use the 'cat' Unix command in the following manner, assuming you have three files you'd like to combine, fileA, fileB, fileC:

cat fileA separator.txt fileB separator.txt fileC > all.txt

Now open all.txt and it should contain the content of fileA, fileB, fileC with a newline (or whatever characters you put in the file 'separator.txt') inserted between each of them. If you'd like to insert 2 breaks between each two files simply add 2 newlines in the file 'separator.txt'.

Questions? Let me know!
 
support by: infomediaku.com