Should MongoDB be the New “M” in LAMP?
I’m more than a little late to the NoSQL party. I’ve only recently had reason to work with a few of the different “NoSQL” databases. While some of them seem extremely well-suited for different purposes, so far, MongoDB has stood out to me as versatile enough to replace what most web sites have been needing and using from MySQL (and then some). In my opinion, it has several advantages over its MySQL predecessor for the learner, the application creator, and the wildly-successful-overnight startup.
Certainly, if you have need the foreign-key relationships and transactions that ACID compliant databases provide; then go for a traditional RDBMS. PostgreSQL is my personal favorite and I’m sure you or your company has one as well. They will be the standards bearer for cross-application data access, integration points and internal systems for a long time to come. They do hold up well under some load in forward or external facing applications; however, the bar is pretty low for when you have to get “cute” or expensive.
With MongoDB, I’ve provisioned some replica sets and shards, prototyped some data from a couple known applications, written a couple map/reduce jobs and interfaced from PHP and Java. I’ve noticed some advantages which become evidently apparent as you start working with the package.
What are the advantages?
Document versus Row
Well, of course, document storage. (MongoDB is a document-store and relational databases are row/column stores.) Most web apps just don’t fit the row/column paradigm. In addition, it’s counterproductive.
A buddy of mine who builds web applications and a lot of professionals I have worked with struggle to correctly identify where one table ends and another table begins. I can’t even begin to account for the number of hours I’ve wrestled with models; changing models, changing the multiplicity of relationships, realizing a denormalization is necessary in place X, etc.
When you start to think of your web application’s entities as documents, a lot of that goes away. Here is a “testcase”. Here is a “case”. Here is a “profile”. In the beginning, you know the big box entities you have. As your app grows, you’re going to add new fields. You’re going to add new entities. Scalar variables inside documents/entities will become lists. Document storage with MongoDB makes this experience a pleasure and easy versus the sometimes extremely complicated (to create and test) ALTER and UPDATE moment that can occur on DDL schema upgrades.
I’m still waiting to drop the document bomb on my buddy; I just can’t do it over the phone. I want to see his face when that light bulb goes on.
Replica Sets and Sharding
This is one of the main features of MongoDB and the other NoSQL databases as well, if not in name, then in spirit. The built-in ability of MongoDB to self-create copies of its data and to route read/write requests horizontally is a big part of the “cuteness” I referenced above.
Taking MySQL “to the next level” usually involves some form of this in a largely ad-hoc and custom way. Of course, various open source packages, cookbooks, etc. will get you started; however, picking the right recipe and keeping it all running is now a much larger effort. In addition, some degree of responsibility also falls on your applications’ shoulders. If you are looking ahead designing your RDBMS application (or have the experience), some of the headaches can be mitigated in the early design; however, you will already be foregoing a lot of the benefit the RDBMS gives you in the first place (at least on the forward-facing end)
In MongoDB, the routing is provided for you. The failover is provided for you. You can create an application targeting a single MongoDB instance and know that when your application grows or needs to scale it will be transparent to the application.
API
I put this one pretty high up there. Speaking strictly from the P (meaning *only* PHP) in LAMP. I was more than pleasantly surprised when I started using the MongoDB API. It’s all just arrays!
At least from the “document” point-of-view, this is a HUGE deal for the PHP developer. We speak array. We’ve lost a lot of good programs in the mysql_fetch_assoc loop.
Map/Reduce
We all know we’re supposed to understand map/reduce. Your typical web learner and my buddy certainly do not. With MongoDB, when you are ready, it’s there. In addition, it uses all the shards and replicas seamlessly, you just have to provide the computation you want.
One Less Language – No SQL
I’m good at SQL. Really good. Sometimes, it’s exactly what you need and there are those of us who can deliver. However, in a web application, it’s just not necessary.
A majority of what the web application does is: SELECT * FROM X WHERE id=N. I’ve worked with some very elegant frameworks (CakePHP, Hibernate, etc.) and it doesn’t matter how many layers of abstraction or how many lines of code sit between your application and that database. You will need to know some SQL. Your application will have some SQL in it somewhere.
MongoDB’s interactions, queries, map/reduce programs, etc. are all JavaScript. A language you are going to need anyway. SQL in the browser is not a good fit. JavaScript on the backend actually isn’t that bad. Here again, the PHP API helps out some. Because the interactions with MongoDB are so simple for the typical things, you may not even need that (at least until you need to figure out that map/reduce thing). The most complicated thing you may see is the dollar-sign “$”. It’s all just arrays!
Setup/Install
I didn’t think it could any easier than mysqld and an unprotected root password. It can. Well, I take that back, SQLite is pretty easy too. However, MongoDB doesn’t force you to create a database or a table (of any kind) up front. In addition, it doesn’t even require the username of “root”.
Of course, you can specify the directory. You can require logins and lock things down. But, to the learner or the PC development environment, these little trivialities can add up to big timesavers and developer productivity is paramount, especially to the late-night or early-AM aspiring app creators or entrepreneurs.
However, a little feedback for the MongoDB guys… Do I really need to create the directory? Even on Windows? If /data/db doesn’t exist, just put it all in /tmp, C:\temp or whatever. Seriously. J
Summary
I’ve spent a little time digging into MongoDB. This was just a quick run through. I’m hoping some find it useful. I’m sure I left something out. There are probably some disadvantages depending on the application.
Overall, I’m very impressed with the package and do think it is worth a look as a MySQL replacement in anybody’s LAMP stack for its capability set, ability to scale and for the document paradigm’s accessibility with somebody just starting to learn.
Default PostgreSQL String Sort Order Bites Me in the SAS
Sometime during the development of an internal PHP and SAS mixed application, I’ve had some interesting transitions. Notably:
- MySQL -> PostgreSQL
- ACCESS/MYSQL -> ACCESS/ODBC (MySQL) ->ACCESS/ODBC (pgODBC) -> ACCESS/ODBC (DataDirect 64-bit)
- LATIN1 -> UTF-8
- SAS 9.1.3 -> SAS 9.2
Most of these transitions went pretty straightforward. However, one bug got introduced somewhere along the way and I just couldn’t ever seem to figure out what would cause it.
For some reason, to download a list of features to get real ID numbers and then match by name, this wouldn’t work:
LABEL feature_id=”feature_id”;
BY feature_name;
WHERE release_id=&release_id;
RUN;
DATA folders(KEEP=feature_id name);
MERGE folders(IN=in1) features(IN=in2);
BY feature_name;
IF in1;
IF in2;
RUN;
I’d get bizarre errors out of SAS that the list wasn’t sorted. Whenever it occurred, I’d inspect the resultant (and intermediate) datasets, and everything seemed sorted just fine. Instead, I had to have something like this:
PROC SORT data=PG.features(RENAME=(id=feature_id name=feature_name)) out=features;
LABEL feature_id=”feature_id”;
BY feature_name;
WHERE release_id=&release_id;
RUN;
/* 9.2 workaround? For some reason if I sort on a RENAME or don’t, then try to */
/* MERGE after RENAME on a sorted field, it won’t work. */
PROC SORT data=features;
BY feature_name;
RUN;
PROC SQL;
CREATE INDEX feature_name ON
features(feature_name);
MERGE folders(IN=in1) features(IN=in2);
BY feature_name;
IF in1;
IF in2;
RUN;
You can clearly see my frustration (and the blaming of my employer’s own software over Postgres) in the comments. In addition, I probably overkilled the solution by re-sorting and then also creating an index, but it did make the problem go away.
Eventually, I got another bug report from a user with the phrasing, “Incorrect sorting within letter group in Features table”. That led me to this entry in the Postgres wiki:
http://wiki.postgresql.org/wiki/FAQ#How_do_I_change_the_sort_ordering_of_textual_data.3F
I discovered SAS sorts strings based on rules found in “C” locale collation. Even though I read some documentation attributing the default LC_COLLATE setting as “C”, in fact, for my database, it was set to “en_US.UTF-8″. What this basically means is when sorting the following list:
- GLMMOD : Tests
- GLM : ODS Graphics
- GLM : Checklist
- GLMMOD : Checklist
You’ll get:
- GLM : Checklist
- GLMMOD : Checklist
- GLMMOD : Tests
- GLM : ODS Graphics
Which seems incorrect at first, until you realize it sorts disregarding whitespace and special characters. However, SAS and “C” locale collation sort like this:
- GLM : Checklist
- GLM : ODS Graphics
- GLMMOD : Checklist
- GLMMOD : Tests
Because PROC SORT and other SAS mechanisms issue and rely on native database commands for some operations, this behavior can produce results in ordering undesirable for SAS. SAS actually performed very admirably by delegating the sorting to the database, setting the appropriate flags on the dataset, but then still catch the match merge problem at runtime!
Long story short, when using PostgreSQL with SAS, it’s probably a good idea to make sure the database is created with the correct setting for LC_COLLATE. If it is not, you may end up with crazy gyrations like mine in your code. Luckily for me, it’s a fixable scenario whereby the database only needs dumped, then restored after it has been recreated with the desired collation.
Mac OS X Reinstall/Restore via Time Machine is Awesome!
During the holiday vacation, my Macbook started booting up with a rather lengthy progress bar. It would take about 10 minutes to complete. The operating system would come up fine, but it started getting me nervous because it was occurring on every boot, not just once (like an update post-processing after reboot or something). I had a feeling it was filesystem related just based on my experiences with Unix/Linux in the past. I figured some kind of fsck was working its magic and some quick searches seemed to confirm the thought. This was going to be my first attempt and resolving a system issue on the Mac; I’ve been lucky to have been on cruise control on there until now.
I eventually stumbled on the Mac’s console message viewer. He’s extremely handy. He zeroed me in on the culprit, “Incorrect number of thread records (#, #)”. Okay, great message. Maybe if you know the number is incorrect, you could tell me the correct number? Maybe if you know the number is incorrect, you could just fix it? Nah.
After some more research, the next remediation steps were to boot into the install disk, try running Disk Utility and also to start up a console, unmount the main volume and run fsck_hfs manually. Nada. All of those seemed to know all about my incorrect thread count, but none could make a correction. I must admit though, with a lot of Windows (and Linux) diagnostics behind me, the utilities are at least navigable well presented on the install disk.
With the easy stuff behind us, my options now get expensive or drastic. Most of the forum and support threads I found have DiskWarrior as the next logical step to try and fix the volume. At $100, I’m not quite ready to give up. I’m close though because with a BootCamp volume and a pretty important Windows 7 install on the other partition, formatting and redoing all that work doesn’t sound palatable… yet.
I do have AppleCare. I use the online form, open a ticket and schedule a callback on 12/31. Cool! The phone rings. The automated attendant informs me they are closed. What!? Why can I schedule the call then? The call ends. Two minutes later, I’m called again by Mr. Robot. Ugh. I re-schedule for another day to avoid infinite recursion! A slick customer service interface has a bug in it lets me schedule calls when they are on holiday and then has another bug in you can get in a loop.
OCD kicks in. I want this to go away now. What do I have unique to the Windows partition, really? Just a Quickbooks file, that’s not a big deal, back it up separate. Time Machine backup is a little behind, let’s take a new one. Okay, I’ll use the install disk to erase the Mac drive and try re-installing Snow Leopard. I’m sure my users and data will restore from Time Machine. I may have to do some homework to access my Windows partition (if it survives). I’ll probably have to redo my network setup, install iLife, Firefox, eclipse, etc. If my Windows partition doesn’t survive, that’ll be fun the next few days…
The install is pretty quick, 30-60 minutes. No problem. It restarts. Here comes my getting started page. There’s a migrating/moving via Time Machine button. Sure. The next screen was pretty surprising. I had options for restoring four things… something like users, data, settings and apps! Are you serious!? Wow. Something will probably not be quite right, but this is HUGE and a big divergence from my non-Mac experiences…
The Time Machine restore cranks through my files. It completes. I restart.
At the gray startup screen, I hold down “Alt”. Will Windows still be there? It is! Yay! He comes up just fine. That, in and of itself, is extremely cool. I had a feeling the partition would be there just from the screens I was seeing on the OS X install disk. Not sure about Vista or 7, but Windows prior would always blow away any bootloader I would have, no question.
I then start OS X. Surprise, surprise, I must have the correct number of thread records now, it boots in normal time
. The login screen has my users, their little thumbnails, cool.
Logging on was what really blew my mind. Once inside, all updates were already applied, the network settings are all there and everything/EVERYTHING was there; iLife (a separate install), an Eclipse I had downloaded and extracted somewhere funny, every other app and setting. Wow. I was so dumbfounded and filled with confidence about the whole process if it takes me more than an hour to find a button to undo a setting not 100% to my liking in the future, I will re-install from scratch and use the last backup not containing it in a heartbeat… For the Mac-savvy, or anybody who read into Time Machine since it came out, this may seem funny, but I still can’t get over it.
It didn’t escape me, going through the motions, OS X was treating it the same as if I had purchased a new computer and was just “moving” to it with my Time Machine backup. Companies don’t use Linux desktops (yet, ugh), but is it this smooth for the minority that do when going to new hardware? I doubt it. For companies using Windows desktops, entire armies of IT guys are still trying to purchase or develop applications/processes/scripts to make moving to new hardware this easy for the employees at large when hardware refreshes are in order. I know it wasn’t this smooth for me when I got my new gear the last time (XP -> XP). I have another Windows upgrade coming in January (XP -> 7), we’ll see how that goes…
Burning a Blu-ray in Linux
I’ve had a blu-ray burner in my Linux system for quite a while. Since about Fedora 8, I’ve been using commands to burn backups onto single-layer BD-RE media. I gleaned those commands out of a posting about dvd+/-rw tools (google cache). Here are the basics…
Ad-Hoc Burning
growisofs -Z /dev/dvd -R -J /path/to/files
- later -
growisofs -M /dev/dvd -R -J /more/files
- to finalize -
growisofs -M /dev/dvd=/dev/zero
Writing an ISO
growisofs -dvd-compat -Z /dev/dvd=/path/image.iso
Erase the Disk
growisofs -Z /dev/dvd=/dev/zero
Format
dvd+rw-format -ssa=1G /dev/dvd
I’ve recently been trying to do some new things and I thought I would post that as of Fedora 11, Brasero can recognize and write files to my BD-RE media and also erase the disk to do it again. K3b is still at 1.0.5 (not a KDE 4 compatible version) on Fedora and does not recognize the disk correctly for type and capacity, nor does it allow burning.
My recent searches pull up the same results from 2007 and 2008 as before where people were unsure, etc. This is the current state of my world though.
*UPDATE 2010-06-25*
It’s been awhile since I’ve tried this, but now using Fedora 13, I am able to use K3B to burn single-layer BD-RE media at 2.0x speed. I have a Sony BDRW BWU-200S.
Providing SOAP (non-REST) web services with CakePHP
I recently had a need to support a complex SOAP web service from CakePHP. Cake provides some built-in support for REST based web services; however, this situation required more. This post should show how to set this up on your own projects and still utilize all your normal controller and model goodness without too much screwing around.
Pleas see this attachment for the source code described in this article.
The method I will outline here requires the php-soap module.
First, the WSDL. For my project, I started with a WSDL created in another tool. My WSDL specifies a slightly different object set than my CakePHP application. I’m sure with PHP5 and some finessing of the Model classes, you could probably use the same set; however, it was easy enough to just create some really vanilla objects to house the transport objects and use those to communicate with the webservice. Both the WSDL and the receiving controller are present in the attachment.
What you will notice is that the *DTO objects defined in the controller file reciprocate the structure of the objects in the WSDL and the methods also are represented in the controller. I put them in the controller file because it wasn’t really obvious to me where in Cake’s structure “outside code” should really go. I have a separate configs.inc.php I pull in up the class hierarchy, but that’s about as non-conventional as I want to get. Also, this controller is dedicated to just handling webservice requests and I only need these *DTO objects in that case, so locality wins and they are here. No real engineering genius here, their structure mimics what is defined in the WSDL file.
The real magic is in the controller. The controller’s remote() method is what handles the POST from the web via the port binding in the WSDL file. The remote() function sets up some of the basic stuff for SoapServer and is easily identified in the PHP manual. It’s even pretty easy to deduce we’re going to need to use SoapServer->setClass() somewhere and plug the name of our Controller in. However, there was one tidbit in the comments section of the manual regarding SoapServer->setObject(). It wasn’t documented (at the time), but after experimenting and looking at the PHP source, it does exactly what we need here, sets the handling class to an instantiated (aka existing) class object instead of trying to spawn a new one. Because we are already inside the CakePHP framework and running the remote() function, we already have the variables we want from beforeFilter(), we have our models loaded up, we may even have a user context from mod_auth_something. Perfect!!! So, we tell SoapServer to use our instantiated Controller. Once the *DTO classes are mapped and SoapServer is configured, it’s as simple as having it handle STDIN to tickle the rest of the methods in your Controller with the parameters populated. Two more tricks/problems remain: debug level & autoRender.
First, debug level. There’s bound to be a way around it; however, since I test with a web service client, when I do have a problem, I have to debug with lots of $this->log() calls. Turning up debugging to 1 or 2 is problematic because then CakePHP doesn’t spit back properly formed XML to the web service client and usually the client takes a SoapFault when that happens. I stick to debug level of 0 during development and deployment wrt the web service stuff.
Second, autoRender. Because SoapServer does the actual outputting of XML response to the client, I set the layout in the Controller to Ajax and also explicitly call exit() at the end of the remote() method. This ensures that CakePHP doesn’t send back a “Missing View”, half rendered $layout, or any other kind of automatic goodies.
I hope this article is helpful for anybody who might want/need to integrate a more elegant/esoteric webservice into their CakePHP architectures. I’m sure there are probably cleaner ways to put this into custom View classes, utilize Components, etc… however, this was a straightforward approach I found has been working really well for one of my applications.
HowTo: PostgreSQL – Adding more values to an ENUM type
I recently had trouble manipulating an ENUM field I had created in PostgreSQL. I couldn’t find any suggestions or samples easily on Google or in the manual and was able to get it to work, so I post it here. The basic premise is there is an ENUM field type created, I need more possible values and to preserve the existing values I already have to keep code working.
Initial creation of the type and table:
CREATE TYPE var_type AS ENUM('text', 'number', 'date', 'boolean');
CREATE TABLE custom_fields (
id bigserial PRIMARY KEY,
name varchar(50) NOT NULL,
pdf_type var_type NOT NULL
);
Running with this table for some time, invariably, new rows are created and there’s now a migration consideration. As long as you are not using the table column as a reference in a foreign key, the following should work to preserve the data, drop and re-create the type.
The following creates a new column to hold the original text value:
ALTER TABLE custom_fields ADD COLUMN type_text varchar(15);
UPDATE custom_fields SET type_text = pdf_type::text;
We, then, need to drop the existing type and re-create it with the new values we want. CASCADE automatically drops columns that depend on the type:
DROP TYPE var_type CASCADE;
CREATE TYPE var_type AS ENUM('text', 'number', 'date', 'boolean', 'list');
This last part was what I couldn’t figure out without thinking a little more. When you add it back, you have to cast the varchar column back into the ENUM type. I had tried a variety of concoctions here before getting this to work:
ALTER TABLE custom_fields ADD COLUMN pdf_type var_type;
UPDATE custom_fields SET pdf_type = type_text::var_type;
ALTER TABLE custom_fields ALTER pdf_type SET NOT NULL;
ALTER TABLE custom_fields DROP COLUMN type_text;
Web log anonymizer
I recently had need to anonymize the IP addresses in an Apache access log. It seemed like a simple task; however, there weren’t any really good code samples out there directly for it. It’s a pretty simple exercise; however, given there wasn’t anything readily available, I figured I’d post it here so others might make use of it. The only requirement it really had was to be able to process large logs rather fast and to maintain the same IP address mappings for multiple entries in the logs in order to preserve the actual traffic data as it relates to sessions. With a little more work, I’m sure it could select random IP addresses in the same geo as the original one whereas this will probably evenly distribute the IPs across the globe (skewed for actual ownership of the ranges).
So here are the few lines of Perl that got the job done:
#!/usr/bin/perl
if ($#ARGV + 1 < 1) {
print "\n\tUsage:\n";
print "\t------\n\n";
print "\tperl log_anonymize.pl file1 [file2 [file3 [...]]]\n\n";
die "Please specify at least one file to use this script.\n\n";
}
my %forward = ();
my %reverse = ();
foreach (@ARGV) {
open(ORIG, $_)
or die "Failed to open input file for reading.";
open(ANON, "+>", $_.".anon")
or die "Failed to open destination file for writing.";
while (<ORIG>) {
if (/([0-9]+\.[0-9]+\.[0-9]+\.[0-9]+)/) {
if (!($forward->{$1})) {
$newIp = getNewIp();
while ($reverse->{$newIp}) {
$newIp = getNewIp();
}
print "New mapping created: $1 -> $newIp\n";
$forward->{$1} = $newIp;
$reverse->{$newIp} = $1;
}
$repl = $forward->{$1};
$_ =~ s/$1/$repl/;
}
print ANON $_;
}
close(ORIG);
close(ANON);
}
exit 0;
sub getNewIp {
return int(rand(256)) . "." . int(rand(256)) . "." . int(rand(256)) . "." . int(rand(256));
}
It is fairly straightforward. You invoke the Perl script with one or more arguments. Every argument should be a path to an access log. For each file, a new file of the same name and “.anon” appended gets created. Across all those files, the script maintains an internal hash of the IPs it has mapped to a new, random IP address and will re-use those mappings as they are encountered. It spits out a little message when the mappings occur so you could do some counts using ‘wc’ or something similar to see how many you had… or you could make it output a count at the end, it’s pretty simple to do either.
So that’s it, easy web log anonymizing via random IP address remapping.
Received SAS Certified Advanced Programmer Credential for SAS 9
After 9 months with SAS, I now have passed the advanced programmer certification exam for SAS 9 on November 8, 2007. You can find details about the credential here.

SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration.