Goodbye Zend, Hello cURL

September 29, 2009

cURL you know it’s true … ooh, ooh, ooh, I love you.

Well, we’ll see at any rate.  I spent a significant amount of hours today delving into both the Mechanical Turk API (REST) and cURL.  Up to this point I’ve been relying on Zend_Http, but it’s been rocky.  cURL seems to be a solid, long-running library that has been accessible to PHP since the first 4.x series.  The former has a very clean, but limited API.  The later is super arcane, but immensely more powerful.  Again, that’s the theory!

Much of what cURL does is automagic in the best and worst possible ways.  For example, to send POST parameters you can simply call:

curl_setopt($handle, CURLOPT_POSTFIELDS, array('name' => 'value'));

Fantastic!  However, that also sets the content type to multipart/form-data automatically and you can’t change it.  Likewise, if you did this:

curl_setopt($handle, CURLOPT_POSTFIELDS, 'name=value');

It’ll configure itself to send application/x-www-form-urlencoded. Yikes, what a weird way to choose.


That Darn Name

September 28, 2009

An hour ago at Starbucks I happily tweeted my first experience with a FINE Sharpie Pen on my Moleskin.  “Hmm,” I thought afterwards. “I wonder if loveatfirstwrite.com is taken?”  It’s only somewhat clever but is also rather long for a domain name, so there was a good chance.  Moments ago I navigated to that web site and it exists … sort of.  It goes directly to a page selling the domain.  Doh!  Okay, so it’s only $97, I’m still interested so I dig further.

loveatfirstwrite.com redirects to ITDomainNames.net purchase page

loveatfirstwrite.com redirects to ITDomainNames.net purchase page

I immediately opened three tabs to Archive.org, DNscoop, and StatBrain.  Existed back in 2007 for a book called “Love @ First . Write”, has a PR1, and a couple dozen visitors a day.  That’s not terrible, but I’m in no rush so I checked the WHOIS to see when it would expire …

Domain Name: LOVEATFIRSTWRITE.COM
Registrar: THAT DARN NAME, INC.
Whois Server: whois.intrustdomains.com
Referral URL: http://www.intrustdomains.com
Name Server: NS1.INTRUSTDOMAINS.NET
Name Server: NS2.INTRUSTDOMAINS.NET
Status: ok
Updated Date: 28-sep-2009
Creation Date: 28-sep-2009
Expiration Date: 28-sep-2010

>>> Last update of whois database: Mon, 28 Sep 2009 22:35:51 UTC <<<

Wha-wha-what? It’s a bit fishy. Either there’s a massive coincidence here, entirely possibly, or that registrar is using Twitter as a domain tasting source. Yes, it all sounds very conspiracy theory, but imagine they come up with domain names and then only purchase them if they already once existed and have backlinks. My tweet would have been perfect to generate this name. Darn them!


Consuming Feeds

September 23, 2009

Today I forced myself to add another feed consumption feature: populating a page’s contents. In particular I had found a District 9 press kit photostream which made me want to be able to incorporate all the entries as a single page. The challenge is how to chop them up into discreet page objects. I tried several approaches and ended up, for now, just letting my generic HTML parser (developed for the Blogger Firehose) attack the entry contents node:

Sharlto Copley's site has a page based on a Flickr photostream feed

Sharlto Copley's site has a page based on a Flickr photostream feed

One of my attempts flattened the contents HTML into text, used it as a description, and then looked for enclosure links to use as images. That worked great except Flickr’s enclosure images are the originals and in this case they were huge! Seriously, at over three megabytes a pop, I had to decide whether to shrink images after downloading them or just grab the smaller images. I hastily chose the latter (abandoning the enclosures), but I’d like to point out this doesn’t really effect the feed source (e.g. Flickr), because the picture caching system will only ever download a URL once until manually cleared.

The template Sharlto’s site uses now also dynamically generates the categories on the left whereas previously it was fixed to just “Pictures” and “Videos”. That’s why the news is in the middle, because they’re alphabetically sorted. I’m not sure it’s worthwhile to fix that, but it urks me (generated content should go at the bottom — right?). I had intended to embed the news in his index page, which is possible with the new feed system, but it looks like ass (Google’s News RSS uses fixed HTML and it’s against their T.O.S. to modify it at all). Another option would be to have a feature which embeds a category listing into the body of a page … something I’m curious to try, but it is actually a large architectural addition. I’m trying to keep the count of object types to an absolute minimum.

There are also two other small modifications to his site, can you tell? First, menu items hide their overflow which was necessary for those arbitrary news titles. Secondly, only 5 items in any category are shown and then underneath is a more link which takes you to that particular category’s page. That one has been on my to-do for a long time and obviously wasn’t that difficult. Having it means I can add stuff more readily and not worry about the page expanding into an unmanageable pile of links.


FanSiter supports Fart.Go

September 21, 2009

I created Fart.Go in July 2007 as part of a SEO experiment to develop a niche website. Initially it started as poopy.info, which is why the logo ALT is still “I See Poopy”, but I had trouble getting Google to index it. The problem probably had something to do with early 404’s and 500’s, but I blamed the TLD and went to register a new domain. I wanted something that included the word fart and ended up reversing my favorite (Go Farts) resulting in fartgo.com. It’s an awkward name, but near-impossible to misspell, and being different is more a boon for the obscure than a bomb in my opinion.

By all counts it is a successful website. I spent a weekend collecting 50 pieces of content and organizing them in a very specific structure. Its initial host was Blogger, but the blog format just doesn’t suit a small, “tight” collection of categorized pages. Thus a JavaScript to run on WSH was born to generate the HTML which I then manually uploaded (some URL’s, you may notice, still use a blogspot structure of YYYY/MM/name.html). At the time I thought the process was painless enough, but later it became clear just how wrong I was based on how infrequent I updated the thing.

Fart.Go simmered in its own stinky stench with nary a helping hand to waft it into greater spaces for nearly two years. It accrued a meager few updates, literally two or three. It floated to the top of Google for certain phrases and traffic steadily climbed to about 200 unique visitors a day. And finally it made about a buck a day on AdSense. I never implemented even a fraction of the fantastical features planned and yet mere age, stability, and URL permanence solidified its success. Imagine if I could make it better, or is meddling bound to make it worse?

We shall see! In the first week of August, over a month and a half ago, I pondered the ability for FanSiter to support Fart.Go on its platform. I reasoned they shared enough common capability to warrant a bit of effort to implement any additional code necessary. This is back when FanSiter2’s spaghetti shop ran the show and FanSiter3 existed locally in mostly-untested, fragmented form. Thus only last week with the advent of the Blogger Firehose did I set out to complete this task.

Fart.Go (via FanSiter platform) in FireFox 3.X on a EEE 901 netbook.

Fart.Go (via FanSiter platform) in FireFox 3.X on a EEE 901 netbook.

Talk about underestimating requirements! Tons of programming and manual conversion later, it’s finally online. The only thing that kept me going was my own stubbornness and that feeling of “I’ve already come this far …”. Before I talk about some of the new engine capabilities, let’s look at a couple of the new things about the site output itself:

  • FavIcon is transparent: Whatever I used to create the original ICO from my GIF did a very poor job and I shrugged my shoulders at the result, despite shuddering every time I saw the white boxed poop in the browser. This time I did some simple GIMP edits to remove the more conspicuous anti-aliasing artifacts and the ran convert in Ubuntu (wonderful utility BTW). Voila! It turned out so good that I also used it for list items!
  • Newest content list actually works. Previously it generated the list of 4 or so and then popped off drafts. Thus you’d only see maybe one item since I had a lot of drafts taking up space. That list also shows their publish date, which is hidden elsewhere, to indicate freshness. Yes, that is a deliberate decision to encourage updates!
  • Next post arrow based on date. I don’t remember what the » link went to before, but now it goes to the next, older item by date. Thus when you read the newest page, you can click that to go to one older, and on and on regardless of what categories those pages are in.

Internally I expanded the site-text parsing capabilities to encompass near-arbitrary HTML. FanSiter3 divides page items into discreet objects that then have a type: picture, video, paragraph of text, link, etc. Generally this is done by splitting by line and parsing each of those individually. Now there is a simple buffer between lines of HTML content which allow tags to span multiple lines whereas before these would be repaired individually.

On the rendering side there are now type and block wrappers allowing specific tags surrounding individual page objects and collections of those of the same type. Fart.Go’s category pages require this since it uses a HTML list (<ul>). They also have different title structures since the site name and category are displayed at the top (more like a blog) and the page title is not connected. On the same token, it uses different tags for each category page and their URI’s are /category/ not /category.html. Plus there’s the complication of directory redirects (/games => /games/). Whew! Liberal formatting capabilities and use of vsprintf() are now customization options for templates and sites.

Finally, because I’m tiring myself writing this, I introduced the Blogger Firehose which allows one to utilize the UI and layout of a blog to write and post content for multiple sites. Fart.Go needed this because it’s already pushing the limits of how much I can stuff into a single Google Sites page reliably (FireFox on my netbook can barely handle all those DIV’s and BR’s) and I want to be able to recruit a writer or writers without overloading them with unnecessary internals knowledge.

It works reasonably well by parsing the actual HTML output and configuring various settings to discourage people from visiting the raw blogspot and/or (more importantly) linking to it. One issue I’m still working out is the PHP DOM creating empty tags from ones which need to be self-closed (e.g. <img></img>). Pictures and videos are parsed out assuming they meet certain conditions and relevant page objects are created. Other HTML becomes text paragraph objects and tags can be used to set variables like category. Oh yeah, and the title can specify the URI, which has the useful side effect of letting multiple posts add to the same page!

It’s all very cool, but I need to get back to expanding and testing it. Next up is some more Delicious parsing for objects/pages and working with the media namespace in RSS for photostreams.


Fart.Go goes to shit

September 21, 2009

The infamous Fart.Go is now fully on the FanSiter platform. Porting forced me to introduce a bevy of features including the super-awesome, heart-attack-inducing “Attempt to delete all files every 5 minutes”. Yikes!

Yesterday I pushed a metric shit ton of new code live containing the aforementioned bug. The first mysterious moment occurred when testing the new Blogger Firehose (more on that later): blogg3.php suddenly couldn’t be found and everything began burping 404 errors. Now, I discounted the possibility of programming fault and chalked it up to a user error, because dragging from a FTP folder in Nautilus (Ubuntu) will remove it from the server (and mess up the local permissions). I just assumed I had accidentally dragged the wrong direction between windows.

After successfully completing a smoke test, I ate some lunch and relaxed to some ambient “space music”. Following that, in a paranoid check (the only thing that saves me from my own idiocy), I encountered the same problem! Weirdly, and this didn’t tip me off to the seriousness, it gave me the 404 even moments after re-uploading the file. Fuzzy memories prevent me from forming a proper excuse as to why I didn’t go on a surgical journey through the entire source and logs. Stupidity and exhaustion, I suppose (it took hours across multiple days to snip, paste, and restructure the site data into the new CMS format).

One of our cats, Princess Kitty, woke me up to incessant yowling this morning before 6. I had gone to bed past 1. I was not amused. After getting up *sniff*, and pulling myself together, I decided on a whim to check the site. 404! WTF! WTH! Only 15 minutes until my bus arrived and I logged onto FTP to find all but a couple PHP files completely gone.

Oh shit oh shit oh shit oh shit … I’ve been hacked! No, that’s silly, why would they delete some and why wouldn’t they just leave my stuff alone while simultaneously hosting their own stuff? Yeah, not hacked, definitely programmer error.

I spent the bus ride jotting notes on where to look, what functions to scrutinize, and how to scour the logs. After getting setup at Tully’s, however, it turned out to be very simple. One of the very first emails I pulled up from my CRON script (MediaTemple automatically emails you CRON output, so my scripts only output if there’s errors) showed it trying to delete literally everything off the system. Just above that …

Warning: Missing argument 1 for DelTree(), called in /nfs/c03/h03/mnt/55261/domains/fansiter.com/html/templates/template1.php on line 13 and defined in /nfs/c03/h03/mnt/55261/domains/fansiter.com/inc/fileio.php on line 4

Notice: Undefined variable: dir in /nfs/c03/h03/mnt/55261/domains/fansiter.com/inc/fileio.php on line 7

Notice: Undefined variable: dir in /nfs/c03/h03/mnt/55261/domains/fansiter.com/inc/fileio.php on line 8

Urgh! Okay, this is totally my fault, the call was being made without a directory being specified in one of the templates. However, what kills me is that PHP issues a warning and continues. It reminds me of VB’s ol’ ON ERROR RESUME NEXT bullshit.

I added two things to the function (and here’s my recommendation to others working on this haphazard platform), which is starting to become common in all my PHP methods: an isset($dir) check for the directory parameter and it returns immediately on the first unlink() failure unless a $force parameter is set to true. Brittle is better, especially when you’re too lazy to properly test.


FanSiter Landing Pages

September 12, 2009

It’s been a slow week for improvements, but I finally knocked out a feature I’m calling Landing Pages which also happens to utilize the first RSS feed code.

Good Eats site demonstrates a Landing Page using a new FSLT: "The Green House"

Good Eats site demonstrates a Landing Page using a new FSLT: "The Green House"

These are something like parking pages, e.g. “AdSense for Domains”, because they are place-holders of sites-to-be.  That is, they represent a serious intent to flesh out content.  Currently, the FanSiter blog is linking to third-levels based on the post and prior to this feature those links went to 404 pages.  Now a user is given a generated site relevant to the thing they clicked on.

Most of the content comes from Google and Bing through RSS feeds.  These feeds become category pages which allow all templates to function normally as long as they can handle arbitrary categories (as opposed to the fixed “Photos” and “Videos” of some of the originals).  The next feature for RSS feeds will be integrating them on an existing page via an RSS “object”.  One of my ideas is attaching a particular picture page to a Flickr photostream feed.  Then I can show the first N images ending with a link to the entire photostream (and each pic will be linked back to its Flickr source page of course).

Oh, and it’s important to note that there’s no ads on FanSiter landing pages.  This is very much intentional.  If the content is basically a glorified mash-up (ugh, using that word stings) from existing web services, then it is certainly against someone’s TOS to use ads to monetize it.  Plus, I think parking pages are one of the great internet scourges, and I honestly want to provide some utility with these beyond maintaining URL permanence.

That said, I do plan on adding an Amazon product block under the body text. ;)

The FanMake3 component which generates the web output is starting to get a wee bit messy, because of its manipulation in FSLT.  I am thinking about a bit of refactoring before adding too many more templates.  I’d like to standardize certain code bits across the board like chopping / abbreviating links, sanitizing titles, etc.  Currently it’s all a big spaghetti pile and the functionality is nice, but it’s hard to keep track of the complexity.

Of course, the web (and life) is messy, so maybe it deserves to be left alone.


Feed me FanSiter

September 11, 2009

I’ve been pouring over Google’s AJAX API documentation for more than an hour and have come to the conclusion that their neutered RSS feeds are easier to deal with (though I do like the little news widget, I will probably end up using that in place of ads on un-scrubbed fansites). My issue is all of the weird functions and controls just for listing a set of links based on a query.

I realize their results are essentially proprietary, and they provide a JSON API, but I just want to suck it up ahead of time on PHP and regenerate periodically. Also, this will go on ad-less pages (e.g. landing sites which represent the intent of a fansite before it exists). I really want these to be useful to visitors so my links to them aren’t tainted.

It must be noted that their Blog and News search services provide RSS feeds, and those I plan to use. Bing’s web search is mostly excellent, so I’ll mash those three things together. Maybe I can tap Yahoo for images? Then that leaves Video. In any case, I realized that I need RSS integration first and foremost because it enables me to create (and I shudder calling it this) “mashup” landing pages which provide the best of the titans’ information.

What I plan to do for the landing pages is have them set certain categories to RSS feeds. Like “News by Google”, “Images from Yahoo”, etc. These then get listed as normal in a template, but lead the user offsite. Those category pages themselves will actually be visit-able and have further descriptions of the links. The main page itself will simply explain the fact that this is a landing page (e.g. rather than a “Biography”) that is automatically generated. The landing pages will slowly be replaced, of course, but in the meantime this should fill the gaps nicely.

Finally, this allows me to keep those nifty next/previous footer links which let crawlers find all the fansites (albeit slowly). Alright, so it’s RSS time. Looks like the Zend Framework provides a decent API for working with feeds.


Bing People Search

September 7, 2009

I think Bing is trying to be over helpful and therefore misinterprets searches. For example, I was trying to remember the name of this person I met at PAX and could only conjure two pieces of information. Here’s the results of my searches (tried Bing first):

Google vs Bing on name search

You can guess which result helped me.


Masked Thumbnails and Amazon

September 2, 2009

I’ve had these two features on my list for a while and they were relatively simple to bang out.

Mary Alice's fansite shows off new masked thumbnails and an Amazon block

Mary Alice's fansite shows off new masked thumbnails and an Amazon block

The image-masking capability I copied from the old FanSiter code written in September 2007. In fact, I also managed to salvage the great blurb I wrote for Mary Alice along with the photos I had found for her (replacing nearly all of what Turk workers turned up). Anyway, now you can specify any mask you want for thumbnails. The only requirement is it must be the same size as the thumbnails and is currently just a “multiplication” filter.

Second on the list of new features is the ability to specify an Amazon iframe URL to use for a “bigad” block. For someone more obscure like the office manager from Ace of Cakes, this allows you to show a specific product or search results. In the above example, I embedded search results for “Ace of Cakes” so visitors should always see whatever the latest season is and any DVD’s that Amazon feels are related. Not too shabby for feeling blocked earlier.