Crowd Sores

October 14, 2009

I spent the last couple hours melting faces with fireballs in Oblivion (well, mostly turning human foes against one another using illusion magic) to unwind from the brutal moderation experience I went through with Mechanical Turk.  Let me tell you tale …

In order to streamline the FanSiter seed site creation, I stopped using their “batch” web mode (works just like a mail merge) since it is impossible to retry individual HIT’s.  Thus you get an all-or-nothing kind of deal or you mediate it by adding redundant HIT’s.  I have a contractor dedicated to solely fixing all the data this caused to get fragmented and thus the whole thing backfired rather magically.  Sure, there are tons of individual sites, but you only see a portion of the bio’s I paid for.

So I spent significant time and development effort to learn the API and write a few PHP classes to help me make use of it.  They work great and after a bit of mind-numbing testing I unleashed new code which would spawn HIT’s based on Landing-only entries in the plat.  This did its job as expected, but certain parts of my moderation tools were broken and required repairs:

  • Stray ampersands made Blogger puke, so approved HIT’s got disposed without a successful post being created.  I updated the nifty RepairTags() function but then didn’t add code to call it.  Face palm!
  • Mid-way through modifications made against live I changed the moderate() method to return a string status rather than a boolean, but I didn’t correct the way it decided which status to return so it was always approving.  Whoops!  Set a bad precedent and I had to delete terrible data that I’d paid for.
  • Adding another assignment when rejecting a previous one seemed like a good idea, but when people weren’t “getting” a particular name (Orange Avenue had 3 rejections, because people wrote about an album by the same name — WTF?!  And for Tettix someone wrote a blurb on the Cicada bug instead of the musician!) I had to manually go in and blow away the whole HIT.  If you’ve ever used Amazon’s “Manage HIT’s Individually” then you know how painful this is.

There were probably other, smaller issues, but those three got me frustrated.  Add to that the quality of the data and I started to get really irritated.  I had kept the wording simple, but specific in the HIT instructions and didn’t duplicate much between its description and the question text.  Unfortunately, it appears many ignore the description since its not very noticable, and there was that confusion of writing about a bug or an album rather than a celebrity (despite that being in the title).  I got a lot of completed assignments that were either literal copies of Wikipedia passages or only slightly re-arranged.  There’s a tool called “Google” which is amazing at helping you find duplicate or plagiarized content.  I guess when you spend very little time on something, you don’t care if you get away with it or not.

Having a length requirement (200+ words) allowed me to reject poorly written stuff on that point alone, but then that ended up being a sticky wicket.  I failed to give a reason on a few of these, since I assumed the results being so bad would be apparent to the writer, and ended up with this fun email conversation:

Jim: may i ask why you rejected this?

Me: You’ll have to provide the contents of the HIT, the software I’m using disposes them once they’re rejected.  Two possibilities that come to mind based on the past rejections: too short (requirement was 200+ words) and/or copied paragraphs from Wikipedia or other sources.

Jim: all were over 200 words and nothing was taking from anyone else. I have found that this site is just to good to be true, shame on me for trusting you. I have reported you to amazon.

Me: That’s fine.  I found this, is it yours?

Devon Aoki is an actress and model, she is best known for her roles in Rosencrantz and Guildenstern Are Undead as Anna, Mutant Chronicles as Cpl. Valerie Duval, War as Kira, DOA: Dead or Alive as Sumi, Sin City as Miho, D.E.B.S. as Dominique and 2 Fast 2 Furious as Suki.

Devon Aoki has been doing television and movies since 2003. She has also done some modeling in 2006.

Devon Aoki is a beautiful charming actress that we see going places in the future.

That’s 84 words.

Jim: yes

He didn’t stop there, though, here’s another email thread about his one other assignment.

Jim: why did you not pay me? I did the work in my own words, i guess you are just another scam site, like most of the others on here, I will report you to amazon, as I did what you asked of me.

Me:

Jesse Eisenberg is best known for his roles in Kick the Can, The Social Network, Holy Rollers, Camp Hope, Some Boys Don’t Leave, Beyond All Boundaries, Solitary Man, Zombieland, Adventureland and The Hunting Party.

Jesse Eisenberg was born 5 October 1983, New York City, New York, USA.

Jesse Eisenberg has been in the Television and Motion Picture industries since 1999.

Jesse Eisenberg is currently working on 4 movies.

Our outlook for Jesse Eisenberg is watch out he is already a rising star, and we believe anything he touches will turn to gold.

92 words.

Ugh, those are just awful seed bio’s and then he lashes out to defend them. On the one hand I have the right to approve/reject whatever I want, that’s the rules of Turk, and I can do it based on quality alone.  On the other I only set the award at 0.49 (0.50 shows up as 0.5) and that may be part of the problem. What I’ve found, though, is raising or lowering that amount still gets you same amount of cruft to sift!  I did get a few really awesome blurbs today, but all in all probably approved only 20%.

I had wanted to avoid requiring qualifications or a quiz, but now I believe it’s going to be necessary. I have thought a lot about whether or not to just go to a singular writer or writing firm, but those just don’t scale the way I’d like. Plus you get less variety in the style since it’s the same person doing all of them. I like the idea of letting in random people, who have down time to make some bucks rather than screwing around on the Internet. The glaring problem here is the current crap I’m getting, both in content and sass.

So … what to do?  Here’s my thoughts.  I’m going to add some qualification requirements, which is more research and coding (argh!).  Then in order to make it a worthwhile pain, I will set the award to something in the neighborhood of $5.  Thus you can flip burgers or do blurb essays on Turk.  Will it work?  I don’t know, but I can’t continue the way I have.  My sanity is at stake.


PHP Sessions

October 8, 2009

I just spent a half hour playing with PHP5’s sessions and would like to disperse some knowledge.  My first question …

How can you check if a session has been started?

session_id() will return false if session_start() has not been called.  However, if you’ve closed the session with session_write_close() / session_commit() and then re-started it, the old ID will be there.  So except for detecting that session_start() has never been called, you can’t reliably tell at all.

Are session_start() calls reference counted (pushed and popped) ?

No, they aren’t ref-counted and subsequent start requests will result in warnings.  Likewise calling session_write_close() multiple times does nothing, but it also doesn’t generate a warning.  This means that unless you’re in the entry-point of a script, you shouldn’t be calling session_start() without first checking for session_id() to see if it’s been started already.  Of course, since that doesn’t always work, it’s a crap shoot!

Why open and close the session manually and/or multiple times at all?

In most PHP installations a session is implemented as a file.  That file is located based on the session identifier which is probably the file name itself.  So starting a session opens that file for random access (read/write) and committing it causes it to be closed.  Thus there are inherent performance issues with auto-starting sessions and simply “leaving them on and open all the time”.  Your script will close sessions when it ends, but why not open and close only if you need to read/write?  Not doing this also causes contention on the user side with sites where resources are generated by PHP.  If each request has to wait for the last to complete you effectively get a slow, serialized experience.


PHP AdHoc Debugging

October 8, 2009

Here’s a quick tip, mostly for myself for when I later forget. Many PHP installations tend to hide the notices and sometimes even the warnings generated by trigger_error() and other problems in the core function set. If you’re debugging, or just always want these printed out, make sure you add the following to the top of your script:

error_reporting(E_ALL | E_STRICT);
ob_implicit_flush(true);

The first line does what I mentioned and the second is just to have all output flushed at the time its written, rather than being buffered. I like this better, because longer-running scripts will expose their code position by what’s been echoed already.


Authorship and Editing

July 2, 2009

Today I completed the final feature for the authorship stage: editing.

AJAX-loaded edit form; text is initialized to previous value.

AJAX-loaded edit form; text is initialized to previous value.

This brings me up to having done the following:

  • Saving UID on S3 objects.
  • Locking (all or just comments).
  • Moderation (approve/hide); there’s a bug here which gives the author too much power, but I can fix it later.
  • Editing.

As has been the case, the code and output is very rough, but it does what I set out to make it do.  Hurray!


Last Week This Week

June 30, 2009

Damn bladder, I just got to actual work today (at 3:23 PM) and now I have to pee … again.  I can blaim this on two things: Panera having caffeine-free Diet Pepsi and my addiction to it.  Anyway, last week suffered due to the upcoming and now-past half-marathon.

Page (categorize) and segment (show/hide) moderation forms.

Page (categorize) and segment (show/hide) moderation forms.

Basic moderation tools are in place, however, so I’m moving on to authorship.  I must note, that I did not put any alerts in place for unmoderated content.  This means that only if a moderator finds your content will they see it needs to be moderated.  This decision was made while my mind was fuddled on Friday with thoughts of running for 13.1 miles, but the week is gone and so I’m moving on.

Authorship means retaining control of the pages you start and the content therein.  Contributors may add content, but only the author has the ability to edit and redirect.  The latter is a feature which won’t be in this release for authors who are self-hosting their content (the Forust page redirects to their own site/domain).  Authors are niche operators of the page they started, giving them raised privileges.

First point: authors are page moderators, not site moderators.  They can not move pages around on the site (rename, link, categorize, delete), but they can effect segments within the page.  They also have a unique ability to edit segment text prior to moderation or comments being added.  Let’s explore that first.

Rather than giving a time limit, a new segment’s text may be edited before it is commented on.  Also, if a site moderator gives the segment a thumbs up (”v’ == ‘1′) then the text cannot be edited either.  This allows a page author to build the content from multiple segments and get it “pristine” before advertising its existence.  They can do this for any segment on the page, even ones added by other users.  Thus if you want control, you must start your own page.  If you want to keep a page author from editing a segment you added, just add a comment to it.

Why prevent edits?  Simple: context.  If you comment on a block of text and then that text is changed, the context of the comment is lost.  There is also the possibility of malicious authors submitting good text and then changing it to bad text after a site moderator has approved it.  Neither situation is desirable.

Alright, the other (and much more simple) thing an author can do is choose whether a segment is visible or not.  This is the exact same form displayed to site moderators.  Anonymous visitors will get the visibility of a segment based on the author or site moderator; whichever is latest (time-based).

Note on visibility: it’s currently set as a CSS class on the segment <div>.  This means that hidden segments are not actually hidden since there’s no stylesheet.  Thus another task this week is a bit of jQuery magic to make these invisible unless the user asks to see them.  It doesn’t help against spam (e.g. SE’s will still index), but protects the user.  I’ll get to the spam issue in that stage.

Lastly, authors must be able to “lock” their pages once they get them the way they want.  This also prevents comments, unfortunately, but that’s a problem for later.  The lock functionality is already there, so this is a simple matter.


One by One

June 24, 2009

My initial plan for the first mod tool was to present a list of all the new pages, but that’s turning out to be more than I want to work on.  Now I’m thinking the mod tool will display the S3 object info, plus the categorize form, and the ACTION URL will be the same tool which will then pick up the next page to moderate.  That way a moderator does them one by one and each submit brings them to the next thing.  This process can certainly be streamlined, but later when I figure more out about what works and what doesn’t.  I need to move onto the next tool also which is moderating segments on accepted pages, a much harder tool to create.


Rename Complications

June 24, 2009

Giving a new URI to a page is a complex process mired in the asynchronous nature of S3, the goal of keeping every single URI ever approved for permanence, and now basing moderation tools around the initial structure.  The latter is the most recent addition to the already-long Bough:rename() method.

There is no move method for objects in S3, only copy and delete.  Thus there is now the possibility that the copy will complete and the delete won’t, leaving … two copies of the same object with two URI’s.  I can combat this by adding an ETag check, but what happens when a user intentionally uploads two identical files?  Obviously, not what they expect when only one shows up.

Code to handle failed moves means dealing with redundancy.  The old URI will have a ‘m’ (moved) leaf associated with it and the object with an exact URI key match must exist at the new location, else the move must have failed and need to be retried.  It’s complicated, but being able to handle possible scenarios is important and I wanted to post this so I know what I was thinking when I come back to this self-correction months or years from now!


Zend OpenID 2

June 20, 2009

zendframeworkZend’s tutorial on using their OpenID consumer convinced me to use their framework.  It looks like they based their demonstration on the SimpleOpenID class, yet a glance at the code and its tests tells you it’s built on much more thoughtful underpinnings.  Unfortunately … it doesn’t support OpenID 2!  Their page states that it has OpenID 2 support, but that doesn’t include YADIS or XRI!

The week has ended, however, and I switched to their codebase yesterday for logging in.  I left my own implementation buried withing ForustOp_Auth as functions that are never called, mostly just for reference.  I filed an bug about OpenID 2 that I’ll revisit later.  Perhaps I’ll assist the Zend community at that point, who can tell?

I made some user-session decisions this week, you can login, but the information is not yet necessary or even included in the TSV or object data.  Those things will come next week with Moderation and following later with Authorship.


Cookies+Session Required

June 18, 2009

Forust is in read-only mode by default unless sessions are enabled.  Additionally, operations cannot be performed unless a prior request has verified that session values are persisted.  This isn’t an issue for web browsers since a GET on the page starts the session and a further POST will replay that cookie with its session ID.

Internally, Forust never calls session_start() and is coded to handle if $_SESSION is not available (e.g. session_start() was not called).  The session variable ‘upd’ (last updated timestamp) must be present for any operation to succeed; it is set automatically on $_SESSION at the end of Forust::operate().

Even anonymous users must have a session and a user identifier in order to make changes.  The user identifier is generated automatically as “a[timestamp]” which allows for a very simple check for the “a[" prefix and "]” suffix to determine if they have authenticated or not.  It’s true that this means two anonymous users might end up with the same ID, but the consequences are not grave.  The user’s REMOTE_ADDR is not used out of respect for their desire for privacy/anonymity.

Each operation during a session is added to a history list on $_SESSION and counters are incremented to show usage.  It’s possible that I will rate-limit based on the ratio of pageviews to operations, but then it would just encourage spammers to slam the server harder.  More importantly, each operation increments a running total that should at some point be carried in between sessions in order to give more weight to their actions.  Once a spammer is identified, their material will be easily removed, and they’re reset to zero when they start again with a new anonymous user.  Sorting by an author’s operation count allows moderators to serve the highest-quality contributors first.  But I’m getting ahead of myself!

Authentication state is stored on the session as well.  This includes the OpenID information such as claimed identifier, actual identifier (may be delegate), provider, association handle, etc.


Primitive OpenID1 Support

June 17, 2009

I created ForustOp_Auth for all authentication operations (“auth.login” and “auth.logout”) and just got through a successful test using my Blogger OpenID.

Sparse login form, I've entered my OpenID without the scheme.

Sparse login form, I've entered my OpenID without the scheme.

Blogger's OpenID1 server asks for my approval.

Blogger's OpenID1 server asks for my approval.

And now I'm authenticated; I made the "create" form temporarily dependent on this for testing.

And now I'm authenticated; I made the "create" form temporarily dependent on this for testing.

Authentication results are put into the query string which I'm tracing here.

Authentication results are put into the query string which I'm tracing here.

Rather than use SimpleOpenID directly, I skimmed it to get the gist and wrote mine from scratch.  Just saying that induces a small shudder, but I wanted to fix some bugs (server URL’s with query string parts), use PHP5-specific stuff, and avoid hassles with mixing in their GPLv3 licensed code with my BSD licensed code.

That said, it has some terrible flaws and is nowhere near secure.  It’s enough to get me through the block I had.

You might note that the return_to URL has a parameter ‘p=1′ which has Forust operate on $_GET rather than $_POST.  OpenID servers never POST, unfortunately, so I extended my API a bit (honestly had planned to for testing anyway).

The OpenID server discovery code uses cURL (so https will work) and also the PHP5 DOM which is awesome, if a bit verbose.  It will handle improperly formed HTML and let you parse it like the XML DOM.  I only use it to enumerate all the link elements, but it allowed me to forgo the wacky RegExp patterns used in SimpleOpenID.

Things to be done:

  • Check the signature returned by the server.
  • Use the nonce, but where is it?  Is that only OpenID 2?
  • Check openid_mode (must be == “id_res”).
  • What happens if the same parameters are replayed elsewhere, couldn’t a snoop do that and be logged in with this identity?  This is what the nonce solves (albeit with a slight race condition), right?
  • I’m a bit worried about simply storing the user ID in the PHP session..  Again it seems like something a snoop could take and put in their own requests to operate as that user unless I put the whole site on HTTPS when logged in (ridiculous?).
  • Basic OpenID 2 support so I can use my Gmail account for testing.

There’s probably more, but these are enough questions to keep me busy.