Picture Maximums (PicMax)

November 4, 2009

Having a bit of inspiration after a productive day at the office, I decided to implement this small feature that’s been on my mind for weeks: restriction maximum picture dimensions and/or size during the preload process.  There are two big reasons: 1.) FTP push and 2.) because I have a lot of super-high-quality image sources.  The Grace Park site is the first to take advantage of this, getting a maximum width of 400 (about 25% the width of some of the sources!).

PicMax is a variable specified in the site text at the site level, e.g. at the very top.  I wanted it to follow this syntax:

picmax=WIDTHxHEIGHT,SIZE

I also wanted each one of those components to be optional.  You can specify all, one, some, or none and achieve the effect you’re looking for.  The aforementioned fansite simply specifies “400″ which means “400 maximum width”.  However, what if you just want height?  Easy: “x400″ (400 maximum height).  You see the ‘x’ and ‘,’ are delimiters that need to be present in order to specify what’s after them.  So to cap all pictures at ten thousand bytes, you’d say “,10000″.  Maximum width and size but no height: “400,10000″.  How do I achieve such magic?

preg_match(‘/^([0-9]+)?x?([0-9]+)?,?([0-9]+)?$/i’, $picmax, $m))

Damn I love regular expressions!  Now, to retain your sanity you can specify a zero in the whatever spot you don’t want filled in and the code will operate the same.  When I first approached the parsing I wondered how I’d structure the pattern, but it turns out just making everything optional solved it.  Crazy!

When pictures are resized for scale, their maximum file size is capped at their previous file size.  This prevents the unintentional effect of shrinking one and having it grow on disk (the opposite desired effect).  Usually resizing dimensions is enough to squish the byte size, but it’s difficult to know what to set the JPG quality value to and re-saving any lossy format is an easy way towards bloat.


CSS Insert Content

October 31, 2009

On the FanSiter blog I wanted to take advantage of the footer area which, for WordPress and Google Sites alike, is completely off limits to hosted customers.  Luckily for users, and probably a pain for admins, CSS3 allows us to generate and replace content.

A disclaimer before we begin: don’t abuse this for spammy purposes and don’t rely on it for indexing within search engines like Google (it probably isn’t and won’t be for a long time).  This is another, purely-cosmetic modification to your pages.  Which, for the purposes of SEO, is fantastic because you can also reduce redundant content from taking up space, showing up in search engines, etc.  It is relatively easy to detect by service administrators and I have no doubt your banning is imminent if you try to use this to avoid their normal content filters / rules.  WordPress, for example, already automatically removes HTML tags so you can’t use it for links, images, etc. — just text.

.footer_content:before {
content:"Created by Neil C. Obremski • ";
}

That’s all I had to add to my custom CSS in WordPress. What this says is: go inside any tag using the “footer_content” class and insert the content string at the beginning. Let’s see what that does …

CSS inserted text

CSS3 content doesn't show up in FireBug

It works pretty dang well and it shows up in Chrome (and therefore Safari I’m sure), IE8, and FireFox 3.5; e.g. the modern trifecta.  I didn’t try in older browsers, but if it doesn’t show up then it’s not a big deal.  I wanted to start with something rather mundane and actually I tried linking my name to my website which is when I found out that WP strips out HTML tags.  It’s possible I could find someway around that, but I don’t want to break their trust (and TOS) and lose my blog.

And as noted in the caption, the CSS-generated content does not show up in FireBug, which can provide a bit of mystery to web developers when they’re trying to track down a bug.  This made me think it might be a funny practical joke to play on a designer: use it to insert some rogue content and watch while they freak out trying to figure out where it’s coming from.  In order to hide from file searching functionality, you could use Unicode escaping. :)


Linux Rename using Modified Date

October 22, 2009

I know I’m going to forget how I did this again, especially if I lose the following script I wrote, so I’m posting it here.  Here’s the gist: a directory full of files and I want to to copy them to a new location while also changing them to lower-case, using a different extension (MOD => mpg), and finally including the Last Modified date in the name (YYMMDD-hhmmss).

#! /bin/bash
for file in $1/*.MOD
do
   modifdate=`stat -c %y $file`
   formatdate=`date -d "$modifdate" "+%Y%m%d-%0k%M%S"`
   echo "cp -p -u $file $2/mov-$formatdate.mpg"
   cp -p -u $file $2/mov-$formatdate.mpg
done

Note to myself: you called it movcp and put it in /usr/bin. Thus I go from this …

Raw video files from SD card for my Canon FS100

Raw video files from SD card for my Canon FS100

Using this …

Linux script copying files off Canon FS100 SD and adding timestamp to their name.

Linux script copying files off Canon FS100 SD and adding timestamp to their name.

To this:

Canon FS100 files are just MPEG-2 without proper aspect ratio header set

Canon FS100 files are just MPEG-2 without proper aspect ratio header set

Voila! Now I can archive these off to my ReadyNAS and online backup.


X-Hacker?

October 17, 2009

Ha, clever …

If you're reading this ...

If you're reading this ...


Delicious but Bitter

October 16, 2009

A simple desire to add a link along with a description led me down a completely unexpected path.  However, I did achieve the result I was looking for …

Finally, descriptions on the links page ...

Finally, descriptions on the links page ...

Initially, I thought most of the code was there, I just had to do some switcheroos.  The “lnk” object started as a bit of a bastard and I used the HTML content (“alt” variable) for the link text.  My plan was to allow setting of a “title” variable and then use that as the link text if it existed.  Any kind of weird automagic decision like this in code is not really a great idea, and after futzing with it in various places I realized the mess would have to be lived with or there would be worse consequences.  Rather, I allowed for a “description” variable, and that satisfied me.  In my testing, I noticed some of the sites weren’t appearing in the Plat and that got me wondering about the whole site count in general.

It turns out that the call I was using in Zend’s Delicious object to getRecentPosts() can only get up to 100 as per the API.  Since I had looked up the raw protocol answer myself, I decided it was high time to just call it myself using CurlHttpRequest.  Rather than copy and pasting the URL, I typed it in — bad mistake.  I mistyped the ol’ delicious host “del.icio.us” as “deli.cio.us”.  Even though I know that to be wrong, given how much I’ve typed it in over the years, I still didn’t spot the problem.  This led me on a merry chase for a long while.

Programmers like to argue, but I bet we can agree that the worst challenges are not the hard or complicated ones.  They’re the “this should not be happening, it doesn’t make sense” ones.  If you’ve ever heard one of us say “that’s impossible” then you’ve witnessed the reaction to it.  In my case, it was possible, and came down to a stupid typo which is probably the source of most of these and probably the driving reason behind constants and #define’s.

Within PHP cURL, the user name and password is set as a single string with curl_setopt(…, CURLOPT_USERPWD, …).  I know this and initially when I was testing with hard-coded values, I called it properly.  After refactoring into a new Delicious class, however, I kept getting 401 errors.  I just about tore my hair out wondering if the ordering was suspect (blaming cURL).  Of course, I was calling my new function with two parameters instead of one — stupid, stupid, stupid.  So I updated the function to allow both styles, because I know I’ll make that mistake again later; I figure I’d formalize it now.

Next issue encountered: Delicious XML.  You can put line breaks in the description of a link which is placed into an attribute called “extended” in the XML API and which Zend calls “notes”.  Oh and the title is in a “description” attribute (legacy?).  Anyway, XML allows line breaks in attribute values but it also normalizes them (read: converts them to spaces).  I was relying on the ability to have breaks and now I wasn’t getting any.  For some reason Zend’s object is able to parse these out which makes me think they’re using their own XML parser (maybe SimpleXml allows this?) with the non-standard behavior of leaving line breaks alone.  I pondered my options for a long while before figuring out a nifty regular expression to re-insert line breaks just where I expect them.  This, I decided, was better than suddenly losing line breaks later when Yahoo normalizes their data internally and also writing special XML parsing to grab them.

Whew!


CURLOPT_FILE Segmentation Fault

October 16, 2009

I just blew at least a half hour, probably more, tracking down the weirdest bug having to do with PHP, cURL, and scope.  The initial intent of my changes were to clean up the massive CurlHttpRequest::send() method and take advantage of CURLOPT_FILE to allow sending the output to a file.  So, you can now do this:

$http = new CurlHttpRequest('GET', 'http://neilobremski.wordpress.com/');
$http->send(false, "index.html");

That creates the file “index.html” containing the output of the request, successful or not.  The file contents are written indiscriminately by cURL, you pass it a pointer rather than a name.  So I may later add functionality to automatically delete the file if the request failed (e.g. a 200-series status was not returned).

Okay, so let’s talk about this “Segmentation fault” business that curl_exec() kept erupting with.  If you are trying to set CURLOPT_FILE, or any other stream-related option, then you may have seen this and eventually given up on figuring it out.  I don’t blame you, I just about rolled the helper method stuff back into my main method, because for some reason it worked there.  When I set the file pointer and then immediately called curl_exec() in the same method, and therefore scope (PHP is function scope only, I think), then it worked.  For some reason calling curl_setopt(CURLOPT_FILE, …) and curl_exec() in separate places kept blowing up with “Segmentation fault”.  WTF!?

It appears that cURL’s PHP wrapper holds onto the reference to the stream rather than the actual stream resource itself.  In my case, I created the file stream in the method and then passed that in with CURLOPT_FILE.  When the function finished, however, that variable went out of scope and got destroyed unset.  Since the PHP cURL is holding onto that variable space, rather than the stream its pointing to (I know it’s weird, PHP references are not pointers), it no longer had a valid file handle.  My fix, then, ended up being a pass-by-reference parameter which I pushed out the stream on.  That way the variable didn’t go out of scope.  In my searching I seemed to find a bug number mentioning something like this, but now I can’t find it since FireFox history is abysmal (I’m using 3.0.X, is 3.5 better?).

Whew!

My other fix was to always set CURLOPT_WRITEFUNCTION after closing the file.  That way if cURL got executed again, it went back to my original behavior of writing the output to a function rather than a file stream … otherwise it would segfault again.  Okay, I’m done blathering, I just had to get that out of my system.  YAAARRGH!  I WIN!


PHP FTP: Recursive Listing

October 16, 2009

I’m taking a break from Mechanical Turk and temporarily turning my attention to FTP push and miscelaneous fixes.  PHP has a great set of FTP methods built-in and one in particular I wanted to abuse: ftp_rawlist().  This allows you to ask the FTP server for a recursive directory listing.  Unfortunately the raw in the function name is exactly what you’d think: the results are left completely up to you to parse!  The one constant is each line represents a discreet item such as a directory change or file entry.  The way FTP servers structure this information varies wildly between warring platforms, specifically Unix and Windows.

As it happens, I will be testing and deploying on both Unix and Windows web hosts and thus need to be able to support both out of the box for FTP push.  Most comments and sites only concentrate on the Unix/Linux side, so I decided to tackle Windows first.  I created a new file (ftp.php) in ObremSDK and scratched out FtpList().  Here’s a screenshot of the adhoc test page using it:

Example of ftp_rawlist() followed by parsed version using my new method

Example of ftp_rawlist() followed by parsed version using my new method

My function returns an associative array (dictionary) where the filename is the key and another associative array for the file information is the value.  Since I’ve experienced PHP running out of memory on me, I made all the file information keys use only a single character (‘t’ = timestamp, ’s’ = size or false if directory).  Amusingly on the key-end, I put the entire relative path name for the file.  This is necessary to use the filename as a key, but also takes up lots of string space if you have deep directories containing lots of files.

One of the annoying problems I had to solve was date parsing.  You’ll note the server is using an American date syntax and only two digits for the year.  This severly confused strtotime() so I had to code in my only little parsing tidbit to force it to be 4 digits.  I didn’t like having code make this kind of assumption, but I don’t really have a choice until the FTP server starts returning 4-digit years.  The way its setup should handle that, since I take the date as a chunk rather than in individual, numerical components.

Another thing I decided was to translate path separators to forward slashes.  Windows FTP supports using forward slashes in commands, but its raw output is based on its own command-line system (e.g. just do a DIR /A and I believe you’re getting the same output).  It’s a painful thing that we’re starting to connect everything and having to deal with multiple styles of line-breaks (\n for *nix, \r for Mac pre OSX, and \r\n for Windows) and path separators (/, :, and \).  Can you imagine how much time would be saved in menial tasks from having one style and no legacy data?  Oh, I can dream, I can dream.

Abstracting the system type into just a regular expression was not a viable solution (even though everyone else does it that way), because the system type also determines hacky work-arounds like my aforementioned date fix.  Instead I will be creating a method per system type which is then executed using call_user_func().  These all have the same prefix to avoid confusion, same parameter signature, and return the same thing.  Currently I only have FtpListP_Windows_NT(), but next I’ll be writing FtpListP_Unix() (Linux/BSD servers use this signature too, at least on the systems I’ve checked).

On a final note, it’s entirely possible I’m duplicating code that’s already out there.  I did spend about 10 minutes on StackOverflow and Google trying to find someone who’d made a reliable method to use, but didn’t come up with anything.  I ignored Zend Framework entirely this time, because I’m working to remove any reliance I have on it — in recovery, overcoming high levels of OOP exposure.


Crowd Sores

October 14, 2009

I spent the last couple hours melting faces with fireballs in Oblivion (well, mostly turning human foes against one another using illusion magic) to unwind from the brutal moderation experience I went through with Mechanical Turk.  Let me tell you tale …

In order to streamline the FanSiter seed site creation, I stopped using their “batch” web mode (works just like a mail merge) since it is impossible to retry individual HIT’s.  Thus you get an all-or-nothing kind of deal or you mediate it by adding redundant HIT’s.  I have a contractor dedicated to solely fixing all the data this caused to get fragmented and thus the whole thing backfired rather magically.  Sure, there are tons of individual sites, but you only see a portion of the bio’s I paid for.

So I spent significant time and development effort to learn the API and write a few PHP classes to help me make use of it.  They work great and after a bit of mind-numbing testing I unleashed new code which would spawn HIT’s based on Landing-only entries in the plat.  This did its job as expected, but certain parts of my moderation tools were broken and required repairs:

  • Stray ampersands made Blogger puke, so approved HIT’s got disposed without a successful post being created.  I updated the nifty RepairTags() function but then didn’t add code to call it.  Face palm!
  • Mid-way through modifications made against live I changed the moderate() method to return a string status rather than a boolean, but I didn’t correct the way it decided which status to return so it was always approving.  Whoops!  Set a bad precedent and I had to delete terrible data that I’d paid for.
  • Adding another assignment when rejecting a previous one seemed like a good idea, but when people weren’t “getting” a particular name (Orange Avenue had 3 rejections, because people wrote about an album by the same name — WTF?!  And for Tettix someone wrote a blurb on the Cicada bug instead of the musician!) I had to manually go in and blow away the whole HIT.  If you’ve ever used Amazon’s “Manage HIT’s Individually” then you know how painful this is.

There were probably other, smaller issues, but those three got me frustrated.  Add to that the quality of the data and I started to get really irritated.  I had kept the wording simple, but specific in the HIT instructions and didn’t duplicate much between its description and the question text.  Unfortunately, it appears many ignore the description since its not very noticable, and there was that confusion of writing about a bug or an album rather than a celebrity (despite that being in the title).  I got a lot of completed assignments that were either literal copies of Wikipedia passages or only slightly re-arranged.  There’s a tool called “Google” which is amazing at helping you find duplicate or plagiarized content.  I guess when you spend very little time on something, you don’t care if you get away with it or not.

Having a length requirement (200+ words) allowed me to reject poorly written stuff on that point alone, but then that ended up being a sticky wicket.  I failed to give a reason on a few of these, since I assumed the results being so bad would be apparent to the writer, and ended up with this fun email conversation:

Jim: may i ask why you rejected this?

Me: You’ll have to provide the contents of the HIT, the software I’m using disposes them once they’re rejected.  Two possibilities that come to mind based on the past rejections: too short (requirement was 200+ words) and/or copied paragraphs from Wikipedia or other sources.

Jim: all were over 200 words and nothing was taking from anyone else. I have found that this site is just to good to be true, shame on me for trusting you. I have reported you to amazon.

Me: That’s fine.  I found this, is it yours?

Devon Aoki is an actress and model, she is best known for her roles in Rosencrantz and Guildenstern Are Undead as Anna, Mutant Chronicles as Cpl. Valerie Duval, War as Kira, DOA: Dead or Alive as Sumi, Sin City as Miho, D.E.B.S. as Dominique and 2 Fast 2 Furious as Suki.

Devon Aoki has been doing television and movies since 2003. She has also done some modeling in 2006.

Devon Aoki is a beautiful charming actress that we see going places in the future.

That’s 84 words.

Jim: yes

He didn’t stop there, though, here’s another email thread about his one other assignment.

Jim: why did you not pay me? I did the work in my own words, i guess you are just another scam site, like most of the others on here, I will report you to amazon, as I did what you asked of me.

Me:

Jesse Eisenberg is best known for his roles in Kick the Can, The Social Network, Holy Rollers, Camp Hope, Some Boys Don’t Leave, Beyond All Boundaries, Solitary Man, Zombieland, Adventureland and The Hunting Party.

Jesse Eisenberg was born 5 October 1983, New York City, New York, USA.

Jesse Eisenberg has been in the Television and Motion Picture industries since 1999.

Jesse Eisenberg is currently working on 4 movies.

Our outlook for Jesse Eisenberg is watch out he is already a rising star, and we believe anything he touches will turn to gold.

92 words.

Ugh, those are just awful seed bio’s and then he lashes out to defend them. On the one hand I have the right to approve/reject whatever I want, that’s the rules of Turk, and I can do it based on quality alone.  On the other I only set the award at 0.49 (0.50 shows up as 0.5) and that may be part of the problem. What I’ve found, though, is raising or lowering that amount still gets you same amount of cruft to sift!  I did get a few really awesome blurbs today, but all in all probably approved only 20%.

I had wanted to avoid requiring qualifications or a quiz, but now I believe it’s going to be necessary. I have thought a lot about whether or not to just go to a singular writer or writing firm, but those just don’t scale the way I’d like. Plus you get less variety in the style since it’s the same person doing all of them. I like the idea of letting in random people, who have down time to make some bucks rather than screwing around on the Internet. The glaring problem here is the current crap I’m getting, both in content and sass.

So … what to do?  Here’s my thoughts.  I’m going to add some qualification requirements, which is more research and coding (argh!).  Then in order to make it a worthwhile pain, I will set the award to something in the neighborhood of $5.  Thus you can flip burgers or do blurb essays on Turk.  Will it work?  I don’t know, but I can’t continue the way I have.  My sanity is at stake.


Silence PHP’s Magic Quotes

October 12, 2009

Here’s a nifty snip of script you can prefix your PHP with to reverse the effects of the “magic” quotes. Thankfully, PHP 6 will no longer have these awful things, but in the meantime:

if (get_magic_quotes_gpc()) {
    foreach ($_GET as $nm => &$s)
        $s = stripslashes($s);
    foreach ($_POST as $nm => &$s)
        $s = stripslashes($s);
}

You may have seen the effects of magic quotes and not even known it. A script processing a form is giving you quotes where every apostrophe and/or double-quote character is preceded by a backslash. The original idea, and who knows how the language designers got it approved (no QA?), was to prevent newbie programmers from inadvertently creating SQL injection security holes. What it ended up doing was open up worse problems as noobs struggled to reverse the effects without quite knowing what they were doing.


PHP Sessions

October 8, 2009

I just spent a half hour playing with PHP5’s sessions and would like to disperse some knowledge.  My first question …

How can you check if a session has been started?

session_id() will return false if session_start() has not been called.  However, if you’ve closed the session with session_write_close() / session_commit() and then re-started it, the old ID will be there.  So except for detecting that session_start() has never been called, you can’t reliably tell at all.

Are session_start() calls reference counted (pushed and popped) ?

No, they aren’t ref-counted and subsequent start requests will result in warnings.  Likewise calling session_write_close() multiple times does nothing, but it also doesn’t generate a warning.  This means that unless you’re in the entry-point of a script, you shouldn’t be calling session_start() without first checking for session_id() to see if it’s been started already.  Of course, since that doesn’t always work, it’s a crap shoot!

Why open and close the session manually and/or multiple times at all?

In most PHP installations a session is implemented as a file.  That file is located based on the session identifier which is probably the file name itself.  So starting a session opens that file for random access (read/write) and committing it causes it to be closed.  Thus there are inherent performance issues with auto-starting sessions and simply “leaving them on and open all the time”.  Your script will close sessions when it ends, but why not open and close only if you need to read/write?  Not doing this also causes contention on the user side with sites where resources are generated by PHP.  If each request has to wait for the last to complete you effectively get a slow, serialized experience.