Data Science, how hard can it be?

02/07/2014

Lets write a poem, and lets do this thing the programmer way. I've got this big juicy dataset of poetry I recently acquired from err, somewhere or other. Lets whip out the statistics and see what we can do with it. Just a little data science. How hard can it be...

But where to start?

How about the length. lets be honest, no one likes a long poem. Just stick it in a novel if you've that much to say. But we have the data now. Lets find out exactly how long it should be...

word count

Okay, so maybe long poems aren't such a drain. It doesn't seem like there is much correlation between poem length and rating. Instead the rating measure just gets more unstable as the amount of data drops.

Either way, I'm still giving a fat zero to the longest poem in the data set "Ashtaroth - a Dramatic Lyric" by "Adam Lindsay Gordon" which is an immense 17680 words long. Literally nobody has time for that.

I'm liking the look of the other side of the spectrum. The shortest poem in our data set is "Reflection On A Wicked World" by "Ogden Nash" consisting of just the three following words:

Purity
Is obscurity.

Nice. Poetic angst at its finest.

So length doesn't seem to matter. We'll keep it short because y'know - can we really be bothered.

How about line length? In English weren't we taught all about iambic pentameter, getting a nice rhythm to the poem and all that. How long should each line be, and does being iambic (having a du-DUH rhythm) really matter to how good a poem is.

According to my English teacher if you're going for it poems that are 10 syllables long, with a nice iambic rhythm should be pretty sweet.

line length

Okay flat as a pancake. Screw that. Lets take a different approach.

Lets decide what to write about first. Topic is surely more important than length. Should it be Love? Ambition? Existentialism? Even...Data Science? (one can dream). This data set is tagged with topics for each poem. Lets check out the most popular topics and their average scores to get an idea - any idea.

topics

Wait what? So the best poems are about mentors, moms, chocolate, television, and football? And since when was "concrete" a topic for poetry anyway? I found over 25 poems in the set with that label. I'm no English major, but I've seen some concrete in my time, and it has never inspired me to write a beautiful sonnet.

I'm starting to wonder if this is really a good dataset. Or more specifically, if the rating and topic tagging, which appears to have largely been done by bored American housewives (with a love of concrete), is really going to tell me anything insightful.

In fact I think I've come up with the perfect poem already - a maternal haiku. Lets stop here.

mom eating chocolate
football on television
concrete between us

Pretty deep eh? I Know.

A special mention goes out to the topic "depression", the only topic out of several hundred (including "concrete") which scored below the default 5.0 score. I guess you better face it. The people don't want to hear about your troubled childhood. Go back and write something about candy, or, America, or do you just hate freedom?!

So I guess that isn't going anywhere. But perhaps topic is linked to other things. My teacher always told me that iambic pentameter was like a heartbeat. It expressed passion, and power - love and joy. Lets take a look and see if she was right...

iambic

Ouch. Okay that was clearly a write-off.

Enough on the topics. What about Rhymes. Everyone loves a sweet rhyme. The only thing that beats a sweet rhyme is a sweet pun, and iambic-inning to get sick of this poetry business. Eh. Eh.

Lets take a look at how the rhyme schemes stack up.

rhymes

Wow. I mean I always liked him, but I didn't expect my main guy ABCAACCADEADEA, the greatest rhyme scheme ever, to be right up there. Not to mention my buddy, ABBACCCCCDDCCC. Five rhymes in a row, that is just something special. That's like - you can't just step up and write that - it takes something special inside.

But seriously, it doesn't really seem like rhyme schemes have any sensible correlation to rating, even when normalized to take into account the number of poems with that scheme. But it is nice to know that people slightly prefer ABCB to ABCC ... I guess.

Clearly we're not going to be able to get informed of how to write a super sweet poem this way.

But I have one final awesome idea.

Lets get the data to write the poem for us. Lets roll out the big guns.

It's Markov Chain time.

Oh yeah. This is it. This is the deep-o-tron.

they mow to what was happy folk of that second
there
this trivial circumstance to dust mixed up from
over the green field of life is the strife
i wronged and aye and not for the cataract

And never for the cataract!

before that i tell
were stirred
to the image of green

the shamrock the flowers
all in furrowed mountains
in a violin in a merchant who bore
come when we came and
the sun
i said earth are spray
the earth my father will
the clash
oh music
the rickety ferry for many a phantom forest

Beautiful.

A personal favourite - this epic tale - of feeling, and questioning.

a tree
i see
all felt
why

And this saucy verse about perfumed monkeys and bridal ass.

we sit and rather one leap out the ass in her bridal bower
by a joy and holy ground
suddenly she young and baulked his body give medicine on the earth
my softest voice ill fares the warm our southern line
now mere profession noble the perfumed monkeys on yonder house can it in the 
moon didst not moonbeams that wretched dead
conquers every birth
the last has been at whose lucid
nothing but that is the blast
you think not distant speaking ask ye your languid arms to their gear of

From which I am taking my new exclamation of surprise, "by a joy and holy ground!".

Unfortunatly I'm kinda stuck now. So I'll leave you with my final submission. But it's been fun folks - remember to tune in next time for "Data Science, how hard can it be?".

This poem I call "A Pale English Man on Holiday".

a man
your victory is all its fill her sobbing with happiness
three days passed and brown