Monday, December 27, 2010

As the Sands of the Hourglass...

This is a nice graph showing where people came into and went out of my Gmail chat life. That gray line is the first day I used the service, which as you'll not is NOT in the middle of 2006 (tee hee, previous post. tee hee.)


Personally, I think this is an incredibly telling graph. It's taking, again, my top nine gchat buddies and showing how the volume of our interaction changes. For example, KRS shows a very solid line at the beginning, showing frequent interaction at the beginning of this period, but tapers off. RPM starts off as a very casual gchat friend, but gains in intensity near the end. CRS is an interesting latecomer; she's my wife. She was given a Gmail account by ASU when she was accepted to the doctoral program, and I switched her personal email over to Gmail as well. We didn't start chatting until we were essentially engaged.

You'll note that she still makes it into my top nine. This is because of the size of those early chats. Unfortunately, this graph is not weighted; you get one dot every day we converse, no matter how long or short. I'm still learning R, and am trying to find out exactly how one goes about changing plotting colors according to the values of a variable. (Rather, I know how to do this for larger symbols, but not for dots).

A note on the three-letter codes. In the interest of privacy, they are generally not the current complete initials of the person listed. However, in order to make it readable for me, they are pretty close. If you're listed here, and you want me to change the initials to protect your privacy on the internet, please let me know. I'm not Mark Zuckerberg, you know.


Saturday, December 18, 2010

The Size of Shakespeare, or, A Comedy of Errors

So.

It's been a little while, and I haven't neglected you, three blog readers. I've been working on a project in order to blow your collective mind, or at least give it a little something to chew on.

Specifically, I have delved deep into the realms of my Gmail chat logs and have begun to discover: data. Oh man, the trip this has been. And it's not over. There will be charts, there will be graphs, AND! there may be PODCASTING.

I intend to give you the tidbits I have learned in chewable form, piece by piece. Today's episode is: why you should examine your data thoroughly before you make any conclusions.

I wrote a Perl script to turn my wad of uncooked data into a delicious patty; it returned the size of each individual chat file along with other important stats. In the statistical scripting language R, I discovered that the sum total of chat content produced was, in a word, ridiculous. I did some calculations and made a graph that looks a little something like this:


Yes, it appeared that even just my most chatty friend had produced with me a larger corpus of work, bytewise, than Bill Shakespeare himself (the Bard wrote about 5 Mb worth). Sweet mercy. Note: I have been using Gmail's Chat client, and occasionally Google Talk, since the former launched in the middle of 2006.

The problem with this graph is that it's not actually accurate. When I examined the files a little further, I realized most of them looked like this:



Uh oh. So, of course, I had to write another Perl script. I found that code accounted for roughly 77 percent of the content created by Gmail chat. Here's the revised graph:


Tada! I think the most interesting development from this graph is quite simply that, even with the code stripped from the chats, a few of my friends and I have produced an entire corpus of text.

In the next few months, I'll look at some of the sociological implications of the date and size data, and then all the way into the the textual aspect of the transcripts. Analyzing the text itself should be tremendously interesting.

[Note: I changed some things as other pursuits have prevented me from diving into the actual text. Someday...someday.]

Thursday, November 18, 2010

Introducing the "Cite Your Source" button

Dear Following Few—

I made a web badge! Take the following HTML and plop it in a blog or other webpage—pass along the fact-checking goodness.






Monday, November 15, 2010

That R-pentomino Is So Hot Right Now

Why virality might a real-life application of the least competitive game in the world.

Ars Technica's Casey Johnson wrote a stellar article about game theory being a more apt explanation of viral media than actual virology. The article points out that the epidemiological approach "is fitting for some cases, in others it's an oversimplification—a person's exposure to a trend doesn't always guarantee they will adopt it and pass it on." Essentially, this is the beginning of the explanation for why websites and gadgets succeed, while other, similarly featured ones fail.

The researchers from the AT article ran a couple of models based on game theory principles. The first assumed that likelihood that a new computer application would be adopted by any given person was directly proportional to the number of friends in said person's network that adopted, and that knowledge of friends' adoption or non-adoption was 100%. This doesn't explain much—it creates a world with an infinite barrier to entry, but a preternatural tendency to growth. The second model denied absolute knowledge of friends' choices, and added a "try-it-out" rule: 100% adoption for nodes that had 0% knowledge of friends' tastes.

This was starting to sound a little like Life. Not the cereal, nor the Zen police procedural, but the game. John Conway's Game of Life is a zero-player game. I seriously won't attempt to beat Wikipedia at explaining it (skim it now, then come back), but suffice it to say that outcomes are both: absolutely predictable by machines who know the rules and can compute them on the fly, and terribly unpredictable and surprising to those who don't know or can't, you know, do several hundred computations in a few milliseconds. Patterns that seem small and silly may spread for generations and generations, and intricate designs might collapse in just a few. (Play Life here.)

It's that unpredictable propagation that makes Life interesting. And while the rules of the game are surely different from the much more complex rules of social marketing, it stands to reason that a few things are similar: it's more important who knows about your product/website than how many of them there are to start off with. If people that people trust (read that phrase again) know about your content, so much the better. But if the social networks of your early adopters can serve to propagate your message to other widely-trusted individuals, sounds like you have a really solid start.

There are HUGE amounts of conjecture in this one little post. We of course have no clue what the rules to the idea-passing mechanism are, how to determine who the starters for your viral marketing plan are, or what "special sauce" makes an idea likely to be passed. Memetics has largely failed in this regard; future research is desperately needed here.

Thursday, November 11, 2010

Conflictinator Alert - Veterans Day 2010

al-Google Veterans Day

So, here's a tempest in a teapot: Associated Content post about Google's Veterans Day logo that claims that the 'e' is actually a crescent of Islam. I'm not sure exactly where this ends, but it's possible the author actually believes the letter 'e' is a secret Muslim. Just to be safe, let's add everyone with an 'e' in their names to the No-Fly List.

Of course, when you bait the conflictinator trolls, they inevitably bite. HuffPo's response, of course, is to run with the AC author and claim that there's a widespread backlash about the logo. In their crazy, polarized view of the universe (perhaps fostered by spending too much time on the internet), the extreme right is one step away from besieging Mountain View with assault rifles, and maybe swastikas.

Way to contribute, guys.

All the President's Tax Cuts

Speaking of HuffPo, here's the title: White House Gives In On Tax Cuts. Here's the article (warning: contains serious hedging and low semantic density). Finding David Axelrod's statement that the president actually favors the extension of the tax cuts is hard, but finding anything that sounds like actually "giving in" is like playing Where's Waldo—with a Jackson Pollock painting.

Of course the Atlantic and a few other outlets took this, and ran with a "Obama gives in" type story.

Wednesday, November 10, 2010

The "Cite Your Source" Project

A little experiment.

As you might have been able to tell, I've been having difficulty finding time to blog this last week or so. (I'm working on other writing projects right now.) I've been thinking about The Problem of Information a lot, and I think I've come up with a short follow-up. It's a little social experiment, and I think it will be interesting to see if it catches on.

We all participate in online communities, whether it be in the comments section of a news website or blog, Twitter, or just on Facebook. A lot of our arguments work like discussion on major news outlets, including the citing of statistics and other supporting evidence without citing our sources.

As you well know, these stats are not necessarily true, but by in large those who agree (and many who disagree) with the point being made never question the factuality of this data.

I propose that we start. Right now. I know it will definitely make you annoying to people, but I would like to encourage everyone here to respond, at least once, to an online claim made without citing a valid source of evidence, with a polite request for citation.

It would be as simple as: "That's an interesting figure. Would you mind telling me where I can go to verify it?" or "I'm not saying I disagree with your point, but I'd like to know how I can verify that fact." You don't have to be outright contentious about it. In fact, it's probably better if you're not. People don't like having their comment or FB post ripped apart.

If everyone started requesting citation of valid sources even a few times a week, it would go a long way toward a healthier data culture. Thanks.

PS: This page will give you a web badge you can post if you like.

Monday, October 25, 2010

The Chinese Language is the Deep Web

Reading Nicholas Kristof's post "Liu Xiaobo and Chinese Democracy", about Mr. Liu's recent Nobel Peace Prize, I saw a piece of content stood out, not only for its content, but also for the offhand way in which it was presented:

Today, Liu presumably doesn’t know that he has won the prize, and the Chinese government is trying to censor the news. But China is changing and censorship no longer works so effectively. It can ban mobile phone users from texting the characters for his name, but young Chinese are smart enough to use substitute characters.

Assuming this actually is the case, it means that hidden within the Chinese languages (and it's clear that they are separate languages, not dialects of one overarching, crazily heterogeneous Chinese language) is a hidden world of possible ideogram-meaning combinations, connected by sound. Here's how that would work:

Every Chinese character represents a word. (Linguists: I know there are exceptions. Thanks.) For example, the word for "work" is 工, pronounced "gong" with a high, steady tone. The word for "attack" is 攻, also pronounced "gong" with a high, steady tone. The word for "supply" as in "power supply" is 供, also "gong" with a high, steady tone. So on with the words for "official business", "palace", and "bow" as in "bow and arrow".

Right now, the censors at Great Firewall HQ, actually called the Propaganda Department—I kid you not—are poring over blogposts and texts and other electronic content, finding subversive messages and stamping them out like bugs. Now, I imagine that a bit of this is done automatically, by keyword, and a great deal more is done by
a large government department, full of the average office assortment of flunkies, middle managers, angry bosses, and the ennui that comes along with this setup.

Now imagine an undercurrent of blogs that don't seem to make sense at first glance. They bring up no poisonous keyword hits. They carry no familiar subversive slogans. But for those who would read them aloud, they transfer hopeful messages of democracy, commentary on the Chinese political situation, and perhaps even plans for meetups and other events.

This sound-meaning correspondence is much like what serious internet people call the "deep web". The deep web consists of all the data on the Internet that's not directly accessible to the average end user of a search engine. Deep web data is significantly more voluminous than surface web data. From the wikipedia page:
Deep Web search reports cannot display URLs like traditional search reports. End users expect their search tools to not only find what they are looking for quickly, but to be intuitive and user-friendly. In order to be meaningful, the search reports have to offer some depth to the nature of content that underlie the sources or else the end-user will be lost in the sea of URLs that do not indicate what content lies underneath them.
By moving context outside of the scope of these messages of Chinese democracy, writers would easily circumvent any mechanical attempts at censorship. Certainly, it's not perfect, but even in a worst-case scenario, this practice could burden the Propaganda Department with the need for more human censors.