Pixels & Pilcrows

Wednesday, January 5, 2011

The Information Problem

We have too much content, not enough information, and surprisingly little data.

The great futurist Alvin Toffler predicted a world with relevant data so profuse that information overload would be the inevitable result. He was wrong. The 21st Century does not have a problem with information overload. It has a problem with Content Overload.

You might think that I'm splitting semantic hairs in order to make a polarizing statement, but perhaps it would help my case to point out that the surfeit of content on the internet is of varying informational value. Quite a bit of it is opinion, and much of the stuff that is purportedly informational is of questionable truth value.

It's important to make a distinction between content, information, and data. While I'm not suggesting that there's anything inherent in each of these words that usefully distinguishes them, it is clear there are definitely three different phenomena which can be distinguished, regarding the stuff we take into our brains.

Content is what I'll call all print and digital "stuff" in general. This definition of content would include movie reviews, lolcats, sports scores, et cetera. Information is content which can be assigned a truth value. That gets tricky, because an opinion piece is clearly not information in and of itself, but may contain information (or misinformation). Data is a special kind of information, that appears in tabular or numeric form. The universe of Content and Data can be seen as a scale that runs from the lowest density of fact to the highest.

Data is not always foolproof, as Charles Seife identifies in his data-in-journalism analysis, Proofiness. A lot of figures are either half-right, made to appear to support erroneous conclusions, or just plain made up. Of course, that makes any informational statements based on these data unsupported, and opinions formed around this information, misinformed. To hear Seife tell it, there are a lot more errant conclusions out there than accurate ones--which leads me to believe the following:

Nearly everyone is wrong most of the time.

The most logical solution to being wrong is to check facts to make sure they are accurate before basing opinions and decisions on them. If this were possible, it would certainly make for a much better world. Unfortunately, verifiable fact is hard to come by. In many cases, scientific studies have not been done (or cannot be done). Worse, sometimes results of different studies conflict. Exercise science and nutrition seem especially prone to this: one month, authorities say it's essential to focus mostly on cardio, the next, weights are the thing. Fat used to be the killer, now carbs are. Any attempt to be consistently right, even most of the time, is probably doomed to failure.

Unfortunately for us, the number of decisions we have to make every day appears to be increasing. From financial choices, like bank selection and the use of credit cards, to consumer choices about everything from insurance to running shoes. Further, it seems as though the options we have are also increasing, adding even more difficulty to the task of human decision-making, and making the human life experience a multi-decade pratfall.

As a shortcut to making decisions, we tend to lump like decisions together, as a pattern. We may then seek to explain these patterns with a narrative. For example, my wife loves to shop with coupons. Her decision-making pattern is that if we don't absolutely need a product right away, she doesn't buy it at full price, or even close, ever. The narrative behind this pattern is the idea that companies engage in promotional deals to catch unwary shoppers off their guard, but that by putting in a little extra effort, you can subvert these deals to make your shopping exceptionally cheap.

The progression in my wife's example is from individual decisions, to patterns, to explanatory narratives. This is a practical approach to decision making. The reverse is the ideological approach: to take the narrative and impose it on practices and individual decisions. There is nothing wrong with the ideological approach if you use it to govern decisions made without a thought for immediate outcomes. Ethical decisions, for example. One deals honestly because the narrative of "Honesty is the best policy" embodies a state which the decision-maker wishes to attain, not because it yields a direct benefit. When you use the ideological approach and expect a specific result to a specific decision, things get crazy.

This is because narratives are based on, again, patterns made from lots of little decisions, which the narratives then explain. When the method by which the narrative was built is unknown, the truth value of the narrative is obscured. Using content in the place of information or data in order to make decisions is terribly dangerous. Using unverified information as if it were verified is terribly dangerous. Analyzing data incorrectly is terribly dangerous. And yet, this is what is happening in decision-making all over the world, from households to governments and beyond.

Data in and of itself is incredibly powerful. Bringing up Seife and Proofiness again, it's clear that attaching a number to a fact is a shortcut to credibility. But humans are notorious for misreading and abusing data, and creating narratives based on these tortured numbers. Here's an example debunking a mostly harmless myth about pet ownership.

What I'm driving at is this: humanity might make better decisions if we would take what's reported as news and science, et cetera, with a grain of salt. This proves exceptionally difficult, because in reacting to one incorrect or incomplete narrative, we often create an opposing narrative that is just as incorrect or incomplete. There may not be a solution to the "Information Problem", but I propose that you, Dear Reader, go conduct your own studies to verify my narrative, and live by holding out judgment until proof is overwhelming.

Monday, December 27, 2010

As the Sands of the Hourglass...

This is a nice graph showing where people came into and went out of my Gmail chat life. That gray line is the first day I used the service, which as you'll not is NOT in the middle of 2006 (tee hee, previous post. tee hee.)

Personally, I think this is an incredibly telling graph. It's taking, again, my top nine gchat buddies and showing how the volume of our interaction changes. For example, KRS shows a very solid line at the beginning, showing frequent interaction at the beginning of this period, but tapers off. RPM starts off as a very casual gchat friend, but gains in intensity near the end. CRS is an interesting latecomer; she's my wife. She was given a Gmail account by ASU when she was accepted to the doctoral program, and I switched her personal email over to Gmail as well. We didn't start chatting until we were essentially engaged.

You'll note that she still makes it into my top nine. This is because of the size of those early chats. Unfortunately, this graph is not weighted; you get one dot every day we converse, no matter how long or short. I'm still learning R, and am trying to find out exactly how one goes about changing plotting colors according to the values of a variable. (Rather, I know how to do this for larger symbols, but not for dots).

A note on the three-letter codes. In the interest of privacy, they are generally not the current complete initials of the person listed. However, in order to make it readable for me, they are pretty close. If you're listed here, and you want me to change the initials to protect your privacy on the internet, please let me know. I'm not Mark Zuckerberg, you know.

Saturday, December 18, 2010

The Size of Shakespeare, or, A Comedy of Errors

So.

It's been a little while, and I haven't neglected you, three blog readers. I've been working on a project in order to blow your collective mind, or at least give it a little something to chew on.

Specifically, I have delved deep into the realms of my Gmail chat logs and have begun to discover: data. Oh man, the trip this has been. And it's not over. There will be charts, there will be graphs, AND! there may be PODCASTING.

I intend to give you the tidbits I have learned in chewable form, piece by piece. Today's episode is: why you should examine your data thoroughly before you make any conclusions.

I wrote a Perl script to turn my wad of uncooked data into a delicious patty; it returned the size of each individual chat file along with other important stats. In the statistical scripting language R, I discovered that the sum total of chat content produced was, in a word, ridiculous. I did some calculations and made a graph that looks a little something like this:

Yes, it appeared that even just my most chatty friend had produced with me a larger corpus of work, bytewise, than Bill Shakespeare himself (the Bard wrote about 5 Mb worth). Sweet mercy. Note: I have been using Gmail's Chat client, and occasionally Google Talk, since the former launched in the middle of 2006.

The problem with this graph is that it's not actually accurate. When I examined the files a little further, I realized most of them looked like this:

Uh oh. So, of course, I had to write another Perl script. I found that code accounted for roughly 77 percent of the content created by Gmail chat. Here's the revised graph:

Tada! I think the most interesting development from this graph is quite simply that, even with the code stripped from the chats, a few of my friends and I have produced an entire corpus of text.

In the next few months, I'll look at some of the sociological implications of the date and size data, and then all the way into the the textual aspect of the transcripts. Analyzing the text itself should be tremendously interesting.

[Note: I changed some things as other pursuits have prevented me from diving into the actual text. Someday...someday.]

Thursday, November 18, 2010

Introducing the "Cite Your Source" button

Dear Following Few—

I made a web badge! Take the following HTML and plop it in a blog or other webpage—pass along the fact-checking goodness.

Monday, November 15, 2010

That R-pentomino Is So Hot Right Now

Why virality might a real-life application of the least competitive game in the world.

Ars Technica's Casey Johnson wrote a stellar article about game theory being a more apt explanation of viral media than actual virology. The article points out that the epidemiological approach "is fitting for some cases, in others it's an oversimplification—a person's exposure to a trend doesn't always guarantee they will adopt it and pass it on." Essentially, this is the beginning of the explanation for why websites and gadgets succeed, while other, similarly featured ones fail.

The researchers from the AT article ran a couple of models based on game theory principles. The first assumed that likelihood that a new computer application would be adopted by any given person was directly proportional to the number of friends in said person's network that adopted, and that knowledge of friends' adoption or non-adoption was 100%. This doesn't explain much—it creates a world with an infinite barrier to entry, but a preternatural tendency to growth. The second model denied absolute knowledge of friends' choices, and added a "try-it-out" rule: 100% adoption for nodes that had 0% knowledge of friends' tastes.

This was starting to sound a little like Life. Not the cereal, nor the Zen police procedural, but the game. John Conway's Game of Life is a zero-player game. I seriously won't attempt to beat Wikipedia at explaining it (skim it now, then come back), but suffice it to say that outcomes are both: absolutely predictable by machines who know the rules and can compute them on the fly, and terribly unpredictable and surprising to those who don't know or can't, you know, do several hundred computations in a few milliseconds. Patterns that seem small and silly may spread for generations and generations, and intricate designs might collapse in just a few. (Play Life here.)

It's that unpredictable propagation that makes Life interesting. And while the rules of the game are surely different from the much more complex rules of social marketing, it stands to reason that a few things are similar: it's more important who knows about your product/website than how many of them there are to start off with. If people that people trust (read that phrase again) know about your content, so much the better. But if the social networks of your early adopters can serve to propagate your message to other widely-trusted individuals, sounds like you have a really solid start.

There are HUGE amounts of conjecture in this one little post. We of course have no clue what the rules to the idea-passing mechanism are, how to determine who the starters for your viral marketing plan are, or what "special sauce" makes an idea likely to be passed. Memetics has largely failed in this regard; future research is desperately needed here.

Thursday, November 11, 2010

Conflictinator Alert - Veterans Day 2010

al-Google Veterans Day

So, here's a tempest in a teapot: Associated Content post about Google's Veterans Day logo that claims that the 'e' is actually a crescent of Islam. I'm not sure exactly where this ends, but it's possible the author actually believes the letter 'e' is a secret Muslim. Just to be safe, let's add everyone with an 'e' in their names to the No-Fly List.

Of course, when you bait the conflictinator trolls, they inevitably bite. HuffPo's response, of course, is to run with the AC author and claim that there's a widespread backlash about the logo. In their crazy, polarized view of the universe (perhaps fostered by spending too much time on the internet), the extreme right is one step away from besieging Mountain View with assault rifles, and maybe swastikas.

Way to contribute, guys.

All the President's Tax Cuts

Speaking of HuffPo, here's the title: White House Gives In On Tax Cuts. Here's the article (warning: contains serious hedging and low semantic density). Finding David Axelrod's statement that the president actually favors the extension of the tax cuts is hard, but finding anything that sounds like actually "giving in" is like playing Where's Waldo—with a Jackson Pollock painting.

Of course the Atlantic and a few other outlets took this, and ran with a "Obama gives in" type story.

Wednesday, November 10, 2010

The "Cite Your Source" Project

A little experiment.

As you might have been able to tell, I've been having difficulty finding time to blog this last week or so. (I'm working on other writing projects right now.) I've been thinking about The Problem of Information a lot, and I think I've come up with a short follow-up. It's a little social experiment, and I think it will be interesting to see if it catches on.

We all participate in online communities, whether it be in the comments section of a news website or blog, Twitter, or just on Facebook. A lot of our arguments work like discussion on major news outlets, including the citing of statistics and other supporting evidence without citing our sources.

As you well know, these stats are not necessarily true, but by in large those who agree (and many who disagree) with the point being made never question the factuality of this data.

I propose that we start. Right now. I know it will definitely make you annoying to people, but I would like to encourage everyone here to respond, at least once, to an online claim made without citing a valid source of evidence, with a polite request for citation.

It would be as simple as: "That's an interesting figure. Would you mind telling me where I can go to verify it?" or "I'm not saying I disagree with your point, but I'd like to know how I can verify that fact." You don't have to be outright contentious about it. In fact, it's probably better if you're not. People don't like having their comment or FB post ripped apart.

If everyone started requesting citation of valid sources even a few times a week, it would go a long way toward a healthier data culture. Thanks.

PS: This page will give you a web badge you can post if you like.