Total Drek

Or, the thoughts of several frustrated intellectuals on Sociology, Gaming, Science, Politics, Science Fiction, Religion, and whatever the hell else strikes their fancy. There is absolutely no reason why you should read this blog. None. Seriously. Go hit your back button. It's up in the upper left-hand corner of your browser... it says "Back." Don't say we didn't warn you.

Wednesday, March 09, 2005

Everything I needed to know about data, I learned from my romantic life.

As sociologists, or grad students, or economists (that's a shout-out to you, Tom) or whatever, many of us have had formal training in statistics. Ah, yes, statistics: the discipline that little grad students fear, that grown up scientists enjoy, and that senior grad students approach with the sort of barely-masked trepidation that you rarely see outside of cheap porno.

The funny thing is, despite the frequency with which statistics are taught, and their importance to the discipline, it is really damned hard to find a good introductory statistics textbook. They tend to oscillate back and forth between two extremes, instead. The first extreme is that "You're a goddamn idiot" version, which concentrates on teaching stats as a practical art, but never bothers to explain why they work the way they do. As a consequence, you end up with practitioners for whom statistics are a sort of mystic incantation, rather than a set of fallible techniques. Alternatively, you may get a book from the, "I want to actually have sex with Calculus" school of thought; in which case you're going to have so many derivations rammed up your ass, you'll be hacking up asymptotes for a month.

Where is the happy medium? Where is the bowl of statistical porridge that isn't too hot, or too cold, but is instead, just right? Well, I'll tell you this: it sure as shit ain't here. Nevertheless, my own need for a decent statistics book has finally driven me to the ultimate extreme- contemplating writing one myself. Of course, there's no way in hell that I know enough about stats to actually write a book, and similarly given that I tend to understand things in a fairly idiosyncratic way, it's unlikely that I could make myself understood if I did choose to write one. So, my twisted dream of making millions off of a thoroughly excellent textbook (Bwahahahaha!!!) must die stillborn.

That doesn't, however, mean that I can't begin to provide a sort of "Total Drek Big Book 'O Stats" for my loyal (i.e. mentally unbalanced) readers. So, without further delay, I invite you to enjoy the first installment in this more or less useless scholarly work.

The Total Drek Big Book 'O Stats

Chapter Something-or-Other: Data, data, data!


In social science, and statistics, we collect data, or information on the world that is used to draw conclusions. All data is not alike, however, and different types of data, which mean very different things, must by analyzed in dramatically different ways.

This chapter is intended to introduce the various types of data typically used in social science. We will follow an organizing scheme that proceeds from those statistics that are least informative within the context of the general linear model (introduced earlier as a way to both establish relationships between variables, and drive yourself to the brink of madness) to those that are most informative. This is not to say that variables that are uninformative in terms of the GLM are not useful, but rather only that their analysis frequently requires methods not covered in this book. And, since all methods that are worth anything are covered in this book, those other methods can go bugger themselves.

These various data types will also be explained in such a manner as to make their use in standard statistical software packages as trouble-free as possible. For those who have used stats packages before, you are correct in thinking that this is more or less the same as saying lube makes anal sex with a horse easier. So sue me. This is a book of statistics, if you want someone to hold your hand while you learn a stats package, go beg your significant other.

And, as long as we're discussing significant others, we may as well discuss this chapter's theme. For our discussion of data and data types, examples will be drawn from the author's spectacularly-failed romantic life, as well the romantic lives of my associates. For those in my department, please be advised that only one of these examples (found in the section on string data) will refer in any way to the person also in our department that I dated- and that example isn't a juicy one. While I know I'm disappointing some of you, it just wouldn't be right. I'll go a long way for humor, but despite our being on rather poor terms, I bear this person no true ill will. So, I'm not going to say anything that might be embarrassing to them if at all possible, no matter how much you want me to.

You fucking vultures.

String Data: The first type of data that we must consider is referred to as "string" data. A string variable is one that simply records a word, set of words, or set of characters. As an example, consider exclamations during sex. The exclamations might be recorded by a researcher as a string variable. Thus, we might construct a variable named "sxtlk" in our dataset that might have values like, "Oh, shit, oh, oh, SHIT!" or, "Wow, baby, that feels good," or even, "Ow! Stop biting!" String variables are, in some ways, the most informative variables in a data set. Since the respondents are not restricted to selecting from a menu of options, there is little risk that the way the researcher perceives the field of inquiry will directly influence the results. Respondents can also provide details that otherwise might be missed. Unfortunately, this very flexibility also limits the statistical usefulness of string data. Since a given respondent can choose to answer the question in any way they like, there is little guarantee that the answers will be at all similar across respondents. In other words, the strength of string data for preserving complexity can, in turn, make this data type entirely useless for statistical analysis.

Even so, string data may remain extremely important. In the earlier example of a string variable, providing the wrong value for a given variable, for example calling out the name of your ex-boyfriend during climax, might lead to rather poor outcomes. In my case, I have managed to circumvent this possibility by only dating women whose names begin with the same letter (actually, of late their names have either been exactly the same, or have been different iterations of the same name). As a result, I need only make an exclamation that begins with the appropriate letter, followed by a more-or-less incomprehensible series of vowel-like sounds. This may seem like a rather extreme solution, but it does the trick.

Categorical: A categorical variable is one that distinguishes types of things, or elements, from each other. These elements are not, however, distinguishable in terms of the amount of something they may or may not have. The difference between elements is, therefore, qualitative in nature. Unlike in string variables these elements are not added to the variable as text, but are coded into the variable as numbers. As an example, consider the reasons why romantic relationships fail, operationalized as a categorical variable "relfail." This variable might have numeric values like 1, 2, & 3, corresponding to meaningful values like, "She hated me," "She began having psychotic episodes," and "Christ, but she was boring." It's important to note, however, that while categorical variables are usually coded numerically, the numbers themselves have essentially no meaning. They are, instead, merely convenient labels that stand in for real values, much as some other woman stands in for you with your lying fuck of a boyfriend while you're away at a conference.

Dummy Variables: A dummy variable, also known in some (less hip) circles as a crisp set variable, is a special case of a categorical variable. A dummy variable is a categorical variable that has only two possible values, most often zero and one. A value of one indicates the presence of something, while a value of zero indicates its absence. It is important to keep in mind that zero is not the precise inverse of one, but is only a statement of not-one. As an example, I recently went on a few dates with an entomologist. If we coded our relationship using a dummy variable for friendship status named "frnd" this variable would, initially, have had a value of zero. This does NOT mean that we are enemies, the inverse of friends, but rather only that our relationship could not be categorized as friendship-based. A zero, then, merely indicates that the case in question does not meet particular criteria, but does not then show what other criteria it may meet. Later, we decided that while we enjoyed hanging out there was no real attraction, and that we should just be friends. Thus, the variable frnd changed values from zero to one, in the process eliminating a great deal of uncertainty.

Ordinal: Ordinal variables are similar to categorical variables, in that they sort elements into distinct categories. Unlike categorical variables, however, they sort on the basis of a single feature which can be rank-ordered. Thus, if we were to sort the women from the earlier "relfail" variable (See the section on categorical data) into a new variable denoting relationship quality, or "relqual," we might assign each case (i.e. woman) an increasing numerical value with increasing relationship quality. So, "Ms. Hatred" might be assigned a 1, "Ms. Dull" a 2, and "Ms. Psychosis," a 3. The numeric values, in contrast to categorical data, indicate changing levels on a single underlying continuum, rather than just sheer difference. It is important to note, however, that in ordinal data the values only indicate a ranking, but not a consistent difference between those ranks. So, while "Ms. Psychosis" might have been a better relationship than "Ms. Dull," that difference in quality between levels may not be the same as the difference in quality between "Ms. Dull," and "Ms. Hatred." This makes logical sense, as dating someone who occasionally hallucinates and has funny beliefs about demons and showers, may not be that much different in quality from dating someone who is simply uninteresting, whereas both are dramatically different from dating someone who appears to have true malicious intent. In short, then, while the ranks do indicate that some sort of distance separates each level of the variable, the precise amount of distance may or may not be constant between levels. Thus, this variable is ideal for recording information that can be sorted by rank, but for which precise measures indicating the details about these ranks are unavailable.

Interval: Interval data is similar to ordinal data in that it allows the ranking of cases on some sort of underlying continuum. Unlike ordinal data, however, the "distance" between categories is constant and unchanging. Put another way, interval data is similar to a variable that counts instances of something. So, if we create a variable named "orgsm" that indicates the number of times you... well... I think it's obvious, while your boyfriend is performing cunnilingus, that variable would be interval in nature. Each orgasm is a distinct event and they can be counted as they occur. Intermediate positions, however, corresponding to "half an orgasm" are meaningless. The meaninglessness of such intermediate positions is apparent from the common male complaint of blue balls. This type of data is useful for counting discrete events that cannot be subdivided, as it preserves the meaningul information about increasing number and consistent distance. It is also worth pointing out that if we were concerned with the amount of pleasure derived from each successive orgasm, we would not record this as an interval variable unless the amount was constant. If, instead, each successive event becomes less enjoyable, the distance between events on the underlying continuum is fluctuating, and we must record the data as either an ordinal variable, or as the next data type.

As a side note: You have to love an online encyclopedia that includes an entry on "blue balls."

Ratio: Ratio data is composed of a measurement of some parameter that can be sub-divided meaningfully and has an interpretable zero point. Thus, unlike interval data, points between major categories (i.e. one and two) can be interpreted in an informative way. Another way to think about this is that the distance between categories is maximally informative. If we were to construct a variable named, "vmt" corresponding to the amount of vomit deposited on you by your girlfriend while she's sick, we could measure the quantity as a ratio variable. Washing might remove any portion of the vomit and both the amount removed, and the amount remaining, would be interpretable. If vomit were in turn an interval level phenomenon we could only manipulate precise amounts of vomit (a sort of "vomit quantum" if you will) at a time, reducing or increasing its presence only in particular amounts, rather than in any portion we choose.

Fuzzy Set: Fuzzy set data, used only occasionally by social scientists, is a special combination case of dummy variable data and ratio data. In a fuzzy set the lower and upper limits of the variable are fixed at zero and one, indicating full non-membership in a given category or full membership respectively. Intermediate positions, however, are interpretable as partial memberships in both the set of non-members and the set of members. So, if we constructed a fuzzy set variable "date" indicating whether or not two people were involved in an ongoing romantic relationship, we may immediately define the two outer anchor points. One would correspond to full membership in the set of people who are dating- for example steady significant others. Similarly, a value of zero would denote full non-membership, but does not indicate what the actual relationship is. So, good friends who are not dating might qualify as zero, but so would bitter enemies. As with the earlier dummy variable data type, fuzzy set zeroes indicate only negation. In addition to these extremes, however, we find intermediate values. Values greater than .5 indicate that a case is more a member than a non-member, and might correspond to individuals who are dating, but still see other people. Similarly, values less than .5 indicate that cases are more non-members than members, such as a couple that broke up, but retains certain feelings for each other and, occasionally, has freaky monkey sex in the back of a Volkswagen. Fuzzy set data is useful in that it allows both the distinguishing of qualitive categores (i.e. dating and not dating) while at the same time preserving the fine gradations that may exist between these states, "dated a little," "Friends with benefits," "Fuck-buddies," etc. At the same time, it is important to recognize that while degree of membership in a set can be subdivided hypothetically-indefinitely, as with ratio data, ratio data does not constrain the variable to have a maximum extent.

The above data types, while not exhaustive, do provide a solid foundation for the understanding and use of particular types of statistics. In the next chapter, we will explore some basic ways to summarize and explore these data as a preliminary step to statistical analysis.

3 Comments:

Anonymous Anonymous said...

nice. if you won't get a textbook though, you could probly at least submit this to the guys and/or gals of improbable.com. oh, also, under your discussion of fuzzy data, i suspect that something may have gone terribly wrong in the communication of nerve impulses from eye to brain to fingers when you typed "qualitive categores", but hey, i'd still publish it if i could.
-anomic

Wednesday, March 09, 2005 4:50:00 PM  
Anonymous Julie said...

Allow me to use a precise academic term here: abso-f'kin-lutely brilliant. Shame that my current research is entirely qualitative, otherwise I might actually start using some of the concepts taught in a dreary undergrad stats course. I think this is one textbook college students might actually read...

Wednesday, March 09, 2005 5:28:00 PM  
Anonymous Anonymous said...

*small applause for the fearless and indiscreet blogger*
Ta, Drek, you made us laugh.
All I can say is that I hope that is a good thing

Thursday, March 10, 2005 9:28:00 AM  

Post a Comment

<< Home

Site Meter