Analytics Strategist

July 28, 2008

The sad reality of today’s business

Filed under: Advertising, business strategy, Datarology, Uncategorized — Tags: , — Huayin Wang @ 4:16 pm

The sad reality of today’s business has something to do with analytics, and technology in general for that matter, in a bit of twisted way.

July 27, 2008

When data floods, analytics is Noah’s Ark.

Filed under: Business, Datarology, Uncategorized — Tags: , — Huayin Wang @ 3:48 am

get on it fast!

September 25, 2007

the skill levels in data analytics

Filed under: business strategy, Datarology — Tags: , , — Huayin Wang @ 10:43 pm

Data analytics is a collection of disparate techniques and applications covering practically every fields and every industries. What holds it together as a coherent discipline is the skill set of the data analyst: the intrinsic structure, levels and connection logics of the skill components that ultimately define and delineate what data analytics currently is and will be in the future.

Data analytics is not a mature discipline. Naturally, there are widely shared confusions about what data analytics is, and particularly what are the skills and level of skills. The lack of common understanding in this has negative consequences on talent search, training and education, project management etc.

One such misunderstanding originated from mixing-up of data analytics skills with subject domain knowledge. Subject domain knowledge are things accumulated through experiences, memorized information and practices related to the subject matter. Information and insights are stored in the brain and can be readily queried without relying on external data, although originally may come from working with data. As a contrast, data analytics skills are skills of extracting intelligent information from data fresh on the spot. Without the explicit provision of data, data analytics will results in no knowledge! Taking out data is like taking out the fuel for data analysts. Comparing this Data-Driven process, the former can be called Grey-Cell-Driven process. Data are useful food for data analytics. Without data analytic skills, data are useless, just like gasoline are useless for a bicycle.

There are varying skill levels in data analytics. For starter, there are roughly 4 levels of data analytic skills:

  1. basic
  2. reporting
  3. professional
  4. expert

At level 1, basic analytic skills are mostly obtained from education and experience. It consists of comparing numbers (big/small, high/low, bigger/smaller), calculating percentage/fraction/ratio/index, reading pie-chart, bar-chart, and understanding two-way tables without relying others to translate into words. Use of excel is optional, but in general, most are able to put data into spreadsheet and do some arithmetic calculations. It does not require any programming skill.

At level 2, reporting analytic skills are generally acquired through working experience. This level includes primarily data analysis skills using excel, or analytic tools that can dump data into excel. It includes the use of formula, the use of numeric and text functions, excel macro, selection of some of the more advanced skills including pivot table, VlookUp, Regression, VBA, Solver etc. The data analytical process of breaking down and aggregating up, trending and graphing are also belong to this level. They understand the concepts of data table or dataset, where records as row and fields as columns, records subseting and filtering, some ways to measure the strength of the relationship between fields …

At level 3, the hallmark of professional data analytics skills is the ability to not only extract information but also evaluate the reliability of the extracted information. In other words, it consists of skills to extracting intelligent information, rather than just information. It also includes a much expanded set of knowledge extraction skills. At the core of it: sampling theory and experimental design, regressions and decision tree models, model development process and common validation principles, basic types of statistical distributions, significant level and p-value, distribution models of 3 basic types of fields (numeric, ordered, categorical) and proper estimation of relationship between fields of different types. Modeling and algorithm knowledge, the use of software/tools and programming languages are intrinsic to professional data analysts.

At level 4, expert data analytics are generally hard to define. Like tree branches, the higher they are the more split they are, both in directions and in varying levels. The one thing that I noticed is their sensitivity and awareness of all explicit and implicit assumptions behind the algorithms used and the general conclusions. Of course, there are many narrower data analytic fields and niches, one could be an expert in one and not in others.

It is also worth to mention that there are a few skills that related to but not part of the data analytics; among other things, it includes making an analogy, generating pretty charts or animating graphs, and last but not the least of all: the skills of selling and promoting data analytics.

Decision Theory and Data Analytics

Filed under: Datarology — Tags: , , — Huayin Wang @ 7:56 pm

Data analytics is the core technology used by businesses today. This is mainly due to the increasingly availability of data and the important of data-driven decision-making process in business.

The basic elements of a decision making process are:

  • the set of choice or options
  • the set of outcomes, corresponding to the above options
  • a valuation of outcomes

Decision-making is about making a choice (or selecting an option) that make sense in light of the valuation of outcomes. Without going too crazy with extra assumptions, such as rationality of the decision agent, the above components allow us to analyze the decision-making process. The simplest decision-making case is when there is only one option (in other words, no choice).

In general, there are four types of decision making:

  1. decision making under certainty (the outcome for each choice is known)
  2. decision making under risk (the probabilities of more than one outcomes for each choice are known)
  3. decision making under uncertainty/ignorant (the possible set of outcomes is known, but not the probabilities)
  4. decision making in interactive context (game theory, gaming context)

with everything above prepared, known, and fully specified, a rational decision making will be reduced to an optimization process, with the exception of case 4. This is not to say it is simple, in fact, many optimization processes in real world can be exceedingly difficult.

Optimization technique is at the core of decision making; it is also the center piece technology of data analytics.

In my professional life as an analytics consultant, I have found this basic conceptual framework very valuable. Whenever a new business problem arise, I often start looking for the core decision making problem. The subsequent steps are, in turn: figuring out the set of all possible choices, what are the constraints which, combined with above, gives a feasible choice set), and the outcome measures or project objectives (from which valuations are derived).

What is so valuable about the framework is not that it ultimately gives a formal setup of the problem; more often than not, there are no clear answers to any of the above questions. Instead, it is the process of trying to clear things up that often helps uncover blind spots and missed opportunities that might otherwise be overlooked.

Much of the Modern Decision Theory is, above and beyond its conceptual frame, quite irrelevant to the actual decision making in the real world. The things that get skimmed over, abstracted out, and cut off before it become a well specified optimization problem are often the real issues for (good) decision making; and it is in dealing with these things that data analytics plays a big role. Data analytics help better decision making by:

  • reducing risk and uncertainty associated with options using predictive modeling, and
  • expanding set of feasible options
  • making optimal choice possible through the use of efficient algorithms

September 24, 2007

three ways of seeing Data Analytics

Filed under: Datarology — Tags: , , — Huayin Wang @ 4:06 pm

There are three different ways of looking at and speaking about Data Analytics: the application-centric, the algorithm-centric, and the data-centric.

Data Analytics is essentially applying a process or algorithm to a set of data to draw intelligent information for a practical purpose. The three ways of looking at Data Analytics are natural reflections of its three key components: data, algorithm and application. Although the essential subject matter can be the same, the day-to-day manifestations of Data Analytics can seem bewilderingly different; nowhere is this felt so acutely as when one is working in data analytic consulting.

The client talk (or “needs” talk, as we sometime call it) refers to the “churn” model, the mailing “response” model, etc. in a way that is naturally application-centric. The statistician talk, or the “techie” talk, is all about the algorithm: logistic regression, robust regression, support vector machine, etc. It is absolutely amazing sometimes just to listen and observe how the two camps communicate, debate, and argue, and how all this often amounts to very little of substance. I wonder how much emotion and saliva could be saved by knowing this difference (There are three different ways of looking at and speaking of Data Analytics: the application-centric, the algorithm-centric, and the data-centric.

Such differences are not limited to the words they use, but reflect contextual and directional differences. To an app-centric perspective, a mailing response model is a mailing response model, it does not matter what algorithm is used: logistic regression, neural network, decision tree, or SVM, etc. To a modeler, a logistic regression is different from a neural network, even though both might be used in many different applications: churn/attrition model, win-back model, fraud prediction model, etc.. When instructed to figure out the best modeling strategy, the decision processes of different camps—those of the business/marketing people and those of the statisticians and data miners—work quite differently. Neither side is rightly equipped to think about this with the breadth and depth needed.

The data-centric perspective provides yet another angle. Common to the app-centric and alg-centric is the idea of a purpose, a thing to predict or find out. In contrast, the focus of the data-centric perspective is solely on data. Given a piece or a set of data, what are ALL the things that analytically can be done to it? This is a pure data analytic perspective where the data elements are in their most abstract forms and, at the same time.  This is also a wide-open perspective, conducive to and capable of providing high-level generative and creative strategies.

If you have a data table, with all numeric fields, what can you do with it? What are all the analytic measures for measuring the “relationship” between two numeric fields? character fields? a numeric and an ordered categorical field? What can you do with a numeric field and a LARGE categorical fields with millions of unique values? two LARGE categorical fields?

Data-centric is new and rarely used. It is also the most interesting and greatly needed at this stage of development.

Three attributes, three perspectives, a pair of eyes. Even with all these, a single great mind is sometimes still the most needed thing to solve challenging problems.

September 18, 2007

Data Driven Intelligence

Filed under: business strategy, Datarology — Tags: , , , — Huayin Wang @ 6:01 pm

Abundance of intelligent people and intelligence is one major characteristic of our time!

Data Driven Intelligence, at least under its current moniker, is a modern invention. In the broadest sense, it refers to intelligence derived solely from data. It takes data, including meta-data, as the only input while outputting intelligent information.

The professionals in this trade are ones with the knowledge and skills for the extraction of intelligent information from data. This profession is still young and diversified. It has been called many names, including statistics, data analytics, machine learning, data mining, artificial intelligence, knowledge discovery, pattern recognition etc. I call it Datarology. Feel free to use your own favorite substitute.

But what about the similarly-named Numerology? Isn’t it also taking in data and generating “insightful” information?

It is true that both derive interesting and intelligent information from numbers, or claim to do so. It is amazing to see how much numerologists can derive out of as small a piece of data as a birthdate! Another profession marked by such an ability to derive much from little data or few words, is theology.

What distinguishes datarology from these two is how very careful it is about what information can be reliably drawn from the available data. I can’t imagine a datarologist being excited about working with a single data point—a birthday! This is not an indictment of numerology, or even a challenge of the validity of its intelligence. This is mainly to illustrate the difference between the two. In all fairness, numerologists do not really work with one data point, they work with huge amount of data going through intricate processes. The key difference lies in the fact that these data and processes are implicit and hidden in the dark (brain cells).

In contrast, Datarology is characterized in large part by its explicitness. It requires that every data and meta-data (including assumptions about the data characteristics) be made explicit; it also promises to make deductive process explicit. The intelligence-generating process can be so transparent that it could be understood and carried out by machine!

This is one force that is radically transforming business today and every day. It advantages businesses that have a lot of data, it improves efficiency of business operation, it pushes the digitalization of every aspect of business.

Most of all, it creates an evolutionary threat to the traditional forms of intelligence and intelligent people. The intelligence based on remembering facts, folklores, and rules that are readily derivable from data, the type that simply comes with age and experience is becoming endangered. If this is unfamiliar, read the book Moneyball by Michael Lewis.

It pays to learn these new knowledge and skills – the capability of extracting intelligence from data, all kinds of data.

The abundance of intelligence is greater now, with the addition of intelligent machines.

 

March 23, 2007

What is Data, really?

Filed under: Datarology — Tags: , , — Huayin Wang @ 8:40 pm

Common definition found on the web, all share similar construct:

Factual information, especially information organized for analysis or used to reason or make decisions. (Answer.com) (Webster is similar)

There are many versions of it that define Data using the word “fact” or “factual information”. This is unacceptable. For data is itself carry no assertion about the quality of it, whether it is fact or not is an after fact as long as the definition of data is concerned.

Using “information” to define data is not proper either, for whether data is information or not is relying on the users and how users understand the data: data is more “primitive” than “information”, not the other way around.

I like the following better, although I am not perfectly happy about it: Data is a structured form consisting of datum. I like it because it does not imply any implicit relationship between its explicit forms and the external world. It does not say limit its structure to any kinds, table, row, collection, independent observations etc. are all artificial frame, not general enough to be considered in the definition. It also does not imply the present of any external knowledge, or preprocessing routines or any specialized observers.

Give me some example, you ask. First of all, the simplest data example is a datum – the atom as far as data is concerned. Because it is datum, itself should not have any sub-structure, so this is saying that it can have nothing but a name or a label. As to the form of datun, it really does not matter, as far as it is looked at as simplest data. Datum can have name and value.

Next, data can be a collection of observations (datum).

Next, datum can have attributes which “describe” observations. An example of it will be “continuous”, “discrete”, “ordered” etc. Attribute may have name and value as well.

What we called “data table” is just one common form of data. Other forms of data include: network data, transaction data, graph data, time series, text data.

“All these are common sense”, you said. “What’s new?”

Well, all the common data analytics are analytics on “Table-like” data. The analytics for other forms of data are so much behind. This is a problem, this is an opportunity.

March 25, 2006

some sayings

Filed under: Datarology — Tags: , — Huayin Wang @ 1:49 pm

science is the process of abstraction of the concretes …
mathematics is the process of concretization of the abstracts …

one who does not know if he/she know, does not know
one who does not know, does not know that he/she does not know

analytics is the process of destruction by reconstruction

Create a free website or blog at WordPress.com.