Re: Languages don't matter. A mathematical refutation

comp.lang.ada
 help / color / mirror / Atom feed

From: brbarkstrom@gmail.com
Subject: Re: Languages don't  matter.  A mathematical refutation
Date: Sat, 4 Apr 2015 08:41:59 -0700 (PDT)
Date: 2015-04-04T08:41:59-07:00	[thread overview]
Message-ID: <2747823f-3d48-47ec-b541-f506e03ee313@googlegroups.com> (raw)
In-Reply-To: <cutvhal9n5jobh5ojsof28r9hncdjabbvk@4ax.com>

I find myself disagreeing with Dr. Kazakov and agreeing more with
Dr. Martinez.  In the following scenario, I've tried to be careful
about separating the statistical issues from the information that
is needed to distinguish why two samples differ.  From that standpoint, 
a statistician can distinguish whether two sampled distributions are likely to come from the same development methodology without knowing anything about the languages, team quality, organizational structure, and the related items that
have consumed the discussion in much of this thread.

Consider a consortium of development organizations.  Some of them use methodology "A".  Another group uses methodology "C".  All members of
the consortium keep records of their total maintenance costs for ten
years.  The consortium members agree it might be useful to share these two pieces of information.  Accordingly, they build a list containing just three pieces of information for each member: name of the organization, the development methodology, and the ten year cost.
For example, the first four items in this list might contain the elements
	Company 1,	"A",		$10,000
	Company 2,	"A",		$25,000
	Company 3,	"C",		$50,000
	Company 4,	"C", 	        $5,000
and so on.

Company Q's CEO hires a statistician to sample this list.  The statistician doesn't want to read the whole list, so he arranges to sample it so that it creates two new lists.  The first of these contains the information only from companies that use method "A".  The second contains information only from 
companies that use method "C".

To assure that the samples selected from the original list are not
duplicated, the statistician selects the samples by starting with a random initial index.  He selects the line with that starting index.  It contains just tho three properties in the sample above.  If the methodology is "A", the information goes into the new list with just "A" information.  If the methodology is "C",it goes into another new list with just the "C" information.

The statistician's algorithm then lets an integer pseudorandom number generator select a new number and adds that to the index for selecting the next sample from the original list.  The algorithm treats the three fields from the new sample in the same way as the first sample.

This process continues until the statistician is satisfied he has enough samples.  [This will probably be somewhere in the vicinity of 20, although it might be more.]

The statistician now has two lists - selected at random from the original list.  At this point, he can apply standard statistical tests to the numerical values for cost to see if the distributions differ significantly.  If he wishes to treat the numerical values as integers, the standard test is the Chi-Square Test.  Because the number of samples with methodology "A" is likely to differ from the number using "C", the standard test is a two-sided Chi-Square.  If he wants to treat the numerical values as floating point (or real), then the standard test is the Kolmogorov-Smirnov test.  Again, there is a one-sided version (for testing whether the distribution is different from a known distribution) and a two-sided version.

Press, W. H., Flannery, B. P., Teukolsky, S. A., and Vetterling, W. T., 1986: Numerical Recipes: The Art of Scientific Computing, Cambridge University Press, Cambridge, UK, pp. 469-475

provides a very readable introduction to all four of these tests.
Note that their copyright is still enforced, so be careful about simply lifting the code from this source (or later editions).

Knuth, D. E., 1998: The Art of Computer Programming, Vol. 2/Seminumerical Algorithms, Third Edition, Addison-Wesley, Boston, MA, pp. 41-58

provides the full derivation of the K-S distribution, as well as other tests for whether a random number generator is producing reliably random output.  The K-S test is covered on pp. 48-55 (including the algorithm in full detail).

At the end of the testing, the statistician can report to the CEO on whether the distribution of ten-year costs for methodology "A" has a statistically significant difference from the distribution of costs with methodology "B".

Note that the conclusion says nothing about why the two methodologies differ.  If they are different, maybe it's because the companies using "A" are using a higher level language.  Maybe it's because the companies using "A" are more experienced or have "higher quality" developers.  Those attributes are irrelevant to the question the CEO asked of the statistician - which was "are these two methodologies drawn from significantly different distributions?".

In order to understand the causes of the differences, the statistician would need to undertake a much broader investigation with many more samples and request many more pieces of information from the members of the consortium.  Probably the most widely-known example of this kind of exploration is found in

Boehm, B. W., Abts, C., Brown, A. W., Chulant, S., Clark, B. K., Horowitz, E., Madachy, R., Reifer, D. and Steece, B., 2000: Software Cost Estimation with COCOMO II, Prentice Hall PTR, Upper Saddle River, NJ

Boehm has been conducting research in understanding why software development with different methodologies creates differences in the cost of the software that's delivered for about 30 years.  The cover of this book shows twenty one factors that he's parameterized as having an effect on cost - and the contents provide all the details needed to estimate the costs.  Unfortunately for the discussion we've been having, Boehm puts language selection into a parameter he calls LTEX, which is "a measure of the level of programming language and software experience of the project team developing the software system or subsystem." [Boehm, et al., pp. 48-49]

It seems clear that it is quite possible to make a random selection
of development approaches even though the methodology of each team is fixed and determinate at the time of the sampling.  The math of demonstrating the statistical significance of difference between two sample distributions is also quite clear and well-established.

Bruce B

next prev parent reply	other threads:[~2015-04-04 15:41 UTC|newest]

Thread overview: 94+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-03-25 11:46 Languages don't matter. A mathematical refutation Jean François Martinez
2015-03-25 15:19 ` Paul Rubin
2015-04-03  0:50   ` robin.vowels
2015-04-03  2:18     ` Jeffrey Carter
2015-04-03 13:37       ` Bob Duff
2015-04-03 14:13         ` Dmitry A. Kazakov
2015-04-03 17:34           ` Paul Rubin
2015-04-03 19:34             ` Dmitry A. Kazakov
2015-04-03 19:58               ` Paul Rubin
2015-04-04  6:59                 ` Dmitry A. Kazakov
2015-04-06 21:12                   ` Paul Rubin
2015-04-07  5:57                     ` Dmitry A. Kazakov
2015-04-08  4:12                       ` Paul Rubin
2015-04-08  6:45                         ` Dmitry A. Kazakov
2015-04-04  0:41             ` Dennis Lee Bieber
2015-04-04  3:05               ` Paul Rubin
2015-04-04 14:46                 ` Dennis Lee Bieber
2015-04-04 15:41                   ` brbarkstrom [this message]
2015-04-04 19:20                   ` Paul Rubin
2015-04-04 20:00                     ` Dmitry A. Kazakov
2015-04-04 20:44                       ` Paul Rubin
2015-04-05  8:00                         ` Dmitry A. Kazakov
2015-04-05  9:55                           ` Brian Drummond
2015-04-06 21:27                             ` Randy Brukardt
2015-04-06 17:07                           ` Paul Rubin
2015-04-06 17:41                             ` Dmitry A. Kazakov
2015-04-06 18:35                               ` Paul Rubin
2015-04-06 21:46                                 ` Randy Brukardt
2015-04-06 22:12                                   ` Paul Rubin
2015-04-06 23:40                                     ` Jeffrey Carter
2015-04-07 19:07                                     ` Randy Brukardt
2015-04-08  3:53                                       ` Paul Rubin
2015-04-08 21:16                                         ` Randy Brukardt
2015-04-09  1:36                                           ` Paul Rubin
2015-04-09 23:26                                             ` Randy Brukardt
2015-04-09  2:36                                           ` David Botton
2015-04-09  8:55                                           ` Georg Bauhaus
2015-04-09  9:38                                             ` Dmitry A. Kazakov
2015-04-09 13:14                                               ` G.B.
2015-04-09 14:35                                                 ` Dmitry A. Kazakov
2015-04-09 15:43                                                   ` G.B.
2015-04-09 17:26                                                     ` Dmitry A. Kazakov
2015-04-09 18:40                                                   ` Niklas Holsti
2015-04-09 19:02                                                     ` Dmitry A. Kazakov
2015-04-09 20:38                                                       ` Paul Rubin
2015-04-09 23:35                                             ` Randy Brukardt
2015-04-10 14:16                                               ` G.B.
2015-04-10 20:58                                                 ` Randy Brukardt
2015-04-07  0:36                                 ` Dennis Lee Bieber
2015-04-05 13:57                     ` Dennis Lee Bieber
2015-04-03 16:17         ` J-P. Rosen
2015-04-03 17:33           ` Bob Duff
2015-04-26 11:38             ` David Thompson
2015-04-03 19:00         ` Georg Bauhaus
2015-04-03 19:12         ` Jeffrey Carter
2015-04-03 22:37           ` Bob Duff
2015-04-03 23:38             ` Jeffrey Carter
2015-04-04  0:15               ` Bob Duff
2015-04-04  7:06                 ` Dmitry A. Kazakov
2015-04-04  2:59               ` Paul Rubin
2015-04-04  0:56             ` Dennis Lee Bieber
2015-03-25 17:12 ` Jean François Martinez
2015-03-26 13:43 ` Maciej Sobczak
2015-03-26 15:01   ` Jean François Martinez
2015-03-26 17:45     ` Jeffrey Carter
2015-03-26 15:21   ` Dmitry A. Kazakov
2015-03-27 11:25     ` Jean François Martinez
2015-03-27 17:36       ` Dmitry A. Kazakov
2015-03-30 10:31         ` Jean François Martinez
2015-03-30 11:52           ` Dmitry A. Kazakov
2015-03-30 12:32             ` G.B.
2015-03-30 13:48               ` Dmitry A. Kazakov
2015-03-30 15:47                 ` G.B.
2015-03-30 16:05                   ` Dmitry A. Kazakov
2015-04-02 12:59                     ` brbarkstrom
2015-04-02 13:35                       ` Dmitry A. Kazakov
2015-04-02 14:48                         ` jm.tarrasa
2015-04-02 15:55                           ` brbarkstrom
2015-04-02 16:21                             ` Jean François Martinez
2015-04-02 16:48                             ` Dmitry A. Kazakov
2015-04-02 16:41                           ` Dmitry A. Kazakov
2015-04-04 10:02                             ` jm.tarrasa
2015-04-04 11:16                               ` Dmitry A. Kazakov
2015-04-02 15:58                         ` Jean François Martinez
2015-04-02 16:39                           ` Dmitry A. Kazakov
2015-04-03  9:46                             ` Jean François Martinez
2015-04-03 14:00                               ` Dmitry A. Kazakov
2015-04-03 17:12                                 ` Jean François Martinez
2015-04-02 17:17                         ` G.B.
2015-04-02 19:09                           ` Dmitry A. Kazakov
2015-04-02 18:24                       ` Niklas Holsti
2015-04-02 18:43                       ` Jeffrey Carter
2015-03-30 11:36         ` Jean François Martinez
2015-03-30 10:48       ` jm.tarrasa

replies disabled

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox