From mboxrd@z Thu Jan  1 00:00:00 1970
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on polar.synack.me
X-Spam-Level: 
X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00,FREEMAIL_FROM,
	LOTS_OF_MONEY autolearn=unavailable autolearn_force=no version=3.4.4
X-Received: by 10.70.128.67 with SMTP id nm3mr8497790pdb.6.1428162119399;
        Sat, 04 Apr 2015 08:41:59 -0700 (PDT)
X-Received: by 10.140.21.145 with SMTP id 17mr89522qgl.1.1428162119144; Sat,
 04 Apr 2015 08:41:59 -0700 (PDT)
Path: 
 eternal-september.org!reader01.eternal-september.org!reader02.eternal-september.org!news.eternal-september.org!mx02.eternal-september.org!feeder.eternal-september.org!usenet.blueworldhosting.com!feeder01.blueworldhosting.com!peer03.iad.highwinds-media.com!news.highwinds-media.com!feed-me.highwinds-media.com!border1.nntp.dca1.giganews.com!nntp.giganews.com!l13no697569iga.0!news-out.google.com!k20ni2qgd.0!nntp.google.com!z60no690127qgd.0!postnews.google.com!glegroupsg2000goo.googlegroups.com!not-for-mail
Newsgroups: comp.lang.ada
Date: Sat, 4 Apr 2015 08:41:59 -0700 (PDT)
In-Reply-To: <cutvhal9n5jobh5ojsof28r9hncdjabbvk@4ax.com>
Complaints-To: groups-abuse@google.com
Injection-Info: glegroupsg2000goo.googlegroups.com;
 posting-host=104.169.185.59;
 posting-account=Ies7ywoAAACcdHZMiIRy0M84lcJvfxwg
NNTP-Posting-Host: 104.169.185.59
References: <b3592526-729a-4198-a630-696542b3f3be@googlegroups.com>
 <87h9t95cly.fsf@jester.gateway.sonic.net>
 <04f0759d-0377-4408-a141-6ad178f055ed@googlegroups.com>
 <mfkt8l$u4j$1@dont-email.me> <871tk1z62n.fsf@theworld.com>
 <w19y3129v8q1.jkn83uwyte4p$.dlg@40tude.net>
 <87oan56rpn.fsf@jester.gateway.sonic.net>
 <ojcuhapfmqleoec1r4fu7ierr3p85in78r@4ax.com>
 <877fts7fvm.fsf@jester.gateway.sonic.net>
 <cutvhal9n5jobh5ojsof28r9hncdjabbvk@4ax.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <2747823f-3d48-47ec-b541-f506e03ee313@googlegroups.com>
Subject: Re: Languages don't  matter.  A mathematical refutation
From: brbarkstrom@gmail.com
Injection-Date: Sat, 04 Apr 2015 15:41:59 +0000
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
X-Received-Bytes: 8007
X-Received-Body-CRC: 3959548837
Xref: news.eternal-september.org comp.lang.ada:25418
Date: 2015-04-04T08:41:59-07:00
List-Id: <comp.lang.ada>


I find myself disagreeing with Dr. Kazakov and agreeing more with
Dr. Martinez.  In the following scenario, I've tried to be careful
about separating the statistical issues from the information that
is needed to distinguish why two samples differ.  From that standpoint,=20
a statistician can distinguish whether two sampled distributions are likely=
 to come from the same development methodology without knowing anything abo=
ut the languages, team quality, organizational structure, and the related i=
tems that
have consumed the discussion in much of this thread.

Consider a consortium of development organizations.  Some of them use metho=
dology "A".  Another group uses methodology "C".  All members of
the consortium keep records of their total maintenance costs for ten
years.  The consortium members agree it might be useful to share these two =
pieces of information.  Accordingly, they build a list containing just thre=
e pieces of information for each member: name of the organization, the deve=
lopment methodology, and the ten year cost.
For example, the first four items in this list might contain the elements
	Company 1,	"A",		$10,000
	Company 2,	"A",		$25,000
	Company 3,	"C",		$50,000
	Company 4,	"C", 	        $5,000
and so on.

Company Q's CEO hires a statistician to sample this list.  The statistician=
 doesn't want to read the whole list, so he arranges to sample it so that i=
t creates two new lists.  The first of these contains the information only =
from companies that use method "A".  The second contains information only f=
rom=20
companies that use method "C".

To assure that the samples selected from the original list are not
duplicated, the statistician selects the samples by starting with a random =
initial index.  He selects the line with that starting index.  It contains =
just tho three properties in the sample above.  If the methodology is "A", =
the information goes into the new list with just "A" information.  If the m=
ethodology is "C",it goes into another new list with just the "C" informati=
on.

The statistician's algorithm then lets an integer pseudorandom number gener=
ator select a new number and adds that to the index for selecting the next =
sample from the original list.  The algorithm treats the three fields from =
the new sample in the same way as the first sample.

This process continues until the statistician is satisfied he has enough sa=
mples.  [This will probably be somewhere in the vicinity of 20, although it=
 might be more.]

The statistician now has two lists - selected at random from the original l=
ist.  At this point, he can apply standard statistical tests to the numeric=
al values for cost to see if the distributions differ significantly.  If he=
 wishes to treat the numerical values as integers, the standard test is the=
 Chi-Square Test.  Because the number of samples with methodology "A" is li=
kely to differ from the number using "C", the standard test is a two-sided =
Chi-Square.  If he wants to treat the numerical values as floating point (o=
r real), then the standard test is the Kolmogorov-Smirnov test.  Again, the=
re is a one-sided version (for testing whether the distribution is differen=
t from a known distribution) and a two-sided version.

Press, W. H., Flannery, B. P., Teukolsky, S. A., and Vetterling, W. T., 198=
6: Numerical Recipes: The Art of Scientific Computing, Cambridge University=
 Press, Cambridge, UK, pp. 469-475

provides a very readable introduction to all four of these tests.
Note that their copyright is still enforced, so be careful about simply lif=
ting the code from this source (or later editions).

Knuth, D. E., 1998: The Art of Computer Programming, Vol. 2/Seminumerical A=
lgorithms, Third Edition, Addison-Wesley, Boston, MA, pp. 41-58

provides the full derivation of the K-S distribution, as well as other test=
s for whether a random number generator is producing reliably random output=
.  The K-S test is covered on pp. 48-55 (including the algorithm in full de=
tail).

At the end of the testing, the statistician can report to the CEO on whethe=
r the distribution of ten-year costs for methodology "A" has a statisticall=
y significant difference from the distribution of costs with methodology "B=
".

Note that the conclusion says nothing about why the two methodologies diffe=
r.  If they are different, maybe it's because the companies using "A" are u=
sing a higher level language.  Maybe it's because the companies using "A" a=
re more experienced or have "higher quality" developers.  Those attributes =
are irrelevant to the question the CEO asked of the statistician - which wa=
s "are these two methodologies drawn from significantly different distribut=
ions?".

In order to understand the causes of the differences, the statistician woul=
d need to undertake a much broader investigation with many more samples and=
 request many more pieces of information from the members of the consortium=
.  Probably the most widely-known example of this kind of exploration is fo=
und in

Boehm, B. W., Abts, C., Brown, A. W., Chulant, S., Clark, B. K., Horowitz, =
E., Madachy, R., Reifer, D. and Steece, B., 2000: Software Cost Estimation =
with COCOMO II, Prentice Hall PTR, Upper Saddle River, NJ

Boehm has been conducting research in understanding why software developmen=
t with different methodologies creates differences in the cost of the softw=
are that's delivered for about 30 years.  The cover of this book shows twen=
ty one factors that he's parameterized as having an effect on cost - and th=
e contents provide all the details needed to estimate the costs.  Unfortuna=
tely for the discussion we've been having, Boehm puts language selection in=
to a parameter he calls LTEX, which is "a measure of the level of programmi=
ng language and software experience of the project team developing the soft=
ware system or subsystem." [Boehm, et al., pp. 48-49]

It seems clear that it is quite possible to make a random selection
of development approaches even though the methodology of each team is fixed=
 and determinate at the time of the sampling.  The math of demonstrating th=
e statistical significance of difference between two sample distributions i=
s also quite clear and well-established.

Bruce B