[SystemSafety] Qualifying SW as "proven in use" [Measuring Software]

Thu Jun 27 17:10:37 CEST 2013

Matthew,
Yes, presuming the components aren't method sized then that's exactly what I'm saying. Consider a simple example of 2 classes, A and B, both of which have 5 methods. Class A's 5 methods are all of average cyclomatic complexity and with correspondingly average defect densities. When you add up the defect counts for the five class A methods and divide that by the sum of the cyclomatic complexities of those same methods, the defects-per-complexity-point will be, as expected, fairly average.

On the other hand, four of class B's methods are very low cyclomatic complexity while one method is a very high. Correspondingly, the four of class B's low complexity methods have low defect densities while the high complexity method has a high defect density. When you add up the defect counts for the five class B methods and divide that by the sum of the cyclomatic complexities of those same methods, the defects-per-complexity-point will also tend to the same average.

So the correlation of defects to cyclomatic complexity at the class (or higher) level will always tend to an average because the low defect counts for the low complexity methods will tend to compensate for the high defect counts in the high complexity methods. The bigger the "component" (I.e., the more functions/methods in the set), the more this averaging effect would  appear. I'm not at all surprised by people not seeing any correlation beyond the function/method level and I continue to wonder why people even keep looking for it there.

I would be willing to bet that if the data used in the Hatton study were broken down to the function/method level, a clear correlation would be appear.

Regards,

-- steve

From: Matthew Squair <mattsquair at gmail.com<mailto:mattsquair at gmail.com>>
Date: Wednesday, June 26, 2013 8:42 PM
To: Steve Tockey <Steve.Tockey at construx.com<mailto:Steve.Tockey at construx.com>>, Bielefield Safety List <systemsafety at techfak.uni-bielefeld.de<mailto:systemsafety at techfak.uni-bielefeld.de>>
Subject: Re: [SystemSafety] Qualifying SW as "proven in use" [Measuring Software]

Thanks Steve,

The full paper can be found at the link below, note that the metrics were applied to each of the three thousand odd components in the library. I take it that you'd say that even at component level (presuming components are not method sized) that the view is still too wide to generate a meaningful correlation?

http://www.leshatton.org/wp-content/uploads/2012/01/NAG01_01-08.pdf

On Wed, Jun 26, 2013 at 10:18 PM, Steve Tockey <Steve.Tockey at construx.com<mailto:Steve.Tockey at construx.com>> wrote:

"Reading the presentation of Les Hatton's 2008 paper, "The role of empiricism in improving the reliability of future software" he found using empirical techniques in a large scale study (of NAG Fortran and C libraries) that cyclomatic complexity was 'effectively useless' and that no metric strongly correlated (some actually weakly anti-correlated)."

I looked at the slide deck on his web site and it appears to me that he's making the same mistake I referred to earlier:

----- begin cut here -----

Maybe we are using different applications of cyclomatic complexity to
code? Yes, sure, increasing the total number of lines of code in some code
base will almost certainly increase the total number of decisions in that
code base, and probably by roughly an equal proportion. 10,000 lines of
code with 2000 decisions almost certainly implies close to 4000 decisions
in 20,000 lines of code.

But I'm not looking for a correlation of overall, total code base
cyclomatic complexity to overall defects. I'm looking for the correlation
of cyclomatic complexity within a single function/method to the defect
density within that same single function/method. Figure 4 in the Schroeder
paper shows a strong correlation of function/method-level cyclomatic
complexity and function/method-level defect density. Again, reverse
engineering from the numbers in Figure 4, shows that the defect density
goes up by more than an order of magnitude between cyclomatic complexity
less than/equal to 5 vs greater than/equal to 15 ***within a single
function***.

----- end cut here -----

My interpretation of Hatton's results is that he's looking at total cyclomatic complexity in the entire code base. It's not relevant at that level. Look at it at the function/method level and it becomes relevant.

"So perhaps we should not use metrics, period?"

That a tool gets mis-applied is not the fault of the tool, it's the fault of the tool user. People should be educated in proper use of tools before they use them…

Regards,

-- steve

From: Matthew Squair <mattsquair at gmail.com<mailto:mattsquair at gmail.com>>
Date: Tuesday, June 25, 2013 7:15 PM
To: Bielefield Safety List <systemsafety at techfak.uni-bielefeld.de<mailto:systemsafety at techfak.uni-bielefeld.de>>

Subject: Re: [SystemSafety] Qualifying SW as "proven in use" [Measuring Software]

Reading the presentation of Les Hatton's 2008 paper, "The role of empiricism in improving the reliability of future software" he found using empirical techniques in a large scale study (of NAG Fortran and C libraries) that cyclomatic complexity was 'effectively useless' and that no metric strongly correlated (some actually weakly anti-correlated).

So it does seem that there is a basis on which we can empirically judge the efficacy of software metrics.

So perhaps we should not use metrics, period?

On Wed, Jun 26, 2013 at 10:54 AM, Derek M Jones <derek at knosof.co.uk<mailto:derek at knosof.co.uk>> wrote:
Steve,

> I think we both strongly agree that there really needs to be a lot more
> evidence.

Yes.  No point quibbling over how little little might be.

> But I'm not looking for a correlation of overall, total code base
> cyclomatic complexity to overall defects. I'm looking for the correlation
> of cyclomatic complexity within a single function/method to the defect
> density within that same single function/method.

Left to their own devices developers follow fairly regular patterns
of code usage.  An extreme outlier of any metric is suspicious and
often worth some investigation; it might be the case that
the developer had a bad day or perhaps that function has to implement
some complicated application functionality. or something else.

Outliers are the low hanging fruit.

The problems start, or rather the time wasting starts, when
specific numbers get written into documents and is used to
judge what developers produce.

> along, what we need in the end is a balancing of a collection of syntactic
> complexity metrics. When functions/methods are split, it always increases
> fan out. When functions/methods are merged, it always decreases fan out.
> The complexity didn't go away, it just moved to a different place in the
> code. So having a limit in only one place easily allows people to squeeze
> it into any other place. Having a set of appropriate limits means there's
> a lot less chance of it going unnoticed somewhere else.

Yes, what we need to lots of good quality data for lots of code
attributes so we can start looking at these trade-offs.
Unfortunately the only good quality data I have involves small
numbers of attributes.

Having seen what a hash some researchers make of analysing the data
they have I am loath to accept finding where the data is not made
available.

> accident. Just the same, I'm basically arguing for more professionalism in
> the software industry. I mean seriously, the programmer who was
> responsible for that single C++ class with a single method of 3400 lines
> of code with a cyclomatic complexity over 2400 is a total freaking moron
> who has no business whatsoever in the software industry.

We are not going to move towards professionalism until there are less
software development jobs than half competent developers.  Hiring
people based on their ability to spell 'software' is not an
environment where professionalism takes root.

I keep telling people that the best way to reduce faults in code is
to start sending developers to prison.  Nobody take me seriously (ok,
yes, it would probably be a difficult case to bring).

> And, we will also always need semantic evaluation of code (which, as I
> said earlier, has to be done by humans) because syntax-based metrics alone
> will probably always be game-able.

Until strong AI arrives that will not happen.
Even the simpler issue of identifier semantics is still way beyond our
reach.  See:
http://www.coding-guidelines.com/cbook/sent792.pdf
for more than you could ever want to know about identifier selection
issues.

>
> Regards,
>
> -- steve
>
>
>
>
> -----Original Message-----
> From: Derek M Jones <derek at knosof.co.uk<mailto:derek at knosof.co.uk>>
> Organization: Knowledge Software, Ltd
> Date: Tuesday, June 25, 2013 4:21 PM
> To: "systemsafety at techfak.uni-bielefeld.de<mailto:systemsafety at techfak.uni-bielefeld.de>"
> <systemsafety at techfak.uni-bielefeld.de<mailto:systemsafety at techfak.uni-bielefeld.de>>
> Subject: Re: [SystemSafety] Qualifying SW as "proven in use"
> [Measuring    Software]
>
> Steve,
>
> ...
>> "local vs. global" categories, it's just that nobody has yet published
>> any
>> data identifying which ones should be paid attention to and which ones
>> should be ignored.
>
> So you agree that there is no empirical evidence.
>
> Your statement is also true of almost every metrics paper published
> todate.
>
> With so many different metrics having been proposed at least one of
> them is likely to agree with the empirical data that is yet to be
> published.
>
> You cited the paper: “A Practical Guide to Object-Oriented Metrics”
> as the source of the cyclomatic complexity vs fault correlation
> claim.  Fig 4 looks like it contains the data.  No standard
> deviation is given for the values, but this would have to be
> very large to ruin what looks like a reasonable correlation.
>
> Such a correlation can often be found, however:
>
>      o cyclomatic complexity is just one of many 'complexity'
> metrics that have a high correlation with quantity of code,
> so why not just measure lines of code?
>
>      o once developers know they are being judged by some metric
> or other they can easily game the system by actions such as
> splitting/merging functions.  If the metric has a causal connection
> to the quantity of interest, e.g., faults, then everybody is happy
> for developers to what what they will to reduce the metric,
> but if the connection is simply a correlation (based on code
> written by developers not trying to game the system) then
> developers doing whatever it takes to improve the metric value
> is at best wasted time.
>
>>
>> -----Original Message-----
>> From: Todd Carpenter <todd.carpenter at adventiumlabs.com<mailto:todd.carpenter at adventiumlabs.com>>
>> Date: Monday, June 24, 2013 7:20 PM
>> To: "systemsafety at techfak.uni-bielefeld.de<mailto:systemsafety at techfak.uni-bielefeld.de>"
>> <systemsafety at techfak.uni-bielefeld.de<mailto:systemsafety at techfak.uni-bielefeld.de>>
>> Subject: Re: [SystemSafety] Qualifying SW as "proven in use"
>> [Measuring   Software]
>>
>> ST> For example, the code quality measure "Cyclomatic Complexity"
>> (reference:
>> ST> Tom McCabe, ³A Complexity Measure², IEEE Transactions on Software
>> ST> Engineering, December, 1976) was validated many years ago by simply
>>
>> DMJ> I am not aware of any study that validates this metric to a
>> reasonable
>> DMJ> standard.  There are a few studies that have used found a medium
>> DMJ> correlation in a small number of data points.
>>
>> Les Hatton had an interesting presentation in '08, "The role of
>> empiricism
>> in improving the
>> reliability of future software" that shows there is a strong correlation
>> between
>> source-lines-of-code and cyclomatic complexity, and that defects follow a
>> power law distribution:
>>
>>
>> http://www.leshatton.org/wp-content/uploads/2012/01/TAIC2008-29-08-2008.pd
>> f
>>
>> Just another voice, which probably just adds evidence to the argument
>> that
>> we haven't yet found a
>> trivial metric to predict bugs...
>>
>> -TC
>>
>> On 6/24/2013 6:38 PM, Derek M Jones wrote:
>>> All,
>>>
>>>> Actually, getting the evidence isn't that tricky, it's just a lot of
>>>> work.
>>>
>>> This is true of most things (+ getting the money to do the work).
>>>
>>>> Essentially all one needs to do is to run a correlation analysis
>>>> (correlation coefficient) between the proposed quality measure on the
>>>> one
>>>> hand, and defect tracking data on the other hand.
>>>
>>> There is plenty of dirty data out there that needs to be cleaned up
>>> before it can be used:
>>>
>>>
>>> http://shape-of-code.coding-guidelines.com/2013/06/02/data-cleaning-the-n
>>> e
>>> xt-step-in-empirical-software-engineering/
>>>
>>>
>>>> For example, the code quality measure "Cyclomatic Complexity"
>>>> (reference:
>>>> Tom McCabe, ³A Complexity Measure², IEEE Transactions on Software
>>>> Engineering, December, 1976) was validated many years ago by simply
>>>
>>> I am not aware of any study that validates this metric to a reasonable
>>> standard.  There are a few studies that have used found a medium
>>> correlation in a small number of data points.
>>>
>>> I have some data whose writeup is not yet available in a good enough
>>> draft form to post to my blog.  I only plan to write about this
>>> metric because it is widely cited and is long overdue for relegation
>>> to the history of good ideas that did not stand the scrutiny of
>>> empirical evidence.
>>>
>>>> finding a strong positive correlation between the cyclomatic complexity
>>>> of
>>>> functions and the number of defects that were logged against those same
>>>
>>> Correlation is not causation.
>>>
>>> Cyclomatic complexity correlates well with lines of code, which
>>> in turn correlates well with number of faults.
>>>
>>>> functions (I.e., code in that function needed to be changed in order to
>>>> repair that defect).
>>>
>>> Changing the function may increase the number of faults.  Creating two
>>> functions where there was previously one will reduce an existing peak
>>> in the distribution of values, but will it result in less faults
>>> overall?
>>>
>>> All this stuff with looking for outlier metric values is pure hand
>>> waving.  Where is the evidence that the reworked code is better not
>>> worse?
>>>
>>>> According to one study of 18 production applications, code in functions
>>>> with cyclomatic complexity <=5 was about 45% of the total code base but
>>>> this code was responsible for only 12% of the defects logged against
>>>> the
>>>> total code base. On the other hand, code in functions with cyclomatic
>>>> complexity of >=15 was only 11% of the code base but this same code was
>>>> responsible for 43% of the total defects. On a per-line-of-code basis,
>>>> functions with cyclomatic complexity >=15 have more than an order of
>>>> magnitude increase in defect density over functions measuring <=5.
>>>>
>>>> What I find interesting, personally, is that complexity metrics for
>>>> object-oriented software have been around for about 20 years and yet
>>>> nobody (to my knowledge) has done any correlation analysis at all (or,
>>>> at
>>>> a minimum they have not published their results).
>>>>
>>>> The other thing to remember is that such measures consider only the
>>>> "syntax" (structure) of the code. I consider this to be *necessary* for
>>>> code quality, but far from *sufficient*. One also needs to consider the
>>>> "semantics" (meaning) of that same code. For example, to what extent is
>>>> the code based on reasonable abstractions? To what extent does the code
>>>> exhibit good encapsulation? What are the cohesion and coupling of the
>>>> code? Has the code used "design-to-invariants / design-forchange"? One
>>>> can
>>>> have code that's perfectly structured in a syntactic sense and yet it's
>>>> garbage from the semantic perspective. Unfortunately, there isn't a way
>>>> (that I'm aware of, anyway) to do the necessary semantic analysis in an
>>>> automated fashion. Some other competent software professionals need to
>>>> look at the code and assess it from the semantic perspective.
>>>>
>>>> So while I applaud efforts like SQALE and others like it, one needs to
>>>> be
>>>> careful that it's only a part of the whole story. More work--a lot
>>>> more--needs to be done before someone can reasonably say that some
>>>> particular code is "high quality".
>>>>
>>>>
>>>> Regards,
>>>>
>>>> -- steve
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> -----Original Message-----
>>>> From: Peter Bishop <pgb at adelard.com<mailto:pgb at adelard.com>>
>>>> Date: Friday, June 21, 2013 6:04 AM
>>>> To: "systemsafety at techfak.uni-bielefeld.de<mailto:systemsafety at techfak.uni-bielefeld.de>"
>>>> <systemsafety at techfak.uni-bielefeld.de<mailto:systemsafety at techfak.uni-bielefeld.de>>
>>>> Subject: Re: [SystemSafety] Qualifying SW as "proven
>>>> in    use"    [Measuring    Software]
>>>>
>>>> I agree with Derek
>>>>
>>>> Code quality is only a means to an end
>>>> We need evidence to to show  the means actually helps to achieve the
>>>> ends.
>>>>
>>>> Getting this evidence is pretty tricky, as parallel developments for
>>>> the
>>>> same project won't happen.
>>>> But you might be able to infer something on average over multiple
>>>> projects.
>>>>
>>>> Derek M Jones wrote:
>>>>> Thierry,
>>>>>
>>>>>> To answer your questions:
>>>>>> 1°) Yes, there is some objective evidence that there is a correlation
>>>>>> between a low SQALE index and quality code.
>>>>>
>>>>> How is the quality of code measured?
>>>>>
>>>>> Below you say that SQALE DEFINES what is "good quality" code.
>>>>> In this case it is to be expected that a strong correlation will exist
>>>>> between a low SQALE index and its own definition of quality.
>>>>>
>>>>>> For example ITRIS has conducted a study where the "good quality" code
>>>>>> is statistically linked to a lower SQALE index, for industrial
>>>>>> software actually used in operations.
>>>>>
>>>>> Again how is quality measured?
>>>>>
>>>>>> No, there is not enough evidence, we wish there would be more people
>>>>>> working on getting the evidence.
>>>>>
>>>>> Is there any evidence apart from SQALE correlating with its own
>>>>> measures?
>>>>>
>>>>> This is a general problem, lots of researchers create their own
>>>>> definition of quality and don't show a causal connection to external
>>>>> attributes such as faults or subsequent costs.
>>>>>
>>>>> Without running parallel development efforts that
>>>>> follow/don't follow the guidelines it is difficult to see how
>>>>> reliable data can be obtained.
>>>>>
>>>>
>>>
>>
>> _______________________________________________
>> The System Safety Mailing List
>> systemsafety at TechFak.Uni-Bielefeld.DE<mailto:systemsafety at TechFak.Uni-Bielefeld.DE>
>>
>>
>>
>> _______________________________________________
>> The System Safety Mailing List
>> systemsafety at TechFak.Uni-Bielefeld.DE<mailto:systemsafety at TechFak.Uni-Bielefeld.DE>
>>
>

--
Derek M. Jones                  tel: +44 (0) 1252 520 667<tel:%2B44%20%280%29%201252%20520%20667>
Knowledge Software Ltd          blog:shape-of-code.coding-guidelines.com<http://shape-of-code.coding-guidelines.com>
Software analysis               http://www.knosof.co.uk
_______________________________________________
The System Safety Mailing List
systemsafety at TechFak.Uni-Bielefeld.DE<mailto:systemsafety at TechFak.Uni-Bielefeld.DE>

--
Matthew Squair

Mob: +61 488770655<tel:%2B61%20488770655>
Email: MattSquair at gmail.com<mailto:MattSquair at gmail.com>

--
Matthew Squair

Mob: +61 488770655
Email: MattSquair at gmail.com<mailto:MattSquair at gmail.com>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.techfak.uni-bielefeld.de/mailman/private/systemsafety/attachments/20130627/fafb0609/attachment-0001.html>