Friday, November 28, 2008

Not so much increase in parallel performance

A few days ago I was celebrating the dramatic improvement of computing
performance using "snow" package in R.
(http://somerandomwalksofmind.blogspot.com/2008/11/my-first-parallel-program.html).
But later when I posted the question on R mailing list
(http://www.r-project.org/mail.html), some people pointed out that in
fact I did not such more-than-doubling improvement in performance.

The trick, according to Stefan Evert
(http://www.nabble.com/More-than-doubling-performance-with-snow-td20654005.html),
is to look at the elapsed time in the system.time() output. In that
case the boost in speed is not so large. For example:

> library(snow)
>
> cc <- makePVMcluster(2)
>
> n.size <- 1000
>
> temp <- NULL
> for(i in 1:10){
+ x <- list(matrix(rnorm(n.size^2),n.size))
+ temp <- c(temp,x)
+ }
>
> system.time(t.1 <- clusterApply(cc,temp,"solve"))
user system elapsed
2.980 0.548 21.909
> system.time(t.2 <- lapply(temp,"solve"))
user system elapsed
24.290 0.636 25.058

So to further gain increase in execution in speed using snow I may
have to have a computer with many cores (much more than my current duo
core laptop).

Sunday, November 23, 2008

My first parallel program

My new laptop computer is a Lenovo Thinkpad T400 with Intel Core 2 Duo
processor. And I'm naturally seduced to do some experiment on parallel
programming. The result is far more than exciting. I tried the
following program:

library(snow)
cc <- makePVMcluster(2)
n.loop <- 10
n.size <- 1000
temp <- NULL
for(i in 1:n.loop){
x <- list(matrix(rnorm(n.size^2),n.size))
temp <- c(temp,x)
}

system.time(t.1 <- clusterApply(cc,temp,"solve"))
system.time(t.2 <- sapply(temp,"solve"))

The serial program took 23.9 seconds while the parallel one took only
3.0 seconds. That's more than double the performance!

Then I n.loop to 100 and n.size to 100. This time serial processing
outperformed parallel processing.

I guess this is a result of the trade-off between parallel
communication and computational time. When computational task is
relatively heavy, the time spent on communicating between processing
units are relatively insignificant. Thus the parallel method works
better. When the computational task is light, more time is wasted
sending messages. And the parallel way takes more time.

Monday, November 17, 2008

Ranking the living organism for environmental protection.

I have to admit that my thought is wicked at today's English as a
Second Language class. During the discussion session we were faced
with controversial over the trade off between environmental protection
and economic concerns. The text book Raise the Issue gave a very
extreme case that in Tennessee the construction of a dam was suspended
in the name of protecting an endangered species of snail. I was very
surprised about this. The dam may be of the welfare of many people in
terms of flood control and hydro-electricity production, which is on
the other hand good for the environment.

Then I ran into the idea that no consensus can be reached if we hold
on agonizing over these choices. The importance of wild life and
welfare of human being can hardly be preserved at the same time. And
we have to device a cut point, as we simply can not save every living
thing on our effort. An ideal cutting line is that on one side of it
the wildlife are protected, and on the other the wildlife protection
is not assigned top priority.

A very intuitive solution to this is to rank the living organisms
according to their genetic similarity with human. This argument is
well founded if we consider the ultimate goal of environmental
protection as self-protection. Thus the closer we are with some
non-human life, the more related our fate will be. The deterioration
of the habits of these living forms may lead to immediate threaten to
human society.

Another issue raises easily as how to find the cutting point.
Considering that the biosphere is a complicated graph of food chain,
where each point stands for a specie, and each edge stands for a
dependency relationship. We may simply find within our budget the set
of points with greatest sum of similarity and smallest dependency on
those which are not included in the set.

And this may give some rational criteria in the environmental issues,
and save the time and energy out of debates and campaigns. Some may
argue that the living things that are distant to human may also be
important. However, they may also get protected if they live within
the protection area for the larger living forms which are listed as
protection object.

Of course the idea above is no simple deed. While the current
development in biological science can provide a tree of phylogeny
similarity, the graph of direct dependence may not be easily obtained.
Besides, the budget mentioned above is difficult to estimate. Thus
this may take the synergy of both (bio)statisticians, biological
scientists, and economist to achieve the ultimate results.