May 16, 2013

Short performance tuning story

Recently +Samuel Lampa has published a blog entry about some simple bioinformatics processing he tried in Python and D. Later more languages were added, D version was hand-tuned to top speed and so on, but what caught my attention is terribly bad performance of initial D version he has wrote, one based on Phobos utilities.

I was curious: "how can naive D implementation behave worse than naive Python implementation?". Good reason to grab some code and start investigating. All code samples used are identical to ones provided by Samuel in his blog, I won't copy them here.

Initial results before any investigation:

$ time python reference.py
37.62173013938618

real 0m9.407s
user 0m9.090s
sys 0m0.047s

$ time pypy reference.py
37.6217301394

real 0m3.496s
user 0m3.367s
sys 0m0.063s

$ time ./phobos
37.6217

real 0m6.806s
user 0m6.763s
sys 0m0.027s

$ time ./manual
37.6217

real 0m0.505s
user 0m0.487s
sys 0m0.013s

So bad :( Phobos version is twice as slow as Python one using pypy and order of magnitude slower than manually written one. Lets see what crucial differences exist between manual and Phobos version and what can be done about it.

First alarm sounds when you notice that manual version uses char instead of dchar and ignores all Unicode possibilities. So source file with DNA sequence is guaranteed to be an ASCII one. Makes sense. However, all functions in std.string must be ready to encounter UTF-8 as this is standard encoding for string type. Checking countchars source shows that it iterates via provided string using dchar symbols. That means that first character bits need to be checked to detect if Unicode code point is encountered and then copy symbol to dchar temporary. Sounds like minor stuff? Results after making countchars ASCII-only:

$ time ./phobos 
37.6217

real 0m3.689s
user 0m3.630s
sys 0m0.040s

Twice the performance at once, coming close to pypy. Sure, those checks were minor but when all your program does is actually decoding symbol sequence and doing few comparisons, it does matter.

Second thing that caught my attention is that countchars actually uses inPattern function inside, which implements custom pattern support (like "a-z" ranges). Python version uses regexp which I am quite sure is cached at least in pypy version but poor Phobos is quite likely to re-parse the pattern every time the function is called. And we don't even need that advanced stuff here. New stupid implementation:

bool inPattern(S)(char c, in S pattern)
{
foreach (size_t i, char p; pattern)
if (c == p)
return true;

return false;
}

Results:

$ time ./phobos 
37.6217

real 0m2.323s
user 0m2.300s
sys 0m0.017s

So far, so good.

Time to do third step and copy obvious logical optimization from manual version to avoid counting same characters twice:
    auto tmp = phobos.countchars(line, "GC");
gcCount += tmp;
totalBaseCount += (tmp + phobos.countchars(line, "TA"));

Results:
$ time ./phobos 
37.6217

real 0m1.610s
user 0m1.587s
sys 0m0.020s

Almost there.

Last step is simple - compare same code using compiler that can actually do clever optimizations. LDC for the rescue:


$ ldmd2 -O -inline -release manual.d
$ time ./manual
37.6217

real 0m0.367s
user 0m0.340s
sys 0m0.023s

$ ldmd2 -O -inline -release phobos.d
$ time ./phobos
37.6217

real 0m0.726s
user 0m0.713s
sys 0m0.013s
Less than 2x difference now. I am satisfied and leaving it up to you to try any further optimizations :)

GtkD 2.2

A new version of GtkD has been released and is available for download at gtkd.org. This version is built on top of Gtk+ 3.8 and GStreamer 1.0. There’s also a new GtkD forum using vibenews and vibe.d.

May 15, 2013

DConf 2013 Day 1 Talk 4

Click through to YouTube for more info, or join the discussion on reddit.

May 13, 2013

DConf 2013 Day 1 Talk 3

Click through to the YouTube page for more info, or join in the discussion on reddit.

May 11, 2013

Visual D 0.3.36

I’ve missed a lot of D news during my mostly-offline, 3-week trip to the States. I decided not to dig through the newsgroups to sum it all up here, primarily because my hot dog shop is so busy now (thanks to a fantastic magazine review and several favorable blog posts) that I’ve hardly had time to even look at a computer. I did manage to get caught up on some Derelict stuff and I’ve been checking newsgroups for the latest announcements.

That brings me to the very latest, which is the real point of this post. Rainer has announced a new version of VisualD, which includes Alexander Bothe’s DParser from Mono-D. There’s also a new “Compile & Run” command along with support for LDC.

On a side note. I was surprised to learn at DConf that Rainer wasn’t already an expert at the Visual Studio framework when he first started the project, but had to learn a lot along the way. He surely must be by now!

DConf 2013 Day 1 Talk 2: Copy and Move Semantics in D by Ali Cehreli

May 10, 2013

Moar Languagez! GC content in Python, D, FPC, C++ and C (and now Go)

Update May 11: Daniel Spångberg now provided a few even more optimized versions: DS Tbl: a table optimized version of the line-by-line reading, as well as optimized versions of the ones that read the whole file at once. On the fastest one, I also tried with -funroll-all-loops to gcc, to see how far we could get. The result (for a 58MB text file): 66 milliseconds!

Update May 16: Thanks to Franzivan Bezerra's post here, I now got some Go code here as well. I modified Francivan's gode a bit, to comply with our latest optimizations here, as well as created a table optimized version (similar to Daniel Spångberg's Table optimized C code). To give a fair comparison of table optimization, I also added a table optimized D version, compiled with DMD, GDC and LDC. Find the results in the first graph below.

Update II, May 16: The link to the input file, used in the examples, got lost. It can be downloaded here!

Update III, May 16: Thanks to Roger Peppe, suggesting some improved versions in this thread on G+, we now got two more, very fast Go versions, clocking even slightly below the idiomatic C++ code. Have a look in the first graph and table below!

Update IV, May 16: Now applied equivalent optimizations like Roger's ones for Go mentioned above, to the "D Tbl" versions (re-use the "line" var as buffer to readln(), and use foreach instead of for loop). Result: D and Go gets extremely similar numbers: 0.317 and 0.319 was the best I could get for D and Go, respectively. See updated graphs below.

Update May 17: See this post on how Francivan Bezerra acheived same performance as C with D, beating the other two (Go and Java) in the comparison. (Direct link to GDoc)

My two previous blog posts (this one and this one) on comparing the performance of a number of languages for a simple but very typical bioinformatics problem, spurred a whole bunch of people to suggest improvements. Nice! :) Many of the comments can be read in the comments on those posts, as well as in the G+ thread, while I got improvements sent by mail from Thomas Hume (thomas.hume -at- labri.fr) and Daniel Spångberg.

Anyway, here comes an updated post, with a whole array of improvements, and a few added languages.

One suggestion I got (by Thomas Koch, I think), was applicable to all the code examples that I got sent to me, namely to count GC, and AT separately, and add to the total sum only in the end. It basically cut a few percent of all execution times, so I went through all the code variants and added it. It did improve the speed in a simipar way for all the codes.

Many people suggested to read the whole file in once instead of reading line by line though. This can cause problems with the many very large files in bioinformatics, so I have divided the results below in solutions that read line by line, and those that read more than that.

Among the code examples I got, was "our own" Daniel Spångberg of UPPMAX, who wrote a whole bunch of C variants, even including an OpenMP parallelized variant, which takes the absolute lead. Nice! :)

More credits in the explanation of the codes, under each of the results sections.

Ok, so let's go to the results:

Performance results

Reading line by line

Here are first the codes that read only line by line. I wanted to present them first, to give a fair comparison.

Python1.692
Cython1.684
Go (8g)1.241
Go (gcc)0.883
Go Tbl (gcc)0.785
Go Tbl (8g)0.767
PyPy0.612
FPC (MiSchi)0.597
D (GDC)0.589
FPC0.586
FPC (leledumbo)0.567
D (DMD)0.54
D Tbl (DMD)0.492
D (LDC)0.479
C++ (HA)0.385
D Tbl (GDC)0.356
Go Tbl (RP) (8g)0.337
Go Tbl (RP) Ptr (8g)0.319
D Tbl (LDC)0.317
C (DS)0.276
C (TH)0.252
C (DS Tbl)0.183

Explanation of code, with links

All of the below codes have have been modified to benefit from the tip from Thomas Koch (thomas -at- koch.ro), to cound AC and GC separately in the inner loop, and sum up only in the end.

Click the language code to see the code!

  • Python: My code, with improvents from lelledumbo (here on the blog).
  • Cython: Same as "Python", compiled to C with Cython, with statically typed integers.
  • Go (8g): Go code compiled with inbuilt Go compiler.
  • Go (gcc): Go code compiled with GCC.
  • Go Tbl (gcc): Go code, table optimized, and compiled with GCC.
  • Go Tbl (8g): Go code, table optimized, and compiled with inbuilt Go compiler.
  • PyPy: My python code compiled with PyPy (1.9.0 with GCC 4.7.0)
  • FPC (MiSchi): FreePascal version improved by MiSchi (Tonne Schindler)
  • D (GDC): My D code, with many improvements suggested from the folks in the G+ thread. Compiled with GDC.
  • FPC: My FPC code
  • FPC (leledumbo): FreePascal version improved by Leledumbo (leledumbo_cool -at- yahoo.co.id)
  • D Tbl (DMD): My table optimized D code, compiled with DMD.
  • D (DMD): My D code, compiled with DMD.
  • D Tbl (GDC): My table optimized D code, compiled with GDC.
  • D (LDC): My D code, compiled with LDC.
  • C++ (HA): Harald Aschiz's C++ code.
  • Go Tbl (RP) (8g): Roger Peppe's optimized Go code, omitting string conversion and using range.
  • Go Tbl (RP) Ptr (8g): Roger Peppe's optimized Go code, additionally using pointers for the table opt.
  • D Tbl (LDC): My table optimized D code, compiled with LDC.
  • C (DS): Daniel Spångberg's C code (See his even faster ones in the next section!)
  • C (TH): Thomas Hume's C code.
  • C (DS Tbl): Daniel Spångberg's line-by-line version, with table optimization.

Reading more lines, or whole file

Here are the codes that don't keep themselves to reading line by line, but do more than so. For example Gerdus van Zyl's pypy-optimized code reads a bunch of lines at a time (1k seems optimal), before processing them, while still allowing to process in a line-by-line fashion. Then there is Daniel Spångberg's exercise in C performance, where by reading the file in at once, and applying some smart C tricks, and even doing some OpenMP-parallelization, cuts down the execution time under 100 ms, and that is for a 1million line file, on a rather slow MacBook air. Quite impressive!

PyPy (GvZ) PyPy PyPy (GvZ) + Opt C (DS Whole) C (DS Whole Tbl) C (DS Whole Tbl Par) C (DS Whole Tbl 16bit) C (DS Whole Tbl Par 16bit) C (DS Whole Tbl Par 16bit Unroll)
0.862s 0.626s 0.551s 0.178s 0.114s 0.095s 0.084s 0.070s 0.066s

Explanation of code, with links

Note: All of the below codes have been modified to benefit from the tip from Thomas Koch (thomas -at- koch.ro), to cound AC and GC separately in the inner loop, and sum up only in the end.

Conclusions

I think there are a number of conclusions one can draw from this:

  • C is in it's own class, when it comes to speed, which is seen in both of the categories above.
  • C++ is also fast, but D with the right compiler is not far behind.
  • Generally D and FPC is very close!
  • Pure python is not near the compiled ones, but surprisingly (for me), by using PyPy, it becomes totally comparable with compiled languages!
  • If you look both at the speed, and the compactness and readability of code at the same time, I find D to be a very strong competitor in this game!

What do you think?

read more

Type-safe tagged unions in the D programming language

A tagged union

Sometime you have some data that can be of type A or B. Very common use cases include decoding various file format or network protocols, communicate with memory mapped devices, some processing that can return an A or a B depending on its result, and many more.

To resolve that issue, it is common to use a union. A union is a aggregate where all members share the same memory. It is very handy, but use it wrongly and you end up badly messing up your memory. I’ll demonstrate in this article how to make this safe in D.

The first step toward this goal is to create a tagged union: a union associated with a tag that indicates which element of the union is currently valid.

1
2
3
4
5
6
7
8
9
10
11
12
struct TaggedUnion {
    union {
        A a;
        B b;
    }

    enum Tag {
        A, B
    }

    Tag tag;
}

Now we have a nice struct that contains either an A or a B, and a tag that indicate which one it is. But this is still unsafe, use it wrong and you’ll wreck your memory. We need to make it safe.

Some encapsulation

The next obvious step is to build the struct in the proper state right away. Let’s put some constructors in place, let’s make all the data private, and provide access to these via controlled, typed, ways. We need to provide a method that checks the tag and calls the right user code with the correct type. Thankfully, D allow to do that – without performance penalty – using template alias parameters.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
struct TaggedUnion {
private:
    union {
        A a;
        B b;
    }

    enum Tag {
        A, B
    }

    Tag tag;

public:
    this(A a) {
        tag = Tag.A;
        this.a = a;
    }

    this(B b) {
        tag = Tag.B;
        this.b = b;
    }

    auto ref apply(alias fun)() {
        final switch(tag) with(Tag) {
            case A:
                return fun(a);

            case B:
                return fun(b);
        }
    }
}

At this point you probably wonder how you can use this. This is dead simple, you simply call the function apply with the code you want to run as a template parameter. Your code will be instantiated with all the possible types.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
TaggedUnion t = ...;

void process(T)(T data) {
    alias Type = typeof(data);

    static if(is(Type == A)) {
        // Code that handle the case where it is an A.
    } else static if(is(Type == B)) {
        // Code that handle the case where it is an B.
    } else {
        static assert(0, "You must handle type " ~ Type.stringof);
    }
}

t.apply!process();

Be careful, as type inference can get in your way here. If case A and B don’t return the same type, you may want to specify it explicitly. Another pain point is that a function that does not return is assumed by dmd to be of return type void (when it can be anything). If you want to throw in some cases, you’ll also need to be explicit. You also can’t use a local function for reasons that seem unclear to me – most likely a dmd bug (2.062).

A generic solution

Now we can repeat this pattern all over the place, but at some point a more generic solution becomes worthwhile. D is quite powerful at this game, so let’s leverage this. We will use some string mixins to construct the exact same union as we did before, but dynamically with several types. Nothing new here, we simply ask the compiler to write the code for us.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
template TaggedUnion(Types ...) {
    private auto getUnionContent() {
        string s;
        foreach(T; Types) {
            s ~=  fullyQualifiedName!T ~ " member_" ~ T.mangleof ~ ";";
        }

        return s;
    }

    private auto getTag() {
        string s;
        foreach(T; Types) {
            s ~= T.mangleof ~ ",";
        }

        return "enum Tag {" ~ s ~ "}";
    }

    private auto getSwitchContent() {
        string s;
        foreach(T; Types) {
            s ~= "case Tag." ~ T.mangleof;
            s ~= ": return fun(member_" ~ T.mangleof ~ ");";
        }

        return s;
    }

    struct TaggedUnion {
    private:
        union {
            mixin(getUnionContent());
        }

        mixin(getTag());

        Tag tag;

    public:
        this(T)(T t) if(is(typeof(mixin("Tag." ~ T.mangleof)))) {
            mixin("tag = Tag." ~ T.mangleof ~ ";");
            mixin("member_" ~ T.mangleof ~ " = t;");
        }

        auto ref apply(alias fun)() {
            final switch(tag) {
                mixin(getSwitchContent());
            }
        }
    }
}

D allows for very expressive type construction, and that is awesome! This small example is simple and shows perfectly how to do it. Unlike many abstractions in programming, this one doesn’t cost anything at runtime as it uses compile time feature of D. This is probably one of the biggest advantages of D : allowing you to build nice abstraction completely at compile time and have the optimizer remove the excess for you.

PS: Once again, a big thank to John Colvin for his help in correcting the article.

May 09, 2013

GC content continued: Python vs D vs C++ vs Free Pascal

My previous blog post, comparing a simple character counting algorithm between python and D, created an interesting discussion on Google+, with many tuning tips for the D version, and finally we also had a C++ version (kindly provided by Harald Achitz).

Then I could not resist to try out something I've been thinking about for a while (comparing D with Free Pascal (FPC)), so I went away and jotted down an FPC version as well (my first FPC code ever, I think).

Anyway, I now just wanted to summarize a comparison between all of these! I have added below the python code, the faster of the two D versions in my previous post, as well as the new additions: Harald's C++ version and my FPC version.

Performance results

PythonD (DMD)D (GDC)D (LDC)C++FPC
7.904s0.652s0.668s0.538s0.399s0.629s

Numbers are execution time in seconds, for counting the fraction of "G" and "C" characters compared to "A", "T", "C" and "T" in a 58 MB text file of nearly 1 million lines of text. And so, why not plot it as well:

My comments

Not surprising maybe that C++ is the fastest. I find it interesting though that both D and FPC perform in the same ballpark at least, with code written by a novice user (although with some minor tweaking).

For me personally, I also got a first answer on the question I've been wondering about for a while: Will D or FreePascal be faster, given that Pascal is an older and more mature language, and doesn't use a Garbage collector, while D on the other hand have better compiler support, which it can benefit from (like LLVM).

Since I've been betting the most on D lately, I find it interesting to see that still, D can perform better, at least with the right compiler. Still I'm impressed that FPC is totally on par with D, and beaten only when D is compiled with the LLVM backed LDC.

On another note though, I tend to like the compactness of the D code. It's not really that far from the python version, if omitting the closing curly brackets, no? Not bad for a statically compiled high performance language! :) (That is, while FreePascal has it's merits from using mostly natural language words, over quirky symbols, which IMO can enhance readability a bit).

The code

Python version

import re
import string
 
def main():
    file = open("Homo_sapiens.GRCh37.67.dna_rm.chromosome.Y.fa","r")
    gcCount = 0
    totalBaseCount = 0
    for line in file:
        line = line.strip("\n")
        if not line.startswith(">"):
            gcCount += len(re.findall("[GC]", line))
            totalBaseCount += len(re.findall("[GCTA]", line))
    gcFraction = float(gcCount) / totalBaseCount
    print( gcFraction * 100 )
 
 
if __name__ == '__main__':
    main()

D version

import std.stdio;
import std.string;
import std.algorithm;
import std.regex;
 
void main() {
    File file = File("Homo_sapiens.GRCh37.67.dna_rm.chromosome.Y.fa","r");
    int gcCount = 0;
    int totalBaseCount = 0;
    string line;
    char c;
    while (!file.eof()) {
        line = file.readln();
        if (line.length > 0 && !startsWith(line, ">")) {
            for(uint i = 0; i < line.length; i++) {
                c = line[i];
                if (c == 'C' || c == 'T' || c == 'G' || c == 'A') {
                    totalBaseCount++;
                    if ( c == 'G' || c == 'C' ) {
                        gcCount++;
                    }
                }   
            }
        }
    }
    float gcFraction = ( cast(float)gcCount / cast(float)totalBaseCount );
    writeln( gcFraction * 100 );
}

C++ version

(Kindly shared by Harald Achitz!)

#include <string>
#include <fstream>
#include <iostream>
 
int main() {
 
    std::fstream fs("Homo_sapiens.GRCh37.67.dna_rm.chromosome.Y.fa", std::ios::in);
 
    int gcCount = 0;
    int totalBaseCount = 0;
 
    if ( fs.is_open() )  {
        std::string line;
        while (std::getline(fs, line)) {
            if(*(line.begin()) != '>') {
                for(std::string::const_iterator pos = line.begin(); pos!=line.end() ; ++pos){
                     switch (*pos){
                       case 'A' :
                         totalBaseCount++;
                         break;
 
                       case 'C' :
                         gcCount++;
                         totalBaseCount++;
                         break;
 
                       case 'G' :
                         gcCount++;
                         totalBaseCount++;
                         break;
 
                       case  'T' :
                         totalBaseCount++;
                         break;
 
                       default:
                         break;
                     }
                }
          }
    }
    //std::cout << "gcCount: " << gcCount << " / totalBaseCount: "<< totalBaseCount << " = " ;
    std::cout << ( (float)gcCount / (float)totalBaseCount ) * 100 << std::endl ;
  } else {
    std::cout << "can't open file" ;
  }
  return 0 ;
}

Free Pascal (FPC)

program gc;
{$mode objfpc} // Do not forget this ever
 
uses
    Sysutils;
 
var
    FastaFile: TextFile;
    CurrentLine: String;
    GCCount: Integer;
    TotalBaseCount: Integer;
    i: Integer;
    c: Char;
    GCFraction: Single;
 
begin
    GCCount := 0;
    TotalBaseCount := 0;
 
    AssignFile(FastaFile, 'Homo_sapiens.GRCh37.67.dna_rm.chromosome.Y.fa'); 
    {$I+} //use exceptions
    try  
        Reset(FastaFile);
        repeat
            Readln(FastaFile, CurrentLine); 
            i := 0;
            while i < Length(CurrentLine) do
            begin
                c := CurrentLine[i];
                if (c = 'A') or (c = 'G') or (c = 'C') or (c = 'T')  then
                begin
                    TotalBaseCount := TotalBaseCount + 1;
                    if (c = 'G') or (c = 'C') then
                        GCCount := GCCount + 1;
                end;
                i := i + 1;
            end;
        until(EOF(FastaFile)); 
        CloseFile(FastaFile);
    except
        on E: EInOutError do
        begin
            Writeln('File handling error occurred. Details: '+E.ClassName+'/'+E.Message);
        end;    
    end;
    GCFraction := GCCount / TotalBaseCount;
    Writeln(FormatFloat('00.0000', GCFraction * 100));
end.

Compile flags

I used the following compile commands / flags for the test above (from the Makefile):

d:
        dmd -ofgc_d_dmd -O -release -inline gc.d
        gdmd -ofgc_d_gdc -O -release -inline gc.d
        ldmd2 -ofgc_d_ldc -O -release -inline gc.d
cpp:
        g++ -ogc_cpp -O3 gc.cpp
 
fpc:
        fpc -ogc_fpc -Ur -O3 -TLINUX gc.pas

Let me know if there is other flags that should be used, for some of the above compilers!

read more

Calculating GC content in python and D - How to get 10x speedup in D

Update May 16: More optimizations and languages have been added here, and here (last one includes Python, D, FreePascal, C++, C and Go).

I'm testing to implement some simple bioinformatics algos in both python and D, for comparison.

My first test, of simply calculating the GC content (fraction of DNA letters G and C, as compared to G, C, T and A), reveals some hints about where D shines the most.

Based on this very limited test, D:s strength in terms of performance, seems to be more clear when you hand-code your inner loops, than when using some of the functionality of the standard library (phobos).

When using the countchars function in std.string of D:s standard library, D is on par with, but a tiny bit slower than python. When I hand-code the inner loop more carefully though, I get a 10x speedup of the D code, over my python code.

Test data

As test data, I use a 58MB Fasta file, containing approximately 1 million (989561) lines of DNA sequence.

Python reference implementation

As a reference, I wrote this little python script. Probably there is a faster way to write it, but this is the fastest I could come up with.

import re
import string
 
def main():
    file = open("Homo_sapiens.GRCh37.67.dna_rm.chromosome.Y.fa","r")
    gcCount = 0
    totalBaseCount = 0
    for line in file:
        line = line.strip("\n")
        if not line.startswith(">"):
            gcCount += len(re.findall("[GC]", line))
            totalBaseCount += len(re.findall("[GCTA]", line))
    gcFraction = float(gcCount) / totalBaseCount
    print( gcFraction * 100 )
 
 
if __name__ == '__main__':
    main()

This takes around 8 seconds on my machine:

[samuel gc]$ time python gc_test.py 
37.6217301394
 
real    0m7.904s
user    0m7.876s
sys     0m0.020s

Then, I tried two implementations in D. First I tried to make my life as simple as possible, by using existing standard library functionality (the countchars function in std.string):

D implementation, using countchars

import std.stdio;
import std.string;
import std.algorithm;
import std.regex;
 
void main() {
    File file = File("Homo_sapiens.GRCh37.67.dna_rm.chromosome.Y.fa","r");
    int gcCount = 0;
    int totalBaseCount = 0;
    string line = "";
    while (!file.eof()) {
        line = chomp(file.readln());
        if (!startsWith(line, ">")) {
            gcCount += countchars(line, "[GC]");
            totalBaseCount += countchars(line, "[GCTA]");
        }
    }
    float gcFraction = ( cast(float)gcCount / totalBaseCount );
    writeln( gcFraction * 100 );
}

The results: slightly slower than python with DMD, and slightly faster with GDC and LDC:

[samuel gc]$ dmd -ofgc_slow_dmd -O -release -inline gc_slow.d
[samuel gc]$ gdmd -ofgc_slow_gdc -O -release -inline gc_slow.d
[samuel gc]$ ldmd2 -ofgc_slow_ldc -O -release -inline gc_slow.d
[samuel gc]$ for c in gc_slow_{dmd,gdc,ldc}; do echo "---"; echo "$c:"; time ./$c; done;
---
gc_slow_dmd:
37.6217
 
real    0m8.088s
user    0m8.049s
sys     0m0.032s
---
gc_slow_gdc:
37.6217
 
real    0m6.791s
user    0m6.764s
sys     0m0.020s
---
gc_slow_ldc:
37.6217
 
real    0m7.138s
user    0m7.096s
sys     0m0.036s

D implementation with hand-written inner loop

On the other hand, when I hand code the string comparison code, like this:

import std.stdio;
import std.string;
import std.algorithm;
import std.regex;
 
void main() {
    File file = File("Homo_sapiens.GRCh37.67.dna_rm.chromosome.Y.fa","r");
    int gcCount = 0;
    int totalBaseCount = 0;
    string line;
    char c;
    while (!file.eof()) {
        line = file.readln();
        if (line.length > 0 && !startsWith(line, ">")) {
            for(uint i = 0; i < line.length; i++) {
                c = line[i];
                if (c == 'C' || c == 'T' || c == 'G' || c == 'A') {
                    totalBaseCount++;
                    if ( c == 'G' || c == 'C' ) {
                        gcCount++;
                    }
                }   
            }
        }
    }
    float gcFraction = ( cast(float)gcCount / cast(float)totalBaseCount );
    writeln( gcFraction * 100 );
}

... then I get some significant speedup over python, around 10x (13x with the fastest combination of compiler and optimization flags)!:

[samuel gc]$ dmd -ofgc_fast_dmd -O -release -inline gc_fast.d
[samuel gc]$ gdmd -ofgc_fast_gdc -O -release -inline gc_fast.d
[samuel gc]$ ldmd2 -ofgc_fast_ldc -O -release -inline gc_fast.d
[samuel gc]$ for c in gc_faster_{dmd,gdc,ldc}; do echo "---"; echo "$c:"; time ./$c; done;
---
gc_faster_dmd:
37.6217
 
real    0m0.652s
user    0m0.624s
sys     0m0.024s
---
gc_faster_gdc:
37.6217
 
real    0m0.668s
user    0m0.644s
sys     0m0.020s
---
gc_faster_ldc:
37.6217
 
real    0m0.538s
user    0m0.520s
sys     0m0.016s

Version info

  • Machine: MacBook air "11, with 1.7GHz Core i5 (dual core).
  • OS: Lubuntu 12.10 32bit
  • Python: 2.7.3
  • DMD version: 2.062
  • GDC(GCC?) version: 4.6.3
  • LDC: 0.10.0 (based on DMD 2.0.60 and LLVM 3.2)

read more

May 08, 2013

Video of Walter’s DConf 2013 Keynote

Both the video and slides for Walter’s keynote at DConf 2013 are now available online. I haven’t looked at it yet, but some people have had some issues. Some pdf versions of the slides have also been posted. Follow the newsgroup announcement for more info.

EDIT: Might as well embed it here since it’s up at  YouTube.

May 06, 2013

Low-Lock Singletons In D

DConf 2013 was a huge success!  I’ve finally recovered enough from my red-eye flight back to the East Coast to blog about a topic that really stuck out in my mind from it.

At DConf, I gave a talk on D-specific design patterns.  (The video will be up on dconf.org shortly.)  One pattern in particular attracted much more audience attention than I thought it would, so I decided to blog about it:  The Low-Lock Singleton Pattern.

Most people who program for a living are probably familiar with the traditional Singleton Pattern, which is used to create a single, globally available instance of a class.  A canonical implementation in D is shown below:

class MySingleton {
  static MySingleton get() {
    if (instance_ is null) {
      instance_ = new MySingleton;
    }
    return instance_;
  }
 private:
  this() {}
  __gshared MySingleton instance_;
 }

Most of these people also probably know that this naive implementation is a concurrency nightmare.  You can’t safely call get() from multiple threads simultaneously.  The most obvious reason is because two instances of MySingleton will get created if get() is called from a second thread before it’s finished running in the first thread that called it, i.e. before instance_ !is null from the first call.

If you want thread safety, the most obvious solution is to wrap everything in a mutex, represented in D by a synchronized block.  Omitting all the boilerplate except the get() method, this would look something like:

  static MySingleton get() {
    synchronized {
      if (instance_ is null) {
        instance_ = new MySingleton;
      }
    }
    return instance_;
  }

Now, if Thread 1 calls get(), it has a chance to finish instantiating instance_ and storing a reference to it before Thread 2 checks whether instance_ is null.  There’s only one problem with this:  It comes at a huge performance cost (benchmarks forthcoming).  Entering a synchronized block is expensive.

One solution that looks correct to the naive eye but is subtly broken is double-checked locking.  The idea here is to check whether instance_ is null outside synchronized block, and then if so, check again inside one:

  static MySingleton get() {
    if (instance_ is null) {
      synchronized {
        if (instance_ is null) {
          instance_ = new MySingleton;
        }
      }
    }
    return instance_;
  }

Intuitively this seems reasonable.  instance_ can’t go from non-null to null, only from null to non-null.  Therefore, if it’s already non-null by the time we get to the first check (the one outside the synchronized block) it would seem we know every thing we need to know and can simply return instance_.  Alas, the real world is not that simple.  The details are complicated and best explained by others, but the executive summary is this:  The sequentiality of program execution is an illusion maintained within a single thread.  Compilers and CPUs reorder instructions and the illusion breaks down when dealing with multiple threads.  For example, instance_ can be assigned a non-null value before the object it refers to is fully constructed.

In D, though, thread-local storage (TLS) is a first class member of the language.  This leads to an interesting workaround that’s similar in spirit to double-checked locking but actually works.  It’s a previously obscure pattern that was independently invented by at least myself and Alexander Terekhov, an IBM researcher working in Java.   It never caught on for some reason, probably because of the abysmal performance of TLS in Java back when it was discovered, as illustrated by Doug Lea’s benchmarks from that era.  Anyhow, the pattern is implemented in D below:

class MySingleton {
  static MySingleton get() {
    if (!instantiated_) {
      synchronized {
        if (instance_ is null) {
          instance_ = new MySingleton;
        }
        instantiated_ = true;
      }
    }
    return instance_;
  }
 private:
  this() {}
  static bool instantiated_;  // Thread local
  __gshared MySingleton instance_;
 }

The idea is simple:  We maintain a thread-local flag to track whether we’ve already entered the synchronized block from the current thread and ensured that instance_ is instantiated.  The synchronized block is entered once per thread, amortized over a whole run of the program.

Benchmarks

Ideally, we’d like thread-safe Singletons that are as fast as thread-unsafe Singletons.  How well does our low-lock TLS-based Singleton perform in practice?  I created a micro-benchmark calling get() in a loop 1 billion times and ran it for naive, thread-unsafe Singletons (unsafe), Singletons that enter a synchronized block on every call to get() (synchronized), and the low-lock TLS-based Singletons (TLS).  The results are shown in the graph below:

chart_1

On GDC the overhead of TLS vs. unsafe is virtually undetectable.  On DMD it’s noticeable but small.  In both cases it pales in comparison to the overhead of synchronizing on every call to get().  Note that this benchmark was run in a single thread, so the mutex used to implement the synchronized statement was never contested.  If it were, the results when synchronizing on every call to get() would be far worse.

There you have it:  An obscure Java pattern that was ahead of its time, resurrected in the presence of more modern thread-local storage implementations in D, lets us have our cake and eat it too:  Fast Singletons that are thread safe without any dependence on obscure memory model details.


May 02, 2013

DConf 2013 Day One

Day one of DConf is behind us. Right now, day two is under way and I’m typing this on my phone because I got back to my room too late last night to do it. It’s safe to say I had a great time.

Things kicked off with a keynote from Walter, followed by several presentations. I won’t summarize them here, as videos will be online eventually. I found all of them informative, but given my interest in game development, Manu’s talk on using D alongside a game engine was the one I was most looking forward to. If you’ve not yet seen it on Twitter, the money quote from that talk was, “The gaming industry is desperate for salvation from C++.”

The best part of the conference has been the mingling. Being able to put faces to all the familiar names is just awesome. I’ve been party to a number of great discussions. It’s so much more interesting to do that face-to-face than on the newsgroups. Plus, I witnessed history last night in the lobby of the Aloft hotel as the rvalue reference problem was solved after a long and spirited discussion.

This has been a fantastic experience so far. And day two has gotten off to a good start.

A few pics here.