Featured Post

x += x++; ?!

Let’s take this code as an example: int x = 3; x += x++; The first thing the compiler does whenever it sees something like z += y is to convert it to z = z + y. This is obviously true for +=, -=, *=, and /=.   Ok, this was easy. Now we have just to consider: x= x + x++; This, by the way, gives...

Read More

Parallel Programming in C# 4.0 using Visual Studio 2010

Posted by Logu Krishnan | Posted in C#, Parallel Programming | Posted on 09-06-2009

Tags: ,

1

Downloaded Visual Studio 2010 Beta 1 yesterday and as I was glancing through it strike to me that this version is stuffed, unless the previous two predecessors VS2005 & VS2008.
Framework has been more enhanced and visual studio IDE itself have got overhauled a bit. Well, I’m not going to give you a list of ALL features – it’s been blogged already around the world. Better Google it or Bing it with “VS2010+Features”

However, few notable features that caught my eyes are “Parallel Programming”, “F# - Functional Programming”, “Velocity – Distributed Caching”, “Azure Tools” and more important of all the evolving Team system.

But I first wanted to dirt my hand with Parallel Computing, because if you are a computer science student – well, you would be more excited about this than others.
Remember the big pillow sized books that we used to read to make this work? Well, things have changed and world have shrunk already. Though I cannot explain all the nitty gritty of parallel programming I will try this to explain in LAY MAN Terms.

Well, during the Stone Age [!] - Most of the computers in the world had only ONE Processors, except those big beasty servers which are always locked up in rooms with high security (well, usually *nix or Solaris servers) – these beasty servers used to manage most of the corporations. These servers had multiple processors and it took huge efforts to write software’s and manage them.

Welcome to the modern world – Every household and every laptop being sold these days at least have two or more processors.

Now – that has posed us a BIG Question? Hardwares have evolved, but has our software evolved to execute on multiple processors? – The answer is NO. At least not in the mainstream programming world – let’s say for example what would happen

  1. If we execute a simple FOR Loop
  2. That would call a service (that takes a longer time)
  3. … and execute sequentially for N Times

On a single processor this is acceptable and we might use threads to increase the efficiency.

Is this still acceptable on a multiple processors? The answer is no. Fine, but how do we get efficiency without the hurdles of running and managing too many threads? Shouldn’t there be an easier way out for this?

Alrighty, without much ado, let me show you how easy(!) this is and a little insight on what happens behind the scenes. Let’s churn out a quick code here based on the same questions we have. Let us say a real long process (Well it could be about counting the stars in the UniverseJ, huh) and let us say you want to do this N times.

In our quest to count all the stars in the universe, let’s first create a data structure for the star and add to the universe, and let us use the good ol` mother of all loops the “FOR” Loop, and see how much inefficient this loop has become these modern days!!

 

“The Sequential execution took almost 30 seconds in my Dual Core Computer.”

And here is the Parallel Computing version of the same method. Yes, the for loop has been replaced with Parallel.For a new entry in System.Threading namespace.
How simpler can this get to?

VOILA! The Parallel execution took Just 3 Seconds in my Dual Core Computer.

Well, That’s a significant performance improvement without Hardware Scale-out or Scale-up, all we are doing is using the existing hardware resource efficiently. So much to a FOR Loop J, Huh. 30 Seconds of execution have become 3 seconds instantly. Look closer to the screenshot – the stars are not counted sequentially, instead it allocates the task to the available CPU in parallel.

Because the loop is run in parallel, each iteration is scheduled and run individually on whatever core is available. This means that the list is not necessarily processed in order, which can drastically impact your code. You should design your code so that each iteration of the loop is completely independent from the others. Any single iteration should not rely on another in order to complete correctly.

Let us catch up more on the insights soon on next part of the same series…

  • Share/Save/Bookmark

PERFORMANCE Code Killer – Unaligned Memory - C# Structs

Posted by Logu Krishnan | Posted in C#, Performance | Posted on 24-02-2009

Tags: ,

1

Recently I was analyzing a .NET Application for performance which had lots of structs defined in it, and happened to hit a strange reality. Unaligned Memory problem!
I was running a profiler, and found that the memory allocated for few structs are huge than it should normally allocate (based on my own math of counting the bytes). When I probed further, there was an interesting discovery. Read on…

Alright here is a little head spinner… What is the difference between the following structures?

struct BadStructure
{
char c1;
int i;
char c2;
}

Nothing much, except the jumbled type declarations… Huh?

Fine, Now let’s look at the size of these structures,

struct GoodStructure
{
int i;
char c1;
char c2;
}

The size of BadStructure Structure in:
.NET Framework 3.5 : Managed sizeof= 12 Bytes, Marshal.Sizeof = 12 Bytes

The size of GoodStructure Structure in:
.NET Framework 3.5 : Managed sizeof= 8 Bytes, Marshal.Sizeof = 8 Bytes

[Note: Size of int=4, char=2]

The Reason behind these differences is “BYTE ALIGNMENT”, As with the default packing in unmanaged C++, integers are laid out on four-byte boundaries, so while the first
character uses two bytes (a char in managed code is a Unicode character, thus occupying two bytes), the integer moves up to the next 4-byte boundary, and the second character uses the subsequent 2 bytes. The resulting structure is 12 bytes when measured with Marshal.SizeOf.

32 bit microprocessors typically organize memory as shown below.


        Byte0  Byte1  Byte2 Byte3

0×1000

0×1004   A0     A1     A2     A3

0×1008

0×100C          B0     B1     B2

0×1010  B3

Most of the processer architectures cannot read data from odd addresses.
Processor Architectures are inefficient in reading the data if it starts at an address not divisible by four.
Memory is accessed by performing 32 bit bus cycles. 32 bit bus cycles can however be performed at addresses that are divisible by 4. So for efficiency purposes, compilers add the so-called pad bytes. The reasons for not permitting misaligned long word reads and writes are not difficult to see. For example, an aligned long word A would be written as A0, A1, A2 and A3.

Here is the IL.

.class nested private sequential ansi sealed beforefieldinit BadValueType

extends [mscorlib]System.ValueType
{
.field public char c1
.field public char c2
.field public int32 i
}
In the .NET Framework 3.5, the JIT does enforce a Sequential layout (if specified) for the managed layout of value types,
We can use the System.Runtime.InteropServices namespace and the StructLayoutAttribute class to control the physical layout of the data fields in the Microsoft .NET Framework 3.5

Thus the microprocessor can read the complete long word in a single bus cycle. If the same microprocessor now attempts to access a long word at address 0×100D, it will have to read bytes B0, B1, B2 and B3. Notice that this read cannot be performed in a single 32 bit bus cycle. The microprocessor will have to issue two different reads at address 0×100C and 0×1010 to read the complete long word. Thus it takes twice the time to read a misaligned long word.

The following byte padding rules will generally work with most 32 bit processor.
a. single byte numbers can be aligned at any address
b. Two byte numbers should be aligned to a two byte boundary
c. Four byte numbers should be aligned to a four byte boundary

This is the cause of the difference.
Fine…. How do we fix this ?

the .NET compilers all apply a StructLayoutAttribute to structures, specifying a Sequential layout. This means that the fields are laid out in the type according to their order in the source file.
Fix = specify StructLayout [LayoutKind Sequential,Pack = 1] for the struct.

Watchout for structures when you create them next time, and think about playing around with ‘m’ structures with ‘n’ size…. m x n = !!!

  • Share/Save/Bookmark

x += x++; ?!

Posted by Logu Krishnan | Posted in C# | Posted on 29-12-2005

0

Let’s take this code as an example:


int
x = 3;
x += x++;

The first thing the compiler does whenever it sees something like z += y is to convert it to z = z + y. This is obviously true for +=, -=, *=, and /=.

 

Ok, this was easy. Now we have just to consider:

x= x + x++;

This, by the way, gives the same result as:

x = x + x;

This, by the way, gives a different result from:

x = x++ + x;

This, by the way gives the same result as:

x = x + ++x;

 

As maddening as this may seem, it actually makes sense (once you understand how it works). But first, what is the difference between x++ and ++x? x++ returns the value of x to the current expression and then increments x. ++x increments x and then return its value to the current expression. Given this factoid (and knowing that c# evaluates expressions left to right), we can then consider what happens in the following case:

 

int x = 3;

x = x + x++;

 

Here is how the compiler conceptually evaluates it:

  1. x = (x) + x++ -> the first x gets evaluated and returns 3, x = 3
  2. x = 3 + (x)++ -> x gets evaluated and returns 3, x = 3
  3. x = 3 + (x++) -> x++ gets evaluated and x is incremented (to 4), x = 4
  4. x = (3 + 3) -> 3 + 3 gets evaluated and returns 6, x = 4
  5. (x = 6) -> x is assigned to 6 (overriding the previous value of 4)

 

Now let’s see how this one works:

int x = 3;

x = x++ + x;

  1. x = (x)++ + x -> x gets evaluated and returns 3, x =3
  2. x = (x++) + x -> x++ gets evaluated and x is incremented, x=4
  3. x = 3 + (x) -> x gets evaluated and returns 4, x = 4
  4. x = 3 + 4 -> 3+4 gets evaluated and returns 7, x = 4
  5. (x=7) -> x is assigned to 7 (overriding the previous value of 4)

 

Now let’s get to this one:

int x = 3;

x = x + ++x;

  1. x = (x) + ++x -> x gets evaluated and returns 3, x=3
  2. x = 3 + (++x) -> ++x gets evaluated and x is incremented, x=4
  3. x = 3 + (x) -> x gets evaluated and returns 4, x=4
  4. x = 3 + 4 -> 3+4 gets evaluated and returns 7, x = 4
  5. (x=7) -> x is assigned to 7 (overriding the previous value of 4)

 

I hope this is clear.

By the way, in c++ the behavior for this expression is undefined…

 

But now… why did we make this legal? Why not err or warn at compilation time? Well…

  • We were wrong, we should have erred or warned, but now it is too late because if we change this we break code OR
  • It is quite complex to form a set of guidelines that the compiler can evaluate to be able to err just in the bizarre cases OR
  • We prefer to spend our time working on things people really care about instead of these corner-corner-corner cases

Found this Interesting Post by  Luca Bolognese Quite Interesting…

  • Share/Save/Bookmark