Author Archives: Gerben

A real world example of digital signature checking

In this post we will see exactly how we can check if a SSL certificate hasn’t been tampered with.

We will use https://google.com as an example and we’re manually going to check that the certificate’s digital signature is valid. Other important steps such as traversing the entire chain is beyond the scope of this simple example. Certificates don’t remain valid forever, so today you will get different ones. For sake of reproduction. I’ve included the ones I used later on in this post.

When I browsed to Google, it returned 2 certificates to my browser:

  1. Its own certificate
  2. The certificate of the intermediate CA that signed Google’s certificate

We’re going to use the following approach to check the signature on Google’s certificate:


  1. Retrieve the digital signature included in Google’s certificate.

  2. Retrieve the intermediate CA’s public-key from the CA’s certificate.

  3. Decrypt the digital signature in Google’s certificate using the public-key from the intermediate CA. Now we have the hash value that the intermediate CA calculated at the time when it signed Google’s certificate.

  4. Calculate the hash value of Google’s certificate ourself

  5. Compare the two hash values. If they are the same, then Google’s certificate has not changed since it was signed and therefore we consider it to be valid

Retrieve the signature from Google’s certificate


Google’s certificate is listed further on in this post. Its in the PEM format which is just a base64 encoded representation of a X.509 certificate. I decoded it back into ‘plain old’ bytes and then I had the ASN.1 DER encoded version of the certificate. Using an ASN.1 viewer I can see that the entire X.509 file has the following structure.

SEQUENCE(3 elem)
    SEQUENCE(8 elem) <-- Google's part of the certificate. It contains 8 things, which I'm not showing here
    SEQUENCE(2 elem) <-- 2 elements that say which algorithm the intermediate CA used to sign Google's part of the certificate. Its a SHA1 with RSA encryption
    BIT STRING(2048 bit) <-- Intermediate CA's signature

So the last 2048 bits (256 bytes) contain the signature of the certificate. Below is the hex representation of those bytes:

348B7D645A64085B1FF6D86DF35480F9D913EADB09210B7E7402B7779F730077C7C7926A7A953DCD814C35E30608C02586A220795F965AF0E97F3CE5C32E7234FD6259782E447BFF73F6319797CA8DB1EB8D0A58119FB0794EF83ACCD8E45895C91FDCA97BB82FB425811E8A4CF0D41594618A5663BF774AC9CE2DBB9798E6E5BB6C5CCEC68B80D93E8C6748394B3822DE437C4FB93BCF302723ACD4D9ECAC75FFA4993D559C12C2E17228AC917942B1666D9948C6C42FAD1B0EB8F78AB0B38A5B392F85E7BDBFE97FD7534269CBB8FE22B03EF305514668DCE491683B1DD6852DBEE9C21E9C9E955B41E7078ACB722B2555CECBDEAD60AEC4FDC1C9A9686BE8

By the way. If you're doing these steps too and using an ASN.1 viewer, you might have noticed that I skipped the first byte of the contents. That's because its a BITSTRING and the following quote from the ITU-T X.690 specification implies that the content starts with a byte thats not really part of the content

The initial octet shall encode, as an unsigned binary integer with bit 1 as the least significant bit, the number of unused bits in the final subsequent octet. The number shall be in the range zero to seven.

Retrieve the intermediate CA's public-key from the CA's certificate


The CA's public-key is stored somewhere in the the middle of its certificate (not Google's certificate). Here I used the same trick of using an ASN.1 viewer to figure out which part of the ASN.1 contained the key.

The modulo is

009C2A04775CD850913A06A382E0D85048BC893FF119701A88467EE08FC5F189CE21EE5AFE610DB7324489A0740B534F55A4CE826295EEEB595FC6E1058012C45E943FBC5B4838F453F724E6FB91E915C4CFF4530DF44AFC9F54DE7DBEA06B6F87C0D0501F28300340DA0873516C7FFF3A3CA737068EBD4B1104EB7D24DEE6F9FC3171FB94D560F32E4AAF42D2CBEAC46A1AB2CC53DD154B8B1FC819611FCD9DA83E632B8435696584C819C54622F85395BEE3804A10C62AECBA972011C739991004A0F0617A95258C4E5275E2B6ED08CA14FCCE226AB34ECF46039797037EC0B1DE7BAF4533CFBA3E71B7DEF42525C20D35899D9DFB0E1179891E37C5AF8E7269

There are 2 odd things about this modulo. I know that its a 2048 bit / 256 byte key. However I have 257 bytes. You might think that we're running into that BITSTRING thing again here, but that's not the case as the ASN.1 tag specifies that the modulo element is an INTEGER. Whats really going on is that the RSA modulo is a 2048 bit unsigned number and that's serialized with an extra leading byte to indicate that its unsigned.

The exponent is:

01 00 01

Decrypt the signature from Google’s certificate


We know the intermediate CA's public key and we know the bytes that contain the signature of the certificate. So now we can do an RSA decyption on those bytes and voila, we will have the hash that the intermediate CA calculated during the signing process.

I used the following snippet of Python to do this. But most languages should be able to do this:

#Decrypt the signature from the certificate using the intermediate CA's public RSA key
modulo    = 0x009C2A04775CD850913A06A382E0D85048BC893FF119701A88467EE08FC5F189CE21EE5AFE610DB7324489A0740B534F55A4CE826295EEEB595FC6E1058012C45E943FBC5B4838F453F724E6FB91E915C4CFF4530DF44AFC9F54DE7DBEA06B6F87C0D0501F28300340DA0873516C7FFF3A3CA737068EBD4B1104EB7D24DEE6F9FC3171FB94D560F32E4AAF42D2CBEAC46A1AB2CC53DD154B8B1FC819611FCD9DA83E632B8435696584C819C54622F85395BEE3804A10C62AECBA972011C739991004A0F0617A95258C4E5275E2B6ED08CA14FCCE226AB34ECF46039797037EC0B1DE7BAF4533CFBA3E71B7DEF42525C20D35899D9DFB0E1179891E37C5AF8E7269
exponent  = 0x010001
signature = 0x348B7D645A64085B1FF6D86DF35480F9D913EADB09210B7E7402B7779F730077C7C7926A7A953DCD814C35E30608C02586A220795F965AF0E97F3CE5C32E7234FD6259782E447BFF73F6319797CA8DB1EB8D0A58119FB0794EF83ACCD8E45895C91FDCA97BB82FB425811E8A4CF0D41594618A5663BF774AC9CE2DBB9798E6E5BB6C5CCEC68B80D93E8C6748394B3822DE437C4FB93BCF302723ACD4D9ECAC75FFA4993D559C12C2E17228AC917942B1666D9948C6C42FAD1B0EB8F78AB0B38A5B392F85E7BDBFE97FD7534269CBB8FE22B03EF305514668DCE491683B1DD6852DBEE9C21E9C9E955B41E7078ACB722B2555CECBDEAD60AEC4FDC1C9A9686BE8
IntermediateCAsHash = pow(signature, exponent, modulo)
bytesOfHash = IntermediateCAsHash.to_bytes(sys.getsizeof(IntermediateCAsHash),byteorder='big', signed=False)
print ( "%s" % ''.join(format(x, '02X') for x in bytesOfHash ))

Running this code, gave me the following output ( I manually added line breaks, so remove them if you ever copy/paste this somewhere):

00000000000000000000000000000000
00000000000000000000000000000000
00000000000000000001FFFFFFFFFFFF
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
FFFFFFFF003021300906052B0E03021A
05000414F8F3D8AACF7E27B2F66A2231
C3240682A15ADFF6

The 0000...1FFF...FF00 part is an RSA Encryption Block Type 1 from the PKCS#1 standard and isn't really part of the data that the intermediate CA wanted to encrypt. We can ignore it and focus on the 3021300906052B0E03021A05000414F8F3D8AACF7E27B2F66A2231C3240682A15ADFF6 part. This part is an ASN.1 DER encoded data-structure defined in RFC2313 as:

DigestInfo ::= SEQUENCE {
     digestAlgorithm DigestAlgorithmIdentifier,
     digest Digest
}
DigestAlgorithmIdentifier ::= AlgorithmIdentifier
Digest ::= OCTET STRING

The AlgorithmIdentifier is defined in RFC 5280 as

AlgorithmIdentifier  ::=  SEQUENCE  {
    algorithm               OBJECT IDENTIFIER,
    parameters              ANY DEFINED BY algorithm OPTIONAL
}

So this means we should get:

SEQUENCE(2 elements)
    SEQUENCE(2 elements)
        OBJECT IDENTIFIER
        NULL (see RFC2313)
    OCTET STRING 

And indeed when we use the ASN.1 decoder we get the following output:

SEQUENCE(2 elem)
    SEQUENCE(2 elem)
        OBJECT IDENTIFIER 1.3.14.3.2.26 sha1(OIW)
        NULL
    OCTET STRING(20 byte) F8F3D8AACF7E27B2F66A2231C3240682A15ADFF6

So, now we know that the hash value calculated by the intermediate CA is

F8F3D8AACF7E27B2F66A2231C3240682A15ADFF6

Calculate the hash value of Google's certificate ourself


Now we are going to repeat the same hash calculation that the intermediate CA did a long time ago. We will:

  1. Need to extract the bytes that represents Google's part of the certificate.This may NOT include any of bytes that hold the digital signature itself.
  2. Run a SHA1 hash calculation on it.

The following python code does that and when I run it, it prints

F8F3D8AACF7E27B2F66A2231C3240682A15ADFF6

.

So we conclude that Google's certificate has not been tampered with!

import base64
import hashlib
    
def showSha1HashOfCertificate(bashe64EncodedCert):

    #Before doing the base64 decoding, we need to remove the 1st and last lines
    certificateWithoutCommentLines = bashe64EncodedCert.replace("-----BEGIN CERTIFICATE----","").replace("----END CERTIFICATE-----","")
    bytesOfCertificate =  base64.b64decode(certificateWithoutCommentLines)
    
    #The hash is calculated over the bytes that resulted from DER encoding the part that the X.509 specs
    #refer as the 'tbsCertificate' field of the entire certificate. 
    #Using the ASN.1 viewer I see that the tbsCertificate (the first member of the sequence) starts at offset 4 and its length is 4 + 1453 bytes     
    bytesOftbsCertificatePart = bytesOfCertificate[4: 1461]
    sha1Hasher = hashlib.sha1()
    sha1Hasher.update(bytesOftbsCertificatePart)
    ourHash = sha1Hasher.digest();
    print ("%s" % ''.join(format(x, '02X') for x in ourHash ))

googlesBashe64EncodedCert = """
-----BEGIN CERTIFICATE-----
MIIGxTCCBa2gAwIBAgIIVGohyFSBd4owDQYJKoZIhvcNAQEFBQAwSTELMAkGA1UE
... I removed a lot of the lines for brevity
rsT9wcmpaGvo
-----END CERTIFICATE-----
"""

showSha1HashOfCertificate(googlesBashe64EncodedCert)

The certificates


Below is the certificate for Google (its a big one!)

-----BEGIN CERTIFICATE-----
MIIGxTCCBa2gAwIBAgIIVGohyFSBd4owDQYJKoZIhvcNAQEFBQAwSTELMAkGA1UE
BhMCVVMxEzARBgNVBAoTCkdvb2dsZSBJbmMxJTAjBgNVBAMTHEdvb2dsZSBJbnRl
cm5ldCBBdXRob3JpdHkgRzIwHhcNMTUwNDA4MTM0MDEwWhcNMTUwNzA3MDAwMDAw
WjBmMQswCQYDVQQGEwJVUzETMBEGA1UECAwKQ2FsaWZvcm5pYTEWMBQGA1UEBwwN
TW91bnRhaW4gVmlldzETMBEGA1UECgwKR29vZ2xlIEluYzEVMBMGA1UEAwwMKi5n
b29nbGUuY29tMFkwEwYHKoZIzj0CAQYIKoZIzj0DAQcDQgAEy93BzqzWIF9fj2sq
ckQqqm8/USjGY97ncLJMtkAmzNVQ4HGC3pZlYdCTkq89JsFD1UfX81ynnPaQnDtT
QTZs/KOCBF0wggRZMB0GA1UdJQQWMBQGCCsGAQUFBwMBBggrBgEFBQcDAjCCAyYG
A1UdEQSCAx0wggMZggwqLmdvb2dsZS5jb22CDSouYW5kcm9pZC5jb22CFiouYXBw
ZW5naW5lLmdvb2dsZS5jb22CEiouY2xvdWQuZ29vZ2xlLmNvbYIWKi5nb29nbGUt
YW5hbHl0aWNzLmNvbYILKi5nb29nbGUuY2GCCyouZ29vZ2xlLmNsgg4qLmdvb2ds
ZS5jby5pboIOKi5nb29nbGUuY28uanCCDiouZ29vZ2xlLmNvLnVrgg8qLmdvb2ds
ZS5jb20uYXKCDyouZ29vZ2xlLmNvbS5hdYIPKi5nb29nbGUuY29tLmJygg8qLmdv
b2dsZS5jb20uY2+CDyouZ29vZ2xlLmNvbS5teIIPKi5nb29nbGUuY29tLnRygg8q
Lmdvb2dsZS5jb20udm6CCyouZ29vZ2xlLmRlggsqLmdvb2dsZS5lc4ILKi5nb29n
bGUuZnKCCyouZ29vZ2xlLmh1ggsqLmdvb2dsZS5pdIILKi5nb29nbGUubmyCCyou
Z29vZ2xlLnBsggsqLmdvb2dsZS5wdIISKi5nb29nbGVhZGFwaXMuY29tgg8qLmdv
b2dsZWFwaXMuY26CFCouZ29vZ2xlY29tbWVyY2UuY29tghEqLmdvb2dsZXZpZGVv
LmNvbYIMKi5nc3RhdGljLmNugg0qLmdzdGF0aWMuY29tggoqLmd2dDEuY29tggoq
Lmd2dDIuY29tghQqLm1ldHJpYy5nc3RhdGljLmNvbYIMKi51cmNoaW4uY29tghAq
LnVybC5nb29nbGUuY29tghYqLnlvdXR1YmUtbm9jb29raWUuY29tgg0qLnlvdXR1
YmUuY29tghYqLnlvdXR1YmVlZHVjYXRpb24uY29tggsqLnl0aW1nLmNvbYILYW5k
cm9pZC5jb22CBGcuY2+CBmdvby5nbIIUZ29vZ2xlLWFuYWx5dGljcy5jb22CCmdv
b2dsZS5jb22CEmdvb2dsZWNvbW1lcmNlLmNvbYIKdXJjaGluLmNvbYIIeW91dHUu
YmWCC3lvdXR1YmUuY29tghR5b3V0dWJlZWR1Y2F0aW9uLmNvbTALBgNVHQ8EBAMC
B4AwaAYIKwYBBQUHAQEEXDBaMCsGCCsGAQUFBzAChh9odHRwOi8vcGtpLmdvb2ds
ZS5jb20vR0lBRzIuY3J0MCsGCCsGAQUFBzABhh9odHRwOi8vY2xpZW50czEuZ29v
Z2xlLmNvbS9vY3NwMB0GA1UdDgQWBBRywGdPXVe4yyyclgSRP628eGqncDAMBgNV
HRMBAf8EAjAAMB8GA1UdIwQYMBaAFErdBhYbvPZotXb1gba7Yhq6WoEvMBcGA1Ud
IAQQMA4wDAYKKwYBBAHWeQIFATAwBgNVHR8EKTAnMCWgI6Ahhh9odHRwOi8vcGtp
Lmdvb2dsZS5jb20vR0lBRzIuY3JsMA0GCSqGSIb3DQEBBQUAA4IBAQA0i31kWmQI
Wx/22G3zVID52RPq2wkhC350Ard3n3MAd8fHkmp6lT3NgUw14wYIwCWGoiB5X5Za
8Ol/POXDLnI0/WJZeC5Ee/9z9jGXl8qNseuNClgRn7B5Tvg6zNjkWJXJH9ype7gv
tCWBHopM8NQVlGGKVmO/d0rJzi27l5jm5btsXM7Gi4DZPoxnSDlLOCLeQ3xPuTvP
MCcjrNTZ7Kx1/6SZPVWcEsLhciiskXlCsWZtmUjGxC+tGw6494qws4pbOS+F572/
6X/XU0Jpy7j+IrA+8wVRRmjc5JFoOx3WhS2+6cIenJ6VW0HnB4rLcislVc7L3q1g
rsT9wcmpaGvo
-----END CERTIFICATE-----

And here we have the certificate of the intermediate CA that signed the above certificate:

-----BEGIN CERTIFICATE-----
MIID8DCCAtigAwIBAgIDAjp2MA0GCSqGSIb3DQEBBQUAMEIxCzAJBgNVBAYTAlVT
MRYwFAYDVQQKEw1HZW9UcnVzdCBJbmMuMRswGQYDVQQDExJHZW9UcnVzdCBHbG9i
YWwgQ0EwHhcNMTMwNDA1MTUxNTU1WhcNMTYxMjMxMjM1OTU5WjBJMQswCQYDVQQG
EwJVUzETMBEGA1UEChMKR29vZ2xlIEluYzElMCMGA1UEAxMcR29vZ2xlIEludGVy
bmV0IEF1dGhvcml0eSBHMjCCASIwDQYJKoZIhvcNAQEBBQADggEPADCCAQoCggEB
AJwqBHdc2FCROgajguDYUEi8iT/xGXAaiEZ+4I/F8YnOIe5a/mENtzJEiaB0C1NP
VaTOgmKV7utZX8bhBYASxF6UP7xbSDj0U/ck5vuR6RXEz/RTDfRK/J9U3n2+oGtv
h8DQUB8oMANA2ghzUWx//zo8pzcGjr1LEQTrfSTe5vn8MXH7lNVg8y5Kr0LSy+rE
ahqyzFPdFUuLH8gZYR/Nnag+YyuENWllhMgZxUYi+FOVvuOAShDGKuy6lyARxzmZ
EASg8GF6lSWMTlJ14rbtCMoU/M4iarNOz0YDl5cDfsCx3nuvRTPPuj5xt970JSXC
DTWJnZ37DhF5iR43xa+OcmkCAwEAAaOB5zCB5DAfBgNVHSMEGDAWgBTAephojYn7
qwVkDBF9qn1luMrMTjAdBgNVHQ4EFgQUSt0GFhu89mi1dvWBtrtiGrpagS8wEgYD
VR0TAQH/BAgwBgEB/wIBADAOBgNVHQ8BAf8EBAMCAQYwNQYDVR0fBC4wLDAqoCig
JoYkaHR0cDovL2cuc3ltY2IuY29tL2NybHMvZ3RnbG9iYWwuY3JsMC4GCCsGAQUF
BwEBBCIwIDAeBggrBgEFBQcwAYYSaHR0cDovL2cuc3ltY2QuY29tMBcGA1UdIAQQ
MA4wDAYKKwYBBAHWeQIFATANBgkqhkiG9w0BAQUFAAOCAQEAJ4zP6cc7vsBv6JaE
+5xcXZDkd9uLMmCbZdiFJrW6nx7eZE4fxsggWwmfq6ngCTRFomUlNz1/Wm8gzPn6
8R2PEAwCOsTJAXaWvpv5Fdg50cUDR3a4iowx1mDV5I/b+jzG1Zgo+ByPF5E0y8tS
etH7OiDk4Yax2BgPvtaHZI3FCiVCUe+yOLjgHdDh/Ob0r0a678C/xbQF9ZR1DP6i
vgK66oZb+TWzZvXFjYWhGiN3GhkXVBNgnwvhtJwoKvmuAjRtJZOcgqgXe/GFsNMP
WOH7sf6coaPo/ck/9Ndx3L2MpBngISMjVROPpBYCCX65r+7bU2S9cS+5Oc4wt7S8
VOBHBw==
-----END CERTIFICATE-----

Zalenium a stable and scalable Selenium grid

I just want to give well deserved thumbs up to Zalando’s Zalenium Their own description says it best:

Allows anyone to have a disposable, flexible, container based Selenium Grid infrastructure featuring video recording, live preview, basic auth & online/offline dashboards

Getting up and-running really is only one docker pull and docker run command away.

Accessing gpio pins inside a docker container on a raspberry pi

If your container needs access to the GPIO pins, then it must have access to the /dev/gpiomem device. From the command line you can do that like this:

$ docker run --device=/dev/gpiomem:/dev/gpiomem ...rest of commandline...

Here’s how to do it with a docker-compose file:

version: "2"

services:
  app:
    devices:
      - /dev/gpiomem:/dev/gpiomem
    ports:
     ...rest of the file...

Containerising the development environment

One of the nice things about docker is that we can use all kinds of software without cluttering up our local machine. I really like the ability to have the development environment running in a container. Here is an example where we:

  • Get a Node.js development environment with all required tools and packages
  • Allow remote debugging of the app in the container
  • See code changes immediately reflected inside the container

The dockerfile below gives us a container with all required tools and packages for a Node.js app. In this example we assume the ‘.’ directory contains the files needed to run the app.

FROM node:9

WORKDIR /code

RUN npm install -g nodemon

COPY package.json /code/package.json
RUN npm install && npm ls
RUN mv /code/node_modules /node_modules
COPY . /code

CMD ["npm", "start"]

That’s nice, but how does this provide remote debugging? and how do code changes propagate to a running container?

Two very normal aspects of docker achieve this. Firstly docker-compose.yml overrules the CMD ["npm", "start"] statement to start nodemon with the --inspect=0.0.0.5858 flag. That starts the app with the debugger listening on all of the machines IP addresses. We expose port 5858 to allow remote debuggers to connect to the app in the container.

Secondly, the compose file contains a volume mapping that overrules the /code folder in the container and points it to the directory on the local machine where you edit the code. Combined with the --watch flag nodemon sees any changes you make to the code and restarts the app in the container with the latest code changes.

Note: If you are running docker on Windows of the code is stored on some network share, then you must use the --legacy-watch flag instead of --watch

The docker-compose.yml file:

version: "2"

services:
  app:
    build: .
    command: nodemon --inspect=0.0.0.0:5858 --watch
    volumes:
      - ./:/code
    ports:
      - "5858:5858"

Here’s a launch.json for Visual Studio Code to attach to the container.

{
    "version": "0.2.0",
    "configurations": [
        {
            "name": "Attach",
            "type": "node",
            "request": "attach",
            "port": 5858,
            "address": "localhost",
            "restart": true,
            "sourceMaps": false,
            "outDir": null,
            "localRoot": "${workspaceRoot}",
            "remoteRoot": "/code"
        }
    ]
}

Docker on Raspbian: cgroup not supported on this system

Are you running Docker on Raspbian and getting the error:

cgroups: memory cgroup not supported on this system

Best solution is to add cgroup_memory=1 in /boot/cmdline.txt and reboot.

sudo echo "cgroup_memory=1" >> /boot/cmdline.txt

PLease note, for future releases of Raspbian you will need the following instead:

sudo echo "cgroup_enable=memory" >> /boot/cmdline.txt

Alternatively, you can downgrade to an earlier docker version:

sudo apt-get install -y docker-ce=17.09.0~ce-0~raspbian --allow-downgrades

Returning a collection of objects from a PowerShell function

If you’re writing a function that returns a collection then don’t forget to include the comma operator in the return statement. If you forget it your function will work correctly when the collection contains multiple objects, but fails when it contains 1 object.

Take the following buggy example:

function GiveMeAllTheThings()
{
    $myarray = @()
    #fill $myarray with results of type String. Assume that
    #run-time conditions determine if it is filled with 
    #0, 1 or more items and that each item is a string
    return $myarray;
}
$result = GiveMeAllTheThings
$result.GetType().FullName

If you execute this code when $myarray has many strings in it, the returned type from the function is System.Object[]. If $myarray has only 1 string in it, then the returned type will be System.String.

The code should have been written like this:

function GiveMeAllTheThings()
{
    ...
    return ,$myarray;
}

Understanding the LINQ nested grouping example

Here’s an explanation of how the default example for LINQ nested grouping actually works. The usual example for nested grouping looks like this:

from student in Students
group student by student.Faculty into Faculty
from dbtgroup in
(
    from student in Faculty
    group student by student.DebtCategory
)
group dbtgroup by Faculty.Key;

The objective of this statement is to first group-by students into faculties and then in each faculty create subgroupings of students by their DebtCategory.

So how does this actually work and whats the equivalent method/lamba syntax? The first step is to groups each student into their faculty. Assume we have the following data

public class Student
{
   public string Name { get; set; }
   public string Faculty { get; set; }
   public int DebtCategory { get; set; }
}

IList<Student> Students = new List<Student>();
Students.Add(new Student { Name = "John" , Faculty = "IT"     , DebtCategory = 2 });
Students.Add(new Student { Name = "Jane" , Faculty = "IT"     , DebtCategory = 2 });
Students.Add(new Student { Name = "Jesse", Faculty = "Finance", DebtCategory = 2 });
Students.Add(new Student { Name = "Linda", Faculty = "Finance", DebtCategory = 1 });

The following query groups each student into a faculty

var query1 = from student in Students
group student by student.Faculty into Faculty
select Faculty;

//The Method syntax for the above query is:
var query1Method = Students
.GroupBy(student => student.Faculty)
.Select ( Faculty => Faculty);

//This gives us the following IGrouping<string, Student> as result
//
// [0]
//    Key   :  IT
//    Values: 
//          [0] John (IT) (2)
//          [1] Jane (IT) (2)
//
// [1]
//    Key   : Finance
//    Values:
//          [0] Jesse (Finance) (2)
//          [1] Linda (Finance) (1)

The next step is to add another level of grouping:

var query2 = from student in Students
group student by student.Faculty into Faculty
from dbtgroup in
(
    from student in Faculty
    group student by student.DebtCategory
)
select dbtgroup;
//This gives us the following IGrouping<int, Student> as result
//[0]
//  Key   : 2
//  Values:
//        [0] John (IT) (2)
//        [1] Jane (IT) (2)
//
//[1]
//  Key   : 2
//  Values:
//        [0] Jesse (Finance) (2)
//
//[2]
//  Key   : 1
//  Values:
//        [0] Linda (Finance) (1)

// The following is the literal translation of the above Comprehension syntax into method syntax. We're ignoring this as explained below
//	var query2Method = Students
//		.GroupBy(student => student.Faculty)
//		.SelectMany(  Faculty =>Faculty.GroupBy(student => student.DebtCategory)
//					, (Faculty, dbtgroup) => dbtgroup);
	
//The final complete query ends with"group dbtgroup by Faculty.Key;" 
// this statement causes the compiler to see that you're refering to the Faculty object from the select many, so instead of 
// "(Faculty, dbtgroup) => dbtgroup" it emits a slightly different projection "(Faculty, dbtgroup) => new {Faculty, dbtgroup}
//structure
var query2Method = Students
.GroupBy(student => student.Faculty)
.SelectMany( Faculty =>Faculty.GroupBy(student => student.DebtCategory)
	 , (Faculty, dbtgroup) => new {Faculty, dbtgroup});

Query2 is close to our desired output, however the grouping is the wrong way around. So the final step is:

var query3 = from student in Students
group student by student.Faculty into Faculty
from dbtgroup in
    (
    from student in Faculty
    group student by student.DebtCategory
    )
group dbtgroup by Faculty.Key;

//The method/lambda syntax is:
var query3Method = Students
.GroupBy(student => student.Faculty)
.SelectMany (
	Faculties => Faculties.GroupBy (student => student.DebtCategory)
	, (Faculty, dbtgroup) => 
		new  
		{
			Faculty = Faculty, 
			dbtgroup = dbtgroup
		} )
.GroupBy( item => item.Faculty.Key, item => item.dbtgroup );

//This gives us the following groups as result
//[0]
//  Key   : IT
//  Values:
//        [0] Key   : 2
//            Values:
//                  [0] John (IT) (2)
//                  [1] Jane (IT) (2)
//[1]
//  Key   : Finance
//  Values:
//        [0] Key   : 2
//            Values:
//                  [0] Jesse (Finance) (2)
//        [1] Key   : 1
//            Values:
//                    [0] Linda (Finance) (1)