The Digital Cat - algorithms

Data Partitioning and Consistent Hashing

2022-08-23T12:00:00+00:00

This post is an introduction to partitioning, a technique for distributed storage systems, and to consistent hashing, a specific partitioning algorithm that promotes even distribution of data, while allowing to dynamically change the number of servers and minimising the cost of the transition.

My interest in partitioning dates back to 2015 when I was following courses at the MongoDB university and learned about sharding, the name MongoDB uses for partitioning. I was fascinated by the topic and discovered the technique known as consistent hashing; I enjoyed it a lot, so much that I wrote a little demo in Python to understand it better. Later, I focused on other things and forgot the project completely, until recently, when David Eynon sent me a PR on GitHub to replace a deprecated testing library. So, I decided to brush up on my knowledge of consistent hashing and, as I often do on this blog, dump my thoughts in a post.

The topic of distributed storage and data processing is arguably rich and complicated, so while I will try to give a broader context to the concepts that I will introduce, I by no means intend to write a comprehensive guide to the subject matter. The audience of this post is developers who do not know what partitioning and consistent hashing are and want to take their first step into those topics.

Code syntax

You will find some code examples mentioned in the post, which are written using the Python notation. If you are not familiar with the language, these are the main rules

x**y means x^y, e.g. 2**3 => 8.
x//y means the integer division between x and y, e.g. 11//4 => 2.
x%y means the modulo operation (remainder of integer division), e.g. 11%4 => 3.

Rationale¶

When we design a system, we might want to scatter data among multiple sources to allow real concurrency of access and a more targeted optimisation.

For example, we might observe that in a given social media application there are two types of queries: some are very infrequent and involve tables related to personal data and the user profile, others are extremely frequent and pretty intensive, and are related to the content shared by the user. In this case, we might decide to store the tables related to the profile and the tables that are related to content in two different systems, A and B (here, the word system might be freely replaced by computer, database, storage system, or other similar components).

This means that the infrequent queries that fetch personal data will be served by system A, while the more frequent and intensive queries related to content will be served by system B.

Suddenly, we have the chance to deploy system B using more powerful and expensive hardware, or an architecture with better performances, without increasing the cost for tables that won't benefit from such an improvement as the ones stored in system A.

This is a standard approach in system design, and it requires the introduction of an additional layer of control that will route requests to the right source. This layer might be implemented in several places, for example:

in the code of our application, with conditional structures that query different data sources
in the framework that we are using for the application, for example in a middleware that automatically routes requests according to nature or the query
in a wrapper around the storage that hides the fact that data exists in two different systems

In the last case, this technique is usually called partitioning.

In this post, I will try to show the challenges we face when we partition data and focus on some of the algorithms that can be used to implement it, in particular on consistent hashing. Please note that, while some of these techniques are used by databases to provide internal partitioning, they have a wider range of applications and might come in handy in different contexts.

Design choices¶

Every design choice in a system depends on the requirements, and when it comes to data storage the most important factors are the nature of the data, its distribution, and the access patterns. Consider for example databases and Content Delivery Networks (CDNs): both are meant to store data, and the storage size of both can vary substantially. However, there are important differences between the two that greatly affect the design choices. Let's see some simple examples:

databases are meant to store data in a long-term fashion, while caches are by definition short-lived. This means that an important requirement for databases is data preservation, and we should do everything in our power to avoid losing parts of the database. A cache, conversely, holds data for a short time, either predetermined by the system or forced by a change in the data source. As you can see in this case we not only take data loss as part of the equation, but we get to the point where we trigger it on purpose.

applications often make use of range queries, which means that they retrieve sets of results spanning a range of values of one of the keys; for example, you might want to see all employers within a certain range of salaries, or all users that have more than a certain amount of followers. In such cases, it makes little sense to scatter data among different physical sources, thus making the retrieval more complicated and ultimately affecting performances. Databases see very often an access pattern of this type, while caches, being usually implemented as key/value stores, do not need to take this into account.

A practical example of partitioning¶

Let us consider a simple key/value store, for example a common address book where the key is the name of the contact and the value a rich document with their personal details. If multiple users access the store, chances are that the system will at a certain point struggle to serve all the requests, so we might want to partition the data to allow concurrent access. We can for example sort them alphabetically and split them in two, storing all values with a key that begins with the letters A-M in one server and the rest (keys N-Z) in the second one.

This might seem a good idea, but we will soon discover that performances are not great. Unfortunately, our address book doesn't contain the same number of people for each letter, as (for example) we know more people whose name starts with A or C than with X or Z.

That poses a problem, as our partitioning doesn't achieve the desired outcome, that of splitting requests evenly between the two servers. If we increase the number of partitions, serving smaller groups of letters, we will just worsen the problem, to the point where a partition might be completely empty and thus receive no traffic: since the problem comes from the data distribution, we need to find a way to change that property.

One way to deal with the problem is to change the boundaries of the partitions so that we get an almost even distribution of values among them. For example, we might store keys starting with A-B in the first partition, keys starting with C-D in the second, and all the rest in the third one.

The problem with such a strategy is that it is highly dependent on the actual data that we are storing. Not only does this mean the solution has to be customised for each use case (the partitions in the example might be good for one address book and completely wrong for another), but adding data to the storage might change the distribution and invalidate the solution.

Hash functions to the rescue¶

An interesting solution to the problem of distributing data evenly is represented by hash functions. As I explained in my post Introduction to hashing, good hash functions produce a highly uniform distribution, which makes them ideal in this case. Please note that hash functions can help with routing queries and not with storing data. Hashed values cannot replace the content, as they are not bijective, i.e. given two different inputs the output might be the same (collision), so they can only be used to decide where to store a piece of information.

We can at this point devise a storage strategy based on hash functions. We can divide the output space of the hash function (codomain) into a certain amount of partitions and be sure that each one of them will contain a similar amount of elements. For example, the hash function might output a 32-bit number, so we know that each hashed value will be between 0 and 2³² (4,294,967,295), and from here it's pretty straightforward to find partition boundaries. For example, we can create 16 partitions numbered 0 to 15, each one containing 2²⁸ hash values (268,435,456).

Routing is at this point very simple, as we can mathematically find the partition number given the hash. There are many ways to do this but two simple approaches are

using the integer division hash(k) // partition_size, e.g. hash(k) // 2**28. All keys from 0 to 268435455 end up in partition 0 (268435455 // 2**28), keys from 268435456 to 536870911 end up in partition 1, and so on.

using the modulo operator hash(k) % number_of_partitions, e.g. hash(k) % 16. This assigns values to partitions in a round robin fashion, where key 0 goes to partition 0 (0%16), key 1 to partition 1 (1%16), key 15 to partition 15 (15%16), and then starts again with key 16 which goes to partition 0 (16%16), and so on.

This architecture has the clear advantage that thanks to the properties of hash functions, data is scattered evenly among the partitions. This means that when we query the system, requests will also be divided evenly, thus giving us a good distribution of the load.

As we will see later, however, this is not a good approach for dynamic systems.

Partitioning use cases¶

Hash functions are definitely interesting but they are not the perfect solution in every case. Let's have a brief look at three different types of systems that might benefit from partitioning and discuss their specific requirements.

Load balancers

Pure load balancers solve a simple problem: to spread requests evenly across multiple identical servers. The key word here is "identical", as you cannot pick the wrong server, thus no routing can result in an error. However, spreading the load unevenly can result in performance loss, and possibly also service failure. For example, if a server gets overloaded queries might hit a timeout while waiting to be served.

For this reason, when load balancing is not content-aware, for example in a simple HTTP server scenario, round-robin partitioning is a good choice. The system just assigns new requests to servers on a rotation basis, which ensures perfectly even distribution. For example, this algorithm is the default choice for AWS Application Load Balancers.

Clearly, load balancers can be more complicated and feature-rich even without becoming content aware. The aforementioned AWS ALBs, for example, support also the "least outstanding requests" algorithm, which in simple words means choosing the server with the smallest workload.

Caches

Caches are systems that temporarily store data whose retrieval is expensive, either for the user or for the provider. For example, if a system runs a long query on a database caching the result will be beneficial both for the system and the database. For the former, because a repeated run will get the result much faster and for the latter because the load of the new query is zero.

Caches can be found everywhere and vary dramatically in size, but they are one of the best examples of systems that benefit from partitioning. As I mentioned before, their standard usage patterns don't include range queries and data loss (flushing) is part of their normal workflow.

A Content Delivery Network (CDN) is a specific type of cache that is distributed geographically. The purpose of the CDN nodes is to store content in a location that is physically near the users, thus increasing the performance of the system. This means that two geographically distinct nodes of a CDN contain the same values (replication), and the routing policy is solely based on the physical position of the user with respect to the node. Internally, each CDN node can be implemented using partitioning, though, which might speed up the performances of that specific node.

Databases

As for databases, I already mentioned that the most important problem is range queries or if you prefer, content-aware partitioning. In general, you can't partition a database without taking into account the content, or you will incur severe performance losses. So, when it comes to databases, partitioning has to be the result of a specific design and can't be applied regardless of the database schema.

To better understand the challenge, let's consider a simple database whose elements are employees with a name and a salary. Now, if we want to partition this database we have to choose a key for the partitioning itself. It might be the primary key, the name, or the salary, as these are the only values available in each record.

Say we use hash functions to partition the database and use the employee salary as a key. Because of the properties of hash functions, employees with the exact same salary will end up being stored in the same partition, but employees with similar salaries might end up in different ones. This depends on the number of partitions, clearly, but the main point is that records that are "near" (according to the selected key) now are potentially very far.

In the example above I used MD5 as the routing hash function, and you can reproduce the calculations using the following Python code

import hashlib

def hash_value(value):
    return int(hashlib.md5(str(value).encode("utf-8")).hexdigest(), 16)

# 57500283691658467528082923406379043196
hash_value(60000)

# 209589555716047624083879134729984902154
hash_value(60100)

# 12
hash_value(60000) % 16

# 10
hash_value(60100) % 16

Things do not go much better if we use the integer division. If we have 16 partitions, each one of them contains 2¹²⁴ values

# 2
hash_value(60000) // 2**124

# 9
hash_value(60100) // 2**124

Now, let's consider a query that selects all employees within a certain range of salaries. If the database is not partitioned, all records are kept on the same server, and if we optimised the system for such a query, the records will also be physically adjacent (e.g. stored in nearby memory addresses). This makes the query blazing fast, but if the database is partitioned the query has to collect values from multiple partitions which greatly penalises performance.

We can see a real example of this design challenge in the documentation of MongoDB, a non-relational database that supports partitioning (called sharding). MongoDB supports hashed sharding and ranged sharding. In their words

Hashed sharding uses either a single field hashed index or a compound hashed index as the shard key to partition data across your sharded cluster.

Range-based sharding involves dividing data into contiguous ranges determined by the shard key values. In this model, documents with "close" shard key values are likely to be in the same chunk or shard. This allows for efficient queries where reads target documents within a contiguous range. However, both read and write performance may decrease with poor shard key selection.

I highly recommend reading the two pages I linked above as they will give you a good idea of how a real system uses the concepts I introduced and what design challenges are involved when using partitioning.

Caching and scaling strategies¶

When we design distributed caches, an interesting problem we might face is that of scaling the system in and out to match the current load without wasting resources.

When the cache is under a light load we might want to run a small number of servers, but as soon as the number of requests increases we need to proportionally increase the number of cache nodes if we want to avoid a performance drop. This is usually not a big problem for partitioned databases, since in that case we change the number of partitions only occasionally to adjust performances or to increase the storage size, but caches like CDNs might need continuous adjustments during a single day.

Increasing or decreasing the number of nodes in a distributed cache might however be a pretty destructive action. Depending on the routing algorithm, if we add nodes (scale out) we might need to move data from existing ones to the newly added ones, and if we remove nodes (scale in) we will certainly lose the data contained in them. Both scenarios result in a (potentially massive) cache invalidation which can't be taken lightly.

The hash-based routing method presented in the previous section has terrible performances when it comes to scaling because any change in the number of servers impacts the key boundaries of the existing ones. Let's see a practical example of that and calculate the actual figures.

Scaling out with hash partitioning

Every time you consider a process or an algorithm you should have a look at how it behaves in the worst possible condition, to have a glimpse of what you might run into when you use it. For this reason, the following example considers a scale-out scenario in which all cache nodes are full. The best case is obviously when all nodes are empty, but in that case we don't need to scale out at all.

Let's consider a 32-bit hash function and 16 partitions numbered 0 to 15. Since the hash function space is 2³² (4,294,967,296), each partition will contain 2²⁸ hash values (268,435,456). Each node is full, which means that all the possible 2²⁸ slots are assigned to a cached item, that is some data stored in the server that corresponds to that partition. The system is using the integer division routing system.

If we scale out to 17 partitions, increasing the pool by just by 1 node, each node will now contain a smaller part of the global data space, as now we split it among more nodes. In particular, each node used to contain 1/16 of the global data (268,435,456), and will now contain 1/17 of it (approx. 252,645,135). Our biggest problem is now managing the transition between the initial setup and the new one.

The first node hosted 1/16 of the data space, the keys from 0 to 268435455. It will now contain 1/17 of the data space, the keys from 0 to 252645134. To simplify the example it is useful to convert everything into a common unit of measure: the node used to contain 17/272 of the space (1/16) and contains now 16/272 (1/17) of it.

This means that 1/272 of the whole data space has to be moved to the second node, corresponding to the keys from 252645135 to 268435455. It is important to note that these keys cannot be moved to the newly added node, but have to be moved to the second node because the algorithm we use maps keys to nodes in order.

This means that the second node will receive 1/272 of the whole data space. Since it originally already contained 17/272 of the whole space it should now theoretically contain 18/272 of it. However, as it happened for the first node, we want to balance the content and reduce it to 16/272, so now we have 2/272 of the whole space that we want to move to the third node.

So, we move 1/272 from node 1 to node 2, 2/272 from node 2 to node 3, 3/272 from node 3 to node 4, and going on with the example we end up moving 16/272 (1/17) from the 16th node to the 17th, which fills it with the correct amount of keys. However, in doing so we moved 136/272 (1/272 + 2/272 + 3/272 + ... + 16/272) of the data space between nodes, which is exactly 50% of it.

So, for any initial size and a scale out of 1 single node, we have to move 50 of the data stored in our cache, and it might only get worse by increasing the number of final nodes until we end up having to move almost 100 of it (in an extreme case). A similar effect plagues the scale-in action, where one or more nodes are removed from the pool, and the keys they contain have to be migrated to the remaining nodes, creating a ripple effect to redistribute the keys according to the algorithm.

Using a modulo routing strategy doesn't change things: as I mentioned before, the core issue is that the addition of new nodes changes the routing of the whole data space, requiring a massive migration of keys in the entire system.

A different approach¶

While the idea of using hash functions looked very promising, we quickly found that the trivial implementation has very poor performances in a dynamic setting. As we clearly saw in the previous section, the problem is that upon scaling more than half of the keys have to be moved across nodes, so if we could find a way to avoid this we could still use hash functions to scatter data uniformly across the nodes.

As you might have already figured out, the issue comes from the attempt to keep all nodes perfectly balanced. The modulo and integer division algorithms distribute keys evenly (as long as the hash function has a good diffusion), but this is a double-edged sword. The balance is extremely beneficial in a static environment, but it is also the Achilles heel of this architecture when we change the number of nodes.

When we design a system, requirements are paramount. Everything we add to the final product should be there to satisfy one or more requirements. However, often requirements clash with each other, and trying to implement all of them at once might lead to situations where there is no apparent solution. In such cases, it is useful to temporarily drop one or more requirements and investigate the options we have, and this is exactly what we can do in this case: maintaining balance is an important feature, but let's see what would happen if we didn't have that requirement.

If we don't care about balancing nodes we can solve the problem with a different approach. Instead of using the integer division to find the slot, we can keep a table of the minimum hash served by each slot and route requests according to that. Each row of the table will have a minimum hash and the node that serves them.

This means that when we increase the number of slots we can just drop a new slot anywhere and assign to it all the keys that fall under its domain. This means that the new node will become the owner of keys that belonged to another node as it happened before, but with an important difference. Now all keys come from another single node, and the amount of keys moved is a fraction of those contained in it (which is much less than half of the keys). In the worst case, we need to move all keys contained in a node, which once again is much less than half of the keys.

As you can see, this relieves the load of one single node. According to what we said before, we are not trying to balance the load of the whole cluster. If we could use this technique to cover multiple spaces with a single added node, though, we could relieve the load of more than one other node. In principle this is simple: we just need to add multiple rows with the same node to the table.

Pay attention to the fact that we added multiple rows, that is multiple partitions, but they are all served by the same physical node. This has several advantages:

It fills the new node with keys coming from several different nodes without rippling effects.
The key transfer load is spread among different nodes, noticeably hitting only the new node.

There is also an interesting turn of events: since keys for the new node are fetched from several different existing nodes, the process will keep the cluster balanced! This is a remarkable outcome: we temporarily dropped a requirement and found a solution that provides that exact requirement in a different way.

The key part of this new process is the idea that multiple partitions can be served by the same node. The only missing part at this point is a way to identify the new partitions (the sets of hashes) served by the new node in a deterministic way.

Consistent hashing¶

Finally, let me introduce consistent hashing as a technique to implement the process described above.

As we discussed in the previous section, the only missing part is an algorithm that produces a deterministic set of hash ranges for a single new node. These hash ranges represent the partitions served by that node and should be scattered across the whole hash space. It is important for them to be spread because this way they will each receive some keys from existing nodes, instead of migrating a bulk of keys from a single one. The more evenly spread, the better the distribution of the load and the more balanced the resulting cluster.

As we saw previously, any time we need to scatter data across a given space in a deterministic way, hash functions are a good choice, and they can be used in this case as well. The idea is simple: each partition of a node is assigned a name and this name is hashed with the same function used to hash the keys stored in the system. This will produce a deterministic value in the hash space, and that value will be the minimum value served by that partition. Thanks to diffusion the names of all partitions will generate different hash values that won't easily clash, and this is the way we generate the routing table.

Let's see an example, bearing in mind that the specific function can change among implementations.

For simplicity's sake, I used a custom hash function that outputs 28-bit hashes (7 hexadecimal digits). This makes it possible to compare hashes visually and simplifies the example. To do this I took the first 7 digits of the SHA1 hash with the following Python code

def hash_name(name):
    return int(hashlib.sha1(name.encode("utf-8")).hexdigest()[:7], 16)

thus creating a hash function whose values go from 0x0000000 to 0xfffffff. At the end of the post you will find the Python code that I used to generate the following routing tables, and you are free to experiment using different settings.

WARNING: this is not a good hash function! SHA1 produces 160 bits hashes, so taking the first 28 bits reduces the hash space to a microscopic fraction of the total, as we go from 2¹⁶⁰ total hashes to 2²⁸. Please keep in mind that this is done only to simplify the visualisation of the example.

All our nodes are called server-X with X being a letter of the English alphabet, thus giving us server-a, server-b, and so on. I decided to create 5 partitions per server, numbered from 0 to 4, which are generated appending -Y to the name, where Y is the number of the partition. For example:

server-a-0 -- hash --> 148456820
server-a-1 -- hash --> 57674441
server-a-2 -- hash --> 216250418
server-a-3 -- hash --> 30595746
server-a-4 -- hash --> 23746828

If we do this for two nodes (server-a and server-b) and then sort the results we will get a full routing table

 23746828 --> server-a-4 ( 6848918 hashes)
 30595746 --> server-a-3 (27078695 hashes)
 57674441 --> server-a-1 ( 3228787 hashes)
 60903228 --> server-b-2 (17957108 hashes)
 78860336 --> server-b-0 ( 7773725 hashes)
 86634061 --> server-b-4 (61822759 hashes)
148456820 --> server-a-0 (67793598 hashes)
216250418 --> server-a-2 (17304439 hashes)
233554857 --> server-b-3 (29289666 hashes)
262844523 --> server-b-1 ( 5590932 hashes)

Remember that the hashes in the routing table are the minimum hash served by that partition. For example, the first line tells us that all hashes from 23746828 are served by the partition server-a-4, while hashes from 30595746 are served by the partition server-a-3. This means that the partition server-a-4 serves 6848918 hashes (as you can read in the table). A key whose hash is 79249022 will be served by server-b-0

 60903228 --> server-b-2 (17957108 hashes)
 78860336 --> server-b-0 ( 7773725 hashes)
                     ^
                     |
 79249022 -----------+

 86634061 --> server-b-4 (61822759 hashes)
148456820 --> server-a-0 (67793598 hashes)

Since partitions are not physically separated, but are just virtual entities belonging to a node, the route table can be simplified to

 23746828 -- > server-a (37156400 hashes)
 60903228 -- > server-b (87553592 hashes)
148456820 -- > server-a (85098037 hashes)
233554857 -- > server-b (34880598 hashes)

What we achieved is remarkable, but there are still two problems. Let's have a look at a simple routing table for three nodes with 5 partitions each

3 nodes with 5 partitions each

 23746828 --> server-a (23267855 hashes)
 47014683 --> server-c (10659758 hashes)
 57674441 --> server-a ( 3228787 hashes)
 60903228 --> server-b (63557309 hashes)
124460537 --> server-c (23996283 hashes)
148456820 --> server-a (31382512 hashes)
179839332 --> server-c (36411086 hashes)
216250418 --> server-a (17304439 hashes)
233554857 --> server-b (15386579 hashes)
248941436 --> server-c (13903087 hashes)
262844523 --> server-b ( 5590932 hashes)

First, the lowest value is not 0, which means that there are some hashes (23,746,828 in this case) which are not served by any slot. Second, in general the distribution doesn't cover the space evenly, as some nodes receive too many keys compared to others. This second problem isn't actually visible in the setups I showed so far, but it becomes noticeable increasing the number of servers. For example, with two nodes we have this situation

2 nodes with 5 partitions each

server-a: 122254437 hashes
server-b: 146181018 hashes

while with 5 nodes it becomes

5 nodes with 5 partitions each

server-a: 64211359 hashes
server-b: 66179053 hashes
server-c: 57545779 hashes
server-d: 43217324 hashes
server-e: 37281940 hashes

As you can see, in the second case the load of server-e is 56% that of server-b.

The first problem is easily solved assigning the initial hashes to the last node, that is considering the hash space mapped on a circle. This means that for 2 nodes with 5 partitions each we have

Routing table of 2 nodes with 5 partitions each

Full routing table
        0 --> server-b-1 (23746828 hashes)
 23746828 --> server-a-4 (6848918 hashes)
 30595746 --> server-a-3 (27078695 hashes)
 57674441 --> server-a-1 (3228787 hashes)
 60903228 --> server-b-2 (17957108 hashes)
 78860336 --> server-b-0 (7773725 hashes)
 86634061 --> server-b-4 (61822759 hashes)
148456820 --> server-a-0 (67793598 hashes)
216250418 --> server-a-2 (17304439 hashes)
233554857 --> server-b-3 (29289666 hashes)
262844523 --> server-b-1 (5590932 hashes)

Simplified routing table
        0 -- > server-b (23746828 hashes)
 23746828 -- > server-a (37156400 hashes)
 60903228 -- > server-b (87553592 hashes)
148456820 -- > server-a (85098037 hashes)
233554857 -- > server-b (34880598 hashes)

where the partition server-b-1 contains the orphaned initial hashes.

The second problem is a matter of statistical approach. The hash function that we use to map the partition name to the key space cannot be controlled, as its diffusion property has been designed to avoid a regular spacing of values. However, if we increase the number of partitions we expect the hash function to spread values across the whole space. At that point, each partition will be assigned just a tiny key space, and the differences between partitions will be less noticeable. In other words, by increasing the number of partitions dramatically we should achieve a better distribution. Let's compare the results of 5 nodes with 2 partitions each

5 nodes with 2 partitions each

server-a 36500586
server-b 76678431
server-c 31738329
server-d 56183426
server-e 67334683

with the results of 5 nodes with 3000 partitions each

5 nodes with 3000 partitions each

server-a 53385222
server-b 53855877
server-c 53755762
server-d 53597662
server-e 53840932

There is clearly an upper limit to the number of partitions that we can create. If we create more partitions than the possible number of hashes we will end up having empty ones and incurring routing errors as some of them will clash, but this is a purely theoretical case: using standard real hash functions we generate hashes of at least 160 bits, which means a codomain of 2¹⁶⁰ possible values (more than 10⁴⁸). With 10,000 nodes (which is a considerable amount of servers in 2022) the threshold would be greater than 10⁴⁴ partitions per server.

So far, we achieved great results, but we already managed to properly partition the space with simple techniques. The real power of consistent hashing is in the way it behaves in a dynamic setting.

Consistent hashing and scaling¶

The interesting thing about consistent hashing is its amazing behaviour in a dynamic environment. As you might remember, the problem with hash partitioning was that a change in the number of nodes had ripple effects that resulted in a massive migration of at least half the keys.

With consistent hashing, when we add a new node we need to generate the hash values for that and put them in the routing table, and at that point we need to migrate the keys that fall under the domain of the newly created slots. Let's see an example before we discuss the performances.

The initial setup is 2 nodes with 5 partitions each

2 nodes with 5 partitions

Full routing table
        0 --> server-b-1 (23746828 hashes)
 23746828 --> server-a-4 (6848918 hashes)
 30595746 --> server-a-3 (27078695 hashes)
 57674441 --> server-a-1 (3228787 hashes)
 60903228 --> server-b-2 (17957108 hashes)
 78860336 --> server-b-0 (7773725 hashes)
 86634061 --> server-b-4 (61822759 hashes)
148456820 --> server-a-0 (67793598 hashes)
216250418 --> server-a-2 (17304439 hashes)
233554857 --> server-b-3 (29289666 hashes)
262844523 --> server-b-1 (5590932 hashes)

Simplified routing table
        0 -- > server-b (23746828 hashes)
 23746828 -- > server-a (37156400 hashes)
 60903228 -- > server-b (87553592 hashes)
148456820 -- > server-a (85098037 hashes)
233554857 -- > server-b (34880598 hashes)

Stats
server-a 122254437
server-b 146181018

TOTAL HASHES: 268435455/268435455

if we add one node we migrate to this new setup

3 nodes with 5 partitions

Full routing table
        0 --> server-b-1 (23746828 hashes)
 23746828 --> server-a-4 (6848918 hashes)
 30595746 --> server-a-3 (16418937 hashes)
 47014683 --> server-c-3 (10659758 hashes)
 57674441 --> server-a-1 (3228787 hashes)
 60903228 --> server-b-2 (17957108 hashes)
 78860336 --> server-b-0 (7773725 hashes)
 86634061 --> server-b-4 (37826476 hashes)
124460537 --> server-c-2 (23996283 hashes)
148456820 --> server-a-0 (31382512 hashes)
179839332 --> server-c-1 (25303093 hashes)
205142425 --> server-c-4 (11107993 hashes)
216250418 --> server-a-2 (17304439 hashes)
233554857 --> server-b-3 (15386579 hashes)
248941436 --> server-c-0 (13903087 hashes)
262844523 --> server-b-1 (5590932 hashes)

Simplified routing table
        0 -- > server-b (23746828 hashes)
 23746828 -- > server-a (23267855 hashes)
 47014683 -- > server-c (10659758 hashes)
 57674441 -- > server-a ( 3228787 hashes)
 60903228 -- > server-b (63557309 hashes)
124460537 -- > server-c (23996283 hashes)
148456820 -- > server-a (31382512 hashes)
179839332 -- > server-c (36411086 hashes)
216250418 -- > server-a (17304439 hashes)
233554857 -- > server-b (15386579 hashes)
248941436 -- > server-c (13903087 hashes)
262844523 -- > server-b ( 5590932 hashes)

Stats
server-a 75183593
server-b 108281648
server-c 84970214

TOTAL HASHES: 268435455/268435455

Let's have a closer look to what happens with server-c

Simplified routing table
        0 -- > server-b (23746828 hashes)
 23746828 -- > server-a (23267855 hashes) ----+ 10659758 hashes
                                              | from server-a
 47014683 -- > server-c (10659758 hashes) <---+
 57674441 -- > server-a ( 3228787 hashes)
 60903228 -- > server-b (63557309 hashes) ----+ 23996283 hashes
                                              | from server-b
124460537 -- > server-c (23996283 hashes) <---+
148456820 -- > server-a (31382512 hashes) ----+ 36411086 hashes
                                              | from server-a
179839332 -- > server-c (36411086 hashes) <---+
216250418 -- > server-a (17304439 hashes)
233554857 -- > server-b (15386579 hashes) ----+ 13903087 hashes
                                              | from server-b
248941436 -- > server-c (13903087 hashes) <---+
262844523 -- > server-b ( 5590932 hashes)

Globally, server-c receives 47,070,844 hashes from server-a and 37,899,370 hashes from server-b, which results in a migration of approximately 30% of the total hashes. As you can see there is no ripple effect here, as the boundaries of the existing partitions do not change.

Let's consider the performances in the worst case when we add one single node. If we are terribly unlucky (and we use a hash function with clear issues) each partition of the new node will cover completely a partition of an existing node. Assuming that the initial setup with N nodes created a balanced cluster, each node contains 1/Nth of the total keys, and in the worst case we need to move all of them from an existing node to the newly added one.

So, adding one node to a cluster of N nodes using consistent hashing results, in the worst case, in the migration of 1/Nth of the keys. In the previous example, then, we expected to migrate at most 50% of the keys (1/2), and we ended up migrating 30$ of them.

This is a terrific result. Not only it's much better than the previous one (at least 50% of the keys), but it gets better increasing the size of the cluster. In a cluster with 100 nodes, adding a node will result (in the worst case!) in the migration of 1/100 of the keys.

Source code¶

All routing tables shown in the post have been created with the following Python script. Please bear in mind that this is just demo code, so things haven't been optimised or designed particularly well. Feel free to change the hash function and the parameters of the script to experiment and see what consistent hashing can do.

consistent_hashing_demo.py

import hashlib
import itertools
import sys
import string

from operator import itemgetter

NUM_NODES = 3
NUM_PARTITIONS = 5


def hash_name(name):
    encoded_name = name.encode("utf-8")
    hash_encoded_name = hashlib.sha1(encoded_name).hexdigest()

    return int(hash_encoded_name[:7], 16)


def create_partitions(node_name, partitions):
    partition_hashes = []

    for partition_number in range(partitions):
        partition_name = f"{node_name}-{partition_number}"
        partition_hash = hash_name(partition_name)

        partition_hashes.append(
            {
                "min_hash": partition_hash,
                "partition_name": partition_name,
                "node_name": node_name,
            }
        )

    return partition_hashes


def create_routing_table(node_names, partitions):
    table = []

    for node_name in node_names:
        table.extend(create_partitions(node_name, partitions))

    table = sorted(table, key=itemgetter("min_hash"))

    return table


if NUM_NODES > len(string.ascii_lowercase):
    print("Too many servers")
    sys.exit(1)

nodes = [f"server-{i}" for i in string.ascii_lowercase[:NUM_NODES]]

routing_table = create_routing_table(nodes, NUM_PARTITIONS)
routing_table = [
    {
        "min_hash": 0,
        "partition_name": routing_table[-1]["partition_name"],
        "node_name": routing_table[-1]["node_name"],
    }
] + routing_table

routing_table_shift = routing_table[1:] + [
    {"min_hash": 0xFFFFFFF, "partition_name": "END"}
]

full_routing_table = []
for i, j in zip(routing_table, routing_table_shift):
    full_routing_table.append(
        {
            "min_hash": i["min_hash"],
            "partition_name": i["partition_name"],
            "node_name": i["node_name"],
            "served_hashes": j["min_hash"] - i["min_hash"],
        }
    )

print("Full routing table")
for r in full_routing_table:
    print(f'{r["min_hash"]:9} --> {r["partition_name"]} ({r["served_hashes"]} hashes)')

grouped_routing_table = itertools.groupby(
    full_routing_table, key=itemgetter("node_name")
)


simplified_routing_table = []
for r in grouped_routing_table:
    consecutive_partitions = list(r[1])

    simplified_routing_table.append(
        {
            "node_name": r[0],
            "min_hash": consecutive_partitions[0]["min_hash"],
            "served_hashes": sum([i["served_hashes"] for i in consecutive_partitions]),
        }
    )

print()
print("Simplified routing table")
for r in simplified_routing_table:
    print(f'{r["min_hash"]:9} -- > {r["node_name"]} ({r["served_hashes"]:8} hashes)')

print()
print("Stats")
stats = []
for node in nodes:
    slots = filter(lambda x: x["node_name"] == node, simplified_routing_table)
    total_hashes = sum([i["served_hashes"] for i in slots])
    stats.append({"node_name": node, "served_hashes": total_hashes})

for r in stats:
    print(r["node_name"], r["served_hashes"])

total_hashes = sum([i["served_hashes"] for i in stats])
print()
print(f"TOTAL HASHES: {total_hashes}/{2**28 - 1}")

Final words¶

I hope this long post was useful to introduce you to the topic of partitioning and in general to system design. As I mentioned, such concepts are currently in use by well-known systems, and still discussed as none of them is perfect, so it is worth understanding the fundamental issues before adopting a specific solution.

Resources¶

Martin Kleppmann, Designing Data-Intensive Applications, Chapter 6 "Partitioning", O’Reilly 2017 official site.
The Wikipedia article about consistent hashing.
A Guide to Consistent Hashing by Juan Pablo Carzolio.
The original article by David Karger et al.: "Consistent Hashing and Random Trees: Distributed Caching protocols for Relieving Hot Spots ont the World Wide Web".
An alternative algorithm by John Lamping and Eric Veach: "A Fast, Minimal Memory, Consistent Hash Algorithm".

Feedback¶

Feel free to reach me on Twitter if you have questions. The GitHub issues page is the best place to submit corrections.

Photo by Alex Lvrs on Unsplash.

Public key cryptography: OpenSSH private keys

2021-06-03T14:00:00+01:00

When you create standard RSA keys with ssh-keygen you end up with a private key in PEM format, and a public key in OpenSSH format. Both have been described in detail in my post Public key cryptography: RSA keys. In 2014, OpenSSH introduced a custom format for private keys that is apparently similar to PEM but is internally completely different. This format is used by default when you create ed25519 keys and it is expected to be the default format for all keys in the future, so it is worth having a look.

While investigating this topic I found a lot of misconceptions and wrong or partially wrong statements on Stack Overflow, so I hope this might be a comprehensive view of what this format is, its relationship with PEM, and the tools that you can use to manipulate it.

I'm not the first programmer to look into this, clearly, and I have to mention two posts that I read before writing this one: OpenSSH ed25519 private key file format written in December 2017 by Peter Lyons and The OpenSSH private key binary format, written in August 2020 by Marin Atanasov Nikolov. I'm sure many others have done this research but these are the resources that I found and I want to say a big thanks to both authors for sharing their findings. I will shamelessly use their results in the following explanation, as I hope others will do with what I'm writing here. Sharing knowledge is one of the best ways to help others.

Please note that all the private keys shown in this post have been trashed after I published it.

Note: as the word "key" can identify several different component of the systems I will describe, I will as much as possible use the words "private key" and "encryption key". The first is the key that we generate to be used in SSH, while the second is a parameter of a (symmetric) encryption algorithm.

KDFs and protection at rest¶

Describing the introduction of the new format, the OpenSSH changelog says

Add a new private key format that uses a bcrypt KDF to better
protect keys at rest. This format is used unconditionally for
Ed25519 keys, but may be requested when generating or saving
existing keys of other types via the -o ssh-keygen(1) option.
We intend to make the new format the default in the near future.
Details of the new format are in the PROTOCOL.key file.

Before we start dissecting the format, then, it is worth briefly discussing what a KDF is, what bcrypt is, and what it means to protect keys at rest.

Key Derivation Functions

Whenever a system is protected by a password you want to store the latter somewhere. This is clearly necessary to check the validity of the passwords that the user inputs and decide if you should grant access, but you shouldn't store the password in clear text, as a breach in the storage might compromise the whole system. The idea behind storing password securely is to run them through a hash function and store the hash: whenever someone inputs a password we can run the hash function again and compare the two hashes. However, we also want to prevent the attacker to be able to reconstruct the password from the hash, so we need a cryptographic hash function, which is a hash function with added requirements to prevent an easy inversion of the process.

The same strategy can be applied when it comes to encryption. An encryption system needs a key (a sequence of bits used to encrypt the message) and we need to derive it from the password given by the user. Encryption keys are required to have a specific length dictated by the encryption algorithm that we use, so hashing looks like a good solution, as all hashes generated by a given algorithm are by definition of the same size. AES, for example, one of the most widespread symmetric block ciphers, uses a key of exactly 128, 192, or 256 bits. Converting the password into a key of predetermined size is called stretching.

Any cryptographic system can be broken using a brute-force attack, as you can always test all possible inputs. In the case of login, we can just input all possible passwords until we get access to the system, while in the case of encryption we can try to decrypt using all possible keys until we obtain a meaningful result. This means that the most important thing we can do to protect such systems is to make brute-force attacks infeasible. This can be done increasing the key size (using more bits) but also using a slow stretching algorithm.

While hash functions created for things like digital signatures should be fast, then, hash functions that we use to obfuscate the password (for storage) or to create the key (for encryption/decryption) have to be very slow. The slowness of the processing can frustrate brute-force attacks and make them less effective is not infeasible. An example: at the current state of technology, you can easily hash 1 trillion passwords a second with a trivial expense, but if each one of those hashes takes 1 second you end up having to wait more than 31,000 years before you test all of them.

The process that converts a password into a key is called Key Derivation Function (KDF) and despite the name it is usually a complex algorithm and not a single mathematical function. PBKDF2 is an important KDF, defined as part of the specification PKCS #5, and it can use any pseudorandom function as part of the key stretching. An important feature of PBKDF2 is that it accepts an iteration count as input, that allows to slow down the process. As we just saw, this is the key to making the algorithm slower in order to adapt to the increasing computing power available to attackers.

bcrypt

The password-hashing function known as bcrypt was created in 1999 and is based on the Blowfish cipher created in 1993. Bcrypt is well know to be an extremely good choice thanks to the simple fact that its slowness can be increased tuning one of the parameters of the algorithm called "cost factor". This represents the number of iterations done in the setup of the underlying cipher, and its logarithmic nature makes easy to adapt the whole process to the increasing computational power available to attackers. This post attempts to estimate the time to hash a password of 15 characters with a cost of 30 (the maximum is actually 31) with a decent 2017 laptop (2.8 GHz Intel Core i7 16 GB RAM). The result turns out to be around 500 days which makes you understand that bcrypt won't die easily. It is important to note here that bcrypt is not a KDF, but a hash function. As such, it might be part of a KDF, but not replace the whole process.

Protection at rest

Protection at rest refers to the scheme that ensures data is secure when it is stored. Practically speaking, when it comes to SSH keys, we refer to the fact that an attacker that can physically access a key, for example stealing a laptop, actually owns an encrypted version of the key, which can't be used without first decrypting it. As the attacker is supposed to ignore the password used to encrypt the key, the only strategy they can use is to brute-force the key, and here is where the concept of protection at rest comes into play. Actually, the other strategy they can employ is to kidnap you and to force you to reveal the password, but this somehow falls outside the sphere of cryptographic security.

PEM format and protection at rest¶

Now that I clarified some terminology, let's have a look at what the standard PEM format does to store encrypted passwords. As I explained in my post Public key cryptography: RSA keys a PEM file contains a text header, a text footer, and some content. The content is always an ASN.1 structure created using DER and encoded using base64.

For encrypted private keys, the ASN.1 structure is created following a standard called PKCS #8. This standard uses an encryption scheme called PBES2 described in the specification PKCS #5, which uses a symmetric cipher and a password, previously converted into an encryption key using the KDF called PBKDF2. I hope at this point some if not all of these names ring a bell.

We can roughly sketch the process with the following steps:

Create the private key using the requested asymmetric algorithm (e.g. RSA or ED25519)
Encrypt the private key following PBES2
- Stretch the password into an encryption key using PBKDF2 with one of the possible hash functions and a random salt value
- Encrypt the private key using the newly created encryption key
Represent the encrypted key and the parameters used for PBKDF2 using ASN.1/DER
Encode the result with base64
Add a header and a footer that specify the nature of the content

Let's create an encrypted key with OpenSSL and analyse it. The command I used is

openssl genpkey -aes-256-cbc -algorithm RSA\
    -pkeyopt rsa_keygen_bits:4096 -pass pass:foobar\
    -out key_rsa_4096_openssl_pw

which creates a 4096 bits RSA key and encrypts it with AES using foobar as password. What I get is a file in the aforementioned PEM format

We can dump the ASN.1 content directly from the PEM format using openssl asn1parse

$ openssl asn1parse -inform pem -in key_rsa_4096_openssl_pw
    0:d=0  hl=4 l=2477 cons: SEQUENCE          
    4:d=1  hl=2 l=  87 cons: SEQUENCE          
    6:d=2  hl=2 l=   9 prim: OBJECT            :PBES2  1
   17:d=2  hl=2 l=  74 cons: SEQUENCE          
   19:d=3  hl=2 l=  41 cons: SEQUENCE          
   21:d=4  hl=2 l=   9 prim: OBJECT            :PBKDF2  2
   32:d=4  hl=2 l=  28 cons: SEQUENCE          
   34:d=5  hl=2 l=   8 prim: OCTET STRING      [HEX DUMP]:5BE04AE9442D08F0  4
   44:d=5  hl=2 l=   2 prim: INTEGER           :0800  5
   48:d=5  hl=2 l=  12 cons: SEQUENCE          
   50:d=6  hl=2 l=   8 prim: OBJECT            :hmacWithSHA256  6
   60:d=6  hl=2 l=   0 prim: NULL              
   62:d=3  hl=2 l=  29 cons: SEQUENCE          
   64:d=4  hl=2 l=   9 prim: OBJECT            :aes-256-cbc  3
   75:d=4  hl=2 l=  16 prim: OCTET STRING      [HEX DUMP]:88BD4E050F7D6691847BEAE813121BB0
   93:d=1  hl=4 l=2384 prim: OCTET STRING      [HEX DUMP]:93C719E39B382D[...]

Please note that I truncated the final OCTET STRING that contains the encrypted key as it is pretty long.

You can clearly see that this key is encrypted using PBES2 1 and PBKDF2 2. The algorithm used to encrypt the key is aes-256-cbc 3, as I asked. Specifically, this is AES with a key of 256 bits in CBC mode).

According to the PKCS #5 specification, the PBES2 block contains

PBES2-params ::= SEQUENCE {
       keyDerivationFunc AlgorithmIdentifier {{PBES2-KDFs}},
       encryptionScheme AlgorithmIdentifier {{PBES2-Encs}} }

and indeed we have PBKDF2 1 for keyDerivationFunc, and aes-256-cbc 3 for encryptionScheme. The sequence PBKDF2 is specified in the same document as

PBKDF2-params ::= SEQUENCE {
       salt CHOICE {
           specified OCTET STRING,
           otherSource AlgorithmIdentifier {{PBKDF2-SaltSources}}
       },
       iterationCount INTEGER (1..MAX),
       keyLength INTEGER (1..MAX) OPTIONAL,
       prf AlgorithmIdentifier {{PBKDF2-PRFs}} DEFAULT
       algid-hmacWithSHA1 }

As you can see in the ASN.1 dump the salt is 5BE04AE9442D08F0 4, the iteration count is 2048 (0x800) 5, and the hash function (prf, pseudorandom function) is hmacWithSHA256 6 without any additional parameters. The value 2048 for the iterations is a default value in OpenSSL (see the definition of PKCS5_DEFAULT_ITER).

OpenSSH's private key format¶

As we saw at the beginning of the post, the OpenSSH team came up with a custom format to store the private keys, so now that we are familiar with the nomenclature and with the way PEM stores encrypted keys, lets see what this new format can do.

The best starting point for our investigation is the tool ssh-keygen which we can use to create private keys. The source can be found in the OpenSSH repository in the file ssh-keygen.c. This file uses two different functions, sshkey_private_to_blob2 (source code) for the new format and sshkey_private_to_blob_pem_pkcs8 (source code) for keys in PKCS #8 format. The former calls bcrypt_pbkdf which comes from OpenBSD (source code).

This function contains a modified implementation of PBKDF2 that uses bcrypt as the core hash function. The comment that you can find at the top of the file bcrypt_pbkdf.c says

/*
 * pkcs #5 pbkdf2 implementation using the "bcrypt" hash
 *
 * The bcrypt hash function is derived from the bcrypt password hashing
 * function with the following modifications:
 * 1. The input password and salt are preprocessed with SHA512.
 * 2. The output length is expanded to 256 bits.
 * 3. Subsequently the magic string to be encrypted is lengthened and modified
 *    to "OxychromaticBlowfishSwatDynamite"
 * 4. The hash function is defined to perform 64 rounds of initial state
 *    expansion. (More rounds are performed by iterating the hash.)
 *
 * Note that this implementation pulls the SHA512 operations into the caller
 * as a performance optimization.
 *
 * One modification from official pbkdf2. Instead of outputting key material
 * linearly, we mix it. pbkdf2 has a known weakness where if one uses it to
 * generate (e.g.) 512 bits of key material for use as two 256 bit keys, an
 * attacker can merely run once through the outer loop, but the user
 * always runs it twice. Shuffling output bytes requires computing the
 * entirety of the key material to assemble any subkey. This is something a
 * wise caller could do; we just do it for you.
 */

As you can see, this is intended to be a pkcs #5 pbkdf2 implementation that uses bcrypt as its underlying hash function. It also mentions some modifications, and it's worth noting that when you modify a standard you are not following the standard any more. I won't run through all the details of the implementation, though, as it's beyond the scope of the post.

So, the OpenSSH private key format ultimately contains a private key encrypted with a non-standard version of PBKDF2 that uses bcrypt as its core hash function. The structure that contains the key is not ASN.1, even though it's base64 encoded and wrapped between header and footer that are similar to the PEM ones. A description of the structure can be found in https://github.com/openssh/openssh-portable/blob/2dc328023f60212cd29504fc05d849133ae47355/PROTOCOL.key.

Cost factor and rounds

PBKDF2 uses the concept of rounds to make the key stretching slower. This is the number of times the hash function is called internally (using as salt the output of the previous iteration), so in PBKDF2 the number of rounds or iterations is directly proportional to the slowness of the stretching operation.

Bcrypt implements a similar mechanism with its cost factor. The cost factor in the standard bcrypt implementation is defined as the binary logarithm of the number of iterations of a specific part of the process (the repeated expansion of the password and the salt). Using the binary logarithm means that a cost factor of 4 (the minimum) corresponds to 16 iterations, while 31 (the maximum) corresponds to 2,147,483,648 (more than 2 billion) iterations.

In the OpenSSH/OpenBSD implementation things are a bit different.

OpenBSD's version of bcrypt runs with a fixed cost of 6, that creates 64 iterations of the key expansion (source code), but being an implementation of PBKDF2 it can still be hardened increasing the number of rounds (source code). Those rounds correspond to the value given to the parameter -a of the ssh-keygen command line.

How many rounds?

When it comes to KDFs, the advice is always to run as much iterations as possible while keeping the specific application usable, so you need to tune your SSH keys testing different values in your system. To give you some rough estimations, Wikipedia mentions that for PBKDF2 the number of iterations used by Apple and Lastpass is between 2k and 100k. It is worth reiterating though that you shouldn't aim to use other people's figures, in this case. Instead, run tests of your software and hardware.

On my laptop, an i7-8565U with 32GiB of RAM running Kubuntu 20.04 I get the following results, which are pretty linear:

ssh-keygen -a 100 -t ed25519	0.667s
ssh-keygen -a 500 -t ed25519	3.148s
ssh-keygen -a 1000 -t ed25519	6.331s
ssh-keygen -a 5000 -t ed25519	31.624s

A sensible value for me might be between 100 and 500, then, so that I don't have to wait too long every time I push and pull my branches from GitHub.

Can we convert private OpenSSH keys into PEM?¶

As OpenSSL doesn't understand the OpenSSH private keys format, a common question among programmers and devops is if it is possible to convert it into a PEM format. As you might have guessed reading the previous sections, the answer is no. The PEM format for private keys uses PKCS#5, so it supports only the standard implementation of PBKDF2.

It's interesting to note that the OpenSSL team also specifically decided not to support this new format as it is not standard (see https://github.com/openssl/openssl/issues/5323).

A poorly documented format¶

PEM, PKCS #8, ASN.1, and all other formats that we use every day, included the OpenSSH public key format, are well documented and standardised in RFCs or similar documents. The OpenSSH private key format is documented in a tiny file that you can find in the source code, but doesn't offer more than a quick overview. To have a good understanding of what is going on I had to read the source code, not only of OpenSSH, but also of OpenBSD.

I think poor documentation like this might be acceptable in personal projects or in new tools, but SSH is used by the whole world, and when the team decides to come up with a completely new format for one of its most important elements I would expect them to detail every single bit of it, or at least try to be more open about the reasons and the implementation. I also personally believe that standards can't but benefit intercommunication between systems and, in cryptography, improve security, since they are reviewed and discussed by a wider audience.

The claim is that the new SSH private key format offers a better protection of keys at rest. I'd be very interested to see a cryptanalysis made by some expert (which I'm not). Cryptography is a tricky field, and often things that are apparently smart end up being tragically wrong.

Resources¶

OpenSSL documentation: asn1parse, genpkey
The Base64 encoding
The Abstract Syntax Notation One ASN.1 interface description language
RFC 4251 - The Secure Shell (SSH) Protocol Architecture
RFC 4253 - The Secure Shell (SSH) Transport Layer Protocol
RFC 4716 - The Secure Shell (SSH) Public Key File Format
RFC 5208 - PKCS #5: Password-Based Cryptography Specification Version 2.0
RFC 5208 - Public-Key Cryptography Standards (PKCS) #8: Private-Key Information Syntax Specification Version 1.2
RFC 5958 - Asymmetric Key Packages
RFC 7468 - Textual Encodings of PKIX, PKCS, and CMS Structures

Feedback¶

Feel free to reach me on Twitter if you have questions. The GitHub issues page is the best place to submit corrections.

Public key cryptography: SSL certificates

2020-11-04T23:00:00+01:00

In the context of public key cryptography, certificates are a way to prove the identity of the owner of a public key.

While public key cryptography allows us to communicate securely through an insecure network, it leaves the problem of identity untouched. Once we established an encrypted communication we can be sure that the data we send and receive cannot be read or tampered with by third parties. But how can we be sure that the entity on the other side of the communication channel, with which we initiated the communication, is what it claims to be?

In other words, the messages cannot be read or modified by malicious third-parties, but what if we established communication with a malicious actor in the first place? Such a situation can arise during a man-in-the-middle attack, where the low-level network communication is hijacked by a malicious actor who pretends to be the desired recipient of the communication.

In the context of the Internet, and in particular of the World Wide Web, the main concern is that the server that provides services we log into (think of every service that has your personal or financial data like you bank, Google, Facebook, Netflix, etc.) is run by the company that we trust and not by an attacker who wants to steal our data.

In this post I will try to clarify the main components of the certificates system and to explain the meaning of the major acronyms and names that you might hear when you deal with this part of web development.

Clarification: SSL vs TLS¶

In the world of web development and infrastructure management, we normally speak of SSL protocol and of SSL certificates, but it has to be noted that SSL (Secure Sockets Layer) is the name of a deprecated protocol. The current implementation of the protocol used to secure web applications is TLS (Transport Layer Security).

The story of SSL and TLS is rich of events and spans 25 years since its inception by Taher Elgamal at Netscape. In short, SSL had 3 major versions (the first of which was never publicly used), and was replaced by TLS in 1999. TLS itself has gone through 3 revisions at the time of writing, TLS 1.3 being the latest version available.

The TLS/SSL nomenclature is one of many sources of confusion in the complicated world of security and applied cryptography. In this article I will use only the acronym TLS, but I went for SSL in the title because I wanted the subject matter to be recognisable also by developers that are not much into security and cryptography.

X.509 certificates¶

While the problem of the identity in an insecure network can be solved in several ways, the solution embraced to secure the World Wide Web is based on a standard called X.509. When we mention TLS certificates, we usually mean X.509 certificates used in a TLS connection, such as that created by HTTPS.

X.509 is the ITU-T standard used to represent certificates, and has been chosen to be the standard used in the TLS protocol. The standard doesn't only define the binary structure of the certificate itself, but it also defines procedures to revoke the certificates, and establishes a hierarchical system of certification known as certificate path, or certificate chain.

The structure of an X.509 certificate is expressed using ASN.1, a notation used natively by the PEM format (discussed here). You can read the full specification in RFC 2459, in particular Section 4 "Certificate and Certificate Extensions Profile". I will refer to this later when I will have a look at a real certificate.

Before I discuss how certificates solve the problem of identity (or ownership of a public key), let's clarify the relationship between them and HTTPS.

HTTPS stands for HTTP Secure, and the core of the protocol consists of running HTTP over TLS. When we access a web site with HTTPS the browser first establishes a TLS connection with the server and then communicates with it using pure HTTP. This means that the whole HTTP protocol is encrypted, as the secure channel is established outside it, and also means that, aside from the different URI scheme https:// instead of http://, there are no differences between the two protocols.

Certificates come into play when the browser establishes the TLS connection, which is why you need to set-up HTTPS as part of your infrastructure and not in your web application. By the time the HTTP requests reach your application they are already decrypted and accessible in plain text, as the HTTP protocol mandates. We usually say that we "terminate TLS" when a component of our infrastructure manages certificates and decrypts HTTPS into HTTP.

How do certificates work?¶

The X.509 standard establishes entities called Certificate Authorities (CAs), and creates a hierarchy of trust called chain between them. The idea is that there is a set of entities that are trusted worldwide by operating systems, browsers, and other network-related software, and that these entities can trust other entities, thus creating a trust network.

While the market of Certificate Authorities is dominated by three major commercial players (see the usage statistics) there are approximately 100 organisations operating worldwide, among which some non-profit ones. Not all of these are trusted by all operating systems or browsers, though.

The set of CAs trusted by an organisation is called root program. The Mozilla community runs a program that is independent from the hardware/software platform, aptly called Mozilla's CA Certificate Program and uses data contained in the Common CA Database (CCADB). Private companies such as Microsoft, Apple, and Oracle run their own root programs and software running on the respective platforms (Windows, macOS/iOS, Java) can decide to trust the CAs provided by those programs.

In the open-source world, the Mozilla root program is by far the most influential and important source of information, being used by other software packages and Linux distributions.

It is possible to create certificates that are not signed by any CA, and these are called self-signed certificates. Such certificates can be used with any software that relies on certificates, but it requires such a software to disable certificate checking with the Certificate Authorities. Self-signed certificates are obviously useful for testing purposes, but there are scenarios in which it might be desirable not to rely on the CAs and establish a private network of trust.

Example: CA root certificate¶

The certificates for root CAs that are part of the Mozilla root program can be retrieved from the Common CA Database web page, or can be seen in the Firefox source code directly. On a running Firefox browser you can open the Privacy & Security menu and click on "View Certificates" at the bottom of the page. The CAs are listed under the tab "Authorities".

The interesting thing you can do here is to export a CA certificate. If you do it Firefox will save it in a file with extension .crt, that contains data in PEM format. I exported the certificate for Amazon Root CA 1 and I ended up with the file AmazonRootCA1.crt. If, instead of exporting, you view the certificate, you will end up in a page that allows you to download the certificate and the chain, both in PEM format, in files with the extension .pem. As you see, you are not the only one who is confused.

I described the PEM format in a post on RSA keys so I won't repeat here the whole discussion about it. The RFC 7468 ("Textual Encodings of PKIX, PKCS, and CMS Structures") describes certificates in section 5. Section 4 mentions the module id-pkix1-e for Certificate, CertificateList, and SubjectPublicKeyInfo RFC 5280 ("Internet X.509 Public Key Infrastructure Certificate and Certificate Revocation List (CRL) Profile").

The identifier id-pkix1-e is part of a registry of objects to be used in ASN.1 data created in the framework of the Public-Key Infrastructure using X.509 (PKIX) Working Group, that defined the infrastructure around the X.509 certificates system. Basically it's a standard way to identify binary objects and their structure. You can see a full list of all the objects in RFC 7299 ("Object Identifier Registry for the PKIX Working Group"). Not a very exciting one to read, if you ask me.

I can dump the content of the Amazon Root CA 1 certificate with OpenSSL

$ openssl asn1parse -inform pem -in amazon-root-ca-1.pem
    0:d=0  hl=4 l= 833 cons: SEQUENCE
    4:d=1  hl=4 l= 553 cons: SEQUENCE
    8:d=2  hl=2 l=   3 cons: cont [ 0 ]
   10:d=3  hl=2 l=   1 prim: INTEGER           :02
   13:d=2  hl=2 l=  19 prim: INTEGER           :066C9FCF99BF8C0A39E2F0788A43E696365BCA
   34:d=2  hl=2 l=  13 cons: SEQUENCE
   36:d=3  hl=2 l=   9 prim: OBJECT            :sha256WithRSAEncryption
   47:d=3  hl=2 l=   0 prim: NULL
   49:d=2  hl=2 l=  57 cons: SEQUENCE
   51:d=3  hl=2 l=  11 cons: SET
   53:d=4  hl=2 l=   9 cons: SEQUENCE
   55:d=5  hl=2 l=   3 prim: OBJECT            :countryName
   60:d=5  hl=2 l=   2 prim: PRINTABLESTRING   :US
   64:d=3  hl=2 l=  15 cons: SET
   66:d=4  hl=2 l=  13 cons: SEQUENCE
   68:d=5  hl=2 l=   3 prim: OBJECT            :organizationName
   73:d=5  hl=2 l=   6 prim: PRINTABLESTRING   :Amazon
   81:d=3  hl=2 l=  25 cons: SET
   83:d=4  hl=2 l=  23 cons: SEQUENCE
   85:d=5  hl=2 l=   3 prim: OBJECT            :commonName
   90:d=5  hl=2 l=  16 prim: PRINTABLESTRING   :Amazon Root CA 1
  108:d=2  hl=2 l=  30 cons: SEQUENCE
  110:d=3  hl=2 l=  13 prim: UTCTIME           :150526000000Z
  125:d=3  hl=2 l=  13 prim: UTCTIME           :380117000000Z
  140:d=2  hl=2 l=  57 cons: SEQUENCE
  142:d=3  hl=2 l=  11 cons: SET
  144:d=4  hl=2 l=   9 cons: SEQUENCE
  146:d=5  hl=2 l=   3 prim: OBJECT            :countryName
  151:d=5  hl=2 l=   2 prim: PRINTABLESTRING   :US
  155:d=3  hl=2 l=  15 cons: SET
  157:d=4  hl=2 l=  13 cons: SEQUENCE
  159:d=5  hl=2 l=   3 prim: OBJECT            :organizationName
  164:d=5  hl=2 l=   6 prim: PRINTABLESTRING   :Amazon
  172:d=3  hl=2 l=  25 cons: SET
  174:d=4  hl=2 l=  23 cons: SEQUENCE
  176:d=5  hl=2 l=   3 prim: OBJECT            :commonName
  181:d=5  hl=2 l=  16 prim: PRINTABLESTRING   :Amazon Root CA 1
  199:d=2  hl=4 l= 290 cons: SEQUENCE
  203:d=3  hl=2 l=  13 cons: SEQUENCE
  205:d=4  hl=2 l=   9 prim: OBJECT            :rsaEncryption
  216:d=4  hl=2 l=   0 prim: NULL
  218:d=3  hl=4 l= 271 prim: BIT STRING
  493:d=2  hl=2 l=  66 cons: cont [ 3 ]
  495:d=3  hl=2 l=  64 cons: SEQUENCE
  497:d=4  hl=2 l=  15 cons: SEQUENCE
  499:d=5  hl=2 l=   3 prim: OBJECT            :X509v3 Basic Constraints
  504:d=5  hl=2 l=   1 prim: BOOLEAN           :255
  507:d=5  hl=2 l=   5 prim: OCTET STRING      [HEX DUMP]:30030101FF
  514:d=4  hl=2 l=  14 cons: SEQUENCE
  516:d=5  hl=2 l=   3 prim: OBJECT            :X509v3 Key Usage
  521:d=5  hl=2 l=   1 prim: BOOLEAN           :255
  524:d=5  hl=2 l=   4 prim: OCTET STRING      [HEX DUMP]:03020186
  530:d=4  hl=2 l=  29 cons: SEQUENCE
  532:d=5  hl=2 l=   3 prim: OBJECT            :X509v3 Subject Key Identifier
  537:d=5  hl=2 l=  22 prim: OCTET STRING      [HEX DUMP]:04148418CC8534ECBC0C94942E08599CC7B2104E0A08
  561:d=1  hl=2 l=  13 cons: SEQUENCE
  563:d=2  hl=2 l=   9 prim: OBJECT            :sha256WithRSAEncryption
  574:d=2  hl=2 l=   0 prim: NULL
  576:d=1  hl=4 l= 257 prim: BIT STRING

Let's read part of it using the aforementioned section 4 of RFC 5280.

The signed certificate is a sequence of three main components

   Certificate  ::=  SEQUENCE  {
        tbsCertificate       TBSCertificate,
        signatureAlgorithm   AlgorithmIdentifier,
        signatureValue       BIT STRING  }

and the TBSCertificate structure represents the unsigned certificate (TBS = To Be Signed)

TBSCertificate  ::=  SEQUENCE  {
        version         [0]  EXPLICIT Version DEFAULT v1,
        serialNumber         CertificateSerialNumber,
        signature            AlgorithmIdentifier,
        issuer               Name,
        validity             Validity,
        subject              Name,
        subjectPublicKeyInfo SubjectPublicKeyInfo,
        issuerUniqueID  [1]  IMPLICIT UniqueIdentifier OPTIONAL,
                             -- If present, version MUST be v2 or v3
        subjectUniqueID [2]  IMPLICIT UniqueIdentifier OPTIONAL,
                             -- If present, version MUST be v2 or v3
        extensions      [3]  EXPLICIT Extensions OPTIONAL
                             -- If present, version MUST be v3
        }

Comparing this with the output of OpenSSL we can find fields such as version

   10:d=3  hl=2 l=   1 prim: INTEGER           :02

which according to the documentation is 3 (binary 02). Many values are of type PRINTABLESTRING, so they are readable already in the ASN.1 dump.

The validity of the certificate is

  110:d=3  hl=2 l=  13 prim: UTCTIME           :150526000000Z
  125:d=3  hl=2 l=  13 prim: UTCTIME           :380117000000Z

and following section 4.1.2.5.1 of the RFC we find out that the certificate is valid between 26 May 2015 and 17 Jan 2038. You can easily read these values in the certificate page in the browser without getting an headache trying to decode ASN.1.

The CA signed the certificate using a certain algorithm. The algorithm identifier is repeated twice, first in the structure Certificate (signatureAlgorithm AlgorithmIdentifier) and then in the structure TBSCertificate (signature AlgorithmIdentifier). The two fields must have the same value.

   34:d=2  hl=2 l=  13 cons: SEQUENCE
   36:d=3  hl=2 l=   9 prim: OBJECT            :sha256WithRSAEncryption
   47:d=3  hl=2 l=   0 prim: NULL

[...]

  561:d=1  hl=2 l=  13 cons: SEQUENCE
  563:d=2  hl=2 l=   9 prim: OBJECT            :sha256WithRSAEncryption
  574:d=2  hl=2 l=   0 prim: NULL

For this certificate, the algorithm used by Amazon is sha256WithRSAEncryption. This label is described in RFC 4055 ("Additional Algorithms and Identifiers for RSA Cryptography for use in the Internet X.509 Public Key Infrastructure Certificate and Certificate Revocation List (CRL) Profile") as "PKCS #1 version 1.5 signature algorithm with SHA-256". The specific algorithm can be found in RFC 2313 ("PKCS #1: RSA Encryption Version 1.5"). As the name of the algorithm suggests, the certificate is first digested with SHA-256 and then encrypted using RSA and the private key of the signer.

Speaking of keys, the public key the CA used for the certificate can be found in the field subjectpublickeyinfo, which is again made of a field type AlgorithmIdentifier and a bit string with the value of the key. In this case the fields are

  205:d=4  hl=2 l=   9 prim: OBJECT            :rsaEncryption
  216:d=4  hl=2 l=   0 prim: NULL
  218:d=3  hl=4 l= 271 prim: BIT STRING

The algorithm rsaEncryption is described in RFC 3279 ("Algorithms and Identifiers for the Internet X.509 Public Key Infrastructure Certificate and Certificate Revocation List (CRL) Profile"), section 2.3.1 as

      RSAPublicKey ::= SEQUENCE {
         modulus            INTEGER,    -- n
         publicExponent     INTEGER  }  -- e

(sic) or in RFC 8017 ("PKCS #1: RSA Cryptography Specifications Version 2.2")

RSAPublicKey ::= SEQUENCE {
    modulus           INTEGER,  -- n
    publicExponent    INTEGER   -- e
}

We can then use the option -strparse of the module asn1parse to find the actual values

$ openssl asn1parse -inform pem -in amazon-root-ca-1.pem -strparse 218
    0:d=0  hl=4 l= 266 cons: SEQUENCE
    4:d=1  hl=4 l= 257 prim: INTEGER           :B2788071CA78D5E371AF478050747D6ED8D78876F4
9968F7582160F97484012FAC022D86D3A0437A4EB2A4D036BA01BE8DDB48C80717364CF4EE8823C73EEB37F5B5
19F84968B0DED7B976381D619EA4FE8236A5E54A56E445E1F9FDB416FA74DA9C9B35392FFAB02050066C7AD080
B2A6F9AFEC47198F503807DCA2873958F8BAD5A9F948673096EE94785E6F89A351C0308666A14566BA54EBA3C3
91F948DCFFD1E8302D7D2D747035D78824F79EC4596EBB738717F2324628B843FAB71DAACAB4F29F240E2D4BF7
715C5E69FFEA9502CB388AAE50386FDBFB2D621BC5C71E54E177E067C80F9C8723D63F40207F2080C4804C3E3B
24268E04AE6C9AC8AA0D
  265:d=1  hl=2 l=   3 prim: INTEGER           :010001

As we already saw for RSA keys), OpenSSL has a specific module for important structures, and the X.509 certificates are definitely worth a module aptly called x509. using that we can easily decode any certificate

$ openssl x509 -inform pem -in amazon-root-ca-1.pem -noout -text
Certificate:
    Data:
        Version: 3 (0x2)
        Serial Number:
            06:6c:9f:cf:99:bf:8c:0a:39:e2:f0:78:8a:43:e6:96:36:5b:ca
        Signature Algorithm: sha256WithRSAEncryption
        Issuer: C = US, O = Amazon, CN = Amazon Root CA 1
        Validity
            Not Before: May 26 00:00:00 2015 GMT
            Not After : Jan 17 00:00:00 2038 GMT
        Subject: C = US, O = Amazon, CN = Amazon Root CA 1
        Subject Public Key Info:
            Public Key Algorithm: rsaEncryption
                RSA Public-Key: (2048 bit)
                Modulus:
                    00:b2:78:80:71:ca:78:d5:e3:71:af:47:80:50:74:
                    7d:6e:d8:d7:88:76:f4:99:68:f7:58:21:60:f9:74:
                    84:01:2f:ac:02:2d:86:d3:a0:43:7a:4e:b2:a4:d0:
                    36:ba:01:be:8d:db:48:c8:07:17:36:4c:f4:ee:88:
                    23:c7:3e:eb:37:f5:b5:19:f8:49:68:b0:de:d7:b9:
                    76:38:1d:61:9e:a4:fe:82:36:a5:e5:4a:56:e4:45:
                    e1:f9:fd:b4:16:fa:74:da:9c:9b:35:39:2f:fa:b0:
                    20:50:06:6c:7a:d0:80:b2:a6:f9:af:ec:47:19:8f:
                    50:38:07:dc:a2:87:39:58:f8:ba:d5:a9:f9:48:67:
                    30:96:ee:94:78:5e:6f:89:a3:51:c0:30:86:66:a1:
                    45:66:ba:54:eb:a3:c3:91:f9:48:dc:ff:d1:e8:30:
                    2d:7d:2d:74:70:35:d7:88:24:f7:9e:c4:59:6e:bb:
                    73:87:17:f2:32:46:28:b8:43:fa:b7:1d:aa:ca:b4:
                    f2:9f:24:0e:2d:4b:f7:71:5c:5e:69:ff:ea:95:02:
                    cb:38:8a:ae:50:38:6f:db:fb:2d:62:1b:c5:c7:1e:
                    54:e1:77:e0:67:c8:0f:9c:87:23:d6:3f:40:20:7f:
                    20:80:c4:80:4c:3e:3b:24:26:8e:04:ae:6c:9a:c8:
                    aa:0d
                Exponent: 65537 (0x10001)
        X509v3 extensions:
            X509v3 Basic Constraints: critical
                CA:TRUE
            X509v3 Key Usage: critical
                Digital Signature, Certificate Sign, CRL Sign
            X509v3 Subject Key Identifier: 
                84:18:CC:85:34:EC:BC:0C:94:94:2E:08:59:9C:C7:B2:10:4E:0A:08
    Signature Algorithm: sha256WithRSAEncryption
         98:f2:37:5a:41:90:a1:1a:c5:76:51:28:20:36:23:0e:ae:e6:
         28:bb:aa:f8:94:ae:48:a4:30:7f:1b:fc:24:8d:4b:b4:c8:a1:
         97:f6:b6:f1:7a:70:c8:53:93:cc:08:28:e3:98:25:cf:23:a4:
         f9:de:21:d3:7c:85:09:ad:4e:9a:75:3a:c2:0b:6a:89:78:76:
         44:47:18:65:6c:8d:41:8e:3b:7f:9a:cb:f4:b5:a7:50:d7:05:
         2c:37:e8:03:4b:ad:e9:61:a0:02:6e:f5:f2:f0:c5:b2:ed:5b:
         b7:dc:fa:94:5c:77:9e:13:a5:7f:52:ad:95:f2:f8:93:3b:de:
         8b:5c:5b:ca:5a:52:5b:60:af:14:f7:4b:ef:a3:fb:9f:40:95:
         6d:31:54:fc:42:d3:c7:46:1f:23:ad:d9:0f:48:70:9a:d9:75:
         78:71:d1:72:43:34:75:6e:57:59:c2:02:5c:26:60:29:cf:23:
         19:16:8e:88:43:a5:d4:e4:cb:08:fb:23:11:43:e8:43:29:72:
         62:a1:a9:5d:5e:08:d4:90:ae:b8:d8:ce:14:c2:d0:55:f2:86:
         f6:c4:93:43:77:66:61:c0:b9:e8:41:d7:97:78:60:03:6e:4a:
         72:ae:a5:d1:7d:ba:10:9e:86:6c:1b:8a:b9:59:33:f8:eb:c4:
         90:be:f1:b9

Now I'm pretty sure you want to kill me because I could have shown you this from the start. But I like to understand things, and the easy path doesn't always make everything clear. At any rate, here you have a way to read an X.509 certificate in PEM format.

Please note that in this certificate the Issuer and the Subject are the same entity, as this is a root certificate, which is signed by the same entity that creates it.

        Issuer: C = US, O = Amazon, CN = Amazon Root CA 1
[...]
        Subject: C = US, O = Amazon, CN = Amazon Root CA 1

Moreover, one of the version 3 extensions of the self-signed certificate is a basic constraint with the boolean CA set to true. It also has the extension Key Usage set to Digital Signature, Certificate Sign, CRL Sign, which means that the certificate can be used to sign other certificates.

Example: self-signed certificate¶

You can use OpenSSL to create a self-signed certificate using the module req that you would normally use to create certificate requests. As a self-signed certificate doesn't need approval, the module can directly output the certificate.

$ openssl req -x509 -newkey rsa:2048 -keyout self-signed-key.pem -out self-signed.pem -days 365 -nodes -subj '/CN=localhost'
Generating a RSA private key
....+++++
................+++++
writing new private key to 'self-signed-key.pem'
-----

(note that for simplicity's sake I specified the option -nodes that prevents the key to be protected with a password, but this is a bad practice). This command creates the two files I mentioned, self-signed-key.pem (the private key) and self-signed.pem.

We can read the certificate using the module x509

$ openssl x509 -inform pem -in self-signed.pem  -noout -text
Certificate:
    Data:
        Version: 3 (0x2)
        Serial Number:
            46:e5:2f:8e:42:82:43:b8:ac:88:cb:6d:0c:2f:71:28:a9:fe:00:ec
        Signature Algorithm: sha256WithRSAEncryption
        Issuer: CN = localhost
        Validity
            Not Before: Nov  3 00:23:34 2020 GMT
            Not After : Nov  3 00:23:34 2021 GMT
        Subject: CN = localhost
        Subject Public Key Info:
            Public Key Algorithm: rsaEncryption
                RSA Public-Key: (2048 bit)
                Modulus:
                    00:b7:14:ef:3b:eb:8b:a9:40:18:c5:d2:eb:1d:4f:
                    5d:e4:a3:17:f3:df:ce:b7:d3:3f:52:58:eb:61:02:
                    a2:68:0a:cd:0f:97:ae:e0:a5:ac:a7:88:cf:a1:15:
                    0a:97:ca:e7:03:8a:a5:c0:66:38:ef:bb:59:4d:48:
                    17:db:a7:bd:fa:4b:50:2a:be:e9:5b:bb:59:65:71:
                    dc:99:73:9c:bc:4d:3b:42:97:91:e9:3b:1a:8a:9d:
                    cc:41:38:ba:8b:8f:df:65:ff:5b:1f:ef:8a:b7:c5:
                    93:07:ce:15:4c:13:72:78:59:64:9a:5b:95:20:b6:
                    b3:8e:aa:c3:29:c3:7f:28:39:43:81:59:e4:0f:26:
                    7c:3f:49:d2:06:05:d9:54:ab:09:65:96:01:cc:c2:
                    72:be:85:1f:40:ea:94:35:04:09:9d:87:eb:a1:90:
                    36:ce:d2:55:f9:ee:08:db:52:78:e8:70:d0:25:89:
                    13:8e:0f:9d:98:98:d1:4d:67:06:8f:8a:61:9e:3a:
                    73:89:aa:0a:0a:1b:05:a7:52:32:ef:1b:78:5a:5f:
                    4b:b6:c9:a7:4e:15:10:04:50:99:00:09:2f:60:8e:
                    aa:20:af:6b:ee:f5:60:0b:29:da:38:1c:b2:73:14:
                    99:a4:ee:5e:89:e6:77:0b:ba:cf:d3:5d:d7:a3:ea:
                    c4:bf
                Exponent: 65537 (0x10001)
        X509v3 extensions:
            X509v3 Subject Key Identifier: 
                64:7B:C1:FC:99:74:56:B7:82:D1:4F:E7:2D:94:77:1A:09:52:26:5C
            X509v3 Authority Key Identifier: 
                keyid:64:7B:C1:FC:99:74:56:B7:82:D1:4F:E7:2D:94:77:1A:09:52:26:5C

            X509v3 Basic Constraints: critical
                CA:TRUE
    Signature Algorithm: sha256WithRSAEncryption
         43:7b:0b:c8:98:b8:6f:72:af:39:4a:d9:76:ce:e3:9d:3a:c7:
         9f:14:b0:4f:20:0a:45:b3:b4:8c:e5:37:4c:bf:15:ad:8e:5c:
         45:4f:3e:b7:ef:8d:60:57:bb:6f:d9:5e:6a:d3:04:05:4a:ff:
         f2:66:b1:76:66:59:7e:24:89:0a:50:28:c9:d5:f5:7a:00:07:
         8a:79:9c:6e:53:43:66:e5:9a:10:d8:f8:e1:f2:c1:f1:17:d0:
         d2:9e:50:80:fe:2a:ca:08:b6:98:e9:b5:a4:82:23:31:45:35:
         33:da:2c:e3:fe:54:f2:bd:f2:61:91:f4:32:e3:7d:4c:3a:e5:
         3a:0f:cd:36:b0:8b:af:9f:8e:3d:0e:0b:a5:df:4a:3a:91:83:
         b3:b2:5f:3c:47:81:73:4f:a2:c1:49:06:75:17:25:fa:5a:8d:
         30:e5:55:7f:9c:3e:15:a8:b5:ab:f7:45:38:e3:76:8e:d4:0d:
         60:fc:42:17:3d:85:72:41:1d:53:9d:58:b0:e9:29:0c:e4:6b:
         14:c2:22:c4:d5:7b:de:36:da:df:d8:a0:4f:a4:0a:f2:3e:ca:
         7e:66:a6:10:38:97:24:73:5b:db:eb:0b:6c:a8:f8:37:15:2c:
         0e:b1:82:44:cc:fe:85:b0:cb:6c:26:4b:4a:70:33:dc:7e:f5:
         84:ba:07:db

As you can see this certificate has the same value in Issuer and Subject, as happened before for the Amazon Root one. It also has the flag CA set to true but it doesn't have the extension Key Usage meaning that this certificate can't be used to sign other certificates.

Example: this site's certificate¶

You can see TLS certificates and the chain of trust in action in this very website. Following the documentation of you browser (instructions for Firefox are here), you can see the certificate used by The Digital Cat. At the time of writing the blog is hosted on GitHub Pages, even tough I'm using a custom domain, and GitHub partnered with Let's Encrypt to provide certificates for such a configuration (details here).

Indeed, the certificate for thedigitalcatonline.com is provided by "Let's Encrypt Authority X3", which in turn is trusted by Digital Signature Trust Co. with its root CA "DST Root CA X3".

Let's have a look at the three certificates. The one for The Digital Cat is

Certificate:
    Data:
        Version: 3 (0x2)
        Serial Number:
            03:93:02:bb:9a:c9:ed:a5:c3:d1:16:00:8b:15:76:af:e5:d9
        Signature Algorithm: sha256WithRSAEncryption
        Issuer: C = US, O = Let's Encrypt, CN = Let's Encrypt Authority X3
        Validity
            Not Before: Oct 22 04:53:28 2020 GMT
            Not After : Jan 20 04:53:28 2021 GMT
        Subject: CN = www.thedigitalcatonline.com
[...]
        X509v3 extensions:
            X509v3 Key Usage: critical
                Digital Signature, Key Encipherment
            X509v3 Extended Key Usage: 
                TLS Web Server Authentication, TLS Web Client Authentication
            X509v3 Basic Constraints: critical
                CA:FALSE
            X509v3 Subject Key Identifier: 
                63:4E:15:85:56:5A:A4:94:02:C2:16:42:A4:A5:97:9A:38:02:57:97
            X509v3 Authority Key Identifier: 
                keyid:A8:4A:6A:63:04:7D:DD:BA:E6:D1:39:B7:A6:45:65:EF:F3:A8:EC:A1

            Authority Information Access: 
                OCSP - URI:http://ocsp.int-x3.letsencrypt.org
                CA Issuers - URI:http://cert.int-x3.letsencrypt.org/

            X509v3 Subject Alternative Name: 
                DNS:www.thedigitalcatonline.com
[...]

And you can see that this time the Subject is www.thedigitalcatonline.com, but the Issuer is Let's Encrypt Authority X3. The certificate provided by the organisation Let's Encrypt is

Certificate:
    Data:
        Version: 3 (0x2)
        Serial Number:
            0a:01:41:42:00:00:01:53:85:73:6a:0b:85:ec:a7:08
        Signature Algorithm: sha256WithRSAEncryption
        Issuer: O = Digital Signature Trust Co., CN = DST Root CA X3
        Validity
            Not Before: Mar 17 16:40:46 2016 GMT
            Not After : Mar 17 16:40:46 2021 GMT
        Subject: C = US, O = Let's Encrypt, CN = Let's Encrypt Authority X3
[...]
        X509v3 extensions:
            X509v3 Basic Constraints: critical
                CA:TRUE, pathlen:0
            X509v3 Key Usage: critical
                Digital Signature, Certificate Sign, CRL Sign
            Authority Information Access: 
                OCSP - URI:http://isrg.trustid.ocsp.identrust.com
                CA Issuers - URI:http://apps.identrust.com/roots/dstrootcax3.p7c

            X509v3 Authority Key Identifier: 
                keyid:C4:A7:B1:A4:7B:2C:71:FA:DB:E1:4B:90:75:FF:C4:15:60:85:89:10

            X509v3 Certificate Policies: 
                Policy: 2.23.140.1.2.1
                Policy: 1.3.6.1.4.1.44947.1.1.1
                  CPS: http://cps.root-x1.letsencrypt.org

            X509v3 CRL Distribution Points: 

                Full Name:
                  URI:http://crl.identrust.com/DSTROOTCAX3CRL.crl

            X509v3 Subject Key Identifier: 
                A8:4A:6A:63:04:7D:DD:BA:E6:D1:39:B7:A6:45:65:EF:F3:A8:EC:A1
[...]

Here, the Subject is Let's Encrypt Authority X3 (the Issuer of the previous certificate), and the Issuer is DST Root CA X3. Last, the certificate provided by the organisation Digital Signature Trust Co. is

Certificate:
    Data:
        Version: 3 (0x2)
        Serial Number:
            44:af:b0:80:d6:a3:27:ba:89:30:39:86:2e:f8:40:6b
        Signature Algorithm: sha1WithRSAEncryption
        Issuer: O = Digital Signature Trust Co., CN = DST Root CA X3
        Validity
            Not Before: Sep 30 21:12:19 2000 GMT
            Not After : Sep 30 14:01:15 2021 GMT
        Subject: O = Digital Signature Trust Co., CN = DST Root CA X3
[...]
        X509v3 extensions:
            X509v3 Basic Constraints: critical
                CA:TRUE
            X509v3 Key Usage: critical
                Certificate Sign, CRL Sign
            X509v3 Subject Key Identifier: 
                C4:A7:B1:A4:7B:2C:71:FA:DB:E1:4B:90:75:FF:C4:15:60:85:89:10
[...]

As happened for the certificate Amazon Root CA 1 that we discussed before, this one is self-signed, having the same value for Subject and Issuer.

How to verify certificates with OpenSSL¶

To verify if a certificate is valid we can use the module verify of OpenSSL. By default, OpenSSL doesn't trust anything, and verify relies on a default path in the system to find root certificates. You can see the path running

$ openssl version -d
OPENSSLDIR: "/usr/lib/ssl"

On Ubuntu 20.04, the directory /usr/lib/ssl/certs is a symbolic link to /etc/ssl/certs that is installed by the package ca-certificates which is linked to the Mozilla's CA Certificate Program (details on that package can be found in the source code).

So, if a root certificate is included in the Mozilla program, it is trusted by OpenSSL

$ openssl verify amazon-root-ca-1.pem 
amazon-root-ca-1.pem: OK

while a self-signed certificate is not

$ openssl verify self-signed.pem 
CN = localhost
error 18 at 0 depth lookup: self signed certificate
error self-signed.pem: verification failed

A non-root certificate can be verified specifying which root certificate signed it. So, the certificate for this website is not trusted automatically

$ openssl verify www-thedigitalcatonline-com.pem 
CN = www.thedigitalcatonline.com
error 20 at 0 depth lookup: unable to get local issuer certificate
error www-thedigitalcatonline-com.pem: verification failed

But it is verified specifying the certificate for Let's Encrypt that signed it

$ openssl verify -CAfile lets-encrypt-x3.pem www-thedigitalcatonline-com.pem 
www-thedigitalcatonline-com.pem: OK

because the certificate lets-encrypt-x3.pem is signed by DST_Root_CA_X3.pem which is included in the Mozilla program, and thus included in my Linux distribution.

If I remove the default certificates path OpenSSL doesn't accept the certificate for Let's Encrypt any more

$ openssl verify -no-CApath -CAfile lets-encrypt-x3.pem www-thedigitalcatonline-com.pem
C = US, O = Let's Encrypt, CN = Let's Encrypt Authority X3
error 2 at 1 depth lookup: unable to get issuer certificate
error www-thedigitalcatonline-com.pem: verification failed

Low-level certificate validation process¶

Let's have a look at the signature process for x.509 certificates. The process depends on the specific algorithm used to sign the certificate, so I will use the certificate Amazon Root CA 1 as an example, leaving to the reader the investigation about other algorithms.

A signed certificate is made of two parts, the certificate itself and the signature. The signature contains an encrypted hash of the certificate, so the verification is done in three steps:

Decrypt the encrypted hash using the public key
Compute the hash of the certificate using the same algorithm
Compare the hashes

For the Amazon root certificate, we know the signature algorithm and value from the output of openssl x509

$ openssl x509 -inform pem -in amazon-root-ca-1.pem  -noout -text
[...]
    Signature Algorithm: sha256WithRSAEncryption
         98:f2:37:5a:41:90:a1:1a:c5:76:51:28:20:36:23:0e:ae:e6:
         28:bb:aa:f8:94:ae:48:a4:30:7f:1b:fc:24:8d:4b:b4:c8:a1:
         97:f6:b6:f1:7a:70:c8:53:93:cc:08:28:e3:98:25:cf:23:a4:
         f9:de:21:d3:7c:85:09:ad:4e:9a:75:3a:c2:0b:6a:89:78:76:
         44:47:18:65:6c:8d:41:8e:3b:7f:9a:cb:f4:b5:a7:50:d7:05:
         2c:37:e8:03:4b:ad:e9:61:a0:02:6e:f5:f2:f0:c5:b2:ed:5b:
         b7:dc:fa:94:5c:77:9e:13:a5:7f:52:ad:95:f2:f8:93:3b:de:
         8b:5c:5b:ca:5a:52:5b:60:af:14:f7:4b:ef:a3:fb:9f:40:95:
         6d:31:54:fc:42:d3:c7:46:1f:23:ad:d9:0f:48:70:9a:d9:75:
         78:71:d1:72:43:34:75:6e:57:59:c2:02:5c:26:60:29:cf:23:
         19:16:8e:88:43:a5:d4:e4:cb:08:fb:23:11:43:e8:43:29:72:
         62:a1:a9:5d:5e:08:d4:90:ae:b8:d8:ce:14:c2:d0:55:f2:86:
         f6:c4:93:43:77:66:61:c0:b9:e8:41:d7:97:78:60:03:6e:4a:
         72:ae:a5:d1:7d:ba:10:9e:86:6c:1b:8a:b9:59:33:f8:eb:c4:
         90:be:f1:b9

You can see the signed certificate binary values with cat amazon-root-ca-1.pem | tail -n+2 | head -n-1 | base64 -di | hexdump -ve '/1 "%02x "' -e '2/8 "\n"'. While we can recognise the signature in the last 256 bytes we can't easily separate the bytes with the signature algorithm. If we open the signed certificate with an ASN.1 parser, instead, we can easily find the binary value of the certificate part

30 82 03 41 30 82 02 29 a0 03 02 01 02 02 13 06
6c 9f cf 99 bf 8c 0a 39 e2 f0 78 8a 43 e6 96 36
5b ca 30 0d 06 09 2a 86 48 86 f7 0d 01 01 0b 05
00 30 39 31 0b 30 09 06 03 55 04 06 13 02 55 53
31 0f 30 0d 06 03 55 04 0a 13 06 41 6d 61 7a 6f
6e 31 19 30 17 06 03 55 04 03 13 10 41 6d 61 7a
6f 6e 20 52 6f 6f 74 20 43 41 20 31 30 1e 17 0d
31 35 30 35 32 36 30 30 30 30 30 30 5a 17 0d 33
38 30 31 31 37 30 30 30 30 30 30 5a 30 39 31 0b
30 09 06 03 55 04 06 13 02 55 53 31 0f 30 0d 06
03 55 04 0a 13 06 41 6d 61 7a 6f 6e 31 19 30 17
06 03 55 04 03 13 10 41 6d 61 7a 6f 6e 20 52 6f
6f 74 20 43 41 20 31 30 82 01 22 30 0d 06 09 2a
86 48 86 f7 0d 01 01 01 05 00 03 82 01 0f 00 30
82 01 0a 02 82 01 01 00 b2 78 80 71 ca 78 d5 e3
71 af 47 80 50 74 7d 6e d8 d7 88 76 f4 99 68 f7
58 21 60 f9 74 84 01 2f ac 02 2d 86 d3 a0 43 7a
4e b2 a4 d0 36 ba 01 be 8d db 48 c8 07 17 36 4c
f4 ee 88 23 c7 3e eb 37 f5 b5 19 f8 49 68 b0 de
d7 b9 76 38 1d 61 9e a4 fe 82 36 a5 e5 4a 56 e4
45 e1 f9 fd b4 16 fa 74 da 9c 9b 35 39 2f fa b0
20 50 06 6c 7a d0 80 b2 a6 f9 af ec 47 19 8f 50
38 07 dc a2 87 39 58 f8 ba d5 a9 f9 48 67 30 96
ee 94 78 5e 6f 89 a3 51 c0 30 86 66 a1 45 66 ba
54 eb a3 c3 91 f9 48 dc ff d1 e8 30 2d 7d 2d 74
70 35 d7 88 24 f7 9e c4 59 6e bb 73 87 17 f2 32
46 28 b8 43 fa b7 1d aa ca b4 f2 9f 24 0e 2d 4b
f7 71 5c 5e 69 ff ea 95 02 cb 38 8a ae 50 38 6f
db fb 2d 62 1b c5 c7 1e 54 e1 77 e0 67 c8 0f 9c
87 23 d6 3f 40 20 7f 20 80 c4 80 4c 3e 3b 24 26
8e 04 ae 6c 9a c8 aa 0d 02 03 01 00 01 a3 42 30
40 30 0f 06 03 55 1d 13 01 01 ff 04 05 30 03 01
01 ff 30 0e 06 03 55 1d 0f 01 01 ff 04 04 03 02
01 86 30 1d 06 03 55 1d 0e 04 16 04 14 84 18 cc
85 34 ec bc 0c 94 94 2e 08 59 9c c7 b2 10 4e 0a
08

The signature algorithm part is

30 0d 06 09 2a 86 48 86 f7 0d 01 01 0b 05 00
03 82 01 01 00

and the ASN.1 parser tells us that those bytes represent an OBJECT IDENTIFIER which value is 2.16.840.1.101.3.4.2.1. Now, object identifiers are not complicated per se, they are just a way to identify algorithms and other well known components in ASN.1 structures. The description of the field signatureAlgorithm of an x.509 certificate mentions three other RFCs that contains descriptions of the available algorithms. In particular, RFC 4055 contains the description of PKCS #1 one-way hash functions, one of which is

id-sha256  OBJECT IDENTIFIER  ::=  { joint-iso-itu-t(2)
                     country(16) us(840) organization(1) gov(101)
                     csor(3) nistalgorithm(4) hashalgs(2) 1 }

You can see the values in the object identifier between parentheses. Since these are PKCS #1 (a.k.a. RSA) has functions, OpenSSL identifies it as sha256WithRSAEncryption (see again RFC 4055).

RSA encryption is described in RFC 2313 ("PKCS #1: RSA Encryption Version 1.5") and the signature algorithm based on RSA is described there in section 10. In particular, section 10.2 details the verification process, which is the one we are interested in. The steps are

Bit-string-to-octet-string conversion of the signature
RSA decryption
Digest decoding (ASN.1)
Message digesting and comparison

As for the signature conversion, the sentence

Specifically, assuming that the length in bits of the
signature S is a multiple of eight, the first bit of the signature
shall become the most significant bit of the first octet of the
encrypted data, and so on through the last bit of the signature,
which shall become the least significant bit of the last octet of the
encrypted data.

is a very verbose way to say that the signature is big-endian.

So, the hexadecimal value of the signature is

98f2375a4190a11ac57651282036230eaee628bbaaf894ae48a4307f1bfc248d
4bb4c8a197f6b6f17a70c85393cc0828e39825cf23a4f9de21d37c8509ad4e9a
753ac20b6a897876444718656c8d418e3b7f9acbf4b5a750d7052c37e8034bad
e961a0026ef5f2f0c5b2ed5bb7dcfa945c779e13a57f52ad95f2f8933bde8b5c
5bca5a525b60af14f74befa3fb9f40956d3154fc42d3c7461f23add90f48709a
d9757871d1724334756e5759c2025c266029cf2319168e8843a5d4e4cb08fb23
1143e843297262a1a95d5e08d490aeb8d8ce14c2d055f286f6c49343776661c0
b9e841d7977860036e4a72aea5d17dba109e866c1b8ab95933f8ebc490bef1b9

And reading the field Subject Public Key Info of the certificate we find the public key. Remember that this is a root certificate, so it is signed using the same key that it contains, which is not true in general.

The public key's modulus is

b2788071ca78d5e371af478050747d6ed8d78876f49968f7582160f97484012f
ac022d86d3a0437a4eb2a4d036ba01be8ddb48c80717364cf4ee8823c73eeb37
f5b519f84968b0ded7b976381d619ea4fe8236a5e54a56e445e1f9fdb416fa74
da9c9b35392ffab02050066c7ad080b2a6f9afec47198f503807dca2873958f8
bad5a9f948673096ee94785e6f89a351c0308666a14566ba54eba3c391f948dc
ffd1e8302d7d2d747035d78824f79ec4596ebb738717f2324628b843fab71daa
cab4f29f240e2d4bf7715c5e69ffea9502cb388aae50386fdbfb2d621bc5c71e
54e177e067c80f9c8723d63f40207f2080c4804c3e3b24268e04ae6c9ac8aa0d

and the exponent is 0x10001 (default choice).

RSA public-key signature decryption is performed with signature ^ exponent mod modulus, and this operation returns

1fffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff
ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff
ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff
ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff
ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff
ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff
fffffffffffffffffffff003031300d0609608648016503040201050004206fc
4b8ac3d2b52c08baf56255e43d22c762962e4facab01ace16d48ec008be0a

Once the padding is removed, we are left with an ASN.1 binary structure that represents the digest

DigestInfo ::= SEQUENCE {
  digestAlgorithm DigestAlgorithmIdentifier,
  digest Digest }

(see RFC 2313 - Section 10.1.2)

The value of Digest can be extracted with an ASN.1 parser or by taking the last 256 bits and is 6fc4b8ac3d2b52c08baf56255e43d22c762962e4facab01ace16d48ec008be0a.

At this point we need to process the certificate bytes (without signature) with the SHA-256 hash function and we will find a matching value of 6fc4b8ac3d2b52c08baf56255e43d22c762962e4facab01ace16d48ec008be0a.

This process (for the specific case of this certificate) can be easily done in Python

from cryptography import x509
from hashlib import sha256

certificate_pem_file = "amazon-root-ca-1.pem"

with open(certificate_pem_file, "rb") as f:
    certificate_pem = f.read()

certificate = x509.load_pem_x509_certificate(certificate_pem)

modulus = certificate.public_key().public_numbers().n
exponent = certificate.public_key().public_numbers().e

signature = int.from_bytes(certificate.signature, "big")

verification = pow(signature, exponent, modulus)

digest = bytes().fromhex(str(hex(verification))[-64:])

calculated_digest = sha256(certificate.tbs_certificate_bytes)

print(digest.hex() == calculated_digest.hexdigest())

This is arguably not the best Python code ever, but it's a simple way to demonstrate the process. As I said, this is far from being general, as it assumes the signature is sha256WithRSAEncryption, which might not be the case.

What I showed you here is what happens when we validate a root certificate. When we validate a non-root certificate the process is exactly the same (taking into account that the algorithms involved might be different), only the public key used to sign the certificate doesn't come from the certificate itself, but from the signer one. So, in the case of this blog, the certificate for www.thedigitalcat.com has a signature encrypted with the public key of Let's Encrypt. And the certificate for Let's Encrypt will be signed using the public key of Digital Signature Trust Co. This is what creates the chain of trust.

Algorithms used by root certificates¶

A quick scan of the certificates that are part of the Mozilla program reveals that the vast majority of them is using RSA to self-sign them

$ for i in /etc/ssl/certs/*.pem; do openssl x509 -inform pem -in ${i} -noout -text | grep -E "Public Key Algorithm"; done | sort | uniq -c
     25             Public Key Algorithm: id-ecPublicKey
    114             Public Key Algorithm: rsaEncryption

while part of them are using id-ecPublicKey which is the identifier of elliptic curves algorithms.

When it comes to signature algorithms, instead, there is more variety

$ for i in /etc/ssl/certs/*.pem; do openssl x509 -inform pem -in ${i} -noout -text | grep -E "^    Signature Algorithm"; done | sort | uniq -c
      7     Signature Algorithm: ecdsa-with-SHA256
     18     Signature Algorithm: ecdsa-with-SHA384
     47     Signature Algorithm: sha1WithRSAEncryption
     57     Signature Algorithm: sha256WithRSAEncryption
      9     Signature Algorithm: sha384WithRSAEncryption
      1     Signature Algorithm: sha512WithRSAEncryption

Even here, elliptic curves are slowly being adopted.

If you are using AWS, you can create certificates with ACM, the AWS Certificate Manager. Such certificates cannot be downloaded, they can only be attached to other AWS components. For this reason, the generation process requires you to create any request, as you might have to do with other authorities. Certificates created in the ACM are free.

Certificates created in the ACM can be attached to several AWS components, most notably Load Balancers, CloudFront, and API Gateway.

Traditionally, load balancers are the place where TLS is terminated for HTTPS, requiring a connection to port 443. While Application Load Balancers can do that, in 2019 AWS announced support for certificates in Network Load Balancers as well.

Let's encrypt¶

In an effort to push for HTTP encryption of any public server, the Internet Security Research Group founded in 2016 a non-profit CA named Let's Encrypt, which provides at no charge TLS certificates valid for 90 days. Such certificates can be renewed automatically as part of the setup (certbot) and represent a viable alternative to certificates issued by other CA, in particular for open source projects. This blog uses a certificate issued by Let's Encrypt (provided by GitHub Pages) and will thus expire in less than 3 months (but also automatically renewed).

Final words¶

I hope this post helped to clarify some of the most obscure points of certificates, that definitely bugged be when I first approached them. As always when standards are involved, the risk is to get lost in the myriad of documents where information is scattered, and not to realise that some (if not many) parts of the systems we run every day have a long history and thus a big burden of legacy code or nomenclature.

Resources¶

The Wikipedia article on TLS
The Wikipedia article on Certificate authority
The Wikipedia article on X.509
The Wikipedia article on Let's Encrypt
OpenSSL documentation: asn1parse, x509, verify
The Abstract Syntax Notation One ASN.1 interface description language
RFC 2313 - "PKCS #1: RSA Encryption Version 1.5"
RFC 2459 - "Internet X.509 Public Key Infrastructure Certificate and CRL Profile"
RFC 3279 - "Algorithms and Identifiers for the Internet X.509 Public Key Infrastructure Certificate and Certificate Revocation List (CRL) Profile"
RFC 4055 - "Additional Algorithms and Identifiers for RSA Cryptography for use in the Internet X.509 Public Key Infrastructure Certificate and Certificate Revocation List (CRL) Profile"
RFC 5280 - "Internet X.509 Public Key Infrastructure Certificate and Certificate Revocation List (CRL) Profile"
RFC 7299 - "Object Identifier Registry for the PKIX Working Group"
RFC 7468 - "Textual Encodings of PKIX, PKCS, and CMS Structures"
RFC 8017 - "PKCS #1: RSA Cryptography Specifications Version 2.2"
RFC 8446 - "The Transport Layer Security (TLS) Protocol Version 1.3"
pyca/cryptography - The Python pyca/cryptography package

Feedback¶

Feel free to reach me on Twitter if you have questions. The GitHub issues page is the best place to submit corrections.

Multiple inheritance and mixin classes in Python

2020-03-27T12:00:00+01:00

I recently revisited three old posts on Django class-based views that I wrote for this blog, updating them to Django 3.0 (you can find them here) and noticed once again that the code base uses mixin classes to increase code reuse. I also realised that mixins are not very popular in Python, so I decided to explore them, brushing up my knowledge of the OOP theory in the meanwhile.

To fully appreciate the content of the post, be sure you grasp two pillars of the OOP approach: delegation, in particular how it is implemented through inheritance, and polymorphism. This post about delegation and this post about polymorphism contain all you need to understand how Python implements those concepts.

Multiple inheritance: blessing and curse¶

General concepts

To discuss mixins we need to start from one of the most controversial subjects in the whole OOP world: multiple inheritance. This is a natural extension of the concept of simple inheritance, where a class automatically delegates method and attribute resolution to another class (the parent class).

Let me state it again, as it is important for the rest of the discussion: inheritance is just an automatic delegation mechanism.

Delegation was introduced in OOP as a way to reduce code duplication. When an object needs a specific feature it just delegates it to another class (either explicitly or implicitly), so the code is written just once.

Let's consider the example of code management website, clearly completely fictional and not inspired by any existing product. Let's assume we created the following hierarchy

      assignable reviewable item
 (assign_to_user, ask_review_to_user)
                 ^
                 |
                 |
                 |
            pull request

which allows us to put in pull request only the specific code required by that element. This is a great achievement, as it is what libraries do for code, but on live objects. Method calls and delegation are nothing more than messages between objects, so the delegation hierarchy is just a simple networked system.

Unfortunately, the use of inheritance over composition often leads to systems that, paradoxically, increase code duplication. The main problem lies in the fact that inheritance can directly delegate to only one other class (the parent class), as opposed to composition, where the object can delegate to any number of other ones. This limitation of inheritance means that we might have a class that inherits from another one because it needs some of its features, but doing this receives features it doesn't want, or shouldn't have.

Let's continue the example of the code management portal, and consider an issue, which is an item that we want to store in the system, but cannot be reviewed by a user. If we create a hierarchy like this

    assignable reviewable item
 (assign_to_user, ask_review_to_user)
                 ^
                 |
                 |
                 |
                 |
        +--------+--------+
        |                 |
        |                 |
        |                 |
      issue          pull request
 (not reviewable)

we end up putting the features related to the review process in an object that shouldn't have them. The standard solution to this problem is that of increasing the depth of the inheritance hierarchy and to derive from the new simpler ancestor.

   assignable item
  (assign_to_user)
          ^
          |
          |
          |
          |
   +------+--------------+
   |                     |
   |                     |
   |                     |
   |         reviewable assignable item
   |            (ask_review_to_user)
   |                     ^
   |                     |
   |                     |
   |                     |
 issue              pull request

However, this approach stops being viable as soon as an object needs to inherit from a given class but not from the parent of that class. For example, an element that has to be reviewable but not assignable, like a best practice that we want to add to the site. If we want to keep using inheritance, the only solution at this point is to duplicate the code that implements the reviewable nature of the item (or the code that implements the assignable feature) and create two different class hierarchies.

   assignable item              +-------->  reviewable item
  (assign_to_user)              |         (ask_review_to_user)
          ^                     |                  ^
          |                     |                  |
          |                     |                  |
          |             CODE DUPLICATION           |
          |                     |                  |
   +------+--------------+      |                  |
   |                     |      |                  |
   |                     |      |                  |
   |                     |      V                  |
   |         reviewable assignable item            |
   |            (ask_review_to_user)               |
   |                     ^                         |
   |                     |                         |
   |                     |                         |
   |                     |                         |
 issue              pull request             best practice

Please note that this doesn't even take into account that the new reviewable item might need attributes from assignable item, which prompts for another level of depth in the hierarchy, where we isolate those features in a more generic class. So, unfortunately, chances are that this is only the first of many compromises we will have to accept to keep the system in a stable state if we can't change our approach.

Multiple inheritance was then introduced in OOP, as it was clear that an object might want to delegate certain actions to a given class, and other actions to a different one, mimicking what life forms do when they inherit traits from multiple ancestors (parents, grandparents, etc.).

The above situation can then be solved having pull request inherit from both the class that provides the assign feature and from the one that implements the reviewable nature.

   assignable item                          reviewable item
  (assign_to_user)                        (ask_review_to_user)
          ^                                      ^  ^
          |                                      |  |
          |                                      |  |
          |                                      |  |
          |                                      |  |
   +------+-------------+ +----------------------+  |
   |                    | |                         |
   |                    | |                         |
   |                    | |                         |
   |                    | |                         |
   |                    | |                         |
   |                    | |                         |
   |                    | |                         |
   |                    | |                         |
   |                    | |                         |
 issue              pull request              best practice

Generally speaking, then, multiple inheritance is introduced to give the programmer a way to keep using inheritance without introducing code duplication, keeping the class hierarchy simpler and cleaner. Eventually, everything we do in software design is to try and separate concerns, that is, to isolate features, and multiple inheritance can help to do this.

These are just examples and might be valid or not, depending on the concrete case, but they clearly show the issues that we can have even with a very simple hierarchy of 4 classes. Many of these problems clearly arise from the fact that we wanted to implement delegation only through inheritance, and I dare to say that 80% of the architectural errors in OOP projects come from using inheritance instead of composition and from using god objects, that is classes that have responsibilities over too many different parts of the system. Always remember that OOP was born with the idea of small objects interacting through messages, so the considerations we make for monolithic architectures are valid even here.

That said, as inheritance and composition implement two different types of delegation (to be and to have), they are both valuable, and multiple inheritance is the way to remove the single provider limitation that comes from having only one parent class.

Why is it controversial?

Given what I just said, multiple inheritance seems to be a blessing. When an object can inherit from multiple parents, we can easily spread responsibilities among different classes and use only the ones we need, promoting code reuse and avoiding god objects.

Unfortunately, things are not that simple. First of all, we face the issue that every microservice-oriented architecture faces, that is the risk of going from god objects (the extreme monolithic architecture) to almost empty objects (the extreme distributed approach), burdening the programmer with too a fine-grained control that eventually results in a system where relationships between objects are so complicated that it becomes impossible to grasp the effect of a change in the code.

There is a more immediate problem in multiple inheritance, though. As it happens with the natural inheritance, parents can provide the same "genetic trait" in two different flavours, but the resulting individual will have only one. Leaving aside genetics (which is incredibly more complicated than programming) and going back to OOP, we face a problem when an object inherits from two other objects that provide the same attribute.

So, if your class Child inherits from parents Parent1 and Parent2, and both provide the __init__ method, which one should your object use?

class Parent1():
    def __init__(self):
        [...]


class Parent2():
    def __init__(self):
        [...]


class Child(Parent1, Parent2):
    # This inherits from both Parent1 and Parent2,
    # which __init__ does it use?
    pass

Things can even get worse, as parents can have different signatures of the common method, for example

class Parent1:
    # This inherits from Ancestor but redefines __init__
    def __init__(self, status):
        [...]


class Parent2:
    # This inherits from Ancestor but redefines __init__
    def __init__(self, name):
        [...]


class Child(Parent1, Parent2):
    # This inherits from both Parent1 and Parent2,
    # which __init__ does it use?
    pass

The problem can be extended even further, introducing a common ancestor above Parent1 and Parent2.

class Ancestor:
    # The common ancestor, defines its own __init__ method
    def __init__(self):
        [...]


class Parent1(Ancestor):
    # This inherits from Ancestor but redefines __init__
    def __init__(self, status):
        [...]


class Parent2(Ancestor):
    # This inherits from Ancestor but redefines __init__
    def __init__(self, name):
        [...]


class Child(Parent1, Parent2):
    # This inherits from both Parent1 and Parent2,
    # which __init__ does it use?
    pass

As you can see, we already have a problem when we introduce multiple parents, and a common ancestor just adds a new level of complexity. The ancestor class can clearly be at any point of the inheritance tree (grandparent, grand-grandparent, etc.), the important part is that it is shared between Parent1 and Parent2. This is the so-called diamond problem, as the inheritance graph has the shape of a diamond

       Ancestor
        ^   ^
       /     \
      /       \
 Parent1     Parent2
     ^     ^
      \       /
       \     /
        Child

So, while with single-parent inheritance the rules are straightforward, with multiple inheritance we immediately have a more complex situation that doesn't have a trivial solution. Does all this prevent multiple inheritance from being implemented?

Not at all! There are solutions to this problem, as we will see shortly, but this further level of intricacy makes multiple inheritance something that doesn't fit easily in a design and has to be implemented carefully to avoid subtle bugs. Remember that inheritance is an automatic delegation mechanism, as this makes what happens in the code less evident. For these reasons, multiple inheritance is often depicted as scary and convoluted, and usually given some space only in the advanced OOP courses, at least in the Python world. I believe every Python programmer, instead, should familiarise with it and learn how to take advantage of it.

Multiple inheritance: the Python way

Let's see how it is possible to solve the diamond problem. Unlike genetics, we programmers can't afford any level of uncertainty or randomness in our processes, so in the presence of a possible ambiguity as the one created by multiple inheritance, we need to write down a rule that will be strictly followed in every case. In Python, this rule goes by the name of MRO (Method Resolution Order), which was introduced in Python 2.3 and is described in this document by Michele Simionato.

There is a lot to say about MRO and the underlying C3 linearisation algorithm, but for the scope of this post, it is enough to see how it solves the diamond problem. In case of multiple inheritance, Python follows the usual inheritance rules (automatic delegation to an ancestor if the attribute is not present locally), but the order followed to traverse the inheritance tree now includes all the classes that are specified in the class signature. In the example above, Python would look for attributes in the following order: Child, Parent1, Parent2, Ancestor.

So, as in the case of standard inheritance, this means that the first class in the list that implements a specific attribute will be the selected provider for that resolution. An example might clarify the matter

class Ancestor:
    def rewind(self):
        print("Ancestor: rewind")


class Parent1(Ancestor):
    def open(self):
        print("Parent1: open")


class Parent2(Ancestor):
    def open(self):
        print("Parent2: open")

    def close(self):
        print("Parent2: close")

    def flush(self):
        print("Parent2: flush")


class Child(Parent1, Parent2):
    def flush(self):
        print("Child: flush")


print(Child.__mro__)

c = Child()
c.rewind()
c.open()
c.close()
c.flush()

As you can see, we can access the MRO of any class reading its __mro__ attribute, and as we expected its value is (<class '__main__.Child'>, <class '__main__.Parent1'>, <class '__main__.Parent2'>, <class '__main__.Ancestor'>, <class 'object'>).

So, in this case an instance c of Child provides rewind, open, close, and flush. When c.rewind is called, the code in Ancestor is executed, as this is the first class in the MRO list that provides that method. The method open is provided by Parent1, while close is provided by Parent2. If the method c.flush is called, the code is provided by the Child class itself, that redefines it overriding the one provided by Parent2.

As we see with the flush method, Python doesn't change its behaviour when it comes to method overriding with multiple parents. The first implementation of a method with that name is executed, and the parent's implementation is not automatically called. As in the case of standard inheritance, then, it's up to us to design classes with matching method signatures.

Under the bonnet

How does multiple inheritance work internally? How does Python create the MRO list?

Python has a very simple approach to OOP (even though it ultimately ends with a mind-blowing ouroboros, see here). Classes are objects themselves, so they contain data structures that are used by the language to provide features, and delegation makes no exception. When we run a method on an object, Python silently uses the __getattribute__ method (provided by object), which uses __class__ to reach the class from the instance, and __bases__ to find the parent classes. The latter, in particular, is a tuple, so it is ordered, and it contains all the classes that the current class inherits from.

The MRO is created using only __bases__, but the underlying algorithm is not that trivial and has to with the monotonicity of the resulting class linearisation. It is less scary than it sounds, but not something you want to read while suntanning, probably. If that's the case, the aforementioned document by Michele Simionato contains all the gory details on class linearisation that you always wanted to explore while lying on the beach.

Inheritance and interfaces¶

To approach mixins, we need to discuss inheritance in detail, and specifically the role of method signatures.

In Python, when you override a method provided by an ancestor class, you have to decide if and when to call its original implementation. This gives the programmer the freedom to decide whether they need to just augment a method or to replace it completely. Remember that the only thing Python does when a class inherits from another is to automatically delegate methods that are not implemented.

When a class inherits from another we are ideally creating objects that keep the backward compatibility with the interface of the parent class, to allow a polymorphic use of them. This means that when we inherit from a class and override a method changing its signature we are doing something that is dangerous and, at least from the point of view of polymorphism, wrong. Have a look at this example

class GraphicalEntity:
    def __init__(self, pos_x, pos_y, size_x, size_y):
        self.pos_x = pos_x
        self.pos_y = pos_y
        self.size_x = size_x
        self.size_y = size_y

    def move(self, pos_x, pos_y):
        self.pos_x = pos_x
        self.pos_y = pos_y

    def resize(self, size_x, size_y):
        self.size_x = size_x
        self.size_y = size_y


class Rectangle(GraphicalEntity):
    pass


class Square(GraphicalEntity):
    def __init__(self, pos_x, pos_y, size):
        super().__init__(pos_x, pos_y, size, size)

    def resize(self, size):
        super().resize(size, size)

Please note that Square changes the signature of both __init__ and resize. Now, when we instantiate those classes we need to keep in mind the different signature of __init__ in Square

r1 = Rectangle(100, 200, 15, 30)
r2 = Rectangle(150, 280, 23, 55)
q1 = Square(300, 400, 50)

We usually accept that an enhanced version of a class accepts different parameters when it is initialised, as we do not expect it to be polymorphic on __init__. Problems arise when we try to leverage polymorphism on other methods, for example resizing all GraphicalEntity objects in a list

for shape in [r1, r2, q1]:
    size_x = shape.size_x
    size_y = shape.size_y
    shape.resize(size_x*2, size_y*2)

Since r1, r2, and q1 are all objects that inherit from GraphicalEntity we expect them to provide the interface provided by that class, but this fails, because Square changed the signature of resize. The same would happen if we instantiated them in a for loop from a list of classes, but as I said it is generally accepted that child classes change the signature of the __init__ method. This is not true, for example, in a plugin-based system, where all plugins shall be initialised the same way.

This is a classic problem in OOP. While we, as humans, perceive a square just as a slightly special rectangle, from the interface point of view the two classes are different, and thus should not be in the same inheritance tree when we are dealing with dimensions. This is an important consideration: Rectangle and Square are polymorphic on the move method, but not on __init__ and resize. So, the question is if we could somehow separate the two natures of being movable and resizeable.

Now, discussing interfaces, polymorphism, and the reasons behind them would require an entirely separate post, so in the following sections, I'm going to ignore the matter and just consider the object interface optional. You will thus find examples of objects that break the interface of the parent, and objects that keep it. Just remember: whenever you change the signature of a method you change the (implicit) interface of the object, and thus you stop polymorphism. I'll discuss another time if I consider this right or wrong.

Mixin classes¶

MRO is a good solution that prevents ambiguity, but it leaves programmers with the responsibility of creating sensible inheritance trees. The algorithm helps to resolve complicated situations, but this doesn't mean we should create them in the first place. So, how can we leverage multiple inheritance without creating systems that are too complicated to grasp? Moreover, is it possible to use multiple inheritance to solve the problem of managing the double (or multiple) nature of an object, as in the previous example of a movable and resizeable shape?

The solution comes from mixin classes: those are small classes that provide attributes but are not included in the standard inheritance tree, working more as "additions" to the current class than as proper ancestors. Mixins originate in the LISP programming language, and specifically in what could be considered the first version of the Common Lisp Object System, the Flavors extension. Modern OOP languages implement mixins in many different ways: Scala, for example, has a feature called traits, which live in their own space with a specific hierarchy that doesn't interfere with the proper class inheritance.

Mixin classes in Python

Python doesn't provide support for mixins with any dedicated language feature, so we use multiple inheritance to implement them. This clearly requires great discipline from the programmer, as it violates one of the main assumptions for mixins: their orthogonality to the inheritance tree. In Python, so-called mixins are classes that live in the normal inheritance tree, but they are kept small to avoid creating hierarchies that are too complicated for the programmer to grasp. In particular, mixins shouldn't have common ancestors other than object with the other parent classes.

Let's have a look at a simple example

class GraphicalEntity:
    def __init__(self, pos_x, pos_y, size_x, size_y):
        self.pos_x = pos_x
        self.pos_y = pos_y
        self.size_x = size_x
        self.size_y = size_y


class ResizableMixin:
    def resize(self, size_x, size_y):
        self.size_x = size_x
        self.size_y = size_y


class ResizableGraphicalEntity(GraphicalEntity, ResizableMixin):
    pass

rge = ResizableGraphicalEntity(5, 4, 200, 300)
rge.resize(1000, 2000)

Here, the class ResizableMixin doesn't inherit from GraphicalEntity, but directly from object, so ResizableGraphicalEntity gets from it just the resize method. As we said before, this simplifies the inheritance tree of ResizableGraphicalEntity and helps to reduce the risk of the diamond problem. It leaves us free to use GraphicalEntity as a parent for other classes without having to inherit methods that we don't want. Please remember that this happens because the classes are designed to avoid it, and not because of language features: the MRO algorithm just ensures that there will always be an unambiguous choice in case of multiple ancestors.

Mixins cannot usually be too generic. After all, they are designed to add features to classes, but these new features often interact with other pre-existing features of the augmented class. In this case, the resize method interacts with the attributes size_x and size_y that have to be present in the object. Obviously, there are obviously examples of pure mixins, but since they would require no initialization their scope is definitely limited.

Using mixins to hijack inheritance

Thanks to the MRO, Python programmers can leverage multiple inheritance to override methods that objects inherit from their parents, allowing them to customise classes without code duplication. Let's have a look at this example

class GraphicalEntity:
    def __init__(self, pos_x, pos_y, size_x, size_y):
        self.pos_x = pos_x
        self.pos_y = pos_y
        self.size_x = size_x
        self.size_y = size_y

class Button(GraphicalEntity):
    def __init__(self, pos_x, pos_y, size_x, size_y):
        super().__init__(pos_x, pos_y, size_x, size_y)
        self.status = False

    def toggle(self):
        self.status = not self.status

b = Button(10, 20, 200, 100)

As you can see the Button class extends the GraphicalEntity one in a classic way, using super to call the parent's __init__ method before adding the new status attribute. Now, if I wanted to create a SquareButton class I have two choices.

I might just override __init__ in the new class

class GraphicalEntity:
    def __init__(self, pos_x, pos_y, size_x, size_y):
        self.pos_x = pos_x
        self.pos_y = pos_y
        self.size_x = size_x
        self.size_y = size_y


class Button(GraphicalEntity):
    def __init__(self, pos_x, pos_y, size_x, size_y):
        super().__init__(pos_x, pos_y, size_x, size_y)
        self.status = False

    def toggle(self):
        self.status = not self.status


class SquareButton(Button):
    def __init__(self, pos_x, pos_y, size):
        super().__init__(pos_x, pos_y, size, size)

b = SquareButton(10, 20, 200)

which performs the requested job, but strongly connects the feature of having a single dimension with the Button nature. If we wanted to create a circular image we could not inherit from SquareButton, as the image has a different nature.

The second option is that of isolating the features connected with having a single dimension in a mixin class, and add it as a parent for the new class

class GraphicalEntity:
    def __init__(self, pos_x, pos_y, size_x, size_y):
        self.pos_x = pos_x
        self.pos_y = pos_y
        self.size_x = size_x
        self.size_y = size_y


class Button(GraphicalEntity):
    def __init__(self, pos_x, pos_y, size_x, size_y):
        super().__init__(pos_x, pos_y, size_x, size_y)
        self.status = False

    def toggle(self):
        self.status = not self.status


class SingleDimensionMixin:
    def __init__(self, pos_x, pos_y, size):
        super().__init__(pos_x, pos_y, size, size)


class SquareButton(SingleDimensionMixin, Button):
    pass

b = SquareButton(10, 20, 200)

The second solution gives the same final result, but promotes code reuse, as now the SingleDimensionMixin class can be applied to other classes derived from GraphicalEntity and make them accept only one size, while in the first solution that feature was tightly connected with the Button ancestor class.

Please note that the position of the mixin is important as super follows the MRO. As it is, the MRO of SquareButton is (SquareButton, SingleDimensionMixin, Button, GraphicalEntity, object), so, when we instantiate it the __init__ method is provided by SingleDimensionMixin, which in turn calls through super the method __init__ of Button. The call super().__init__(pos_x, pos_y, size, size) in SingleDimensionMixin and the signature def __init__(self, pos_x, pos_y, size_x, size_y): in Button match, so everything works.

If we defined SquareButton as

class SquareButton(Button, SingleDimensionMixin):
    pass

then the __init__ method would first be provided by Button, and its super would call the __init__ method of GraphicalEntity. This would however result in an error, as we run SquareButton(10, 20, 200), and Button.__init__ expects four parameters.

Mixins are not used only when you want to change the object's interface, though. Leveraging super we can achieve interesting designs like

class GraphicalEntity:
    def __init__(self, pos_x, pos_y, size_x, size_y):
        self.pos_x = pos_x
        self.pos_y = pos_y
        self.size_x = size_x
        self.size_y = size_y


class Button(GraphicalEntity):
    def __init__(self, pos_x, pos_y, size_x, size_y):
        super().__init__(pos_x, pos_y, size_x, size_y)
        self.status = False

    def toggle(self):
        self.status = not self.status


class LimitSizeMixin:
    def __init__(self, pos_x, pos_y, size_x, size_y):
        size_x = min(size_x, 500)
        size_y = min(size_y, 400)
        super().__init__(pos_x, pos_y, size_x, size_y)


class LimitSizeButton(LimitSizeMixin, Button):
    pass

b = LimitSizeButton(10, 20, 2000, 1000)
print(b.size_x)
print(b.size_y)

Here, the MRO or LimitSizeButton is (<class '__main__.LimitSizeButton'>, <class '__main__.LimitSizeMixin'>, <class '__main__.Button'>, <class '__main__.GraphicalEntity'>, <class 'object'>), which means that when we initialize it the __init__ method is first provided by LimitSizeMixin, which then calls through super the __init__ method of Button, and through the latter the __init__ method of GraphicalEntity.

Remember that in Python, you are never forced to call the parent's implementation of a method, so the mixin here might also stop the dispatching mechanism if that is the requirement of the business logic of the new object.

A concrete example: Django class-based views¶

Finally, let's get to the original source of inspiration for this post: the Django codebase. I will show you here how the Django programmers used multiple inheritance and mixin classes to promote code reuse, and you will now hopefully grasp all the reasons behind them.

The example I chose can be found in the code of generic views, and in particular in two classes: TemplateResponseMixin and TemplateView.

As you might know, Django View class is the ancestor of all class-based views and provides a dispatch method that converts HTTP request methods into Python function calls (CODE). Now, the TemplateView is a view that answers to a GET request rendering a template with the data coming from a context passed when the view is called. Given the mechanism behind Django views, then, TemplateView should implement a get method and return the content of the HTTP response. The code of the class is

class TemplateView(TemplateResponseMixin, ContextMixin, View):
    """
    Render a template. Pass keyword arguments from the URLconf to the context.
    """
    def get(self, request, *args, **kwargs):
        context = self.get_context_data(**kwargs)
        return self.render_to_response(context)

As you can see TemplateView is a View, but it uses two mixins to inject features. Let's have a look at TemplateResponseMixin

class TemplateResponseMixin:
    [...]

    def render_to_response(self, context, **response_kwargs):
        [...]

    def get_template_names(self):
        [...]

I removed the code of the class as it is not crucial for the present discussion, you can see the full class here.

It is clear that TemplateResponseMixin just adds to any class the two methods get_template_names and render_to_response. The latter is called in the get method of TemplateView to create the response. Let's have a look at a simplified schema of the calls:

GET request --> TemplateView.dispatch --> View.dispatch --> TemplateView.get --> TemplateResponseMixin.render_to_response

It might look complicated, but try to follow the code a couple of times and the whole picture will start to make sense. The important thing I want to stress is that the code in TemplateResponseMixin is available for any class that wants to have the feature of rendering a template, for example DetailView (CODE), which receives the feature of showing the details of a single object by SingleObjectTemplateResponseMixin, which inherits from TemplateResponseMixin, overriding its method get_template_names (CODE).

As we discussed before, mixins cannot be too generic, and here we see a good example of a mixin designed to work on specific classes. TemplateResponseMixin has to be applied to classes that contain self.request (CODE), and while this doesn't mean exclusively classes derived from View, it is clear that it has been designed to augment that specific type.

Takeaway points¶

Inheritance is designed to promote code reuse but can lead to the opposite result
Multiple inheritance allows us to keep the inheritance tree simple
Multiple inheritance leads to possible problems that are solved in Python through the MRO
Interfaces (either implicit or explicit) should be part of your design
Mixin classes are used to add simple changes to classes
Mixins are implemented in Python using multiple inheritance: they have great expressive power but require careful design.

Final words¶

I hope this post helped you to understand a bit more how multiple inheritance works, and to be less scared by it. I also hope I managed to show you that classes have to be carefully designed and that there is a lot to consider when you create a class system. Once again, please don't forget composition, it's a powerful and too often forgotten tool.

Updates¶

2020-03-13: GitHub user sureshvv noticed that the LimitSizeMixin method __init__ had the wrong parameters pos_x and pos_y, instead of size_x and size_y. Thanks!

2021-12-20: Alexander fixed a mistake in the part relative to SquareButton and the behaviour of super(). Thanks!

Feedback¶

Feel free to reach me on Twitter if you have questions. The GitHub issues page is the best place to submit corrections.

Public key cryptography: RSA keys

2018-04-25T13:00:00+01:00

I bet you created at least once an RSA key pair, usually because you needed to connect to GitHub and you wanted to avoid typing your password every time. You diligently followed the documentation on how to create SSH keys and after a couple of minutes your setup was complete.

But do you know what you actually did?

Do you know what the file ~/.ssh/id_rsa really contains? Why did ssh create two files with such a different format? Did you notice that one file begins with ssh-rsa, while the other begins with -----BEGIN RSA PRIVATE KEY-----? Have you noticed that sometimes the header of the second file misses the RSA part and just says BEGIN PRIVATE KEY?

I believe that a minimum level of knowledge regarding the various formats of RSA keys is mandatory for every developer nowadays, not to mention the importance of understanding them deeply if you want to pursue a career in the infrastructure management world.

RSA algorithm and key pairs¶

Since the invention of public-key cryptography, various systems have been devised to create the key pair. One of the first ones is RSA, the creation of three brilliant cryptographers, that dates back to 1977. The story of RSA is quite interesting, as it was first invented by an English mathematician, Clifford Cocks, who was however forced to keep it secret by the British intelligence office he was working for.

Keeping in mind that RSA is not a synonym for public-key cryptography but only one of the possible implementations, I wanted to write a post on it because it is still, more than 40 years after its publication, one of the most widespread algorithms. In particular it is the standard algorithm used to generate SSH key pairs, and since nowadays every developer has their public key on GitHub, BitBucket, or similar systems, we may arguably say that RSA is pretty ubiquitous.

I will not cover the internals of the RSA algorithm in this article, however. If you are interested in the gory details of the mathematical framework you may find plenty of resources both on Internet and in the textbooks. The theory behind it is not trivial, but it is definitely worth the time if you want to be serious about the mathematical part of cryptography.

In this article I will instead explore two ways to create RSA key pairs and the formats used to store them. Applied cryptography is, like many other topics in computer science, a moving target, and the tools change often. Sometimes it is pretty easy to find out how to do something (StackOverflow helps), but less easy to get a clear picture of what is going on.

All the examples shown in this post use a 2048-bits RSA key created for this purpose, so all the numbers you see come from a real example. The key has been obviously trashed after I wrote the article.

The PEM format¶

Let's start the discussion about key pairs with the format used to store them. Nowadays the most widely accepted storage format is called PEM (Privacy-enhanced Electronic Mail). As the name suggests, this format was initially created for e-mail encryption but later became a general format to store cryptographic data like keys and certificates. It is described in RFC 7468 ("Textual Encodings of PKIX, PKCS, and CMS Structures").

An example private key in PEM format is the following

Basically, you can tell you are dealing with a PEM format from the typical header and footer that identify the content. While the hyphens and the two words BEGIN and END are always present, the PRIVATE KEY part describes the content and can change if the PEM file contains something different from a key, for example an X.509 certificate for SSL.

The PEM format specifies that the the body of the content (the part between the header and the footer) is encoded using Base64.

If the private key has been encrypted with a password the header and the footer are different

When the PEM format is used to store cryptographic keys the body of the content is in a format called PKCS #8. Initially a standard created by a private company (RSA Laboratories), it became a de facto standard so has been described in various RFCs, most notably RFC 5208 ("Public-Key Cryptography Standards (PKCS) #8: Private-Key Information Syntax Specification Version 1.2").

The PKCS #8 format describes the content using a description language called ASN.1 (Abstract Syntax Notation One) and the relative binary encoding DER (Distinguished Encoding Rules) to serialise the resulting structure. This means that Base64-decoding the content will return some binary content that can be processed only by an ASN.1 parser.

Let me visually recap the structure

-----BEGIN label-----
+--------------------------- Base64 ---------------------------+
|                                                              |
| PKCS #8 content:                                             |
| ASN.1 language serialised with DER                           |
|                                                              |
+--------------------------------------------------------------+
-----END label-----

Please note that, due to the structure of the underlying ASN.1 structure, RSA PEM bodies start always with the same characters: MIG for 1024 bit keys, MII for 2048 and 4096 bit ones.

OpenSSL and ASN.1

OpenSSL can directly decode a key in PEM format and show the underlying ASN.1 structure with the module asn1parse

$ openssl asn1parse -inform pem -in private.pem
    0:d=0  hl=4 l=1214 cons: SEQUENCE          
    4:d=1  hl=2 l=   1 prim: INTEGER           :00
    7:d=1  hl=2 l=  13 cons: SEQUENCE          
    9:d=2  hl=2 l=   9 prim: OBJECT            :rsaEncryption
   20:d=2  hl=2 l=   0 prim: NULL              
   22:d=1  hl=4 l=1192 prim: OCTET STRING      [HEX DUMP]:308204A40201000282010100B2F5FD3F9F0917112
   CE42F8BF87ED676E15258BE443F36DEAFB0B69BDE2496B495EAAD1B01CAD84271B014E96F79386C636D348516DA74A68
   A8C70FBA882870C47B4218D8F49186DDF72727B9D80C21911C3E337C6E407FFB47C2F2767B0D164D8A1E9AF95F6481BF
   8D9EDFB2E3904B2529268C460256FAFD0A677D29898F10B1D15128A695839FC08EDD584E8335615B1D1D7277BE65C532
   DCA92DDC7050374868B117EA9154914EF9292B8443F13696E4FAD50DED6BD90E5A6F7ED33BE2ECE31C6DD7A4253EE6CD
   C56787DDD1D5CD776614022DB87D03BB22F23285B5A3167AF8DACABBEA40004471337D3781E8C5CCA0EA5E27799B510E
   4EF938C61CAA60D02030100010282010100B24255000A6A03901827333539511E4F4C21BA43CBB72BF0A51060D4E1719
   0AC50A871C57503986696D7CDFCB80D0726EFE2D76DBA55DFDC0425E064CC753810035C6A0F97AA37AB39E7C6215BC1E
   595131D0C3782E5A11213B59F42A1067F8CF43C538992D6BEFD1DE3F6293CE18ECC1173C4E7D6DD7362AD7323E7A218B
   5FFB0F245EB796327CC87493EDD134234ED5F3B14A4C4D92374597F64A6D3CB2C10F0CD2D57E99F58C8D28F2049D1433
   CC4BD677017AD1BDD1C83CFB8FB7E8C8FDCF0B4FB77DE7B8285749CEDFBFD6878F7F7930073F0F42ADDCBA8385D7ED05
   CDFCAA2A2BA757601723A96201FECCC2E65C65E14F65F1D34D6ECDFE3F85401800102818100E1D16389BF6EFF7AE44F6
   57106ED81C81A48B5FB356F83DD4A229E8654BDC036716BBD9D46DFD1498132545054958ACA5CFDA709D97CC8C6A9E92
   03D05F7B9D45E685A19A5F58267FCB17FCF502B32CFEDB94CAEA58EE5F63EBA5F33D09946C8652132344410D3D658748
   BCAE256F24896C2A9AD9340D3C8392652DA8ED7346D02818100CAE155C9B3A4546B5FC3CF4CC80D539D531C406BAC5ED
   82818E977B496F9F614CEFB1179E3BFBFAB22BCA7F88EBB8C9B1327AE70113242DFF0866370B6C76782DBD50DBE1FEE9
   B3316B9AAC7BABB7CFA0A9EF26C3C976CF62DA8F41EFE065458DC7C1CBCA78FB1CB4FF7AA50D116CE1640956A4E89EAD
   F5293FAA13A2349F42102818100BC3B93324E6D92EE7883AA366624F28ABF461ED3B0BE2CF7F805158939F815D20C075
   83E52C6DCA8DDD5FB2C1EE5AC9474A1476CD16ACFDDB1E24EEA2F204939BA1C58068B2D342FC4169D484D36451BC7B82
   F306176D53FC71809A5A25B320277320DAC3D949D504DD9907164EC3EF7BD1BB4DEA82160A7C4E3AA2ADEE88A9D02818
   02915E921A7D7A7A0F70BD8775C2C16BACD91F319DB1679FFE4CBA30A5768D784EF45B90C4E2B0ECDC18323211B06B03
   AD76E39CD482E3D8CCC50EAE270A1813CE6F80688723F07FF18A3110AD1AE16692CAD73BAA7AAA2CE5800D72F4F92489
   296542C1DA87159382B41A4A42933CD18848BBDB39A0A8E9F5288770E27075B010281803AB4E3B841AB234515BF0A8D2
   E40FB6E95389702D834474E9AD849124DC6C1D342738D4E7510265DF6B744EBAA4A88A7995346BEEF047DB024CE8B2A4
   E3923B0566389948AB0BBB031879770DA14F4418AEB75AE98349122A2D9535117B05BEF938A1211A3BE6E882957BC2A5
   F1DE5CA50C26F42EE0A383A2A2B6340D52E1A36

This that you see in the code snippet is then the private key in ASN.1 format. Remember that DER is only used to go from the text representation of ASN.1 to binary data, so we don't see it unless we decode the Base64 content into a file and open it with a binary editor.

Note that the ASN.1 structure contains the type of the object (rsaEncryption, in this case). You can further decode the OCTET STRING field, which is the actual key, specifying the offset

$ openssl asn1parse -inform pem -in private.pem -strparse 22
    0:d=0  hl=4 l=1188 cons: SEQUENCE          
    4:d=1  hl=2 l=   1 prim: INTEGER           :00
    7:d=1  hl=4 l= 257 prim: INTEGER           :B2F5FD3F9F0917112CE42F8BF87ED676E15258BE443F36DEAFB
    0B69BDE2496B495EAAD1B01CAD84271B014E96F79386C636D348516DA74A68A8C70FBA882870C47B4218D8F49186DDF
    72727B9D80C21911C3E337C6E407FFB47C2F2767B0D164D8A1E9AF95F6481BF8D9EDFB2E3904B2529268C460256FAFD
    0A677D29898F10B1D15128A695839FC08EDD584E8335615B1D1D7277BE65C532DCA92DDC7050374868B117EA9154914
    EF9292B8443F13696E4FAD50DED6BD90E5A6F7ED33BE2ECE31C6DD7A4253EE6CDC56787DDD1D5CD776614022DB87D03
    BB22F23285B5A3167AF8DACABBEA40004471337D3781E8C5CCA0EA5E27799B510E4EF938C61CAA60D
  268:d=1  hl=2 l=   3 prim: INTEGER           :010001
  273:d=1  hl=4 l= 257 prim: INTEGER           :B24255000A6A03901827333539511E4F4C21BA43CBB72BF0A51
    060D4E17190AC50A871C57503986696D7CDFCB80D0726EFE2D76DBA55DFDC0425E064CC753810035C6A0F97AA37AB39
    E7C6215BC1E595131D0C3782E5A11213B59F42A1067F8CF43C538992D6BEFD1DE3F6293CE18ECC1173C4E7D6DD7362A
    D7323E7A218B5FFB0F245EB796327CC87493EDD134234ED5F3B14A4C4D92374597F64A6D3CB2C10F0CD2D57E99F58C8
    D28F2049D1433CC4BD677017AD1BDD1C83CFB8FB7E8C8FDCF0B4FB77DE7B8285749CEDFBFD6878F7F7930073F0F42AD
    DCBA8385D7ED05CDFCAA2A2BA757601723A96201FECCC2E65C65E14F65F1D34D6ECDFE3F854018001
  534:d=1  hl=3 l= 129 prim: INTEGER           :E1D16389BF6EFF7AE44F657106ED81C81A48B5FB356F83DD4A2
    29E8654BDC036716BBD9D46DFD1498132545054958ACA5CFDA709D97CC8C6A9E9203D05F7B9D45E685A19A5F58267FC
    B17FCF502B32CFEDB94CAEA58EE5F63EBA5F33D09946C8652132344410D3D658748BCAE256F24896C2A9AD9340D3C83
    92652DA8ED7346D
  666:d=1  hl=3 l= 129 prim: INTEGER           :CAE155C9B3A4546B5FC3CF4CC80D539D531C406BAC5ED82818E
    977B496F9F614CEFB1179E3BFBFAB22BCA7F88EBB8C9B1327AE70113242DFF0866370B6C76782DBD50DBE1FEE9B3316
    B9AAC7BABB7CFA0A9EF26C3C976CF62DA8F41EFE065458DC7C1CBCA78FB1CB4FF7AA50D116CE1640956A4E89EADF529
    3FAA13A2349F421
  798:d=1  hl=3 l= 129 prim: INTEGER           :BC3B93324E6D92EE7883AA366624F28ABF461ED3B0BE2CF7F80
    5158939F815D20C07583E52C6DCA8DDD5FB2C1EE5AC9474A1476CD16ACFDDB1E24EEA2F204939BA1C58068B2D342FC4
    169D484D36451BC7B82F306176D53FC71809A5A25B320277320DAC3D949D504DD9907164EC3EF7BD1BB4DEA82160A7C
    4E3AA2ADEE88A9D
  930:d=1  hl=3 l= 128 prim: INTEGER           :2915E921A7D7A7A0F70BD8775C2C16BACD91F319DB1679FFE4C
    BA30A5768D784EF45B90C4E2B0ECDC18323211B06B03AD76E39CD482E3D8CCC50EAE270A1813CE6F80688723F07FF18
    A3110AD1AE16692CAD73BAA7AAA2CE5800D72F4F92489296542C1DA87159382B41A4A42933CD18848BBDB39A0A8E9F5
    288770E27075B01
1061:d=1  hl=3 l= 128 prim: INTEGER           :3AB4E3B841AB234515BF0A8D2E40FB6E95389702D834474E9AD8
    49124DC6C1D342738D4E7510265DF6B744EBAA4A88A7995346BEEF047DB024CE8B2A4E3923B0566389948AB0BBB0318
    79770DA14F4418AEB75AE98349122A2D9535117B05BEF938A1211A3BE6E882957BC2A5F1DE5CA50C26F42EE0A383A2A
    2B6340D52E1A36

Being this an RSA key the fields represent specific components of the algorithm. We find in order: the modulus n = pq, the public exponent e, the private exponent d, the two prime numbers p and q, and the values d_p, d_q, and q_inv (for the Chinese remainder theorem speed-up).

If the key has been encrypted there are fields with information about the cipher, and the OCTET STRING fields cannot be further parsed because of the encryption.

$ openssl asn1parse -inform pem -in private-enc.pem
    0:d=0  hl=4 l=1311 cons: SEQUENCE          
    4:d=1  hl=2 l=  73 cons: SEQUENCE          
    6:d=2  hl=2 l=   9 prim: OBJECT            :PBES2
   17:d=2  hl=2 l=  60 cons: SEQUENCE          
   19:d=3  hl=2 l=  27 cons: SEQUENCE          
   21:d=4  hl=2 l=   9 prim: OBJECT            :PBKDF2
   32:d=4  hl=2 l=  14 cons: SEQUENCE          
   34:d=5  hl=2 l=   8 prim: OCTET STRING      [HEX DUMP]:7FBE6B5C86A4B922
   44:d=5  hl=2 l=   2 prim: INTEGER           :0800
   48:d=3  hl=2 l=  29 cons: SEQUENCE          
   50:d=4  hl=2 l=   9 prim: OBJECT            :aes-256-cbc
   61:d=4  hl=2 l=  16 prim: OCTET STRING      [HEX DUMP]:7FC1CC749F456498F01E43108E4340DE
   79:d=1  hl=4 l=1232 prim: OCTET STRING      [HEX DUMP]:A5581EDC2797FC4E1AD0B66A00B765900AF1164D8
   F67458C1A4E72F54A65F2B8C0C5AD7E42584B95161FD98FBECA07D8E1049687C365ED157C45F1B57B175D2EF778A1FE7
   D12E50C0DF4248F0E1469DA40F9948581F16546F9582D9DCA83AC07C9466A6E3E6CE98CC241C44DAB32F5891B96DE302
   4B6E6A0F4980C6286D6EB8AA1680AD132810EEFB127DE42968142F4F9A4A2CE55A560C054C54DFFBB720A81F3F50A2A6
   3D748CE06309F55340BD4C74980C48F4C9D41650568A62BBE8E0337653BD4A2F7D47C3A24514B5D3100ED40C164831C6
   5A96DC90AD20F4AEF02E00203B0F0B2D550987AEE8F4C7E0E7C0CFF426B465D3CF568D02EE86AF043345954B0AAA649F
   A9F80E026E2A189EC60772A058615DCFEC9EC4D2D12CDEB7844EAA00202E435A0B9B0A28AC4F2DA213214F773A2319D5
   5A560D5C99246F9895F5EF04D97FF1CE26EFC2FF82249F6E94253CB92EE0A74AE3942285C2DFFC77883709E7FF2569FD
   9C8F58C112CD4A125E40E7BC8599242D71DE7D48416B6A36FBE0B90BA9A05AFB982CAE9AD337C2318582AA328ABC341F
   BB1C036DE334DE327DEC97BA757CBBAED26F25DD74BD8BE9215B479CD49D8357AFA5289A0265ADE025F9FC0CDB1CDBF0
   4C812F20B7CEB58BF12C1FD1756AABD7F557B87E1D245E8062D1DF4078D77AD98BFDF0C0F3A06A7FA11BFAE0EBF8F3EE
   1F8AB0D6D7C905D4D238E2738613EA753E044589CEBDF3714CACEC298653FA45AF5977BDCF23B5DD60B479C7958B8AC1
   8CAA4AA4A79C283805246675BBB8D2D0E5B714320E7E6FE8B2EF73DB9839095229B9653726AB9689B19AB47113F70204
   83B2D1A82FE2EB9ABAB429DDF5ACDEBCAB62BABD48D2DBA1D398B03F9919F1DAC8CDA19D39BBAF2B5FE96C43E78F565C
   465019DF88E71BCE35C6F7F8BE87EB384FA1193345E47CA9382BCEFFC2E6B37681E8D95EB48BC7044F7DCA743217D4C0
   81200502E98EC2CFAA9D17277D5385E65CC8104DA999E31532A8B9B3B4D3E219613AE09BC9F10553CC4E5F135ACD3FB4
   A3BBAB21839CEFBBC0D4BB16AE4FBD7407E6E3709B059BD86AFFE032805CE5FB0B8005009B5964B79E478DA7FE88C20D
   D2FEDA10A0EB3433ADC90AF5DD8772B840A5CD7C5E32D96153E41F12BA501EF1F48C4E20CB0120CFBB6F546C2B6E22E0
   834CB9DFBFA4834FEB4B7374788F781A1634ABF9D1FD014E6DB3749E6A086155521ADB9F271D6BF6F60455903B1D913D
   A639EE9F5CA5135FD2A1873FF35EAB8C151C5B90826E4303233D4BB053EBD929107874CDCCADFFF492A7CB595EADF03E
   4C0FE15326752898F1B9AA3EAC9907D9F276E6AB37AFA34FF8F3DBAB7B009754CF1A13029CD6857686105830F0CF6E99
   476CB07ECAAEA8B5CCC2720479423F8504E783D6712E424C636DAB41203D9EC76F47C4B56F453C42E5626048C24CC585
   F0710514EEF6D4C9644E0721CEAE9F885FBD672742A555095A895C7F0D4E814BEF4D223B13285E95BEDF7357D3545784
   32C1EBB63A6EF1D83E21A08DADA073BF9419C7A3185BB492A13569F262683E7CD86EC66CF671C919789038598EFEC22B
   C8EA1E265A4E0864F9E7253BE32457AC1B186722F3D0FF4AD450D04BA97D5B7DC1AA617DBD25EE8EC912072ABCBF5394
   D08AA276732666D4C349196940BFE869DA909EC03A8E25B23339EE50453CB5F81400B1380CA46AF0FC012CA55F322C1C
   5806E5D76D4CD8308B8FDFE

OpenSSL and RSA keys

Another way to look into a private key with OpenSSL is to use the module rsa. While the module asn1parse is a generic ASN.1 parser, the module rsa knows the structure of an RSA key and can properly output the field names

$ openssl rsa -in private.pem -noout -text
Private-Key: (2048 bit)
modulus:
    00:b2:f5:fd:3f:9f:09:17:11:2c:e4:2f:8b:f8:7e:
    d6:76:e1:52:58:be:44:3f:36:de:af:b0:b6:9b:de:
    24:96:b4:95:ea:ad:1b:01:ca:d8:42:71:b0:14:e9:
    6f:79:38:6c:63:6d:34:85:16:da:74:a6:8a:8c:70:
    fb:a8:82:87:0c:47:b4:21:8d:8f:49:18:6d:df:72:
    72:7b:9d:80:c2:19:11:c3:e3:37:c6:e4:07:ff:b4:
    7c:2f:27:67:b0:d1:64:d8:a1:e9:af:95:f6:48:1b:
    f8:d9:ed:fb:2e:39:04:b2:52:92:68:c4:60:25:6f:
    af:d0:a6:77:d2:98:98:f1:0b:1d:15:12:8a:69:58:
    39:fc:08:ed:d5:84:e8:33:56:15:b1:d1:d7:27:7b:
    e6:5c:53:2d:ca:92:dd:c7:05:03:74:86:8b:11:7e:
    a9:15:49:14:ef:92:92:b8:44:3f:13:69:6e:4f:ad:
    50:de:d6:bd:90:e5:a6:f7:ed:33:be:2e:ce:31:c6:
    dd:7a:42:53:ee:6c:dc:56:78:7d:dd:1d:5c:d7:76:
    61:40:22:db:87:d0:3b:b2:2f:23:28:5b:5a:31:67:
    af:8d:ac:ab:be:a4:00:04:47:13:37:d3:78:1e:8c:
    5c:ca:0e:a5:e2:77:99:b5:10:e4:ef:93:8c:61:ca:
    a6:0d
publicExponent: 65537 (0x10001)
privateExponent:
    00:b2:42:55:00:0a:6a:03:90:18:27:33:35:39:51:
    1e:4f:4c:21:ba:43:cb:b7:2b:f0:a5:10:60:d4:e1:
    71:90:ac:50:a8:71:c5:75:03:98:66:96:d7:cd:fc:
    b8:0d:07:26:ef:e2:d7:6d:ba:55:df:dc:04:25:e0:
    64:cc:75:38:10:03:5c:6a:0f:97:aa:37:ab:39:e7:
    c6:21:5b:c1:e5:95:13:1d:0c:37:82:e5:a1:12:13:
    b5:9f:42:a1:06:7f:8c:f4:3c:53:89:92:d6:be:fd:
    1d:e3:f6:29:3c:e1:8e:cc:11:73:c4:e7:d6:dd:73:
    62:ad:73:23:e7:a2:18:b5:ff:b0:f2:45:eb:79:63:
    27:cc:87:49:3e:dd:13:42:34:ed:5f:3b:14:a4:c4:
    d9:23:74:59:7f:64:a6:d3:cb:2c:10:f0:cd:2d:57:
    e9:9f:58:c8:d2:8f:20:49:d1:43:3c:c4:bd:67:70:
    17:ad:1b:dd:1c:83:cf:b8:fb:7e:8c:8f:dc:f0:b4:
    fb:77:de:7b:82:85:74:9c:ed:fb:fd:68:78:f7:f7:
    93:00:73:f0:f4:2a:dd:cb:a8:38:5d:7e:d0:5c:df:
    ca:a2:a2:ba:75:76:01:72:3a:96:20:1f:ec:cc:2e:
    65:c6:5e:14:f6:5f:1d:34:d6:ec:df:e3:f8:54:01:
    80:01
prime1:
    00:e1:d1:63:89:bf:6e:ff:7a:e4:4f:65:71:06:ed:
    81:c8:1a:48:b5:fb:35:6f:83:dd:4a:22:9e:86:54:
    bd:c0:36:71:6b:bd:9d:46:df:d1:49:81:32:54:50:
    54:95:8a:ca:5c:fd:a7:09:d9:7c:c8:c6:a9:e9:20:
    3d:05:f7:b9:d4:5e:68:5a:19:a5:f5:82:67:fc:b1:
    7f:cf:50:2b:32:cf:ed:b9:4c:ae:a5:8e:e5:f6:3e:
    ba:5f:33:d0:99:46:c8:65:21:32:34:44:10:d3:d6:
    58:74:8b:ca:e2:56:f2:48:96:c2:a9:ad:93:40:d3:
    c8:39:26:52:da:8e:d7:34:6d
prime2:
    00:ca:e1:55:c9:b3:a4:54:6b:5f:c3:cf:4c:c8:0d:
    53:9d:53:1c:40:6b:ac:5e:d8:28:18:e9:77:b4:96:
    f9:f6:14:ce:fb:11:79:e3:bf:bf:ab:22:bc:a7:f8:
    8e:bb:8c:9b:13:27:ae:70:11:32:42:df:f0:86:63:
    70:b6:c7:67:82:db:d5:0d:be:1f:ee:9b:33:16:b9:
    aa:c7:ba:bb:7c:fa:0a:9e:f2:6c:3c:97:6c:f6:2d:
    a8:f4:1e:fe:06:54:58:dc:7c:1c:bc:a7:8f:b1:cb:
    4f:f7:aa:50:d1:16:ce:16:40:95:6a:4e:89:ea:df:
    52:93:fa:a1:3a:23:49:f4:21
exponent1:
    00:bc:3b:93:32:4e:6d:92:ee:78:83:aa:36:66:24:
    f2:8a:bf:46:1e:d3:b0:be:2c:f7:f8:05:15:89:39:
    f8:15:d2:0c:07:58:3e:52:c6:dc:a8:dd:d5:fb:2c:
    1e:e5:ac:94:74:a1:47:6c:d1:6a:cf:dd:b1:e2:4e:
    ea:2f:20:49:39:ba:1c:58:06:8b:2d:34:2f:c4:16:
    9d:48:4d:36:45:1b:c7:b8:2f:30:61:76:d5:3f:c7:
    18:09:a5:a2:5b:32:02:77:32:0d:ac:3d:94:9d:50:
    4d:d9:90:71:64:ec:3e:f7:bd:1b:b4:de:a8:21:60:
    a7:c4:e3:aa:2a:de:e8:8a:9d
exponent2:
    29:15:e9:21:a7:d7:a7:a0:f7:0b:d8:77:5c:2c:16:
    ba:cd:91:f3:19:db:16:79:ff:e4:cb:a3:0a:57:68:
    d7:84:ef:45:b9:0c:4e:2b:0e:cd:c1:83:23:21:1b:
    06:b0:3a:d7:6e:39:cd:48:2e:3d:8c:cc:50:ea:e2:
    70:a1:81:3c:e6:f8:06:88:72:3f:07:ff:18:a3:11:
    0a:d1:ae:16:69:2c:ad:73:ba:a7:aa:a2:ce:58:00:
    d7:2f:4f:92:48:92:96:54:2c:1d:a8:71:59:38:2b:
    41:a4:a4:29:33:cd:18:84:8b:bd:b3:9a:0a:8e:9f:
    52:88:77:0e:27:07:5b:01
coefficient:
    3a:b4:e3:b8:41:ab:23:45:15:bf:0a:8d:2e:40:fb:
    6e:95:38:97:02:d8:34:47:4e:9a:d8:49:12:4d:c6:
    c1:d3:42:73:8d:4e:75:10:26:5d:f6:b7:44:eb:aa:
    4a:88:a7:99:53:46:be:ef:04:7d:b0:24:ce:8b:2a:
    4e:39:23:b0:56:63:89:94:8a:b0:bb:b0:31:87:97:
    70:da:14:f4:41:8a:eb:75:ae:98:34:91:22:a2:d9:
    53:51:17:b0:5b:ef:93:8a:12:11:a3:be:6e:88:29:
    57:bc:2a:5f:1d:e5:ca:50:c2:6f:42:ee:0a:38:3a:
    2a:2b:63:40:d5:2e:1a:36

The fields are the same we found in the ASN.1 structure, but in this representation we have a better view of the specific values of the RSA key. You can compare the two and see that the value of the fields are the same.

If you want to learn something about RSA, try to investigate the historical reasons behind the choice of 65537 as a common public exponent (as you can see here in the section publicExponent).

PKCS #8 vs PKCS #1

The first version of the PKCS standard (PKCS #1) was specifically tailored to contain an RSA key. Its ASN.1 definition can be found in RFC 8017 ("PKCS #1: RSA Cryptography Specifications Version 2.2")

RSAPublicKey ::= SEQUENCE {
    modulus           INTEGER,  -- n
    publicExponent    INTEGER   -- e
}

RSAPrivateKey ::= SEQUENCE {
    version           Version,
    modulus           INTEGER,  -- n
    publicExponent    INTEGER,  -- e
    privateExponent   INTEGER,  -- d
    prime1            INTEGER,  -- p
    prime2            INTEGER,  -- q
    exponent1         INTEGER,  -- d mod (p-1)
    exponent2         INTEGER,  -- d mod (q-1)
    coefficient       INTEGER,  -- (inverse of q) mod p
    otherPrimeInfos   OtherPrimeInfos OPTIONAL
}

Subsequently, as the need to describe new types of algorithms increased, the PKCS #8 standard was developed. This can contain different types of keys, and defines a specific field for the algorithm identifier. Its ASN.1 definition can be found in RFC 5958 ("Asymmetric Key Packages")

OneAsymmetricKey ::= SEQUENCE {
     version                   Version,
     privateKeyAlgorithm       PrivateKeyAlgorithmIdentifier,
     privateKey                PrivateKey,
     attributes            [0] Attributes OPTIONAL,
     ...,
     [[2: publicKey        [1] PublicKey OPTIONAL ]],
     ...
   }

PrivateKey ::= OCTET STRING
                     -- Content varies based on type of key. The
                     -- algorithm identifier dictates the format of
                     -- the key.

The definition of the field PrivateKey for the RSA algorithm is the same used in PKCS #1.

If the PEM format uses PKCS #8 its header and footer are

-----BEGIN PRIVATE KEY-----
[...]
-----END PRIVATE KEY-----

If it uses PKCS #1, however, there has to be an external identification of the algorithm, so the header and footer are

-----BEGIN RSA PRIVATE KEY-----
[...]
-----END RSA PRIVATE KEY-----

The structure of PKCS #8 is the reason why we had to parse the field at offset 22 to access the RSA parameters when using the module asn1parse of OpenSSL. If you are parsing a PKCS #1 key in PEM format you don't need this second step.

Private and public key¶

In the RSA algorithm the public key is built using the modulus and the public exponent, which means that we can always derive the public key from the private key. OpenSSL can easily do this with the module rsa, producing the public key in PEM format

$ openssl rsa -in private.pem -pubout
writing RSA key
-----BEGIN PUBLIC KEY-----
MIIBIjANBgkqhkiG9w0BAQEFAAOCAQ8AMIIBCgKCAQEAsvX9P58JFxEs5C+L+H7W
duFSWL5EPzber7C2m94klrSV6q0bAcrYQnGwFOlveThsY200hRbadKaKjHD7qIKH
DEe0IY2PSRht33Jye52AwhkRw+M3xuQH/7R8LydnsNFk2KHpr5X2SBv42e37LjkE
slKSaMRgJW+v0KZ30piY8QsdFRKKaVg5/Ajt1YToM1YVsdHXJ3vmXFMtypLdxwUD
dIaLEX6pFUkU75KSuEQ/E2luT61Q3ta9kOWm9+0zvi7OMcbdekJT7mzcVnh93R1c
13ZhQCLbh9A7si8jKFtaMWevjayrvqQABEcTN9N4Hoxcyg6l4neZtRDk75OMYcqm
DQIDAQAB
-----END PUBLIC KEY-----

You can dump the information in the public key specifying the flag -pubin

$ openssl rsa -in public.pem -noout -text -pubin
Public-Key: (2048 bit)
Modulus:
    00:b2:f5:fd:3f:9f:09:17:11:2c:e4:2f:8b:f8:7e:
    d6:76:e1:52:58:be:44:3f:36:de:af:b0:b6:9b:de:
    24:96:b4:95:ea:ad:1b:01:ca:d8:42:71:b0:14:e9:
    6f:79:38:6c:63:6d:34:85:16:da:74:a6:8a:8c:70:
    fb:a8:82:87:0c:47:b4:21:8d:8f:49:18:6d:df:72:
    72:7b:9d:80:c2:19:11:c3:e3:37:c6:e4:07:ff:b4:
    7c:2f:27:67:b0:d1:64:d8:a1:e9:af:95:f6:48:1b:
    f8:d9:ed:fb:2e:39:04:b2:52:92:68:c4:60:25:6f:
    af:d0:a6:77:d2:98:98:f1:0b:1d:15:12:8a:69:58:
    39:fc:08:ed:d5:84:e8:33:56:15:b1:d1:d7:27:7b:
    e6:5c:53:2d:ca:92:dd:c7:05:03:74:86:8b:11:7e:
    a9:15:49:14:ef:92:92:b8:44:3f:13:69:6e:4f:ad:
    50:de:d6:bd:90:e5:a6:f7:ed:33:be:2e:ce:31:c6:
    dd:7a:42:53:ee:6c:dc:56:78:7d:dd:1d:5c:d7:76:
    61:40:22:db:87:d0:3b:b2:2f:23:28:5b:5a:31:67:
    af:8d:ac:ab:be:a4:00:04:47:13:37:d3:78:1e:8c:
    5c:ca:0e:a5:e2:77:99:b5:10:e4:ef:93:8c:61:ca:
    a6:0d
Exponent: 65537 (0x10001)

Generating key pairs with OpenSSL¶

If you want to generate an RSA private key you can do it with OpenSSL

$ openssl genpkey -algorithm RSA -out private.pem \
  -pkeyopt rsa_keygen_bits:2048
......................................................................+++
..........+++

Since OpenSSL is a collection of modules, we specify genpkey to generate a private key. The option -algorithm specifies which algorithm we want to use to generate the key (RSA in this case), -out specifies the name of the output file, and -pkeyopt allows us to set the value for specific key options. In this case the length of the RSA key in bits.

If you want an encrypted key you can generate one specifying the cipher (for example -aes-256-cbc)

$ openssl genpkey -algorithm RSA -out private-enc.pem \
  -aes-256-cbc -pkeyopt rsa_keygen_bits:2048
...........................+++
..........+++
Enter PEM pass phrase:
Verifying - Enter PEM pass phrase:

You can see the list of supported ciphers with openssl list-cipher-algorithms. In both cases you can then extract the public key with the method shown previously. OpenSSL private keys are created using PKCS #8, so unencrypted keys will be in the form

-----BEGIN PRIVATE KEY-----
[...]
-----END PRIVATE KEY-----

and encrypted ones in the form

-----BEGIN ENCRYPTED PRIVATE KEY-----
[...]
-----END ENCRYPTED PRIVATE KEY-----

Generating key pairs with OpenSSH¶

Another tool that you can use to generate key pairs is ssh-keygen, which is a tool included in the SSH suite that is specifically used to create and manage SSH keys. As SSH keys are standard asymmetrical keys we can use the tool to create keys for other purposes.

To create a key pair just run

ssh-keygen -m PEM -t rsa -b 2048 -f key

The option -m specifies the key format. By default OpenSSH uses its own format specified in RFC 4716 ("The Secure Shell (SSH) Public Key File Format").

The option -t specifies the key generation algorithm (RSA in this case), while the option -b specifies the length of the key in bits.

The option -f sets the name of the output file. If not present, ssh-keygen will ask the name of the file, offering to save it to the default file ~/.ssh/id_rsa. The tool always asks for a password to encrypt the key, but you are allowed to enter an empty one to skip the encryption.

This tool creates two files. One is the private key file, named as requested, and the second is the public key file, named like the private key one but with the extension .pub.

The value PEM specified for the option -m writes the private key using the PKCS #1 format, so the key will be in the form

-----BEGIN RSA PRIVATE KEY-----
[...]
-----END RSA PRIVATE KEY-----

Using -m PKCS8 instead uses PKCS #8 and the key will be in the form

-----BEGIN PRIVATE KEY-----
[...]
-----END PRIVATE KEY-----

The OpenSSH public key format

The public key saved by ssh-keygen is written in the so-called SSH-format, which is not a standard in the cryptography world. It's structure is ALGORITHM KEY COMMENT, where the KEY part of the format is encoded with Base64.

For example

ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQCy9f0/nwkXESzkL4v4ftZ24VJYvkQ/Nt6vsLab3iSWtJXqrRsBythCcbAU6W9
5OGxjbTSFFtp0poqMcPuogocMR7QhjY9JGG3fcnJ7nYDCGRHD4zfG5Af/tHwvJ2ew0WTYoemvlfZIG/jZ7fsuOQSyUpJoxGAlb6
/QpnfSmJjxCx0VEoppWDn8CO3VhOgzVhWx0dcne+ZcUy3Kkt3HBQN0hosRfqkVSRTvkpK4RD8TaW5PrVDe1r2Q5ab37TO+Ls4xx
t16QlPubNxWeH3dHVzXdmFAItuH0DuyLyMoW1oxZ6+NrKu+pAAERxM303gejFzKDqXid5m1EOTvk4xhyqYN user@host

To manually decode the central part of the key you can use base64 and hexdump

$ cat key.pub | cut -d " " -f2 | \
  base64 -d | hexdump -ve '/1 "%02x "' -e '2/8 "\n"'
00 00 00 07 73 73 68 2d 72 73 61 00 00 00 03 01
00 01 00 00 01 01 00 b2 f5 fd 3f 9f 09 17 11 2c
e4 2f 8b f8 7e d6 76 e1 52 58 be 44 3f 36 de af
b0 b6 9b de 24 96 b4 95 ea ad 1b 01 ca d8 42 71
b0 14 e9 6f 79 38 6c 63 6d 34 85 16 da 74 a6 8a
8c 70 fb a8 82 87 0c 47 b4 21 8d 8f 49 18 6d df
72 72 7b 9d 80 c2 19 11 c3 e3 37 c6 e4 07 ff b4
7c 2f 27 67 b0 d1 64 d8 a1 e9 af 95 f6 48 1b f8
d9 ed fb 2e 39 04 b2 52 92 68 c4 60 25 6f af d0
a6 77 d2 98 98 f1 0b 1d 15 12 8a 69 58 39 fc 08
ed d5 84 e8 33 56 15 b1 d1 d7 27 7b e6 5c 53 2d
ca 92 dd c7 05 03 74 86 8b 11 7e a9 15 49 14 ef
92 92 b8 44 3f 13 69 6e 4f ad 50 de d6 bd 90 e5
a6 f7 ed 33 be 2e ce 31 c6 dd 7a 42 53 ee 6c dc
56 78 7d dd 1d 5c d7 76 61 40 22 db 87 d0 3b b2
2f 23 28 5b 5a 31 67 af 8d ac ab be a4 00 04 47
13 37 d3 78 1e 8c 5c ca 0e a5 e2 77 99 b5 10 e4
ef 93 8c 61 ca a6 0d

The structure of this binary file is pretty simple, and is described in two different RFCs. RFC 4253 ("SSH Transport Layer Protocol") states in section 6.6 that

The "ssh-rsa" key format has the following specific encoding:

      string    "ssh-rsa"
      mpint     e
      mpint     n

while the definition of the types string and mpint can be found in RFC 4251 ("SSH Protocol Architecture"), section 5

string

    [...] They are stored as a uint32 containing its length
    (number of bytes that follow) and zero (= empty string) or more
    bytes that are the value of the string.  Terminating null
    characters are not used. [...]

mpint

    Represents multiple precision integers in two's complement format,
    stored as a string, 8 bits per byte, MSB first. [...]

This means that the above sequence of bytes is interpreted as 4 bytes of length (32 bits of the type uint32) followed by that number of bytes of content.

(4 bytes)   00 00 00 07          = 7
(7 bytes)   73 73 68 2d 72 73 61 = "ssh-rsa" (US-ASCII)
(4 bytes)   00 00 00 03          = 3
(3 bytes)   01 00 01             = 65537 (a common value for the RSA exponent)
(4 bytes)   00 00 01 01          = 257
(257 bytes) 00 b2 .. ca a6 0d    = The key modulus

Please note that since we created a key of 2048 bits we should have a modulus of 256 bytes. Instead this key uses 257 bytes prefixing the number with a byte 00 to avoid it being interpreted as negative (two's complement format).

The structure shown above is the reason why all the RSA public SSH keys start with the same 12 characters AAAAB3NzaC1y. This string, converted in Base64 gives the initial 9 bytes 00 00 00 07 73 73 68 2d 72 (Base64 characters are not a one-to-one mapping of the source bytes). If the exponent is the standard 65537 the key starts with AAAAB3NzaC1yc2EAAAADAQAB, which encoded gives the fist 18 bytes 00 00 00 07 73 73 68 2d 72 73 61 00 00 00 03 01 00 01.

Converting between PEM and OpenSSH format¶

We often need to convert files created with one tool to a different format, so this is a list of the most common conversions you might need. I prefer to consider the key format instead of the source tool, but I give a short description of the reason why you should want to perform the conversion.

PEM/PKCS#1 to PEM/PKCS#8

This is useful to convert OpenSSH private keys to a newer format.

openssl pkcs8 -topk8 -inform PEM -outform PEM -in pkcs1.pem -out pkcs8.pem

OpenSSH public to PEM/PKCS#8

To convert public OpenSSH keys in a PEM format using PKCS #8 (prints to stdout)

ssh-keygen -e -f public.pub -m PKCS8

This is easy to remember because -e stands for export. Note that you can also use -m PEM to convert the key into a PEM format that uses PKCS #1.

PEM/PKCS#8 to OpenSSH public

If you need to use in SSH a key pair created with another system

ssh-keygen -i -f public.pem -m PKCS8

This is easy to remember because -i stands for import. As happened when exporting the key, you can import a PEM/PKCS #1 key using -m PEM.

Reading RSA keys in Python¶

In Python you can use the package pycrypto to access a PEM file containing an RSA key with the function RSA.importKey. Now you can hopefully understand the documentation that says

externKey (string) - The RSA key to import, encoded as a string.

An RSA public key can be in any of the following formats:
    * X.509 subjectPublicKeyInfo DER SEQUENCE (binary or PEM encoding)
    * PKCS#1 RSAPublicKey DER SEQUENCE (binary or PEM encoding)
    * OpenSSH (textual public key only)

An RSA private key can be in any of the following formats:
    * PKCS#1 RSAPrivateKey DER SEQUENCE (binary or PEM encoding)
    * PKCS#8 PrivateKeyInfo DER SEQUENCE (binary or PEM encoding)
    * OpenSSH (textual public key only)

For details about the PEM encoding, see RFC1421/RFC1423.

In case of PEM encoding, the private key can be encrypted with DES or 3TDES
according to a certain pass phrase. Only OpenSSL-compatible pass phrases are
supported.

In practice what you can do with a file private.pem is

from Crypto.PublicKey import RSA

f = open('private.pem', 'r')
key = RSA.importKey(f.read())

and the variable key will contain an instance of _RSAobj (not a very pythonic name, to be honest). This instance contains the RSA parameters as attributes as stated in the documentation

modulus = key.n
public_exponent = key.e
private_exponent = key.d
first_prime_number = key.p
second_prime_number = key.q
q_inv_crt = key.u

Final words¶

I keep finding on StackOverflow (and on other boards) messages of users that are confused by RSA keys, the output of the various tools, and by the subtle but important differences between the formats, so I hope this post helped you to get a better understanding of the matter.

Resources¶

The Wikipedia article on RSA
OpenSSL documentation: asn1parse, rsa, genpkey
The Base64 encoding
The Abstract Syntax Notation One ASN.1 interface description language
RFC 4251 - The Secure Shell (SSH) Protocol Architecture
RFC 4253 - The Secure Shell (SSH) Transport Layer Protocol
RFC 4716 - The Secure Shell (SSH) Public Key File Format
RFC 5208 - Public-Key Cryptography Standards (PKCS) #8: Private-Key Information Syntax Specification Version 1.2
RFC 5958 - Asymmetric Key Packages
RFC 7468 - Textual Encodings of PKIX, PKCS, and CMS Structures
RFC 8017 - PKCS #1: RSA Cryptography Specifications Version 2.2
PyCrypto - The Python Cryptography Toolkit

Feedback¶

Feel free to reach me on Twitter if you have questions. The GitHub issues page is the best place to submit corrections.

Introduction to hashing

2018-04-06T11:30:00+01:00

Have you ever used dictionaries or maps in your language of choice, or have you ever met a mysterious MD5 code while downloading a file from a server? Maybe you are a programmer, and using Git to manage your code you ended up dealing with strange numbers called SHA1, and surely you heard at least a couple of times the term cache, which probably needed to be emptied in your browser.

What do all these concept have in common?

In this post I want to introduce you to the concept of hashing, which is one of the basic topics a good programmer shall know. Hashes are such an important topic in computer science that lacking knowledge in this field means being confused about a wide range of other subjects, like cryptography and security. Data structures, Bitcoin and blockchains, HTTPS, all these topics have hashing as one of their building blocks. As you can see, it is worth mastering the concept.

This will obviously be only a humble introduction to the subject matter, as the whole concept is too broad for a single post. You can start a serious study of this important part of computer science reading the Wikipedia articles liked at the bottom of the page and reading a good book (or taking a course) in either cryptography or data structures.

A practical example¶

Let me give you a concrete example of hashing before we analyse the matter in depth.

I want to download a big file from Internet, for example a Linux distribution. I already downloaded it in the past, but I'm not sure if the version I have is the same available for download now. The file has been renamed, so the original name has been lost. I might obviously just download it again, install it and manually check the version.

I wonder if there is a simpler solution, one that can possibly be automated. Downloading an ISO file from Internet is nowadays very cheap, but manually checking for the version isn't, at least in terms of time. I might also download it and compare all the files contained in both the new and the old ISO images, but then again, this process is not very fast.

The best solution would be for the server to provide a sort of label that depends only on the data in a mathematical and deterministic way. I might then run the same algorithm on the file I already downloaded and get a label that can be easily compared with the one provided by the server. If the process of computing the label is fast enough this might be the perfect solution.

A typical algorithm used for this purpose is MD5, and the label computed by the server could be something like ef67d799b71de37423202c587662c87f. Computing the MD5 of a 600 MB file takes less than a couple of seconds on a modern computer, so I can check if the file I own is the same the server provides in a very short time.

You can test MD5 on your own using the md5sum program that comes with all Linux distributions or other Unix-based systems. Open a terminal and run the following command

echo "This is a simple input string" | md5sum

and the result will be 8a7cc3b47880b5ef880ac6ef30785a1a, independently of your operating system.

MD5 is one of many hash functions that have been invented to deal with problems like the one I exemplified. Recently, I had the need to synchronise daily two AWS S3 buckets containing more than 60 gigabytes of files. Without hash functions it would be impossible to quickly identify the files that need to be copied.

The rest of the post is dedicated to the exploration of such an important and intriguing part of contemporary technology.

Hash functions¶

Let's start from the formal definition of hash function:

A hash function is any function that can be used to map data of arbitrary size to data of fixed size

This description may sound intimidating at first, but it is actually pretty simple. Let's consider a dictionary where you want to look up a word that you don't know, like for example "quagmire". What you do is to jump directly to a section labelled "Q" in the dictionary, then quickly identify the part of the section where words that start with "QU" are, and promptly find the word. Congratulations, you just used a hash function!

Getting the first letter of the word is, as a matter of fact, a function (an operation) that maps (connects) data of arbitrary size (words) to data of fixed size (a single letter of the alphabet). Using this method we can connect any word (also invented ones!) to a letter of the alphabet.

Before we move on, I want to stress one aspect that is clear from the previous example. Through a hash function we can connect a set of potentially infinite values (all the words that we can create) with a finite set (the letters of an alphabet). This is the most important concept we have to keep in mind when dealing with hashing.

Uniqueness¶

The result of a hash function is not unique, which means that two different inputs may give the same output. This is pretty easy to understand in the dictionary example, where multiple words can give as a result the same letter, as multiple words begin with that letter.

It is also evident that hash functions cannot produce unique results by design. The goal of a hash function is to map an infinite set with a finite set, so it is obvious that multiple elements of the infinite set will map to the same element in the finite one.

Let me give you a very simple example. Let's create a hash function that returns the first 32 bits (4 bytes) of the input, padding them with zeros if the input is shorter that 32 bits. I will use the ASCII standard to convert strings of characters into hexadecimal numbers, so every letter is represented by 1 byte.

"This is a string"
  |
  +-----> 54 68 69 73 20 69 73 20 61 20 73 74 72 69 6e 67
            |
            +---> 54 68 69 73

"One"
  |
  +-----> 4f 6e 65
            |
            +---> 4f 6e 65 00

"The quick fox"
  |
  +-----> 54 68 65 20 71 75 69 63 6b 20 66 6f 78
            |
            +---> 54 68 65 20   <== COLLISION

"The lazy dog"
  |
  +-----> 54 68 65 20 6c 61 7a 79 20 64 6f 67
            |
            +---> 54 68 65 20   <== COLLISION

As you can see we have multiple input strings with different lengths, and while the first three produce different output values the last one produces the same value as the third one. This is straightforward, as the two strings start with the same four characters and our hash function considers only those to compute its result.

Such an event is called collision and it is a direct effect of the non-uniqueness of hash values. It will always happen, with hash functions, that different values produce the same output, and it is important to understand that this is not because our hash function is trivial, but this is in the very nature of hash functions, for strict mathematical reasons.

Collisions are not intrinsically bad, but we have to be aware they can happen when we develop algorithms that use hash functions. If we are writing a dictionary for a human language where 80% of the words starts with "A" it is pointless to use the first letter to partition the book because the first section would be almost as big as the whole tome. This may seem too imaginative an example, but when we manage data structures problems such this arise more often than not.

In this last example avoiding collisions is easy. We just need to increase the number of characters that we consider until there are no clashes on the right. This is a very empirical way to sort the problem, though, and it's possible only because we are dealing with a narrow set of inputs and a very simple hash function. In the next section we will discuss how more complicated hash functions deal with this problem.

Digital hash functions¶

As we saw the definition of hash functions involves functions, which are mappings. In other words we just need to describe a process that couples objects from the infinite source set of inputs to the finite destination set of outputs. Taking the fist letter of a word is such a process, but other examples may be grouping people according to the colour of the eyes or cataloguing films by production year. Among the various processes that we can use a big role is played by digital processes, that is functions that involve some operation on binary numbers.

When we digitalise something we represent it with a sequence of bits, and once this is done there is no real difference between strings, images, videos, sounds, programs. Everything in a computer is ultimately a sequence of bits, and those sequences can be sliced and changed with pure numerical functions such as additions and multiplications.

Cryptographic hash functions¶

Hash functions play a decisive role in security and in cryptography, and can be found in algorithms that provide authentication, i.e. secure ways to demonstrate the authenticity of some data. While the actual cryptographic techniques are not in the scope of this article, it is important to know that hash functions used for cryptographic purposes are not different, in principle, from hash functions used for other tasks that do not require any degree of security. Cryptographic hash functions, however, must have some specific properties that give the function a certain degree of "robustness". Being able to find the input of a hash function given the output, for example, would be catastrophic for some security algorithms that rely on the infeasibility of such an operation.

"Good" hash functions¶

The definition of hash function is pretty inclusive as the only required property is that of returning a fixed-length output. Hash functions used in practice may however have other properties. Such properties may be desirable or mandatory depending on the application, so functions that are extremely good for cryptography may be a poor choice for data structures like dictionaries.

Let me briefly describe some of the most important properties that you should be aware of.

Determinism

Given the algorithm (with its parameters) and given the input data, the hash must always be the same. The result of the hashing function depends only on the data itself, and not on other external factors like for example time or computer system.

Pay attention to the fact that this definition considers the algorithm and its parameters. This means that we can include external factors in the computation, but they have to be fixed for the whole life of the result itself.

Let's consider a system that uses a hash to speed up searches in some arrays. For several reasons the hashing algorithm employs an initial random seed that is derived from the boot time. As long as the system is running (i.e. it is not rebooted), the algorithm is consistent, and we may consider the random seed as a constant parameter. We may also persist the hashes on a storage, because when we load them they are still perfectly valid. As soon as the system is rebooted, however, the whole set of hashes created during the previous execution becomes invalid and meaningless. This is not the case, though, if the hashing function bases its computation on the actual data only.

Diffusion

Changing one single bit of the source data shall results in a complete change of the hash number. Compare for example the MD5 hash values of two similar strings

The quick brown fox jumps over the lazy dog
  |
  +--> 37c4b87edffc5d198ff5a185cee7ee09

The quick brown fox jumps over the lazy cog
  |
  +--> 15546a0bcace46fd5e12ec29adca5e70

As you can see when a single input byte is different (the letter d in dog becomes a c), the whole result changes.

This implies that every part of the output is computed considering all the bits of the input. A function that returns the first n bits of the input does not have a good diffusion, as two different strings may return exactly the same hash if they have the same first n bits (see the example given above when I spoke about uniqueness). This property is important for cryptographic hash function.

Minimal change (continuity)

An interesting property of some hash functions is that similar input values map to similar hash values. The exact definition of "similar" may vary, but in general we might associate it with the number of changes from the first output to the second. This behaviour is handy in some searching algorithms, where it is important that similar objects are stored near each other.

Note that this property is somehow the opposite of diffusion, thus demonstrating that not all these properties might be found in a single hash function.

Uniformity

A hash function has a given finite number of possible outputs, because the output has a finite length. When a hash function is uniform, producing the output for each possible input produces a uniform distribution of outputs, that is there is no output value that is used more often than others. When designing data structures this is often the desirable behaviour, since it leads to an uniform use of resources, for example memory, leading to an uniform behaviour of other algorithms that work on the same structure, like search.

Uniformity is obviously linked to the number of collisions produced by a hash function, and a perfectly uniform hash function will have the same number of collisions for each output value. Increasing the number of possible output values, thus, results in a uniform reduction of collisions.

Non-invertible

Inverting a function means to create a function that returns the original input given the output. For example multiplication by 2 is an invertible function, as given the result we may easily divide by 2 and retrieve the original input.

With non-injective functions the only caveat is that there are multiple inputs that return the same output, but this doesn't prevent the creation of an inverse function. For example, 3 squared gives 9 and since the inverse of the square function is the square root, we can apply it to the result and retrieve the possible inputs, that is +3 and -3.

With non-invertible functions there is no simple way to find the input given the output. Mathematically we speak of one-way functions, as computing the inverse is either impossible of infeasible. Mind that "infeasible" has a well-defined meaning in mathematics, but I will not go deeper into it in this article. It will be sufficient to consider it as "too hard to compute in a reasonable time with the current state of technology". Cryptographic hash functions must be non-invertible.

Collisions-resistant

A hash function is said to be collision-resistant when it is hard to find two different inputs that produce the same hash value. Mind that the definition of "hard" here is the same as that of "infeasible" in the previous section. This property is very important in cryptography, where collisions can be exploited to crack a cipher.

Theoretical and practical inputs¶

It is important to understand that the analysis of a hash function can be made considering either theoretical or practical inputs. Theoretical inputs are all the possible inputs, like "all the possible strings", while a set of practical inputs might be "the names of a group of people". The latter might be very large but it is not infinite.

Obviously, a hash function that provides interesting properties when dealing with theoretical inputs will show the same properties when applied to practical inputs, but often such functions are complex and slow. Not to mention that it is very difficult to create them.

Let me show you an example. As we saw above, a hash function that returns the first letter of a string is not a very good one. It lacks the diffusion property, for instance, an its uniformity is questionable, as all the strings that begin with the same letter will produce the same hash, leading to a large number of collisions. This is bad for data structures, so such a function is in theory not optimal.

However, if we are working on a set of strings like

A poor workman blames his tool
Barking dogs seldom bite
Common sense ain't common
Doctors make the worst patients
...
You can't teach an old dog new tricks

where it is known (or evident) that each string begins with a different letter, suddenly our hash function becomes a perfect choice to build a searchable data structure, because given this input set there are no collisions. So, an analysis of the practical inputs is always paramount when we consider hash functions, as theoretically poor functions may perform very well on specific sets of inputs.

A very good example of such an analysis can be found in the source code of the Python language. The implementation of dictionaries contains an in-depth discussion of the choices made when implementing the hashing mechanism behind those structures. You can find it here. If you never approached data structures I recommend starting from a simpler explanation, however, as you might be intimidated by that discussion. You will find a good basic tutorial on hash tables in any data structures course or textbook.

Final words¶

As I said this is just a very quick and humble introduction to hashing. I think you cannot call yourself a programmer nowadays without knowing something about hashing, and what I summarized in this post is enough to understand hash uses like Bitcoin or SSL. If you want to study the topic in depth, however, I recommend taking a course or reading a book on data structures.

Resources¶

Hash function on Wikipedia
Cryptographic hash function on Wikipedia
A lesson on hash functions by Prof. Christof Paar
MIT Professor Srinivas Devadas on Cryptographic hash functions
Wiley & Sons publishes a book on Data Structures and Algorithms in Python
O'Reilly publishes a book on Mastering Algorithms with C: Useful Techniques from Sorting to Encryption

Updates¶

2018-04-28 gixslayer and SevenGlass discussed on reddit the right command line for the md5sum example on Windows. See the original comments.

Feedback¶

Feel free to reach me on Twitter if you have questions. The GitHub issues page is the best place to submit corrections.

The Digital Cat - algorithms

Data Partitioning and Consistent Hashing

Rationale¶

Design choices¶

A practical example of partitioning¶

Hash functions to the rescue¶

Partitioning use cases¶

Load balancers

Caches

Databases

Caching and scaling strategies¶

Scaling out with hash partitioning

A different approach¶

Consistent hashing¶

Consistent hashing and scaling¶

Source code¶

Final words¶

Resources¶

Feedback¶

Public key cryptography: OpenSSH private keys

KDFs and protection at rest¶

Key Derivation Functions

bcrypt

Protection at rest

PEM format and protection at rest¶

OpenSSH's private key format¶

Cost factor and rounds

How many rounds?

Can we convert private OpenSSH keys into PEM?¶

A poorly documented format¶

Resources¶

Feedback¶

Public key cryptography: SSL certificates

Clarification: SSL vs TLS¶

X.509 certificates¶

How are certificates related to HTTPS?¶

How do certificates work?¶

Example: CA root certificate¶

Example: self-signed certificate¶

Example: this site's certificate¶

How to verify certificates with OpenSSL¶

Low-level certificate validation process¶

Algorithms used by root certificates¶

AWS components related to certificates¶

Let's encrypt¶

Final words¶

Resources¶

Feedback¶

Multiple inheritance and mixin classes in Python

Multiple inheritance: blessing and curse¶

General concepts

Why is it controversial?

Multiple inheritance: the Python way

Under the bonnet

Inheritance and interfaces¶

Mixin classes¶

Mixin classes in Python

Using mixins to hijack inheritance

A concrete example: Django class-based views¶

Takeaway points¶

Final words¶

Updates¶

Feedback¶

Public key cryptography: RSA keys

RSA algorithm and key pairs¶

The PEM format¶

OpenSSL and ASN.1

OpenSSL and RSA keys

PKCS #8 vs PKCS #1

Private and public key¶

Generating key pairs with OpenSSL¶

Generating key pairs with OpenSSH¶

The OpenSSH public key format

Converting between PEM and OpenSSH format¶

PEM/PKCS#1 to PEM/PKCS#8

OpenSSH public to PEM/PKCS#8

PEM/PKCS#8 to OpenSSH public

Reading RSA keys in Python¶

Final words¶

Resources¶