Reproducibility

Reproducibility can be understood by symmetries in a fractals

Reproducibility is a very interesting concept that is related to both Physics and Software, which I practice them daily. I thought it would be nice to share my thoughts about why it's important and why should we seek it. In fact at the first glance, reproducibility is an important concept without adding any new ideas to it, because in software, by increasing reproducibility companies can increase their revenue. But it's a little bit deeper than that, and even with such motivation, this concept is not embraced as it should be!


Above wallpaper reference1

Last week, because the /boot partition in my NixOS machine was too small and it was bothering me, I had to repartition it again(now it's 2GB) and installed NixOS again. By the way, you can find that details in this gist. This's a good reason for me to write this post and emphesise NixOS's biggest promise, reproducibility. By the way, I can guarantee you'll learn something here even that it looks like a rant, so bear with me.

A quick summary of what you'll see here is, of course, some rants here and there :D But mostly, my goal is to shift the paradigm of why life exists, despite it looks like it violates the Second law of thermodynamics2. Currently the best explanation that I'm aware of regarding this question is Evolution and the Second Law3 by Sean Carroll, but it's not satisfying IMHO! Then after that, I'll jump to software to make some points clear as well. As you'll see I like to answer big questions with first principles, so it may be boring for you if you're not curious enough.

Before starting, I have to mention sharing is caring, so I share my ideas freely here, but I really need the credit of my ideas, so I could exchange this credit with money in the future to build my anti-gravity engin! :P So please share them with reference!

What's reproducibility

To be agree on the definition, let's start by the Wikipedia's definition.4

Reproducibility, also known as replicability and repeatability, is a major principle underpinning the scientific method. For the findings of a study to be reproducible means that results obtained by an experiment or an observational study or in a statistical analysis of a data set should be achieved again with a high degree of reliability when the study is replicated. There are different kinds of replication[1] but typically replication studies involve different researchers using the same methodology. Only after one or several such successful replications should a result be recognized as scientific knowledge.

With a narrower scope, reproducibility has been introduced in computational sciences: Any results should be documented by making all data and code available in such a way that the computations can be executed again with identical results.

Let's explain these definitions in science and software contexts separately.

Science

So it all starts by definition of Science, and the fact that we're living in a universe that patterns are repeated a lot is leading us to this concept. I called this Fractal Hypothesis in one of my previous posts, when I tried to define Scinece itself5. To justify its name and answering why it's related to fractals6,7, we need to take a look at the difinition of fractal. It goes like this in Wikipedia.

In mathematics, a fractal is a geometric shape containing detailed structure at arbitrarily small scales, usually having a fractal dimension strictly exceeding the topological dimension.

In some sense, the topological dimension8 is the ordinary dimention, which is a positive integer, but the fractal dimention can be a real number and it's defined like

In mathematics, Hausdorff dimension is a measure of roughness, or more specifically, fractal dimension, that was first introduced in 1918 by mathematician Felix Hausdorff.[2] For instance, the Hausdorff dimension of a single point is zero, of a line segment is 1, of a square is 2, and of a cube is 3. That is, for sets of points that define a smooth shape or a shape that has a small number of corners—the shapes of traditional geometry and science—the Hausdorff dimension is an integer agreeing with the usual sense of dimension, also known as the topological dimension. However, formulas have also been developed that allow calculation of the dimension of other less simple objects, where, solely on the basis of their properties of scaling and self-similarity, one is led to the conclusion that particular objects—including fractals—have non-integer Hausdorff dimensions.

If you read more about fractal dimention, you'll notice that other definitions are available for it as well, but Hausdroff dimension9 is the best one IMHO. The other definition that we should take a look at it is the concept of Recursion10. Again based on the Wikipedia

Recursion (adjective: recursive) occurs when a thing is defined in terms of itself or of its type.

In Hausdroff dimension's definition, self-similarity is what we call it repetition of a pattern in the science, also the scalling is a recursion function. But it's not just scalling that can make fractals. For instance, in Mandelbrot set fractal11 the recursion function is \( f_c(z)=z^2+c \). Therefore, my general claim is that whenever you work with a recursion function, you are actually working with a fractal, so you can calculate its dimension. This calculation is supper easy for recursion functions that scale, like Sierpiński triangle12 and Koch snowflake13, you just need to find the number of replication that function creates and how much those replications are scaled down, then you have everything you need. In case of Mandelbrot set11, and Newton Fractal14, it's much harder, but I would say all information to calculate their Hausdorff dimension is available in that recursion function.

So when the Fractal Hypothesis says the patterns are repeated, then someone can make a type for that repeated pattern, which also called self-similarity, then create a recursion function from one instance of that type to another instance, or replication instances, of that type. For example, in throwing a ball experiment, we can create a recursive function that receive the first throw as input and return the second throw. Thus this function can receive the second throw and return the third throw. This can be extended to all experiments. Because the repetition is guaranteed, then a recurisive funtion exists. In a simple setup of throwing a ball experiment, applying that recursive funtion is like counting the numeber of the repetition, but in more complicated repetitions it can be modeled with group theory15, which is the basis for definition of symmetry in Physics16. However, notice that not all recursion functions can be studied by the group theory15. For instance, in Sierpiński triangle12 the recursion function is halving its width, and replicate it three times, where a group cannot transform a pattern three times in one operation!

For instance, regarding the Poincaré group17, which is the symmerty of flat space-time, we can argue that someone can run an experiment, by rotating and moving around on the vicinity of any point in the space-time, to show that this experiment is repeatable. Hence, there is a recursion function that can let us jump, and rotate, on points in the space-time, which is provided by the exponentiation, \(\exp \left(ia_{\mu }P^{\mu }\right)\exp \left({\frac {i}{2}}\omega_{\mu \nu }M^{\mu \nu }\right) \), of its Lie algebra18. This recursion function is enough to jump from one such pattern to its vicinity, which is so small that we couldn't measure their minimum distance yet, so we treat them like they can create a real number line. However, it cannot be the case for ever, the same as what we have now for atoms and molecules. We thought matter is continuous, but then we found atoms and molecules. So it'll be the case for space-time continuum in the future as well. Notice, for each kind of atoms or molecules, we have a recursion function that we can jump by using it. They are old continuous, you may say, but nowadays, for instance in crystals, we work with them in Lattice model19, however, they could be more exciting in liquids and gases, but here we're! We still don't think of vacuum as a state of matter! Unrelated but shows our laziness!

But it's not just physics, that is built upon recursion functions. the morphisms, and functors in category theory20 are recursions. the mathematical induction21 in mathematical logic22, and arithmatic23 are recursions, which mean anything we are building in math is a fractal, or part of a fractal. For instance, you can map mathmatical logic on the natural number, the same as what we have in Gödel's incompleteness theorems24, then any logical operation, as the a recursion function, would transfer it from a set of numbers to another number, so in the first glance you can see its fractal dimension is less than one. Here, I just assumed all maps preserve the fractal dimension!

It's not finished, but even until here reproducibility looks pretty important, so I just wonder why people don't use it much enough in scientific context. In fact, currently we have Replication crisis25, which means we have scientists who don't care about reproducibility! Told you it's underappreciated! Anyway! The paradigm shift is on its way so buckle up!

Reductionism

By Wikipedia reductionism26 is

Reductionism is any of several related philosophical ideas regarding the associations between phenomena which can be described in terms of other simpler or more fundamental phenomena. It is also described as an intellectual and philosophical position that interprets a complex system as the sum of its parts.

Like any philosophical statement, someone cannot reject it in its general form, but if sommeone sticks to geometry of our universe and says he/she can build up every properties of any large object from properties of its smaller parts, then I have a thought experiment for him/her to reject it.

This experiment goes like this. Any theory of nature that currenlty we have, uses the concept of field27 to describe the nature. For instance, in Quantum Mechanics28 we have quantum fields, and in General Relativity29 we have metric. In General Relativity, all space-time, including everything in it, galaxies and atoms, can be described by only one metric tensor, so every point of space-time has only one numerical tensor as a value.

However, in Copenhagen interpretation30, we can use tensor product31 to have multiple wave functions(quantum fields) to describe reality, so it'll not give us only one numerical tensor as a value on each pint. It looks like an obstacle for us to conclude anything here, but by looking carefully we notice in the real calculations of atoms and molecules we only use one wave function, by using some mental tricks to allow it for us, in the Density functional theory(DFT)32. It's nice to mention that in the Resonance interpretation33 we only use the tensor product to seperate wave function of different measurement devices of the same wave function. We just use the tensor product to gather different values of this function, to work with the conserved quantities, where are measured by devices on different locations. So in atoms and molecules, where we don't have different measurement devices for each electrons, protons, and notrons, we don't need to gather them by using tensor product, the same as what we have in DFT. Especially to give you more sense of it, for a single atom this function is depending on \( n, l, m, s \) as we have for Hydrogen atom34, plus \( p \) as the number of protons in the atom nucleus, plus \( N \) as the number of notrons there. Therefore, no tensor product is needed to describe a single atom. Hence we can conclude that just one tensor of fields is enough to describe reality in a calculatable manner.

Depend on the boundaries and the equation of motion, that single tensor can represent all the structures we observe. Notice, if we were using tensor product to have separate fields for each of electrons in an atom, like what we have in Copenhagen interpretation30, we could define boundaries of atoms properly. That's why we insisted to don't use it. However, it's good to clarify that when we measure that field for a galaxy, the atoms and molecules in that field are just noises. We don't even need super large structures like galaxies to ignore the atoms and molecules structure in the field, it's happening even for bacteria and cells. For their structure, atoms and molecules are just noises. Let's define structure in this model.

A structure is a pattern in the single tensor field with rigid enough boundaries that let it have one or multiple resonance frequencs on any kind of wave(field).

By the Resonance interpretation33, this one again ;), atoms and molecules are structures, human body is a structure, a galaxy is a structure, etc. Anything that has rigid enough boundaries to reflect waves inside that boundaries, will have a resonance frequency, so it'll be a structure. These structures can be inside of each others, but the small ones can be some noises for the large one, therefore has no effect on the large one. Even if the small ones are not noises to the bigger one, because resonance frequencies depend on the shape of the boundaries, it cannot be deduced by the small structures. So we have a property for the large structures that cannot be explained by the small ones. Therefore the rejection statement above is proved.

You may ask the boundaris of a large structure is defined by small ones, like the boundary of a book on a table, therefore something here doesn't add up! The answer is that think in the general sense, for instance what's the boundary between Earth and Moon? Their boundary can be defined by the metric tensor, not by atoms and molecules, this's also applicable to the book and table case since most of the space among atoms is empty.

But we were not some stupid folks around to talk about made up rules. Why did we think we can deduce properties of a large structure from its small building blocks? The reason is that those rigid boundaries are so useful, you may say powerful, that distracted us to see the whole picture. It started from Newton who taught us that we can decouple small parts of a big problem, then solve them separately, then attach the solutions to find the big solution. It's a very useful technique in engineering. Especually in software, we have Separation of concerns35 which is so useful to solve problems in large scale. Those rigid boundaris are interfaces, etc. However, it's not always an applicable problem solving method as proved above.

Anyway, we got distracted and here we're. The important part is to accept it and move forward to the paradigm shift!

Entropy

For an isolated system in equilibrium, the Entropy36 is

\[ S = k_B log\Omega \]

where \( \Omega \) is the number of microstates37, the state of microscopic configuration. But it's not the general formula. The general one is

\[ S=-k_{\mathrm {B} }\sum_{i} p_{i}\log p_{i}, \]

where \( p_{i} \) is the probability for the system to be in \( i \)th microscopic configuration. However, someone can write it like

\[ S=-k_{\mathrm {B} }\langle \log p\rangle = k_{\mathrm {B} }\log \Omega - k_{\mathrm {B} }\langle \log n\rangle \]

where \( p_{i}=n_{i}/\Omega \), so \( n_{i} \) is the number of states in the \( i \)th specific microscopic configuration. Notice \( \sum_i n_{i} = \Omega \), which means in the case of a lot of microscopic configurations, which happens most of the times except in situations like crystals, we have, \( \Omega \gg n_{i} \), therefore, by applying \( \log(.) \) on it we have \( \log \Omega \gg \log n_{i} \), thus their average would also be \( \log \Omega \gg \langle \log n \rangle \). Hence, even in the general case, \( k_{\mathrm {B} }\log \Omega \) term governs entroy's behaviour. This brings us to the question of how we count \( \Omega \).

In the reductionism approach we had before, we just count the number of states of atoms and molecules. Even here you can see a paradox in that approach, because molecules are built by atoms, so why should we count those structures separately? But here we're in the next step, where reductionism is not an assumption anymore. Now we know that atoms are separated structures than molecules, so we should count them separately. But a system can have other structures as well that we should count.

First of all, counting of each structure is orthogonal/independent to the counting of other structures, so \( \Omega \) should be their product, which means if the total repetition of \(i\)th structure is \( r_{i} \), then by keeping the previously assumption of a lot of microscopic(?) configurations, we have

\[ \Omega = r_{1}!r_{2}!...r_{N}! \]

where the system has only \(N\) type of structures. Therefore by applying Stirling's approximation38

\[ S= - k_{\mathrm {B} }\langle \log n\rangle + k_{\mathrm {B} }\sum_{i}^{N} r_{i}\left(\log r_{i} - 1\right) \]

Second, in the old approach and the new one, we don't count the repeated structures of space-time itself. Even that there must be a huge finit number of repeated structures in the space-time, as we discussed about them while we were explaining Poincaré group17, it's always a constant in a flat space-time, so we can just include them by adding a constant, if volume is a constant, like \( C = C(V)\)

\[ S=- k_{\mathrm {B} }\langle \log n\rangle + C + k_{\mathrm {B} }\sum_{i}^{N} r_{i}\left(\log r_{i} - 1\right) \]

It shows \( S \) directly related to all \( r_{i} \), so increasing the number of repetition of any structure in a system will increase its entropy. And this's it. We can answer big questions with it! Notice, the noise term, \(- k_{\mathrm {B} }\langle \log n\rangle \), and the constant part have unsignificant contributions in our measurements based on

\[ dU=TdS-pdV \]

and

\[ T=\left({\frac {\partial U}{\partial S}}\right)_{V,N} \]

Here, you may need to replay all the thermodynamics' games we had with this new paradigm. For instance, populating the Earth, which is increasing repetition of human body structure, \( r_{h} \), will decrease Earth's temprature if we populate it in the order of magnitude of repetition of atoms and molecules, \( r_{a,m} \), but it's not the case as you may ask! This is a good point since this model claims even a big structure contributes an extra energy, rather than the amount of energy its molecules contribute to the system, that can objectively be tested by an experiment. This is so exciting! Let me know in Twitter how it goes.

Life

Erwin Schrödinger39 concluded in his book40 that life decreases the entropy of Earth, which raised a lot of concerns later41 that Second law of thermodynamics2 is not applicable on life, or at least there is a contradictory between our observation and our theories. Evolution and the Second Law3 tries to explain that existance of life doesn't violate the Second law of thermodynamics by insisting we have an open system on the Earth, and Sun radiates enough entropy at Earth to increase its entropy to its maximum, so there's enough input entropy to the system that can cover its lost due to evolution of life. My problem with this explanation is that entropy is not a conserved quantity, so by providing enough entropy as input we cannot say the system decided that so I have enough of that, lets lose some entropy on some points! The Second law of thermodynamics is a consequence of having much higher number of disordered states respect to odered ones, so system will fall to a disorder state much often, unless we spend enrgy and push it to another directiion. If we provide entropy as input, system will try to increase it more because still there is a lot of space in the disordered side of the possibilities respect to ordered ones. Therefore, it cannot be the correct answer to this paradox!

Also it's nice to notice that in the current reductionism approach there's a maximum for the entropy of Earth with its current temperature, but after our paradigm change, there's no such a maximum, because as soon as the system discovers a new structure that could be repeated in a sustainable manner, it could increase its entropy. Life is such a repeated structure, which is sustainable on Earth, so it increases the entropy, not decreasing it as previously thought. It's not just life! The atoms and molecules are sustainable structures so now it makes sense why nature always try to have more types of them. Also, the space-time has structures as explained before, so we can argue it may expand to increase repetition of those structures.

Here, it's useful to rephrase the Second law of thermodynamics2 a little bit. The time is ticking on the direction of increasing entropy, unless the system recieves enough energy to push it away. This is important to notice it's a local process. The system doesn't know anything about the long term consequences of chosing a state. It just can calculate it locally, in the short term, as any system in classical mechanics would do. Just a reminder that any classical system can convert Hamilton's principle42, which is a global condition, to Lagrange's equations43, which are the local conditions. It's not the case for a quantum system in Copenhagen interpretation30. So the time would tick to increase the entropy in Cambrian explosion44, then, because the repeated structures are not sustainable or an external source of energy pushes the system to decrease its entropy, we have a list of mass extinction events45. If the system could act globally in the future to know it would be a mass extinction after this structure type explosion, then the time would not tick in that direction in the first place. Such an access to the future would violate classical mechanics, but not a quantum system in Copenhagen interpretation30. This all means classical mechanics should be used here.

It's not just the evolution46 that can be explained by it, it looks like we can even answer why events like Industrial Revolution47 happened. Because mass production in Industrial Revolution means we could create a large number of products, which basically are structures with their own types. Each of those products has their own \( r_{i} \), so such an event could increase the entropy, therefore the time ticks on this direction. However, we just need to make it sustainable to stay around, which didn't happen yet!

This also means mass extinction that's currently happening on Earth can also be one of the causes of the global warming rather than just be an effect.

But on the other hand, it also means if we all buy enough of Tesla Model 3, if Tesla could produce that much in a sustainable manner, then increasing \( r_{tm3} \) will decrease the temprature of Earth then solve the global warming :D Just kidding!

Software

It's not all about decumentations as the Wikipedia's definition above suggests, but mostly on clearly defining input and output of a program, in a way that output only depends on the inputs, which we call this behaviour being deterministic48,49. Someone can achieve that by applying needed restrictions with the language's type system or other methods on the input and output to clarify the dependencies of the result.

It worth to have a brief look at the history of computers. In the old days, we only had analog computers. They were not deterministic, so everytime we run them, there is a possibility that they give us differrent results. Then we applied our number system in the form of binaries, which is an abstract concept, on the reality of our machines, which are full of noises, to achieve reproducibility. So we struggled once to reach this point of reproducibility, but we forgot that like a kid! So we invented Encapsulation50 and put it in a paradigm named Object-oriented programming(OOP)51 then pour the money on it by advertising until it's used everywhere. By definition, Encapsulation hides some inputs that the output depends on, so it violated determinism. Thus, we lost all of the reproducibility power of the machines! By the way, it's not the only reason, for instance, we designed Concurrency52 and Parallelism53 APIs to produce race conditions54 by default, unless user of the API could solve it by themselves to make its behaviour deterministic again. Therefore, the result is that the programs are not as deterministic as the underlying binary system designed for it.

The good news is that if you define the input and output for a program, or a function, clearly and restrictively, also avoid concurrency and parallelism in them, then the underlying binary system, which is deterministic, will help you to have a deterministic program, or function. This kind of function has a name, we call it a pure function55.

I have to mention that using pure functions and type system is what Functional Programming(FP)56 advertises, which are great tools to achieve reproducibility as I explained above, but using Functor, Monad, Category theory20, etc, which are other parts of Functional Programming56, are not directly make your code reproducibile. So you don't need to shift entirely from OOP to FP to achieve reproducibility.

Personally, reproducibility was so important to me that when I wanted to choose which branch of Software Engineering I want to work with, I chose Mobile development, because in mobile development your code, as a repeatable pattern, would be reproduced in much larged machines than let's say a backend code. Therefore, writing code would turn into experiments that would be run on different locations, which itsef is a test for reproducibility of its underlying Physics and Math. It's so exciting to me!

But unfortunately, this is not how the rest of mobile developers choose this branch, so reproducibility is not as important for majority of us as it's for me. But even generally in software we don't have a better situation, even that some tools and technologies are available right now, but developers complain about the steep learning curve! For instance, pure functions must be the small structure of any large determinant structure. But most popular programming languages are not supporting pure functions properly. Even in the languages that have a simple type system, which restricts data structures to force minimum reproducibility, developers tend to use more non-restrictive codes. Of course, the reason of why popular programming languages don't support reproducibility is clear. It's because software industry is growing fast and the young energy for trial and error is enough to make PHP and Javascript profitable enough for some developers and companies by fast enough iteration on features and bugs. Looking at you Meta! Apparently that profit makes them role model of other developers. But this will not happen in the long term.

We need profitable developers and companies, who embrace reproducibility and related technologes. The good news is the time is aligned with our direction based on above thoughts on Entropy. The time itself ticks on the direction that has more sustainable reproducible patterns, to increase the entropy as much as possible.


References