GEP for stricter minting rules [Closed]

sabrinasadik · March 22, 2024, 7:21am

Over the last couple of years, it has been amazing to see the ThreeFold Grid (the TFGrid) expanding. The TFGrid is currently live in 57 countries, with 26.16 PB of capacity, 36.484 cores, and a total of 1782 ThreeFold nodes.

Coming into this next chapter, we need to focus on the stability of the TFGrid. The goal for the TFGrid is for users to be able to easily deploy on a node of their choosing, without being worried about the stability and availability of their deployments.

Minting goes out from that standpoint. A general question to be asked would thus be: is this given node available for deployments so users can rely on it? If the answer is yes, then the TFGrid mints the ThreeFold Tokens.

It has come to our attention that there are some nodes on the TFGrid that unfortunately do not measure up to these standards. We see it as our responsibility to bring this out to the community and propose a stricter set of minting rules so that we all collectively can make sure that the TFGrid is reliable, stable, and fair.

Reduce allowed uptime out of bounds from 5 minutes to 1 minute (see diff in https://github.com/threefoldtech/minting_v3/commit/5c99834adb1ca6f477ee06b8e6766459078c3fca for reasoning).
Reduce allowed downtime from power-managed nodes from 25 hours to 24 hours (25 was an accidental leftover commit after testing purposes).
Add violation for nodes if uptime increased less than time increased (accounting for 1 minute of skew as per point 1). This was part of the original implementation but was later allowed in the early days of V3 since we were still figuring out how infrastructure was handled, and manual validation at that time showed no issues.
Enforce max delay between uptime reports of 41 minutes (40 minutes zos interval + 1 minute of skew as per point 1). See https://github.com/threefoldtech/minting_v3/issues/22 for reasoning.
Add a violation for nodes if the twin does not have a relay set (node is essentially not usable for the grid).
Add violation for nodes if the twin has a public key in an invalid format (as that won’t work for sure. The node is thus unreachable similar to the previous point).
Improve decoding of twin information to avoid cases where the decoding strategy selects the wrong format leading to invalid decoded data.
Add a violation for the case where a node twin does not exist (This should also be checked by the chain), as these nodes will also not be usable similar to points 5 and 6.
Add violation for long-term clock skew (cfr. https://github.com/threefoldtech/zos/issues/1914)
Miscellaneous code improvements that aren’t directly related to payout (e.g. remove dead code, small performance improvement, better wording of violations for easy debugging, …)
Add a max delay of half an hour between the node power target rising edge and the node uptime event of reboot. Farmerbot v0.2.0 introduced random wakeups to combat potential fraud. Since periodic wakeups continue as normal, a node could just ignore this (as currently nodes are only required to post uptime every 24/25 hours, see point 2, while Farmerbot is running). The idea here is that since a random wakeup is unpredictable, this check essentially makes sure the hardware is still there. This check will also apply to regular wakeups. If the threshold is passed, the node gets stuck with a violation (and it doesn’t mint for the month). 30 minutes is debatable but should be sufficient for even heavy-duty servers to fully boot. Notice that this (time between rising edge power target and node boot) is also the delay a user has to wait for a node to come online if he deploys on a power-managed node.

While we understand that points 5, 6, and 8 are not really user errors, the goal of the minting is to mint tokens for usable capacity. If any of these conditions are true, then the node is not usable, and therefore it should not get tokens rewarded. This also means that even though we do our hardest to avoid any mistakes or bugs that would cause a farmer to miss out on tokens, in reality, it is a possibility and the ThreeFold Foundation can not be held responsible for this loss. It also means that any issues or bugs get high priority from our teams to resolve these and that we would like to invite anyone from the community to help in reporting or fixing any possible issues.

A new GEP proposal will be launched for this on our platforms after which these new minting rules will be applied to the minting code to make sure we have a more fair, honest, and stable ThreeFold Grid.

We thank you for your understanding.

Sabrina on behalf of the ThreeFold Team

sabrinasadik · October 24, 2023, 9:17am

The DAO proposal is live now until Tuesday October 31st. Please visit dashboard.grid.tf to cast your vote.

RobertL · October 24, 2023, 10:17am

I’m sorry to say I have no idea what most points mean and how to explain them. It may be my lack of English or technical understanding. In short, I have no clue what I’m voting for and how it impacts us. Sorry.

Edit; 2 suggestions: Maybe some more details examples of what points 1-5 mean. It would be nice if there’s a list of nodes affected, so users (clients) can be warned? Additionally; a way to resolve those issues?

Or do i misunderstand?

FLnelson · October 24, 2023, 10:15am

It’s tough love but it is best for the network. Let’s just make sure the commerical servers reliably boot in 30 min, they are not made to boot fast.

RobertL · October 24, 2023, 10:27am

I’m all for making it more robust. I’m just slightly worried that it has an affect to nodes/users who have no clue they’re doing something wrong. The other day I met a users who bought several Titans years ago, in the very early days, just because he knew Weynand and wanted to support the network. Only to find out they stopped functioning long ago, probably because of the version 2 to 3 move.
Anyways, these people have no clue of what happens withtin this project.

Personally I find it very questionable why TF keeps on moving the responsiblity andconsequences to their users. We sell nodes, and if one of them fails and causes damage to the company that used them, the claim is coming to me. I can’t tell them it’s their own repsonsibility, but TF somehow can.

So, with the above I’m trying at least to reach my own customers if there’s a chance they’re affected. For that, I need to understand it better.

Again, I have no problem at all improving!

FLnelson · October 24, 2023, 12:02pm

It unreasonable to pay people of hardware running that isn’t meeting uptime needs. Where TF can step up is letting these people know there are problems.

RobertL · October 24, 2023, 1:00pm

I agree. But what if the downtime was caused by ZOS?

renauter · October 24, 2023, 1:07pm

I totally agree that provided capacity should be reliable and reach some quality and uptime standards in order to have its TFT minted.
Nevertheless it remains sometimes complicated to fit to all these rules and not because of bad will but most of the time because of lack of information or communication.
Maybe communicating to farmer the points the node does not respect, with some good and simple explanation and how to respect the rule, would be a path.
So for me such stricter minting rules should come with a better interaction tool with farmer.
Ex: optional telegram bot (to warn farmer there is some violation) + farming FAQ (to give support in case some action could solve)
Because even for technical people it is sometimes not easy to understand so I can imagine how it can be for someone that just want to buy a node to support the project.

Btw I remember once after checking the minting of my nodes on the app I observed I get much less TFT for the month… Indeed, after having a look to the nodes I realized they were down (probably due to short power cut when I was out of home, which can be common in some countries). So, since I am not checking each day if my node is up/down, a warn from some telegram bot would have been helpful in that case. And this situation could have last for some time if I had not checked on the app and investigated a bit.

Mik · October 24, 2023, 2:58pm

@renauter
Do you mean a status bot like this one?

renauter · October 24, 2023, 3:06pm

Yes, similar to this one but that would send warning messages when something unexpected happened and not only when you ask for status

Mik · October 24, 2023, 3:30pm

OK good.
Here’s a github issue on this: https://github.com/threefoldtech/zos/issues/2092

Mik · October 24, 2023, 5:13pm

isyal · October 24, 2023, 8:28pm

I support the upcoming changes, but I believe that existing errors in the Farmerbot code should be removed. One time my paycheck was cut in half because ZOS didn’t detect SSDs properly. So far farmerbot has problems with WoL commands for Fujitsu servers. While in farmerbot 2.0 I managed to make the Fujitsu servers start turning off, unfortunately the WoL commands practically do not work, I had to turn the server on on a schedule. The proposed changes will make the server I use for the TFT network unusable. I hope that it will be fixed and it will start working properly and the upcoming changes will not negatively affect my income.

fromenator · November 8, 2023, 5:06pm

Is there a guide or place to confirm if requirements 5,6 or 8 affect any of our farms?

ParkerS · November 8, 2023, 6:26pm

How many total nodes were found that are not actually able to contribute to the grid?

scott · November 20, 2023, 9:12pm

For 5 and 8, you can see this info listed in the explorer. If either piece of data there is missing for any of your nodes, let us know asap and we’ll investigate.

Regarding 6, it’s a bit more tricky, but the good news is that we’ve never actually seen this issue. If a node has accepted a workload or was managed successfully by the farmerbot through at least one cycle, then it’s fine. For anyone with further concerns about this, please reach out to me and I can help run some checks.

I did a survey of all nodes on the Grid a while ago, to check for RMB responsiveness. Of those that were actively submitting uptime reports, I only found three that had issues with their twin configuration (missing public key). Those nodes have all since gone offline.

We also have an ongoing issue with the rmb-peer implementation that causes nodes to sometimes lose the ability to communicate over RMB and thus also prevents them from accepting new workloads. This is hopefully fixed in the next release, but since it only appears “in the wild” and the devs can’t reproduce it, we won’t know until we try it. Just to be clear, those nodes are not excluded from minting.

colossus · December 17, 2023, 8:36pm

Following up here regarding the slash in rewards for last month. I took a 50k TFT hit (this is 1/6 of my monthly yield) and every day since there are 1-2 nodes that report that they do not wake up in time only to then check in less than 10 minutes after a violation. At this rate I will miss out on December completely.

Why the team pushed forth stricter rules regarding node check-in to be 24 hours instead of 25 hours knowing full well that GitHub issue #66 was open and unresolved is perplexing.

I trust that this issue is being investigated and resolved expeditiously.

randynho · December 18, 2023, 9:41am

Adding my experince here as well. I got around 25k TFT less last month than during the months before.

Tonight I had again the 24 hour slash on two nodes.
My scheduled wakeups start at 01 am. Yesterday (Dez. 17.) two of my nodes started randomly at 00:43, were online at 00:56 and 00:57 and shut down 01:29.

Those two nodes got their scheduled wakeup signal tonight (Dez. 18.) at 01:23 and 01:33. Both nodes got the 24h slash message at 01:29

This looks very much like a bug to me. Random wakeup short before the scheduled one, caused those nodes to shut down after the usual duration and during the time they usually have their daily wakeup.
On the next day they get the dailiy wakeup call at the usual time but thats already too late or too late to boot.
Really annying is, that I can’t do anything about that. Support told me that at least one of my nodes got slashed last month cause of a manual boot.

Seems like the 25 hours timespan was an easy way to prevent such issues.
Random wakeups shouldn’t occure around one hour before the scheduled ones now. Or a wakeup should be sent to a node latest 23 hours and 29 minutes after the last shutdown.

FLnelson · December 18, 2023, 3:32pm

If ditching the random wakeups would make the fix easier, I think that should be done. We can trust our current farmers and no new v3 farmers will be possible soon.

randynho · December 18, 2023, 5:13pm

Nice said. Giving more trust to the farmers.