Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🐞 [Bug]:Farmerbot fails to start fully due to an RMB communication error. #1191

Open
mahendravarmayadala93 opened this issue Sep 6, 2024 · 7 comments
Assignees
Labels
farmerbot type_bug Something isn't working
Milestone

Comments

@mahendravarmayadala93
Copy link

mahendravarmayadala93 commented Sep 6, 2024

What happened?

Farm ID : 195

The client reported that his Nodes managed by Farmerbot did not shut down.

Upon reviewing the log file, we found that the farmerbot was not starting up due to the following error:

8:20AM DBG failed to read message error="websocket: close 1006 (abnormal closure): unexpected EOF"
8:20AM DBG connecting url=wss://relay.grid.tf

Some additional notes by @scottyeager:

I checked the log file. The core thing here is that the bot never fully starts up. That is indeed due to the failure of RMB communication associated with this error

We see on each attempt of the bot to start that it adds one node successfully using the same RMB relay before failing repeatedly on the second node. So there is some successful RMB communication happening

I also checked on the rate limiting implementation for RMB. It looks like it only drops messages with an error, it isn't supposed to drop connections entirely if the user tries to send too many messages

Log File :

farmerbot_16enuun.log

which network/s did you face the problem on?

Main

Twin ID/s

No response

Version

No response

Node ID/s

626, 548, 547(Offline currently) - 3038(Online)

Farm ID/s

195

Contract ID/s

No response

Relevant log output

Config File

farm_id: 195
never_shutdown_nodes:
  - 626
power:
  periodic_wake_up_start: 09:00AM
  periodic_wake_up_limit: 3
@mahendravarmayadala93 mahendravarmayadala93 added the type_bug Something isn't working label Sep 6, 2024
@rawdaGastan rawdaGastan added this to the v0.16.0 milestone Sep 9, 2024
@rawdaGastan rawdaGastan self-assigned this Sep 9, 2024
@rawdaGastan rawdaGastan removed this from 3.15.x Oct 1, 2024
@rawdaGastan
Copy link
Collaborator

rawdaGastan commented Oct 1, 2024

  • You should make sure the nodes are healthy and working before adding them to the farmerbot.
  • You can try to use --continue-power-on-error flag

@rawdaGastan rawdaGastan modified the milestones: v0.16.x, v0.17.x Oct 31, 2024
@TullysInc
Copy link

TullysInc commented Nov 19, 2024

@rawdaGastan : There is a more recent report from a second farmer (farmID_250), about the same error lines in the logs he obtained.

farmer@bot:~/farmerbot$ tail -n 50 farmerbot.log
2024/11/18 14:08:47 Connecting to wss://tfchain.grid.tf:443...
2:08PM INF starting peer session=farmerbot-rpc-250 twin=826
2:08PM DBG connecting url=wss://tfchain.grid.tf/ws
2024/11/18 14:08:49 Connecting to wss://tfchain.grid.tf/ws...
2:08PM DBG connecting url=wss://relay.grid.tf
2:08PM DBG Add node nodeID=3736
2:08PM DBG failed to read message error="websocket: close 1006 (abnormal closure ): unexpected EOF"
2:08PM DBG connecting url=wss://relay.grid.tf
2:08PM DBG Add node nodeID=4746
2:08PM DBG failed to read message error="websocket: close 1006 (abnormal closure ): unexpected EOF"

All nodes included in this config are currently up in the dashboard, so we can possibly rule out the suspicion of the nodes being unhealthy before being added to the farmerbot. Also, the --continue-power-on-error is already included in the script that was used to set up.

farm_id: 250
included_nodes:

  • 565
  • 3736
  • 6026
  • 4746
  • 4961
  • 5262
  • 4458
  • 5985
  • 3763
    never_shutdown_nodes:
  • 565
    power:
    periodic_wake_up_start: 09:00AM

@scottyeager
Copy link

  • You can try to use --continue-power-on-error flag

We are advising all farmers to use this flag, but it doesn't seem to help in every case. Aside from the EOF error above, regular timeouts while trying to reach powered off nodes also seem to block the bot from starting, for example:

error :
9:13PM FTL error="failed to add node with id 2950 with error: failed to get node 2950 statistics from rmb with error: context deadline exceeded"

@rawdaGastan, can you clarify the expected behavior with --continue-power-on-error?

@rawdaGastan
Copy link
Collaborator

this flag --continue-power-on-error allows the farmerbot to continue updating nodes and managing them even some nodes have errors in RMB connection. Otherwise farmerbot won't be able to start if the flag is not set and some nodes have issues with RMB

It is expected that nodes cannot communicate through RMB when they are offline.

@scottyeager
Copy link

scottyeager commented Nov 25, 2024

this flag --continue-power-on-error allows the farmerbot to continue updating nodes and managing them even some nodes have errors in RMB connection. Otherwise farmerbot won't be able to start if the flag is not set and some nodes have issues with RMB

This matches what we expected. The thing then is that we are seeing various cases where the bot does not start due to RMB error, despite the --continue-power-on-error flag being passed. So that's why I was trying to clarify if there's still some case that should cause the bot to refuse to start due to RMB failures with the flag present.

Assuming no such case exists, our issue is that the bot is still refusing to start with --continue-power-on-error.

@scottyeager
Copy link

It is expected that nodes cannot communicate through RMB when they are offline.

These errors are coming from online nodes. I'm also not sure what the severity of these errors is. I did some searching regarding EOF error for websockets and found this:

The error indicates that the peer closed the connection without sending a close message. The RFC calls this "abnormal closure", but the error is normal to receive.

But I also found some different suggestions about adjusting timeouts over reverse proxies, etc. So I guess it would also be good to clarify if this EOF error is something that we should be concerned with addressing.

@TullysInc
Copy link

There are multiple farmers reporting the same rmb issue including the earlier farmid_250. Please see the snippets of logs below:

farmid250.

Farmid_3148
JeroenV

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
farmerbot type_bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants