You need to log in to create posts and topics.

global name 'df' is not defined.

Hi,

Sometimes, the slaves lost the connection for a really short period of time ( for varous reasons, for exemple when lauching a render if a heavy scene and the network saturates while it transfer the files to all the slaves, but it can be because of an unreliable network cable ect ..).

When it happens, this error (see first picture) appears on each slaves who lost the conection. Clicking OK resolve the problem, because the network loss was like really short. But the problem is if i dont click OK, then the slave doesnt respond to the coordinator, and dont take jobs. Sometimes, when i launch a render during the night, the next morning i discover that only half of the slaves are still rendering. Remotly accessing the lost slaves and clicking OK do the tricks. But its a bit tedious, and could be a big problem on big farms. It also means you need to check often the farm to see if slaves are still up.

On the second picture, the fist slave was waiting for me to click OK, and i clicked it, but the other 2 are still waiting. And when the wait is too long, they state become "not responding".

So would it be possible to tell Pandora when it detects that it lost connection to the server, wait for a bit and retry instead of kind of freezing itself, waiting for human input ?

EDIT : The connection loss can also happen when the server restarts. So each time the server restarts, you need to go on each slaves, and click OK. So with big farm, that can be really tedious.

Uploaded files:
  • error2.png
  • error3.png

I totally agree that the slaves shouldn't wait on any user interaction. I improved that in v.1.0.3.7. Now there won't be a popup window. Instead there will be a warning in the slavelog and after 10 seconds the slave tries to access the file again. I hope this solves the problem for you when the connection is lost.

Hi,

It solved it mostly. Now the farm run way smoother. But from time to time, the slave client crash i think, and i have the dialog box that propose to send a bug repport (dont have a screenshot sorry). I think is still due to the slave from times to times loosing the connection. Often, just hitting cancel is enough to get it back and running, but sometimes, i need to restart the slave programm.

I was thinking, a cool thing would be to be able to restart a slave from the render handler windows on the coordinator. It would help solve most of those problems, without needing to physically go on the slave, or remotely accessing it.

(An other cool thing would be the ability to remotely update the slaves from the render handler window on the coordinator. The coordinator could be downloading the zip file, or the user could manually give it, and then the coordinator would send it to all the slaves.)

Since version 1.0.3.7 you shouldn't see any error messages on the slave computers. If you have an older version installed you should update to the latest version. If you still see any error messages, please send me the error, because even if the server connection is lost there shouldn't be an error, which blocks the slave.

I just updated Pandora to v1.0.3.10 and I added a few new features. You can now right click on a slave in the Render Handler and restart it. But keep in mind that this only works as long as the slave can receive commands from the coordinator. If a critical error occurred on the slave and it is closed or blocked, this command cannot restart the slave. But you could use it for example to restart the slave when Maya froze to kill the Maya process.

I also added the option to update the Pandora version of all slaves at once from the Render Handler. In the "Help" menu there is now an option "Update slaves...". This will download the zip file and copy it to all slaves. The slaves will install the file and restart automatically. During the update all Pandora processes will be closed. So in case you have the coordinator running on a slave workstation, the coordinator process will be closed too and you have to start it again.

 

The coordinator and all the slaves are using pandora 1.0.3.7. Next time the error pops up, i'll take a screenshot and use the send to the dev button.

Thanks for the restart slave and the update slave feature ! I'll try them when i can.

Here is the error. Hitting the close button resolve the problem most of the time, but sometimes i have to restart the slave.

04/07/19 22:30:06 ERROR - PandoraCore v1.0.3.7:
  File "C:\Pandora\Scripts\PandoraSlave.py", line 1658, in 
    sys.exit(qApp.exec_())
  File "C:\Pandora\Scripts\PandoraSlave.py", line 230, in func_wrapper
    return func(*args, **kwargs)
  File "C:\Pandora\Scripts\PandoraSlave.py", line 704, in checkAssignments
    debug = self.getConfSetting("debugMode")
  File "C:\Pandora\Scripts\PandoraSlave.py", line 230, in func_wrapper
    return func(*args, **kwargs)
  File "C:\Pandora\Scripts\PandoraSlave.py", line 524, in getConfSetting
    self.createSettings()
  File "C:\Pandora\Scripts\PandoraSlave.py", line 230, in func_wrapper
    return func(*args, **kwargs)
  File "C:\Pandora\Scripts\PandoraSlave.py", line 687, in createSettings
    self.setConfig(configPath=self.slaveConf, confData=sConfig)
  File "C:\Pandora\Scripts\PandoraSlave.py", line 230, in func_wrapper
    return func(*args, **kwargs)
  File "C:\Pandora\Scripts\PandoraSlave.py", line 547, in setConfig
    result = self.core.setConfig(cat=cat, param=param, val=val, data=data, configPath=configPath, delete=delete, confData=confData, silent=silent)
  File "C:\Pandora\Scripts\PandoraCore.py", line 118, in func_wrapper
    erStr = ("%s ERROR - PandoraCore %s:\n%s\n\n%s" % (time.strftime("%d/%m/%y %X"), args[0].version, ''.join(traceback.format_stack()), traceback.format_exc()))


Traceback (most recent call last):
  File "C:\Pandora\Scripts\PandoraCore.py", line 115, in func_wrapper
    return func(*args, **kwargs)
  File "C:\Pandora\Scripts\PandoraCore.py", line 804, in setConfig
    errStr = "The folder couldn't be created:\n\n%s\n\n%s" % (os.path.dirname(configPath), str(e))
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe9 in position 22: ordinal not in range(128)

EDIT : After hitting the cancel button, a text box show up saying that the problem could be caused by a special character in the path. But i dont have any, and the path are the same on all the others slaves. I just have 2 slaves that pops me this error from times to times.

Uploaded files:
  • error_unknown.png
  • error_unknown2.png

I just fixed that and uploaded version 1.0.3.13.

The special character was probably not in any filepath, but in an error, which was returned from your Window system, when Windows fails to access your server. I could imagine that this happens if your Windows is in a language that uses many of these special characters (like Spanish or German).

Anyway, that should be fixed now.

Well, all the PCs have the french version of Windows installed, so you're probably right ^^. I'll try to update the slave and see if it works.