k8s-device-plugin NVIDIA device plugin for Kuberne开源项目

我要开发同款
匿名用户2021年11月26日
57阅读
所属分类Google Go、云计算、云原生
授权协议Apache-2.0 License

作品详情

NVIDIAdevicepluginforKubernetesTableofContentsAboutPrerequisitesQuickStartPreparingyourGPUNodesEnablingGPUSupportinKubernetesRunningGPUJobsDeploymentviahelmBuildingandRunningLocallyChangelogIssuesandContributingVersioningUpgradingKuberneteswiththeDevicePluginAbout

TheNVIDIAdevicepluginforKubernetesisaDaemonsetthatallowsyoutoautomatically:

ExposethenumberofGPUsoneachnodesofyourclusterKeeptrackofthehealthofyourGPUsRunGPUenabledcontainersinyourKubernetescluster.

ThisrepositorycontainsNVIDIA'sofficialimplementationoftheKubernetesdeviceplugin.

Pleasenotethat:

TheNVIDIAdevicepluginAPIisbetaasofKubernetesv1.10.TheNVIDIAdevicepluginisstillconsideredbetaandismissingMorecomprehensiveGPUhealthcheckingfeaturesGPUcleanupfeatures...SupportwillonlybeprovidedfortheofficialNVIDIAdeviceplugin(andnotforforksorothervariantsofthisplugin).Prerequisites

ThelistofprerequisitesforrunningtheNVIDIAdevicepluginisdescribedbelow:

NVIDIAdrivers~=384.81nvidia-dockerversion>2.0(seehowtoinstallandit'sprerequisites)dockerconfiguredwithnvidiaasthedefaultruntime.Kubernetesversion>=1.10QuickStartPreparingyourGPUNodes

ThefollowingstepsneedtobeexecutedonallyourGPUnodes.ThisREADMEassumesthattheNVIDIAdriversandnvidia-dockerhavebeeninstalled.

Notethatyouneedtoinstallthenvidia-docker2packageandnotthenvidia-container-toolkit.Thisisbecausethenew--gpusoptionshasn'treachedkubernetesyet.Example:

#Addthepackagerepositories$distribution=$(./etc/os-release;echo$ID$VERSION_ID)$curl-s-Lhttps://nvidia.github.io/nvidia-docker/gpgkey|sudoapt-keyadd-$curl-s-Lhttps://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list|sudotee/etc/apt/sources.list.d/nvidia-docker.list$sudoapt-getupdate&&sudoapt-getinstall-ynvidia-docker2$sudosystemctlrestartdocker

Youwillneedtoenablethenvidiaruntimeasyourdefaultruntimeonyournode.Wewillbeeditingthedockerdaemonconfigfilewhichisusuallypresentat/etc/docker/daemon.json:

{"default-runtime":"nvidia","runtimes":{"nvidia":{"path":"/usr/bin/nvidia-container-runtime","runtimeArgs":[]}}}

ifruntimesisnotalreadypresent,headtotheinstallpageofnvidia-docker

EnablingGPUSupportinKubernetes

OnceyouhaveconfiguredtheoptionsaboveonalltheGPUnodesinyourcluster,youcanenableGPUsupportbydeployingthefollowingDaemonset:

$kubectlcreate-fhttps://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.9.0/nvidia-device-plugin.yml

Note:Thisisasimplestaticdaemonsetmeanttodemonstratethebasicfeaturesofthenvidia-device-plugin.PleaseseetheinstructionsbelowforDeploymentviahelmwhendeployingtheplugininaproductionsetting.

RunningGPUJobs

Withthedaemonsetdeployed,NVIDIAGPUscannowberequestedbyacontainerusingthenvidia.com/gpuresourcetype:

apiVersion:v1kind:Podmetadata:name:gpu-podspec:containers:-name:cuda-containerimage:nvcr.io/nvidia/cuda:9.0-develresources:limits:nvidia.com/gpu:2#requesting2GPUs-name:digits-containerimage:nvcr.io/nvidia/digits:20.12-tensorflow-py3resources:limits:nvidia.com/gpu:2#requesting2GPUs

WARNING:ifyoudon'trequestGPUswhenusingthedevicepluginwithNVIDIAimagesalltheGPUsonthemachinewillbeexposedinsideyourcontainer.

Deploymentviahelm

Thepreferredmethodtodeploythedevicepluginisasadaemonsetusinghelm.Instructionsforinstallinghelmcanbefoundhere.

Thehelmchartforthelatestreleaseoftheplugin(v0.9.0)includesanumberofcustomizablevalues.Themostcommonlyoverriddenonesare:

failOnInitError:failthepluginifanerrorisencounteredduringinitialization,otherwiseblockindefinitely(default'true')compatWithCPUManager:runwithescalatedprivilegestobecompatiblewiththestaticCPUManagerpolicy(default'false')legacyDaemonsetAPI:usethelegacydaemonsetAPIversion'extensions/v1beta1'(default'false')migStrategy:thedesiredstrategyforexposingMIGdevicesonGPUsthatsupportit[none|single|mixed](default"none")deviceListStrategy:thedesiredstrategyforpassingthedevicelisttotheunderlyingruntime[envvar|volume-mounts](default"envvar")deviceIDStrategy:thedesiredstrategyforpassingdeviceIDstotheunderlyingruntime[uuid|index](default"uuid")nvidiaDriverRoot:therootpathfortheNVIDIAdriverinstallation(typicalvaluesare'/'or'/run/nvidia/driver')

Whensettotrue,thefailOnInitErrorflagfailsthepluginifanerrorisencounteredduringinitialization.Whensettofalse,itprintsanerrormessageandblocksthepluginindefinitelyinsteadoffailing.Blockingindefinitelyfollowslegacysemanticsthatallowtheplugintodeploysuccessfullyonnodesthatdon'thaveGPUsonthem(andaren'tsupposedtohaveGPUsonthem)withoutthrowinganerror.Inthisway,youcanblindlydeployadaemonsetwiththepluginonallnodesinyourcluster,whethertheyhaveGPUsonthemornot,withoutencounteringanerror.However,doingsomeansthatthereisnowaytodetectanactualerroronnodesthataresupposedtohaveGPUsonthem.Failingifaninitilizationerrorisencounteredisnowthedefaultandshouldbeadoptedbyallnewdeployments.

ThecompatWithCPUManagerflagconfiguresthedaemonsettobeabletointeroperatewiththestaticCPUManagerofthekubelet.Settingthisflagrequiresonetodeploythedaemonsetwithelevatedprivileges,soonlydosoifyouknowyouneedtointeroperatewiththeCPUManager.

ThelegacyDaemonsetAPIflagconfiguresthedaemonsettouseversionextensions/v1beta1oftheDaemonSetAPI.ThisAPIversionwasremovedinKubernetesv1.16,soisonlyintendedtoallownewerpluginstorunonolderversionsofKubernetes.

ThemigStrategyflagconfiguresthedaemonsettobeabletoexposeMulti-InstanceGPUs(MIG)onGPUsthatsupportthem.MoreinformationonwhatthesestrategiesareandhowtheyshouldbeusedcanbefoundinSupportingMulti-InstanceGPUs(MIG)inKubernetes.

Note:WithamigStrategyofmixed,youwillhaveadditionalresourcesavailabletoyouoftheformnvidia.com/mig-<slice_count>g.<memory_size>gbthatyoucansetinyourpodspectogetaccesstoaspecificMIGdevice.

ThedeviceListStrategyflagallowsonetochoosewhichstrategythepluginwillusetoadvertisethelistofGPUsallocatedtoacontainer.ThisistraditionallydonebysettingtheNVIDIA_VISIBLE_DEVICESenvironmentvariableasdescribedhere.Thisstrategycanbeselectedviathe(default)envvaroption.Supportwasrecentlyaddedtothenvidia-container-toolkittoalsoallowpassingthelistofdevicesasasetofvolumemountsinsteadofasanenvironmentvariable.Thisstrategycanbeselectedviathevolume-mountsoption.Detailsfortherationalebehindthisstrategycanbefoundhere.

ThedeviceIDStrategyflagallowsonetochoosewhichstrategythepluginwillusetopassthedeviceIDoftheGPUsallocatedtoacontainer.ThedeviceIDhastraditionallybeenpassedastheUUIDoftheGPU.ThisflagletsauserdecideiftheywouldliketousetheUUIDortheindexoftheGPU(asseenintheoutputofnvidia-smi)astheidentifierpassedtotheunderlyingruntime.PassingtheindexmaybedesirableinsituationswherepodsthathavebeenallocatedGPUsbytheplugingetrestartedwithdifferentphysicalGPUsattachedtothem.

Pleasetakealookinthefollowingvalues.yamlfiletoseethefullsetofoverridableparametersforthedeviceplugin.

https://github.com/NVIDIA/k8s-device-plugin/blob/v0.9.0/deployments/helm/nvidia-device-plugin/values.yamlInstallingviahelminstallfromthenvidia-device-pluginhelmrepository

Thepreferredmethodofdeploymentiswithhelminstallviathenvidia-device-pluginhelmrepository.

Thisrepositorycanbeinstalledasfollows:

$helmrepoaddnvdphttps://nvidia.github.io/k8s-device-plugin$helmrepoupdate

Oncethisrepoisupdated,youcanbegininstallingpackagesfromittodepoloythenvidia-device-plugindaemonset.Belowaresomeexamplesofdeployingthepluginwiththevariousflagsfromabove.

Note:Sincethisisapre-releaseversion,youwillneedtopassthe--develflagtohelmsearchrepoinordertoseethisreleaselisted.

Usingthedefaultvaluesfortheflags:

$helminstall\--version=0.9.0\--generate-name\nvdp/nvidia-device-plugin

EnablingcompatibilitywiththeCPUManagerandrunningwitharequestfor100msofCPUtimeandalimitof512MBofmemory.

$helminstall\--version=0.9.0\--generate-name\--setcompatWithCPUManager=true\--setresources.requests.cpu=100m\--setresources.limits.memory=512Mi\nvdp/nvidia-device-plugin

UsethelegacyDaemonsetAPI(onlyavailableonKubernetes<v1.16):

$helminstall\--version=0.9.0\--generate-name\--setlegacyDaemonsetAPI=true\nvdp/nvidia-device-plugin

EnablingcompatibilitywiththeCPUManagerandthemixedmigStrategy

$helminstall\--version=0.9.0\--generate-name\--setcompatWithCPUManager=true\--setmigStrategy=mixed\nvdp/nvidia-device-pluginDeployingviahelminstallwithadirectURLtothehelmpackage

Ifyouprefernottoinstallfromthenvidia-device-pluginhelmrepo,youcanrunhelminstalldirectlyagainstthetarballoftheplugin'shelmpackage.Theexamplesbelowinstallthesamedaemonsetsasthemethodabove,exceptthattheyusedirectURLstothehelmpackageinsteadofthehelmrepo.

Usingthedefaultvaluesfortheflags:

$helminstall\--generate-name\https://nvidia.github.com/k8s-device-plugin/stable/nvidia-device-plugin-0.9.0.tgz

EnablingcompatibilitywiththeCPUManagerandrunningwitharequestfor100msofCPUtimeandalimitof512MBofmemory.

$helminstall\--generate-name\--setcompatWithCPUManager=true\--setresources.requests.cpu=100m\--setresources.limits.memory=512Mi\https://nvidia.github.com/k8s-device-plugin/stable/nvidia-device-plugin-0.9.0.tgz

UsethelegacyDaemonsetAPI(onlyavailableonKubernetes<v1.16):

$helminstall\--generate-name\--setlegacyDaemonsetAPI=true\https://nvidia.github.com/k8s-device-plugin/stable/nvidia-device-plugin-0.9.0.tgz

EnablingcompatibilitywiththeCPUManagerandthemixedmigStrategy

$helminstall\--generate-name\--setcompatWithCPUManager=true\--setmigStrategy=mixed\https://nvidia.github.com/k8s-device-plugin/stable/nvidia-device-plugin-0.9.0.tgzBuildingandRunningLocally

Thenextsectionsarefocusedonbuildingthedevicepluginlocallyandrunningit.Itisintendedpurelyfordevelopmentandtesting,andnotrequiredbymostusers.Itassumesyouarepinningtothelatestreleasetag(i.e.v0.9.0),butcaneasilybemodifiedtoworkwithanyavailabletagorbranch.

WithDockerBuild

Option1,pulltheprebuiltimagefromDockerHub:

$dockerpullnvcr.io/nvidia/k8s-device-plugin:v0.9.0$dockertagnvcr.io/nvidia/k8s-device-plugin:v0.9.0nvcr.io/nvidia/k8s-device-plugin:devel

Option2,buildwithoutcloningtherepository:

$dockerbuild\-tnvcr.io/nvidia/k8s-device-plugin:devel\-fdocker/Dockerfile\https://github.com/NVIDIA/k8s-device-plugin.git#v0.9.0

Option3,ifyouwanttomodifythecode:

$gitclonehttps://github.com/NVIDIA/k8s-device-plugin.git&&cdk8s-device-plugin$dockerbuild\-tnvcr.io/nvidia/k8s-device-plugin:devel\-fdocker/Dockerfile\.Run

WithoutcompatibilityfortheCPUManagerstaticpolicy:

$dockerrun\-it\--security-opt=no-new-privileges\--cap-drop=ALL\--network=none\-v/var/lib/kubelet/device-plugins:/var/lib/kubelet/device-plugins\nvcr.io/nvidia/k8s-device-plugin:devel

WithcompatibilityfortheCPUManagerstaticpolicy:

$dockerrun\-it\--privileged\--network=none\-v/var/lib/kubelet/device-plugins:/var/lib/kubelet/device-plugins\nvcr.io/nvidia/k8s-device-plugin:devel--pass-device-specsWithoutDockerBuild$C_INCLUDE_PATH=/usr/local/cuda/includeLIBRARY_PATH=/usr/local/cuda/lib64gobuildRun

WithoutcompatibilityfortheCPUManagerstaticpolicy:

$./k8s-device-plugin

WithcompatibilityfortheCPUManagerstaticpolicy:

$./k8s-device-plugin--pass-device-specsChangelogVersionv0.9.0FixbugwhenusingCPUManagerandthedevicepluginMIGmodenotsetto"none"AllowpassinglistofGPUsbydeviceindexinsteadofuuidMovetourfave/clitobuildtheCLISupportsettingcommandlineflagsviaenvironmentvariablesVersionv0.8.2Updatealldockerhubreferencestonvcr.ioVersionv0.8.1FixpermissionerrorwhenusingNewDeviceinsteadofNewDeviceLitewhenconstructingMIGdevicemapVersionv0.8.0RaiseanerrorifadevicehasmigEnabled=truebuthasnoMIGdevicesAllowmig.strategy=singleonnodeswithnon-MIGgpusVersionv0.7.3UpdatevendoringtoincludebugfixfornvmlEventSetWait_v2Versionv0.7.2Fixbugindockfilesforubi8andcentosusingCMDnotENTRYPOINTVersionv0.7.1UpdateallDockerfilestopointtolatestcuda-baseonnvcr.ioVersionv0.7.0Promotev0.7.0-rc.8tov0.7.0Versionv0.7.0-rc.8Permitconfigurationofalternativecontainerregistrythroughenvironmentvariables.Addanalternatesetofgitlab-cidirectivesunder.nvidia-ci.ymlUpdateallk8sdependenciestov1.19.1UpdatevendoringforNVMLGobindingsMoverestartlooptoforcerecreateofpluginsonSIGHUPVersionv0.7.0-rc.7FixbugwhichonlyallowedrunningthepluginonmachineswithCUDA10.2+installedVersionv0.7.0-rc.6Addlogictoskip/erroroutwhenunsupportedMIGdeviceencounteredFixbugtreatingmemoryasmultipleof1000insteadof1024SwitchtousingCUDAbaseimagesAddasetofstandardteststothe.gitlab-ci.ymlfileVersionv0.7.0-rc.5AdddeviceListStrategyFlagtoallowdevicelistpassingasvolumemountsVersionv0.7.0-rc.4Allowonetooverrideselector.matchLabelsinthehelmchartAllowonetooverridetheudateStrategyinthehelmchartVersionv0.7.0-rc.3FailthepluginifNVMLcannotbeloadedUpdateloggingtoprinttostderronerrorAddbesteffortremovalofsocketfilebeforeservingAddlogictoimplementGetPreferredAllocation()callfromkubeletVersionv0.7.0-rc.2Addtheabilitytoset'resources'aspartofahelminstallAddoverridesfornameandfullnameinhelmchartAddabilitytooverrideimagerelatedparametershelmchartAddconditionalsupportforoverridingsecutiryContextinhelmchartVersionv0.7.0-rc.1AddedmigStrategyasaparametertoselecttheMIGstrategytothehelmchartAddsupportforMIGwithdifferentstrategies{none,single,mixed}UpdatevendoredNVMLbindingstolatest(toincludeMIGAPIs)AddlicenseinUBIimageUpdateUBIimagewithcertificationrequirementsVersionv0.6.0UpdateCI,buildsystem,andvendoringmechanismChangeversioningschemetov0.x.xinsteadofv1.0.0-betaxIntroducedhelmchartsasamechanismtodeploythepluginVersionv0.5.0Addanewplugin.ymlvariantthatiscompatiblewiththeCPUManagerChangeCMDinDockerfiletoENTRYPOINTAddflagtooptionallyreturnlistofdevicenodesinAllocate()callRefactordeviceplugintoeventuallyhandlemultipleresourcetypesMovepluginerrorretrytoeventloopsowecanexitwithasignalUpdateallvendoreddependenciestotheirlatestversionsFixbugthatwasinadvertentlyalwaysdisablinghealthchecksUpdateminimaldriverversionto384.81Versionv0.4.0FixesabugwithanilpointerdereferencearoundgetDevices:CPUAffinityVersionv0.3.0ManifestisupdatedforKubernetes1.16+(apps/v1)AddsmorelogginginformationVersionv0.2.0AddstheTopologyfieldforKubernetes1.16+Versionv0.1.0IfgRPCthrowsanerror,thedevicepluginnolongerendsupinanonresponsivestate.Versionv0.0.0ReversionedtoSEMVERasdevicepluginsaren'ttiedtoaspecificversionofkubernetesanymore.Versionv1.11Nochange.Versionv1.10ThedevicePluginAPIisnowv1beta1Versionv1.9ThedevicePluginAPIchangedandisnolongercompatiblewith1.8ErrormessageswereaddedIssuesandContributing

CheckouttheContributingdocument!

YoucanreportabugbyfilinganewissueYoucancontributebyopeningapullrequestVersioning

Beforev1.10theversioningschemeofthedevicepluginhadtomatchexactlytheversionofKubernetes.Afterthepromotionofdevicepluginstobetathisconditionwaswasnolongerrequired.WequicklynoticedthatthisversioningschemewasveryconfusingforusersastheystillexpectedtoseeaversionofthedevicepluginforeachversionofKubernetes.

Thisversioningschemeappliestothetagsv1.8,v1.9,v1.10,v1.11,v1.12.

WehavenowchangedtheversioningtofollowSEMVER.Thefirstversionfollowingthisschemehasbeentaggedv0.0.0.

Goingforward,themajorversionofthedevicepluginwillonlychangefollowingachangeinthedevicepluginAPIitself.Forexample,versionv1beta1ofthedevicepluginAPIcorrespondstoversionv0.x.xofthedeviceplugin.Ifanewv2beta2versionofthedevicepluginAPIcomesout,thenthedevicepluginwillincreaseitsmajorversionto1.x.x.

Asofnow,thedevicepluginAPIforKubernetes>=v1.10isv1beta1.IfyouhaveaversionofKubernetes>=1.10youcandeployanydevicepluginversion>v0.0.0.

UpgradingKuberneteswiththeDevicePlugin

UpgradingKuberneteswhenyouhaveadeviceplugindeployeddoesn'trequireyoutodoany,particularchangestoyourworkflow.TheAPIisversionedandisprettystable(thoughitisnotguaranteedtobenonbreaking).StartingwithKubernetesversion1.10,youcanusev0.3.0ofthedeviceplugintoperformupgrades,andKuberneteswon'trequireyoutodeployadifferentversionofthedeviceplugin.Onceanodecomesbackonlineaftertheupgrade,youwillseeGPUsre-registeringthemselvesautomatically.

Upgradingthedevicepluginitselfisamorecomplextask.ItisrecommendedtodrainGPUtasksaswecannotguaranteethatGPUtaskswillsurvivearollingupgrade.HoweverwemakebesteffortstopreserveGPUtasksduringanupgrade.

声明:本文仅代表作者观点,不代表本站立场。如果侵犯到您的合法权益,请联系我们删除侵权资源!如果遇到资源链接失效,请您通过评论或工单的方式通知管理员。未经允许,不得转载,本站所有资源文章禁止商业使用运营!
下载安装【程序员客栈】APP
实时对接需求、及时收发消息、丰富的开放项目需求、随时随地查看项目状态

评论