OCFS2ACLUSTERFILESYSTEMFORLINUXUSER'SGUIDEFORRELEASE1.
4SunilMushranJuly20082OCFS2:AClusterFileSystemforLinux–User'sGuideforRelease1.
4Copyright2008,2010Oracle.
Allrightsreserved.
Permissionisgrantedtocopy,distributeand/ormodifythisdocumentunderthetermsoftheGNUFreeDocumentationLicense,Version1.
3oranylaterversionpublishedbytheFreeSoftwareFoundation;withnoInvariantSections,noFront-CoverTextsandnoBack-CoverTexts.
TheGNUFreeDocumentationLicenseisavailableatthewebsite,http://www.
gnu.
org/licenses/fdl.
txt.
3TABLEOFCONTENTSPREFACE.
4IINTRODUCTION.
5SharedFileSystems.
5ClusteredFileSystems6IIOVERVIEW.
7History.
8Development.
8Support9IIINEWFEATURES.
10FilesystemCompatibility.
12ToolsCompatibility.
13DistributionCompatibility13NewFileSystemDefaults.
13IVGETTINGSTARTED14ADOWNLOADANDINSTALL.
15BCONFIGURE15CFORMAT20DMOUNT.
23VADMINISTRATION.
27ATUNING.
27BFILESYSTEMCHECK28COTHERTOOLS.
30VIORACLERDBMS32VIINOTES34a)BalancedCluster.
34b)FileDeletion34c)DirectoryListing35d)SyntheticFileSystems35e)DistributedLockManager.
36f)DLMDebugging.
36g)NFS38h)Limits.
38i)SystemObjects.
39j)Heartbeat,QuorumandFencing.
40k)Processes40l)Future41ACKNOWLEDGEMENTS434PREFACETheaimofthisguideistobeastartingpointforuserslookingtousetheOCFS2filesystem.
FirsttimeusersshouldbeabletojumptotheGettingStartedchaptertolearnhowtoconfigure,formatandmountthevolume.
UserslookingtoupgradefromanearlierreleasemustrefertothechapteronNewFeaturesandreviewthesectionsoncompatibilitiesanddefaults.
Oracledatabaseusers,inadditiontotheabove,mustalsoreviewthechaptertitledOracleRDBMS.
AllusersshouldnotethatthisdocumentisbestconsumedalongwiththemanpagesforthevariousfilesystemspecificsystemcallsandtheOCFS2tools.
Whilethisdocumentexplainstheusageinaneasytounderstandstyle,itomitstheactualparameternames.
Usersshouldrefertothemanpagesforthosedetails.
ThisguidehasbeenwrittenspecificallyfortheusersofOCFS21.
4,whichisavailableonlyfortheEnterprisedistributions.
Usersofotherdistributionscanalsomakeuseofthisguide.
Theinformationincludedisup-to-dateasofkernelversion2.
6.
25.
5IINTRODUCTIONTherearemorethan30disk-basedfilesystemsinthemainlineLinuxkernelalone.
Whilethefilesystemscanbecategorizedandsub-categorizedinvariousways,thisdocumentsplitsthemintotwobroadones:localandshared.
Local,asthenamesuggests,referstofilesystemsthatcanbeaccessedbyasingleserveronly.
Mostfilesystemsfallunderthiscategory.
ExamplesincludeEXT3/4(generalpurposeworkloads),XFS(enterpriseworkloads),JFFS2(flashdrives),ISOFS(cdroms),UFS(UNIXcompatibility),NTFS/VFAT(Windowscompatibility)andHFS(OSXcompatibility).
Sharedfilesystems,ontheotherhand,allowmultipleserverstoaccessthemconcurrently.
Thissharingallowsuserstoshareinformationeasily,andallowsadministratorstoconsolidatestorageforeasiermanagementandlowercost.
SharedFileSystemsThemostpopularsharedfilesystemistheNetworkFileSystem,NFS.
Itisrelativelyeasytosetup,usesstandardnetworkinghardwareandisavailableonalmostalloperatingsystems.
However,NFSprovidesweakmeta-dataanddatacachecoherenciesbetweentheclientnodes.
Meta-data,whichinthiscasereferstothefilename,size,modificationtime,etc.
,isallowedtogooutofsyncbetweenclientnodesforshortperiods.
WhileNFSimplementationsprovideoptionstoforcefullsynchronization,theyaretypicallynotusedbecauseoftheperformancepenaltytheyimpose.
Thisisnottosuggestthatthefilesystemitselfdoesnotmaintainitsintegrityatalltimes.
Itdoes.
Theonlyproblemisthattheclientnodesrunningtheapplicationdonothaveaconsistentviewofthefilesystematalltimes.
Thislackofstrictcachecoherencygoesunnoticedinthenormalcase.
Simpleapplicationsdonotexpectdifferentclientnodestowritetothesamefilesanddirectoriesconcurrently.
Itcausesproblemswhenrunningapplicationsthatexpectaconsistentviewofthefilesystemonallclientnodesatalltimes.
Thisbecomesveryapparentwhenoneattemptstoscaleoutanapplicationthatspawnsmultipleprocessesreadingandwritingtothesamesetoffiles.
MakingtheseapplicationsworkonNFSrequireseithermakingtheapplicationworkaroundtheweakcachecoherency,or,bydisablingthevariouscaches.
Theformerrequiresapplicationchanges;thelatterimposesasignificantperformancepenalty.
Anotherclassofsharedfilesystemsistheclusteredfilesystem.
Suchfilesystemsarenotalwayseasytosetupandtypicallyrequiremorethanjuststandardnetworkinghardware.
Thus,theyarenotaspopularasNFS.
However,theyusuallyprovidestrictmeta-dataanddatacachecoherenciesallowinguserstoscaleoutapplicationseasily.
6Suchfilesystemsprovidethisfeaturebymakinguseofalockinginfrastructurethatallowsittocoordinateaccesstoresourcesacrossalltheservernodes.
ClusteredFileSystemsClusteredfilesystemshavebeenaroundsincethe1980s.
Inthepastfewyearstheyhavegeneratedalotofinterest.
TheavailabilityofaffordableenterprisegradesystemswithLinux/x86allowsuserstoreplacehighendSMPboxeswithaclusterofquad-coremachines.
Clusteringprovidesaffordablescalabilityandavailability.
AnodedeathdoesnotbringdowntheentiresystemasitwouldabigSMP.
Insteaditallowsthesystemtocontinuetooperateinadegradedmodeuntilthedeadnoderestartsandrejoinsthecluster.
Twopopulardesignsinthisclassoffilesystemsaredistributedparallel(LUSTRE)andshareddisk(OCFS2,GFS2).
Distributedparallelfilesystemshavethemeta-dataanddataaredistributedacrossmultipleservers.
Theyscaleto1000+clientsandareverypopularinhighperformancecomputing.
Thesefilesystemsstorethefiledatainthelocaldisksattachedtoeachservernode.
Someoftheseareusedtostorethefilesystemmeta-data,andothersthedata.
Thedataistypicallystrippedacrossmultipleservernodestoprovidehighdatathroughput.
Whilesuchfilesystemsprovidehighscalability,thehardwarerequirementsmakeitunsuitableforsmalltomediumsizedclusters.
Shareddiskfilesystems,ontheotherhand,scaleto100+nodes.
Asthenamesuggests,suchfilesystemsrequireashareddisk(SAN).
AllservernodesintheclustermustbeabletoperformI/Odirectlyandconcurrentlytothedisk.
Aclusterinterconnectthatprovidesalowlatencynetworktransportisusedforcommunicationinbetweentheservernodes.
Inthisdesign,theI/OsareissueddirectlytotheSANbyeachservernodeandthusdonotsufferfromthesingleserverbottleneckinherentinNFS.
Theclusterinterconnectisusedbytheservernodestocoordinateread/writeaccesstotheSAN,providingbothon-diskdataintegrityandstrictcachecoherencies.
Shareddiskclusteredfilesystemscomeintwoflavors:asymmetricandsymmetric.
Asymmetricreferstofilesystemsthatalloweachservernodetoread/writedatadirectlytotheSAN,butdirectallthemeta-datachangeoperationstoasinglenode.
Symmetric,ontheotherhand,allowseachnodetoreadandwritebothmeta-dataanddatadirectlytotheSAN.
OCFS2isasymmetricshareddiskclusterfilesystem.
7IIOVERVIEWOCFS2isaclusterfilesystemdesignedforuseinashareddiskenvironment.
Itprovidesbothhighperformanceandhighavailability.
Itcanalsobeusedinanon-clusteredenvironment,especiallyonesthatarelookingfortheflexibilitytoscaleoutinthefuture.
Becauseitprovideslocalfilesystemsemantics,itcanalsobeusedwithapplicationsthatarenotcluster-aware.
Theywon'tmakeuseofparallelI/O,buttheywillbeabletomakeuseofthefail-overfacilities.
Forexample,OCFS2iscurrentlyusedtoprovidescalablewebservers,fail-overmailservers,fail-overvirtualmachineimagehosting,scalablefile-servers,etc.
Someofthenotablefeaturesofthefilesystemare:VariableBlocksizesSupportsblocksizesrangingfrom512bytesto4KB.
Extent-basedallocationsTrackstheallocatedspaceinrangesofblocksmakingitespeciallyefficientforstoringlargefiles.
FlexibleAllocationSupportssparsefilesandunwrittenextentsforhigherperformanceandefficientstorage.
Existingfilescanhaveholespunchedforevenmoreefficiency.
JournalingSupportsbothorderedandwritebackdatajournalingmodestoprovidefilesystemconsistencyintheeventofpowerfailureorsystemcrash.
EndianandArchitectureneutralAllowsconcurrentmountson32-bitand64-bit,little-endian(x86,x86_64,ia64)andbig-endian(ppc64)architectures.
In-builtCluster-stackwithDLMIncludesaneasytoconfigure,in-kernelcluster-stackwithadistributedlockmanager.
Buffered,Direct,Asynchronous,SpliceandMemoryMappedI/OsSupportsallmodesofI/Osformaximumflexibilityandperformance.
LargeInodesBlock-sizedinodesallowittostoresmallfilesintheinodeitself.
ComprehensiveToolsSupportProvidesafamiliarEXT3-styletool-setthatusessimilarparametersforease-of-use.
8HistoryOCFS2filesystemdevelopmentbeganin2003asafollowuptoOCFS.
OCFSwastargetedexclusivelyasadatastoreforOracle'sRealApplicationClustereddatabaseproduct(RAC).
Thegoalsforthenewprojectweretomaintaintheraw-likeI/Othroughputforthedatabase,bePOSIXcompliant,andprovidenearlocalfilesystemperformanceformeta-dataoperations.
OCFS2wasintendedfromthestarttobeincludedinthemainlineLinuxkernel.
Atthetime,therewerenoclusteredfilesystemsinthekernel.
OCFS2v1.
0wasreleasedinAugust2005.
Shortlythereafter,itwasmergedintoAndrewMorton's-mmLinuxkerneltree.
OCFS2wasmergedintothemainlineLinuxkerneltreeinJanuary2006.
The2.
6.
16kernel,releasedthatMarch,includedthefilesystem.
OCFS2v1.
2wasreleasedinApril2006.
IttargetedtheEnterpriseLinuxdistributions,SLES9fromNovellandRHEL4fromRedHat.
Thex86,x86_64,ia64,ppc64ands390xarchitectureswereallsupported.
AsofJuly2008,OCFS2isavailablewithallthreeEnterpriseLinuxdistributions,SLES,RHELandOracle'sEL.
Itisalsoavailablewithotherdistributions,notablyUbuntu,openSUSE,FedoraCoreandDebian.
TheversionofthefilesystemonthesedistributionsisfromwhichevermainlineLinuxkernelthedistributionships.
EventhoughtheversionofthefilesystemavailablefortheEnterpriseandotherdistributionsisnotthesame,thefilesystemmaintainson-diskcompatibilityacrossallversions.
DevelopmentOCFS2developmentbeganasprojectintheLinuxKerneldevelopmentgroupinOracleCorporation.
However,sinceitsinclusioninthemainlineLinuxkernel,ithasattractedpatchsubmissionsfromover70developers,andthus,isontrackonbecomingacommunityproject.
Inordertosatisfytheneedsofourusers,whowantastablefilesystemonaLinuxdistributionoftheirchoice,andthedevelopers,whowantaconsistentenvironmentfordevelopingnewfeatures,thedevelopmentgroupfollowsfewbasicgroundrules:1.
AllnewfeaturesarefirstincludedinthemainlineLinuxkerneltree.
2.
Allbugfixesareappliedtoallactivekerneltrees.
ActivekerneltreesincludethecurrentlysupportedEnterprisekernels,thecurrentmainlinetreeandthekerneltreesmaintainedbytheStablekernelteam.
Thestabletreesaretrackedbymostnon-EnterpriseLinuxkerneldistributions.
9ThesourceofthefilesystemismadeavailablewiththeLinuxkernelandcanbedownloadedfromhttp://kernel.
org/.
ThesourcesforEnterprisekernelsareavailableathttp://oss.
oracle.
com/projects/ocfs2/.
ThesourceoftheOCFS2filesystemisavailableundertheGNUGeneralPublicLicense(GPL),version2.
SupportOfficialsupportforthefilesystemistypicallyprovidedbythedistribution.
Forexample,NovellsupportsthefilesystemonSLESwhereasOraclesupportsitonEL.
OraclealsoextendssupportofthefilesystemtoRHELforusewithOracle'sdatabaseproduct.
Forotherdistributions,includingusersofthemainlineLinuxkernel,Oraclealsoprovidesemailsupportviatheocfs2-users@oss.
oracle.
commailinglist.
10IIINEWFEATURESThe1.
4releaseprovidesthefeaturesthathavebeensteadilyaccumulatinginthemainlineLinuxkerneltreeforovertwoyears.
ThelistofnewfeaturesaddedsincetheOCFS21.
2releaseisasfollows:1.
OrderedJournalModeThisnewdefaultjournalmode(mountoptiondata=ordered)forcesthefilesystemtoflushfiledatatodiskbeforecommittingthecorrespondingmeta-data.
Thisflushingensuresthatthedatawrittentonewlyallocatedregionswillnotbelostduetoafilesystemcrash.
Whilethisfeatureremovestheever-so-smallprobabilityofstaleornulldatatoappearinginafileafteracrash,itdoessoattheexpenseofsomeperformance.
Userscanreverttotheolderjournalmodebymountingwithdata=writebackmountoption.
Itshouldbenotedthatfilesystemmeta-dataintegrityispreservedbybothjournalingmodes.
2.
FileAttributeSupportAllowsausertousethechattr(1)commandtosetandclearEXT2-stylefileattributessuchastheimmutablebit.
lsattr(1)canbeusedtoviewthecurrentsetofattributes.
3.
PerformanceEnhancementsEnhancesperformancebyeitherreducingthenumbersofI/Osorbydoingthemasynchronously.
DirectoryReadaheadDirectoryoperationsasynchronouslyreadtheblocksthatmaygetaccessedinthefuture.
FileLookupImprovescold-cachestat(2)timesbycuttingtherequiredamountofdiskI/Oinhalf.
FileRemoveandRenameReplacesbroadcastfilesystemmessageswithDLMlocksforunlink(2)andrename(2)operations.
Thisimprovesnodescalability,asthenumberofmessagesdoesnotgrowwiththenumberofnodesinthecluster.
4.
SpliceI/OAddssupportforthenewsplice(2)systemcall.
Thisallowsforefficientcopyingbetweenfiledescriptorsbymovingthedatainkernel.
5.
AccessTimeUpdatesAccesstimesarenowupdatedconsistentlyandarepropagatedthroughoutthecluster.
Sincesuchupdatescanhaveanegativeperformanceimpact,thefilesystem11allowsuserstotuneitviathefollowingmountoptions:atime_quantum=Defaultsto60seconds.
OCFS2willnotupdateatimeunlessthisnumberofsecondshaspassedsincethelastupdate.
Settozerotoalwaysupdateit.
noatimeThisstandardmountoptionturnsoffatimeupdatescompletely.
relatimeThisisanotherstandardmountoption(addedinLinuxv2.
6.
20)supportedbyOCFS2.
Relativeatimeonlyupdatestheatimeifthepreviousatimeisolderthanthemtimeorctime.
Thisisusefulforapplicationsthatonlyneedtoknowthatafilehasbeenreadsinceitwaslastmodified.
Additionally,alltimeupdatesinthefilesystemhavenanosecondresolution.
6.
FlexibleAllocationThefilesystemnowsupportssomeadvancedfeaturesthatareintendedtoallowusersmorecontroloverfiledataallocation.
Thesefeaturesentailanon-diskchange.
SparseFileSupportItaddstheabilitytosupportholesinfiles.
Thisallowstheftruncate(2)systemcalltoefficientlyextendfiles.
Thefilesystemcanpostponeallocatingspaceuntiltheuseractuallywritestothoseclusters.
UnwrittenExtentsItaddstheabilityforanapplicationtorequestarangeofclusterstobepre-allocated,butnotinitialized,withinafile.
Pre-allocationallowsthefilesystemtooptimizethedatalayoutwithfewer,largerextents.
Italsoprovidesaperformanceboost,delayinginitializationuntiltheuserwritestotheclusters.
Userscanaccessthesefeaturesviaanioctl(2),orviafallocate(2)oncurrentkernels.
PunchingHolesItaddstheabilityforanapplicationtoremovearbitraryallocatedregionswithinafile.
Creatingholes,essentially.
Thiscouldbemoreefficientifausercanavoidzeroingthedata.
Userscanaccessthesefeaturesviaanioctl(2),orviafallocate(2)onlaterkernels.
7.
SharedWriteablemmap(2)SharedwriteablememorymappingsarefullysupportednowonOCFS2.
8.
InlineDataThisfeaturemakesuseofOCFS2'slargeinodesbystoringthedataofsmallfilesanddirectoriesintheinodeblockitself.
Thissavesspaceandcanhaveapositiveimpactoncold-cachedirectoryandfileoperations.
Dataistransparentlymovedouttoanextentwhenitnolongerfitsinsidetheinodeblock.
Thisfeatureentailsanon-diskchange.
129.
OnlineFilesystemResizeUserscannowgrowthefilesystemwithouthavingtounmountit.
Thisfeaturerequiresacompatibleclusteredlogicalvolumemanager.
Compatiblevolumesmanagerswillbeannouncedwhensupportisavailable.
10.
Clusteredflock(2)Theflock(2)systemcallisnowcluster-aware.
Filelockstakenononenodefromuser-spacewillinteractwiththosetakenonothernodes.
Allflock(2)optionsaresupported,includingthekernel'sabilitytocancelalockrequestwhenanappropriatekillsignalisreceived.
(Note:SupportforclusteredPOSIXfilelocks,alsoknownaslockf(3)orfcntl(2),hasnotyetbeenadded.
Wehopetohavethatavailableinthenearterm.
)FilesystemCompatibilityOCFS21.
4isfullycompatiblewiththe1.
2on-diskformat.
Thenewerversioncanmountvolumesfromtheolderversionas-is.
However,thenetworkprotocolisnotcompatiblebetweenthetwoversions.
Concurrentmountsfromnodesrunningthedifferentversionsarenotsupported.
ThedevelopmentteamensuresthatallversionsofOCFS2arebackwardcompatibleon-disk.
UserscanupgradeOCFS2toanewerreleaseknowingitcanmountexistingvolumes.
Thesamecouldnotbesaidofthenetworkprotocolinthepast.
Wehaverecentlyimprovedthesoftwareinfrastructuretoallowustomaintainnetworkcompatibility.
Thisgivesustheflexibilitytomakechangesintheprotocol.
Inthenewscheme,theclusterkeepstrackoftheactiveprotocolversion.
Newernodesthatunderstandtheactiveversionmayjoinattheolderlevel.
OCFS21.
4isnotnetworkcompatiblewithOCFS21.
2,butthischangemeansthatmostfuturereleaseswillbecompatiblewithOCFS21.
4.
ThoughOCFS2maintainson-diskcompatibility,newfeaturesarenotenabledbydefault.
Featuresrequiringon-diskchangesneedtobeenabledexplicitly.
tunefs.
ocfs2(8)allowsuserstoenablethefeaturesonexistingvolumes.
Fornewvolumes,mkfs.
ocfs2(8)canenablethem.
Oncenewfeaturesareenabled,thevolumecannotbemountedbyolderversionsthatdonotunderstandthefeature.
Ifyouwanttomountthevolumeonanolderversion,tunefs.
ocfs2(8)canalsotogglethefeaturesoff.
Formoreonthis,refertotheFormatandTuningsectionsandthemkfs.
ocfs2(8)andtunefs.
ocfs2(8)manpages.
13ToolsCompatibilityThelatestversionofocfs2-toolssupportsallexistingversionsofthefilesystem.
DistributionCompatibilityOCFS21.
4isaback-portofthefilesystemincludedinthemainlineLinuxkernelversion2.
6.
25.
Ithasbeenwrittentoworkonlywith(RH)EL5Update2andSLES10SP2.
Itwillcontinuetoworkwiththenewer(RH)EL5andSLES10kernelupdatesbutwillnotworkonolderreleases.
NewFileSystemDefaultsThesupportforsparsefilesandunwrittenextentsisactivatedbydefaultwhenusingmkfs.
ocfs2(8)v1.
4.
Userswishingtoretainfullcompatibilitywitholderfilesystemsoftwaremustspecify--fs-feature-level=max-compattomkfs.
ocfs2(8).
Theotherchangeinthedefaultsconcernsthejournalingmode.
WhileOCFS21.
2supportedthewritebackdatajournalingmode,OCFS21.
4addssupportfortheordereddatajournalingmodeandmakesitthedefault.
Userswishingtokeepusingthewritebackmodeshouldmountthevolumewiththedata=writebackoption.
Inorderedmode,thefilesystemflushesthefiledatatodiskbeforecommittingmetadatachanges.
Inwritebackmode,nosuchwriteorderingispreserved.
Thewritebackmodestillguaranteesinternalfilesystemintegrity,andcanhavebetteroverallthroughputthanorderedmode.
However,staledatacanappearinfilesafteracrashandasubsequentjournalrecovery.
14IVGETTINGSTARTEDTheOCFS2softwareissplitintotwocomponents,namely,kernelanduser-space.
Thekernelcomponentincludesthecorefilesystemandtheclusterstack.
Theuser-spacecomponentprovidestheutilitiestoformat,tuneandcheckthefilesystem.
SoftwarePackagingForalmostalldistributions,thekernelfilesystemcomponentisbundledwiththekernelpackage.
Thisisideal,asupgradingthekernelautomaticallyupgradesthefilesystem.
Novell'sSLESdistributionusesthisapproach,asdomostnon-enterpriseLinuxdistributions.
ForRedHat'sRHELandOracle'sELdistributions,thekernelcomponentisavailableasaseparatepackage.
Thispackageneedstobeinstalledwhenupgradingthekernel.
Inthedistributionsinwhichthekernelcomponentisprovidedseparately,careshouldbetakentoinstalltheappropriatepackage.
Thisisnottosaythatinstallinganincorrectpackagewillcauseharm.
Itwon't.
Butitwillnotwork.
AnexampleofOCFS2'skernelcomponentpackageis:ocfs2-2.
6.
18-92.
el5PAE-1.
4.
1-1.
el5.
i686.
rpmThepackagenamecanbebrokendownasfollows:ocfs2-2.
6.
18-92.
el5PAEPackagename1.
4.
1-1Packageversionel5Distributioni686ArchitectureThepackagenameincludesthekernelversion.
Tolearntheappropriatepackagenameforyourrunningkerneldo:$echoocfs2-`uname–r`Theuser-spacecomponentincludestwopackages:ocfs2-tools(commandlineinterface)andocfs2console(graphicaluserinterface).
Thesepackagesarespecifictoadistributionandarchitectureonlyandhavenokerneldependency.
Forexample,ocfs2-tools-1.
4.
1-1.
el5.
i386.
rpmisfortheEL5/x86platformregardlessoftheexactkernelversioninuse.
15ADOWNLOADANDINSTALLThedownloadandinstallprocedurearespecifictoeachdistribution.
Theproceduresforsomeofthemorepopulardistributionsarelistedbelow.
Oracle'sEnterpriseLinuxBoththefilesystemandthetoolsneedtobeinstalled.
$up2date--installocfs2-toolsocfs2console$up2date--installocfs2-`uname-r`Novell'sSUSELinuxEnterpriseServerOnlythetoolsneedtobeinstalled;thefilesystemisbundledwiththekernel.
$zypperinstallocfs2-toolsocfs2consoleCanonical'sUbuntuandotherDebian-baseddistributionsOnlythetoolsneedtobeinstalled;thefilesystemisbundledwiththekernel.
$apt-getinstallocfs2-toolsocfs2consoleRedHatEnterpriseLinuxBoththefilesystemandtoolspackagesneedtobedownloadedfromhttp://oss.
oracle.
com/.
Oncedownloaded,thepackagescanbeinstalledusingtherpmutility.
$rpm-Uvhocfs2-tools-1.
4.
1-1.
el5.
i386.
rpm$rpm-Uvhocfs2console-1.
4.
1-1.
el5.
i386.
rpm$rpm-Uvhocfs2-2.
6.
18-92.
el5PAE-1.
4.
1-1.
el5.
i686.
rpmBCONFIGUREOCFS2volumescanbemountedasclusteredorlocal(single-node)volumes.
Userslookingtomountvolumeslocallycanskiptheclusterconfigurationandgostraighttoformatting.
EveryoneelsewillneedtoconfiguretheO2CBcluster.
TheO2CBclusterstackrequiresclusterlayoutandclustertimeoutconfigurations.
16O2CBClusterLayoutConfigurationTheclusterlayoutisspecifiedin/etc/ocfs2/cluster.
conf.
Itiseasytopopulateandpropagatethisconfigurationfileusingocfs2console(8),butonecanalsodoitmanuallyifcareistakentoformatthefilecorrectly.
Whiletheconsoleutilityisintuitivetouse,therearefewpointstokeepinmind.
1.
Thenodenameneedstomatchthehostname.
Itdoesnotneedtoincludethedomainname.
Forexample,forappserver.
oracle.
comcanbeappserver.
2.
TheIPaddressneednotbetheoneassociatedwiththathostname.
Thatis,anyvalidIPaddressonthatnodecanbeused.
O2CBwillnotattempttomatchthenodename(hostname)withthespecifiedIPaddress.
Forbestperformance,theuseofaprivateinterconnect(lowerlatency)ishighlyrecommended.
TheonelimitationoftheconsoleutilityisthatitcannotchangetheIPaddressandportofexistingnodes.
Suchmodificationsrequirestoppingtheclustereverywhereandmanuallyeditingtheconfigurationfileonallnodesbeforerestartingit.
Alwaysensurethatthecluster.
confisthesameonallnodesinthecluster.
Userswhohaveconfiguredtheclusterwithocfs2console(8)canskiptothetimeoutconfiguration.
17Theconfigurationfileisinastanzaformatwithtwotypesofstanzas:clusterandnode.
Atypicalcluster.
confwillhaveoneclusterstanzaandmultiplenodestanzas.
TheClusterstanzahastwoparameters:node_countTotalnumberofnodesintheclusternameNameoftheclusterTheNodestanzahasfiveparameters:ip_portIPport#ip_addressIPaddress(preferablyprivateinterface)numberUniquenodenumberfrom0-254nameHostnameclusterNameoftheclusterUserspopulatingcluster.
confmanuallyshouldfollowtheformatstrictly.
Thestanzaheadermuststartatthefirstcolumnandendwithacolon,stanzaparametersmuststartafteratab,andablanklikemustseparateeachstanza.
Takecaretoavoidanystraywhitespace.
ExampleThefollowingisasample/etc/ocfs2/cluster.
confthatdescribesathreenodecluster.
cluster:node_count=3name=webclusternode:ip_port=7777ip_address=192.
168.
0.
107number=7name=node7cluster=webclusternode:ip_port=7777ip_address=192.
168.
0.
106number=6name=node6cluster=webclusternode:ip_port=7777ip_address=192.
168.
0.
110number=10name=node10cluster=webcluster18O2CBClusterTimeoutConfigurationO2CBhasfourconfigurableclustertimeoutsthatarespecifiedin/etc/sysconfig/o2cb.
Usingtheo2cbinitscript,onecanconfigurethetimeoutsasfollows:$serviceo2cbconfigureConfiguringtheO2CBdriver.
Thiswillconfiguretheon-bootpropertiesoftheO2CBdriver.
Thefollowingquestionswilldeterminewhetherthedriverisloadedonboot.
Thecurrentvalueswillbeshowninbrackets('[]').
Hittingwithouttypingananswerwillkeepthatcurrentvalue.
Ctrl-Cwillabort.
LoadO2CBdriveronboot(y/n)[y]:ClusterstackbackingO2CB[o2cb]:Clustertostartonboot(Enter"none"toclear)[ocfs2]:webclusterSpecifyheartbeatdeadthreshold(>=7)[31]:Specifynetworkidletimeoutinms(>=5000)[30000]:Specifynetworkkeepalivedelayinms(>=1000)[2000]:Specifynetworkreconnectdelayinms(>=2000)[2000]:WritingO2CBconfiguration:OKClusterwebclusteralreadyonlineTheO2CBclusterstackusesthesetimingstodeterminewhetheranodeisdeadoralive.
Whiletheuseofdefaultvaluesisrecommended,userscanexperimentwithothervaluesifthedefaultsarecausingspuriousfencing.
TheO2CBclustertimeoutsare:HeartbeatDeadThresholdThediskheartbeattimeoutisthenumberoftwo-seconditerationsbeforeanodeisconsidereddead.
Theexactformulausedtoconvertthetimeoutinsecondstothenumberofiterationsis:O2CB_HEARTBEAT_THRESHOLD=(((timeoutinseconds)/2)+1)Forexample,tospecifya60sectimeout,setitto31.
For120secs,setitto61.
Thedefaultforthistimeoutis60secs(O2CB_HEARTBEAT_THRESHOLD=31).
NetworkIdleTimeoutThenetworkidletimeoutspecifiesthetimeinmillisecondsbeforeanetworkconnectionisconsidereddead.
Itdefaultsto30000ms.
NetworkKeepaliveDelayThenetworkkeepalivespecifiesthemaximumdelayinmillisecondsbeforeakeepalivepacketissenttoanothernode.
Ifthenodeisalive,itisexpectedtorespond.
Itdefaultsto2000ms.
19NetworkReconnectDelayThenetworkreconnectdelayspecifiestheminimumdelayinmillisecondsbetweenconnectionattempts.
Itdefaultsto2000ms.
Toviewthecurrentlyactiveclustertimeoutvalues,do:$serviceo2cbstatusDriverfor"configfs":LoadedFilesystem"configfs":MountedDriverfor"ocfs2_dlmfs":LoadedFilesystem"ocfs2_dlmfs":MountedCheckingO2CBclusterwebcluster:OnlineHeartbeatdeadthreshold=31Networkidletimeout:30000Networkkeepalivedelay:2000Networkreconnectdelay:2000CheckingO2CBheartbeat:NotactiveTheo2cbinitscripthasadditionalcommandstomanagethecluster.
Seehelpforthelistofcommands.
KernelConfigurationTwosysctlvaluesneedtobesetforO2CBtofunctionproperly.
Thefirst,panic_on_oops,mustbeenabledtoturnakerneloopsintoapanic.
IfakernelthreadrequiredforO2CBtofunctioncrashes,thesystemmustberesettopreventaclusterhang.
Ifitisnotset,anothernodemaynotbeabletodistinguishwhetheranodeisunabletorespondorslowtorespond.
Theotherrelatedsysctlparameterispanic,whichspecifiesthenumberofsecondsafterapanicthatthesystemwillbeauto-reset.
Settingthisparametertozerodisablesauto-reset;theclusterwillrequiremanualintervention.
Thisisnotpreferredinaclusterenvironment.
Tomanuallyenablepaniconoopsandseta30sectimeoutforrebootonpanic,do:$echo1>/proc/sys/kernel/panic_on_oops$echo30>/proc/sys/kernel/panicToenabletheaboveoneveryreboot,addthefollowingto/etc/sysctl.
conf:kernel.
panic_on_oops=1kernel.
panic=30OSConfiguration20O2CBalsorequirestwoRHELcomponents,SELINUXandiptables,tobedisabledormodified.
WewilllookintosupportingSELINUXonceextendedattributesupporthasbeenaddedtothefilesystem.
Thefirewall,ifenabled,mustallowtrafficontheprivatenetworkinterface.
CFORMATThissectionassumesthattheuserhasenabledtheO2CBclusterstack,asitcouldberequiredtore-formatanexistingOCFS2volume.
Theclusterstack,however,neednotbeenablediftheuserintendstousethevolumeonlyasalocalfilesystem.
Likeanyotherfilesystem,thevolumeneedstobeformattedbeforeuse.
Asformattingisaprocessofinitializingavolume,itshouldbedonewithcare.
ThisisespeciallytrueintheOCFS2environment;thevolumesbeingformattedaresharedresourcesandcouldbeinuseonanothernode.
mkfs.
ocfs2(8)haschecksinplacetopreventoverwritingvolumesthatareinuseacrossthecluster.
However,thechecksonlypreventoverwritingexistingOCFS2volumesandwillnotpreventoverwriting,say,anexistingin-useext3volume.
Thus,careshouldalwaysbetakenbeforeformattinganyvolume.
Inaddition,whileitisnotrequired,itispreferredthatthevolumebeingformattedispartitioned.
Notonlyarepartitionedvolumeslesslikelytobereusedbymistake,somefeatureslikemount-by-labelonlyworkwithpartitionedvolumes.
Formoreonpartitioning,checkthemanpagesoffdisk(8)orparted(8).
Userscanformatusingocfs2console(8)orthecommandlinetool,mkfs.
ocfs2(8).
Thisdocumentexplainstheusageandthevariousoptionstomkfs.
ocfs2(8)butexpectstheusertorefertothemanpageforexactparameternames.
Whenrunwithoutanyoptions,mkfs.
ocfs2(8)heuristicallydeterminesthevaluesofthevariousoptions.
Userscanhelpbyprovidingbroadhints,likefilesystemtype,thatdifferentiatebasedontypicalusageslikedatabase(fewer,largesizedfiles)andmail-server(lots-of,smallsizedfiles).
Otherthantheblockandclustersizes,allotheroptionscanbemodifiedbytunefs.
ocfs2(8).
Themainoptionsformkfs.
ocfs2(8)are:BlockSizeTheblocksizeistheunitofspaceallocatedformeta-databythefilesystem.
OCFS2supportsblocksizesof512,1K,2Kand4Kbytes.
Theblocksizecannotbechangedaftertheformat.
Foralmostalluses,a4K-blocksizeisrecommended.
A512byteblocksizeisneverrecommended.
ClusterSizeTheclustersizeistheunitofspaceallocatedforfiledata.
Alldataallocationisinmultiplesoftheclustersize.
OCFS2supportsclustersizesof4K,8K,16K,32K,64K,128K,256K,512Kand1Mbytes.
Foralmostalluses,a4Ksizeisrecommended.
21However,volumesstoringdatabasefilesshouldnotuseavaluesmallerthanthedatabaseblocksizeandarefreetousealargervalue.
NodeSlotsAnodeslotreferstoasetofsystemfiles,likeajournal,thatareusedexclusivelybyonenode.
Thislimitsthenumberofnodesthatcanconcurrentlymountavolumetothenumberofnodeslotsithas.
Thisnumbercanlaterbeincreasedordecreasedusingtunefs.
ocfs2(8).
Differentvolumescanhaveadifferentnumberofnodeslots.
JournalOptionsOCFS2usesthewrite-aheadjournal,JBD,withauser-configurablesize.
Ifleftunspecified,thetooldeterminestheappropriatevaluebasedonthespecifiedfilesystemtype.
Thedefaultsare64MBfordatafilesand256MBformail.
VolumeLabelLabelingvolumesisrecommendedforeasiermanagement.
Thisisespeciallyhelpfulinaclusteredenvironmentinwhichnodesmaydetectthedevicesindifferentorderleadingtothesamedevicehavingdifferentnamesondifferentnodes.
LabelingallowsconsistentnamingforOCFS2volumesacrossacluster.
MountTypeValidtypesareclusterandlocal,withtheformeralsobeingthedefault.
Specifylocalifyouintendtousethefilesystemononenodeonly.
FileSystemTypeValidtypesaremailanddatafiles.
mailreferstoitsuseasamailserverstorethathastheusagecharacteristicoflotsofsmallfiles,requiringlotsofmeta-datachangesthatinturnarebenefitedbyusingalargerjournal.
datafiles,ontheotherhand,suggestsfewerlargefiles,requiringfewermeta-datachanges,thusnotbenefitingfromalargejournal.
FileSystemFeaturesAllowsuserstoenableordisablecertainfilesystemfeatures,includingsparsefiles,unwrittenextentsandback-upsuperblocks.
Refertothemanpagesforthecurrentlistofsupportedfeatures.
FileSystemFeatureLevelValidvaluesaremax-compat,defaultandmax-features.
max-compatenablesonlythosefeaturesthatareunderstoodbyolderversionsofthefilesystemsoftware.
max-featuresisattheotherendofthespectrum.
Itenablesallfeaturesthatthefilesystemsoftwarecurrentlysupports.
defaultcurrentlyenablessupportforsparsefilesandunwrittenextents.
ExamplesToformatwithalldefaults,includingheuristicallydeterminedblockandclustersizes,defaultnumberofnode-slots,andthecurrentdefaultfeaturelevel,do:$mkfs.
ocfs2-L"myvolume"/dev/sda122mkfs.
ocfs21.
4.
1Filesystemlabel=myvolumeBlocksize=4096(bits=12)Clustersize=4096(bits=12)Volumesize=53687074816(13107196clusters)(13107196blocks)407clustergroups(tailcovers11260clusters,restcover32256clusters)Journalsize=268435456Initialnumberofnodeslots:4Creatingbitmaps:doneInitializingsuperblock:doneWritingsystemfiles:doneWritingsuperblock:doneWritingbackupsuperblock:3block(s)FormattingJournals:doneWritinglost+found:donemkfs.
ocfs2successfulThevalueschoseninclude4Kblockandclustersizes,4nodeslotswith256MBjournaleach.
Thelistoffilesystemfeaturesenabledcanbeviewedusingtunefs.
ocfs2(8).
$tunefs.
ocfs2-q-Q"AllFeatures:%M%H%O\n"/dev/sda1AllFeatures:BackupSuperSparseAllocUnwrittenExtentsToformatvolumeforexclusiveuseasadatabasestore,do:$mkfs.
ocfs2-Tdatafiles-L"mydatavol"/dev/sda1mkfs.
ocfs21.
4.
1Overwritingexistingocfs2partition.
Proceed(y/N):yFilesystemTypeofdatafilesFilesystemlabel=mydatavolBlocksize=4096(bits=12)Clustersize=131072(bits=17)Volumesize=53686960128(409599clusters)(13107168blocks)13clustergroups(tailcovers22527clusters,restcover32256clusters)Journalsize=33554432Initialnumberofnodeslots:4Creatingbitmaps:doneInitializingsuperblock:doneWritingsystemfiles:doneWritingsuperblock:doneWritingbackupsuperblock:3block(s)FormattingJournals:doneWritinglost+found:donemkfs.
ocfs2successfulmkfs.
ocfs2(8)selectedalargerclustersize,andthejournalsizewaslimitedto32M.
23Toformatthevolumewithcustomvalues(e.
g.
4Kblockandclustersizes,8node-slotswith128MBjournalseachandmaximumcompatibilitywitholderfilesystemsoftware)do:#mkfs.
ocfs2-b4K-C4K-N8-L"ocfs2vol"-Jsize=128M\--fs-feature-level=max-compat/dev/sda1mkfs.
ocfs21.
4.
1Overwritingexistingocfs2partition.
Proceed(y/N):yFilesystemlabel=ocfs2volBlocksize=4096(bits=12)Clustersize=4096(bits=12)Volumesize=53687074816(13107196clusters)(13107196blocks)407clustergroups(tailcovers11260clusters,restcover32256clusters)Journalsize=134217728Initialnumberofnodeslots:8Creatingbitmaps:doneInitializingsuperblock:doneWritingsystemfiles:doneWritingsuperblock:doneWritingbackupsuperblock:3block(s)FormattingJournals:doneWritinglost+found:donemkfs.
ocfs2successfulDMOUNTThissectionassumestheuserhasformattedthevolume.
Ifitisaclusteredvolume,itisassumedthattheO2CBclusterhasbeenconfiguredandstarted.
CommandstomountandumountOCFS2volumesissimilartootherfilesystems.
$mount/dev/sda1/dir…$umount/dirUsersmountingaclusteredvolumeshouldbeawareofthefollowing:1.
Theclusterstackmusttobeonlineforaclusteredmounttosucceed.
2.
Theclusteredmountoperationisnotinstantaneous;itmustwaitforthenodetojointheDLMdomain.
3.
Likewise,clusteredumountisalsonotinstantaneous,asitinvolvesmigratingallmasteredlock-resourcestotheremainingnodes.
24Ifthemountfails,detailederrorscanbefoundviadmesg(8).
Thesemightincludeincorrectclusterconfiguration(say,missingnodeorincorrectIPaddress),orafirewallinterferingwithO2CBnetworktraffic.
Toauto-mountvolumesonstartup,thefilesystemtoolsincludeanocfs2initservicethatrunsaftertheo2cbinitservicehasstartedthecluster.
Theocfs2initservicemountsallOCFS2volumeslistedin/etc/fstab.
MountOptionsOCFS2supportsmanymountoptionsthataresupportedbyotherLinuxfilesystems.
Thelistofsupportedmountoptionsisasfollows:_netdevThefilesystemresidesonadevicethatrequiresnetworkaccess(usedtopreventthesystemfromattemptingtomountthesefilesystemsuntilthenetworkhasbeenenabledonthesystem).
mount.
ocfs2(8)transparentlyappendsthisoptionduringmount.
However,usersmountingthevolumevia/etc/fstabmustexplicitlyspecifythismountoption.
Thispreventsthesystemfrommountingthevolumeuntilafterthenetworkhasbeenenabled.
Conversely,duringshutdown,itinstructsthesystemtounmountthevolumebeforeshuttingdownthenetwork.
atime_quantum=Thisinstructsthefilesystemtolimitthegranularityofatimeupdatestonrsecssecond.
Thedefaultis60secs.
Alowvaluewillhurtperformanceasatimeisupdatedoneveryreadandwriteaccess.
Toalwaysupdateatime,setittozero.
barrier=1Thisenables/disablesbarriers:barrier=0disables,barrier=1enables.
Barriersaredisabledbydefault.
commit=Thisinstructsthefilesystemtosyncalldataandmetadataeverynrsecsseconds.
Thedefaultvalueis5seconds.
Thismeansthatifyouloseyourpower,youwillloseasmuchasthelatest5secondsofwork(yourfilesystemwillnotbedamagedthough,thankstojournaling).
Thisdefaultvalue(oranylowvalue)willhurtperformance,butit'sgoodfordata-safety.
Settingitto0willhavethesameeffectasleavingitatthedefault(5seconds).
Settingittoverylargevalueswillimproveperformanceattheexpenseofsomedata-loss.
data=ordered/data=writebackThisspecifiesthehandlingofdataduringmetadatajournaling.
orderedThisisthedefaultmode.
Alldataisforceddirectlyouttothemainfilesystempriortoitsmetadatabeingcommittedtothejournal.
writebackDataorderingisnotpreserved-datamaybewrittenintothemainfilesystemafteritsmeta-datahasbeencommittedtothejournal.
Thisisrumoredtobe25thehighestthroughputoption.
Whileitguaranteesinternalfilesystemintegrity,itcanallownullorstaledatatoappearinfilesafteracrashandjournalrecovery.
datavolumeUsethismountoptiontomountvolumesstoringOracledatafiles,controlfiles,redologs,archivelogs,votingdisk,clusterregistry,etc.
(ThismountoptionisonlyavailablewithOCFS21.
2andOCFS21.
4forEnterprisedistributionsEL,RHELandSLES.
)errors=remount-ro/errors=panicThisdefinesthebehaviorwhenanerror(on-diskcorruption)isencountered.
(Eitherremountthefilesystemread-onlyorpanicandhaltthesystem.
)Bydefault,thefilesystemisremountedread-only.
intr/nointrThedefault,intrallowssignalstointerruptcertainclusteroperations.
nointrdisablessignalsduringclusteroperations.
localflocksThisdisablescluster-awareflock(2).
noatimeThisstandardmountoptionturnsoffatimeupdatescompletely.
relatimeThisisonlyavailablewithLinuxkernelv2.
6.
20andlater.
Relativeatimeonlyupdatestheatimeifthepreviousatimeisolderthanthemtimeorctime.
Thisisusefulforapplicationsthatonlyneedtoknowthatafilehasbeenreadsinceitwaslastmodified.
roThismountsthefilesystemread-only.
rwThismountsthefilesystemread-write.
ExamplesIfmountingclusteredvolumes,starttheO2CBclusterservicebeforeattemptinganymounts.
$serviceo2cbonlineTomountdevice/dev/sda1at/u01,do:$mount/dev/sda1/u01Toumountthedevicemountedat/u01,do:26$umount/u01TomountOCFS2volumesautomaticallyatboot,(a)enableo2cbandocfs2initservicestostartonboot,and(b)addthemountentriesin/etc/fstab.
$chkconfig--addo2cbo2cb0:off1:off2:on3:on4:off5:on6:off$chkconfig--addocfs2o2cb0:off1:off2:on3:on4:off5:on6:off$cat/etc/fstab…/dev/sda1/u01ocfs2_netdev,defaults00…The_netdevmountoptionisrequiredforOCFS2volumes.
Thismountoptioninstructstheoperatingsystemtomountthevolumeafterthenetworkisstartedandunmountitbeforethenetworkisstopped.
Tomount-by-labelavolumelabeled"myvolume",do:$mount-Lmyvolume/u0127VADMINISTRATIONTheprevioussectionconcentratedongettingstarted.
Itcoveredconfiguringandstartingtheclusterandformattingandmountingthevolume.
Thissectiondealswithadministrativeoperationsthatcanbeperformedonthevolume.
Thisincludesaddingslots,resizing,changingthelabel,checkingthefilesystem,etc.
ATUNINGtunefs.
ocfs2(8)allowschangingmostofthefilesystem'sparameters.
Infact,otherthantheblockandclustersizes,allotherparameterscanbemodified.
AsthetoolmodifiesanexistingOCFS2volume,theclustermustbestartedtoseewhetherthevolumeisinuseacrossthecluster.
Ifthevolumeisnotinuse,alloperationscanbeperformed.
Ifitisinuse,onlyonlineoperationsmaybeperformed.
tunefs.
ocfs2(8)canperformthefollowingtasks:FSFeaturesToggleItallowstogglingfilesystemfeaturesonandoff.
Enablingnewfilesystemfeaturescanmakethevolumeun-mountablewitholderversionsofthesoftware.
Havingtheabilitytotoggleofffeaturesisusefulduringthetransitionperiod.
However,careshouldbetaken;thisoperationisnotguaranteedtosucceed.
Forexample,disablingsparsefilesupportinvolvesfillingalltheholes;itwillonlyworkifthefilesystemhasenoughfreespacetofillthem.
JournalResizeItallowsthejournaltobegrownortruncated.
Whilethisoperationisnottypicallyrecommended,itisprovidedincasetheuserwishestotweakthesizeofthejournalforperformanceorotherreasons.
VolumeLabelupdateItallowsthevolumelabeltobechanged.
Havingtheabilitytochangethelabelishandy,aslabelsareusefulforidentifyingavolumeacrossacluster.
VolumeMountTypeToggleItallowstogglingthemounttypebetweenclusterandlocal.
Whensettolocal,themountutilitycanmountthevolumewithouttheclusterstack.
NodeSlotsupdateItallowsincreasinganddecreasingthenumberofnodeslots.
Nodeslotsdictatethenumberofnodesthatcanconcurrentlymountavolume.
Havingtheabilitytoincreasethenodeslotsinusefulinanyenvironment.
Italsoallowsuserstoremovenodeslotsandthusrecoverspaceusedbyjournals.
Thisiscurrentlyanofflineoperation.
Weintendtomakeaddslotsanonlineoperationsoon.
28VolumeResizeItallowsgrowingthesizeofthevolume.
Thisiscanbeperformedbothwhenthevolumeisonlineandoffline.
However,asitrequiresaclusteredvolumemanagertoworkeffectively,theonlinefeaturewillonlybeusefulwhensupportforavolumemanagerisannounced.
Weexpecttohaveitavailablewiththenextrelease,ifnotearlier.
Thetooldoesnotallowshrinkingthevolume.
UUIDResetItallowschangingtheUUIDofthevolume.
ThisisusefulwhenusingthevolumeonEMCandNetAppdiskarraysthatallowvolumecloning.
AsOCFS2usestheUUIDtouniquelyidentifyavolume,thisoptionneedstobeperformedontheclonedvolumetoallowittodistinguishitselffromtheoriginal.
BFILESYSTEMCHECKfsck.
ocfs2(8)isthefilesystemchecktool.
Itdetectsandfixeson-diskerrors.
Liketunefs.
ocfs2(8),itexpectstheclustertobeonline.
Itneedstoensurethevolumeisnotinuseacrossthecluster.
Whenrunwithoutanyoptions,fsck.
ocfs2(8)onlyreplaysthejournals.
Forafullscan,specify–ftoforce-checkthevolume.
$fsck.
ocfs2-f/dev/sdf1CheckingOCFS2filesystemin/dev/sdf1:label:apache-datauuid:944483172d30415a9b97d4c7f7c9f43enumberofblocks:13107196bytesperblock:4096numberofclusters:13107196bytespercluster:4096maxslots:4/dev/sdf1wasrunwith-f,checkforced.
Pass0a:CheckingclusterallocationchainsPass0b:CheckinginodeallocationchainsPass0c:CheckingextentblockallocationchainsPass1:Checkinginodesandblocks.
Pass2:Checkingdirectoryentries.
Pass3:Checkingdirectoryconnectivity.
Pass4a:checkingfororphanedinodesPass4b:Checkinginodeslinkcounts.
Allpassessucceeded.
Thefsck.
ocfs2.
checks(8)manpagehasalistingofallchecksperformedbyfsck.
ocfs2(8).
29BackupSuperblocksAfilesystemsuperblockstorescriticalinformationthatishardtorecreate.
InOCFS2,itstorestheblocksize,clustersize,andthelocationsoftherootandsystemdirectories,amongotherthings.
Asthisblockisclosetothestartofthedisk,itisverysusceptibletobeingoverwrittenbyanerrantwrite.
Say,ddif=fileof=/dev/sda1.
Backupsuperblocksarecopiesoftheactualsuperblock.
Theseblocksaredispersedinthevolumetominimizethechancesofbeingoverwritten.
Onthesmallchancethattheoriginaliscorrupted,thebackupsareavailabletoscanandfixthecorruption.
mkfs.
ocfs2(8)enablesthisfeaturebydefault.
Userswishingexplicitlynottohavethemcanspecify–fs-features=nobackup-superduringformat.
tunefs.
ocfs2(8)canbeusedtoviewwhetherthefeaturehasbeenenabledonadevice.
$tunefs.
ocfs2-qQ"%M\n"/dev/sda1BackupSuperInOCFS2,thesuperblockisonthethirdblock.
Thebackupsarelocatedon1GB,4GB,16GB,64GB,256GBand1TBbyteoffsets.
Theactualnumberofbackupblocksdependsonthesizeofthedevice.
Thesuperblockisnotbackedupondevicessmallerthan1GB.
fsck.
ocfs2(8)referstothesesixoffsetsbynumbers,1to6.
Userscanspecifyanybackupwiththe–roptiontorecoverthevolume.
Theexamplebelowusesthesecondbackup.
Ifsuccessful,fsck.
ocfs2(8)overwritesthecorruptedsuperblockwiththebackup.
$fsck.
ocfs2-f–r2/dev/sdf1[RECOVER_BACKUP_SUPERBLOCK]Recoversuperblockinformationfrombackupblock#1048576yCheckingOCFS2filesystemin/dev/sdf1:label:apache-datauuid:944483172d30415a9b97d4c7f7c9f43enumberofblocks:13107196bytesperblock:4096numberofclusters:13107196bytespercluster:4096maxslots:4/dev/sdf1wasrunwith-f,checkforced.
Pass0a:CheckingclusterallocationchainsPass0b:CheckinginodeallocationchainsPass0c:CheckingextentblockallocationchainsPass1:Checkinginodesandblocks.
Pass2:Checkingdirectoryentries.
Pass3:Checkingdirectoryconnectivity.
Pass4a:checkingfororphanedinodesPass4b:Checkinginodeslinkcounts.
Allpassessucceeded.
30COTHERTOOLSTheprevioussectionshavecoveredtoolstomount,format,tuneandcheckanOCFS2volume.
Thissectionbrieflyreviewstheremainingtools.
Thecompleteusagedescriptionforalltoolsisavailableintheircorrespondingmanpages.
mounted.
ocfs2(8)ThistooldetectsallOCFS2volumes.
Itdoessobyscanningallthedeviceslistedin/proc/partitions.
Ithastwomodes.
Inthedetect(-d)mode,itlistsallOCFS2devices.
#mounted.
ocfs2-dDeviceFSUUIDLabel/dev/sda1ocfs284044d1e-fffd-4330-92c6-3486cab85c0fmyvol/dev/sdc1ocfs203b685de-2df6-42ce-9385-665ad4b1ba62cman-test/dev/sdd1ocfs24d6ef662-f9c6-4db8-ab12-8e87aedec207racdb/dev/sdf1ocfs294448317-2d30-415a-9b97-d4c7f7c9f43eapache-dataInthefull(-f)mode,itliststhenodescurrentlymountingeachvolume.
However,itshouldbenotedthattheinformationisnotalwaysaccurate.
Theinformationisgleanedbydirty-readingtheslot-maponthevolume,whichmaybecorruptedifthelastnodetomountthevolumecrashed.
Acorruptedslot-mapisrecoveredbythenextmount.
$mounted.
ocfs2-fDeviceFSNodes/dev/sda1ocfs2node96,node92/dev/sdc1ocfs2node40,node35,node32,node31,node34,node33/dev/sdd1ocfs2Notmounted/dev/sdf1ocfs2Notmountedo2cb_ctl(8)Thisisthetoolusedbytheo2cbinitscripttopopulatetheO2CBcluster.
Itcanalsobeusedbyuserstoaddnewnodestoarunningcluster.
Toaddnodenode4asnodenumber4withIPaddress192.
168.
0.
104usingport777toclusterwebcluster,do:$o2cb_ctl-C-i-nnode4-tnode-anumber=4-aip_address=192.
168.
0.
104\-aip_port=7777-acluster=webclusterThiscommandneedstoberunonallnodesinthecluster,andtheupdated/etc/ocfs2/cluster.
confmustbecopiedtothenewnode.
Thisconfigurationfileneedstobeconsistentonallnodes.
31debugfs.
ocfs2(8)Thisismaindebuggingtool.
Itallowsuserstowalkdirectorystructures,printinodes,backupfiles,etc.
,allwithoutmountingthefilesystem.
Thistoolhasbeenmodeledafterext3'sdebugfs.
Formore,refertothemanpageandtheOn-DiskFormatsupportguidedownloadablefromtheOCFS2'sdocumentationsection.
o2image(8)Thisisanewtoolmodeledafterext3'se2image.
Itallowsuserstoextractthemeta-datafromavolumeandsaveittoanotherfile.
Thisfilecanthenbereadusingdebugfs.
ocfs2(8).
Thiswillbeusefulfordebuggingon-diskissuesthatarenotbeingfixedbyfsck.
ocfs2(8).
Moreover,theOCFS2developershopetousethisinformationtoanalyzetheallocationlayoutsofuservolumesinordertobetterunderstandtypicalusage.
Itshouldbenotedthato2image(8)doesnotbackupanyuserdata.
Itonlybacksupmeta-data,suchasinodesanddirectorylistings.
32VIORACLERDBMSOneoftheearliestusesoftheOCFS2filesystemwaswithOracle'sRealApplicationClusterdatabaseproduct(RAC)onLinux.
Sincethen,thefilesystemhasalsoseenusewiththestandalonedatabaseproduct.
ItprovidesadvantagesoverotherlocalfilesystemsonLinux,includingefficienthandlingoflargefileswithfulldirectandasynchronousI/Osupport,andtheabilitytoconvertthefilesystemfromlocaltoclusteredandback.
ThissectiondetailssomeadditionalconfigurationandissuesoneneedstobeawareofwhenusingOCFS2withtheOracleRDBMS.
MountOptionsTwomountoptionshavebeenaddedspecificallyfortheOracledatabase.
Onedisablessignals(nointr);theotherforcesthedatabasetousedirectI/O(datavolume).
Itisnecessarytomountvolumeshostingthedatafiles,controlfiles,redologs,etcwiththenointrmountoption.
ThispreventsshortI/Os,whichcouldhappenifasignalweretointerruptanI/Oinprogress.
ThisissimilartothemountoptioninNFS.
Thedatavolumemountoptionismorelegacythananythingelse.
ItdirectsthedatabasetoperformdirectI/Otothedatafiles,controlfiles,redologs,etc.
onsuchfilesonvolumesmountedwiththisoption.
Itislegacymainlybecausethesamebehaviorcanbeenforcedbytheinit.
oraparameter,filesystemio_options.
Usersusingthatparameterdonotneedthismountoptionforvolumeshostingthedatabasefiles.
However,avolumehostingtheRACvotingdiskfileandtheclusterregistry(OCR)stillrequiresthisoption.
ItshouldbenotedthatthedatavolumemountoptionisonlyavailableintheOCFS21.
2and1.
4releasesforEnterprisedistributions.
Itisnotavailableinotherdistributionsshippingthemainlinekernel.
ThesemountoptionsarenotrequiredonavolumefortheOraclehomeoranyotheruse.
UsersshouldnotusethesamevolumefortheOraclehomeanddatafilestorage.
TimeoutsRACusesitsownclusterstack,CSS.
Bothclusterstacks,CSSandO2CB,haveconfigurabletimeouts.
InlaterversionsofCSS(late10gand11g),carehasbeentakentoensurethatthetwotimeoutsareunrelated.
ThisistruewhenusingOCFS2tohostthevotingdiskfilesandtheclusterregistry(OCR)butnottheCRS/CSSbinaries,whichshouldbeinstalledonalocalfilesystem.
33NodeNumbersItisbestifthenodenumbersinthetwostacksareconsistent.
Bothstacksusethelowernodenumberasatie-breakerinquorumcalculations.
ChangingnodenumbersinO2CBinvolvesediting/etc/ocfs2/cluster.
confandpropagatingthenewconfigurationtoallnodes.
Thenewnumberstakeeffectaftertheclusterisrestarted.
ClusterSizesWhenspecifyingdatafilesasthefilesystemtype,mkfs.
ocfs2(8)setstheclustersizeto128K.
However,thereisnoonecorrectvalueandusersarefreetouseadifferentvalue.
Theonlypointtorememberisthattheclustersizeshouldnotbesmallerthanthedatabaseblocksize.
Thisistheeasiestwaytoensurethatthedatabaseblockswillnotbefragmentedondisk.
As8Kisthetypicaldatabaseblocksize,useaclustersizeofatleastthatvalue,ifnotlarger.
Theonlypointtonoteisthatastheclustersizespecifiesthesmallestfiledataallocation.
Usingalargevaluecouldleadtospacewastageifthevolumeisbeingusedtostoremanysmallfilesaswell.
ModificationTimesToallowmultiplenodestoconcurrentlystreamI/OstoanOracledatafile,OCFS2makesaspecialdispensationfromthePOSIXstandardbynotupdatingthemodificationtime(mtime)ondiskwhenperformingnon-extendingdirectI/Owrites.
Tobeprecise,whilethenewmtimeisupdatedinmemory,itisnotflushedtodiskunlesstheuserextendsortruncatesthefileorperformsanexplicitoperation,suchastouch(1).
Thisdispensationleadstothefilesystemreturningdifferingtimestampsforthesamefileondifferentnodes.
Whilethisisnotideal,thisbehaviorexiststoallowmaximumthroughput.
Updatingmtimeoneachwritewouldnegateoneofthemainbenefits(parallelI/O)ofaclustereddatabase,becauseitwouldserializetheI/Ostoeachdatafile.
Userwishingtoviewtheon-disktimestampofanyfilecanusethedebugfs.
ocfs2toolasfollows:$debugfs.
ocfs2-R"stat/relative/path/to/file"/dev/sda1|grep"mtime:"CertificationThecurrentinformationoncertificationswithOracleproductsisavailablebyclickingontheCertify&Availabilityoptiononmetalink,http://metalink.
oracle.
com/.
34VIINOTESa)BalancedClusterAclusterisacomputer.
Thisisafactandnotaslogan.
Whatthismeansisthatanerrantnodeintheclustercanaffectthebehaviorofothernodes.
Ifonenodeisslow,theclusteroperationswillslowdownonallnodes.
Topreventthat,itisbesttohaveabalancedcluster.
Thisisaclusterthathasequallypoweredandloadednodes.
Thestandardrecommendationforsuchclustersistohaveidenticalhardwareandsoftwareacrossallthenodes.
However,thatisnotahardandfastrule.
Afterall,inOCFS2wehavetakentheefforttoensureitworksinamixedarchitectureenvironment.
IfoneusesOCFS2inamixedarchitectureenvironment,stilltrytoensurethatthenodesareequallypoweredandloaded.
Theuseofaloadbalancercanassistwiththelatter.
Powerreferstothenumberofprocessors,speed,amountofmemory,I/Othroughput,networkbandwidth,etc.
Inreality,havingequallypoweredheterogeneousnodesisnotalwayspractical.
Inthatcase,makethelowernodenumbersmorepowerfulthanthehighernodenumbers.
ThisisbecausetheO2CBclusterstackfavorslowernodenumbersinallofitstie-breakinglogic.
Thisisnottosuggestyoushouldaddadualcorenodeinaclusterofquadcores.
Noamountofnodenumberjugglingwillhelpyouthere.
b)FileDeletionInLinux,rm(1)removesthedirectoryentry.
Itdoesnotnecessarilydeletethecorrespondinginodetoo.
Byremovingthedirectoryentry,itgivestheillusionthattheinodehasbeendeleted.
Thispuzzlesuserswhentheydonotseeacorrespondingup-tickinthereportedfreespace.
Thereasonisthatinodedeletionhasafewmorehurdlestocross.
Firstisthehardlinkcount.
Thisindicatesthenumberofdirectoryentriespointingtothatinode.
Aslongasadirectoryentryislinkedtothatinode,itcannotbedeleted.
Thefilesystemhastowaitforthatcounttodroptozero.
ThesecondhurdleistheLinux/Unixsemanticsallowingfilestobeunlinkedevenwhiletheyareinuse.
InOCFS2,thattranslatestoinuseacrossthecluster.
Thefilesystemhastowaitforallprocessesacrosstheclustertostopusingtheinode.
Oncethesetwoconditionsaremet,theinodeisdeletedandthefreedbitsareflushedtodiskonthenextsync.
35Usersinterestedinfollowingthetrailcanusedebugfs.
ocfs2(8)toviewthenodespecificsystemfilesorphan_dirandtruncate_log.
Oncethelinkcountiszero,theinodeismovedtotheorphan_dir.
Afterdeletion,thefreedbitsareaddedtothetruncate_log,wheretheyremainuntilthenextsync,duringwhich,thebitsareflushedtotheglobalbitmap.
c)DirectoryListingls(1)maybeasimplecommand,butitisnotcheap.
Whatisexpensiveisnotthepartwhereitreadsthedirectorylisting,butthesecondpartwhereitreadsalltheinodes,alsoreferredasaninodestat(2).
Iftheinodesarenotincache,thiscanentaildiskI/O.
Now,whileacoldcacheinodestat(2)isexpensiveinallfilesystems,itisespeciallysoinaclusteredfilesystem.
Itneedstotakealockoneachnode,pureoverheadincomparisontoanylocalfilesystem.
Ahotcachestat(2),ontheotherhand,hasshowntoperformonOCFS2likeitdoesonEXT3.
Inotherwords,thesecondls(1)willbequickerthanthefirst.
However,itisnotguaranteed.
Sayyouhaveamillionfilesinafilesystemandnotenoughkernelmemorytocachealltheinodes.
Inthatcase,eachls(1)willinvolvesomecoldcachestat(2)s.
d)SyntheticFileSystemsTheOCFS2developmenteffortincludedtwosyntheticfilesystems,configfsanddlmfs.
Italsomakesuseofathird,debugfs.
configfsconfigfshassincebeenacceptedasagenerickernelcomponentandisalsousedbynetconsoleandfs/dlm.
OCFS2toolsuseittocommunicatethelistofnodesinthecluster,detailsoftheheartbeatdevice,clustertimeouts,andsoontothein-kernelnodemanager.
Theo2cbinitscriptmountsthisfilesystemat/sys/kernel/config.
dlmfsdlmfsexposesthein-kernelo2dlmtotheuser-space.
WhileitwasdevelopedprimarilyforOCFS2tools,ithasseenusagebyotherslookingtoaddacluster-lockingdimensionintheirapplications.
Usersinterestedindoingthesameshouldlookatthelibo2dlmlibraryprovidedbyocfs2-tools.
Theo2cbinitscriptmountsthisfilesystemat/dlm.
debugfsOCFS2usesdebugfstoexposeitsin-kernelinformationtouserspace.
Forexample,listingallthefilesystemclusterlocks,dlmlocks,dlmstate,o2netstate,etc.
Userscanaccesstheinformationbymountingthefilesystemat/sys/kernel/debug.
Toauto-mount,addthefollowingto/etc/fstab:36debugfs/sys/kernel/debugdebugfsdefaults00e)DistributedLockManagerOneofthekeytechnologiesinaclusteristhelockmanager,whichmaintainsthelockingstateofallresourcesacrossthecluster.
Aneasyimplementationofalockmanagerinvolvesdesignatingonenodetohandleeverything.
Inthismodel,ifanodewantedtoacquirealock,itwouldsendtherequesttothelockmanager.
However,thismodelhasaweakness:lockmanager'sdeathcausestheclustertoseizeup.
Abettermodelisonewhereallnodesmanageasubsetoflockresources.
Eachnodemaintainsenoughinformationforallthelockresourcesitisinterestedin.
Ifanodedies,thelockstateinformationmaintainedbythedeadnodecanbereconstructedbytheremainingnodes.
Inthisscheme,thelockingoverheadisdistributedamongstallthenodes.
Hence,thetermdistributedlockmanager.
O2DLMisadistributedlockmanager.
Itisbasedonthespecificationtitled"ProgrammingLockingApplication"writtenbyKristinThomasandavailableatthefollowinglink.
http://opendlm.
sourceforge.
net/cvsmirror/opendlm/docs/dlmbook_final.
pdff)DLMDebuggingOnenewfeatureinthereleasethathasgoneunmentionedistheimprovementinthedebugginginfrastructure,especiallyinO2DLM.
Asperkernelconvention,alldebuggingrelatedinformationismadeaccessibleviathedebugfsfilesystem.
Amongtheinformationprovidedisthedlmstate.
Inthesamplebelow,wecanseeanine-nodeclusterthathasjustlostthreenodes:12,32and35.
Node7istherecoverymaster,iscurrentlyrecoveringnode12,andhasreceivedthelockstatesofthedeadnodefromtheotherlivenodes.
$cat/sys/kernel/debug/o2dlm/45F81E3B6F2B48CCAAD1AE7945AB2001/dlm_stateDomain:45F81E3B6F2B48CCAAD1AE7945AB2001Key:0x10748e61ThreadPid:24542Node:7State:JOINEDNumberofJoins:1JoiningNode:255DomainMap:73133344050LiveMap:73133344050MasteredResourcesTotal:48850Locally:48844Remotely:6Unknown:0Lists:Dirty=EmptyPurge=EmptyPendingASTs=EmptyPendingBASTs=EmptyMaster=EmptyPurgeCount:0Refs:1DeadNode:12RecoveryPid:24543Master:7State:ACTIVERecoveryMap:123235RecoveryNodeState:7-DONE31-DONE33-DONE34-DONE40-DONE50-DONE37Thepreviousversionofthefilesystemalloweduserstodumpthefilesystemlockstates(fs_locks).
Thisonealsoallowsuserstodumpthedlmlockstates(dlm_locks).
Thedifferenceisthatthefilesystemhasnoknowledgeoftheothernodesinthecluster.
Inthesamplebelow,weseethatthelockresourceisowned(mastered)bynode25andthatnode26holdstheEX(writelock)lockonthatresource.
$debugfs.
ocfs2–R"dlm_locksM000000000000000022d63c00000000"/dev/sda1Lockres:M000000000000000022d63c00000000Owner:25State:0x0LastUsed:0ASTsReserved:0Inflight:0MigrationPending:NoRefs:8Locks:6OnLists:NoneReferenceMap:2627289495Lock-QueueNodeLevelConvCookieRefsASTBASTPending-ActionGranted94NL-194:31694092NoNoNoneGranted28NL-128:32135912NoNoNoneGranted27NL-127:32168322NoNoNoneGranted95NL-195:31784292NoNoNoneGranted25NL-125:35139942NoNoNoneGranted26EX-126:35129062NoNoNoneAnotherenhancementinthisreleaseistothedebugfscommand,fs_locks.
Itnowsupportsa–Boptiontolimittheoutputtoonlythebusylocks.
Thisisuseful,asatypicalfilesystemcaneasilyhaveover100,000lockresources.
Beingabletofilterouttheunnecessaryinformationmakesitmucheasiertoisolatetheproblem.
$debugfs.
ocfs2–R"fs_locks–B"/dev/sda1Lockres:M000000000000000000000b9aba12ecMode:NoLockFlags:InitializedAttachedBusyROHolders:0EXHolders:0PendingAction:ConvertPendingUnlockAction:NoneRequestedMode:ExclusiveBlockingMode:NoLockPR>Gets:0Fails:0Waits(usec)Total:0Max:0EX>Gets:440247Fails:0Waits(usec)Total:24104335Max:2431630DiskRefreshes:1Withthisdebugginginfrastructureinplace,userscandebughangissuesasfollows:DumpthebusyfslocksforalltheOCFS2volumesonthenodewithhangingprocesses.
Ifnolocksarefound,thentheproblemisnotrelatedtoO2DLM.
Dumpthecorrespondingdlmlockforallthebusyfslocks.
Notedowntheowner(master)ofallthelocks.
Dumpthedlmlocksonthemasternodeforeachlock.
Atthisstage,oneshouldnotethatthehangingnodeiswaitingtogetanASTfromthemaster.
Themaster,ontheotherhand,cannotsendtheASTuntilthecurrentholderhasdownconvertedthatlock,whichitwilldouponreceivingaBlockingAST.
However,anodecanonlydownconvertifallthelockholdershavestoppedusingthatlock.
Afterdumpingthedlmlockonthemasternode,identifythecurrentlockholderanddumpboththedlmandfslocksonthatnode.
38ThetrickhereistoseewhethertheBlockingASTmessagehasbeenrelayedtofilesystem.
Ifnot,theproblemisinthedlmlayer.
Ifithas,thenthemostcommonreasonwouldbealockholder,thecountforwhichismaintainedinthefslock.
Atthisstage,printingthelistofprocesshelps.
$ps-e-opid,stat,comm,wchan=WIDE-WCHAN-COLUMNMakeanoteofallDstateprocesses.
Atleastoneofthemisresponsibleforthehangonthefirstnode.
Thechallengethenistofigureoutwhythoseprocessesarehanging.
Failingthat,atleastgetenoughinformation(likealt-sysrqtoutput)forthekerneldeveloperstoreview.
Whattodonextdependsonwheretheprocessishanging.
IfitiswaitingfortheI/Otocomplete,theproblemcouldbeanywhereintheI/Osubsystem,fromtheblockdevicelayerthroughthedriverstothediskarray.
Ifthehangconcernsauserlock(flock(2)),theproblemcouldbeintheuser'sapplication.
Apossiblesolutioncouldbetokilltheholder.
Ifthehangisduetotightorfragmentedmemory,freeupsomememorybykillingnon-essentialprocesses.
Thethingtonoteisthatthesymptomfortheproblemwasononenodebutthecauseisonanother.
Theissuecanonlyberesolvedonthenodeholdingthelock.
Sometimes,thebestsolutionwillbetoresetthatnode.
Oncekilled,theO2DLMrecoveryprocesswillclearalllocksownedbythedeadnodeandlettheclustercontinuetooperate.
Asharshasthatsounds,attimesitistheonlysolution.
Thegoodnewsisthat,byfollowingthetrail,younowhaveenoughinformationtofileabugandgettherealissueresolved.
g)NFSOCFS2volumescanbeexportedasNFSvolumes.
ThissupportislimitedtoNFSversion3,whichtranslatestoLinuxkernelversion2.
4orlater.
UsersmustmounttheNFSvolumesontheclientsusingthenordirplusmountoption.
ThisdisablestheREADDIRPLUSRPCcalltoworkaroundabuginNFSD,detailedinthefollowinglink:http://oss.
oracle.
com/pipermail/ocfs2-announce/2008-June/000025.
htmlUsersrunningNFSversion2canexportthevolumeafterhavingdisabledsubtreechecking(mountoptionno_subtree_check).
Bewarned,disablingthecheckhassecurityimplications(documentedintheexports(5)manpage)thatusersmustevaluateontheirown.
h)LimitsOCFS21.
4hasnointrinsiclimitonthetotalnumberoffilesanddirectoriesinthefilesystem.
Ingeneral,itisonlylimitedbythesizeofthedevice.
Therearethreelimitsimposedbythecurrentfilesystem:39Aninodecanhaveatmost32000hardlinks.
Thismeansthatadirectoryislimitedto32000sub-directories.
Thislimitmayberaisedwhenindexeddirectoriesareaddedtothefilesystem.
OCFS2canaddressatmost232(approximatelyfourbillion)clusters.
Afilesystemwith4Kclusterscangoupto16TB,whileafilesystemwith1Mclusterscanreach4PB.
OCFS2usesJBDforjournaling.
JBDcanaddressamaximumof232(approximatelyfourbillion)blocks.
Thislimitsthecurrentmaximumfilesystemsizeto16TB.
ThislimitwillincreasewhensupportforJBD2isaddedtoOCFS2.
i)SystemObjectsTheOCFS2filesystemstoresitsinternalmeta-data,includingbitmaps,journals,etc.
,assystemfiles.
Thesearegroupedinasystemdirectory.
Thesefilesanddirectoriesarenotaccessibleviathefilesysteminterfacebutcanbeviewedusingthedebugfs.
ocfs2(8)tool.
Tolistthesystemdirectory(referredtoasdouble-slash),do:$debugfs.
ocfs2-R"ls-l//"/dev/sdd1514drwxr-xr-x40040968-Jul-200810:18.
514drwxr-xr-x40040968-Jul-200810:18.
.
515-rw-r--r--10008-Jul-200810:18bad_blocks516-rw-r--r--10010485768-Jul-200810:18global_inode_alloc517-rw-r--r--10010485768-Jul-200810:18slot_map518-rw-r--r--10010485768-Jul-200810:18heartbeat519-rw-r--r--100817889288-Jul-200810:18global_bitmap520drwxr-xr-x20040968-Jul-200810:18orphan_dir:0000521drwxr-xr-x20040968-Jul-200810:18orphan_dir:0001522-rw-r--r--10008-Jul-200810:18extent_alloc:0000523-rw-r--r--10008-Jul-200810:18extent_alloc:0001524-rw-r--r--10041943048-Jul-200810:18inode_alloc:0000525-rw-r--r--10008-Jul-200810:18inode_alloc:0001526-rw-r--r--10041943048-Jul-200810:18journal:0000527-rw-r--r--10041943048-Jul-200810:18journal:0001528-rw-r--r--10008-Jul-200810:18local_alloc:0000529-rw-r--r--10008-Jul-200810:18local_alloc:0001530-rw-r--r--10008-Jul-200810:18truncate_log:0000531-rw-r--r--10008-Jul-200810:18truncate_log:0001Thefilenamesthatendwithnumbersareslotspecificandarereferredtoasnode-localsystemfiles.
Thesetofnode-localfilesusedbyanodecanbedeterminedfromtheslotmap.
Tolisttheslotmap,do:#debugfs.
ocfs2-R"slotmap"/dev/sdd1Slot#Node#03213524033143453340Formoreinformation,refertotheOCFS2supportguidesavailableintheDocumentationsectionathttp://oss.
oracle.
com/projects/ocfs2.
j)Heartbeat,QuorumandFencingHeartbeatisanessentialcomponentinanycluster.
Itischargedwithaccuratelydesignatingnodesasdeadoralive.
Amistakeherecouldleadtoaclusterhangoracorruption.
O2HBisthediskheartbeatcomponentofO2CB.
Itperiodicallyupdatesatimestampodisk,indicatingtoothersthatthisnodeisalive.
Italsoreadsallthetimestampstoidentifyotherlivenodes.
Otherclustercomponents,likeO2DLMandO2NET,usetheO2HBservicetogetnodeupanddownevents.
Thequorumisthegroupofnodesinaclusterthatisallowedtooperateonthesharedstorage.
Whenthereisafailureinthecluster,nodesmaybesplitintogroupsthatcancommunicateintheirgroupsandwiththesharedstoragebutnotbetweengroups.
O2QUOdetermineswhichgroupisallowedtocontinueandinitiatesfencingoftheothergroup(s).
Fencingistheactofforcefullyremovinganodefromacluster.
AnodewithOCFS2mountedwillfenceitselfwhenitrealizesthatitdoesnothavequoruminadegradedcluster.
Itdoesthissothatothernodeswon'tbestucktryingtoaccessitsresources.
O2CBusesamachineresettofence.
Thisisthequickestrouteforthenodetorejointhecluster.
k)Processes[o2net]Onepernode.
Itisawork-queuethreadstartedwhentheclusterisbroughton-lineandstoppedwhenitisoff-lined.
Ithandlesnetworkcommunicationforallthreads.
ItgetsthelistofactivenodesfromO2HBandsetsupaTCP/IPcommunicationchannelwitheachlivenode.
Itsendsregularkeep-alivepacketstodetectanyinterruptiononthechannels.
[user_dlm]Onepernode.
Itisawork-queuethreadstartedwhendlmfsisloadedandstoppedwhenitisunloaded.
(dlmfsisasyntheticfilesystemthatallowsuserspaceprocessestoaccessthein-kerneldlm.
)[ocfs2_wq]Onepernode.
Itisawork-queuethreadstartedwhentheOCFS2moduleisloadedandstoppedwhenitisunloaded.
Itisassignedbackgroundfilesystemtasksthatmaytakeclusterlockslikeflushingthetruncatelog,orphandirectoryrecoveryandlocalallocrecovery.
Forexample,orphandirectoryrecoveryrunsinthebackgroundsothatitdoesnotaffectrecoverytime.
41[o2hb-14C29A7392]Oneperheartbeatdevice.
Itisakernelthreadstartedwhentheheartbeatregionispopulatedinconfigfsandstoppedwhenitisremoved.
Itwriteseverytwosecondstoablockintheheartbeatregion,indicatingthatthisnodeisalive.
Italsoreadstheregiontomaintainamapoflivenodes.
Itnotifiessubscriberslikeo2netando2dlmofanychangesinthelivenodemap.
[ocfs2dc]Onepermount.
Itisakernelthreadstartedwhenavolumeismountedandstoppedwhenitisunmounted.
ItdowngradeslocksinresponsetoblockingASTs(BASTs)requestedbyothernodes.
[kjournald]Onepermount.
ItispartofJBD,whichOCFS2usesforjournaling.
[ocfs2cmt]Onepermount.
Itisakernelthreadstartedwhenavolumeismountedandstopped,whenitisunmounted.
Itworkswithkjournald.
[ocfs2rec]Itisstartedwheneveranodehastoberecovered.
Thisthreadperformsfilesystemrecoverybyreplayingthejournalofthedeadnode.
Itisscheduledtorunafterdlmrecoveryhascompleted.
[dlm_thread]Oneperdlmdomain.
Itisakernelthreadstartedwhenadlmdomainiscreatedandstoppedwhenitisdestroyed.
ThisthreadsendsASTsandblockingASTsinresponsetolocklevelconvertrequests.
Italsofreesunusedlockresources.
[dlm_reco_thread]Oneperdlmdomain.
Itisakernelthreadthathandlesdlmrecoverywhenanothernodedies.
Ifthisnodeisthedlmrecoverymaster,itre-masterseverylockresourceownedbythedeadnode.
[dlm_wq]Oneperdlmdomain.
Itisawork-queuethreadthato2dlmusestoqueueblockingtasks.
l)FutureAsclusteringhasbecomemorepopular,sohavethedemandsforacommonclusterstack.
OneofthedevelopmentsprojectsunderwayistoallowOCFS2toworkwithuser-spaceclusterstacks.
Thisisaworkinprogress,andwehopetohaveasolutionby(RH)EL6andSLES11.
Workisunderwayonthecorefilesystemaswell.
Wearelookingtoaddsupportforextendedattributes,POSIXlocking,on-linenode-slotaddition,andJBD2.
42Ifyouareinterestedincontributing,refertothedevelopmentwikiathttp://oss.
oracle.
com/osswiki/OCFS2foralistofprojects.
Youcanalsoemailthedevelopmentteamatocfs2-devel@oss.
oracle.
com.
43ACKNOWLEDGEMENTSThisreleasemarksanimportantmilestoneinthedevelopmentofOCFS2,whichbeganinlate2001.
Theauthorwishestotakethisopportunitytoacknowledgethepeoplewhosecontributionsmadethismilestonepossible.
KurtHackelstartedtheOCFSprojectwiththeauthorandwentontodesignandwritelargepartsofOCFS2,includingthedistributedlockmanager.
MarkFasheh,whojoinedthetwousalittleafterthefirstreleaseofOCFS,workedondesigningandrewritingtheentirefilesystem.
Hewentontoaddadvancedfeatureslikesparsefiles,unwrittenextents,inlinedata,sharedwriteablemmap,anddlmfsandiscurrentlyanofficialmaintainerofthefilesystem.
JoelBecker,who,havingexperienceddifferentclusterstacks,designedO2CBtobethesimplestoneyet.
Heiscurrentlyaddingsupportforuser-spaceclusterstackswhilekeepingtheoriginalinterfaceintact.
Healsowrotelargechunksoftheuser-spacelibrariesandconfigfs.
Heisalsoanofficialmaintainerofthefilesystem.
ZachBrown,havingworkedonLustre,wroteo2netandotherkernelpiecesoftheO2CBclusterstack,alsowrotefsck.
ocfs2,andcannowbefoundworkingonCRFS,acache-coherentnetworkfilesystem.
TigerYangstartedbyaddingsupportforspliceI/Os.
Hecannowbefoundaddingsupportforextendedattributes.
TaoMastartedbyidentifyingandfixingbugsinfsck.
ocfs2,wentontowriteonlineresize,andcanbecurrentlyfoundaddingsupportfortheextendedattributes.
MarcosMatsunaga,theone-mantestingmachine,testsallproductionreleases,mainlinekernels,anddevelopmentpatchesinanunwieldyclustersetup.
WimCoekaertsdrovethemainlinesubmissionprocess.
ManishSinghhadtheundesirabletaskofdebuggingthesupposedcompilerissuesweencounteredduringthedevelopment.
Healsowrotetheconsoleandlentahandinthebuildinfrastructure.
RustyLynch,SonicZhang,JohnL.
VillalovosandXiaofengLingfromIntellentahandduringthedogdaysofdevelopmentin2004.
ChristophHellwignotonlycleanedupalotofthesourcecode,hehelpedgetthefilesystemacceptedinthemainlinekernel.
JeffMahoney,JanKara,LarsMarowsky-Bree,AndrewBeekhof,ColyLiandZhenWeifromSUSELabscontinuetoensurethefilesystemworksontheSLESdistribution.
44FabioMassimoDiNittohelpedgetthefilesystemincludedintheUbuntudistribution.
Healsomaintainsthedebianspecificpartsofourbuildinfrastructure.
DavidTeiglandandthemembersofRedHat'sLinuxClustergroupwhoarehelpingusintegrateOCFS2withthenewCMAN.
VarioustestingteamsinOracle,whodidnotgiveuponthefilesystemevenaftersufferingnumerousoopsesandhangsintheearlydays.
EMC,Emulex,HP,IBM,Intel,NetworkApplianceandSGIforprovidinghardwarefortesting.
Clustertestingisanexpensiveproposition.
Withoutthehelpofpartners,OCFS2wouldnothavebeenpossible.
RAKsmart怎么样?RAKsmart发布了2021年中促销,促销时间,7月1日~7月31日!,具体促销优惠整理如下:1)美国西海岸的圣何塞、洛杉矶独立物理服务器低至$30/月(续费不涨价)!2)中国香港大带宽物理机,新品热卖!!!,$269.23 美元/月,3)站群服务器、香港站群、日本站群、美国站群,低至177美元/月,4)美国圣何塞,洛杉矶10G口服务器,不限流量,惊爆价:$999.00,...
无忧云怎么样?无忧云是一家成立于2017年的老牌商家旗下的服务器销售品牌,现由深圳市云上无忧网络科技有限公司运营,是正规持证IDC/ISP/IRCS商家,主要销售国内、中国香港、国外服务器产品,线路有腾讯云国外线路、自营香港CN2线路等,都是中国大陆直连线路,非常适合免备案建站业务需求和各种负载较高的项目,同时国内服务器也有多个BGP以及高防节点,目前商家开启了夏日清凉补贴活动,商家的机器还是非常...
tmthosting怎么样?tmthosting家本站也分享过多次,之前也是不温不火的商家,加上商家的价格略贵,之到斯巴达商家出现,这个商家才被中国用户熟知,原因就是斯巴达家的机器是三网回程AS4837线路,而且也没有多余的加价,斯巴达家断货后,有朋友发现TMTHosting竟然也在同一机房,所以大家就都入手了TMTHosting家的机器。目前,TMTHosting商家放出了夏季优惠,针对VPS推...
ubuntutweak为你推荐
云爆发云出十里未及孤村什么意思百度商城百度商城里抽奖全是假的kaixin.com人人网和开心网互通,可我用的是kaixin001的开心,和kaixin*com不是一个呀!嘀动网在炫动网买鞋怎么样,是真的吗百度关键词工具常见的关键词挖掘工具有哪些bbs2.99nets.com天堂1单机版到底怎么做www.1diaocha.com请问网络上可以做兼职赚钱吗?现在骗子比较多,不敢盲目相信。请大家推荐下45gtv.comLETSCOM是什么牌子?www.diediao.com谁知道台湾的拼音怎么拼啊?有具体的对照表最好!bk乐乐bk乐乐和CK是什么关系?
快速域名备案 如何查询域名备案号 ddos hkbn 174.127.195.202 鲜果阅读 ubuntu更新源 qq数据库 云全民 警告本网站美国保护 dd444 百兆独享 免费活动 爱奇艺vip免费领取 国外ip加速器 ca187 彩虹云 空间购买 宏讯 路由跟踪 更多