SimpleSpeechRecognitionSystemusingMATLABandVHDLonAlteraDE0.DemoVideohere
IntroductionThisprojectisatrialtodevelopasimplespeechrecognitionengineonlow-endandeducationalFPGAslikeAlteraDE0.Alsoasimplechallengetoexhaustthelimitsoflow-endFPGAsandtammingthemtodoadvancedstuff.Thesystemwasdesignedsoastorecognizethedigit(1or0)beingspokenintothemicrophoneoflaptopthentransferredintoFPGAoverUART.Bothindustryandacademiahavespentaconsiderableeffortinthisfieldfordevelopingsoftwareandhardwaretocomeupwitharobustsolution.However,itisbecauseoflargenumberofaccentsspokenaroundtheworldthatthisconundrumstillremainsanactiveareaofresearch.
SpeechRecognitionfindsnumerousapplicationsincludinghealthcare,artificialintelligence,humancomputerinteraction,InteractiveVoiceResponseSystems,military,avionicsetc.Anothermostimportantapplicationresidesinhelpingthephysicallychallengedpeopletointeractwiththeworldinabetterway.
TheorySpeechrecognitionsystemscanbeclassifiedintoseveralmodelsbydescribingthetypesofutterancestoberecognized.Theseclassesshalltakeintoconsiderationtheabilitytodeterminetheinstancewhenthespeakerstartsandfinishestheutterance.InthisprojectIaimedtoimplementIsolatedWordRecognitionSystemwhichusuallyusedahammingwindowoverthewordbeingspoken.
TheSpeechRecognitionEnginesarebroadlyclassifiedinto2types,namelyPatternRecognitionandAcousticPhoneticsystems.Whiletheformerusetheknown/trainedpatternstodetermineamatch,thelatterusesattributesofthehumanbodytocomparespeechfeatures(phoneticssuchasvowelsounds).Thepatternrecognitionsystemscombinewithcurrentcomputingtechniquesandtendtohavehigheraccuracy.
basicstructureofaspeechrecognitionsystemgoesasfollows:SpeechSignalRecording.SpectralAnalysis(FFT,Windowing,MFCC,PowerSpectrum).ProbabilityEstimation(NeuralNetworks,HiddenMarkovModel,VQ).SignalDecodingandDecisionMaking.AudioSignalsarecapturedusingmicrophonesandit’srecordedinthetimedomain(i.e.varieswithtime).Theproblemwithhumanvoicesignalsthattheyarenotstationaryandtheanalysisofsuchsignalsintimedomainisverycomplicatedproblemandcomputationallycostly.Herecomestheroleofspectralanalysis,bydoingasetoftransformationsandprocessingalgorithmsontheincomingsignal,itisconvertedintoausableformthatfurtheranalysiscanbedoneonit.
ForthisI'mareusing:DFT:ThediscreteFouriertransform(DFT)convertsafinitesequenceofequally-spacedsamplesofafunctionintoanequivalent-lengthsequenceofequally-spacedsamplesofthediscrete-timeFouriertransform(DTFT),whichisacomplex-valuedfunctionoffrequency.
HammingWindow:WheneveryoudoafiniteFouriertransform,youareimplicitlyapplyingittoaninfinitelyrepeatingsignal.So,ifthestartandendofthefinitesampledon'tmatchthenthatwilllookjustlikeadiscontinuityinthesignal,andshowupaslotsofhigh-frequencynonsenseintheFouriertransform,whichyoudon'twant.
Andifthesamplehappenstobeabeautifulsinusoidbutanintegernumberofperiodsdon'thappentofitexactlyintothefinitesample,yourFTwillshowappreciableenergyinallsortsofplacesnowhereneartherealfrequency.
Windowingthedatamakessurethattheendsmatchupwhilekeepingeverythingreasonablysmooth;thisgreatlyreducesthesortof"spectralleakage".
EuclideanDistance:TheEuclideandistanceorEuclideanmetricisthe"ordinary"straight-linedistancebetweentwopointsinEuclideanspace.Withthisdistance,Euclideanspacebecomesametricspace.TheassociatednormiscalledtheEuclideannorm.OlderliteraturereferstothemetricasPythagoreanmetric.
HammingDistance:Ininformationtheory,theHammingdistancebetweentwostringsofequallengthisthenumberofpositionsatwhichthecorrespondingsymbolsaredifferent.Inotherwords,itmeasurestheminimumnumberofsubstitutionsrequiredtochangeonestringintotheother,ortheminimumnumberoferrorsthatcouldhavetransformedonestringintotheother.Inamoregeneralcontext,theHammingdistanceisoneofseveralstringmetricsformeasuringtheeditdistancebetweentwosequences.
FFT:TheFFTisafast,O[Nlog(N)]algorithmtocomputetheDiscreteFourierTransform(DFT),whichnaivelyisanO[N^2]computation.TheFFToperatesbydecomposinganNpointtimedomainsignalintoNtimedomainsignalseachcomposedofasinglepoint.ThesecondstepistocalculatetheNfrequencyspectracorrespondingtotheseNtimedomainsignals.Lastly,theNspectraaresynthesizedintoasinglefrequencyspectrum.
ImplementationThesystemwasfirstintendedtobedevelopedintheFPGAonlywithoutexternalequipmentsbutitwasimpossibletodosoduetothelimitedcapabilitiesoftheboardIhave,soIdividedtheprojectinto2stages,thefront-end(signalacquisitionandanalysis)andtheback-end(patternmatchingandestimation,decisionmakingandUI).
Frontend(MATLAB):ThefrontendisbuiltintomatlabduetotheeaseofdoingDSPonitusingbuiltinfunctions,wehave2programs,onefortrainingandobtainingameansignalandtheotherforrealtimeoperation.stepsdoneinmatlabare:
DataAcquisitionusingmicrophone.Windowing&FastFourierTransformPlotting&DataTransmission.FilesintheFrontend:[train.m,recorder.m]
Backend(AlteraDE0):DuetothelackofADCinAlteraDE0I'mtransmittingthedatafromthecomputer’smicrophoneusingUSBtoTTLmoduleovertheuartprotocol,thereceiveddataoflength(1000)samplesarecomparedthenwiththesavedvectorsfromthetrainingwithmatlab,theeuclideandistancesarecalculatedandthevectorwithmoreprobabilitytobetherightoneisgivenabiggerweight,weightsarethencomparedthendisplayingthefinalresultson7-SegmentsandLEDs.
ThebackendwasmodelledasaMooreFiniteStateMachinewith4states:(Receiving,CalculatingDistance,DecisionMaking,DisplayingResults).FilesintheBackend:[Voice_Recognition.vhd,uart_tx.vhd,uart_rx.vhd,uart_parity.vhd,uart.vhd]
DesignChoicesandWorkAroundsEuclideanDistanceCalculation:Calculationoftheeuclideandistancefor1000pointlengthvectorisveryexpensivetodoinFPGAdirectlyusingforloops,soIdidalittletrickandcalculatedtheweightsofvectorsindirectly,byonlycountingthestateswherethedistanceequalszero,thisapproachissimilartousingK-nearestneighbourinmachinelearning.Inotherwordswearereallycalculatinghammingdistanceinversely.
FFTPointsDiscarding:DuetotheirrelevanceofallthefrequenciesIonlytook1000pointsanddiscardedthewholesignal,alsowhiletakingtheFFTIdiscardedhalfthesignalsduetosymmetryoftheoutput.
MooreFSM:Thedesignwasmadeinmooremachineforautomaticrecognitionandtodecreasetheuserinteractionwiththesystem,alsoforcomplexityreduction.
UARTModule:UARTwasusedinthemodulefortransmittingdataduetothelimitationsoftheFPGABoard,andduetothesimplicityofimplementationandavailablitiyofconversionmodulesinthemarket.
ResultsRAMConsumptionaround380MBonubuntu16.04LTSforthefrontend.LogicElementsConsumptionis13,757LE.Consumes9144Registerand10,450LogicFunctions.Uses46PinsfortheUIandDataInterface.Accuracy90%forthesamespeaker,decreaseswithspeakerchanging.Candetect2Numbers(oneandzero)Conclusion:ItwasshownherethatitispossibletoimplementabasicspeechrecognitionsystemonAlteraDE0andit’spossibletoovercomethelimitedcapabilitiesofthehardwarebymanysoftwareworkarounds.
Thesystemisabletosuccessfullyrecognizetwodigits(1and0)toagreataccuracyforthesamespeaker.Thesystemspeakerdependenttoagreatextentduetothelownumberoftestingsamples,thiscanbeimprovedbymakingabiggerdatasetfromvariousspeakers,alsobycalculatingandcomparingtheMFCCswithFFTtheapplicationwillbemoreeffectiveandwithaveryhighaccuracy.
Theavailabilityofmorepowerfulhardware,willallowmetoeasilyimplementmorerobustalgorithmslikeHiddenMarkovModelsandusemorepowerfulADCChipstorecordsoundmorepurelyresultinginmoreaccurateresults.
评论