CN101093661B

CN101093661B - A pitch tracking and playing method and system thereof

Info

Publication number: CN101093661B
Application number: CN200610086470XA
Authority: CN
Inventors: 王贵平
Original assignee: Beijing Sunnorth Electronic Technology Co ltd; Sunplus Technology Co Ltd
Current assignee: BEIJING SUNPLUS-EHUE TECHNOLOGY Co Ltd; Sunplus Technology Co Ltd
Priority date: 2006-06-23
Filing date: 2006-06-23
Publication date: 2011-04-13
Anticipated expiration: 2026-06-23
Also published as: CN101093661A

Abstract

A pitch tracking and playing method and system comprises a voice input processing module for receiving input voice and sampling and framing the input voice, a pitch and energy detection module for detecting pitch and energy of each voice frame, a note segmentation module for completing note segmentation according to energy detection results, a music score conversion module for converting the segmented notes into a music score, and a voice synthesis module for synthesizing the converted music score into a musical instrument digital interface file and playing the musical instrument digital interface file. Thus, human humming can be directly used as input, and the output is MIDI music. Furthermore, the invention can be applied to an embedded system by adopting measures of filtering by adopting a two-order low-pass filter during the preprocessing of pitch detection, calculating a cross-correlation function at intervals, reducing the search range of fundamental tone frequency and the like.

Description

A pitch tracking and playing method and system thereof

技术领域technical field

本发明涉及音高跟踪(Pitch Tracking)技术，尤其涉及在一种音高跟踪和播放系统中进行音符切分与量化的方法。 The present invention relates to a pitch tracking (Pitch Tracking) technology, in particular to a method for note segmentation and quantization in a pitch tracking and playback system. the

背景技术Background technique

好的音乐不仅可以培育高雅的审美情趣，而且也是缓解压力的一种非常积极的方式，在现代社会中有着重要作用。随着数字音效处理技术的发展，音乐创作和保存方式也不断发展。通常可以听到人们不自觉的哼唱，其实哼唱是一种最自然和最直接的方式去进行音乐创作或音乐查询，这需要将哼唱的信号改编成音乐乐谱并通过一个简易装置进行回放。在一般的哼唱系统中，包括哼唱识别和编曲两大部分，在过去一段时间里主要力量集中在提高哼唱识别的准确度上。 Good music can not only cultivate elegant aesthetic taste, but also a very positive way to relieve stress, which plays an important role in modern society. With the development of digital sound processing technology, music creation and preservation methods are also constantly evolving. You can usually hear people humming unconsciously. In fact, humming is the most natural and direct way to make music or music inquiry, which requires adapting the humming signal into a music score and playing it back through a simple device. . In the general humming system, including humming recognition and arrangement, the main force has been focused on improving the accuracy of humming recognition in the past period of time. the

哼唱识别，即捕捉音高信息和节奏变化，离不开音高跟踪技术。音高也可以称作音调或基音频率，表征发浊音时声带振动的频率。浊音指由声带准周期振动发出的声音，可以获得相应的音高；而清音则指声带没有振动发出的声音，没有音高参数。 Humming recognition, that is, capturing pitch information and rhythm changes, is inseparable from pitch tracking technology. Pitch, also known as pitch or pitch frequency, characterizes the frequency at which the vocal cords vibrate during voiced sounds. Voiced sound refers to the sound produced by the quasi-periodic vibration of the vocal cords, and the corresponding pitch can be obtained; while unvoiced sound refers to the sound produced by the vocal cords without vibration, and there is no pitch parameter. the

现有的音高跟踪技术在时域和频域利用统计和非统计信号处理等多种方法实现，但是仍存在以下缺点： The existing pitch tracking technology is implemented in the time domain and frequency domain by using various methods such as statistical and non-statistical signal processing, but there are still the following disadvantages:

1)音高跟踪方法结果多以图表、文本等形式表示，没有合适的装置将跟踪结果转换成声音播放； 1) The results of the pitch tracking method are mostly expressed in the form of graphs, texts, etc., and there is no suitable device to convert the tracking results into sound playback;

2)音高跟踪装置的音符(Note，如简谱中的“1，2，3，4，5，6，7--do，re，mi，fa，so，la，xi”)切分不够准确，音高识别不够准确； 2) The notes (Note, such as "1, 2, 3, 4, 5, 6, 7--do, re, mi, fa, so, la, xi" in the numbered musical notation) of the pitch tracking device are not accurate enough , pitch recognition is not accurate enough;

3)音高跟踪算法过于复杂，所占用系统内存较大，往往需要较大的哼唱数据库，并不适用于嵌入式系统。 3) The pitch tracking algorithm is too complex, takes up a lot of system memory, and often requires a large humming database, which is not suitable for embedded systems. the

公开日为2005年4月20日的中国专利申请CN200410049328公开了一种哼唱编曲系统及其方法。主要指将输入的哼唱信号编写为标准乐谱再次呈现出来。但该专利存在以下缺点：1)只是将哼唱信号编写为标准乐谱，并不能进行声音播放；2)音高跟踪装置的音符切分不够准确；3)音符切分与识别算法为统计式马可夫模型，且需要数据库参数匹配，算法比较复杂，占用资源高，不能应用于嵌入式系统，使得整个系统应用受限。 Chinese patent application CN200410049328 published on April 20, 2005 discloses a humming arrangement system and method thereof. It mainly refers to compiling the input humming signal into a standard score and presenting it again. However, this patent has the following disadvantages: 1) the humming signal is only written as a standard score, and the sound cannot be played; 2) the note segmentation of the pitch tracking device is not accurate enough; 3) the note segmentation and recognition algorithm is a statistical Markov model, and requires matching of database parameters, the algorithm is relatively complex, takes up high resources, and cannot be applied to embedded systems, which limits the application of the entire system. the

公告日为1999年11月16日的美国专利US5986199公开了一种音乐数据声音输入设备，一种体现方式为：预置音符从声音输入信号中识别和选择，同时辅助的音符信息(包括音符和音符持续长度等参数)也从声音输入信号中提取，辅助的音符信息用于生成综合引擎参数，该参数修改预置音符，提供合成音符输出。另一种体现方式为：音符切分的特征向量用于选择预置文件，该文件系指一种来自乐器预置档案库的特殊乐器文件。根据音符切分，从乐器预置文件中选择预置音符，生成的音符输出与指定的乐器或乐器组相对应。但该专利采用简单峰值检测算法，音符切分不准确。 U.S. Patent No. 5,986,199 published on November 16, 1999 discloses a music data sound input device. A method of embodiment is: preset notes are identified and selected from sound input signals, and auxiliary note information (including notes and Parameters such as note duration) are also extracted from the sound input signal, and auxiliary note information is used to generate synthesis engine parameters that modify preset notes to provide synthesized note output. Another embodiment is that the feature vector of note segmentation is used to select a preset file, which refers to a special instrument file from the instrument preset archive. According to the note split, select the preset note from the instrument preset file, and generate the note output corresponding to the specified instrument or instrument group. However, this patent uses a simple peak detection algorithm, and the note segmentation is not accurate. the

发明内容Contents of the invention

本发明要解决的技术问题在于提供一种音高跟踪和播放方法和系统，可以将跟踪结果直接转换成声音播放，并适宜于嵌入式系统使用。 The technical problem to be solved by the present invention is to provide a pitch tracking and playback method and system, which can directly convert the tracking result into sound playback, and is suitable for use in embedded systems. the

为了解决上述技术问题，本发明提供了一种音高跟踪和播放系统，包括：用于接收输入的声音并对其做采样和分帧处理的语音输入处理模块，用于检测每一语音帧的音高和能量的音高和能量检测模块，用于根据能量检测结果完成音符切分的音符切分装置，以及用于将完成切分的音符转换成乐谱的乐谱转换模块，其特征在于，该系统还包括语音合成模块，用于将转换成的乐谱合成为乐器数字界面文件并进行播放。 In order to solve the problems of the technologies described above, the present invention provides a pitch tracking and playing system, comprising: a speech input processing module for receiving input sound and performing sampling and frame processing to it, for detecting the pitch of each speech frame A pitch and energy detection module for pitch and energy, a note segmentation device for completing note segmentation according to the energy detection result, and a music score conversion module for converting the note that completes segmentation into a score, characterized in that the The system also includes a speech synthesis module, which is used to synthesize the converted music score into a digital instrument interface file and play it. the

进一步地，上述系统还可具有以下特点：所述音高和能量检测模块进一步包括预处理单元、归一化互相关函数计算单元、后处理单元和基音频率搜索单元，其中，该预处理单元采用两阶低通滤波器进行滤波。 Further, the above-mentioned system can also have the following features: the pitch and energy detection module further includes a preprocessing unit, a normalized cross-correlation function calculation unit, a postprocessing unit and a pitch frequency search unit, wherein the preprocessing unit adopts A two-order low-pass filter is used for filtering. the

进一步地，上述系统还可具有以下特点：所述互相关函数计算单元进行互相关函数的计算是隔点进行的。 Further, the above-mentioned system may also have the following feature: the calculation of the cross-correlation function by the cross-correlation function calculation unit is performed at intervals. the

进一步地，上述系统还可具有以下特点：所述基音频率搜索单元在搜索时采取隔点搜索，且其搜索范围在20-120之内。 Further, the above-mentioned system may also have the following characteristics: the pitch frequency search unit adopts interval search when searching, and its search range is within 20-120. the

进一步地，上述系统还可具有以下特点：所述音符切分装置包括波峰检测模块、主控制模块、音符切分模块、存储模块和双峰值判定模块，其中： Further, the above-mentioned system can also have the following characteristics: the note segmentation device includes a peak detection module, a main control module, a note segmentation module, a storage module and a double peak determination module, wherein:

所述波峰检测模块，用于统计从浊音段起始语音帧或前一能量下降段最后一个语音帧开始的能量连续上升段和随后的能量连续下降段所涉及的能量大于临界值的语音帧的个数，如该个数大于第三阈值，则判定这些语音帧对应的一段能量曲线构成一个波峰，该段曲线上最大的能量值为该波峰的峰值，该段曲线上最后一个语音帧的能量值为该波峰的谷值，该波峰的峰值和谷值位置分别为该峰值和该谷值对应的语音帧； The peak detection module is used to count the voice frames whose energy involved in the energy continuous rising segment and the subsequent energy continuous falling segment from the initial voice frame of the voiced segment or the last voice frame of the previous energy falling segment is greater than the critical value number, if the number is greater than the third threshold, then it is determined that a section of energy curve corresponding to these speech frames constitutes a peak, the maximum energy value on this section of curve is the peak value of this wave peak, and the energy of the last speech frame on this section of curve The value is the valley value of the peak, and the peak and valley positions of the peak are the speech frames corresponding to the peak and the valley;

所述存储模块用于保存波峰的参数以及浊音段的起始和结束位置； The storage module is used to save the parameters of the peak and the starting and ending positions of the voiced segment;

所述双峰值判定模块用于判断第二波峰峰值与第一波峰谷值之差和第一波峰的峰值与谷值之差的比值是否大于第一阈值，以及第一波峰和第二波峰的峰值位置之间的语音帧个数是否大于第二阈值，如果均是，则返回的判定结果为成功，否则返回失败的判定结果； The double peak determination module is used to determine whether the ratio of the difference between the peak value of the second peak and the valley value of the first peak and the difference between the peak value of the first peak and the valley value is greater than the first threshold, and whether the peak value of the first peak and the second peak Whether the number of speech frames between the positions is greater than the second threshold, if they are all, the returned judgment result is success, otherwise the judgment result of failure is returned;

所述主控制模块用于对能量大于临界值的连续语音帧构成的每一浊音段进行音符切分，进一步包括第一控制单元、第二控制单元、第三控制单元第四控制单元和第五控制单元，其中： The main control module is used to perform note segmentation on each voiced segment composed of continuous speech frames with energy greater than a critical value, and further includes a first control unit, a second control unit, a third control unit, a fourth control unit and a fifth control unit. control unit, where:

第一控制单元，用于从浊音段的起始位置开始，调用波峰检测模块，如检测不到波峰，则结束该浊音段的处理，否则以检测到的第一个波峰为第一波峰，将该浊音段起始位置及第一波峰的峰值、谷值、峰值位置和谷值位置保存到所述存储模块，触发第二控制单元继续处理； The first control unit is used to call the peak detection module from the starting position of the voiced sound segment, and if no peak is detected, then end the processing of the voiced sound segment, otherwise the first peak detected is the first peak, and the The starting position of the voiced sound segment and the peak value, valley value, peak position and valley position of the first wave peak are stored in the storage module, triggering the second control unit to continue processing;

第二控制单元，用于调用波峰检测模块，如在检测到能量小于临界值的语音帧之前检测出下一波峰，则将浊音段起始位置输出到音符切分装置，触发第三控制单元继续处理，否则，将该浊音段起始和结束位置输出到音符切分装置； The second control unit is used to call the peak detection module. If the next peak is detected before the speech frame whose energy is less than the critical value is detected, the starting position of the voiced segment is output to the note segmentation device, and the third control unit is triggered to continue. Processing, otherwise, the voiced segment start and end positions are output to the note segmentation device;

第三控制单元，将检测出的下一波峰作为第二波峰，记录其峰值、谷值、峰值位置和谷值位置，调用双峰值判定模块，如果返回的判定结果为成功，触发第四控制单元继续处理，否则触发第五控制单元继续处理； The third control unit uses the detected next peak as the second peak, records its peak value, valley value, peak position and valley position, calls the double peak determination module, and if the returned determination result is successful, triggers the fourth control unit Continue processing, otherwise trigger the fifth control unit to continue processing;

第四控制单元，用于调用波峰检测模块，如在检测到能量小于临界值的语音帧之前检测出下一波峰，将第一波峰谷值位置输出到音符切分装置，并用第二波峰的参数覆盖掉保存的第一波峰的相应参数，触发第三控制单元继续处理；否则将第一波峰谷值位置和浊音段结束位置输出到音符切分装置； The fourth control unit is used to call the peak detection module, such as detecting the next peak before detecting the speech frame whose energy is less than the critical value, outputting the position of the first peak and valley to the note segmentation device, and using the parameter of the second peak Overwrite the corresponding parameters of the saved first peak, and trigger the third control unit to continue processing; otherwise, output the first peak-valley position and the end position of the voiced segment to the note segmentation device;

第五控制单元，用于调用波峰检测模块，如在检测到能量小于临界值的语音帧之前检测出下一波峰，以第一波峰和第二波峰峰值中的大值替换掉第一波峰的峰值，以该大的峰值后的一个谷值或两个谷值中的最小值替换掉第一波峰的谷值，第一波峰的峰值位置和谷值位置更新为新的峰值和谷值对应的语音帧，然后触发第三控制单元继续处理；否则将浊音段结束位置输出到音符切分装置； The fifth control unit is used to invoke the peak detection module, such as detecting the next peak before detecting the speech frame whose energy is less than the critical value, and replacing the peak of the first peak with the larger value of the first peak and the second peak , replace the valley value of the first peak with a valley after the large peak or the minimum of the two valleys, and the peak position and valley position of the first peak are updated to the voice corresponding to the new peak and valley frame, then trigger the third control unit to continue processing; otherwise, output the end position of the voiced segment to the note segmentation device;

所述音符切分装置用于以浊音段处理过程中输出的两个相邻位置为一个音符的起始和结束位置，完成对该浊音段的音符切分。 The note segmentation device is used to complete the note segmentation of the voiced segment by using two adjacent positions output during the processing of the voiced segment as the start and end positions of a note. the

相应地，本发明提供的音高跟踪和播放方法，包括以下步骤： Correspondingly, the pitch tracking provided by the present invention and playing method comprise the following steps:

(a)对输入声音做采样和分帧处理； (a) Sampling and framing the input sound;

(b)检测每一语音帧的音高和能量； (b) detect the pitch and energy of each speech frame;

(c)根据能量检测结果完成音符切分，然后根据检测出的音高将音符切分结果转换成乐谱； (c) Complete the note segmentation according to the energy detection result, and then convert the note segmentation result into a score according to the detected pitch;

(d)将转换得到的乐谱合成为乐器数字界面文件，并进行播放。 (d) Synthesizing the converted score into a digital instrument interface file and playing it. the

进一步地，上述方法还可具有以下特点：所述步骤(b)检测语音帧的音高时，进一步分为以下步骤： Further, the above method can also have the following characteristics: when the step (b) detects the pitch of the speech frame, it is further divided into the following steps:

(b1)对分帧处理后的语音信号进行预处理，包括去均值和低通滤波处理，且采用两阶低通滤波器进行滤波； (b1) Preprocessing the voice signal after frame processing, including removing the mean and low-pass filtering, and using a two-order low-pass filter for filtering;

(b2)将低通滤波后的语音能量与一个阈值进行比较，如高于该阈值，则判断为浊音，转到步骤(b3)，否则，执行步骤270； (b2) compare the voice energy after the low-pass filter with a threshold, if it is higher than the threshold, it is judged as voiced sound, go to step (b3), otherwise, execute step 270;

(b3)计算出每帧延迟的浊音信号的归一化互相关函数值ρ(t)； (b3) Calculate the normalized cross-correlation function value ρ(t) of the voiced sound signal delayed by each frame;

(b4)进行后处理，得到相关性最大的ρ(t)及对应的最佳延迟T； (b4) Perform post-processing to obtain the most relevant ρ(t) and the corresponding optimal delay T;

(b5)基音频率搜索，即检测相关性最大的ρ(t)所对应的时间延迟的样点t是否在搜索范围内，如果在该范围内，执行步骤(b6)，否则，执行步骤(b7)； (b5) Pitch frequency search, that is, whether the sample point t of the time delay corresponding to the maximum ρ (t) of the detection correlation is within the search range, if within the range, perform step (b6), otherwise, perform step (b7 );

(b6)认为该帧语音为浊音，输出最佳延迟T，再根据时间与频率的关系，由该T可以换算出该帧语音对应的音高值，结束； (b6) think that this frame of speech is voiced sound, output the best delay T, and then according to the relationship between time and frequency, the pitch value corresponding to this frame of speech can be converted from this T, and end;

(b7)认为该帧语音为清音，令音高为0，结束。 (b7) Think that the frame of speech is unvoiced, set the pitch to 0, and end. the

进一步地，上述方法还可具有以下特点：所述步骤(b3)进行互相关函数值的计算时是隔点进行的。 Further, the above method may also have the following feature: the calculation of the cross-correlation function value in the step (b3) is performed at intervals. the

进一步地，上述方法还可具有以下特点：所述步骤(b5)在进行基音频率搜索时采取隔点搜索，且其搜索范围在20-120之内。 Furthermore, the above-mentioned method may also have the following characteristics: the step (b5) adopts point-by-point search when performing the pitch frequency search, and the search range is within 20-120. the

进一步地，上述方法还可具有以下特点：所述步骤(c)进行音符切分时，对能量大于临界值的连续语音帧构成的每一浊音段，执行以下步骤： Further, the above-mentioned method can also have the following characteristics: when the step (c) carries out note segmentation, for each voiced segment formed by continuous speech frames with energy greater than the critical value, perform the following steps:

(c1)以第一个语音帧为该浊音段的起始位置，如检测到波峰，将第一个检测到的波峰作为第一波峰，记录该浊音段起始位置及第一波峰的峰值、谷值、峰值位置和谷值位置，执行下一步，如检测不到波峰，结束； (c1) take the first speech frame as the starting position of the voiced sound segment, as a wave peak is detected, the first wave peak detected is used as the first wave peak, record the voiced sound segment starting position and the peak value of the first wave peak, Valley, peak position and valley position, execute the next step, if no peak is detected, end;

(c2)继续检测，如在检测到能量小于临界值的语音帧之前检测出下一波峰，则输出浊音段起始位置，执行步骤(c3)，否则输出该浊音段起始和结束位置，执行步骤(c6)； (c2) continue to detect, if detect next wave peak before detecting the speech frame that energy is less than critical value, then output voiced sound segment initial position, execute step (c3), otherwise output this voiced sound segment initial position and end position, carry out step (c6);

(c3)将检测出的下一波峰作为第二波峰，记录其峰值、谷值、峰值位置和谷值位置，判断是否第二波峰峰值与第一波峰谷值之差和第一波峰的峰值与谷值之差的比值大于第一阈值，且第一波峰和第二波峰的峰值位置之间的语音帧个数大于第二阈值，如果是，执行步骤(c4)，否则执行步骤(c5)； (c3) Use the detected next peak as the second peak, record its peak value, valley value, peak position and valley position, and judge whether the difference between the peak value of the second peak and the valley value of the first peak and the peak value of the first peak The ratio of the difference between valleys is greater than the first threshold, and the number of speech frames between the peak positions of the first peak and the second peak is greater than the second threshold, if so, step (c4) is performed, otherwise step (c5) is performed;

(c4)继续检测，如在检测到能量小于临界值的语音帧之前检测出下一波峰，输出第一波峰谷值位置，并用第二波峰的参数覆盖掉保存的第一波峰的相应参数，返回步骤(c3)，否则，输出第一波峰谷值位置和浊音段结束位置，执行步骤(c6)； (c4) Continue to detect, if the next peak is detected before the speech frame whose energy is less than the critical value is detected, the position of the first peak valley value is output, and the parameters of the second peak are used to overwrite the corresponding parameters of the saved first peak, and return Step (c3), otherwise, output the first peak and valley position and the end position of the voiced sound segment, and perform step (c6);

(c5)继续检测，如在检测到能量小于临界值的语音帧之前检测出下一波峰，以第一波峰峰值和第二波峰峰值中的大值覆盖掉第一波峰的峰值，以该大的峰值后的一个谷值或两个谷值中的小值覆盖掉第一波峰的谷值，第一波峰的峰值位置和谷值位置更新为新的峰值和谷值对应的语音帧，然后返回步骤(c3)，否则，输出该浊音段结束位置，执行步骤(c6)； (c5) continue to detect, as detecting the next peak before detecting the speech frame with energy less than the critical value, cover the peak value of the first peak with the large value of the first peak value and the second peak value, and use the large value A valley after the peak or the small value of the two valleys overwrites the valley of the first peak, the peak position and valley position of the first peak are updated to the voice frame corresponding to the new peak and valley, and then return to the step (c3), otherwise, output the end position of the voiced sound segment, and perform step (c6);

(c6)以输出的两个相邻位置为一个音符的起始和结束位置，完成对该浊音段的音符切分。 (c6) Take the two output adjacent positions as the start and end positions of a note, and complete the note segmentation of the voiced segment. the

其中，所述步骤(a)～(e)进行波峰检测时，是逐一比较相邻两个语音帧的能量大小，统计从浊音段起始语音帧或前一能量下降段最后一个语音帧开始的能量连续上升段和随后的能量连续下降段所涉及的能量大于临界值的语音帧的个数，如该个数大于第三阈值，则判定这些语音帧对应的一段能量曲线构成一个波峰，该段曲线上最大的能量值为该波峰的峰值，该段曲线上最后一个语音帧的能量值为该波峰的谷值，该波峰的峰值和谷值位置分别为该峰值和该谷值对应的语音帧，该浊音段起始语音帧或前一能量下降段最后一个语音帧为该波峰的起始位置。 Wherein, when described step (a)～(e) carries out wave peak detection, is to compare the energy size of adjacent two speech frames one by one, counts from the initial speech frame of the voiced speech segment or the last speech frame of the preceding energy decline segment. The number of speech frames whose energy is greater than the critical value involved in the energy continuous rising section and the subsequent energy continuous falling section, if the number is greater than the third threshold, then it is determined that a section of the energy curve corresponding to these speech frames constitutes a peak, and this section The maximum energy value on the curve is the peak value of the peak, the energy value of the last speech frame on the curve is the valley value of the peak, and the peak and valley positions of the peak are the speech frames corresponding to the peak value and the valley value , the initial speech frame of the voiced segment or the last speech frame of the previous energy-decreasing segment is the initial position of the peak. the

采用本发明方法和系统，可以直接将人声哼唱作为输入，输出即为MIDI曲，这样将无需执行其他转化处理就可以感受音高的变化。本发明还通过对音高检测部分的简化，使得系统内存占用小，可以应用于嵌入式系统。此外，通过在音符切分的过程中，可进行抗干扰、有重点有层次的双峰值检测和引入了更有效的检测参数，可提高音符切分的准确率。 By adopting the method and system of the present invention, human voice humming can be directly used as input, and the output is MIDI music, so that the change of pitch can be experienced without performing other conversion processing. The invention also simplifies the pitch detection part, so that the system memory occupies a small amount and can be applied to an embedded system. In addition, in the process of note segmentation, anti-interference, focused and layered double peak detection can be performed, and more effective detection parameters can be introduced to improve the accuracy of note segmentation. the

附图说明Description of drawings

图1是本发明实施例音高跟踪和播放系统的结构框图。 Fig. 1 is a structural block diagram of a pitch tracking and playing system according to an embodiment of the present invention. the

图2是本发明实施例音高跟踪和播放方法的整体流程图。 Fig. 2 is an overall flow chart of the pitch tracking and playing method of the embodiment of the present invention. the

图3是作为示例的一个音高曲线的示意图。 Fig. 3 is a schematic diagram of a pitch curve as an example. the

图4是本发明实施例采用归一化互相关函数的基音检测算法的流程图。 FIG. 4 is a flowchart of a pitch detection algorithm using a normalized cross-correlation function according to an embodiment of the present invention. the

图5是本发明实施例音符切分和转成乐谱所用的能量和音高曲线图。 Fig. 5 is a graph of energy and pitch used for splitting notes and converting them into scores according to an embodiment of the present invention. the

图6是本发明实施例音符切分的流程图。 Fig. 6 is a flowchart of note segmentation according to an embodiment of the present invention. the

图7是本发明实施例音符切分装置的结构框图。 Fig. 7 is a structural block diagram of a musical note segmentation device according to an embodiment of the present invention. the

具体实施方式Detailed ways

本发明应用于PC机及嵌入式系统，主要用于跟踪人声哼唱，例如用“啦”哼唱，也可用于某些电子乐器。系统输入为人声，输出为Midi乐曲，Midi是Musical-instrument-digital-interface(乐器数字界面)的简称，是一种记录乐谱的文件格式。 The invention is applied to PCs and embedded systems, and is mainly used for tracking human voice humming, such as humming with "la", and can also be used for some electronic musical instruments. The input of the system is human voice, and the output is Midi music. Midi is the abbreviation of Musical-instrument-digital-interface (Musical-instrument-digital-interface), which is a file format for recording music scores. the

如图1所示，本实施例的音高跟踪和播放系统包括以下模块： As shown in Figure 1, the pitch tracking and playing system of the present embodiment includes the following modules:

语音输入处理模块，用于接收输入的声音并对其做采样和分帧处理，输出到音高和能量检测模块。 The speech input processing module is used to receive the input sound, sample and frame it, and output it to the pitch and energy detection module. the

音高和能量检测模块，用于检测出每一语音帧的音高和能量，得到输入声音的音高曲线和能量曲线，然后输出到音符切分装置。 The pitch and energy detection module is used to detect the pitch and energy of each speech frame, obtain the pitch curve and energy curve of the input sound, and then output it to the note segmentation device. the

音符切分装置，用于根据能量检测结果完成音符切分，输出到乐谱转换模块。 The note segmentation device is used to complete the note segmentation according to the energy detection result and output it to the score conversion module. the

乐谱转换模块，用于将完成切分的音符转换成乐谱，输出到语音合成模块。 The musical score conversion module is used to convert the segmented notes into musical scores and output them to the speech synthesis module. the

语音合成模块，用于将转换成的乐谱合成为MIDI文件并进行播放。 Speech synthesis module, used for synthesizing converted scores into MIDI files and playing them. the

如图2所示，本实施例音高跟踪和播放方法的整体流程包括以下步骤： As shown in Figure 2, the overall process of the pitch tracking and playing method of the present embodiment comprises the following steps:

步骤10，对输入声音做采样和分帧处理； Step 10, sampling and framing the input sound;

可以通过硬件系统的麦克，以8KHz采样率、16bit采集人声哼唱作为输入，也可提高采样率到16KHz。通常，对语音的分析和处理是在短时性的基础上，因此需要对输入的声音做分帧处理。本实施例对输入的声音(8KHz采样率、16bits表示)进行分帧(frames)时，每一帧为20ms。 The microphone of the hardware system can be used to collect vocal humming at a sampling rate of 8KHz and 16bit as input, or the sampling rate can be increased to 16KHz. Usually, the analysis and processing of speech is based on the short-term nature, so the input sound needs to be processed in frames. In this embodiment, when the input sound (8KHz sampling rate, 16bits representation) is divided into frames, each frame is 20ms. the

步骤20，音高和能量检测； Step 20, pitch and energy detection;

音高也称作基音频率，在语音处理领域中音高检测和估计是非常重要的一个问题。目前，音高检测有很多非常成熟的算法实现，主要分为时域、频域以及其他流行方法。 Pitch is also called pitch frequency, and pitch detection and estimation is a very important problem in the field of speech processing. At present, there are many very mature algorithms for pitch detection, mainly divided into time domain, frequency domain and other popular methods. the

语音信号的特征是随时间变化的，只有在一短时间间隔内保持相对平稳。因此，需要检测出每一语音帧的音高和能量，得到输入声音的音高曲线和能量曲线。 The characteristics of the speech signal are time-varying, and only remain relatively stable in a short time interval. Therefore, it is necessary to detect the pitch and energy of each speech frame to obtain the pitch curve and energy curve of the input sound. the

步骤30，音符切分与转成乐谱 Step 30, note segmentation and conversion into score

经过音高检测得到一个音高曲线，图3是一个示例，表示了音高随时间的变化，有的音符单独表示，有的音符连在一起需要切分，音符切分需要结合音高和能量检测结果方可完成。 After pitch detection, a pitch curve is obtained. Figure 3 is an example, which shows the change of pitch over time. Some notes are represented separately, and some notes need to be segmented together. Note segmentation requires a combination of pitch and energy. The test results are ready for completion. the

音高曲线虽然看起来非常直观，但是作为表征丰富的音符变化的参数仍然不足以使人有切身感受，本实施例的系统不仅能够进行音高检测而且能够把处理结果实时的播放出来，使人对自己的音符有了全方位的感受，从而拓宽了其应用领域。例如：可以作为有声调语言的对比学习，或通过哼唱检索歌曲，或者智能电子玩具等，这就需要将音符切分结果转换成乐谱播放。 Although the pitch curve looks very intuitive, it is still not enough to make people feel personally as a parameter representing rich note changes. The system of this embodiment can not only detect the pitch but also play the processing results in real time, making people Have a full range of feeling to your notes, thus broaden its field of application. For example: it can be used as a comparative study of tonal languages, or retrieve songs by humming, or smart electronic toys, etc., which requires converting the note segmentation results into musical scores for playback. the

步骤40，利用通常合成技术，将转换得到的乐谱合成为MIDI文件，并进行播放。 Step 40, using common synthesis technology, synthesize the converted musical score into a MIDI file and play it. the

对电子乐器的声音也可按上述步骤同样处理。 The sound of electronic musical instruments can also be processed in the same way as above. the

在上述步骤20中，本实施例的归一化互相关函数(NCCF)的基音检测算法和现有方法基本相同，区别点将在以下步骤中重点介绍，请参照图4，该基音检测算法包括以下步骤： In the above-mentioned step 20, the pitch detection algorithm of the normalized cross-correlation function (NCCF) of the present embodiment is basically the same as the existing method, and the difference will be introduced in the following steps, please refer to Fig. 4, the pitch detection algorithm includes The following steps:

步骤210，对分帧处理后的语音信号进行预处理，包括去均值、1000Hz低通滤波处理； Step 210, preprocessing the voice signal after the frame processing, including removing the mean value and 1000Hz low-pass filter processing;

步骤220，逐帧进行清浊音判断：将低通滤波后的语音能量与一个阈值进行比较，如高于该阈值，则判断为浊音，转到步骤230，否则，执行步骤270； Step 220, judging unvoiced and voiced sound frame by frame: comparing the voice energy after the low-pass filter with a threshold, if it is higher than the threshold, it is judged as voiced sound, go to step 230, otherwise, execute step 270;

步骤230，计算归一化互相关函数： Step 230, calculate the normalized cross-correlation function:

通过以下公式计算出每帧延迟的浊音信号的归一化互相关函数值ρ(t)： The normalized cross-correlation function value ρ(t) of the voiced signal delayed by each frame is calculated by the following formula:

$ρ ρ ((t t)) = = \frac{{Σ Σ}_{n no = = 00}^{N N - - 11} s the s ((n no)) s the s ((n no - - t t))}{\sqrt{{Σ Σ}_{n no = = 00}^{N N - - 11} {s the s}^{22} ((n no)) {Σ Σ}_{n no = = 00}^{N N - - 11} {s the s}^{22} ((n no - - t t))}},, t t &Element; &Element; [[00,, N N - - 11]]$

其中，s(n)表示语音信号，N为信号帧长，一帧语音是20ms，经过8kHz采样，则N＝160，t为时间延迟的样点，t在[0，N-1]范围内分为三个区域，在每个区域内隔点计算ρ(t)值，并比较原始信号和它的延迟信号之间的相似(相关)程度，得出一个最为相似的ρ(t)。三个区域共得到三个ρ(t)值、各自对应的时间延迟的样点以及对应的延迟时间。 Among them, s(n) represents the voice signal, N is the signal frame length, and one frame of voice is 20ms, after 8kHz sampling, then N=160, t is the sample point of time delay, and t is in the range of [0, N-1] It is divided into three areas, and the ρ(t) value is calculated at intervals in each area, and the degree of similarity (correlation) between the original signal and its delayed signal is compared to obtain the most similar ρ(t). A total of three ρ(t) values, corresponding time-delay samples and corresponding delay times are obtained for the three regions. the

步骤240，后处理：比较三个ρ(t)值，得到相关性最大的ρ(t)及对应的最佳延迟T； Step 240, post-processing: compare the three ρ(t) values to obtain the most relevant ρ(t) and the corresponding optimal delay T;

步骤250，基音频率搜索：检测相关性最大的ρ(t)所对应的时间延迟的样点t是否在[20，120]范围内，如果在该范围内，执行步骤260，否则，执行步骤270； Step 250, pitch frequency search: detect whether the sample point t of the time delay corresponding to the most correlated ρ(t) is within the range of [20, 120], if within this range, perform step 260, otherwise, perform step 270 ;

步骤260，认为该帧语音为浊音，输出最佳延迟T，再根据时间与频率的关系，由该T可以换算出该帧语音对应的音高值。 Step 260, consider the frame of speech as voiced sound, output the optimal delay T, and then convert the pitch value corresponding to the frame of speech from T according to the relationship between time and frequency. the

步骤270，认为该帧语音为清音，令音高为0，结束。 In step 270, consider the frame of speech as unvoiced, set the pitch to 0, and end. the

相应地，可以将音高和能量检测模块划分为预处理单元、归一化互相关函数计算单元、后处理单元、基音频率搜索单元和音高输出单元，和现有的单元相比，有以下特点： Correspondingly, the pitch and energy detection module can be divided into a pre-processing unit, a normalized cross-correlation function calculation unit, a post-processing unit, a pitch frequency search unit, and a pitch output unit. Compared with existing units, it has the following characteristics :

现有的预处理单元采用了5阶椭圆低通滤波器和一个数值滤波器，本实施例将原有的5阶椭圆低通滤波器改为Haar小波基的两阶低通滤波器，截止频率设为1000Hz，并去除数值滤波器，从而大大减少了运算复杂度。 The existing preprocessing unit has adopted a 5th-order elliptic low-pass filter and a numerical filter. In this embodiment, the original 5th-order elliptic low-pass filter is changed to a Haar wavelet-based two-order low-pass filter, and the cutoff frequency Set to 1000Hz, and remove the numerical filter, thus greatly reducing the computational complexity. the

互相关函数计算非常复杂，现有互相关函数计算单元通常是每一点都要计算一次相关函数，而本实施例的该单元是隔点进行相关运算，降低了一半的运算量。 The calculation of the cross-correlation function is very complicated. The existing cross-correlation function calculation unit usually calculates the correlation function once for each point, but the unit of this embodiment performs the correlation calculation at every point, which reduces the calculation amount by half. the

现有的基音频率搜索单元的搜索范围为20-147，本实施例的该单元则将搜索范围缩小到20-120内(也可以更小一点)并采取隔点搜索，减少了高频基音检测误差和减少运算。 The search range of the existing pitch frequency search unit is 20-147, and this unit of the present embodiment then narrows the search range to 20-120 (also can be a bit smaller) and takes every other point to search, has reduced the high-frequency pitch detection Error and reduction operations. the

音高检测部分是复杂度最大的部分，本实施例通过上述单元的改进，在不影响检测精度的前提下降低了复杂度，从而使得本发明可以应用于嵌入式系统。 The pitch detection part is the part with the greatest complexity. In this embodiment, through the improvement of the above units, the complexity is reduced without affecting the detection accuracy, so that the present invention can be applied to embedded systems. the

上述步骤30中的音符切分是本发明的一个重点，下面将详细介绍。 The note segmentation in the above step 30 is an important point of the present invention, which will be described in detail below. the

A，音符切分的目的 A, the purpose of note segmentation

对于音高曲线需要转化为一个个音符(如简谱“1，2，3，4......”)即乐谱才能进行播放，而只有少数情况是一段音高对应一个音符，多数情况是一段音高对应几个音符，这就需要进行音符切分，才能完成乐谱。 For the pitch curve, it needs to be converted into notes (such as numbered notation "1, 2, 3, 4..."), that is, the score can be played, and there are only a few cases where a pitch corresponds to a note, and most cases are A pitch corresponds to several notes, which requires note segmentation to complete the score. the

B，音符切分的背景 B, the background of note segmentation

当人采用歌词清唱时，音高曲线变得比较复杂，一方面受到音高变化的影响，另一方面对于音色的变化(或者转音)也会影响音高曲线的起伏变化，加强了切分音符的难度。例如，一首曲子本身有着音高的高低变化，同样的音高采用钢琴和吉他演奏效果并不同，这就是音色不同的影响，同样一首曲子可以通过哼“啦”、“嗒”演唱，也可以通过唱歌词来演唱，后者歌词是变化的，要比前者的音高曲线变化更复杂。本实施例方法主要用于完成前者“啦”、“嗒”演唱的歌曲。 When people use lyrics to sing a cappella, the pitch curve becomes more complicated. On the one hand, it is affected by pitch changes, and on the other hand, the change of timbre (or transphonation) will also affect the ups and downs of the pitch curve, which strengthens the syncopation Difficulty of notes. For example, a piece of music itself has a change in pitch. The effect of playing the same pitch on a piano and a guitar is not the same. This is the effect of different timbres. It can be sung by singing lyrics. The latter lyrics are changing, which is more complicated than the former's pitch curve changes. The present embodiment method is mainly used for finishing the song that the former " la ", " clatter " sings. the

C，现有的音符切分方法 C, the existing note segmentation method

音符切分是为了找到一个个音符，又因为每个音符在起始阶段能量较大，然后能量开始下降、维持，直到消失或被下一个音符能量所掩蔽，这就是本系统选择峰值检测算法的原因，如图4所示。 The purpose of note segmentation is to find each note, and because each note has high energy in the initial stage, then the energy begins to decrease and maintain until it disappears or is masked by the energy of the next note. This is why the system chooses the peak detection algorithm. The reason is shown in Figure 4. the

峰值检测是用来检测信号中能量较大的峰值位置，现有的技术有： Peak detection is used to detect the peak position with higher energy in the signal. The existing technologies are:

A)小波变换算法，小波变换是传统傅立叶变换的继承和发展，主要应用在信号处理、图像处理、语音处理等多个领域，但是实现比较复杂。 A) Wavelet transform algorithm. Wavelet transform is the inheritance and development of traditional Fourier transform. It is mainly used in many fields such as signal processing, image processing, and voice processing, but the implementation is relatively complicated. the

B)简单峰值检测，或者幅度包络检测，通过一阶导数、二阶导数决定一个峰值，此方法虽然简单但抗干扰性能较差。 B) Simple peak detection, or amplitude envelope detection, determines a peak through the first and second derivatives. Although this method is simple, its anti-interference performance is poor. the

C)通过峰峰值、半峰值和峰谷值等参数检测峰值，已有应用在音符切分技术上，虽然所得到的结果比较稳定，但对于语音信号中的多峰(信号中的一个峰值周围存在小峰)检测不准。 C) Peak values are detected by parameters such as peak-to-peak value, half-peak value, and peak-to-valley value, which have been applied to the note segmentation technology. Although the obtained results are relatively stable, for the multi-peak in the voice signal (around a peak in the signal) There are small peaks) detection is not accurate. the

D)本实施例的音符切分方法 D) the musical note segmentation method of the present embodiment

本实施例在上述峰值检测C)的基础上进行了改进，主要表现在以下三个方面： This embodiment improves on the basis of the above-mentioned peak detection C), mainly in the following three aspects:

1)预处理 1) Pretreatment

首先将检测出的各个语音帧的能量值依次输入一个一阶低通滤波器进行滤波，过滤掉能量曲线中的毛刺，以提高抗干扰性和检测效果； Firstly, the energy values of each detected speech frame are sequentially input into a first-order low-pass filter for filtering, and the glitches in the energy curve are filtered out, so as to improve the anti-interference and detection effect;

2)双峰值检测 2) Double peak detection

请参照图5，整个能量曲线上有能量大于临界值的语音帧和能量小于临界值的语音帧，在音符切分时只对能量大于临界值的连续语音帧进行处理，文中称其为浊音段，该连续语音帧中第一个和最后一个语音帧即为该浊音段的起始位置和结束位置。在能量曲线上，能量可以用幅度或功率来表征，较佳地，能量的临界值可以取为26dB～30dB，但本发明不限于此。 Please refer to Figure 5. There are speech frames with energy greater than the critical value and speech frames with energy less than the critical value on the entire energy curve. When the notes are segmented, only the continuous speech frames with energy greater than the critical value are processed, which are called voiced segments in the text. , the first and last speech frames in the continuous speech frames are the start position and end position of the voiced speech segment. On the energy curve, energy can be characterized by amplitude or power. Preferably, the critical value of energy can be set at 26dB-30dB, but the present invention is not limited thereto. the

下面以一个浊音段为例对本实施例的双峰值检测方法进行说明。如图6所示，该流程包括以下步骤： The double peak detection method of this embodiment will be described below by taking a voiced sound segment as an example. As shown in Figure 6, the process includes the following steps:

步骤300，从浊音段起始位置，即第一个语音帧开始，判断在该浊音段是否检测到第一个波峰，如果是，执行步骤310，否则，直接结束； Step 300, from the starting position of the voiced sound segment, that is, the first speech frame, judge whether the first peak is detected in the voiced sound segment, if yes, perform step 310, otherwise, directly end;

在整个音符切分的过程中都需要进行波峰检测，因此先介绍一下本实施例采用的波峰检测方法：逐一比较语音帧序列中前一个语音帧与后一个语音帧的能量大小关系，统计从浊音段起始语音帧或前一能量下降段最后一个语音帧开始的能量连续上升段和随后的能量连续下降段所涉及的语音帧的个数，如果该个数大于设定的阈值(该阈值较佳为5～9，本实施例为7)，则认为这些语音帧对应的能量曲线构成一个波峰。该段能量曲线上最大的能量值为该波峰的峰值，该段能量曲线上最后一个语音帧的能量值为该波峰的谷值，峰值和谷值位置分别为峰值和谷值对应的语音帧，波峰的起始位置即为上述浊音段起始语音帧或前一能量下降段最后一个语音帧。 All need to carry out peak detection in the process of whole note segmentation, so first introduce the peak detection method that present embodiment adopts: compare the energy magnitude relation of the previous speech frame and the following speech frame in the speech frame sequence one by one, statistics from voiced sound The number of voice frames involved in the energy continuous rising segment and the subsequent energy continuous falling segment of the initial speech frame of the segment or the last speech frame of the previous energy falling segment, if the number is greater than the set threshold (the threshold is lower than Preferably it is 5-9, and this embodiment is 7), then it is considered that the energy curves corresponding to these speech frames form a peak. The maximum energy value on the energy curve of this section is the peak value of the peak, the energy value of the last voice frame on the energy curve of this section is the valley value of the wave peak, and the positions of the peak value and the valley value are respectively the voice frames corresponding to the peak value and the valley value, The starting position of the peak is the starting speech frame of the voiced speech segment or the last speech frame of the previous energy-decreasing segment. the

如果能量连续上升段和随后的能量连续下降段涉及的语音帧个数小于等于所述阈值，则认为这些语音帧对应的能量曲线是一个小突波，不对其进行处理。 If the number of speech frames involved in the energy continuous rising segment and the subsequent energy continuous falling segment is less than or equal to the threshold, the energy curve corresponding to these speech frames is considered to be a small spike and is not processed. the

步骤310，将检测到的第一个波峰作为第一波峰，记录该浊音段起始位置和该第一波峰的相关参数，包括：该波峰的峰值、谷值、峰值位置和谷值位置； Step 310, using the detected first peak as the first peak, recording the voiced segment starting position and the relevant parameters of the first peak, including: the peak, valley, peak position and valley position of the peak;

步骤320，继续检测，判断在检测到能量小于临界值的语音帧之前是否检测出下一波峰，如果否，表示在该浊音段结束前已没有波峰，执行步骤330，如果检测出下一波峰，输出该浊音段起始位置(步骤320a)，执行步骤340； Step 320, continue to detect, judge whether to detect next wave peak before detecting the speech frame of energy less than critical value, if not, represent that there is no wave peak before the end of this voiced speech segment, execute step 330, if detect next wave peak, Output this voiced sound segment starting position (step 320a), perform step 340;

步骤330，判定该浊音段对应于一个音符，输出该浊音段的起始位置和结束位置，结束； Step 330, determine that the voiced sound segment corresponds to a note, output the starting position and the end position of the voiced sound segment, and end;

步骤340，将检测出的下一波峰作为第二波峰，记录其峰值、谷值、峰值位置和谷值位置，判断第一波峰和第二波峰的峰值能量差比和峰峰间距是否大于预设的阈值，如果是，执行步骤350，否则，执行步骤380； Step 340, using the detected next peak as the second peak, recording its peak value, valley value, peak position and valley position, and judging whether the peak energy difference ratio and peak-to-peak distance between the first peak and the second peak are greater than the preset The threshold value, if yes, execute step 350, otherwise, execute step 380;

第一波峰和第二波峰的峰值能量差比是这样计算的：将第二波峰峰值减去第一波峰谷值的差，除以第一波峰峰值减去第一波峰谷值的差，得到的比值即为该两个波峰的峰值能量差比。当峰值能量差比小于相应阈值时，则认为当前第二波峰是在第一波峰下降段出现的一个小突波，不对应于一个音符。用于和计算出的峰值能量差比比较的阈值较佳为0.1825～0.3125，本实施例取0.1875。 The peak energy difference ratio between the first peak and the second peak is calculated as follows: the difference between the peak value of the second peak minus the valley value of the first peak is divided by the difference between the peak value of the first peak minus the valley value of the first peak, and the obtained The ratio is the peak energy difference ratio of the two peaks. When the peak energy difference ratio is smaller than the corresponding threshold, it is considered that the current second peak is a small spike that appears in the descending section of the first peak and does not correspond to a musical note. The threshold used for comparison with the calculated peak energy difference ratio is preferably 0.1825-0.3125, and this embodiment takes 0.1875. the

两个波峰的峰峰间距是指两个峰值位置之间的语音帧个数，用于和计算出的该语音帧个数比较的阈值较佳为5～9，本实施例为7。如峰峰间距小于该阈值则表示第二波峰和第一波峰非常接近，不认为第一波峰和第二波峰分别对应于一个单独的音符。 The peak-to-peak distance between two peaks refers to the number of speech frames between two peak positions, and the threshold used for comparison with the calculated number of speech frames is preferably 5-9, and it is 7 in this embodiment. If the peak-to-peak distance is smaller than the threshold, it means that the second peak is very close to the first peak, and it is not considered that the first peak and the second peak respectively correspond to a single note. the

当然，上述两个条件，即峰值能量差比和峰峰间距也可以单独使用，或者采用其它的双峰值检测条件，也能起到滤去过渡成分的功效。 Of course, the above two conditions, that is, the peak energy difference ratio and the peak-to-peak distance can also be used alone, or other double peak detection conditions can be used to filter out transition components. the

步骤350，继续检测，判断在检测到能量小于临界值的语音帧之前是否检测出下一波峰，如果否，表示在该浊音段结束前已没有波峰，执行步骤360，如果检测出下一波峰，执行步骤370； Step 350, continue to detect, judge whether to detect next wave peak before detecting the speech frame of energy less than critical value, if not, represent that there is no wave peak before the end of this voiced speech segment, execute step 360, if detect next wave peak, Execute step 370;

步骤360，认定第一波峰和第二波峰分别对应于一个音符，输出第一波峰谷值位置和浊音段结束位置，结束； Step 360, determine that the first peak and the second peak correspond to a note respectively, output the position of the first peak and valley value and the end position of the voiced sound segment, and end;

步骤370，认定第一波峰对应于一个音符，输出其谷值位置，同时用第二波峰的各项参数替换掉保存的第一波峰的各项参数，即将原第二波峰作为新的第一波峰，返回步骤340； Step 370, determine that the first peak corresponds to a note, output its valley position, and replace the saved parameters of the first peak with the parameters of the second peak at the same time, that is, the original second peak is used as the new first peak , return to step 340;

步骤380，继续检测，判断在检测到能量小于临界值的语音帧之前是否检测出下一波峰，如果否，表示在该浊音段结束前已没有波峰，执行步骤390，如果检测出下一波峰，执行步骤400； Step 380, continue to detect, judge whether to detect next wave peak before detecting the speech frame of energy less than critical value, if not, represent that there is no wave peak before the end of this voiced sound segment, execute step 390, if detect next wave peak, Execute step 400;

步骤390，判定第一波峰对应于一个音符，输出该浊音段的结束位置，结束； Step 390, determine that the first wave peak corresponds to a note, output the end position of the voiced sound segment, end;

步骤400，根据当前记录的第一波峰和第二波峰的参数更新第一波峰的参数，返回步骤340； Step 400, update the parameters of the first peak according to the parameters of the first peak and the second peak of the current record, and return to step 340;

在更新第一波峰参数时，本实施例是以第一波峰峰值和第二波峰峰值中的大值替换掉第一波峰的峰值，以该大的峰值后的一个谷值或两个谷值中的小值替换掉第一波峰的谷值，第一波峰的峰值位置和谷值位置更新为新的峰值和谷值对应的语音帧。 When updating the first peak parameter, in this embodiment, the peak value of the first peak is replaced by the larger value of the first peak value and the second peak value, and one valley value or two valley values after the large peak value are used. The small value of replaces the valley value of the first peak, and the peak position and valley position of the first peak are updated to the speech frames corresponding to the new peak value and valley value. the

在完成检测或在检测过程中，根据输出的浊音段起始、结束位置和波峰的谷值位置就可以进行音符切分，在一个浊音段上，输出的两个相邻位置即为一个音符的起始和结束位置。 After the detection is completed or during the detection process, note segmentation can be performed according to the output voiced segment start, end position and peak valley position. On a voiced segment, two adjacent positions of the output are a note start and end positions. the

一般情况下，两个峰值之间总有一些过渡成分，包括小突波和平缓下降(属于第一个峰值的延续)、抖动和不规则的上升曲线(属于第二个峰值的起始)等，采用双峰值检测有效地处理了过渡成分，因此提高了传统的峰值检测算法的准确率。 In general, there are always some transition components between two peaks, including small spikes and gentle declines (belonging to the continuation of the first peak), jitter and irregular rising curves (belonging to the start of the second peak), etc. , using double-peak detection to effectively deal with transition components, thus improving the accuracy of traditional peak detection algorithms. the

本实施例采用如图7所示的音符切分装置来实现上述音符切分的方法，包括一阶低通滤波器(该单元可选)、波峰检测模块、主控制模块、音符切分模块、存储模块和双峰值判定模块，其中： The present embodiment adopts the note segmentation device as shown in Figure 7 to realize the method for above-mentioned note segmentation, including a first-order low-pass filter (this unit is optional), a peak detection module, a main control module, a note segmentation module, Storage module and double-peak determination module, wherein:

所述一阶低通滤波器，用于对检测出的语音帧序列中各个语音帧的能量进行滤波，所述波峰检测模块和主控制模块基于该滤波后的语音帧能量进行处理； The first-order low-pass filter is used to filter the energy of each speech frame in the detected speech frame sequence, and the peak detection module and the main control module process based on the filtered speech frame energy;

所述主控制模块用于对能量大于临界值的连续语音帧构成的每一浊音段进行音符切分，进一步包括第一控制单元、第二控制单元、第三控制单元、第四控制单元和第五控制单元，其中： The main control module is used to segment each voiced sound segment formed by continuous speech frames with energy greater than a critical value, and further includes a first control unit, a second control unit, a third control unit, a fourth control unit and a first control unit. Five control units, of which:

第二控制单元，用于调用波峰检测模块，如在检测到能量小于临界值的语音帧之前检测出下一波峰，则将浊音段起始位置输出到音符切分模块，触发第三控制单元继续处理，否则，将该浊音段起始和结束位置输出到音符切分模块； The second control unit is used to call the peak detection module. If the next peak is detected before the speech frame whose energy is less than the critical value is detected, the starting position of the voiced segment is output to the note segmentation module, and the third control unit is triggered to continue. Processing, otherwise, output the start and end positions of the voiced segment to the note segmentation module;

第四控制单元，用于调用波峰检测模块，如在检测到能量小于临界值的语音帧之前检测出下一波峰，将第一波峰谷值位置输出到音符切分模块，并用第二波峰的参数替换掉保存的第一波峰的相应参数，触发第三控制单元继续处理；否则将第一波峰谷值位置和浊音段结束位置输出到音符切分模块； The fourth control unit is used to call the peak detection module, such as detecting the next peak before detecting the speech frame with energy less than the critical value, outputting the position of the first peak and valley to the note segmentation module, and using the parameter of the second peak Replace the corresponding parameters of the saved first peak, and trigger the third control unit to continue processing; otherwise, output the position of the first peak and valley and the end position of the voiced segment to the note segmentation module;

第五控制单元，用于调用波峰检测模块，如在检测到能量小于临界值的语音帧之前检测出下一波峰，根据第一波峰和第二波峰的参数更新第一波峰的参数(见图6中步骤370)，触发第三控制单元继续处理；否则将浊音段结束位置输出到音符切分模块； The fifth control unit is used to call the peak detection module, such as detecting the next peak before detecting the speech frame with energy less than the critical value, updating the parameters of the first peak according to the parameters of the first peak and the second peak (see Figure 6 In step 370), the third control unit is triggered to continue processing; otherwise, the voiced segment end position is output to the note segmentation module;

所述音符切分模块用于以浊音段处理过程中输出的两个相邻位置为一个音符的起始和结束位置，完成对该浊音段的音符切分。 The note segmentation module is used to complete the note segmentation of the voiced segment by using two adjacent positions output during the processing of the voiced segment as the start and end positions of a note. the

有时波峰之间会的一些小突波，在实施例中是将前一波峰的谷值位置作为该波峰对应音符的结束位置。不过，在其它实施方式中，也可以将后一波峰的起始位置作为前一波峰对应音符的结束位置，这反应的流程上在步骤340要记录第二波峰的起始位置，并在步骤360和步骤370中先用第二波峰的起始位置替换掉第一波峰的谷值位置，再输出该替换后的第一波峰谷值位置。反映在装置上，则第三控制单元还记录第二个波峰的起始位置，而第四控制单元先用第二波峰的起始位置替换掉第一波峰的谷值位置，再输出替换后的该第一波峰的谷值位置。 Sometimes there will be some small spikes between the peaks. In the embodiment, the valley position of the previous peak is used as the end position of the note corresponding to the peak. However, in other implementation manners, the initial position of the next peak can also be used as the end position of the note corresponding to the previous peak, and the process of this reaction will record the initial position of the second peak in step 340, and in step 360 In step 370, first replace the valley position of the first peak with the initial position of the second peak, and then output the replaced valley position of the first peak. Reflected on the device, the third control unit also records the initial position of the second peak, and the fourth control unit first replaces the valley position of the first peak with the initial position of the second peak, and then outputs the replaced The valley position of the first peak. the

总体流程中的步骤30还要将切分完的音符转成MIDI乐谱，方法如下： Step 30 in the overall process also converts the split notes into MIDI scores, the method is as follows:

众所周知，简谱中每个音阶内的“1，2，3，4，5，6，7--do，re，mi，fa，so，la，xi”都对应了一个频率值，而按照乐理中的十二平均律对应于不同的MIDI值，例如： As we all know, "1, 2, 3, 4, 5, 6, 7--do, re, mi, fa, so, la, xi" in each scale in numbered notation corresponds to a frequency value, and according to music theory The twelve equal laws correspond to different MIDI values, for example:

Octave5(八度音阶) Midi Pitch Octave5 (Octave) Midi Pitch

415.30HZ G5#So5# ---MIDI 68 415.30HZ G5#So5# ---MIDI 68

440.00HZ A5 La5 ---MIDI 69 440.00HZ A5 La5 ---MIDI 69

466.16HZ A5#La5# ---MIDI 70 466.16HZ A5#La5# ---MIDI 70

MIDI乐谱需要音符信息，包括音符的长度和音符的音高均值。 MIDI notation requires note information, including the length of the note and the mean pitch of the note. the

音符的长度，是根据对浊音段进行音符切分时输出的两个相邻位置之间的语音帧个数得到的，这些位置可能是峰值起始位置、谷值位置或浊音段起始、结束位置。 The length of the note is obtained according to the number of speech frames between two adjacent positions output when the voiced segment is divided into notes. These positions may be the peak start position, the valley position, or the start and end of the voiced segment Location. the

音符的音高均值，是根据对浊音段进行音符切分时输出的两个相邻位置找到对应的音高曲线，计算该段曲线音高的平均值得到的。例如，输出的两个相邻位置为10、35(用语音帧序号表示)，P(n)为音高曲线中第n个语音帧的音高值，则该音符的音高均值为： The average value of the pitch of the note is obtained by finding the corresponding pitch curve at two adjacent positions output when the note is segmented for the voiced segment, and calculating the average value of the pitch of the curve. For example, the two adjacent positions of the output are 10, 35 (expressed by the speech frame number), and P(n) is the pitch value of the nth speech frame in the pitch curve, then the pitch mean of the note is:

Pitch＝[P(10)+P(11)+...+P(35)]/(35-10) Pitch＝[P(10)+P(11)+...+P(35)]/(35-10)

将音高均值转换为相应的频率f_pitch＝f_x/Pitch，其中f_x为采样频率。 Convert the mean value of the pitch to the corresponding frequency f _pitch =f _x /Pitch, where f _x is the sampling frequency.

因此，首先通过音符的音高得到相应的频率值(或音阶Octave值)，然后量化到简谱中的“1，2，3......”，最后通过十二平均律公式MIDI＝69+12×log₂[(FS/440)×f_pitch]即可得到MIDI的音高。例如，系统得到一个音符的音高为430HZ，那么首先将它量化成Octave5中的A5 La5，然后就可以采用事先通过十二平均律公式计算好的MIDI 69表示。 Therefore, first obtain the corresponding frequency value (or scale Octave value) through the pitch of the note, then quantize it to "1, 2, 3..." in the numbered notation, and finally use the twelve equal temperament formula MIDI=69 +12×log ₂ [(FS/440)×f _pitch ] to get the MIDI pitch. For example, if the system gets a note with a pitch of 430HZ, it will first be quantized into A5 La5 in Octave5, and then it can be expressed in MIDI 69 calculated by the twelve equal-tempered formula in advance.

综上所述，本发明可以应用于PC机及嵌入式系统，可以跟踪人声哼唱信号及某些电子乐器，通过时域自相关音高检测(Pitch Detection)算法和基于能量的音符分割使得系统占用资源较少，方法简单，使用方便灵活。 In summary, the present invention can be applied to PCs and embedded systems, can track human voice humming signals and some electronic musical instruments, through time domain autocorrelation pitch detection (Pitch Detection) algorithm and energy-based note segmentation to The system occupies less resources, the method is simple, and the use is convenient and flexible. the

Claims

1. pitch tracking and Play System, comprise: be used to receive the sound of input and it is done the phonetic entry processing module of sampling and the processing of branch frame, be used to detect the pitch and the energy detection module of the pitch and the energy of each speech frame, be used for finishing the note cutting device of note cutting according to energy detection results, and the note that is used for finishing cutting converts the music score modular converter of music score to, it is characterized in that, this system also comprises the phonetic synthesis module, and the music score that is used for converting to synthesizes musical instrument digital interface file and plays.

2. pitch tracking as claimed in claim 1 and Play System, it is characterized in that, described pitch and energy detection module further comprise pretreatment unit, Normalized Cross Correlation Function computing unit, post-processing unit and fundamental frequency search unit, wherein, this pretreatment unit adopts two rank low-pass filters to carry out filtering.

3. pitch tracking as claimed in claim 2 and Play System is characterized in that, the calculating that described cross correlation function computing unit carries out cross correlation function is that dot interlace carries out.

4. as claim 2 or 3 described pitch tracking and Play Systems, it is characterized in that described fundamental frequency search unit is taked the dot interlace search when search, and its hunting zone is within 20-120.

5. pitch tracking as claimed in claim 1 and Play System is characterized in that, described note cutting device comprises crest detection module, main control module, note cutting module, memory module and bimodal determination module, wherein:

Described crest detection module, be used to add up from the number of the related energy of continuous ascent stage of energy that the initial speech frame of voiced segments or last last speech frame of energy decreases section begin and the continuous descending branch of energy subsequently greater than the speech frame of critical value, as this number greater than the 3rd threshold value, one section energy trace then judging these speech frame correspondences constitutes a crest, energy value maximum on this section curve is the peak value of this crest, the energy value of last speech frame is the valley of this crest on this section curve, and the peak value of this crest and valley position are respectively the speech frame of this peak value and this valley correspondence;

Described memory module is used to preserve the starting and ending position of the parameter and the voiced segments of crest;

Described bimodal determination module is used to judge that whether the ratio of difference of the peak value of the difference of secondary peak peak value and primary peak valley and primary peak and valley is greater than first threshold, and whether the speech frame number between the peak of primary peak and secondary peak is greater than second threshold value, if all be, the result of determination of then returning is successfully, otherwise returns the result of determination of failure;

Described main control module is used for energy is carried out the note cutting greater than each voiced segments of the continuous speech frame formation of critical value, further comprise first control module, second control module, the 3rd control module, the 4th control module and the 5th control module, wherein:

First control module, be used for reference position from voiced segments, call the crest detection module, as detecting less than crest, then finish the processing of this voiced segments, otherwise with detected first crest is primary peak, and peak value, valley, peak and the valley position of this voiced segments reference position and primary peak is saved in described memory module, triggers second control module and continues to handle;

Second control module, be used to call the crest detection module, as before detecting the speech frame of energy, detecting next crest less than critical value, then the voiced segments reference position is outputed to note cutting module, triggering the 3rd control module continues to handle, otherwise, this voiced segments starting and ending position is outputed to note cutting module;

The 3rd control module, with detected next crest as secondary peak, write down its peak value, valley, peak and valley position, call the bimodal determination module, if the result of determination of returning is successfully, trigger the 4th control module and continue to handle, continue to handle otherwise trigger the 5th control module;

The 4th control module, be used to call the crest detection module, as before detecting the speech frame of energy, detecting next crest less than critical value, primary peak valley position is outputed to note cutting module, and override the relevant parameter of the primary peak of preservation with the parameter of secondary peak, trigger the 3rd control module and continue to handle; Otherwise primary peak valley position and voiced segments end position are outputed to note cutting module;

The 5th control module, be used to call the crest detection module, as before detecting the speech frame of energy, detecting next crest less than critical value, replace the peak value of primary peak with the big value in primary peak and the secondary peak peak value, replace the valley of primary peak with the minimum value in valley behind this big peak value or two valleies, the peak of primary peak and valley position renewal are the new peak value and the speech frame of valley correspondence, trigger the 3rd control module then and continue to handle; Otherwise the voiced segments end position is outputed to note cutting module;

The note cutting to this voiced segments is finished in the starting and ending position that two adjacent positions that described note cutting module is used for exporting with the voiced segments processing procedure are a note.

6. pitch tracking and player method may further comprise the steps:

(a) sound import is done sampling and divided frame to handle;

(b) detect the pitch and the energy of each speech frame;

(c) finish the note cutting according to energy detection results, convert note cutting result to music score according to detected pitch then;

(d) music score that is converted to is synthesized musical instrument digital interface file, and play.

7. pitch tracking as claimed in claim 6 and player method is characterized in that, when described step (b) detects the pitch of speech frame, are further divided into following steps:

(b1) voice signal after minute frame processing is carried out pre-service, comprise average and low-pass filtering treatment and adopt two rank low-pass filters to carry out filtering;

(b2) pretreated speech energy and a threshold value are compared, as be higher than this threshold value, then be judged as voiced sound, forward step (b3) to, otherwise, execution in step (b7);

(b3) calculate the Normalized Cross Correlation Function value ρ (t) of the voiced sound signal of every frame delay;

(b4) carry out aftertreatment, obtain the ρ (t) and the corresponding optimal delay T of correlativity maximum;

(b5) fundamental frequency search, the sampling point t of the pairing time delay of ρ (t) that promptly detects the correlativity maximum whether in the hunting zone, if in this scope, execution in step (b6), otherwise, execution in step (b7);

(b6) think that these frame voice are voiced sound, output optimal delay T, according to the relation of Time And Frequency, the pitch value by this T converses this frame voice correspondence finishes again;

(b7) think that these frame voice are voiceless sound, make that pitch is 0, finish.

8. pitch tracking as claimed in claim 7 and player method is characterized in that, described step (b3) is that dot interlace carries out when carrying out the calculating of cross-correlation function value.

9. pitch tracking as claimed in claim 7 and player method is characterized in that, described step (b5) is taked the dot interlace search when carrying out the fundamental frequency search, and its hunting zone is within 20-120.

10. as claim 6 or 8 described pitch tracking and player methods, it is characterized in that, when described step (c) is carried out the note cutting,, carry out following steps energy each voiced segments greater than the continuous speech frame formation of critical value:

(c1) with first speech frame be the reference position of this voiced segments, as detect crest, with first detected crest as primary peak, write down peak value, valley, peak and the valley position of this voiced segments reference position and primary peak, carry out next step, as detecting, finish less than crest;

(c2) continue to detect,, then export the voiced segments reference position as before detecting the speech frame of energy, detecting next crest less than critical value, execution in step (c3), otherwise export this voiced segments starting and ending position, execution in step (c6);

(c3) with detected next crest as secondary peak, write down its peak value, valley, peak and valley position, the ratio of difference that judges whether the peak value of the difference of secondary peak peak value and primary peak valley and primary peak and valley is greater than first threshold, and the speech frame number between the peak of primary peak and secondary peak is greater than second threshold value, if, execution in step (c4), otherwise execution in step (c5);

(c4) continue to detect, as before detecting the speech frame of energy, detecting next crest less than critical value, output primary peak valley position, and override the relevant parameter of the primary peak of preservation with the parameter of secondary peak, return step (c3), otherwise, output primary peak valley position and voiced segments end position, execution in step (c6);

(c5) continue to detect, as before detecting the speech frame of energy, detecting next crest less than critical value, override the peak value of primary peak with the big value in primary peak peak value and the secondary peak peak value, override the valley of primary peak with the little value in valley behind this big peak value or two valleies, the peak of primary peak and valley position renewal are the new peak value and the speech frame of valley correspondence, return step (c3) then, otherwise, export this voiced segments end position, execution in step (c6);

(c6) with the starting and ending position that is a note, two adjacent positions of output, finish note cutting to this voiced segments.

Wherein, when described step (a)～(e) is carried out the crest detection, it is the energy size of more adjacent one by one two speech frames, statistics is from the number of the related energy of continuous ascent stage of energy that the initial speech frame of voiced segments or last last speech frame of energy decreases section begin and the continuous descending branch of energy subsequently greater than the speech frame of critical value, as this number greater than the 3rd threshold value, one section energy trace then judging these speech frame correspondences constitutes a crest, energy value maximum on this section curve is the peak value of this crest, the energy value of last speech frame is the valley of this crest on this section curve, the peak value of this crest and valley position are respectively the speech frame of this peak value and this valley correspondence, and the initial speech frame of this voiced segments or last last speech frame of energy decreases section are the reference position of this crest.