Something is off in the pitch / NFCC computing functions

i just wanted to point out that there is something off in the NFCC calculations in torchaudio that is not present in pytorchaudio and therefore seems to not be carried over from there. I run some code that uses pytorchaudio to compute pitch using the kaldi method

```python
import torch
import torchaudio
import torchaudio.functional as F
import torchaudio.transforms as T

SPEECH_WAVEFORM, SAMPLE_RATE = torchaudio.load("/Users/frkkan96/Desktop/a1.wav")

pitch_feature = F.compute_kaldi_pitch(waveform=SPEECH_WAVEFORM, 
	sample_rate=SAMPLE_RATE, 
	frame_length= 25.0, 
	frame_shift= 10.0, 
	min_f0= 50, 
	max_f0= 400, 
	soft_min_f0= 10.0, 
	penalty_factor= 0.1, 
	lowpass_cutoff= 1000, 
	resample_frequency= 4000, 
	delta_pitch= 0.005, 
	nccf_ballast= 7000, 
	lowpass_filter_width= 1, 
	upsample_filter_width= 5, 
	max_frames_latency= 0, 
	frames_per_chunk= 0, 
	simulate_first_pass_online= False, 
	recompute_frame= 500, 
	snip_edges=True)
pitch, nfcc = pitch_feature[..., 0], pitch_feature[..., 1]
```

and when I then look at the output, I am convinced that what I actually got was values from windowed portions of the signal.

>>> pitch.size()
torch.Size([1, 402])
>>> nfcc.size()
torch.Size([1, 402])

Kaldi pitch extraction is not exposed by torchaudio, but you can get NFCCs using the functional__compute_ncc function. But then I get confused as while this code (using the native pitch detection), I get nothing like the python output in NFCCs

```r

origSoundFile <- "/Users/frkkan96/Desktop/a1.wav"
audio = transform_to_tensor(audiofile_loader(filepath=origSoundFile,
                                             offset=beginTime,
                                             duration=(endTime - beginTime), #A duration of 0 seems to be interpreted as the complete file
                                             unit="time"))
waveform <- audio[[1]]
sample_rate <- audio[[2]]
windowShift <- 10

pitch <- functional_detect_pitch_frequency(waveform,
                                         sample_rate = sample_rate,
                                         frame_time = windowShift /1000,
                                         win_length = windowSize,
                                         freq_low=minF,
                                         freq_high=maxF) # Expects seconds

nfcc <- functional__compute_nccf(waveform,
                               sample_rate = sample_rate,
                               frame_time = windowShift/1000,
                               freq_low = minF)
```

```
> str(pitch)
Float [1:1, 1:389]
> str(nfcc)
Float [1:1, 1:404, 1:630]
```

Optimally, these two R functions should correspond in dimensions with the python interface ones, and with identical window shift lengths (10ms in this case), the dimensions should be the same from detect_pitch and compute_nfcc, right?



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Something is off in the pitch / NFCC computing functions #43

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Something is off in the pitch / NFCC computing functions #43

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions