Friday, 3 August 2018

How Fourier Transforms Work

In a recent talk I tried to explain how Fourier Transforms can be used to estimate the frequency content of signals (ie. the spectral content).  Normally we would use maths to illustrate these concepts - but that doesn't suit everyone,  so here's an attempt with only animated images!

First let's consider a signal which contains only one frequency component.  We consider sampled signals (i.e. discrete-time signals) - the upper plot on the left shows a sin wave sampled at regular time intervals.  In fact any periodic signal can be decomposed into sinusoidal components but for simplicity we will only consider  sin waves. 

Since each sample has a magnitude and phase, in the lower left plot we show a polar version of the sample sequence.  This rotating "phasor" signal is actually a more generic representation than plotting the samples versus time.  The original signal can be recovered by taking the horizontal component of the polar diagram.  Likewise the vertical part represents another sin-wave, with 90 degrees phase difference.  (In radio models, these components are called the Inphase and Quadrature signals.)

Of course if the signal frequency is lower, the plot will show more samples per cycle (see magenta example). 

Noise has been added to the sin-wave shown on the LHS.  (In the polar plot, independent noise samples have been added in both the I and Q dimensions.)   

From now on we will drop the time-domain plots.  Our aim is to estimate the amount of each sin-wave component in a sampled signal.  Assume our input signal is shown in blue phasor samples and will be compared to 4 "references" plotted below in black.  Ref2 is twice the frequency of Ref1, the next is three times etc.   A Discrete Fourier Transform (DFT) would normally contain many more frequency references, but four is enough for this illustration.  

 You might notice that the blue input is the same frequency as Ref2.  How can we use the reference phasors to estimate the input spectrum?  

Assume that for each new sample, we multiply the signal magnitude by the reference magnitude (which is 1) and that we take the difference between their phases. This gives a product phasor for each new sample which will be added to the previous product (as in vector addition).  The result is shown below, with the references shown in different colours for clarity.  

On the RHS sub-plot, observe that the magenta phasor (or vector) sum grows steadily in length during the DFT.   (The phase difference between blue and magenta is zero.)   However the other product sums "curve back" on themselves and their net length is zero at the end of the DFT.  The RHS sub-plots are autoscaling so we can see the initial behaviour more clearly.  The bottom right sub-plot shows the net length of each phasor sum, during the DFT.  This shows, in more conventional plotting style, that the spectral amplitude is zero for all components except Ref2. 

If you have followed the figures above -- well done!  But is this example too contrived - what happens with noise, or if the input signal is a slightly different frequency?  On the right, we see it still works!  Now the magenta vector sum is slightly curved - but its length is still much greater that the other components.  (We say the input frequency is no longer 'bin-centred'.) 

Needless to say, I encourage you to look at the mathematical description of DFTs.  While the DFT takes a lot of numerical processing, thanks to the great work by Cooley and Tukey in the 1960's we now have the very efficient Fast Fourier Transform (FFT). This forms the basis of signal processing in many modern communications systems (and much else as well!)   

Sunday, 22 April 2018

Moving Away from Press-To-Talk?

From the start,  radio amateurs and many others have used the "press-to-talk" approach for voice communications: transmission is initiated by pressing a button,  talking continues for a period of time and then the operator invites the other party (or parties) to reply while he or she receives.  This approach is currently used for both analog and digital modes.  It allows simpler equipment and the same communication channel can be be reused for communication in either direction.  Obvious disadvantages include the inability of the receiving station to interrupt or reply during an 'over' and lack of feedback to the sender about the reception of their signal (until the next over).

Can we move away from PTT to achieve more natural methods of radio communication, say over HF channels?  Cellular systems achieve duplex operation via rapid time multiplexing, or by the use of multiple frequency allocations (TDD or FDD).  To avoid significant complexity, a time-division scheme with longer frame times could be envisaged as follows:   

  • We assume a software-defined (at least partially) approach whereby speech packets are digitised and only transmitted after the voice-activity-detector (VAD) indicates speech is present.    These packets are transmitted during an allocated period of the time frame. For example if A initiates the call, her packets could be transmitted during the first part of the frame, after which A will receive packets from B, until the end of the frame.  
  • We consider an adaptive scheme where the person who is talking the most, will get a large portion of the Tx time.  So if B is mainly listening to A, he might be allocated just the last 10% of the frame for his transmission - which is just enough for some interjections or brief comments, plus (digital) feedback on signal quality,  including how much of his speech is queued and waiting to be sent.  
  • This quasi-duplex scheme requires a cooperative approach from the operators -- they would be given an indication of how much of their speech is waiting to be sent, and how much from the other end is waiting.  Polite operators would stop talking when the other party wants to say something! 
  • What frame period should be used?  Longer frames (eg >10 seconds) could allow greater interleaving and robustness, but of course latency will become an increasing problem for two-way communications.  Short frames (eg a few seconds) will suffer a higher overhead from Tx/Rx switching, guard times, etc and need tighter synchronisation requirements.  Of course we envisage the use of source and channel coding (eg ~FreeDV) so need to avoid very short frames durations to suit these algorithms.  

Given the likely short pauses and delays in speech delivery under the scheme above it is hard to say how well it will work.  I've therefore created a small python simulation of this "ADDS" scheme (see figure below) using UDP packet transmission between two linux PCs, with sound-cards and headsets.  (ADDS stands for "adaptive delayed duplex speech".) The VAD is very simple and just checks the maximum sample amplitude in a block.  This simulation is rather basic, with no source or channel coding, using 8 bit samples, at 8 kHz.  The percentage of transmission time at each terminal can only be adjusted manually at present and while the local queue length is visible, the remote queue length is not. 

The results seem encouraging so far.  Using a frame period of 3 or 4 seconds, the pauses in conversation are obviously noticeable, but not too annoying.  On the other hand, natural speech contains silence periods.   These are not transmitted, so speech from the other end (that has been waiting) may be delivered faster than it was spoken.  The simulation code is on github.

This method would take some effort to implement over a radio channel. Frame sync could use NTP, as for other recent digital modes like WSJT.  For simplicity the frame allocations could be fixed, e.g. 50% of the frame for each party, but performance would suffer significantly.  The adaptive scheme requires the control portions of the frame to be particularly robust which will be challenging on a flaky channel.  It would be sensible to always send the control transmissions in the same part of the time frame. For example A's status and control information (including ID) could be sent in the first (say) 10% of the frame, B's status in the last 10%, with the rest allocated to speech (probably by the calling party A), as required.