Monday 23 April 2018

Moving Away from Press-To-Talk?


From the start,  radio amateurs and many others have used the "press-to-talk" approach for voice communications: transmission is initiated by pressing a button,  talking continues for a period of time and then the operator invites the other party (or parties) to reply while he or she receives.  This approach is currently used for both analog and digital modes.  It allows simpler equipment and the same communication channel can be be reused for communication in either direction.  Obvious disadvantages include the inability of the receiving station to interrupt or reply during an 'over' and lack of feedback to the sender about the reception of their signal (until the next over).

Can we move away from PTT to achieve more natural methods of radio communication, say over HF channels?  Cellular systems achieve duplex operation via rapid time multiplexing, or by the use of multiple frequency allocations (TDD or FDD).  To avoid significant complexity, a time-division scheme with longer frame times could be envisaged as follows:   

  • We assume a software-defined (at least partially) approach whereby speech packets are digitised and only transmitted after the voice-activity-detector (VAD) indicates speech is present.    These packets are transmitted during an allocated period of the time frame. For example if A initiates the call, her packets could be transmitted during the first part of the frame, after which A will receive packets from B, until the end of the frame.  
  • We consider an adaptive scheme where the person who is talking the most, will get a large portion of the Tx time.  So if B is mainly listening to A, he might be allocated just the last 10% of the frame for his transmission - which is just enough for some interjections or brief comments, plus (digital) feedback on signal quality,  including how much of his speech is queued and waiting to be sent.  
  • This quasi-duplex scheme requires a cooperative approach from the operators -- they would be given an indication of how much of their speech is waiting to be sent, and how much from the other end is waiting.  Polite operators would stop talking when the other party wants to say something! 
  • What frame period should be used?  Longer frames (eg >10 seconds) could allow greater interleaving and robustness, but of course latency will become an increasing problem for two-way communications.  Short frames (eg a few seconds) will suffer a higher overhead from Tx/Rx switching, guard times, etc and need tighter synchronisation requirements.  Of course we envisage the use of source and channel coding (eg ~FreeDV) so need to avoid very short frames durations to suit these algorithms.  


Given the likely short pauses and delays in speech delivery under the scheme above it is hard to say how well it will work.  I've therefore created a small python simulation of this "ADDS" scheme (see figure below) using UDP packet transmission between two linux PCs, with sound-cards and headsets.  (ADDS stands for "adaptive delayed duplex speech".) The VAD is very simple and just checks the maximum sample amplitude in a block.  This simulation is rather basic, with no source or channel coding, using 8 bit samples, at 8 kHz.  The percentage of transmission time at each terminal can only be adjusted manually at present and while the local queue length is visible, the remote queue length is not. 





The results seem encouraging so far.  Using a frame period of 3 or 4 seconds, the pauses in conversation are obviously noticeable, but not too annoying.  On the other hand, natural speech contains silence periods.   These are not transmitted, so speech from the other end (that has been waiting) may be delivered faster than it was spoken.  The simulation code is on github.

This method would take some effort to implement over a radio channel. Frame sync could use NTP, as for other recent digital modes like WSJT.  For simplicity the frame allocations could be fixed, e.g. 50% of the frame for each party, but performance would suffer significantly.  The adaptive scheme requires the control portions of the frame to be particularly robust which will be challenging on a flaky channel.  It would be sensible to always send the control transmissions in the same part of the time frame. For example A's status and control information (including ID) could be sent in the first (say) 10% of the frame, B's status in the last 10%, with the rest allocated to speech (probably by the calling party A), as required.