Reddit bot – string cleanup

While working on a reddit bot I found myself in a bit of a pickle with strings and retaining the important information in them. The predicament is as follows;

  • I follow youtube, vimeo and soundcloud url’s, extracting the titles used for the videos/sounds later to be¬†used as search parameters in Spotify. When people create these titles they have a tendency to add meta information such as release year or quality of upload which conflicts with the Spotify search.

examples of titles acquired:

title_01 = 'Glassjaw - All Good junkies Go To Heaven (Proper) (High Quality)'
title_02 = 'The White Stripes - Icky Thump (Official Music Video)'
title_03 = 'The Girl with the Dragon Tattoo - Immigrant Song (Title Sequence) [HQ]'

The meta data are for the most case encapsulated in brackets of some sort, which makes it fairly convenient to remove them.¬†Initial thoughts where to use RegEx but I couldn’t manage to create a RegEx pattern that would handle multiple occurrences of bracket pairs. Even though finding a RegEx solution to the problem stands as a priority I had to come up with a different solution for now.

While writing the function it became clear that it would serve well as a recursive “remove all occurrences of this pattern” instead of just open and close brackets.

'Glassjaw - All Good junkies Go To Heaven'
'The White Stripes - Icky Thump'
'The Girl with the Dragon Tattoo - Immigrant Song'

The nifty part is that it handles uneven amounts of open brackets and close brackets, or open and close terms. Say a user was a bit excited when he/she created a title and put “Glassjaw – All Good junkies Go To Heaven (((Proper))((((High Quality))))))”. Or even better, create disorder in the open/close order with “))))(((Proper)()((((High Quality))())())(((” the function is still able to sort it!

def remove_brackets(string, open_term, close_term):
	"""
		Searches for the open and close terms within the string and removes
		them and the content they encapsulate. If only the open or close term
		is found they are removed form the string. 

		The function has a recursive behaviour continuing until there are
		no more occurences of the open or close term or the combination of them.
	"""
	# If both an open and close bracket is found, remove them and the content they surround.
	if open_term in string and close_term in string:

		# we need to make sure that the open term comes before the close term
		# or else we have two unrelated terms.
		if string.index(open_term) < string.index(close_term):

			# Retrireve the respective positions of open and close brackets.
			start = string.index(open_term)
			end = string.index(close_term)

			# Augument the string to exlude the brackets and their content.
			string = string[:start] + string[end+1: len(string)]

		# If the close term comes before the open term it is a solitary term.
		else:
			string = string[:string.index(close_term)] + string[string.index(close_term)+1:len(string)]

	# If there is a single open bracket withouth a matching close bracket, remove it.
	elif open_term in string and close_term not in string:
		string = string[:string.index(open_term)] + string[string.index(open_term)+1:len(string)]

	# If there is a single close bracket without a matching open bracket, remove it.
	elif close_term in string and open_term not in string:
		string = string[:string.index(close_term)] + string[string.index(close_term)+1:len(string)]

	# ----- Recursive behaviour ------
	# If there are any more brackets, close or open,
	# initalize the function with the variables already achieved.
	if open_term in string or close_term in string:
		return remove_brackets(string, open_term, close_term)
	else:
		return string

 

2 thoughts on “Reddit bot – string cleanup

  1. Hi Jens,
    I’m not a “regex expert” but the following seems to work fine.


    >>> import re
    >>> regex = re.compile('^([a-zA-Z0-9\- ]*) .*', re.MULTILINE)
    >>> string = """Glassjaw - All Good junkies Go To Heaven (Proper) (High Quality)
    ... The White Stripes - Icky Thump (Official Music Video)
    ... The Girl with the Dragon Tattoo - Immigrant Song (Title Sequence) [HQ]
    ... Glassjaw - All Good junkies Go To Heaven (((Proper))((((High Quality))))))
    ... Glassjaw - All Good junkies Go To Heaven ))))(((Proper)()((((High Quality))())())((("""
    >>> regex.findall(string)
    >>> regex.findall(string)
    ['Glassjaw - All Good junkies Go To Heaven', 'The White Stripes - Icky Thump', 'The Girl with the Dragon Tattoo - Immigrant Song', 'Glassjaw - All Good junkies Go To Heaven', 'Glassjaw - All Good junkies Go To Heaven']

    Maybe this helps to simplify the code or just as a regex refresher.

    Keep up the good work,
    Cheers!

    • Hi!

      Apollogies for the late reply, but thats definitively a smoother solution!
      Thank you for posting this, trying to get to grips with regex with every change so this is awesome.

      by the way – nice site you got and some really cool projects!

      regards
      Jens

Leave a Reply

Your email address will not be published.