I use Calibre to manage my eBook, PDF, etc. libraries. When I’m importing catalogs of PDFs into my library things generally run smoothly… except when they don’t. I have a block of PDFs that I tried to import that had a file naming convention of author – title.pdf and author – seriesName seriesIndex – title.pdf. By default, Calibre will, when reading the metadata from the file name, assume that the file naming convention is title-author.pdf (or something similar).
In the case of my PDF collection, the default fails spectacularly which means I have to manually edit all the files to make corrections. Happily, I found this discussion thread on the exactly the problem I was having (Yay Internet!) and got this regex to sort out all my file name parsing woes:
^((?P<author>([^\-_0-9]+)(?=\s*-\s*)(?!\s*-\s*[0-9.]+)|\b))(\s*-\s*)?(\[?(?P<series>[^0-9\-]+) (- )?(?P<series_index>[0-9.]+)\]?\s*-\s*)?(?P<title>.+)
Without going into a lot of detail, this regular expression string parses the file name by author, series, series index, and title with a conditional that checks if the series and series index values are present. If they are, it includes them; otherwise, it moves directly on to the title. For future reference, here’s what my preferences panel looks like:
Special thanks to the Mobile Read forums member Starson17 for taking the time to answer a question way back in December of 2009.